Attribute Normalization Rules for Zoning Feeds

Q: When does spatial fallback override the lexical result?

Only when the lexical result is weak: a missing code, a garbled string, or a fuzzy score below the routing threshold. A clean exact match is more authoritative than a geometric overlap, so spatial fallback never overwrites it, and recovered zones are capped at a lower confidence.

Q: Why stamp every record with a mapping version and timestamp?

Temporal stamping lets you answer point-in-time questions, roll back deterministically when a municipality retracts an amendment, and drive change-detection when the official map updates. Without it a re-run silently rewrites history and audits become impossible.

Q: How do I handle overlay districts and conditional-use permits?

Keep them out of canonical_zone. An overlay or a conditional-use permit modifies what is permitted without changing the base zone, so fold them into separate typed attributes. Otherwise a single parcel competes for one zoning value and double-counts in a spatial join.

A single metropolitan region will hand you the same fact — what a parcel is zoned — in a dozen incompatible shapes. One county ships R-1 in a shapefile DBF column called ZONE_CD; the city next door publishes Residential Low (R1) in a REST API field named zoning; a third municipality buries it in a scanned ordinance PDF as “Single-Family Residential District.” When those records land in the same table without reconciliation, every downstream system inherits the chaos: valuation models compare residential against RESIDENTIAL_SINGLE_FAMILY and R1 as if they were three different land uses, compliance checks silently pass on null zoning codes, and a development-feasibility query returns a parcel that does not exist as zoned. Attribute normalization is the deterministic bridge that collapses that variance into one canonical taxonomy before anything spatial or financial touches the data. It is the stage of Automated Feed Ingestion & GIS Data Parsing that turns “the same fact in twelve shapes” into one provable, versioned record.

A record is resolved by the cheapest authority that succeeds: exact lexical match, then bounded fuzzy, then a spatial join against verified parcels. The confidence fork sends anything below 0.75 to an immutable quarantine ledger and stamps the rest with its mapping version before export.

Prerequisites and operational context jump to heading

Normalization runs in the middle of the ingestion path, not at the start. It assumes several things already exist, and it produces inconsistent output if they do not:

A staged raw layer. Payloads should already be extracted into a working GeoDataFrame or a raw_features table. When a jurisdiction has no machine-readable feed, that staging step is the responsibility of the PDF & HTML scraping pipelines, which lift raw zoning identifiers, parcel references, and effective dates out of unstructured documents before the normalizer ever sees them.
A settled working projection. Spatial fallback joins only resolve correctly when the raw geometry and the master parcel layer share a coordinate system. That depends on the CRS alignment strategies you have standardised on — typically EPSG:4326 for storage — so an sjoin does not silently miss because one side is in a state plane system.
A canonical taxonomy. You must decide the target vocabulary before you map to it. The mapping dictionary here is one slice of a broader zoning taxonomy mapping discipline that defines what RESIDENTIAL_MULTI_FAMILY actually means across jurisdictions.
An authoritative parcel layer. Spatial fallback needs a verified municipal parcel layer carrying known-good canonical_zone values to recover attributes when a raw record’s geometry is missing or its code is unresolvable.
A quarantine sink. Records that cannot be resolved with confidence must have somewhere to go that is neither the production table nor the bit bucket.

Get these in place first. Normalization is deterministic by design, but it can only be as trustworthy as the staging, projection, and taxonomy it sits on top of.

Architecture: the three resolution layers jump to heading

Normalization is best understood as three resolution layers applied in strict order, each a fallback for the one before it. A record is resolved by the cheapest, most authoritative mechanism that succeeds, and only descends to a weaker mechanism when the stronger one fails.

Lexical resolution comes first because it is deterministic and cheap. A versioned lookup dictionary maps city-specific codes to the canonical schema by exact match. Exact matching alone is brittle — municipalities introduce typographical drift, whitespace, mixed casing, and deprecated codes between releases — so a bounded fuzzy fallback (token or ratio scoring) bridges minor variations like R-1 versus R1 versus r 1 without inventing matches that are not there. The fuzzy threshold is the single most important tuning knob: too low and C1 (commercial) gets mapped to R1 (residential); too high and legitimate variants fall through.

Spatial resolution is the second layer, reserved for records the lexicon could not confidently resolve. When a zoning code is missing, garbled, or scores below threshold, a spatial join against the authoritative parcel layer recovers the canonical zone from geometry: the parcel the record overlaps already carries a verified classification. This is how you rescue a scraped PDF row that has a parcel reference but a mangled zoning string.

Temporal resolution is the third layer and the one most teams skip. Every normalized record carries the ordinance version it was resolved against and the timestamp the rule was applied. Without it, you cannot answer “what was this parcel zoned on the date of that appraisal,” cannot roll back deterministically when a municipality retracts an amendment, and cannot drive change-detection when the official map updates.

The five concrete stages the pipeline executes, in order:

Stage	Operation	Output
1. Schema detection	Parse incoming headers, infer geometry type, standardise column casing	Normalized DataFrame
2. Lexical resolution	Exact match, then bounded fuzzy fallback, then default routing	`canonical_zone`, `confidence`
3. Spatial fallback	`sjoin` against the master parcel layer when geometry/code is weak	Validated `GeoDataFrame`
4. Confidence routing	Score records; low-confidence to quarantine, high-confidence to staging	Partitioned datasets
5. Canonical export	Apply temporal stamps, enforce types, write to the target sink	Production-ready layer

Production implementation jump to heading

The following implementation is a modular, high-throughput normalizer for municipal zoning feeds. It applies exact matching, bounded fuzzy fallback via rapidfuzz, spatial recovery, confidence-based partitioning, and structured temporal stamping. The code is complete and runnable against a raw GeoDataFrame and a verified parcel layer.

import logging
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Optional, Tuple

import geopandas as gpd
import pandas as pd
from rapidfuzz import fuzz, process

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
)
logger = logging.getLogger(__name__)

# Canonical schema mapping, maintained as a versioned config artifact so that
# every normalized record can be traced back to the dictionary that produced it.
MAPPING_VERSION = "2026.02"
CANONICAL_ZONING = {
    "R1": "RESIDENTIAL_SINGLE_FAMILY",
    "R2": "RESIDENTIAL_MULTI_FAMILY",
    "C1": "COMMERCIAL_GENERAL",
    "I1": "INDUSTRIAL_LIGHT",
    "AG": "AGRICULTURAL",
    "OS": "OPEN_SPACE",
}

# Spatial fallback is less authoritative than an exact code, so it is capped
# below the routing threshold unless corroborated. Tune to your data.
FUZZY_THRESHOLD = 85          # rapidfuzz ratio (0-100) for accepting a near-match
SPATIAL_CONFIDENCE = 0.65     # baseline confidence for a clean spatial join
ROUTE_THRESHOLD = 0.75        # below this, a record is quarantined


@dataclass
class NormalizationResult:
    canonical_zone: Optional[str]
    confidence: float
    method: str  # "exact" | "fuzzy" | "spatial_fallback" | "quarantine"
    mapping_version: str = MAPPING_VERSION
    resolved_at: datetime = field(
        default_factory=lambda: datetime.now(timezone.utc)
    )


def resolve_lexical(raw_code: Optional[str]) -> NormalizationResult:
    """Exact match first, then a bounded fuzzy fallback against canonical keys."""
    if not raw_code or not str(raw_code).strip():
        return NormalizationResult(None, 0.0, "quarantine")

    # Normalise surface form before matching: strip, upper, collapse separators.
    key = str(raw_code).strip().upper().replace("-", "").replace(" ", "")
    if key in CANONICAL_ZONING:
        return NormalizationResult(CANONICAL_ZONING[key], 1.0, "exact")

    match = process.extractOne(key, CANONICAL_ZONING.keys(), scorer=fuzz.ratio)
    if match and match[1] >= FUZZY_THRESHOLD:
        return NormalizationResult(
            CANONICAL_ZONING[match[0]], match[1] / 100.0, "fuzzy"
        )

    return NormalizationResult(None, 0.0, "quarantine")


def normalize_attributes(
    raw_gdf: gpd.GeoDataFrame,
    parcel_layer: gpd.GeoDataFrame,
    zoning_col: str = "zoning_code",
) -> Tuple[gpd.GeoDataFrame, gpd.GeoDataFrame]:
    """Lexical mapping -> spatial fallback -> confidence routing.

    Returns (clean_gdf, quarantine_gdf). Records never silently disappear:
    anything unresolved lands in the quarantine partition with a reason.
    """
    if raw_gdf.crs != parcel_layer.crs:
        # A mismatched CRS makes the spatial join silently empty; fail loud.
        raise ValueError(
            f"CRS mismatch: raw={raw_gdf.crs}, parcels={parcel_layer.crs}"
        )

    results = [resolve_lexical(row.get(zoning_col)) for _, row in raw_gdf.iterrows()]
    norm_df = pd.DataFrame([r.__dict__ for r in results], index=raw_gdf.index)
    gdf = pd.concat([raw_gdf, norm_df], axis=1)

    # Spatial fallback only for records the lexicon could not resolve well.
    weak = gdf["confidence"] < ROUTE_THRESHOLD
    if weak.any():
        logger.info("Spatial fallback for %d weak records", int(weak.sum()))
        joined = gpd.sjoin(
            gdf[weak],
            parcel_layer[["parcel_id", "canonical_zone", "geometry"]],
            how="left",
            predicate="intersects",
        )
        # Collapse multiple parcel hits to the first match per source row.
        recovered = joined[~joined["canonical_zone_right"].isna()]
        recovered = recovered[~recovered.index.duplicated(keep="first")]
        idx = recovered.index
        gdf.loc[idx, "canonical_zone"] = recovered["canonical_zone_right"]
        gdf.loc[idx, "method"] = "spatial_fallback"
        gdf.loc[idx, "confidence"] = SPATIAL_CONFIDENCE

    # Partition: a null canonical_zone or sub-threshold confidence is quarantined.
    quarantine_mask = gdf["canonical_zone"].isna() | (
        gdf["confidence"] < SPATIAL_CONFIDENCE
    )
    clean_gdf = gdf[~quarantine_mask].copy()
    quarantine_gdf = gdf[quarantine_mask].copy()
    quarantine_gdf["quarantine_reason"] = quarantine_gdf.apply(
        lambda r: "unresolved_code" if pd.isna(r["canonical_zone"])
        else "low_confidence",
        axis=1,
    )

    logger.info(
        "Normalization complete: %d clean, %d quarantined",
        len(clean_gdf), len(quarantine_gdf),
    )
    return clean_gdf, quarantine_gdf

Several choices here are deliberate. Surface-form normalisation (strip, upper, separator removal) happens before matching so that R-1, r1, and R 1 all collapse to the same exact key without burning a fuzzy comparison. The CRS guard raises loudly rather than letting a projection mismatch produce a silently empty spatial join — a class of bug that otherwise quarantines an entire feed for no obvious reason. Spatial fallback is capped at SPATIAL_CONFIDENCE because a geometric overlap is weaker evidence than an exact code, and every result carries its mapping_version and resolved_at so the record is auditable the moment it is written.

Edge cases and gotchas jump to heading

Municipal attribute feeds fail in specific, recurring ways. Plan for these before they reach production:

Overlay districts double-count a parcel. A parcel inside a historic-preservation or flood overlay carries two valid zoning records. A naive sjoin returns both and your row count silently inflates. Resolve the base zone and overlays into separate typed columns rather than letting them compete for one canonical_zone.
Fuzzy matching crosses land-use boundaries. fuzz.ratio will happily score C1 against R1 reasonably high on two-character codes. Short codes need a higher threshold or a category guard, otherwise commercial parcels normalise to residential. Validate that the matched category is plausible for the source jurisdiction.
Schema drift renames the source column. When a county renames zoning to zone_class between releases, row.get(zoning_col) returns None for every row and the whole feed quarantines. Detect the zoning column dynamically or assert its presence at schema-detection time so the failure is named, not silent.
Deprecated codes resolve to retired classifications. A historical R3 that the municipality merged into R2 years ago must map through a deprecation table, not the live dictionary, or you will mislabel parcels that predate the merge. This is where temporal versioning earns its keep.
Conditional-use permits are not zoning. A CUP modifies what is permitted without changing the base zone. Folding it into canonical_zone corrupts the taxonomy; carry it as a separate attribute.
Empty-geometry spatial fallback. A record with no zoning code and no geometry cannot be recovered spatially. Route it straight to quarantine rather than letting an empty sjoin produce a misleading null.

Integration points jump to heading

Normalization is a midstream stage, defined by what feeds it and what consumes it. Its inputs are the staged payloads from acquisition and scraping; its outputs flow forward to publication. Once attributes are resolved and stamped, the clean partition becomes the input to GIS export sync workflows, which serialise it out as GeoPackage, PostGIS, GeoJSON, or vector tiles for dashboards and public parcel viewers. The quarantine partition is not a dead end — it is handed to error handling & retry logic, which logs the schema drift, alerts the data steward, and schedules reprocessing when a corrected municipal feed arrives. Both contracts are the normalized record shape — same canonical keys, same CRS, same quarantine semantics — so the normalizer can be re-run or replayed without downstream changes. For deeper, multi-jurisdiction configuration, the normalizing zoning attributes across different city schemas walkthrough extends these patterns to overlapping boundaries, overlay districts, and conditional-use cases.

Compliance and audit artifacts jump to heading

For PropTech underwriting and regulatory review, a normalization run is only finished when it is provable. Every run should emit:

A mapping manifest: the MAPPING_VERSION applied, the canonical dictionary hash, the source feed identifiers, and the working CRS — the lineage record linking each zone label back to the exact rule set that produced it.
Per-method counts: how many records resolved by exact, fuzzy, and spatial fallback versus quarantined, so a reviewer can see how much of a feed rested on weaker evidence.
Temporal stamps on every record: resolved_at and the ordinance version, enabling point-in-time reconstruction and deterministic rollback when a municipality retracts or amends a classification mid-cycle.
An immutable quarantine ledger: every rejected record with its quarantine_reason, retained for audit so a null zoning code is never silently absorbed into production.

These artifacts pair with downstream schema validation & data quality checks to form a continuous, defensible chain from raw municipal publication to underwriting-grade dataset.

FAQ jump to heading

What fuzzy-match threshold should I use for zoning codes?

Start at a ratio of 85 and raise it for short codes. Two- and three-character zoning codes like C1 and R1 score deceptively high against each other, so a low threshold maps commercial parcels to residential. Pair the threshold with a category guard that rejects a match whose canonical land-use class is implausible for the source jurisdiction.

When does spatial fallback override the lexical result?

Only when the lexical result is weak — a missing code, a garbled string, or a fuzzy score below the routing threshold. A clean exact match is more authoritative than a geometric overlap, so spatial fallback never overwrites it. Recovered zones are capped at a lower confidence because an overlap is weaker evidence than a stated code.

Why stamp every record with a mapping version and timestamp?

Temporal stamping lets you answer point-in-time questions ("what was this parcel zoned on the appraisal date"), roll back deterministically when a municipality retracts an amendment, and drive change-detection when the official map updates. Without it, a re-run silently rewrites history and audits become impossible.

What happens to a record that cannot be normalized?

It is quarantined, never dropped. The record is written to an immutable ledger with a quarantine_reason of either unresolved_code or low_confidence, then handed to error handling for alerting and scheduled reprocessing once a corrected feed arrives. A null zoning code never reaches the production table.

How do I handle overlay districts and conditional-use permits?

Keep them out of canonical_zone. An overlay (historic, flood) and a conditional-use permit modify what is permitted without changing the base zone, so fold them into separate typed attributes. Otherwise a single parcel competes for one zoning value and either double-counts in a spatial join or overwrites its own base classification.

Normalizing zoning attributes across different city schemas — multi-jurisdiction mapping, overlays, and conditional-use patterns
Async batch processing — the upstream engine that stages records for normalization at county scale
PDF & HTML scraping pipelines — supplies raw zoning identifiers for jurisdictions without an OGC API
GIS export sync workflows — the downstream consumer of the clean partition
Zoning taxonomy mapping — defines the canonical vocabulary this stage maps into
Schema validation & data quality checks — verifies the normalized output downstream

Up: Automated Feed Ingestion & GIS Data Parsing