Zoning Taxonomy Mapping

Every jurisdiction invents its own zoning alphabet. One county writes R-1 for low-density residential, the next writes RS-7, a third buries the same intent inside a LAND_USE_CLASS of SINGLE FAM DET. A real estate model that underwrites parcels across more than a handful of municipalities cannot reason over this dialect chaos directly — it needs a single, stable vocabulary where RESIDENTIAL_LOW means the same thing in Travis County as it does in King County. Zoning taxonomy mapping is the translation layer that produces that vocabulary, and when it is built as a pile of if/else branches it fails in the most dangerous way possible: silently, by mapping an unrecognized code to a plausible-but-wrong category that propagates into feasibility scores and entitlement assumptions. This page is part of Municipal Zoning Data Architecture & Compliance Frameworks, and within that architecture taxonomy mapping is the deterministic, versioned, fully audited stage that converts jurisdiction-specific land-use codes into machine-readable classifications without ever guessing in a way it cannot later explain.

Prerequisites and operational context jump to heading

Taxonomy mapping is a mid-pipeline stage, and it only behaves deterministically if the stages before it have already done their jobs. Three things must be in place before a single code is translated:

A clean, validated input contract. Mapping operates on the zoning_code attribute, so that attribute must already exist, be typed as a string, and be free of nulls and obvious garbage. Catching ZoningType vs ZONE_CD column drift and rejecting malformed rows belongs upstream in schema validation and data quality checks, not in the mapping engine. If mapping has to defend against missing columns it stops being deterministic.
Geometries already in a known projection. Mapping joins an attribute classification onto a geometry, and any downstream spatial rollup (district area by category, adjacency of mapped uses) is only honest if every feature shares one metric CRS. Enforce that with CRS alignment strategies before this stage runs; mapping itself must never reproject.
A versioned ruleset, not inline literals. The set of code-to-taxonomy rules is data, not code. It changes whenever a new municipality is onboarded or an ordinance renames a district, and every change must be traceable. Store it as a versioned table (CSV, Parquet, or a database row set with a ruleset_version) so any historical classification can be reproduced exactly.

Mapping also assumes a target taxonomy already exists. Before writing any rules, fix the canonical vocabulary — a flat or shallow hierarchy such as RESIDENTIAL_LOW, RESIDENTIAL_MEDIUM, RESIDENTIAL_HIGH, COMMERCIAL_GENERAL, MIXED_USE, INDUSTRIAL, AGRICULTURAL, PUBLIC_INSTITUTIONAL, and an explicit UNCLASSIFIED sentinel. The sentinel is not optional: every code that does not match a rule must land somewhere observable rather than disappearing or crashing the run.

Mapping architecture: tiered deterministic resolution jump to heading

The core design decision is resolution order. A single code can plausibly match several rules — MU-D matches both a generic ^MU-.*$ pattern and a Denver-specific override — so the engine must apply rules in a fixed precedence and stop at the first hit. The tiers, highest precedence first:

Tier	Match type	Example rule	Confidence	When it wins
1	Exact `(jurisdiction, code)`	`("AUSTIN", "SF-3") → RESIDENTIAL_LOW`	1.0	Hand-curated per-city code
2	Jurisdiction-scoped regex	`("DENVER", r"^MU-.*$") → MIXED_USE`	0.95	City reuses a family of codes
3	Generic regex	`("ANY", r"^R-[2-9]$") → RESIDENTIAL_MEDIUM`	0.9	Cross-jurisdiction convention
4	Default fallback	`("ANY", r".*") → UNCLASSIFIED`	0.0	Nothing else matched

Two properties make this defensible. First, every output carries a confidence score and the method that produced it, so downstream consumers can refuse to underwrite on anything below a threshold instead of treating a 0.85 regex guess like a 1.0 curated match. Second, the default fallback is a real rule, not an exception handler — an unmatched code is data flagged for review, never a thrown error that halts a county-wide batch.

Resolution must be jurisdiction-aware because the same string means different things in different places: C-1 is neighborhood commercial in one city and heavy commercial in another. The engine therefore matches on the (jurisdiction, code) pair, falling through to ANY-scoped generic rules only when no city-specific rule applies. This is also why exact matches sit above regex: a curated entry for a known-ambiguous code overrides the broad pattern that would otherwise mis-bucket it.

Production implementation jump to heading

The pipeline has three stages — validate the input contract, resolve the taxonomy with tiered rules, and emit an audit manifest. The validation stage uses Pydantic to enforce the contract and routes failures to a dead-letter queue rather than aborting:

import pandas as pd
import geopandas as gpd
from pydantic import BaseModel, ValidationError, field_validator
from typing import Any
import logging

logger = logging.getLogger(__name__)


class ZoningRecord(BaseModel):
    parcel_id: str
    jurisdiction: str
    zoning_code: str
    land_use_desc: str | None
    effective_date: str
    geometry: dict

    @field_validator("zoning_code", mode="before")
    @classmethod
    def normalize_code(cls, v: Any) -> str:
        # Normalize casing/whitespace so lookups are stable; reject empty/too-short codes.
        cleaned = str(v).strip().upper()
        if not cleaned or len(cleaned) < 2:
            raise ValueError("Invalid zoning code format")
        return cleaned

    @field_validator("jurisdiction", mode="before")
    @classmethod
    def normalize_jurisdiction(cls, v: Any) -> str:
        return str(v).strip().upper() or "UNKNOWN"


def ingest_and_validate(raw_gdf: gpd.GeoDataFrame) -> tuple[gpd.GeoDataFrame, list[dict]]:
    """Enforce the input contract before any mapping runs. Malformed rows are
    quarantined, never silently dropped or allowed to crash the batch."""
    validated_records: list[dict] = []
    rejected_queue: list[dict] = []

    for idx, row in raw_gdf.iterrows():
        try:
            record = ZoningRecord(
                parcel_id=str(row.get("parcel_id", "")).strip(),
                jurisdiction=row.get("jurisdiction", row.get("city", "UNKNOWN")),
                zoning_code=row.get("zoning", row.get("zoning_code", "")),
                land_use_desc=row.get("land_use_desc", None),
                effective_date=str(row.get("eff_date", "1900-01-01")),
                geometry=row.geometry.__geo_interface__,
            )
            validated_records.append(record.model_dump())
        except ValidationError as e:
            rejected_queue.append({
                "row_index": int(idx),
                "parcel_id": str(row.get("parcel_id", "UNKNOWN")),
                "error": str(e),
            })

    if rejected_queue:
        logger.warning("Routed %d records to dead-letter queue", len(rejected_queue))

    valid_gdf = gpd.GeoDataFrame(validated_records, geometry="geometry", crs=raw_gdf.crs)
    return valid_gdf, rejected_queue

The resolution engine reads the versioned ruleset and applies tiers in precedence order. Each rule carries a tier, a jurisdiction scope, a regex pattern, the target standard_code, and a confidence value. Already-resolved parcels are skipped, so the first matching tier wins:

import re

# Versioned ruleset — stored as data, loaded from CSV/Parquet/DB in production.
# Lower `tier` = higher precedence. `jurisdiction` of "ANY" matches every city.
ZONING_RULESET = pd.DataFrame([
    {"tier": 1, "jurisdiction": "AUSTIN", "pattern": r"^SF-3$",   "standard_code": "RESIDENTIAL_LOW",    "confidence": 1.00},
    {"tier": 2, "jurisdiction": "DENVER", "pattern": r"^MU-.*$",  "standard_code": "MIXED_USE",          "confidence": 0.95},
    {"tier": 3, "jurisdiction": "ANY",    "pattern": r"^R-1$",    "standard_code": "RESIDENTIAL_LOW",    "confidence": 0.90},
    {"tier": 3, "jurisdiction": "ANY",    "pattern": r"^R-[2-9]$","standard_code": "RESIDENTIAL_MEDIUM", "confidence": 0.90},
    {"tier": 3, "jurisdiction": "ANY",    "pattern": r"^C-.*$",   "standard_code": "COMMERCIAL_GENERAL", "confidence": 0.85},
    {"tier": 3, "jurisdiction": "ANY",    "pattern": r"^MU-.*$",  "standard_code": "MIXED_USE",          "confidence": 0.85},
    {"tier": 4, "jurisdiction": "ANY",    "pattern": r".*",       "standard_code": "UNCLASSIFIED",       "confidence": 0.00},
])

RULESET_VERSION = "2026.06.0"


def resolve_zoning_taxonomy(
    gdf: gpd.GeoDataFrame,
    ruleset: pd.DataFrame = ZONING_RULESET,
) -> gpd.GeoDataFrame:
    """Tiered, jurisdiction-aware resolution. First matching tier wins; every
    parcel receives a standard_code, a confidence, and the method that set it."""
    gdf = gdf.copy()
    gdf["standard_code"] = None
    gdf["mapping_confidence"] = 0.0
    gdf["mapping_method"] = None

    # Apply rules strictly in tier order so precedence is deterministic.
    ordered = ruleset.sort_values("tier", ascending=True)

    for _, rule in ordered.iterrows():
        unresolved = gdf["standard_code"].isna()
        if not unresolved.any():
            break  # everything classified — no need to keep scanning rules

        scope = rule["jurisdiction"]
        in_scope = unresolved if scope == "ANY" else (unresolved & (gdf["jurisdiction"] == scope))
        matches = gdf["zoning_code"].str.match(rule["pattern"], case=False, na=False)
        apply_mask = in_scope & matches

        if apply_mask.any():
            method = "default_fallback" if rule["tier"] == 4 else f"tier{int(rule['tier'])}_match"
            gdf.loc[apply_mask, "standard_code"] = rule["standard_code"]
            gdf.loc[apply_mask, "mapping_confidence"] = rule["confidence"]
            gdf.loc[apply_mask, "mapping_method"] = method

    fell_back = gdf["mapping_method"] == "default_fallback"
    if fell_back.any():
        logger.warning("%d parcels fell back to UNCLASSIFIED for manual review", int(fell_back.sum()))

    return gdf

Because the fallback is a real rule, the engine cannot raise on an unknown code — it labels it UNCLASSIFIED with confidence 0.0 and method default_fallback, which is exactly the signal a review queue and a confidence-gated underwriting model both need.

Edge cases and gotchas jump to heading

Municipal zoning data breaks taxonomy mappers in predictable, niche ways. Build for these before they break a production batch:

Code collisions across jurisdictions. C-1 is the classic trap: neighborhood commercial in one city, heavy commercial in another. A generic ^C-.*$ rule buckets both into COMMERCIAL_GENERAL and erases that distinction. The defense is tier-1 exact rules per ambiguous code, which is why resolution is jurisdiction-aware and exact matches outrank regex.
Overlay and combining districts. Codes like R-1/H (residential with a historic overlay) or C-2-PUD carry a base district plus a modifier. A naive pattern matches the base and silently discards the overlay, which often carries the binding setback or use restriction. Either model overlays as a separate attribute or write explicit rules for the combined forms — do not let the modifier vanish.
Whitespace, dashes, and Unicode. Source feeds emit R 1, R–1 (en dash), R_1, and trailing tabs. The normalize_code validator handles casing and trimming, but normalize separators too before matching, or every variant becomes a fallback miss.
Ordinance renames over time. A district renamed R-1 → RS-5 last year means historical parcels and new parcels carry different codes for the same intent. This is why the ruleset is versioned and the effective_date travels with each record: a re-run of last year’s data must use last year’s rules.
The plausible-but-wrong fallback. The most expensive failure is mapping an unknown code to a real category instead of UNCLASSIFIED. Never give the catch-all rule a non-zero confidence or a real land-use value. An honest UNCLASSIFIED at confidence 0.0 is recoverable; a confident wrong answer is not.
Regex ordering surprises. Within a tier, two patterns can match the same code. Keep patterns mutually exclusive within a tier, or add an explicit ordering column — relying on DataFrame row order is fragile. When two adjacent jurisdictions disagree on the same code family, see how to map local zoning codes to standardized taxonomies for the conflict-resolution patterns in depth.

Integration points jump to heading

Taxonomy mapping sits in the middle of the architecture and is only useful if its output is shaped for what comes next. Upstream, it consumes the typed records produced once attribute normalization rules have reconciled column names and value formats across feeds — mapping translates the zoning attribute, normalization gets that attribute into a consistent shape first. When a parcel cannot be classified above the confidence threshold, the unresolved record is handed to fallback routing logic, which decides whether to substitute a secondary source, defer to a prior snapshot, or escalate to manual review rather than ship an UNCLASSIFIED parcel into a model.

Downstream, the classified GeoDataFrame — now carrying standard_code, mapping_confidence, and mapping_method alongside the aligned geometry — is the canonical layer that GIS export sync workflows serialize to GeoParquet, shapefile, and GeoJSON for consumers. Keep the confidence and method columns in the export; an underwriting team needs to filter on them, and a regulator needs to see them. The standardized vocabulary is also what makes longitudinal change detection possible: comparing this run’s standard_code per parcel against the last run surfaces real rezonings instead of noise from a city renaming a district.

Compliance and audit artifacts jump to heading

A taxonomy decision that cannot be reconstructed is a liability in any PropTech underwriting or regulatory context. Every run must emit an immutable manifest that lets a reviewer answer two questions months later: what rules were in force and what did they decide. That means the ruleset version, a deterministic hash of the input, per-tier and per-category counts, the confidence distribution, and a rejection summary — all stored alongside the versioned output so a historical classification can be replayed exactly:

import json
import hashlib
from datetime import datetime, timezone


def compute_input_hash(gdf: gpd.GeoDataFrame) -> str:
    """Deterministic hash of the attributes that drive mapping — used for
    change detection and to prove which input produced a given output."""
    payload = gdf[["parcel_id", "jurisdiction", "zoning_code", "effective_date"]] \
        .sort_values("parcel_id") \
        .to_json(orient="records", date_format="iso")
    return hashlib.sha256(payload.encode("utf-8")).hexdigest()


def generate_audit_manifest(
    run_id: str,
    input_hash: str,
    ruleset_version: str,
    valid_count: int,
    rejected_count: int,
    mapped_gdf: gpd.GeoDataFrame,
) -> dict:
    method_counts = mapped_gdf["mapping_method"].value_counts().to_dict()
    category_counts = mapped_gdf["standard_code"].value_counts().to_dict()
    low_confidence = int((mapped_gdf["mapping_confidence"] < 0.85).sum())
    unclassified = int((mapped_gdf["standard_code"] == "UNCLASSIFIED").sum())

    return {
        "run_id": run_id,
        "timestamp_utc": datetime.now(timezone.utc).isoformat(),
        "pipeline_version": "1.4.2",
        "ruleset_version": ruleset_version,
        "input_data_hash": input_hash,
        "records_processed": valid_count + rejected_count,
        "records_validated": valid_count,
        "records_rejected": rejected_count,
        "mapping_method_distribution": method_counts,
        "taxonomy_category_distribution": category_counts,
        "low_confidence_count": low_confidence,
        "unclassified_count": unclassified,
        "compliance_status": "PASS" if (rejected_count == 0 and unclassified == 0) else "REVIEW_REQUIRED",
    }


# Usage:
# valid_gdf, rejected = ingest_and_validate(raw_gdf)
# mapped = resolve_zoning_taxonomy(valid_gdf)
# manifest = generate_audit_manifest(
#     "run_20260626_01", compute_input_hash(valid_gdf), RULESET_VERSION,
#     len(valid_gdf), len(rejected), mapped,
# )
# logger.info(json.dumps(manifest))

The compliance_status flag is deliberately strict: any rejected record or any UNCLASSIFIED parcel forces REVIEW_REQUIRED rather than letting an incomplete classification pass as clean. Hashing the input over sorted parcels makes the manifest a genuine change-detection key — two runs with the same hash and same ruleset version must produce identical output, and a changed hash tells you the source actually moved. How these manifests roll up into jurisdiction-level reporting and retention obligations is covered under compliance framework integration.

FAQ jump to heading

Why use a tiered ruleset instead of a flat dictionary lookup?

A flat dictionary cannot express precedence. The same code legitimately matches a curated per-city entry, a jurisdiction-scoped pattern, and a generic pattern at once, and you need the most specific one to win deterministically. Tiers encode that precedence explicitly: exact matches outrank regex, jurisdiction-scoped rules outrank generic ones, and a default fallback catches everything else. A dictionary also has no place to record confidence or the method that produced a result, both of which downstream consumers need.

What should the mapping engine do with a zoning code it has never seen?

Label it UNCLASSIFIED with confidence 0.0 and method default_fallback, then route it to a review queue. Never let an unknown code raise an exception that halts a county-wide batch, and never map it to a real category as a guess. An honest UNCLASSIFIED is observable and recoverable; a confident wrong classification silently corrupts every downstream feasibility score that trusts it.

How do I handle the same code meaning different things in two cities?

Match on the jurisdiction and code together, not the code alone. C-1 can be neighborhood commercial in one city and heavy commercial in another, so a generic pattern would collapse the distinction. Add tier-1 exact rules scoped to each jurisdiction for known-ambiguous codes; because exact matches sit above regex in precedence, the curated rule overrides the broad pattern that would otherwise mis-bucket the parcel.

Why version the ruleset, and what breaks if I do not?

Ordinances rename districts over time, so the same intent carries different codes in different eras and the same code can change meaning. Without a ruleset version pinned to each run, you cannot reproduce a past classification or explain why last quarter's output differs from today's. Store the rules as versioned data, travel the effective_date with each record, and record the ruleset version in the audit manifest so any historical run replays exactly.

How does confidence scoring change what downstream systems do?

It lets consumers gate on certainty instead of treating every mapping equally. An underwriting model can require confidence at or above a threshold and route anything below it to fallback routing or manual review, so a 0.85 regex guess is never trusted like a 1.0 curated match. Keep the confidence and method columns in every export — both the underwriting team filtering on them and the regulator auditing them depend on their presence.

How to map local zoning codes to standardized taxonomies — step-by-step normalization and conflict resolution between adjacent jurisdictions
Schema validation & data quality checks — the upstream gate that guarantees the input contract mapping depends on
CRS alignment strategies — projects geometries to a metric CRS before taxonomy joins
Fallback routing logic — handles parcels that map to UNCLASSIFIED or below the confidence threshold
GIS export sync workflows — serializes the classified layer to GeoParquet, shapefile, and GeoJSON for consumers

Up: Municipal Zoning Data Architecture & Compliance Frameworks