To-Do: Build Business History Dataset
Purpose
Build a business-change dataset that can support the parking minimum study at two levels:
1. an **aggregated baseline** using NaNDA tract-level annual business counts and employment
2. a later **business-level POI history dataset** using Yelp, Foursquare, OSM, and manual validation
The main idea is to start with a tract-based corridor proxy so we can learn the data, test corridor comparisons, and get an early read on signal before committing to the harder business-level pipeline.
Why This Revision Makes Sense
The original plan jumped quickly into business-level longitudinal reconstruction.
That is still a good long-run goal, but the Nanda dataset give us a (low effort/ high reward) baby step in that direction:
- annual tract-level panels
- counts and employment
- multiple retail and service categories
- long time range
- standardized geography
Examples from the downloaded NaNDA files:
- eating/drinking places
- retail stores
- grocery stores
- liquor/tobacco stores
- personal care / laundromats
- recreation stores
- dollar stores
The tract CSVs appear to contain fields like:
- `tract_fips10`
- `year`
- `totpop`
- `aland10`
- category-specific `count_*`
- category-specific `emps_*`
- category-specific `den_*`
- category-specific `aden_*`
That makes NaNDA a good baseline for:
- annual corridor-proxy business counts
- annual corridor-proxy employment
- pre/post trend inspection
- early cross-city comparison
It does **not** solve true corridor-only attribution by itself, because tracts are larger than the corridor buffers we care about. But it gives us a practical first dataset.
Framing
We should treat this as a two-stage business-history strategy.
Stage A
Build a **corridor-proxy aggregated panel** using:
- corridor geometry
- corridor buffers
- census tract overlap
- NaNDA annual tract-level business counts and employment
Stage B
Build a **business-level corridor panel** using:
- Yelp
- Foursquare
- OSM
- manual checks
Stage A helps us learn:
- whether selected corridors show plausible movement over time
- which category families appear most responsive
- where the strongest cases are
- how to structure later validation
Revised Goal
Construct a staged longitudinal business dataset aligned with the study objective of detecting corridor-level change before and after parking minimum removal.
Immediate goal
Create an **aggregated corridor-year panel** using NaNDA tract-level business and employment measures.
Later goal
Create a **business-location panel** for selected corridors after the aggregated baseline is working.
Phase I: Aggregated Baseline With NaNDA
1. Define The Analysis Frame
- Confirm final study corridors and backups
- Use the reviewed corridor geometry once available
- Define corridor buffers
- likely `100m` and possibly `200m` sensitivity version
- Assign a unique corridor ID
- Assign city, state, city group, and treatment status
Output:
- corridor geometry layer
- corridor buffer layer
- corridor metadata table
2. Define The Geographic Proxy Strategy
Because NaNDA is tract-level, we need a transparent way to map corridor buffers to tracts.
Recommended default:
- intersect corridor buffers with 2010 census tracts
- compute overlap share
- assign tracts to corridors using one or more rules
Recommended rules to support:
- `binary_touch`
- tract counts if it intersects the corridor buffer at all
- `centroid_in_buffer`
- tract counts if tract centroid falls inside the corridor buffer
- `area_weighted`
- tract contributes based on share of tract area inside the corridor buffer
My recommendation:
- use `area_weighted` as the main specification
- keep `binary_touch` as a robustness check
This will not perfectly isolate corridor-only activity, but it is a defensible first-pass corridor proxy.
3. Prepare NaNDA Inputs
Use the tract-level CSVs inside [`data/nanda`](c:/Users/ylaim/OneDrive/6555-urpl-transport-env/parking_research/data/nanda).
Relevant families:
- eating/drinking
- retail
- grocery
- liquor/tobacco
- personal care / laundromats
- recreation
- dollar stores
Tasks:
- unzip to a working data directory
- standardize tract identifiers
- document year coverage per file
- document category field names
- select a clean subset of variables for the first pass
Recommended first-pass fields:
- tract ID
- year
- total population
- land area
- category-specific business counts
- category-specific employment
Output:
- cleaned NaNDA tract panels by theme
- one variable dictionary for selected NaNDA fields
4. Build Corridor-Tract Crosswalk
For each corridor buffer:
- intersect with tract polygons
- compute:
- tract area overlap
- share of tract area in corridor buffer
- optional share of corridor buffer inside tract
Recommended crosswalk fields:
- `corridor_id`
- `city`
- `tract_fips10`
- `buffer_m`
- `tract_area_m2`
- `intersect_area_m2`
- `tract_overlap_share`
- `corridor_buffer_share`
- `crosswalk_rule`
Output:
- corridor-tract crosswalk table
This crosswalk becomes the key bridge between corridor geometry and NaNDA panels.
5. Build Corridor-Year Aggregated Panel
Join the corridor-tract crosswalk to the cleaned NaNDA tract panels.
Then aggregate tract values to the corridor-year level.
Recommended default:
- weighted sums using `tract_overlap_share`
Example outputs by corridor-year:
- estimated eating/drinking count
- estimated eating/drinking employment
- estimated retail count
- estimated retail employment
- estimated grocery count
- estimated personal care / laundromat count
This yields a corridor-year panel such as:
- `corridor_id`
- `city`
- `year`
- `buffer_m`
- `count_totaleatingplaces_est`
- `emps_totaleatingplaces_est`
- `count_totretail_est`
- `emps_totretail_est`
- additional category estimates
6. Derive Aggregated Change Measures
From the corridor-year panel, derive:
- annual change in business counts
- annual change in employment
- pre/post treatment change
- category composition change
- corridor-level trend slopes
Good first measures:
- level change
- percent change
- rolling average change
- category share change
This is not business churn yet in the strict entry/exit sense, but it gives an early measure of corridor commercial change.
7. Validate The Aggregated Proxy
We should explicitly test whether the tract proxy behaves plausibly.
Checks:
- compare corridor maps to tract overlap maps
- flag corridors dominated by one very large tract
- flag corridors with weak tract alignment
- compare aggregated signals to qualitative expectations from corridor notes
- compare selected corridors to OSM/POI context
Important:
Some corridors will be better candidates for tract-based inference than others.
This validation step should help us identify:
- strong aggregated candidates
- weak aggregated candidates
- corridors that likely require business-level reconstruction sooner
8. Produce Phase I Deliverables
Deliverables:
- corridor-tract crosswalk
- cleaned NaNDA tract panels
- corridor-year aggregated panel
- exploratory plots by corridor and city
- notes on which categories are most informative
This phase should be enough to:
- test the city/corridor design
- assess whether treatment corridors move differently from comparison corridors
- decide where the business-level effort is worth the cost
Phase II: Business-Level POI History
Only after the aggregated baseline is working should we move to the harder business-level build.
9. Build Multi-Source POI Inventory
Use a combined approach:
- Yelp for review/activity timing
- Foursquare for structure and categories
- OSM for validation and baseline context
For each corridor:
- query by corridor geometry or corridor-centered search areas
- collect name, address, coordinates, category, and metadata
- store raw source tables separately
10. Standardize And Merge POIs
- normalize names
- normalize categories
- deduplicate by name + proximity + address logic
- create a unified business-location table
Output:
- one row per candidate business-location
11. Attach Time Signals
Examples:
- Yelp review timestamps
- first observed activity
- last observed activity
- optional Foursquare activity/metadata if useful
- OSM as validation or supplementary presence signal
This will allow us to infer:
- entry proxy
- exit proxy
- persistence
- category change
12. Build Business-Level Panel
Unit:
- business-location
Time:
- yearly or coarse interval
Fields:
- active / inactive
- category
- corridor membership
- source confidence
This is the stage that supports true churn metrics.
13. Validate With Manual Checks
Use:
- Google Maps
- Street View
- manual parcel review where necessary
Focus on:
- false closures
- duplicate businesses
- renamed businesses
- major missed establishments
14. Final Analysis Outputs
Eventually we want two linked outputs:
Corridor-level panel
- annual business counts
- annual employment proxies
- category mix
- churn metrics
Business-level panel
- entry timing
- exit timing
- persistence
- category change
Ready for:
- before/after policy comparison
- treated vs comparison corridor comparison
- cross-city comparison
- linkage to repeal timing
Deliverable
A staged business-history workflow that starts with a tract-based aggregated baseline and later extends to business-level longitudinal reconstruction.
That gives us:
- a faster first analytic read
- a cleaner way to learn the corridor design
- a more defensible path into the harder POI history build