To-Do: Build Business History Dataset

Purpose

Build a business-change dataset that can support the parking minimum study at two levels:

1. an **aggregated baseline** using NaNDA tract-level annual business counts and employment

2. a later **business-level POI history dataset** using Yelp, Foursquare, OSM, and manual validation

The main idea is to start with a tract-based corridor proxy so we can learn the data, test corridor comparisons, and get an early read on signal before committing to the harder business-level pipeline.

Why This Revision Makes Sense

The original plan jumped quickly into business-level longitudinal reconstruction.

That is still a good long-run goal, but the Nanda dataset give us a (low effort/ high reward) baby step in that direction:

annual tract-level panels
counts and employment
multiple retail and service categories
long time range
standardized geography

Examples from the downloaded NaNDA files:

eating/drinking places
retail stores
grocery stores
liquor/tobacco stores
personal care / laundromats
recreation stores
dollar stores

The tract CSVs appear to contain fields like:

`tract_fips10`
`year`
`totpop`
`aland10`
category-specific `count_*`
category-specific `emps_*`
category-specific `den_*`
category-specific `aden_*`

That makes NaNDA a good baseline for:

annual corridor-proxy business counts
annual corridor-proxy employment
pre/post trend inspection
early cross-city comparison

It does **not** solve true corridor-only attribution by itself, because tracts are larger than the corridor buffers we care about. But it gives us a practical first dataset.

Framing

We should treat this as a two-stage business-history strategy.

Stage A

Build a **corridor-proxy aggregated panel** using:

corridor geometry
corridor buffers
census tract overlap
NaNDA annual tract-level business counts and employment

Stage B

Build a **business-level corridor panel** using:

Yelp
Foursquare
OSM
manual checks

Stage A helps us learn:

whether selected corridors show plausible movement over time
which category families appear most responsive
where the strongest cases are
how to structure later validation

Revised Goal

Construct a staged longitudinal business dataset aligned with the study objective of detecting corridor-level change before and after parking minimum removal.

Immediate goal

Create an **aggregated corridor-year panel** using NaNDA tract-level business and employment measures.

Later goal

Create a **business-location panel** for selected corridors after the aggregated baseline is working.

Phase I: Aggregated Baseline With NaNDA

1. Define The Analysis Frame

Confirm final study corridors and backups
Use the reviewed corridor geometry once available
Define corridor buffers
likely `100m` and possibly `200m` sensitivity version
Assign a unique corridor ID
Assign city, state, city group, and treatment status

Output:

corridor geometry layer
corridor buffer layer
corridor metadata table

2. Define The Geographic Proxy Strategy

Because NaNDA is tract-level, we need a transparent way to map corridor buffers to tracts.

Recommended default:

intersect corridor buffers with 2010 census tracts
compute overlap share
assign tracts to corridors using one or more rules

Recommended rules to support:

`binary_touch`
tract counts if it intersects the corridor buffer at all

`centroid_in_buffer`
tract counts if tract centroid falls inside the corridor buffer

`area_weighted`
tract contributes based on share of tract area inside the corridor buffer

My recommendation:

use `area_weighted` as the main specification
keep `binary_touch` as a robustness check

This will not perfectly isolate corridor-only activity, but it is a defensible first-pass corridor proxy.

3. Prepare NaNDA Inputs

Use the tract-level CSVs inside [`data/nanda`](c:/Users/ylaim/OneDrive/6555-urpl-transport-env/parking_research/data/nanda).

Relevant families:

eating/drinking
retail
grocery
liquor/tobacco
personal care / laundromats
recreation
dollar stores

Tasks:

unzip to a working data directory
standardize tract identifiers
document year coverage per file
document category field names
select a clean subset of variables for the first pass

Recommended first-pass fields:

tract ID
year
total population
land area
category-specific business counts
category-specific employment

Output:

cleaned NaNDA tract panels by theme
one variable dictionary for selected NaNDA fields

4. Build Corridor-Tract Crosswalk

For each corridor buffer:

intersect with tract polygons
compute:
tract area overlap
share of tract area in corridor buffer
optional share of corridor buffer inside tract

Recommended crosswalk fields:

`corridor_id`
`city`
`tract_fips10`
`buffer_m`
`tract_area_m2`
`intersect_area_m2`
`tract_overlap_share`
`corridor_buffer_share`
`crosswalk_rule`

Output:

corridor-tract crosswalk table

This crosswalk becomes the key bridge between corridor geometry and NaNDA panels.

5. Build Corridor-Year Aggregated Panel

Join the corridor-tract crosswalk to the cleaned NaNDA tract panels.

Then aggregate tract values to the corridor-year level.

Recommended default:

weighted sums using `tract_overlap_share`

Example outputs by corridor-year:

estimated eating/drinking count
estimated eating/drinking employment
estimated retail count
estimated retail employment
estimated grocery count
estimated personal care / laundromat count

This yields a corridor-year panel such as:

`corridor_id`
`city`
`year`
`buffer_m`
`count_totaleatingplaces_est`
`emps_totaleatingplaces_est`
`count_totretail_est`
`emps_totretail_est`
additional category estimates

6. Derive Aggregated Change Measures

From the corridor-year panel, derive:

annual change in business counts
annual change in employment
pre/post treatment change
category composition change
corridor-level trend slopes

Good first measures:

level change
percent change
rolling average change
category share change

This is not business churn yet in the strict entry/exit sense, but it gives an early measure of corridor commercial change.

7. Validate The Aggregated Proxy

We should explicitly test whether the tract proxy behaves plausibly.

Checks:

compare corridor maps to tract overlap maps
flag corridors dominated by one very large tract
flag corridors with weak tract alignment
compare aggregated signals to qualitative expectations from corridor notes
compare selected corridors to OSM/POI context

Important:

Some corridors will be better candidates for tract-based inference than others.

This validation step should help us identify:

strong aggregated candidates
weak aggregated candidates
corridors that likely require business-level reconstruction sooner

8. Produce Phase I Deliverables

Deliverables:

corridor-tract crosswalk
cleaned NaNDA tract panels
corridor-year aggregated panel
exploratory plots by corridor and city
notes on which categories are most informative

This phase should be enough to:

test the city/corridor design
assess whether treatment corridors move differently from comparison corridors
decide where the business-level effort is worth the cost

Phase II: Business-Level POI History

Only after the aggregated baseline is working should we move to the harder business-level build.

9. Build Multi-Source POI Inventory

Use a combined approach:

Yelp for review/activity timing
Foursquare for structure and categories
OSM for validation and baseline context

For each corridor:

query by corridor geometry or corridor-centered search areas
collect name, address, coordinates, category, and metadata
store raw source tables separately

10. Standardize And Merge POIs

normalize names
normalize categories
deduplicate by name + proximity + address logic
create a unified business-location table

Output:

one row per candidate business-location

11. Attach Time Signals

Examples:

Yelp review timestamps
first observed activity
last observed activity
optional Foursquare activity/metadata if useful
OSM as validation or supplementary presence signal

This will allow us to infer:

entry proxy
exit proxy
persistence
category change

12. Build Business-Level Panel

Unit:

business-location

Time:

yearly or coarse interval

Fields:

active / inactive
category
corridor membership
source confidence

This is the stage that supports true churn metrics.

13. Validate With Manual Checks

Use:

Google Maps
Street View
manual parcel review where necessary

Focus on:

false closures
duplicate businesses
renamed businesses
major missed establishments

14. Final Analysis Outputs

Eventually we want two linked outputs:

Corridor-level panel

annual business counts
annual employment proxies
category mix
churn metrics

Business-level panel

entry timing
exit timing
persistence
category change

Ready for:

before/after policy comparison
treated vs comparison corridor comparison
cross-city comparison
linkage to repeal timing

Deliverable

A staged business-history workflow that starts with a tract-based aggregated baseline and later extends to business-level longitudinal reconstruction.

That gives us:

a faster first analytic read
a cleaner way to learn the corridor design
a more defensible path into the harder POI history build