GeoLift-SDID Dataset Setup Guide¶

Overview¶

This guide provides precise technical requirements for preparing datasets compatible with the GeoLift-SDID package. Improper dataset formatting will cause cryptic matrix calculation errors that are difficult to diagnose. Following these exact specifications will prevent 95% of all implementation failures.

Core Dataset Requirements¶

Required Columns¶

Every GeoLift-SDID dataset must contain these fundamental columns:

unit: Unique identifier for each geographic unit (e.g., DMA code, region ID, store ID)
time: Time period for each observation, typically as datetime or numeric timestamp
outcome: The metric being measured (e.g., sales, conversion rate, website visits)
treatment: Binary indicator (0/1) marking treatment periods for treatment units

Data Structure¶

The dataset must be in long format (panel data), where:

Each row represents a unique combination of unit and time period
All units must have observations for all time periods (balanced panel)
Pre-treatment and post-treatment periods must be clearly demarcated

Example structure:

| unit | time       | outcome | treatment |
|------|------------|---------|-----------|
| 501  | 2024-01-01 | 123.45  | 0         |
| 501  | 2024-03-01 | 124.56  | 1         |
| 502  | 2024-01-01 | 234.56  | 0         |
| 502  | 2024-03-01 | 235.67  | 0         |

Data Types and Formatting¶

unit column:
- Can be numeric (integer/float) or string
- Must be consistent throughout the dataset
- Will be automatically converted to appropriate type for analysis
- If treatment units are specified via CLI or config, they must match this type
time column:
- Can be datetime string (YYYY-MM-DD) or numeric timestamp
- Will be automatically converted to numeric format for matrix operations
- Must be chronologically sorted in ascending order
- Must be consistent across all units
outcome column:
- Must be numeric (float/integer)
- Non-numeric values will cause matrix calculation errors
- Should not contain NaN/None/NULL values
treatment column:
- Must be binary (0/1)
- 0 = no treatment, 1 = treatment
- Only treatment units in post-intervention period should have value 1

Technical Structural Requirements¶

Dimension Requirements:
- Pre-intervention periods: At least 30 required for robust synthetic control (60+ preferred)
- Post-intervention periods: At least 14 required for statistical validity (30+ preferred)
- Control units: At least 10 required for donor pool diversity (30+ preferred for multi-cell)
Matrix Structure Considerations:
- The implementation handles matrices of dimensions:
  - Y_pre: (n_control, n_pre) - Control units × pre-treatment periods
  - Y_post: (n_control, n_post) - Control units × post-treatment periods
  - Y1_pre: (1, n_pre) - Treatment unit × pre-treatment periods
  - Y1_post: (1, n_post) - Treatment unit × post-treatment periods
Handling Dimension Mismatches:
- If n_pre ≠ n_post, the implementation includes fallback mechanisms
- Bootstrap resampling is used for standard error calculation when matrix dimensions are incompatible

Single-Cell vs. Multi-Cell Dataset Structure¶

Single-Cell Dataset¶

For analyzing one treatment unit against multiple controls:

Data Structure:
- One unit designated as treatment
- All other units as controls
- Clear delineation of pre/post intervention periods
Treatment Assignment:
- ‘treatment’ = 0 for all units during pre-intervention period
- ‘treatment’ = 1 ONLY for treatment unit during post-intervention period
- Control units always have ‘treatment’ = 0

Example:

| unit | time       | outcome | treatment |
|------|------------|---------|-----------|
| 501  | 2024-01-01 | 123.45  | 0         | <- Pre-intervention
| 501  | 2024-03-01 | 125.67  | 1         | <- Post-intervention starts
| 502  | 2024-01-01 | 234.56  | 0         | <- Control (always 0)
| 502  | 2024-03-01 | 236.78  | 0         | <- Control (always 0)

Multi-Cell Dataset¶

For analyzing multiple treatment units simultaneously:

Data Structure:
- Multiple units can be designated as treatment
- Treatment start date typically the same across all units
- Treatment units can have different start dates if specified
Treatment Assignment:
- Same principle: ‘treatment’ = 0 pre-intervention, ‘treatment’ = 1 post-intervention
- Each treatment unit’s ‘treatment’ column changes from 0 to 1 at its intervention date

Critical Technical Pitfalls¶

Misaligned Treatment Indicator:
- COMMON ERROR: Setting treatment=1 for all rows of treatment unit
- CORRECT: Only post-intervention periods of treatment units should have treatment=1
Matrix Dimension Issues:
- Pre/post period counts must be consistent across all units
- Matrix operations require specific dimensional alignment
- Error: “Input operand has a mismatch in its core dimension” indicates this problem
Type Inconsistencies:
- Treatment unit IDs must match the type in your dataframe
- Mixing string and numeric IDs without proper conversion causes errors
- Command-line treatment unit specifications must match dataframe type
Date Format Inconsistencies:
- Inconsistent formats between intervention date and dataframe dates
- Strings vs. datetime objects causing comparison errors
- Different regional formats (MM/DD/YYYY vs. DD/MM/YYYY)

Best Practices for Data Preparation¶

Preprocessing Steps:
- Ensure balanced panel (all units have all time periods)
- Remove outliers that could distort synthetic control calculation
- Normalize/standardize outcome if using units with different scales
- Consider seasonality adjustment if strong cyclic patterns exist
Validation Checks:
- Verify treatment assignment pattern (0→1 only at intervention point)
- Confirm pre/post period counts meet minimum recommendations
- Check for parallel trends in pre-intervention period
- Validate unit type consistency between data and command parameters
Performance Optimization:
- Limit unnecessary columns to reduce memory usage
- Pre-sort data by unit and time for faster processing
- Consider data aggregation for very high-frequency data

Data Preparation Template Code¶

import pandas as pd
import numpy as np

# 1. Load and standardize column names
df = pd.read_csv('raw_data.csv')
df = df.rename(columns={
    'dma': 'unit',              # Geographic identifier 
    'date': 'time',             # Time period
    'sales': 'outcome',         # Measured metric
})

# 2. Handle data types - CRITICAL FOR MATRIX COMPATIBILITY
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values(['unit', 'time'])  # Sort by unit and time

# Check for non-numeric outcome
if not pd.api.types.is_numeric_dtype(df['outcome']):
    raise ValueError("'outcome' column must be numeric for matrix operations")

# 3. Define intervention parameters
treatment_units = [501]  # Must match type in dataframe
intervention_date = pd.to_datetime('2024-03-01')

# 4. Create treatment indicator (critical for matrices)
df['treatment'] = 0  # Initialize all as untreated
mask = (df['unit'].isin(treatment_units)) & (df['time'] >= intervention_date)
df.loc[mask, 'treatment'] = 1

# 5. Verify balanced panel
expected_periods = df['time'].nunique()
unit_periods = df.groupby('unit').size()
imbalanced_units = unit_periods[unit_periods != expected_periods]
if not imbalanced_units.empty:
    raise ValueError(f"Imbalanced panel will cause matrix errors. Units missing periods: {imbalanced_units.index.tolist()}")

# 6. Verify treatment pattern
pre_treatment = df[(df['unit'].isin(treatment_units)) & (df['time'] < intervention_date)]
if pre_treatment['treatment'].sum() > 0:
    raise ValueError("Invalid treatment pattern: Pre-intervention periods incorrectly marked as treatment=1")

post_treatment = df[(df['unit'].isin(treatment_units)) & (df['time'] >= intervention_date)]
if post_treatment['treatment'].sum() != len(post_treatment):
    raise ValueError("Invalid treatment pattern: Post-intervention treatment periods not all marked as 1")

# 7. Verify sufficient data for statistical validity
n_pre = df[df['time'] < intervention_date]['time'].nunique()
n_post = df[df['time'] >= intervention_date]['time'].nunique()
n_control = df[~df['unit'].isin(treatment_units)]['unit'].nunique()

if n_pre < 20:
    print(f"WARNING: Only {n_pre} pre-periods. Minimum 20 recommended, 30+ preferred.")
if n_post < 7:
    print(f"WARNING: Only {n_post} post-periods. Minimum 7 recommended, 14+ preferred.")
if n_control < 5:
    print(f"WARNING: Only {n_control} control units. Minimum 5 recommended, 10+ preferred.")

# 8. Save processed data
df.to_csv('geolift_ready_data.csv', index=False)
print(f"Data preparation complete: {len(df)} rows across {df['unit'].nunique()} units and {df['time'].nunique()} time periods.")

Advanced Data Setup Considerations¶

Handling Structural Constraints¶

The GeoLift-SDID implementation includes failsafe mechanisms for handling structural matrix constraints. When pre-intervention and post-intervention periods have different counts (e.g., 60 pre vs. 30 post), direct matrix multiplication fails. The implementation automatically:

Detects dimension mismatches
Falls back to direct statistical estimation
Uses bootstrap resampling for standard error calculation (n=500 by default)
Logs warnings when fallback mechanisms are activated

These failsafes prevent execution errors but may reduce statistical precision. Balanced panel data remains strongly preferred for optimal results.

Absolute Minimum Requirements vs. Optimal Parameters¶

Below are the absolute minimum and optimal parameters for valid analysis:

Parameter	Absolute Minimum	Recommended Minimum	Optimal
Pre-treatment periods	20	30	60+
Post-treatment periods	7	14	30+
Control units (single-cell)	5	10	20+
Control units (multi-cell)	10	20	30+
Treatment units (multi-cell)	2	3	5+

Additional critical requirements:

Balanced Panel: Every unit must have data for every time period
Type Consistency: Unit identifiers must maintain consistent type throughout
Treatment Assignment: Perfect binary pattern (0→1 at intervention only for treatment units)
Data Quality: No missing values or extreme outliers in outcome metric