# GeoLift-SDID Dataset Setup Guide ## Overview This guide provides precise technical requirements for preparing datasets compatible with the GeoLift-SDID package. Improper dataset formatting will cause cryptic matrix calculation errors that are difficult to diagnose. Following these exact specifications will prevent 95% of all implementation failures. ## Core Dataset Requirements ### Required Columns Every GeoLift-SDID dataset must contain these fundamental columns: - **unit**: Unique identifier for each geographic unit (e.g., DMA code, region ID, store ID) - **time**: Time period for each observation, typically as datetime or numeric timestamp - **outcome**: The metric being measured (e.g., sales, conversion rate, website visits) - **treatment**: Binary indicator (0/1) marking treatment periods for treatment units ### Data Structure The dataset must be in **long format** (panel data), where: - Each row represents a unique combination of unit and time period - All units must have observations for all time periods (balanced panel) - Pre-treatment and post-treatment periods must be clearly demarcated Example structure: ``` | unit | time | outcome | treatment | |------|------------|---------|-----------| | 501 | 2024-01-01 | 123.45 | 0 | | 501 | 2024-03-01 | 124.56 | 1 | | 502 | 2024-01-01 | 234.56 | 0 | | 502 | 2024-03-01 | 235.67 | 0 | ``` ### Data Types and Formatting 1. **unit column**: - Can be numeric (integer/float) or string - Must be consistent throughout the dataset - Will be automatically converted to appropriate type for analysis - If treatment units are specified via CLI or config, they must match this type 2. **time column**: - Can be datetime string (YYYY-MM-DD) or numeric timestamp - Will be automatically converted to numeric format for matrix operations - Must be chronologically sorted in ascending order - Must be consistent across all units 3. **outcome column**: - Must be numeric (float/integer) - Non-numeric values will cause matrix calculation errors - Should not contain NaN/None/NULL values 4. **treatment column**: - Must be binary (0/1) - 0 = no treatment, 1 = treatment - Only treatment units in post-intervention period should have value 1 ### Technical Structural Requirements 1. **Dimension Requirements**: - Pre-intervention periods: At least 30 required for robust synthetic control (60+ preferred) - Post-intervention periods: At least 14 required for statistical validity (30+ preferred) - Control units: At least 10 required for donor pool diversity (30+ preferred for multi-cell) 2. **Matrix Structure Considerations**: - The implementation handles matrices of dimensions: - Y_pre: (n_control, n_pre) - Control units × pre-treatment periods - Y_post: (n_control, n_post) - Control units × post-treatment periods - Y1_pre: (1, n_pre) - Treatment unit × pre-treatment periods - Y1_post: (1, n_post) - Treatment unit × post-treatment periods 3. **Handling Dimension Mismatches**: - If n_pre ≠ n_post, the implementation includes fallback mechanisms - Bootstrap resampling is used for standard error calculation when matrix dimensions are incompatible ## Single-Cell vs. Multi-Cell Dataset Structure ### Single-Cell Dataset For analyzing one treatment unit against multiple controls: 1. **Data Structure**: - One unit designated as treatment - All other units as controls - Clear delineation of pre/post intervention periods 2. **Treatment Assignment**: - 'treatment' = 0 for all units during pre-intervention period - 'treatment' = 1 ONLY for treatment unit during post-intervention period - Control units always have 'treatment' = 0 Example: ``` | unit | time | outcome | treatment | |------|------------|---------|-----------| | 501 | 2024-01-01 | 123.45 | 0 | <- Pre-intervention | 501 | 2024-03-01 | 125.67 | 1 | <- Post-intervention starts | 502 | 2024-01-01 | 234.56 | 0 | <- Control (always 0) | 502 | 2024-03-01 | 236.78 | 0 | <- Control (always 0) ``` ### Multi-Cell Dataset For analyzing multiple treatment units simultaneously: 1. **Data Structure**: - Multiple units can be designated as treatment - Treatment start date typically the same across all units - Treatment units can have different start dates if specified 2. **Treatment Assignment**: - Same principle: 'treatment' = 0 pre-intervention, 'treatment' = 1 post-intervention - Each treatment unit's 'treatment' column changes from 0 to 1 at its intervention date ## Critical Technical Pitfalls 1. **Misaligned Treatment Indicator**: - COMMON ERROR: Setting treatment=1 for all rows of treatment unit - CORRECT: Only post-intervention periods of treatment units should have treatment=1 2. **Matrix Dimension Issues**: - Pre/post period counts must be consistent across all units - Matrix operations require specific dimensional alignment - Error: "Input operand has a mismatch in its core dimension" indicates this problem 3. **Type Inconsistencies**: - Treatment unit IDs must match the type in your dataframe - Mixing string and numeric IDs without proper conversion causes errors - Command-line treatment unit specifications must match dataframe type 4. **Date Format Inconsistencies**: - Inconsistent formats between intervention date and dataframe dates - Strings vs. datetime objects causing comparison errors - Different regional formats (MM/DD/YYYY vs. DD/MM/YYYY) ## Best Practices for Data Preparation 1. **Preprocessing Steps**: - Ensure balanced panel (all units have all time periods) - Remove outliers that could distort synthetic control calculation - Normalize/standardize outcome if using units with different scales - Consider seasonality adjustment if strong cyclic patterns exist 2. **Validation Checks**: - Verify treatment assignment pattern (0→1 only at intervention point) - Confirm pre/post period counts meet minimum recommendations - Check for parallel trends in pre-intervention period - Validate unit type consistency between data and command parameters 3. **Performance Optimization**: - Limit unnecessary columns to reduce memory usage - Pre-sort data by unit and time for faster processing - Consider data aggregation for very high-frequency data ## Data Preparation Template Code ```python import pandas as pd import numpy as np # 1. Load and standardize column names df = pd.read_csv('raw_data.csv') df = df.rename(columns={ 'dma': 'unit', # Geographic identifier 'date': 'time', # Time period 'sales': 'outcome', # Measured metric }) # 2. Handle data types - CRITICAL FOR MATRIX COMPATIBILITY df['time'] = pd.to_datetime(df['time']) df = df.sort_values(['unit', 'time']) # Sort by unit and time # Check for non-numeric outcome if not pd.api.types.is_numeric_dtype(df['outcome']): raise ValueError("'outcome' column must be numeric for matrix operations") # 3. Define intervention parameters treatment_units = [501] # Must match type in dataframe intervention_date = pd.to_datetime('2024-03-01') # 4. Create treatment indicator (critical for matrices) df['treatment'] = 0 # Initialize all as untreated mask = (df['unit'].isin(treatment_units)) & (df['time'] >= intervention_date) df.loc[mask, 'treatment'] = 1 # 5. Verify balanced panel expected_periods = df['time'].nunique() unit_periods = df.groupby('unit').size() imbalanced_units = unit_periods[unit_periods != expected_periods] if not imbalanced_units.empty: raise ValueError(f"Imbalanced panel will cause matrix errors. Units missing periods: {imbalanced_units.index.tolist()}") # 6. Verify treatment pattern pre_treatment = df[(df['unit'].isin(treatment_units)) & (df['time'] < intervention_date)] if pre_treatment['treatment'].sum() > 0: raise ValueError("Invalid treatment pattern: Pre-intervention periods incorrectly marked as treatment=1") post_treatment = df[(df['unit'].isin(treatment_units)) & (df['time'] >= intervention_date)] if post_treatment['treatment'].sum() != len(post_treatment): raise ValueError("Invalid treatment pattern: Post-intervention treatment periods not all marked as 1") # 7. Verify sufficient data for statistical validity n_pre = df[df['time'] < intervention_date]['time'].nunique() n_post = df[df['time'] >= intervention_date]['time'].nunique() n_control = df[~df['unit'].isin(treatment_units)]['unit'].nunique() if n_pre < 20: print(f"WARNING: Only {n_pre} pre-periods. Minimum 20 recommended, 30+ preferred.") if n_post < 7: print(f"WARNING: Only {n_post} post-periods. Minimum 7 recommended, 14+ preferred.") if n_control < 5: print(f"WARNING: Only {n_control} control units. Minimum 5 recommended, 10+ preferred.") # 8. Save processed data df.to_csv('geolift_ready_data.csv', index=False) print(f"Data preparation complete: {len(df)} rows across {df['unit'].nunique()} units and {df['time'].nunique()} time periods.") ``` ## Advanced Data Setup Considerations ### Handling Structural Constraints The GeoLift-SDID implementation includes failsafe mechanisms for handling structural matrix constraints. When pre-intervention and post-intervention periods have different counts (e.g., 60 pre vs. 30 post), direct matrix multiplication fails. The implementation automatically: 1. Detects dimension mismatches 2. Falls back to direct statistical estimation 3. Uses bootstrap resampling for standard error calculation (n=500 by default) 4. Logs warnings when fallback mechanisms are activated These failsafes prevent execution errors but may reduce statistical precision. Balanced panel data remains strongly preferred for optimal results. ### Absolute Minimum Requirements vs. Optimal Parameters Below are the absolute minimum and optimal parameters for valid analysis: | Parameter | Absolute Minimum | Recommended Minimum | Optimal | |-----------|-----------------|---------------------|--------| | Pre-treatment periods | 20 | 30 | 60+ | | Post-treatment periods | 7 | 14 | 30+ | | Control units (single-cell) | 5 | 10 | 20+ | | Control units (multi-cell) | 10 | 20 | 30+ | | Treatment units (multi-cell) | 2 | 3 | 5+ | Additional critical requirements: 1. **Balanced Panel**: Every unit must have data for every time period 2. **Type Consistency**: Unit identifiers must maintain consistent type throughout 3. **Treatment Assignment**: Perfect binary pattern (0→1 at intervention only for treatment units) 4. **Data Quality**: No missing values or extreme outliers in outcome metric