# GeoLift-SDID Dataset Setup Guide

## Overview

This guide provides precise technical requirements for preparing datasets compatible with the GeoLift-SDID package. Improper dataset formatting will cause cryptic matrix calculation errors that are difficult to diagnose. Following these exact specifications will prevent 95% of all implementation failures.

## Core Dataset Requirements

### Required Columns

Every GeoLift-SDID dataset must contain these fundamental columns:

- **unit**: Unique identifier for each geographic unit (e.g., DMA code, region ID, store ID)
- **time**: Time period for each observation, typically as datetime or numeric timestamp
- **outcome**: The metric being measured (e.g., sales, conversion rate, website visits)
- **treatment**: Binary indicator (0/1) marking treatment periods for treatment units

### Data Structure

The dataset must be in **long format** (panel data), where:
- Each row represents a unique combination of unit and time period
- All units must have observations for all time periods (balanced panel)
- Pre-treatment and post-treatment periods must be clearly demarcated

Example structure:
```
| unit | time       | outcome | treatment |
|------|------------|---------|-----------|
| 501  | 2024-01-01 | 123.45  | 0         |
| 501  | 2024-03-01 | 124.56  | 1         |
| 502  | 2024-01-01 | 234.56  | 0         |
| 502  | 2024-03-01 | 235.67  | 0         |
```

### Data Types and Formatting

1. **unit column**:
   - Can be numeric (integer/float) or string
   - Must be consistent throughout the dataset
   - Will be automatically converted to appropriate type for analysis
   - If treatment units are specified via CLI or config, they must match this type

2. **time column**:
   - Can be datetime string (YYYY-MM-DD) or numeric timestamp
   - Will be automatically converted to numeric format for matrix operations
   - Must be chronologically sorted in ascending order
   - Must be consistent across all units

3. **outcome column**:
   - Must be numeric (float/integer)
   - Non-numeric values will cause matrix calculation errors
   - Should not contain NaN/None/NULL values

4. **treatment column**:
   - Must be binary (0/1)
   - 0 = no treatment, 1 = treatment
   - Only treatment units in post-intervention period should have value 1

### Technical Structural Requirements

1. **Dimension Requirements**:
   - Pre-intervention periods: At least 30 required for robust synthetic control (60+ preferred)
   - Post-intervention periods: At least 14 required for statistical validity (30+ preferred)
   - Control units: At least 10 required for donor pool diversity (30+ preferred for multi-cell)

2. **Matrix Structure Considerations**:
   - The implementation handles matrices of dimensions:
     - Y_pre: (n_control, n_pre) - Control units × pre-treatment periods
     - Y_post: (n_control, n_post) - Control units × post-treatment periods
     - Y1_pre: (1, n_pre) - Treatment unit × pre-treatment periods
     - Y1_post: (1, n_post) - Treatment unit × post-treatment periods

3. **Handling Dimension Mismatches**:
   - If n_pre ≠ n_post, the implementation includes fallback mechanisms
   - Bootstrap resampling is used for standard error calculation when matrix dimensions are incompatible

## Single-Cell vs. Multi-Cell Dataset Structure

### Single-Cell Dataset

For analyzing one treatment unit against multiple controls:

1. **Data Structure**:
   - One unit designated as treatment
   - All other units as controls
   - Clear delineation of pre/post intervention periods

2. **Treatment Assignment**:
   - 'treatment' = 0 for all units during pre-intervention period
   - 'treatment' = 1 ONLY for treatment unit during post-intervention period
   - Control units always have 'treatment' = 0

Example:
```
| unit | time       | outcome | treatment |
|------|------------|---------|-----------|
| 501  | 2024-01-01 | 123.45  | 0         | <- Pre-intervention
| 501  | 2024-03-01 | 125.67  | 1         | <- Post-intervention starts
| 502  | 2024-01-01 | 234.56  | 0         | <- Control (always 0)
| 502  | 2024-03-01 | 236.78  | 0         | <- Control (always 0)
```

### Multi-Cell Dataset

For analyzing multiple treatment units simultaneously:

1. **Data Structure**:
   - Multiple units can be designated as treatment
   - Treatment start date typically the same across all units
   - Treatment units can have different start dates if specified

2. **Treatment Assignment**:
   - Same principle: 'treatment' = 0 pre-intervention, 'treatment' = 1 post-intervention
   - Each treatment unit's 'treatment' column changes from 0 to 1 at its intervention date

## Critical Technical Pitfalls

1. **Misaligned Treatment Indicator**:
   - COMMON ERROR: Setting treatment=1 for all rows of treatment unit
   - CORRECT: Only post-intervention periods of treatment units should have treatment=1

2. **Matrix Dimension Issues**:
   - Pre/post period counts must be consistent across all units
   - Matrix operations require specific dimensional alignment
   - Error: "Input operand has a mismatch in its core dimension" indicates this problem

3. **Type Inconsistencies**:
   - Treatment unit IDs must match the type in your dataframe
   - Mixing string and numeric IDs without proper conversion causes errors
   - Command-line treatment unit specifications must match dataframe type

4. **Date Format Inconsistencies**:
   - Inconsistent formats between intervention date and dataframe dates
   - Strings vs. datetime objects causing comparison errors
   - Different regional formats (MM/DD/YYYY vs. DD/MM/YYYY)

## Best Practices for Data Preparation

1. **Preprocessing Steps**:
   - Ensure balanced panel (all units have all time periods)
   - Remove outliers that could distort synthetic control calculation
   - Normalize/standardize outcome if using units with different scales
   - Consider seasonality adjustment if strong cyclic patterns exist

2. **Validation Checks**:
   - Verify treatment assignment pattern (0→1 only at intervention point)
   - Confirm pre/post period counts meet minimum recommendations
   - Check for parallel trends in pre-intervention period
   - Validate unit type consistency between data and command parameters

3. **Performance Optimization**:
   - Limit unnecessary columns to reduce memory usage
   - Pre-sort data by unit and time for faster processing
   - Consider data aggregation for very high-frequency data

## Data Preparation Template Code

```python
import pandas as pd
import numpy as np

# 1. Load and standardize column names
df = pd.read_csv('raw_data.csv')
df = df.rename(columns={
    'dma': 'unit',              # Geographic identifier 
    'date': 'time',             # Time period
    'sales': 'outcome',         # Measured metric
})

# 2. Handle data types - CRITICAL FOR MATRIX COMPATIBILITY
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values(['unit', 'time'])  # Sort by unit and time

# Check for non-numeric outcome
if not pd.api.types.is_numeric_dtype(df['outcome']):
    raise ValueError("'outcome' column must be numeric for matrix operations")

# 3. Define intervention parameters
treatment_units = [501]  # Must match type in dataframe
intervention_date = pd.to_datetime('2024-03-01')

# 4. Create treatment indicator (critical for matrices)
df['treatment'] = 0  # Initialize all as untreated
mask = (df['unit'].isin(treatment_units)) & (df['time'] >= intervention_date)
df.loc[mask, 'treatment'] = 1

# 5. Verify balanced panel
expected_periods = df['time'].nunique()
unit_periods = df.groupby('unit').size()
imbalanced_units = unit_periods[unit_periods != expected_periods]
if not imbalanced_units.empty:
    raise ValueError(f"Imbalanced panel will cause matrix errors. Units missing periods: {imbalanced_units.index.tolist()}")

# 6. Verify treatment pattern
pre_treatment = df[(df['unit'].isin(treatment_units)) & (df['time'] < intervention_date)]
if pre_treatment['treatment'].sum() > 0:
    raise ValueError("Invalid treatment pattern: Pre-intervention periods incorrectly marked as treatment=1")

post_treatment = df[(df['unit'].isin(treatment_units)) & (df['time'] >= intervention_date)]
if post_treatment['treatment'].sum() != len(post_treatment):
    raise ValueError("Invalid treatment pattern: Post-intervention treatment periods not all marked as 1")

# 7. Verify sufficient data for statistical validity
n_pre = df[df['time'] < intervention_date]['time'].nunique()
n_post = df[df['time'] >= intervention_date]['time'].nunique()
n_control = df[~df['unit'].isin(treatment_units)]['unit'].nunique()

if n_pre < 20:
    print(f"WARNING: Only {n_pre} pre-periods. Minimum 20 recommended, 30+ preferred.")
if n_post < 7:
    print(f"WARNING: Only {n_post} post-periods. Minimum 7 recommended, 14+ preferred.")
if n_control < 5:
    print(f"WARNING: Only {n_control} control units. Minimum 5 recommended, 10+ preferred.")

# 8. Save processed data
df.to_csv('geolift_ready_data.csv', index=False)
print(f"Data preparation complete: {len(df)} rows across {df['unit'].nunique()} units and {df['time'].nunique()} time periods.")
```

## Advanced Data Setup Considerations

### Handling Structural Constraints

The GeoLift-SDID implementation includes failsafe mechanisms for handling structural matrix constraints. When pre-intervention and post-intervention periods have different counts (e.g., 60 pre vs. 30 post), direct matrix multiplication fails. The implementation automatically:

1. Detects dimension mismatches
2. Falls back to direct statistical estimation
3. Uses bootstrap resampling for standard error calculation (n=500 by default)
4. Logs warnings when fallback mechanisms are activated

These failsafes prevent execution errors but may reduce statistical precision. Balanced panel data remains strongly preferred for optimal results.

### Absolute Minimum Requirements vs. Optimal Parameters

Below are the absolute minimum and optimal parameters for valid analysis:

| Parameter | Absolute Minimum | Recommended Minimum | Optimal |
|-----------|-----------------|---------------------|--------|
| Pre-treatment periods | 20 | 30 | 60+ |
| Post-treatment periods | 7 | 14 | 30+ |
| Control units (single-cell) | 5 | 10 | 20+ |
| Control units (multi-cell) | 10 | 20 | 30+ |
| Treatment units (multi-cell) | 2 | 3 | 5+ |

Additional critical requirements:

1. **Balanced Panel**: Every unit must have data for every time period
2. **Type Consistency**: Unit identifiers must maintain consistent type throughout
3. **Treatment Assignment**: Perfect binary pattern (0→1 at intervention only for treatment units)
4. **Data Quality**: No missing values or extreme outliers in outcome metric