GeoLift-SDID Pipeline Workflow¶
Preparation Steps¶
Data Preparation: Ensure data files follow required format with ‘unit’, ‘time’, ‘outcome’, and ‘treatment’ columns. See Dataset Setup Guide for detailed requirements.
Data Validation: Verify your dataset meets all matrix dimension requirements before proceeding.
Configuration Files: Set up appropriate YAML config files for each analysis step (all paths are relative to project root).
Core Analysis Workflow¶
0. Data Validation (Critical)¶
Verify your dataset is properly formatted before analysis.
# Verify data structure and formatting
python recipes/data_validator.py --data data/GeoLift_Singlecell.csv
# Expected outputs:
# - Terminal: "Data validation passed. Dataset meets all requirements for GeoLift-SDID analysis."
# - File: "Validation report saved to: outputs/validation/GeoLift_Singlecell_validation_[TIMESTAMP].json"
# Runtime: ~5 seconds for typical datasets
1. Donor Evaluation¶
Identifies optimal control markets for synthetic comparison.
# Single-cell donor evaluation
python recipes/donor_evaluator.py --config configs/donor_eval_config_singlecell.yaml
# Multi-cell donor evaluation
python recipes/donor_evaluator.py --config configs/donor_eval_config_multicell.yaml
# Expected outputs:
# - Terminal: "Donor evaluation complete."
# - Files (Single-cell): 3 CSV files in outputs/singlecell_donor_eval/
# * donor_eval_detailed_results.csv: Per-location metrics
# * donor_eval_results.csv: Comprehensive donor rankings
# * donor_eval_summary.csv: Top donors by category
# - Files (Multi-cell): Enhanced outputs in outputs/multicell_donor_eval/
# * All single-cell CSVs plus:
# * donor_recommendations.yaml: Optimized donor pool selections
# * donor_map_[LOCATION].png: Geographic visualizations
# * power_analysis_config_[LOCATION].yaml: Auto-generated power configs
# * summary.md: Markdown report of findings
# Runtime: 1-5 minutes depending on dataset size
2. Power Analysis¶
Calculates minimum detectable effect and statistical power across different effect sizes and post-treatment durations.
# Single-cell power analysis
python recipes/power_calculator.py --mode power --config configs/power_analysis_config_singlecell.yaml
# Multi-cell power analysis
python recipes/power_calculator.py --mode power --config configs/power_analysis_config_multicell.yaml
# Option 1: Using auto-generated configs from donor evaluation
python recipes/power_calculator.py --mode power --config outputs/multicell_donor_eval/power_analysis_config_501.yaml
# Option 2: Using custom config
python recipes/power_calculator.py --mode power --config configs/power_analysis_config.yaml
# Expected outputs:
# - Terminal: "Power analysis completed with [N] results."
# - Files: Location-specific outputs in outputs/[TYPE]_power_analysis/
# * power_vs_duration.png: Power curves across different test durations
# * power_vs_effect_size.png: Power curves across different effect sizes
# * power_heatmap.png: 2D heatmap of power across parameter space
# * power_calculation_details.csv: Tabular results for all combinations
# * summary.json: Machine-readable results and metadata
# Runtime: 3-10 seconds for t-test method, 30-60 seconds for SDID approximation, 2-10 minutes for full simulation
3. Causal Impact Analysis¶
Single-Cell Implementation¶
Analyzes treatment effect for a single market.
# Option 1: Using config file only
python recipes/geolift_single_cell.py --config configs/geolift_analysis_config_singlecell.yaml
# Option 2: Direct parameters without config
python recipes/geolift_single_cell.py --data data/GeoLift_Singlecell.csv --treatment 501 --intervention-date 2024-03-01 --output outputs/singlecell_geolift_analysis
# Option 3: Config with command-line overrides
python recipes/geolift_single_cell.py --config configs/geolift_analysis_config_singlecell.yaml --treatment 501 --output custom_output_dir
# Expected outputs:
# - Terminal: "Single-cell analysis complete. Results saved to: [output path]"
# - Note: May encounter NumPy serialization errors with current implementation
# Runtime: 10-30 seconds depending on dataset size
Multi-Cell Implementation¶
Analyzes treatment effects across multiple markets simultaneously.
# Option 1: Using config file only
python recipes/geolift_multi_cell.py --config configs/geolift_analysis_config_multicell.yaml
# Option 2: Direct parameters without config - USE SPACES BETWEEN IDs, NOT COMMAS
python recipes/geolift_multi_cell.py --data data/GeoLift_Multicell.csv --treatment-ids 501 502 503 --intervention-date 2024-03-01 --output outputs/multicell_geolift_analysis
# Option 3: With donor recommendations from previous step
python recipes/geolift_multi_cell.py --data data/GeoLift_Multicell.csv --treatment-ids 501 502 503 --donor-recommendations outputs/multicell_donor_eval/donor_recommendations.yaml --intervention-date 2024-03-01 --output outputs/multicell_geolift_analysis
# Expected outputs:
# - Terminal: "Multi-cell analysis complete. Results saved to: [output path]"
# - Note: May encounter NumPy serialization errors with current implementation
# Runtime: 5-15 minutes depending on number of treatment locations
Example Configuration Format (YAML):
# Single-cell analysis configuration (geolift_analysis_config_singlecell.yaml)
data_path: "data/GeoLift_Singlecell.csv"
treatment_unit: 501 # Required: ID of the treatment market
intervention_date: "2024-03-01" # Required: Start date of treatment
end_date: "2024-04-01" # Optional: End date for analysis window
output_dir: "outputs/singlecell_analysis" # Output directory path
shapemap_file: "data/geo_shapes.geojson" # Optional: For map visualizations
Technical Note: This implementation includes bootstrap standard error calculation when matrix dimension constraints prevent direct matrix calculation.
4. AI-Powered Interpretation¶
Generates business-focused analysis from statistical results.
# Generate interpretation for single-cell analysis (using default deepseek-r1 model)
python recipes/generate_analysis_report.py --outputs outputs/singlecell_geolift_analysis
# Generate interpretation for multi-cell analysis with specified model
python recipes/generate_analysis_report.py --outputs outputs/multicell_geolift_analysis --model deepseek-r1
# Generate interpretation with verbose logging
python recipes/generate_analysis_report.py --outputs outputs/multicell_geolift_analysis --verbose
# Expected output: "Analysis report generated: outputs/*/geolift_interpretation.md"
# Runtime: 10-30 seconds (may vary based on model choice)
Output Structure¶
Each analysis step produces structured outputs in the respective directory:
outputs/
├── donor_eval_singlecell/
├── donor_eval_multicell/
├── power_analysis_singlecell/
├── power_analysis_multicell/
├── singlecell_analysis/
│ ├── summary.json # Standardized results for AI consumption
│ ├── sdid_main_plot.png # Visualization of actual vs synthetic control
│ ├── results.json # Detailed model results
│ └── geolift_interpretation.md # AI-generated business interpretation
└── multicell_analysis/
├── summary.json
├── [market]_plot.png # One plot per treatment market
├── results.json
└── geolift_interpretation.md
Parameter Reference¶
Common Parameters¶
--data: Path to input data file (relative to project root or absolute path)--config: Path to configuration YAML file (relative to project root or absolute path)--output: Directory where analysis results will be saved (will be created if doesn’t exist)--intervention-date: Date when treatment began in YYYY-MM-DD format (must match format in data)--end-date: Optional end date for analysis in YYYY-MM-DD format
Single-Cell Specific¶
--treatment: ID of single treatment market (must match type in data - numeric or string)
Multi-Cell Specific¶
--treatment-ids: Space-separated list of treatment market IDs (no quotes required unless IDs contain spaces)--donor-recommendations: Path to donor recommendations YAML from donor evaluation step
Troubleshooting¶
Error Message |
Likely Cause |
Solution |
|---|---|---|
“ValueError: matrices are not aligned” |
Imbalanced panel data |
Ensure all units have same number of time periods |
“KeyError: ‘data_path’” |
Missing configuration key |
Verify config file contains all required parameters |
“OSError: no such file” |
Incorrect file path |
Check all paths are relative to project root |
“ModuleNotFoundError” |
Missing dependency |
Run |
“Treatment pattern is incorrect” |
Invalid treatment column values |
Ensure treatment=0 before intervention, treatment=1 after for treatment units only |
For more detailed troubleshooting, see Technical Notes.