GeoLift-SDID Technical Best Practices

This document provides first-principles implementation guidance for conducting rigorous Synthetic Difference-in-Differences analysis using the GeoLift-SDID library.

Core Implementation Framework

The GeoLift-SDID workflow follows a sequential process:

  1. Data Preparation & Validation

  2. Donor Pool Optimization

  3. Power Analysis & Sensitivity Thresholds

  4. SDID Model Implementation

  5. Bootstrap Standard Error Estimation

  6. AI-Powered Result Interpretation

Each component addresses specific technical challenges in causal inference for marketing applications.

Single-Cell vs Multi-Cell Experiment Designs

GeoLift-SDID supports two fundamental experimental designs with distinct statistical properties:

Single-Cell Design

Definition: Treatment applied to exactly one location (treatment unit).

Configuration: See configs/power_analysis_config_singlecell.yaml and configs/donor_eval_config_singlecell.yaml

Statistical Properties:

  • Simpler inference procedure

  • Higher power for detecting effects in the single treated location

  • Unable to estimate effect heterogeneity across locations

Multi-Cell Design

Definition: Treatment applied to multiple locations (treatment units).

Configuration: See configs/power_analysis_config_multicell.yaml and configs/donor_eval_config_multicell.yaml

Statistical Properties:

  • Allows estimation of average treatment effects across locations

  • Can detect heterogeneous treatment effects

  • Requires more control locations for reliable inference

  • Lower power per location but higher overall power

Data Requirements

Analysis Type

Min Pre-Treatment Periods

Min Treatment Locations

Min Control Locations

Single-Cell

20

1

5

Multi-Cell

20

3

10

Insufficient data will result in poor synthetic controls and unreliable inference.

Technical Implementation Guide

1. Data Preparation & Validation

Input: Raw time series data Output: Matrix-compatible panel dataset

Critical Technical Requirements

  • Matrix Dimensionality: Ensure balanced panel structure for matrix operations

  • Type Consistency: Maintain consistent types for unit identifiers between data and parameters

  • Treatment Pattern: Binary indicator correctly marking only post-intervention periods for treatment units

  • Variable Types: Numeric outcome variables without missing values

Validation Checks

# Essential validation checks
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values(['unit', 'time'])

# Verify correct treatment pattern
treat_pattern = df[df['unit'].isin(treatment_units)]['treatment'].values
correct_pattern = np.concatenate([
    np.zeros(pre_periods),  # All zeros before intervention
    np.ones(post_periods)   # All ones after intervention
])
if not np.array_equal(treat_pattern, correct_pattern):
    raise ValueError("Treatment pattern is incorrect: should be all 0s pre-intervention, all 1s post-intervention")

# Check dimension balance
unit_counts = df.groupby('unit').size()
if unit_counts.nunique() > 1:
    imbalanced = unit_counts[unit_counts != unit_counts.max()]
    raise ValueError(f"Imbalanced panel: units with missing periods: {imbalanced.index.tolist()}")

# Verify numeric outcome
if not np.issubdtype(df['outcome'].dtype, np.number):
    raise ValueError("Outcome variable must be numeric")

2. Donor Pool Optimization

Input: Validated panel data Output: Optimal control unit ranking

Technical Implementation

The donor evaluator uses multiple distance metrics to identify optimal synthetic control candidates:

# Key similarity metrics for donor quality
metrics = {
    'correlation': lambda x, y: np.corrcoef(x, y)[0, 1],  # Trend similarity
    'rmse': lambda x, y: np.sqrt(np.mean((x - y)**2)),    # Absolute difference
    'mape': lambda x, y: np.mean(np.abs((x - y) / x)),    # Relative difference
    'dtw': lambda x, y: fastdtw(x, y)[0]                  # Pattern matching with time warping
}

# Composite scoring function
def composite_score(metrics_dict, weights=None):
    if weights is None:
        weights = {'correlation': 0.4, 'rmse': 0.3, 'mape': 0.2, 'dtw': 0.1}
    
    # Normalize metrics to [0,1] range
    normalized = {}
    for m in metrics_dict:
        if m == 'correlation':  # Higher is better
            normalized[m] = (metrics_dict[m] + 1) / 2  # Convert [-1,1] to [0,1]
        else:  # Lower is better
            values = np.array(list(metrics_dict[m].values()))
            min_val, max_val = np.min(values), np.max(values)
            # Add epsilon to prevent division by zero
            epsilon = 1e-10
            normalized[m] = {k: 1 - (v - min_val) / max(max_val - min_val, epsilon) 
                            for k, v in metrics_dict[m].items()}
    
    # Calculate weighted score
    scores = {}
    for location in normalized[list(normalized.keys())[0]]:
        scores[location] = sum(weights[m] * normalized[m][location] for m in normalized)
    
    return scores

Critical Considerations

  • Correlation captures trend similarity but ignores magnitude differences

  • RMSE captures absolute differences but is sensitive to scale

  • MAPE captures relative differences but fails for near-zero values

  • Dynamic Time Warping accounts for time shifts but has higher computational cost

Optimal donor pools balance these metrics based on business context.

3. Power Analysis & Sensitivity Thresholds

Input: Validated data with optimized donor pool Output: Minimum detectable effect (MDE) and optimal test duration

Technical Implementation

The power calculator uses t-test based approximations for efficient computation:

def power_calculation(data, treatment_units, control_units, pre_period, effect_sizes, durations, alpha=0.05):
    # Extract pre-period data for estimating baseline variance
    pre_data = data[data['time'].isin(pre_period)]
    treat_pre = pre_data[pre_data['unit'].isin(treatment_units)]['outcome']
    control_pre = pre_data[pre_data['unit'].isin(control_units)]['outcome']
    
    # Calculate pooled standard deviation
    n1, n2 = len(treat_pre), len(control_pre)
    s1, s2 = treat_pre.std(), control_pre.std()
    pooled_sd = np.sqrt(((n1-1)*s1**2 + (n2-1)*s2**2) / (n1+n2-2))
    
    # Calculate statistical power for each effect size and duration
    results = {}
    for effect in effect_sizes:
        for duration in durations:
            # Standard formula for t-test power
            effect_size = effect / pooled_sd  # Cohen's d
            power = 1 - stats.t.cdf(
                stats.t.ppf(1-alpha/2, df=duration-1) - effect_size*np.sqrt(duration),
                df=duration-1
            )
            results[(effect, duration)] = power
    
    return results

Minimum Detectable Effect Estimation

For a given test duration and desired power (typically 0.8):

def calculate_mde(duration, desired_power=0.8, alpha=0.05, sd=None):
    """Calculate minimum detectable effect size for given parameters"""
    # Critical value for two-sided test
    t_crit = stats.t.ppf(1-alpha/2, df=duration-1)
    
    # Critical value for desired power
    t_power = stats.t.ppf(desired_power, df=duration-1)
    
    # MDE in standardized units (Cohen's d)
    mde_standardized = (t_crit + t_power) / np.sqrt(duration)
    
    # Convert to absolute units if standard deviation provided
    if sd is not None:
        return mde_standardized * sd
    return mde_standardized

4. SDID Model Implementation

Input: Validated data with optimized donors and sufficient power Output: Treatment effect estimates and standard errors

Technical Implementation

The core SDID implementation contains matrix operations for weight optimization:

from synthdid.synthdid import SyntheticDifferenceInDifferences

# Preferred implementation with proper import path
sdid = SyntheticDifferenceInDifferences(
    data,
    unit="unit",     # Geographic identifier
    time="time",     # Time period identifier 
    outcome="outcome", # Measured metric
    treatment="treatment",  # Binary treatment indicator
    covariates=covariates_list  # Optional additional predictors
)

# Fit model and extract primary results
sdid.fit()
att = sdid.att           # Average treatment effect on treated
se = sdid.se             # Standard error of estimate
p_value = sdid.p_value   # Statistical significance
ci_lower, ci_upper = sdid.confidence_interval(alpha=0.05)  # 95% confidence interval

Single-Cell Implementation Considerations

For single-cell analysis, special considerations apply:

from recipes.geolift_single_cell import GeoLiftSingleCell

# Initialize single-cell analysis
single_cell = GeoLiftSingleCell(
    data_path="path/to/data.csv",
    treatment_unit="501",  # Single treatment unit
    intervention_date="2024-03-01",
    pre_periods=60,
    post_periods=30,
    # Additional parameters as needed
)

# Run analysis
results = single_cell.run()

# Extract results with fallback calculations
att = results.get('att')  # Average treatment effect
se = results.get('se')    # Standard error (may use bootstrap)
p_value = results.get('pvalue')  # Statistical significance

5. Bootstrap Standard Error Estimation

Input: SDID model results with potential matrix dimension challenges Output: Robust standard error estimates

Technical Implementation

When standard matrix-based standard error calculations fail due to dimension constraints, bootstrap resampling provides an alternative:

def bootstrap_standard_error(Y1_pre, Y1_post, Y0_pre, Y0_post, n_bootstrap=500):
    """Calculate standard error via bootstrap when matrix dimensions prevent direct calculation"""
    
    # Original dimensions may not be compatible for standard SE calculation
    bootstrap_estimates = []
    valid_iterations = 0
    
    # Track failed iterations to analyze failure modes
    failure_types = {'matrix_error': 0, 'convergence': 0, 'other': 0}
    
    for _ in range(n_bootstrap):
        try:
            # Resample control units
            n_control = Y0_pre.shape[1]
            control_indices = np.random.choice(n_control, n_control, replace=True)
            
            Y0_pre_boot = Y0_pre[:, control_indices]
            Y0_post_boot = Y0_post[:, control_indices]
            
            # Compute synthetic control weights with validation
            omega = compute_omega(Y1_pre, Y0_pre_boot)
            if np.any(np.isnan(omega)) or np.any(np.isinf(omega)):
                failure_types['convergence'] += 1
                continue
                
            # Apply weights to get synthetic control
            Y_synth_pre = Y0_pre_boot @ omega
            Y_synth_post = Y0_post_boot @ omega
            
            # Calculate ATT with heteroskedasticity-robust estimation
            att_boot = np.mean(Y1_post) - np.mean(Y_synth_post)
            bootstrap_estimates.append(att_boot)
            valid_iterations += 1
        except np.linalg.LinAlgError:
            failure_types['matrix_error'] += 1
            continue
        except Exception:
            failure_types['other'] += 1
            continue
    
    # Log bootstrap performance statistics
    logging.info(f"Bootstrap completed: {valid_iterations}/{n_bootstrap} valid iterations")
    if valid_iterations < n_bootstrap * 0.8:
        logging.warning(f"Bootstrap failures: {failure_types}")
    
    # Calculate standard error as standard deviation of bootstrap estimates
    if bootstrap_estimates:
        return np.std(bootstrap_estimates)
    else:
        return np.nan

Implementation in Production Environment

The implementation automatically detects matrix dimension issues and falls back to bootstrap:

def calculate_standard_error(model, Y1_pre, Y1_post, Y0_pre, Y0_post):
    """Calculate standard error with fallback to bootstrap if needed"""
    try:
        # Try standard matrix-based calculation
        se = model.se
        if np.isnan(se) or se <= 0:
            raise ValueError("Invalid standard error from matrix calculation")
        return se
    except Exception as e:
        print(f"Matrix-based SE calculation failed: {e}")
        print("Falling back to bootstrap standard error calculation")
        return bootstrap_standard_error(Y1_pre, Y1_post, Y0_pre, Y0_post)

6. AI-Powered Result Interpretation

Input: SDID analysis results Output: Business-focused interpretation and recommendations

Technical Implementation

The AI interpreter converts statistical results to business insights:

def generate_ai_interpretation(results, api_key, prompt_template=None):
    """Generate AI interpretation of statistical results"""
    
    # Default prompt template focuses on business implications
    if prompt_template is None:
        prompt_template = """
        You are an expert marketing analyst interpreting causal experiment results.
        
        EXPERIMENT RESULTS:
        - Test Periods: {test_duration} days
        - Average Treatment Effect: {att:.4f} ({att_pct:.2f}% change)
        - Standard Error: {se:.4f}
        - p-value: {pvalue:.4f}
        - 95% Confidence Interval: [{ci_lower:.4f}, {ci_upper:.4f}]
        
        Provide a concise, business-focused interpretation of these results in 3-5 paragraphs,
        addressing:
        1. Statistical significance and what it means in plain language
        2. Practical significance of the effect size for business decisions
        3. Recommendations for next steps based on these findings
        4. Any limitations or considerations in interpreting these results
        
        Keep your language clear, precise, and actionable for marketing stakeholders.
        """
    
    # Format the prompt with actual results
    formatted_prompt = prompt_template.format(
        test_duration=results['post_periods'],
        att=results['att'],
        att_pct=results['att_pct'] * 100 if 'att_pct' in results else results['att'] * 100,
        se=results['se'],
        pvalue=results['pvalue'],
        ci_lower=results['ci_lower'],
        ci_upper=results['ci_upper']
    )
    
    # Call the DeepSeek API
    response = requests.post(
        "https://api.deepseek.com/v1/chat/completions",
        headers={
            "Content-Type": "application/json",
            "Authorization": f"Bearer {api_key}"
        },
        json={
            "model": "deepseek-chat",
            "messages": [
                {"role": "system", "content": "You are an expert marketing analyst."},
                {"role": "user", "content": formatted_prompt}
            ],
            "temperature": 0.4,  # Lower temperature for more precise responses
            "max_tokens": 1000
        }
    )
    
    # Parse and return the interpretation
    try:
        interpretation = response.json()['choices'][0]['message']['content']
        return interpretation
    except Exception as e:
        return f"Failed to generate interpretation: {str(e)}"

Critical Technical Pitfalls

  1. Matrix Dimension Mismatches:

    • SYMPTOM: “Input operand has a mismatch in its core dimension”

    • CAUSE: Pre/post period counts don’t align for matrix multiplication

    • SOLUTION: Verify matching pre-intervention and post-intervention shapes

    • REFERENCE: See error handling in recipes/power_calculator.py or ensure balanced dimensions

  2. Singular Matrix Errors:

    • SYMPTOM: “Singular matrix” or “LinAlgError”

    • CAUSE: Linear dependence between control units

    • SOLUTION: Add ridge penalty or remove highly correlated units

    • REFERENCE: See implementation in synthdid/synthdid.py

  3. Imbalanced Panel Issues:

    • SYMPTOM: “ValueError: Found input variables with inconsistent numbers of samples”

    • CAUSE: Each location has different number of time periods

    • SOLUTION: Verify complete panel structure before analysis

    • REFERENCE: See data validation in recipes/donor_evaluator.py

  4. Incorrect Package Import:

    • SYMPTOM: “ModuleNotFoundError: No module named ‘src’”

    • CAUSE: Package structure is synthdid, not src

    • SOLUTION: Use correct import from synthdid.synthdid

    • REFERENCE: See import statements in example scripts

  5. Visualization Style Error:

    • SYMPTOM: “OSError: ‘seaborn-whitegrid’ is not a valid package style”

    • CAUSE: Matplotlib version compatibility issue

    • SOLUTION: Use version-appropriate style name (seaborn-v0_8-whitegrid)

    • REFERENCE: See style handling in recipes/power_calculator.py

  6. Incorrect Single-Cell Configuration:

    • SYMPTOM: “Multiple treatment locations detected in single-cell configuration”

    • CAUSE: Single-cell analysis requires exactly one treatment location

    • SOLUTION: Modify power_analysis_config_singlecell.yaml to include only one treatment location

    • REFERENCE: See example in configs/power_analysis_config_singlecell.yaml

Validation Heuristics

Pre-Treatment Fit Quality

The synthetic control should closely match the treatment location during pre-intervention:

def assess_pretreatment_fit(Y1_pre, Y_synth_pre):
    """Assess quality of pre-treatment fit"""
    # Prevent division by zero in MAPE calculation
    epsilon = 1e-10
    denom = np.maximum(np.abs(Y1_pre), epsilon)
    
    rmse = np.sqrt(np.mean((Y1_pre - Y_synth_pre) ** 2))
    correlation = np.corrcoef(Y1_pre.flatten(), Y_synth_pre.flatten())[0, 1]
    mape = np.mean(np.abs((Y1_pre - Y_synth_pre) / denom))
    
    # Interpretation heuristics - calibrated from empirical results
    if correlation > 0.9 and mape < 0.1:
        quality = "Excellent"
        recommendation = "Proceed with analysis"
    elif correlation > 0.8 and mape < 0.2:
        quality = "Good"
        recommendation = "Proceed with analysis"
    elif correlation > 0.7 and mape < 0.3:
        quality = "Acceptable"
        recommendation = "Proceed with caution, interpret results carefully"
    else:
        quality = "Poor"
        recommendation = "Consider alternative donor pools or analysis methods"
    
    return {
        "rmse": rmse,
        "correlation": correlation,
        "mape": mape,
        "quality": quality,
        "recommendation": recommendation
    }

Statistical Significance Assessment

Interpreting statistical significance requires both p-values and effect sizes:

def assess_significance(att, se, baseline, alpha=0.05):
    """Assess statistical and practical significance"""
    # Statistical significance
    z_score = att / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
    
    # Practical significance (percent change)
    pct_change = att / baseline * 100
    
    # Confidence interval
    ci_lower = att - stats.norm.ppf(1-alpha/2) * se
    ci_upper = att + stats.norm.ppf(1-alpha/2) * se
    
    # Interpretation
    if p_value < alpha:
        stat_interp = "Statistically significant"
    else:
        stat_interp = "Not statistically significant"
    
    # Practical interpretation based on percent change
    if abs(pct_change) < 1:
        pract_interp = "Negligible effect"
    elif abs(pct_change) < 5:
        pract_interp = "Small effect"
    elif abs(pct_change) < 10:
        pract_interp = "Moderate effect"
    else:
        pract_interp = "Large effect"
    
    return {
        "p_value": p_value,
        "percent_change": pct_change,
        "confidence_interval": (ci_lower, ci_upper),
        "statistical_interpretation": stat_interp,
        "practical_interpretation": pract_interp
    }

Advanced Technical Considerations

1. Optimization Algorithm Selection

The core SDID implementation uses:

  • Quadratic programming for unit weights (omega)

  • Ridge regression with cross-validation for time weights (lambda)

These algorithms balance fit quality with robustness to overfitting.

2. Bootstrapping vs. Jackknife Variance

Bootstrap resampling offers advantages over jackknife for standard errors:

  • Less sensitive to outliers in small donor pools

  • More robust when pre/post periods have different lengths

  • Better handling of non-linear relationships

3. Balanced vs. Unbalanced Panels

While balanced panels are ideal, strategies for unbalanced panels include:

  • Imputation of missing values (linear interpolation recommended)

  • Restriction to common time periods (reducing data but maintaining balance)

  • Weighted estimation that accounts for missing observations

Treatment Effect Interpretation

Interpreting Heterogeneous Effects

For multi-cell experiments, we need to interpret variation in treatment effects across locations:

def interpret_heterogeneous_effects(location_effects):
    """Convert raw location-specific effects to actionable insight"""
    # Calculate coefficient of variation
    cv = np.std(location_effects) / np.mean(location_effects) if np.mean(location_effects) != 0 else np.inf
    
    if cv < 0.1:
        return "Homogeneous effect across locations"
    elif cv < 0.3:
        return "Moderate heterogeneity, segment by location characteristics"
    else:
        return "High heterogeneity, analyze each location individually"

From Statistical to Business Metrics

To convert ATT to meaningful business metrics:

def convert_to_business_metrics(att, baseline, time_periods, cost_per_unit=None):
    """Convert statistical ATT to business metrics"""
    metrics = {
        'absolute_lift': att,
        'percent_lift': att / baseline * 100 if baseline != 0 else np.inf,
        'total_incremental': att * time_periods
    }
    
    if cost_per_unit is not None:
        metrics['roi'] = (att * time_periods) / cost_per_unit
    
    return metrics

Further Technical Resources