Integrity Checks & Robustness Validation

Comprehensive validation framework ensuring benchmark reliability, detecting overfitting, and maintaining scoring integrity through systematic falsification tests and perturbation analysis.

Why Integrity Checks Matter

Benchmark integrity is crucial for ensuring that scoring algorithms provide meaningful and reliable results. Without proper validation, models can overfit to specific test cases or exploit unintended patterns in the data.

Overfitting Risk: Without integrity checks, scoring systems might learn to recognize specific benchmark patterns rather than general onboarding quality principles, leading to poor performance on real-world scenarios.

Our integrity validation framework addresses three critical areas:

  • Data Leakage Prevention: Ensures scoring logic doesn't access ground truth data
  • Robustness Validation: Tests sensitivity to input perturbations and variations
  • Falsification Testing: Validates that degraded inputs produce appropriately degraded scores

Ground Truth Leakage Detection

Ground truth leakage occurs when the scoring system has access to the correct answers during evaluation, leading to artificially inflated performance metrics.

Direct Data Access

Prevents scoring algorithms from reading meta.json files or accessing fixture directories during evaluation.

✅ PASS

Meta Information Isolation

Ensures no imports of ground truth data in probe implementations or scoring logic.

✅ PASS

Directory Traversal Protection

Blocks attempts to traverse fixture directories to discover test case information.

✅ PASS
// Example: Leakage detection in validate-benchmarks.mjs const leakageChecks = [ 'No direct meta.json imports', 'No fixture directory traversal', 'No ground truth data access', 'Scoring logic isolation' ]; // Each check must pass for validation to succeed if (leakageChecks.every(check => check.passed)) { console.log('✅ No ground truth leakage detected'); }

Falsification Tests

Falsification tests intentionally degrade input quality to verify that the scoring system responds appropriately. These tests help detect overfitting and ensure the model generalizes well.

Label Shuffle Falsification

Randomly shuffles heuristic labels to test if the scoring system is learning meaningful patterns rather than memorizing specific combinations.

Expected Result: Scores should degrade significantly (typically 40-60%) when labels are shuffled, indicating the system learns meaningful patterns.

DOM Perturbation Tests

Applies systematic perturbations to the DOM structure while preserving semantic meaning:

Perturbation Type Description Expected Impact Acceptance Threshold
Light Perturbation Minor CSS changes, font variations Minimal score change (<5%) ±10%
Medium Perturbation Layout shifts, spacing changes Moderate score change (5-15%) ±20%
Heavy Perturbation Structural DOM changes Significant score change (>15%) ±30%

Inverted Heuristics Test

Intentionally inverts heuristic logic to test if the scoring system recognizes degraded quality:

// Example: Inverted heuristics test const invertedHeuristics = { 'H-CTA-ABOVE-FOLD': false, // CTA should be below fold 'H-COPY-CLARITY': 'poor', // Copy should be unclear 'H-TRUST-MARKERS': false, // No trust signals 'H-PERCEIVED-SPEED': 'slow' // Process should feel slow }; // Expected: Score should decrease by 40-60% const expectedDegradation = 0.5; // 50%

CI Test Matrix

Our continuous integration runs a comprehensive test matrix with different falsification modes:

A-Baseline-Train-NoPerturb

Standard benchmark on training data. Targets: Macro-F1 ≥ 0.85, R² ≥ 0.80

Baseline

B-Perturbation-Train-Light

Light DOM perturbation testing. Purpose: Evaluate stability under layout changes

Robustness

C-Holdout-NoPerturb

Independent validation on unseen data. Targets: Macro-F1 ≥ 0.80, R² ≥ 0.75

Validation

D-Falsification-Train-Shuffle

Label shuffle falsification test. Targets: Macro-F1 ≤ 0.40, R² ≤ 0.10

Falsification

E-BandMidpoint-Ablation

Band midpoint sensitivity analysis. Purpose: Test scoring calibration stability

Ablation

F-Leakage-Sentry

Enhanced validation with strict thresholds. Targets: Macro-F1 ≥ 0.90, R² ≥ 0.85

Strict

Holdout Validation

Holdout validation uses separate test sets that are not used during development to provide an unbiased estimate of model performance.

Holdout Dataset Structure

Our holdout validation includes:

  • 30% of total fixtures: Reserved exclusively for final validation
  • Balanced categories: Each category represented proportionally
  • Temporal separation: Holdout data collected after training set
  • Blind evaluation: No access to holdout results during development

Category Balance

Ensures holdout set maintains proportional representation across all benchmark categories.

✅ PASS

Temporal Validation

Validates that performance doesn't degrade significantly on newer data patterns.

✅ PASS

Generalization Test

Measures performance consistency between training and holdout datasets.

✅ PASS

Acceptance Thresholds

Clear thresholds define what constitutes acceptable performance across different integrity checks.

Check Type Metric Acceptable Range Warning Threshold Failure Threshold
Data Leakage Direct access violations 0 violations N/A >0 violations
Label Shuffle Score degradation >30% decrease 20-30% decrease <20% decrease
DOM Perturbation Score variance <±20% ±20-30% >±30%
Inverted Heuristics Score degradation >40% decrease 30-40% decrease <30% decrease
Holdout Validation Performance gap <10% difference 10-15% difference >15% difference
Threshold Violations: When any check falls below the warning threshold, it indicates potential issues with model robustness or generalization that should be investigated.

Interpreting Results

Understanding integrity check results is crucial for maintaining benchmark reliability.

Green Flags (Good Signs)

  • Strong falsification response: 40-60% score degradation on inverted tests
  • Robust perturbation handling: Minimal sensitivity to minor DOM changes
  • Consistent holdout performance: <10% difference from training results
  • No leakage violations: Zero unauthorized data access attempts

Red Flags (Warning Signs)

  • Weak falsification response: <30% degradation suggests overfitting
  • Excessive perturbation sensitivity: >30% variance indicates instability
  • Holdout performance gap: >15% difference suggests poor generalization
  • Leakage violations: Any unauthorized data access is critical

Action Items for Failed Checks

  1. Investigate the root cause: Determine why the check failed
  2. Review model architecture: Check for potential overfitting patterns
  3. Validate data pipeline: Ensure no unintended data leakage
  4. Retrain with fixes: Address identified issues and re-run validation
  5. Document findings: Record issues and solutions for future reference

Implementation Details

Technical implementation of integrity checks in the OnboardingAudit.ai framework.

Running Integrity Checks

# Run all integrity checks npm run test:integrity # Run specific falsification tests node scripts/validate-benchmarks.mjs --mode inverted-heuristics node scripts/validate-benchmarks.mjs --mode label-shuffle # Run perturbation tests node scripts/run-benchmarks.mjs --perturb light node scripts/run-benchmarks.mjs --perturb medium node scripts/run-benchmarks.mjs --perturb heavy # Check for data leakage node scripts/leakage_check.mjs

Configuration Options

Integrity checks can be configured through various parameters:

  • --falsification-mode: Type of falsification test (shuffle, invert, perturb)
  • --perturbation-level: Intensity of DOM perturbations (light, medium, heavy)
  • --holdout-ratio: Percentage of data reserved for holdout validation
  • --threshold-strictness: Tolerance for acceptance thresholds (strict, normal, relaxed)