Integrity Checks & Robustness Validation
Comprehensive validation framework ensuring benchmark reliability, detecting overfitting, and maintaining scoring integrity through systematic falsification tests and perturbation analysis.
Why Integrity Checks Matter
Benchmark integrity is crucial for ensuring that scoring algorithms provide meaningful and reliable results. Without proper validation, models can overfit to specific test cases or exploit unintended patterns in the data.
Our integrity validation framework addresses three critical areas:
- Data Leakage Prevention: Ensures scoring logic doesn't access ground truth data
- Robustness Validation: Tests sensitivity to input perturbations and variations
- Falsification Testing: Validates that degraded inputs produce appropriately degraded scores
Ground Truth Leakage Detection
Ground truth leakage occurs when the scoring system has access to the correct answers during evaluation, leading to artificially inflated performance metrics.
Direct Data Access
Prevents scoring algorithms from reading meta.json files or accessing fixture directories during evaluation.
✅ PASSMeta Information Isolation
Ensures no imports of ground truth data in probe implementations or scoring logic.
✅ PASSDirectory Traversal Protection
Blocks attempts to traverse fixture directories to discover test case information.
✅ PASSFalsification Tests
Falsification tests intentionally degrade input quality to verify that the scoring system responds appropriately. These tests help detect overfitting and ensure the model generalizes well.
Label Shuffle Falsification
Randomly shuffles heuristic labels to test if the scoring system is learning meaningful patterns rather than memorizing specific combinations.
DOM Perturbation Tests
Applies systematic perturbations to the DOM structure while preserving semantic meaning:
Perturbation Type | Description | Expected Impact | Acceptance Threshold |
---|---|---|---|
Light Perturbation | Minor CSS changes, font variations | Minimal score change (<5%) | ±10% |
Medium Perturbation | Layout shifts, spacing changes | Moderate score change (5-15%) | ±20% |
Heavy Perturbation | Structural DOM changes | Significant score change (>15%) | ±30% |
Inverted Heuristics Test
Intentionally inverts heuristic logic to test if the scoring system recognizes degraded quality:
CI Test Matrix
Our continuous integration runs a comprehensive test matrix with different falsification modes:
A-Baseline-Train-NoPerturb
Standard benchmark on training data. Targets: Macro-F1 ≥ 0.85, R² ≥ 0.80
BaselineB-Perturbation-Train-Light
Light DOM perturbation testing. Purpose: Evaluate stability under layout changes
RobustnessC-Holdout-NoPerturb
Independent validation on unseen data. Targets: Macro-F1 ≥ 0.80, R² ≥ 0.75
ValidationD-Falsification-Train-Shuffle
Label shuffle falsification test. Targets: Macro-F1 ≤ 0.40, R² ≤ 0.10
FalsificationE-BandMidpoint-Ablation
Band midpoint sensitivity analysis. Purpose: Test scoring calibration stability
AblationF-Leakage-Sentry
Enhanced validation with strict thresholds. Targets: Macro-F1 ≥ 0.90, R² ≥ 0.85
StrictHoldout Validation
Holdout validation uses separate test sets that are not used during development to provide an unbiased estimate of model performance.
Holdout Dataset Structure
Our holdout validation includes:
- 30% of total fixtures: Reserved exclusively for final validation
- Balanced categories: Each category represented proportionally
- Temporal separation: Holdout data collected after training set
- Blind evaluation: No access to holdout results during development
Category Balance
Ensures holdout set maintains proportional representation across all benchmark categories.
✅ PASSTemporal Validation
Validates that performance doesn't degrade significantly on newer data patterns.
✅ PASSGeneralization Test
Measures performance consistency between training and holdout datasets.
✅ PASSAcceptance Thresholds
Clear thresholds define what constitutes acceptable performance across different integrity checks.
Check Type | Metric | Acceptable Range | Warning Threshold | Failure Threshold |
---|---|---|---|---|
Data Leakage | Direct access violations | 0 violations | N/A | >0 violations |
Label Shuffle | Score degradation | >30% decrease | 20-30% decrease | <20% decrease |
DOM Perturbation | Score variance | <±20% | ±20-30% | >±30% |
Inverted Heuristics | Score degradation | >40% decrease | 30-40% decrease | <30% decrease |
Holdout Validation | Performance gap | <10% difference | 10-15% difference | >15% difference |
Interpreting Results
Understanding integrity check results is crucial for maintaining benchmark reliability.
Green Flags (Good Signs)
- Strong falsification response: 40-60% score degradation on inverted tests
- Robust perturbation handling: Minimal sensitivity to minor DOM changes
- Consistent holdout performance: <10% difference from training results
- No leakage violations: Zero unauthorized data access attempts
Red Flags (Warning Signs)
- Weak falsification response: <30% degradation suggests overfitting
- Excessive perturbation sensitivity: >30% variance indicates instability
- Holdout performance gap: >15% difference suggests poor generalization
- Leakage violations: Any unauthorized data access is critical
Action Items for Failed Checks
- Investigate the root cause: Determine why the check failed
- Review model architecture: Check for potential overfitting patterns
- Validate data pipeline: Ensure no unintended data leakage
- Retrain with fixes: Address identified issues and re-run validation
- Document findings: Record issues and solutions for future reference
Implementation Details
Technical implementation of integrity checks in the OnboardingAudit.ai framework.
Running Integrity Checks
Configuration Options
Integrity checks can be configured through various parameters:
- --falsification-mode: Type of falsification test (shuffle, invert, perturb)
- --perturbation-level: Intensity of DOM perturbations (light, medium, heavy)
- --holdout-ratio: Percentage of data reserved for holdout validation
- --threshold-strictness: Tolerance for acceptance thresholds (strict, normal, relaxed)