Integrity Checks - OnboardingAudit.ai

Why Integrity Checks Matter

Benchmark integrity is crucial for ensuring that scoring algorithms provide meaningful and reliable results. Without proper validation, models can overfit to specific test cases or exploit unintended patterns in the data.

Overfitting Risk: Without integrity checks, scoring systems might learn to recognize specific benchmark patterns rather than general onboarding quality principles, leading to poor performance on real-world scenarios.

Our integrity validation framework addresses three critical areas:

Data Leakage Prevention: Ensures scoring logic doesn't access ground truth data
Robustness Validation: Tests sensitivity to input perturbations and variations
Falsification Testing: Validates that degraded inputs produce appropriately degraded scores

Ground Truth Leakage Detection

Ground truth leakage occurs when the scoring system has access to the correct answers during evaluation, leading to artificially inflated performance metrics.

Direct Data Access

Prevents scoring algorithms from reading meta.json files or accessing fixture directories during evaluation.

✅ PASS

Meta Information Isolation

Ensures no imports of ground truth data in probe implementations or scoring logic.

✅ PASS

Directory Traversal Protection

Blocks attempts to traverse fixture directories to discover test case information.

✅ PASS

// Example: Leakage detection in validate-benchmarks.mjs
const leakageChecks = [
  'No direct meta.json imports',
  'No fixture directory traversal',
  'No ground truth data access',
  'Scoring logic isolation'
];

// Each check must pass for validation to succeed
if (leakageChecks.every(check => check.passed)) {
  console.log('✅ No ground truth leakage detected');
}
                

Falsification Tests

Falsification tests intentionally degrade input quality to verify that the scoring system responds appropriately. These tests help detect overfitting and ensure the model generalizes well.

Label Shuffle Falsification

Randomly shuffles heuristic labels to test if the scoring system is learning meaningful patterns rather than memorizing specific combinations.

Expected Result: Scores should degrade significantly (typically 40-60%) when labels are shuffled, indicating the system learns meaningful patterns.

DOM Perturbation Tests

Applies systematic perturbations to the DOM structure while preserving semantic meaning:

Perturbation Type	Description	Expected Impact	Acceptance Threshold
Light Perturbation	Minor CSS changes, font variations	Minimal score change (<5%)	±10%
Medium Perturbation	Layout shifts, spacing changes	Moderate score change (5-15%)	±20%
Heavy Perturbation	Structural DOM changes	Significant score change (>15%)	±30%

Inverted Heuristics Test

Intentionally inverts heuristic logic to test if the scoring system recognizes degraded quality:

// Example: Inverted heuristics test
const invertedHeuristics = {
  'H-CTA-ABOVE-FOLD': false,  // CTA should be below fold
  'H-COPY-CLARITY': 'poor',     // Copy should be unclear
  'H-TRUST-MARKERS': false,  // No trust signals
  'H-PERCEIVED-SPEED': 'slow'   // Process should feel slow
};

// Expected: Score should decrease by 40-60%
const expectedDegradation = 0.5; // 50%
                

CI Test Matrix

Our continuous integration runs a comprehensive test matrix with different falsification modes:

A-Baseline-Train-NoPerturb

Standard benchmark on training data. Targets: Macro-F1 ≥ 0.85, R² ≥ 0.80

Baseline

B-Perturbation-Train-Light

Light DOM perturbation testing. Purpose: Evaluate stability under layout changes

Robustness

C-Holdout-NoPerturb

Independent validation on unseen data. Targets: Macro-F1 ≥ 0.80, R² ≥ 0.75

Validation

D-Falsification-Train-Shuffle

Label shuffle falsification test. Targets: Macro-F1 ≤ 0.40, R² ≤ 0.10

Falsification

E-BandMidpoint-Ablation

Band midpoint sensitivity analysis. Purpose: Test scoring calibration stability

Ablation

F-Leakage-Sentry

Enhanced validation with strict thresholds. Targets: Macro-F1 ≥ 0.90, R² ≥ 0.85

Strict

Holdout Validation

Holdout validation uses separate test sets that are not used during development to provide an unbiased estimate of model performance.

Holdout Dataset Structure

Our holdout validation includes:

30% of total fixtures: Reserved exclusively for final validation
Balanced categories: Each category represented proportionally
Temporal separation: Holdout data collected after training set
Blind evaluation: No access to holdout results during development

Category Balance

Ensures holdout set maintains proportional representation across all benchmark categories.

✅ PASS

Temporal Validation

Validates that performance doesn't degrade significantly on newer data patterns.

✅ PASS

Generalization Test

Measures performance consistency between training and holdout datasets.

✅ PASS

Acceptance Thresholds

Clear thresholds define what constitutes acceptable performance across different integrity checks.

Check Type	Metric	Acceptable Range	Warning Threshold	Failure Threshold
Data Leakage	Direct access violations	0 violations	N/A	>0 violations
Label Shuffle	Score degradation	>30% decrease	20-30% decrease	<20% decrease
DOM Perturbation	Score variance	<±20%	±20-30%	>±30%
Inverted Heuristics	Score degradation	>40% decrease	30-40% decrease	<30% decrease
Holdout Validation	Performance gap	<10% difference	10-15% difference	>15% difference

Threshold Violations: When any check falls below the warning threshold, it indicates potential issues with model robustness or generalization that should be investigated.

Interpreting Results

Understanding integrity check results is crucial for maintaining benchmark reliability.

Green Flags (Good Signs)

Strong falsification response: 40-60% score degradation on inverted tests
Robust perturbation handling: Minimal sensitivity to minor DOM changes
Consistent holdout performance: <10% difference from training results
No leakage violations: Zero unauthorized data access attempts

Red Flags (Warning Signs)

Weak falsification response: <30% degradation suggests overfitting
Excessive perturbation sensitivity: >30% variance indicates instability
Holdout performance gap: >15% difference suggests poor generalization
Leakage violations: Any unauthorized data access is critical

Action Items for Failed Checks

Investigate the root cause: Determine why the check failed
Review model architecture: Check for potential overfitting patterns
Validate data pipeline: Ensure no unintended data leakage
Retrain with fixes: Address identified issues and re-run validation
Document findings: Record issues and solutions for future reference

Implementation Details

Technical implementation of integrity checks in the OnboardingAudit.ai framework.

Running Integrity Checks

# Run all integrity checks
npm run test:integrity

# Run specific falsification tests
node scripts/validate-benchmarks.mjs --mode inverted-heuristics
node scripts/validate-benchmarks.mjs --mode label-shuffle

# Run perturbation tests
node scripts/run-benchmarks.mjs --perturb light
node scripts/run-benchmarks.mjs --perturb medium
node scripts/run-benchmarks.mjs --perturb heavy

# Check for data leakage
node scripts/leakage_check.mjs
                

Configuration Options

Integrity checks can be configured through various parameters:

--falsification-mode: Type of falsification test (shuffle, invert, perturb)
--perturbation-level: Intensity of DOM perturbations (light, medium, heavy)
--holdout-ratio: Percentage of data reserved for holdout validation
--threshold-strictness: Tolerance for acceptance thresholds (strict, normal, relaxed)

Integrity Checks & Robustness Validation

Why Integrity Checks Matter

Ground Truth Leakage Detection

Direct Data Access

Meta Information Isolation

Directory Traversal Protection

Falsification Tests

Label Shuffle Falsification

DOM Perturbation Tests

Inverted Heuristics Test

CI Test Matrix

A-Baseline-Train-NoPerturb

B-Perturbation-Train-Light

C-Holdout-NoPerturb

D-Falsification-Train-Shuffle

E-BandMidpoint-Ablation

F-Leakage-Sentry

Holdout Validation

Holdout Dataset Structure

Category Balance

Temporal Validation

Generalization Test

Acceptance Thresholds

Interpreting Results

Green Flags (Good Signs)

Red Flags (Warning Signs)

Action Items for Failed Checks

Implementation Details

Running Integrity Checks

Configuration Options