Validation Results
We validate the LLM's sentence extraction by comparing against manually-labeled ground truth data. This page documents the validation methodology, accuracy metrics, and known limitations.
AI Transparency
Validation itself uses AI: an OpenAI model (FAST_MODEL, currently gpt-4.1-mini) determines whether a bot-extracted sentence “matches” a ground truth sentence (accounting for minor text differences). This creates a second layer of AI decision-making that we document here.
Validation Dataset
Where do these dataset counts come from?
These four numbers are computed from the sentence_matches table in .matchquerycache.sqlite, which is produced by ground-truth-comparisons/update_matches.py.
“Transcripts validated” means the number of distinct transcript files present in the validation dataset.
SELECT COUNT(DISTINCT transcript_filename) AS transcripts_validated FROM sentence_matches; SELECT COUNT(DISTINCT ground_truth_sentence) AS ground_truth_sentences FROM sentence_matches; SELECT COUNT(DISTINCT bot_identified_sentence) AS bot_extracted_sentences FROM sentence_matches; SELECT COUNT(*) AS pairs_compared FROM sentence_matches;
Accuracy Metrics
| Metric | Value | Interpretation |
|---|---|---|
| Precision | 63.0% | Of the bot’s positives, how many were correct |
| Recall (Coverage) | 83.6% | Share of ground truth sentences the bot found |
| F1 Score | 71.8% | Harmonic mean of precision and recall |
| Bot Sentences Matched | 148 | Bot sentences that correspond to ground truth labels (TP) |
| Bot Expansion Factor | 1.3x | Bot extracts more sentences than ground truth contains |
Method Comparison
This comparison reports overall accuracy for different processing modes/runs using the same ground truth and sentence-matching cache.
| Run | Precision | Recall | F1 | TP | FP | FN | GT | AI |
|---|---|---|---|---|---|---|---|---|
| regex_test | 64.5% | 85.3% | 73.5% | 151 | 83 | 26 | 177 | 234 |
| regex1-2025-06-19 | 64.4% | 84.7% | 73.2% | 150 | 83 | 27 | 177 | 233 |
| analysis_openai | 22.6% | 59.9% | 32.8% | 106 | 364 | 71 | 177 | 470 |
| analysis_gemini | 12.8% | 74.6% | 21.8% | 132 | 900 | 45 | 177 | 1,032 |
Method Comparison by Tag
Per-tag precision/recall/F1 for each run (limited to the four core labels).
| Run | Tag | Precision | Recall | F1 | TP | FP | FN | GT | AI |
|---|---|---|---|---|---|---|---|---|---|
| regex_test | mentions_cyber | 64.8% | 85.8% | 73.8% | 151 | 82 | 25 | 176 | 233 |
| regex_test | mentions_board | 53.8% | 60.9% | 57.1% | 14 | 12 | 9 | 23 | 26 |
| regex_test | regulatory_reference | 62.5% | 50.0% | 55.6% | 5 | 3 | 5 | 10 | 8 |
| regex_test | specificity | 0.0% | 0.0% | 0.0% | 0 | 1 | 0 | 0 | 1 |
| regex1-2025-06-19 | mentions_cyber | 64.2% | 84.7% | 73.0% | 149 | 83 | 27 | 176 | 232 |
| regex1-2025-06-19 | mentions_board | 47.2% | 73.9% | 57.6% | 17 | 19 | 6 | 23 | 36 |
| regex1-2025-06-19 | regulatory_reference | 71.4% | 50.0% | 58.8% | 5 | 2 | 5 | 10 | 7 |
| regex1-2025-06-19 | specificity | 0.0% | 0.0% | 0.0% | 0 | 29 | 0 | 0 | 29 |
| analysis_openai | mentions_cyber | 25.2% | 59.7% | 35.4% | 105 | 312 | 71 | 176 | 417 |
| analysis_openai | mentions_board | 20.6% | 56.5% | 30.2% | 13 | 50 | 10 | 23 | 63 |
| analysis_openai | regulatory_reference | 16.7% | 50.0% | 25.0% | 5 | 25 | 5 | 10 | 30 |
| analysis_openai | specificity | 0.0% | 0.0% | 0.0% | 0 | 173 | 0 | 0 | 173 |
| analysis_gemini | mentions_cyber | 12.8% | 75.0% | 21.9% | 132 | 897 | 44 | 176 | 1,029 |
| analysis_gemini | mentions_board | 12.5% | 87.0% | 21.9% | 20 | 140 | 3 | 23 | 160 |
| analysis_gemini | regulatory_reference | 8.0% | 80.0% | 14.5% | 8 | 92 | 2 | 10 | 100 |
| analysis_gemini | specificity | 0.0% | 0.0% | 0.0% | 0 | 634 | 0 | 0 | 634 |
Metrics by Tag
| Tag | Precision | Recall | F1 | TP | FP | FN |
|---|---|---|---|---|---|---|
| mentions_cyber | 62.8% | 83.5% | 71.7% | 147 | 87 | 29 |
| mentions_board | 39.5% | 73.9% | 51.5% | 17 | 26 | 6 |
| regulatory_reference | 63.6% | 70.0% | 66.7% | 7 | 4 | 3 |
| specificity | 0.0% | 0.0% | 0.0% | 0 | 92 | 0 |
Validation by Transcript
Performance varies by transcript. Some transcripts have high match rates while others show divergence between human and AI labeling.
| Transcript | GT Matched | Bot Matched | Pairs Checked |
|---|
Quality Assurance Process
Ground Truth Collection
Human researchers manually labeled cybersecurity-relevant sentences in a subset of transcripts, applying the same label categories (mentions_board, regulatory_reference, specificity).
Sentence Matching
An OpenAI model (FAST_MODEL, currently gpt-4.1-mini, temperature=0) compares each ground truth sentence against each bot-extracted sentence to determine if they represent the same content (accounting for tokenization differences and minor text variations). Match decisions are stored in sentence_matches.
Overseer QA
An "overseer" model reviews ground truth sentences that the bot skipped, evaluating whether the miss was justified or indicates a systematic gap.
Pattern Analysis
We train classifiers on missed sentences to identify phrase patterns associated with false negatives, informing prompt refinement.
Known Issues
- Sentence boundary mismatches: Different tokenizers split sentences at different points
- Context-dependent references: Pronouns like "it" referring to cybersecurity may be missed
- Implicit mentions: "The incident" referring to a known breach may not be detected without context
- Ground truth limitations: Human labeling may also miss relevant sentences or include marginal ones
Continuous Improvement
Validation findings feed back into prompt engineering and processing improvements:
- Identified patterns in missed sentences inform keyword list expansion
- Context window size adjustments improve detection of implicit references
- Label criteria refinements reduce classification ambiguity