Validation Results

We validate the LLM's sentence extraction by comparing against manually-labeled ground truth data. This page documents the validation methodology, accuracy metrics, and known limitations.

AI Transparency

Validation itself uses AI: an OpenAI model (FAST_MODEL, currently gpt-4.1-mini) determines whether a bot-extracted sentence “matches” a ground truth sentence (accounting for minor text differences). This creates a second layer of AI decision-making that we document here.

Validation Dataset

45
Transcripts Validated
746
Ground Truth Sentences
2,849
Bot-Extracted Sentences
30,849
Pairs Compared
Where do these dataset counts come from?

These four numbers are computed from the sentence_matches table in .matchquerycache.sqlite, which is produced by ground-truth-comparisons/update_matches.py. “Transcripts validated” means the number of distinct transcript files present in the validation dataset.

SELECT COUNT(DISTINCT transcript_filename)  AS transcripts_validated
FROM sentence_matches;

SELECT COUNT(DISTINCT ground_truth_sentence) AS ground_truth_sentences
FROM sentence_matches;

SELECT COUNT(DISTINCT bot_identified_sentence) AS bot_extracted_sentences
FROM sentence_matches;

SELECT COUNT(*) AS pairs_compared
FROM sentence_matches;

Accuracy Metrics

Metric Value Interpretation
Precision63.0%Of the bot’s positives, how many were correct
Recall (Coverage)83.6%Share of ground truth sentences the bot found
F1 Score71.8%Harmonic mean of precision and recall
Bot Sentences Matched148Bot sentences that correspond to ground truth labels (TP)
Bot Expansion Factor1.3xBot extracts more sentences than ground truth contains

Method Comparison

This comparison reports overall accuracy for different processing modes/runs using the same ground truth and sentence-matching cache.

Run Precision Recall F1 TP FP FN GT AI
regex_test64.5%85.3%73.5%1518326177234
regex1-2025-06-1964.4%84.7%73.2%1508327177233
analysis_openai22.6%59.9%32.8%10636471177470
analysis_gemini12.8%74.6%21.8%132900451771,032

Method Comparison by Tag

Per-tag precision/recall/F1 for each run (limited to the four core labels).

Run Tag Precision Recall F1 TP FP FN GT AI
regex_testmentions_cyber64.8%85.8%73.8%1518225176233
regex_testmentions_board53.8%60.9%57.1%141292326
regex_testregulatory_reference62.5%50.0%55.6%535108
regex_testspecificity0.0%0.0%0.0%01001
regex1-2025-06-19mentions_cyber64.2%84.7%73.0%1498327176232
regex1-2025-06-19mentions_board47.2%73.9%57.6%171962336
regex1-2025-06-19regulatory_reference71.4%50.0%58.8%525107
regex1-2025-06-19specificity0.0%0.0%0.0%0290029
analysis_openaimentions_cyber25.2%59.7%35.4%10531271176417
analysis_openaimentions_board20.6%56.5%30.2%1350102363
analysis_openairegulatory_reference16.7%50.0%25.0%52551030
analysis_openaispecificity0.0%0.0%0.0%017300173
analysis_geminimentions_cyber12.8%75.0%21.9%132897441761,029
analysis_geminimentions_board12.5%87.0%21.9%20140323160
analysis_geminiregulatory_reference8.0%80.0%14.5%892210100
analysis_geminispecificity0.0%0.0%0.0%063400634

Metrics by Tag

Tag Precision Recall F1 TP FP FN
mentions_cyber62.8%83.5%71.7%1478729
mentions_board39.5%73.9%51.5%17266
regulatory_reference63.6%70.0%66.7%743
specificity0.0%0.0%0.0%0920

Validation by Transcript

Performance varies by transcript. Some transcripts have high match rates while others show divergence between human and AI labeling.

Transcript GT Matched Bot Matched Pairs Checked

Quality Assurance Process

1

Ground Truth Collection

Human researchers manually labeled cybersecurity-relevant sentences in a subset of transcripts, applying the same label categories (mentions_board, regulatory_reference, specificity).

2

Sentence Matching

An OpenAI model (FAST_MODEL, currently gpt-4.1-mini, temperature=0) compares each ground truth sentence against each bot-extracted sentence to determine if they represent the same content (accounting for tokenization differences and minor text variations). Match decisions are stored in sentence_matches.

3

Overseer QA

An "overseer" model reviews ground truth sentences that the bot skipped, evaluating whether the miss was justified or indicates a systematic gap.

4

Pattern Analysis

We train classifiers on missed sentences to identify phrase patterns associated with false negatives, informing prompt refinement.

Known Issues

Continuous Improvement

Validation findings feed back into prompt engineering and processing improvements: