Validation Results

We validate the end-to-end extraction and labeling pipeline by comparing against manually-labeled ground truth data. This page documents the validation methodology, accuracy metrics, and known limitations.

Important: In regex-gated runs, regex is used only to detect cybersecurity-relevant sentences (mentions_cyber). All downstream labels (mentions_board, regulatory_reference, specificity) are assigned by the LLM. In LLM runs, the LLM performs both relevance detection and labeling.

Alternative experimental runs are documented on the Method Alternatives page.

AI Transparency

Validation itself uses AI: an OpenAI model (FAST_MODEL, currently gpt-4.1-mini) determines whether a bot-extracted sentence “matches” a ground truth sentence (accounting for minor text differences). This creates a second layer of AI decision-making that we document here.

Note: The tag-specific regex lists (for mentions_board, regulatory_reference, specificity) are not used for label assignment in the production pipeline. They are included as baselines or references for separate experiments.

Validation Dataset

Transcripts Validated

37,961

Total Sentences (All)

746

Ground Truth Sentences

2,849

Bot-Extracted Sentences

30,849

Pairs Compared

Where do these dataset counts come from?

These four numbers are computed from the sentence_matches table in .matchquerycache.sqlite, which is produced by ground-truth-comparisons/update_matches.py. “Transcripts validated” means the number of distinct transcript files present in the validation dataset.

SELECT COUNT(DISTINCT transcript_filename)  AS transcripts_validated
FROM sentence_matches;

SELECT COUNT(DISTINCT ground_truth_sentence) AS ground_truth_sentences
FROM sentence_matches;

SELECT COUNT(DISTINCT bot_identified_sentence) AS bot_extracted_sentences
FROM sentence_matches;

SELECT COUNT(*) AS pairs_compared
FROM sentence_matches;

-- Total sentences across the validated transcripts
SELECT SUM(total_sentences) AS total_sentences
FROM processed_files
WHERE filename IN (SELECT DISTINCT transcript_filename FROM sentence_matches);

Regex Gate Coverage (mentions_cyber)

This counts how many ground truth sentences would be missed by the mentions_cyber regex gate alone. It represents the maximum possible recall for any regex-gated run, before labeling is even considered.

GT Sentences	Matched by Regex	Missed by Regex	Coverage
177	154	23	87.0%

Area chart of regex gate coverage and extraction yield

Accuracy Metrics

These metrics evaluate the full pipeline: sentence identification plus labeling. The per-tag metrics below include errors from both extraction and labeling, because a sentence must first be selected as cyber-relevant before a tag can be applied.

Metric	Value	Interpretation
Precision	63.0%	Of the bot’s positives, how many were correct
Recall (Coverage)	83.6%	Share of ground truth sentences the bot found
F1 Score	71.8%	Harmonic mean of precision and recall
Accuracy (Overlap)	56.1%	Share of sentences in the union that match (TP / (TP+FP+FN))
Bot Sentences Matched	148	Bot sentences that correspond to ground truth labels (TP)
Bot Expansion Factor	1.3x	Bot extracts more sentences than ground truth contains

Labeling Accuracy

These tables separate labeling quality from extraction coverage.

Within Regex Gate (Matched Sentences)

This table evaluates labels only for ground truth sentences that pass the mentions_cyber regex gate and were matched to an extracted sentence. Accuracy (Overlap) is TP/(TP+FP+FN) (no TNs). Accuracy (Within Matched) is (TP+TN)/(TP+FP+FN+TN) across those matched sentence pairs. The mentions_cyber row is omitted because it is guaranteed by the gate.

Tag	Precision	Recall	F1	Accuracy (Overlap)	Accuracy (Within Matched)	TP	FP	FN	TN	GT	AI
mentions_board	63.0%	89.5%	73.9%	58.6%	91.9%	17	10	2	119	19	27
regulatory_reference	100.0%	70.0%	82.4%	70.0%	98.0%	7	0	3	138	10	7
specificity	0.0%	0.0%	0.0%	0.0%	68.2%	0	47	0	101	0	47

Sankey diagrams of labeling outcomes by tag within the regex gate

Full Validation Dataset (Missing Gate = Fail)

This table evaluates labeling performance across the entire validation dataset. If a sentence is missed by the regex gate, its tags count as false negatives. Accuracy (Overlap) is TP/(TP+FP+FN).

Tag	Precision	Recall	F1	Accuracy (Overlap)	TP	FP	FN	GT	AI
mentions_cyber	62.8%	83.5%	71.7%	55.9%	147	87	29	176	234
mentions_board	39.5%	73.9%	51.5%	34.7%	17	26	6	23	43
regulatory_reference	63.6%	70.0%	66.7%	50.0%	7	4	3	10	11
specificity	0.0%	0.0%	0.0%	0.0%	0	92	0	0	92

Sankey diagrams of labeling outcomes by tag across the full validation dataset

Validation by Transcript

Performance varies by transcript. The plot below shows recall (x-axis) vs precision (y-axis), with each dot representing a transcript. Colors indicate industry when known.

The table below lists transcript-level counts, including total sentences, regex-gated sentences, and tag totals.

Transcript	Company	Industry	Total Sentences	Regex Matched	mentions_cyber	mentions_board	regulatory_reference	specificity	Precision	Recall	TP	FP	FN	GT	AI	Pairs Checked

Quality Assurance Process

Ground Truth Collection

Human researchers manually labeled cybersecurity-relevant sentences in a subset of transcripts, applying the same label categories (mentions_board, regulatory_reference, specificity).

Sentence Matching

An OpenAI model (FAST_MODEL, currently gpt-4.1-mini, temperature=0) compares each ground truth sentence against each bot-extracted sentence to determine if they represent the same content (accounting for tokenization differences and minor text variations). Match decisions are stored in sentence_matches.

Overseer QA

An "overseer" model reviews ground truth sentences that the bot skipped, evaluating whether the miss was justified or indicates a systematic gap.

Pattern Analysis

We train classifiers on missed sentences to identify phrase patterns associated with false negatives, informing prompt refinement.

Known Issues

Sentence boundary mismatches: Different tokenizers split sentences at different points
Context-dependent references: Pronouns like "it" referring to cybersecurity may be missed
Implicit mentions: "The incident" referring to a known breach may not be detected without context
Ground truth limitations: Human labeling may also miss relevant sentences or include marginal ones

Continuous Improvement

Validation findings feed back into prompt engineering and processing improvements:

Identified patterns in missed sentences inform keyword list expansion
Context window size adjustments improve detection of implicit references
Label criteria refinements reduce classification ambiguity