Methodology
AI Transparency
This page documents exactly how AI systems contribute to this research. Every step where an AI makes a decision is explicitly identified, including the models used, the prompts given, and how outputs are validated.
Data Sources
Corpus: 1,307 earnings call transcripts from 35 ASX-listed Australian companies, covering 2016–2024.
Sources: Transcripts sourced from S&P Global Market Intelligence.
Format: Plain text files with structured sections (Presentation, Q&A).
Processing Pipeline
Transcript Parsing
Method: Automated (rule-based)
Transcripts are split into "Presentation" and "Q&A" sections using regex patterns that identify section headers. Sentences are extracted using NLTK's sentence tokenizer.
Exclusions: Boilerplate text (copyright notices, disclaimers) is automatically filtered.
Cybersecurity Relevance Detection
Method: Dual-mode (Regex or LLM)
Regex Mode: Sentences matching cybersecurity keywords (cyber, data breach, hacking, ransomware, phishing, scam, information security, etc.) are selected.
LLM Mode: Each sentence is evaluated by GPT-4.1-mini with surrounding context (3 sentences before and after). The model determines if the sentence has "any relevance whatsoever to cybersecurity topics."
AI Decision: In LLM mode, the AI decides whether each sentence is relevant. We err on the side of inclusion.
Classification Labeling
Method: LLM-based (Gemini 2.5 Flash)
Each relevant sentence is classified with applicable labels using the model's function-calling capability:
- mentions_board: Board or senior management involvement in cybersecurity
- regulatory_reference: Cybersecurity regulations, laws, or compliance
- specificity: Concrete details (threats, technologies, frameworks, mitigation actions)
AI Decision: The AI assigns labels based on its interpretation. Human validation is performed on a subset (see Validation).
Storage & Deduplication
Method: Automated
Results are stored in SQLite database. File checksums prevent reprocessing of unchanged transcripts. Token usage is tracked for cost analysis.
Models Used
| Task | Model | Provider | Parameters |
|---|---|---|---|
| Relevance Detection (LLM mode) | gpt-4.1-mini | OpenAI | Temperature: 0.0 |
| Classification Labeling | gemini-2.5-flash-preview-05-20 | Function calling | |
| Sentence Matching (Validation) | gpt-4.1-mini | OpenAI | Temperature: 0.0 |
Prompts Used
Relevance Detection Prompt
Answer "true" if this sentence in this context has any relevance
whatsoever to cybersecurity topics such as cyber risk, cyber attacks,
cybercrime hacking, data breaches, incidents, data protection, online
fraud, scams (including digital scams), the Optus and Medibank hacks,
information security, digital crimes or other digital threats, awareness
campaigns to help customers not get scammed, training programs, or
support for safe online behavior. Include questions (e.g. questions
from shareholders) which ask about these topics (even if there's not
much information in them) because we want to track whether they get
answered. Err on the side of including too much: even a tangential
mention of something vaguely cyber-related should be included.
Classification Labels
Labels:
- mentions_board: Assign this label when the text refers to board or
senior management involvement, oversight, or relevant expertise in
cybersecurity. This includes chairing or attending cyber-related
meetings, stated board member experience relevant to cyber, and
statements about board-level focus or governance of cybersecurity.
- regulatory_reference: The text refers to cyber regulations, laws,
agencies or compliance obligations (which refer to cybersecurity).
It can't just be about regulatory compliance in general, there has
to be a cybersecurity aspect to it.
- specificity: Assign only if the text contains concrete cybersecurity
detail. This includes specific threats (e.g., ransomware, phishing),
technologies (e.g., firewalls, MFA), frameworks (e.g., ISO 27001, NIST),
mitigation actions (e.g., red teaming, training), quantified spending,
or internal control mechanisms. This includes awareness campaigns,
training programs, and support for safe online behaviors for customers.
Reproducibility
The processing code is available in our repository. To reproduce:
- Place transcript files in
transcripts/directory - Run
uv run process_transcript_unified.py <file> - Results are stored in
analysis.sqlite
Note: API keys for OpenAI and Google are required. Results may vary slightly between runs due to model non-determinism, though we use temperature 0.0 to minimize variation.
Limitations
- LLM classification is probabilistic; see validation results for accuracy assessment
- Transcripts may contain OCR or transcription errors from source
- Model behavior may change with updates; we document model versions used
- Regex mode may miss cybersecurity mentions that don't contain keyword matches
- Context window limitations may affect classification of ambiguous sentences