Methodology

AI Transparency

This page documents exactly how AI systems contribute to this research. Every step where an AI makes a decision is explicitly identified, including the models used, the prompts given, and how outputs are validated.

Data Sources

Corpus: 1,307 earnings call transcripts from 35 ASX-listed Australian companies, covering 2016–2024.

Sources: Transcripts sourced from S&P Global Market Intelligence.

Format: Plain text files with structured sections (Presentation, Q&A).

Processing Pipeline

1

Transcript Parsing

Method: Automated (rule-based)

Transcripts are split into "Presentation" and "Q&A" sections using regex patterns that identify section headers. Sentences are extracted using NLTK's sentence tokenizer.

Exclusions: Boilerplate text (copyright notices, disclaimers) is automatically filtered.

2

Cybersecurity Relevance Detection

Method: Dual-mode (Regex or LLM)

Regex Mode: Sentences matching cybersecurity keywords (cyber, data breach, hacking, ransomware, phishing, scam, information security, etc.) are selected.

LLM Mode: Each sentence is evaluated by GPT-4.1-mini with surrounding context (3 sentences before and after). The model determines if the sentence has "any relevance whatsoever to cybersecurity topics."

AI Decision: In LLM mode, the AI decides whether each sentence is relevant. We err on the side of inclusion.

3

Classification Labeling

Method: LLM-based (Gemini 2.5 Flash)

Each relevant sentence is classified with applicable labels using the model's function-calling capability:

  • mentions_board: Board or senior management involvement in cybersecurity
  • regulatory_reference: Cybersecurity regulations, laws, or compliance
  • specificity: Concrete details (threats, technologies, frameworks, mitigation actions)

AI Decision: The AI assigns labels based on its interpretation. Human validation is performed on a subset (see Validation).

4

Storage & Deduplication

Method: Automated

Results are stored in SQLite database. File checksums prevent reprocessing of unchanged transcripts. Token usage is tracked for cost analysis.

Models Used

Task Model Provider Parameters
Relevance Detection (LLM mode) gpt-4.1-mini OpenAI Temperature: 0.0
Classification Labeling gemini-2.5-flash-preview-05-20 Google Function calling
Sentence Matching (Validation) gpt-4.1-mini OpenAI Temperature: 0.0

Prompts Used

Relevance Detection Prompt

Answer "true" if this sentence in this context has any relevance
whatsoever to cybersecurity topics such as cyber risk, cyber attacks,
cybercrime hacking, data breaches, incidents, data protection, online
fraud, scams (including digital scams), the Optus and Medibank hacks,
information security, digital crimes or other digital threats, awareness
campaigns to help customers not get scammed, training programs, or
support for safe online behavior. Include questions (e.g. questions
from shareholders) which ask about these topics (even if there's not
much information in them) because we want to track whether they get
answered. Err on the side of including too much: even a tangential
mention of something vaguely cyber-related should be included.

Classification Labels

Labels:
- mentions_board: Assign this label when the text refers to board or
  senior management involvement, oversight, or relevant expertise in
  cybersecurity. This includes chairing or attending cyber-related
  meetings, stated board member experience relevant to cyber, and
  statements about board-level focus or governance of cybersecurity.

- regulatory_reference: The text refers to cyber regulations, laws,
  agencies or compliance obligations (which refer to cybersecurity).
  It can't just be about regulatory compliance in general, there has
  to be a cybersecurity aspect to it.

- specificity: Assign only if the text contains concrete cybersecurity
  detail. This includes specific threats (e.g., ransomware, phishing),
  technologies (e.g., firewalls, MFA), frameworks (e.g., ISO 27001, NIST),
  mitigation actions (e.g., red teaming, training), quantified spending,
  or internal control mechanisms. This includes awareness campaigns,
  training programs, and support for safe online behaviors for customers.

Reproducibility

The processing code is available in our repository. To reproduce:

  1. Place transcript files in transcripts/ directory
  2. Run uv run process_transcript_unified.py <file>
  3. Results are stored in analysis.sqlite

Note: API keys for OpenAI and Google are required. Results may vary slightly between runs due to model non-determinism, though we use temperature 0.0 to minimize variation.

Limitations