Large Language Models in Cybersecurity (original) (raw)

We evaluated 7 large language models across 9 cybersecurity domains using SecBench, a large-scale and multi-format benchmark for security tasks.

We tested each model on 44,823 multiple-choice questions (MCQs) and 3,087 short-answer questions (SAQs), covering data security, identity & access management, network security, vulnerability management, and cloud security.

Benchmarking LLM performance across cybersecurity domains

MCQs (Multiple-Choice Questions) benchmarking:

SAQs (Short Answer Questions):

This benchmark evaluates 7 general LLMs, including both proprietary (e.g., GPT-4) and open-source models (e.g., DeepSeek, Mistral). The benchmark spans 9 cybersecurity subfields, including:

The x-axis domains are sorted by LLM performance, with lower-scoring domains placed toward the left and higher-scoring ones toward the right.1

See benchmark methodology.

Specialized cybersecurity LLMs

The role of LLMs in cybersecurity

Large language models (LLMs) are used across cybersecurity operations to extract actionable insights from unstructured sources such as threat intelligence reports, incident logs, CVE databases, and attacker TTPs.

LLMs automate key tasks, including threat classification, alert summarization, and correlation of indicators of compromise (IOCs).

When fine-tuned on cybersecurity data, large language models can detect anomalies in logs, analyze phishing emails, prioritize vulnerabilities, and map threats to frameworks like MITRE ATT&CK.

Applications of large language models in cybersecurity

Threat intelligence

Vulnerability detection

Anomaly detection & log analysis

Red teaming / LLM-assisted attack prevention

LLMs in cybersecurity benchmark methodology

SecBench is a large-scale, multi-dimensional benchmark for evaluating LLMs in cybersecurity across different tasks, domains, languages, and formats.

Evaluation dimensions

1. Multi-level reasoning:

2. Multi-format:

3. Multi-Language:

SecBench includes questions in both Chinese and English.

4. Multi-Domain:

Questions span 9 cybersecurity domains (D1–D9), including: security management, data security, network security, application security, cloud security, and more.

Evaluation

MCQs are graded by checking if the model selects the correct choice(s).

SAQs are graded using a GPT-4o mini “grading agent”, which compares the model’s response to the ground truth and assigns a score based on accuracy and completeness’.

LLM performance evaluation: For example, Network Security (D3) is assessed by grouping relevant questions from its 44,823-question MCQ dataset.

Accuracy is measured based on each model’s performance, specifically on questions labeled under the D3 domain. A model’s percentage score for D3 reflects the proportion of network security questions it answered correctly.

Cite this research

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Cem Dilmegani (2026) - "Large Language Models in Cybersecurity". Published online at AIMultiple.com. Retrieved June 5, 2026, from: https://aimultiple.com/llms-in-cybersecurity [Online Resource]

Dilmegani, C. (2026, June 5). Large Language Models in Cybersecurity. AIMultiple. https://aimultiple.com/llms-in-cybersecurity

@misc{dilmegani2026, author = {Dilmegani, Cem}, title = {{Large Language Models in Cybersecurity}}, year = {2026}, month = jun, howpublished = {\url{https://aimultiple.com/llms-in-cybersecurity}}, note = {AIMultiple. Retrieved June 5, 2026} }

Cem Dilmegani

Cem Dilmegani

Principal Analyst

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile