Large Language Models in Cybersecurity (original) (raw)

We evaluated 7 large language models across 9 cybersecurity domains using SecBench, a large-scale and multi-format benchmark for security tasks.

We tested each model on 44,823 multiple-choice questions (MCQs) and 3,087 short-answer questions (SAQs), covering data security, identity & access management, network security, vulnerability management, and cloud security.

Benchmarking LLM performance across cybersecurity domains

MCQs (Multiple-Choice Questions) benchmarking:

SAQs (Short Answer Questions):

This benchmark evaluates 7 general LLMs, including both proprietary (e.g., GPT-4) and open-source models (e.g., DeepSeek, Mistral). The benchmark spans 9 cybersecurity subfields, including:

Data Security
Identity & Access Management
Application Security
Network Security
Security Standards

The x-axis domains are sorted by LLM performance, with lower-scoring domains placed toward the left and higher-scoring ones toward the right.1

See benchmark methodology.

Specialized cybersecurity LLMs

The role of LLMs in cybersecurity

Large language models (LLMs) are used across cybersecurity operations to extract actionable insights from unstructured sources such as threat intelligence reports, incident logs, CVE databases, and attacker TTPs.

LLMs automate key tasks, including threat classification, alert summarization, and correlation of indicators of compromise (IOCs).

When fine-tuned on cybersecurity data, large language models can detect anomalies in logs, analyze phishing emails, prioritize vulnerabilities, and map threats to frameworks like MITRE ATT&CK.

Applications of large language models in cybersecurity

Threat intelligence

Co-pilot for contextual threat analysis: LLM-powered tools like CyLens support security analysts throughout the threat intelligence by analyzing extensive threat reports with modular NLP pipelines and entity correlation filters.2
Real-time proactive threat intelligence: systems integrate LLMs with retrieval‑augmented generation (RAG) frameworks to ingest continuous CTI feeds (e.g., CVE) into vector databases (like Milvus), enabling up‑to‑date automated detection, scoring, and contextual reasoning.3
Forum-based CTI extraction: LLMs analyze unstructured data from cybercrime forums to extract key threat indicators using simple prompts.4

Vulnerability detection

Vulnerability description enrichment: LLMs such as CVE‑LLM enrich vulnerability descriptions using domain ontologies, enabling automated triage and CVSS scoring integration within existing security management systems.5
Android filesystem vulnerability detection: Investigates how LLMs can detect file system access vulnerabilities in Android apps, including permission abuse and insecure storage.6
RL fine‑tuning for vulnerability detection: Applies reinforcement learning (RL) to fine-tune LLMs (LLaMA 3B/8B, Qwen 2.5B) for improved accuracy in identifying software vulnerabilities.7

Anomaly detection & log analysis

Semantic log anomaly detection: Frameworks like LogLLM use LLM encoders/decoders to parse and classify log entries, improving anomaly detection beyond pattern matching.8
Log parsing with large language models: Automated LLM parsing converts unstructured logs into structured formats via prompt‑based and fine‑tuned approaches.9

Red teaming / LLM-assisted attack prevention

LLM-driven pentesting and remediation (penheal): Automates penetration testing using a two-stage pipeline; first identifying security weaknesses, then generating remediation actions using a custom LLM setup.10
On-prem red team agent for internal security (hackphyr): Deploys a fine-tuned 7B LLM agent locally to perform red-team tasks such as lateral movement simulation, credential harvesting, and vulnerability scanning in networks.11

LLMs in cybersecurity benchmark methodology

SecBench is a large-scale, multi-dimensional benchmark for evaluating LLMs in cybersecurity across different tasks, domains, languages, and formats.

Evaluation dimensions

1. Multi-level reasoning:

Knowledge Retention (KR): Questions that test factual knowledge or definitions. These are more straightforward.
Logical reasoning (LR): Questions that require inference and deeper understanding. These are more challenging and test the model’s ability to reason based on context.

2. Multi-format:

MCQs (Multiple-Choice Questions): Traditional format where the model selects from predefined answers. Total of 44,823 questions.
SAQs (Short Answer Questions): Open-ended format requiring the model to generate its response for evaluating reasoning, clarity, and hallucination resistance. Total of 3,087 questions.

3. Multi-Language:

SecBench includes questions in both Chinese and English.

4. Multi-Domain:

Questions span 9 cybersecurity domains (D1–D9), including: security management, data security, network security, application security, cloud security, and more.

Evaluation

MCQs are graded by checking if the model selects the correct choice(s).

SAQs are graded using a GPT-4o mini “grading agent”, which compares the model’s response to the ground truth and assigns a score based on accuracy and completeness’.

LLM performance evaluation: For example, Network Security (D3) is assessed by grouping relevant questions from its 44,823-question MCQ dataset.

Accuracy is measured based on each model’s performance, specifically on questions labeled under the D3 domain. A model’s percentage score for D3 reflects the proportion of network security questions it answered correctly.

Cite this research

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Cem Dilmegani (2026) - "Large Language Models in Cybersecurity". Published online at AIMultiple.com. Retrieved June 5, 2026, from: https://aimultiple.com/llms-in-cybersecurity [Online Resource]

Dilmegani, C. (2026, June 5). Large Language Models in Cybersecurity. AIMultiple. https://aimultiple.com/llms-in-cybersecurity

@misc{dilmegani2026, author = {Dilmegani, Cem}, title = {{Large Language Models in Cybersecurity}}, year = {2026}, month = jun, howpublished = {\url{https://aimultiple.com/llms-in-cybersecurity}}, note = {AIMultiple. Retrieved June 5, 2026} }

Cem Dilmegani

Principal Analyst

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile