audytx corpus study · june 2026 · claude opus 4.8

What Frontier Models Get Wrong About AWS Terraform Security

We asked Claude's flagship model to write Terraform for 50 realistic AWS infrastructure scenarios — no security guidance, one shot each. Then we measured every output with audytx and terraform validate.

naive prompts
8 archetypes

686

findings across
1,421 resources

14%

fail
terraform validate

findings suppressed by
cross-resource context

TL;DR

AI-generated Terraform looks like production code, not a vulnerability lab. 48.3 findings per 100 resources — statistically indistinguishable from hand-written production modules (52.2) and ~5× cleaner than a deliberately vulnerable benchmark (269.1). The model avoids the gaping holes. The consistent gap is operational and cross-resource: it creates secrets but forgets rotation. It reaches for security knobs that don't exist and hallucinates the argument name. 1 in 7 generations doesn't pass terraform validate.

Method

Prompts: 50 developer-voiced AWS infrastructure scenarios across 8 archetypes (web app, serverless API, data pipeline, EKS cluster, static site, async worker, multi-env, vague "vibe" one-liners). Every prompt is security-free — no mention of security, compliance, or best practices. We measure defaults, not prompted-for hardening.

Generation: each prompt sent verbatim to claude-opus-4-8 with only "Write the Terraform. Output only HCL files." One generation per prompt, no quality retries. A broken output is data. Generators were isolated from the audytx repo so the scanner's context could not bias them.

Scan: audytx v0.5.1 via the MCP scan_terraform endpoint, one independent root module at a time — 50 separate scans so cross-resource reasoning stays within each app's boundary.

Reference corpora: terragoat (deliberately vulnerable AWS modules) and 21 clean production Terraform modules from the audytx benchmark, scanned with the identical per-directory method.

Key findings

Density comparison

Corpus	Resources	Findings / 100	Security / 100	HIGH+CRIT / 100
AI corpus (Opus 4.8)	1,421	48.3	12.1	1.3
Clean production modules (21)	882	52.2	9.9	3.1
terragoat (deliberately vulnerable)	55	269.1	130.9	94.5

The AI corpus is statistically indistinguishable from production modules on every axis. It is not a vulnerability benchmark.

Where the debt lands

Category	Findings	% of total
Reliability	179	26%
Observability	168	24%
Cost	167	24%
Security	163	24%
Data Protection	9	1%

The plurality is operational, not security. Missing CloudWatch alarms, missing prevent_destroy on stateful resources, loose provider version pins — these dominate the count.

The 18 high-severity findings

Finding	Rule	Count
Secrets Manager rotation not configured	AWS_SM_001	11
S3 bucket policy doesn't deny non-TLS	AWS_S3_005	3
ALB listener HTTP without HTTPS redirect	AWS_ELB_005	1
Redshift parameter group doesn't enforce SSL	AWS_REDSHIFT_002	1
S3 versioning disabled	AWS_S3_003	1
Lambda 3s timeout in VPC	AWS_XREF_001	1

Secrets Manager rotation dominates — 61% of all HIGH findings come from a single cross-resource failure pattern.

Hallucination rate

7 of 50 (14%) fail terraform validate (Terraform 1.9.8). The signature failure: the model reaches for a security setting and invents the argument name.

pipeline-03/redshift.tf:71 — Unsupported argument: "require_tls" (no such attribute on the resource)
pipeline-04/redshift.tf:92 — Unsupported argument: "require_ssl"
webapp-01/database.tf:62 — Unsupported argument: "storage_encrypted_kms_key_id" (invented fusion of two real RDS arguments)

Three of seven hallucinations are invented security knobs — the model knows it should enforce TLS or scope the KMS key, but invents the syntax. The intent is there; the provider knowledge is wrong.

The cross-resource failure pattern

The most consistent security gap in the corpus — 11 of 18 HIGH findings — is a single cross-resource failure: the model creates an aws_secretsmanager_secret for the database password but never creates the companion aws_secretsmanager_secret_rotation resource.

The model knows the right primary resource. It creates it with proper KMS encryption and scoped IAM access. It stops before wiring the dependency that makes the configuration complete. This matches the failure mode documented in arXiv:2512.14792: LLMs systematically fail to model cross-resource dependencies in infrastructure code.

# What the model generates:
resource "aws_secretsmanager_secret" "db_password" {
  name       = "${var.project}-db-password"
  kms_key_id = aws_kms_key.main.arn  # ✓ encrypted
}

# What it consistently omits:
resource "aws_secretsmanager_secret_rotation" "db_password" {
  secret_id           = aws_secretsmanager_secret.db_password.id
  rotation_lambda_arn = aws_lambda_function.rotate_secret.arn
  rotation_rules {
    automatically_after_days = 30
  }
}
    

A single-resource scanner flags the secret for missing a rotation configuration attribute. audytx's cross-resource engine finds the absent companion resource, confirms rotation is not configured anywhere in the plan, and fires — with a rationale that names the missing resource, not just the missing attribute.

What the model got right

The naive prompts did not lead the model into obvious security traps:

No wildcard-admin IAM. Zero ATTACK_PATH_* privilege-escalation chains fired across all 50 apps. Service roles are scoped: Lambda gets lambda:InvokeFunction, ECS gets secretsmanager:GetSecretValue, not *.
Public access blocks on S3. 26 of 50 directories include aws_s3_bucket_public_access_block; private ACLs appear in most others. No public buckets reached the confirmed findings.
DLQ wiring. Dead-letter queues appear in nearly every async-worker configuration. The model understands the pattern — it just forgets to alarm on the DLQs.
Encryption-at-rest broadly present. 44 of 50 directories reference encryption; 39 of 50 create a KMS key.

audytx's cross-resource reasoning suppressed 85 findings that a naive pattern-matcher would have fired — 71 of them IAM-role findings on service roles that are scoped and non-escalatable. The false-positive suppression moat — built and tuned on human-written Terraform — generalizes to AI-generated Terraform.

Methodology caveats

One model, one date. This is claude-opus-4-8 on 2026-06-14 only. The harness is built for multi-model follow-up; that is out of scope for v1. The flagship is the strongest case — if it ships these gaps, smaller or older models likely ship more.
The prompt set is ours. 50 prompts we authored, not a random sample of real developer requests.
audytx authored the scanner and this study. Conflict of interest disclosed. The corpus, manifest, and raw scan results are committed to the testbed repo so anyone can re-scan with another tool. terraform validate results are tool-independent.
Findings ≠ vulnerabilities. Most are reliability/cost/observability gaps, not exploitable holes. The category and severity splits are there precisely so the raw total (686) isn't misleading.
Post-publication note on AWS_VPC_004. The v0.5.1 scan reported 47 "Needs Review" findings for security groups with unknown port exposure. After freezing this study, analysis found these were scanner false positives: the AI-generated code uses the newer aws_vpc_security_group_ingress_rule resource pattern, which the v0.5.1 engine couldn't resolve. Fixed in audytx v0.14.33. The AI-generated code was not actually exposing debug ports.

Reproduce this study

The corpus, manifest, validate results, and scan results are committed to the audytx-testbed repo on the bench/ai-claude branch:

corpus/ai-claude/           # raw Terraform outputs (50 dirs)
corpus/ai-claude/manifest.yaml   # model ID, date, prompt hash per dir
results/ai-claude/validate/      # terraform validate per dir
results/ai-claude/audytx.json    # per-dir MCP scan results (v0.5.1)

Re-scan any directory against the live engine:

curl -s https://audytx.com/mcp \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0", "id": 1,
    "method": "tools/call",
    "params": {
      "name": "scan_terraform",
      "arguments": { "files": [{"path": "main.tf", "content": "..."}] }
    }
  }'

Audit your AI-generated Terraform

audytx posts findings directly in your pull request — cross-resource context, one-click fixes, and false-positive suppression with rationale.

Install audytx free →

Also available as an MCP server for Claude Code, Cursor, and any MCP-compatible agent. See the benchmark for the false-positive comparison.