What Frontier Models Get Wrong About AWS Terraform Security

We asked Claude's flagship model to write Terraform for 50 realistic AWS infrastructure scenarios — no security guidance, one shot each. Then we measured every output with audytx and terraform validate.

50
naive prompts
8 archetypes
686
findings across
1,421 resources
14%
fail
terraform validate
85
findings suppressed by
cross-resource context
TL;DR

AI-generated Terraform looks like production code, not a vulnerability lab. 48.3 findings per 100 resources — statistically indistinguishable from hand-written production modules (52.2) and ~5× cleaner than a deliberately vulnerable benchmark (269.1). The model avoids the gaping holes. The consistent gap is operational and cross-resource: it creates secrets but forgets rotation. It reaches for security knobs that don't exist and hallucinates the argument name. 1 in 7 generations doesn't pass terraform validate.

Method

Prompts: 50 developer-voiced AWS infrastructure scenarios across 8 archetypes (web app, serverless API, data pipeline, EKS cluster, static site, async worker, multi-env, vague "vibe" one-liners). Every prompt is security-free — no mention of security, compliance, or best practices. We measure defaults, not prompted-for hardening.

Generation: each prompt sent verbatim to claude-opus-4-8 with only "Write the Terraform. Output only HCL files." One generation per prompt, no quality retries. A broken output is data. Generators were isolated from the audytx repo so the scanner's context could not bias them.

Scan: audytx v0.5.1 via the MCP scan_terraform endpoint, one independent root module at a time — 50 separate scans so cross-resource reasoning stays within each app's boundary.

Reference corpora: terragoat (deliberately vulnerable AWS modules) and 21 clean production Terraform modules from the audytx benchmark, scanned with the identical per-directory method.

Key findings

Density comparison

Corpus Resources Findings / 100 Security / 100 HIGH+CRIT / 100
AI corpus (Opus 4.8) 1,421 48.3 12.1 1.3
Clean production modules (21) 882 52.2 9.9 3.1
terragoat (deliberately vulnerable) 55 269.1 130.9 94.5

The AI corpus is statistically indistinguishable from production modules on every axis. It is not a vulnerability benchmark.

Where the debt lands

CategoryFindings% of total
Reliability17926%
Observability16824%
Cost16724%
Security16324%
Data Protection91%

The plurality is operational, not security. Missing CloudWatch alarms, missing prevent_destroy on stateful resources, loose provider version pins — these dominate the count.

The 18 high-severity findings

FindingRuleCount
Secrets Manager rotation not configuredAWS_SM_00111
S3 bucket policy doesn't deny non-TLSAWS_S3_0053
ALB listener HTTP without HTTPS redirectAWS_ELB_0051
Redshift parameter group doesn't enforce SSLAWS_REDSHIFT_0021
S3 versioning disabledAWS_S3_0031
Lambda 3s timeout in VPCAWS_XREF_0011

Secrets Manager rotation dominates — 61% of all HIGH findings come from a single cross-resource failure pattern.

Hallucination rate

7 of 50 (14%) fail terraform validate (Terraform 1.9.8). The signature failure: the model reaches for a security setting and invents the argument name.

Three of seven hallucinations are invented security knobs — the model knows it should enforce TLS or scope the KMS key, but invents the syntax. The intent is there; the provider knowledge is wrong.

The cross-resource failure pattern

The most consistent security gap in the corpus — 11 of 18 HIGH findings — is a single cross-resource failure: the model creates an aws_secretsmanager_secret for the database password but never creates the companion aws_secretsmanager_secret_rotation resource.

The model knows the right primary resource. It creates it with proper KMS encryption and scoped IAM access. It stops before wiring the dependency that makes the configuration complete. This matches the failure mode documented in arXiv:2512.14792: LLMs systematically fail to model cross-resource dependencies in infrastructure code.

# What the model generates: resource "aws_secretsmanager_secret" "db_password" { name = "${var.project}-db-password" kms_key_id = aws_kms_key.main.arn # ✓ encrypted } # What it consistently omits: resource "aws_secretsmanager_secret_rotation" "db_password" { secret_id = aws_secretsmanager_secret.db_password.id rotation_lambda_arn = aws_lambda_function.rotate_secret.arn rotation_rules { automatically_after_days = 30 } }

A single-resource scanner flags the secret for missing a rotation configuration attribute. audytx's cross-resource engine finds the absent companion resource, confirms rotation is not configured anywhere in the plan, and fires — with a rationale that names the missing resource, not just the missing attribute.

What the model got right

The naive prompts did not lead the model into obvious security traps:

audytx's cross-resource reasoning suppressed 85 findings that a naive pattern-matcher would have fired — 71 of them IAM-role findings on service roles that are scoped and non-escalatable. The false-positive suppression moat — built and tuned on human-written Terraform — generalizes to AI-generated Terraform.

Methodology caveats

  1. One model, one date. This is claude-opus-4-8 on 2026-06-14 only. The harness is built for multi-model follow-up; that is out of scope for v1. The flagship is the strongest case — if it ships these gaps, smaller or older models likely ship more.
  2. The prompt set is ours. 50 prompts we authored, not a random sample of real developer requests.
  3. audytx authored the scanner and this study. Conflict of interest disclosed. The corpus, manifest, and raw scan results are committed to the testbed repo so anyone can re-scan with another tool. terraform validate results are tool-independent.
  4. Findings ≠ vulnerabilities. Most are reliability/cost/observability gaps, not exploitable holes. The category and severity splits are there precisely so the raw total (686) isn't misleading.
  5. Post-publication note on AWS_VPC_004. The v0.5.1 scan reported 47 "Needs Review" findings for security groups with unknown port exposure. After freezing this study, analysis found these were scanner false positives: the AI-generated code uses the newer aws_vpc_security_group_ingress_rule resource pattern, which the v0.5.1 engine couldn't resolve. Fixed in audytx v0.14.33. The AI-generated code was not actually exposing debug ports.

Reproduce this study

The corpus, manifest, validate results, and scan results are committed to the audytx-testbed repo on the bench/ai-claude branch:

corpus/ai-claude/ # raw Terraform outputs (50 dirs) corpus/ai-claude/manifest.yaml # model ID, date, prompt hash per dir results/ai-claude/validate/ # terraform validate per dir results/ai-claude/audytx.json # per-dir MCP scan results (v0.5.1)

Re-scan any directory against the live engine:

curl -s https://audytx.com/mcp \ -H "Content-Type: application/json" \ -d '{ "jsonrpc": "2.0", "id": 1, "method": "tools/call", "params": { "name": "scan_terraform", "arguments": { "files": [{"path": "main.tf", "content": "..."}] } } }'

Audit your AI-generated Terraform

audytx posts findings directly in your pull request — cross-resource context, one-click fixes, and false-positive suppression with rationale.

Install audytx free →
Also available as an MCP server for Claude Code, Cursor, and any MCP-compatible agent. See the benchmark for the false-positive comparison.