What Frontier Models Get Wrong About AWS Terraform Security
We asked Claude's flagship model to write Terraform for 50 realistic AWS infrastructure scenarios — no security guidance, one shot each. Then we measured every output with audytx and terraform validate.
8 archetypes
1,421 resources
terraform validate
cross-resource context
AI-generated Terraform looks like production code, not a vulnerability lab. 48.3 findings per 100 resources — statistically indistinguishable from hand-written production modules (52.2) and ~5× cleaner than a deliberately vulnerable benchmark (269.1). The model avoids the gaping holes. The consistent gap is operational and cross-resource: it creates secrets but forgets rotation. It reaches for security knobs that don't exist and hallucinates the argument name. 1 in 7 generations doesn't pass terraform validate.
Method
Prompts: 50 developer-voiced AWS infrastructure scenarios across 8 archetypes (web app, serverless API, data pipeline, EKS cluster, static site, async worker, multi-env, vague "vibe" one-liners). Every prompt is security-free — no mention of security, compliance, or best practices. We measure defaults, not prompted-for hardening.
Generation: each prompt sent verbatim to claude-opus-4-8 with only "Write the Terraform. Output only HCL files." One generation per prompt, no quality retries. A broken output is data. Generators were isolated from the audytx repo so the scanner's context could not bias them.
Scan: audytx v0.5.1 via the MCP scan_terraform endpoint, one independent root module at a time — 50 separate scans so cross-resource reasoning stays within each app's boundary.
Reference corpora: terragoat (deliberately vulnerable AWS modules) and 21 clean production Terraform modules from the audytx benchmark, scanned with the identical per-directory method.
Key findings
Density comparison
| Corpus | Resources | Findings / 100 | Security / 100 | HIGH+CRIT / 100 |
|---|---|---|---|---|
| AI corpus (Opus 4.8) | 1,421 | 48.3 | 12.1 | 1.3 |
| Clean production modules (21) | 882 | 52.2 | 9.9 | 3.1 |
| terragoat (deliberately vulnerable) | 55 | 269.1 | 130.9 | 94.5 |
The AI corpus is statistically indistinguishable from production modules on every axis. It is not a vulnerability benchmark.
Where the debt lands
| Category | Findings | % of total |
|---|---|---|
| Reliability | 179 | 26% |
| Observability | 168 | 24% |
| Cost | 167 | 24% |
| Security | 163 | 24% |
| Data Protection | 9 | 1% |
The plurality is operational, not security. Missing CloudWatch alarms, missing prevent_destroy on stateful resources, loose provider version pins — these dominate the count.
The 18 high-severity findings
| Finding | Rule | Count |
|---|---|---|
| Secrets Manager rotation not configured | AWS_SM_001 | 11 |
| S3 bucket policy doesn't deny non-TLS | AWS_S3_005 | 3 |
| ALB listener HTTP without HTTPS redirect | AWS_ELB_005 | 1 |
| Redshift parameter group doesn't enforce SSL | AWS_REDSHIFT_002 | 1 |
| S3 versioning disabled | AWS_S3_003 | 1 |
| Lambda 3s timeout in VPC | AWS_XREF_001 | 1 |
Secrets Manager rotation dominates — 61% of all HIGH findings come from a single cross-resource failure pattern.
Hallucination rate
7 of 50 (14%) fail terraform validate (Terraform 1.9.8). The signature failure: the model reaches for a security setting and invents the argument name.
pipeline-03/redshift.tf:71—Unsupported argument: "require_tls"(no such attribute on the resource)pipeline-04/redshift.tf:92—Unsupported argument: "require_ssl"webapp-01/database.tf:62—Unsupported argument: "storage_encrypted_kms_key_id"(invented fusion of two real RDS arguments)
Three of seven hallucinations are invented security knobs — the model knows it should enforce TLS or scope the KMS key, but invents the syntax. The intent is there; the provider knowledge is wrong.
The cross-resource failure pattern
The most consistent security gap in the corpus — 11 of 18 HIGH findings — is a single cross-resource failure: the model creates an aws_secretsmanager_secret for the database password but never creates the companion aws_secretsmanager_secret_rotation resource.
The model knows the right primary resource. It creates it with proper KMS encryption and scoped IAM access. It stops before wiring the dependency that makes the configuration complete. This matches the failure mode documented in arXiv:2512.14792: LLMs systematically fail to model cross-resource dependencies in infrastructure code.
A single-resource scanner flags the secret for missing a rotation configuration attribute. audytx's cross-resource engine finds the absent companion resource, confirms rotation is not configured anywhere in the plan, and fires — with a rationale that names the missing resource, not just the missing attribute.
What the model got right
The naive prompts did not lead the model into obvious security traps:
- No wildcard-admin IAM. Zero
ATTACK_PATH_*privilege-escalation chains fired across all 50 apps. Service roles are scoped: Lambda getslambda:InvokeFunction, ECS getssecretsmanager:GetSecretValue, not*. - Public access blocks on S3. 26 of 50 directories include
aws_s3_bucket_public_access_block; private ACLs appear in most others. No public buckets reached the confirmed findings. - DLQ wiring. Dead-letter queues appear in nearly every async-worker configuration. The model understands the pattern — it just forgets to alarm on the DLQs.
- Encryption-at-rest broadly present. 44 of 50 directories reference encryption; 39 of 50 create a KMS key.
audytx's cross-resource reasoning suppressed 85 findings that a naive pattern-matcher would have fired — 71 of them IAM-role findings on service roles that are scoped and non-escalatable. The false-positive suppression moat — built and tuned on human-written Terraform — generalizes to AI-generated Terraform.
Methodology caveats
- One model, one date. This is
claude-opus-4-8on 2026-06-14 only. The harness is built for multi-model follow-up; that is out of scope for v1. The flagship is the strongest case — if it ships these gaps, smaller or older models likely ship more. - The prompt set is ours. 50 prompts we authored, not a random sample of real developer requests.
- audytx authored the scanner and this study. Conflict of interest disclosed. The corpus, manifest, and raw scan results are committed to the testbed repo so anyone can re-scan with another tool.
terraform validateresults are tool-independent. - Findings ≠ vulnerabilities. Most are reliability/cost/observability gaps, not exploitable holes. The category and severity splits are there precisely so the raw total (686) isn't misleading.
- Post-publication note on AWS_VPC_004. The v0.5.1 scan reported 47 "Needs Review" findings for security groups with unknown port exposure. After freezing this study, analysis found these were scanner false positives: the AI-generated code uses the newer
aws_vpc_security_group_ingress_ruleresource pattern, which the v0.5.1 engine couldn't resolve. Fixed in audytx v0.14.33. The AI-generated code was not actually exposing debug ports.
Reproduce this study
The corpus, manifest, validate results, and scan results are committed to the audytx-testbed repo on the bench/ai-claude branch:
Re-scan any directory against the live engine:
Audit your AI-generated Terraform
audytx posts findings directly in your pull request — cross-resource context, one-click fixes, and false-positive suppression with rationale.
Install audytx free →