Building a Cloud Remediation Agent with Terraform and LangChain
Cloud misconfigurations are one of the top causes of security breaches. According to Gartner, through 2025, 99% of cloud security failures will be the customer's fault — primarily due to misconfigurations. What if we could build an AI agent that automatically detects and remediates these issues?
That's exactly what I built at Strobes Security. Here's how.
The Problem
Traditional cloud security tools follow this workflow:
- Scanner detects misconfiguration (e.g., S3 bucket is public)
- Alert goes to the security team
- Security team creates a ticket
- DevOps engineer manually fixes it
- Fix is verified
The gap between detection and remediation can be days or weeks. In that window, your infrastructure is vulnerable.
The Solution: An Autonomous Remediation Agent
I built a system that compresses this entire workflow into minutes:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Cloud │ │ LangChain │ │ Terraform │
│ Scanner │───▶│ Agent │───▶│ Apply │
│ (Detect) │ │ (Analyze) │ │ (Remediate) │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
Misconfiguration Fix Strategy Applied Patch
Detected Generated (with rollback)
Architecture Deep Dive
Component 1: Misconfiguration Ingestion
The agent receives findings from multiple sources — ORCA Cloud, AWS Config, or custom scanners:
from pydantic import BaseModel
class CloudFinding(BaseModel):
resource_type: str # e.g., "aws_s3_bucket"
resource_id: str # e.g., "arn:aws:s3:::my-bucket"
misconfiguration: str # e.g., "public_access_enabled"
severity: str # "critical", "high", "medium", "low"
current_config: dict # Current resource configuration
cloud_provider: str # "aws", "azure", "gcp"
Component 2: The LangChain Analysis Agent
The core intelligence uses LangChain with tool-calling to analyze the finding and generate a remediation plan:
from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent
from langchain_core.tools import tool
llm = ChatOpenAI(model="gpt-4o", temperature=0)
@tool
def get_terraform_docs(resource_type: str) -> str:
"""Retrieve Terraform documentation for a specific resource type."""
# Fetch from Terraform registry API
return fetch_tf_docs(resource_type)
@tool
def validate_terraform(hcl_code: str) -> str:
"""Validate Terraform HCL code for syntax errors."""
result = subprocess.run(
["terraform", "validate"],
input=hcl_code, capture_output=True, text=True
)
return result.stdout + result.stderr
@tool
def generate_terraform_patch(finding: str, current_config: str) -> str:
"""Generate a Terraform patch to remediate the misconfiguration."""
prompt = f"""
Given this cloud misconfiguration:
{finding}
Current configuration:
{current_config}
Generate a Terraform resource block that fixes this issue.
Include comments explaining each change.
"""
return llm.invoke(prompt).content
tools = [get_terraform_docs, validate_terraform, generate_terraform_patch]
agent = create_tool_calling_agent(llm, tools, prompt_template)
Component 3: Celery + RabbitMQ for Async Execution
Remediation operations can be slow (Terraform plans take time), so we process them asynchronously:
from celery import Celery
app = Celery("remediation", broker="amqp://rabbitmq:5672")
@app.task(bind=True, max_retries=3)
def remediate_finding(self, finding_id: str):
try:
finding = CloudFinding.from_db(finding_id)
# Step 1: Generate remediation plan
plan = agent.invoke({"finding": finding.dict()})
# Step 2: Run terraform plan (dry-run)
plan_result = terraform_plan(plan.patch)
# Step 3: If plan is safe, apply
if plan_result.changes_only_target_resource:
apply_result = terraform_apply(plan.patch)
# Step 4: Log for audit trail
AuditLog.create(
finding=finding,
action="remediated",
patch=plan.patch,
result=apply_result
)
else:
# Flag for human review — patch affects other resources
flag_for_review(finding, plan)
except Exception as exc:
self.retry(exc=exc, countdown=60)
Component 4: Safety Guardrails
The most critical part — ensuring the agent doesn't break production:
- Blast Radius Check: Before applying, verify the Terraform plan only affects the target resource
- Rollback Capability: Store the previous state so we can
terraform applythe old config - Approval Gates: Critical/high severity findings require human approval before apply
- Dry-Run Mode: New remediation patterns run in dry-run mode for the first 10 executions
def safety_check(plan_output: str, target_resource: str) -> bool:
"""Ensure Terraform plan only modifies the target resource."""
planned_changes = parse_terraform_plan(plan_output)
for change in planned_changes:
if change.resource_id != target_resource:
return False # Plan would affect other resources
return True
Results in Production
After deploying this agent:
- Mean Time to Remediation (MTTR) dropped from 4.2 days to 23 minutes
- 93% of findings were auto-remediated without human intervention
- Zero false-positive remediations in 6 months (thanks to the safety guardrails)
- Full audit trail for every change, satisfying compliance requirements
Lessons Learned
-
Never trust the LLM blindly: Always validate generated Terraform before applying. The
terraform planstep is non-negotiable. -
Start with low-severity findings: We rolled out auto-remediation starting with "low" severity only, gradually enabling higher severities as confidence grew.
-
Rollback is mandatory: Every remediation must be reversible. Store the previous state.
-
Context matters: Including the Terraform documentation in the agent's context dramatically improved patch quality.
-
Async is essential: Terraform operations are slow. Using Celery + RabbitMQ keeps the system responsive.
Tech Stack Summary
| Component | Technology |
|---|---|
| Agent Framework | LangChain + LangGraph |
| LLM | GPT-4o (via OpenAI API) |
| Infrastructure as Code | Terraform |
| Task Queue | Celery + RabbitMQ |
| Database | PostgreSQL |
| API | FastAPI |
| Containerization | Docker |
What's Next?
I'm exploring multi-cloud remediation — a single agent that can generate patches for AWS, Azure, and GCP using provider-specific Terraform modules. The challenge is maintaining a unified finding schema across cloud providers.
Building AI-powered security automation at Strobes Security. Connect with me on GitHub.