Building a Cloud Remediation Agent with Terraform and LangChain

2025-03-05•AI, Terraform, LangChain, Cloud Security, DevSecOps

Cloud misconfigurations are one of the top causes of security breaches. According to Gartner, through 2025, 99% of cloud security failures will be the customer's fault — primarily due to misconfigurations. What if we could build an AI agent that automatically detects and remediates these issues?

That's exactly what I built at Strobes Security. Here's how.

The Problem

Traditional cloud security tools follow this workflow:

Scanner detects misconfiguration (e.g., S3 bucket is public)
Alert goes to the security team
Security team creates a ticket
DevOps engineer manually fixes it
Fix is verified

The gap between detection and remediation can be days or weeks. In that window, your infrastructure is vulnerable.

The Solution: An Autonomous Remediation Agent

I built a system that compresses this entire workflow into minutes:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Cloud       │    │   LangChain  │    │   Terraform  │
│   Scanner     │───▶│   Agent      │───▶│   Apply      │
│   (Detect)    │    │   (Analyze)  │    │   (Remediate) │
└──────────────┘    └──────────────┘    └──────────────┘
        │                   │                    │
        ▼                   ▼                    ▼
   Misconfiguration    Fix Strategy         Applied Patch
   Detected           Generated            (with rollback)

Architecture Deep Dive

Component 1: Misconfiguration Ingestion

The agent receives findings from multiple sources — ORCA Cloud, AWS Config, or custom scanners:

from pydantic import BaseModel

class CloudFinding(BaseModel):
    resource_type: str       # e.g., "aws_s3_bucket"
    resource_id: str         # e.g., "arn:aws:s3:::my-bucket"
    misconfiguration: str    # e.g., "public_access_enabled"
    severity: str            # "critical", "high", "medium", "low"
    current_config: dict     # Current resource configuration
    cloud_provider: str      # "aws", "azure", "gcp"

Component 2: The LangChain Analysis Agent

The core intelligence uses LangChain with tool-calling to analyze the finding and generate a remediation plan:

from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent
from langchain_core.tools import tool

llm = ChatOpenAI(model="gpt-4o", temperature=0)

@tool
def get_terraform_docs(resource_type: str) -> str:
    """Retrieve Terraform documentation for a specific resource type."""
    # Fetch from Terraform registry API
    return fetch_tf_docs(resource_type)

@tool
def validate_terraform(hcl_code: str) -> str:
    """Validate Terraform HCL code for syntax errors."""
    result = subprocess.run(
        ["terraform", "validate"],
        input=hcl_code, capture_output=True, text=True
    )
    return result.stdout + result.stderr

@tool
def generate_terraform_patch(finding: str, current_config: str) -> str:
    """Generate a Terraform patch to remediate the misconfiguration."""
    prompt = f"""
    Given this cloud misconfiguration:
    {finding}

    Current configuration:
    {current_config}

    Generate a Terraform resource block that fixes this issue.
    Include comments explaining each change.
    """
    return llm.invoke(prompt).content

tools = [get_terraform_docs, validate_terraform, generate_terraform_patch]
agent = create_tool_calling_agent(llm, tools, prompt_template)

Component 3: Celery + RabbitMQ for Async Execution

Remediation operations can be slow (Terraform plans take time), so we process them asynchronously:

from celery import Celery

app = Celery("remediation", broker="amqp://rabbitmq:5672")

@app.task(bind=True, max_retries=3)
def remediate_finding(self, finding_id: str):
    try:
        finding = CloudFinding.from_db(finding_id)

        # Step 1: Generate remediation plan
        plan = agent.invoke({"finding": finding.dict()})

        # Step 2: Run terraform plan (dry-run)
        plan_result = terraform_plan(plan.patch)

        # Step 3: If plan is safe, apply
        if plan_result.changes_only_target_resource:
            apply_result = terraform_apply(plan.patch)

            # Step 4: Log for audit trail
            AuditLog.create(
                finding=finding,
                action="remediated",
                patch=plan.patch,
                result=apply_result
            )
        else:
            # Flag for human review — patch affects other resources
            flag_for_review(finding, plan)
    except Exception as exc:
        self.retry(exc=exc, countdown=60)

Component 4: Safety Guardrails

The most critical part — ensuring the agent doesn't break production:

Blast Radius Check: Before applying, verify the Terraform plan only affects the target resource
Rollback Capability: Store the previous state so we can terraform apply the old config
Approval Gates: Critical/high severity findings require human approval before apply
Dry-Run Mode: New remediation patterns run in dry-run mode for the first 10 executions

def safety_check(plan_output: str, target_resource: str) -> bool:
    """Ensure Terraform plan only modifies the target resource."""
    planned_changes = parse_terraform_plan(plan_output)

    for change in planned_changes:
        if change.resource_id != target_resource:
            return False  # Plan would affect other resources

    return True

Results in Production

After deploying this agent:

Mean Time to Remediation (MTTR) dropped from 4.2 days to 23 minutes
93% of findings were auto-remediated without human intervention
Zero false-positive remediations in 6 months (thanks to the safety guardrails)
Full audit trail for every change, satisfying compliance requirements

Lessons Learned

Never trust the LLM blindly: Always validate generated Terraform before applying. The terraform plan step is non-negotiable.
Start with low-severity findings: We rolled out auto-remediation starting with "low" severity only, gradually enabling higher severities as confidence grew.
Rollback is mandatory: Every remediation must be reversible. Store the previous state.
Context matters: Including the Terraform documentation in the agent's context dramatically improved patch quality.
Async is essential: Terraform operations are slow. Using Celery + RabbitMQ keeps the system responsive.

Tech Stack Summary

Component	Technology
Agent Framework	LangChain + LangGraph
LLM	GPT-4o (via OpenAI API)
Infrastructure as Code	Terraform
Task Queue	Celery + RabbitMQ
Database	PostgreSQL
API	FastAPI
Containerization	Docker

What's Next?

I'm exploring multi-cloud remediation — a single agent that can generate patches for AWS, Azure, and GCP using provider-specific Terraform modules. The challenge is maintaining a unified finding schema across cloud providers.

Building AI-powered security automation at Strobes Security. Connect with me on GitHub.