Building a Cloud Remediation Agent with Terraform and LangChain

2025-03-05AI, Terraform, LangChain, Cloud Security, DevSecOps

Cloud misconfigurations are one of the top causes of security breaches. According to Gartner, through 2025, 99% of cloud security failures will be the customer's fault — primarily due to misconfigurations. What if we could build an AI agent that automatically detects and remediates these issues?

That's exactly what I built at Strobes Security. Here's how.

The Problem

Traditional cloud security tools follow this workflow:

  1. Scanner detects misconfiguration (e.g., S3 bucket is public)
  2. Alert goes to the security team
  3. Security team creates a ticket
  4. DevOps engineer manually fixes it
  5. Fix is verified

The gap between detection and remediation can be days or weeks. In that window, your infrastructure is vulnerable.

The Solution: An Autonomous Remediation Agent

I built a system that compresses this entire workflow into minutes:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Cloud       │    │   LangChain  │    │   Terraform  │
│   Scanner     │───▶│   Agent      │───▶│   Apply      │
│   (Detect)    │    │   (Analyze)  │    │   (Remediate) │
└──────────────┘    └──────────────┘    └──────────────┘
        │                   │                    │
        ▼                   ▼                    ▼
   Misconfiguration    Fix Strategy         Applied Patch
   Detected           Generated            (with rollback)

Architecture Deep Dive

Component 1: Misconfiguration Ingestion

The agent receives findings from multiple sources — ORCA Cloud, AWS Config, or custom scanners:

from pydantic import BaseModel

class CloudFinding(BaseModel):
    resource_type: str       # e.g., "aws_s3_bucket"
    resource_id: str         # e.g., "arn:aws:s3:::my-bucket"
    misconfiguration: str    # e.g., "public_access_enabled"
    severity: str            # "critical", "high", "medium", "low"
    current_config: dict     # Current resource configuration
    cloud_provider: str      # "aws", "azure", "gcp"

Component 2: The LangChain Analysis Agent

The core intelligence uses LangChain with tool-calling to analyze the finding and generate a remediation plan:

from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent
from langchain_core.tools import tool

llm = ChatOpenAI(model="gpt-4o", temperature=0)

@tool
def get_terraform_docs(resource_type: str) -> str:
    """Retrieve Terraform documentation for a specific resource type."""
    # Fetch from Terraform registry API
    return fetch_tf_docs(resource_type)

@tool
def validate_terraform(hcl_code: str) -> str:
    """Validate Terraform HCL code for syntax errors."""
    result = subprocess.run(
        ["terraform", "validate"],
        input=hcl_code, capture_output=True, text=True
    )
    return result.stdout + result.stderr

@tool
def generate_terraform_patch(finding: str, current_config: str) -> str:
    """Generate a Terraform patch to remediate the misconfiguration."""
    prompt = f"""
    Given this cloud misconfiguration:
    {finding}

    Current configuration:
    {current_config}

    Generate a Terraform resource block that fixes this issue.
    Include comments explaining each change.
    """
    return llm.invoke(prompt).content

tools = [get_terraform_docs, validate_terraform, generate_terraform_patch]
agent = create_tool_calling_agent(llm, tools, prompt_template)

Component 3: Celery + RabbitMQ for Async Execution

Remediation operations can be slow (Terraform plans take time), so we process them asynchronously:

from celery import Celery

app = Celery("remediation", broker="amqp://rabbitmq:5672")

@app.task(bind=True, max_retries=3)
def remediate_finding(self, finding_id: str):
    try:
        finding = CloudFinding.from_db(finding_id)

        # Step 1: Generate remediation plan
        plan = agent.invoke({"finding": finding.dict()})

        # Step 2: Run terraform plan (dry-run)
        plan_result = terraform_plan(plan.patch)

        # Step 3: If plan is safe, apply
        if plan_result.changes_only_target_resource:
            apply_result = terraform_apply(plan.patch)

            # Step 4: Log for audit trail
            AuditLog.create(
                finding=finding,
                action="remediated",
                patch=plan.patch,
                result=apply_result
            )
        else:
            # Flag for human review — patch affects other resources
            flag_for_review(finding, plan)
    except Exception as exc:
        self.retry(exc=exc, countdown=60)

Component 4: Safety Guardrails

The most critical part — ensuring the agent doesn't break production:

  1. Blast Radius Check: Before applying, verify the Terraform plan only affects the target resource
  2. Rollback Capability: Store the previous state so we can terraform apply the old config
  3. Approval Gates: Critical/high severity findings require human approval before apply
  4. Dry-Run Mode: New remediation patterns run in dry-run mode for the first 10 executions
def safety_check(plan_output: str, target_resource: str) -> bool:
    """Ensure Terraform plan only modifies the target resource."""
    planned_changes = parse_terraform_plan(plan_output)

    for change in planned_changes:
        if change.resource_id != target_resource:
            return False  # Plan would affect other resources

    return True

Results in Production

After deploying this agent:

  • Mean Time to Remediation (MTTR) dropped from 4.2 days to 23 minutes
  • 93% of findings were auto-remediated without human intervention
  • Zero false-positive remediations in 6 months (thanks to the safety guardrails)
  • Full audit trail for every change, satisfying compliance requirements

Lessons Learned

  1. Never trust the LLM blindly: Always validate generated Terraform before applying. The terraform plan step is non-negotiable.

  2. Start with low-severity findings: We rolled out auto-remediation starting with "low" severity only, gradually enabling higher severities as confidence grew.

  3. Rollback is mandatory: Every remediation must be reversible. Store the previous state.

  4. Context matters: Including the Terraform documentation in the agent's context dramatically improved patch quality.

  5. Async is essential: Terraform operations are slow. Using Celery + RabbitMQ keeps the system responsive.

Tech Stack Summary

ComponentTechnology
Agent FrameworkLangChain + LangGraph
LLMGPT-4o (via OpenAI API)
Infrastructure as CodeTerraform
Task QueueCelery + RabbitMQ
DatabasePostgreSQL
APIFastAPI
ContainerizationDocker

What's Next?

I'm exploring multi-cloud remediation — a single agent that can generate patches for AWS, Azure, and GCP using provider-specific Terraform modules. The challenge is maintaining a unified finding schema across cloud providers.


Building AI-powered security automation at Strobes Security. Connect with me on GitHub.