Problem Statement

Enterprise software deployment at scale is a coordination problem disguised as a technical one. In a 200,000-device environment, the manual packaging workflow looks like this: someone discovers a new software release, downloads it, reverse-engineers the installation parameters, builds a package configuration, tests it on a representative device, fixes the inevitable edge cases, and then deploys it through Intune. Each step requires human judgment, domain knowledge, and context switching. The cycle time from release to deployment is measured in days or weeks.

The typical enterprise solution is to hire more people or build more process. AutoPackager takes a different approach: treat the entire workflow as an orchestration problem, where each step is handled by a specialized agent that understands its domain, and the system coordinates their work through a state machine. The goal isn't full automation — it's to move the human decision point from "should I click this button" to "does this deployment plan make sense."

Architecture Overview

AutoPackager is built as a multi-agent system where each agent is responsible for a specific phase of the deployment lifecycle. The agents are:

Discovery Agent: Monitors vendor sites, release feeds, and package repositories to identify new software versions.
Packaging Agent: Downloads the installer, analyzes it, extracts installation parameters, and generates an Intune-compatible package configuration.
Testing Agent: Provisions a test environment, deploys the package, validates the installation, and captures success/failure states.
Deployment Agent: Coordinates the rollout to production, monitors deployment status, and handles rollback if needed.

Each agent operates independently but reports state changes to a central orchestrator built on Celery and Redis. The orchestrator maintains the workflow state machine, ensures agents don't step on each other, and provides observability into the entire pipeline.

┌─────────────────────────────────────────────────────────┐
│                   Orchestration Layer                   │
│              (Celery + Redis State Machine)             │
└─────────────────────────────────────────────────────────┘
           │              │              │              │
           ▼              ▼              ▼              ▼
    ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
    │Discovery │   │Packaging │   │ Testing  │   │Deployment│
    │  Agent   │   │  Agent   │   │  Agent   │   │  Agent   │
    └──────────┘   └──────────┘   └──────────┘   └──────────┘
           │              │              │              │
           ▼              ▼              ▼              ▼
    ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
    │  LLM     │   │  LLM     │   │  LLM     │   │  LLM     │
    │Abstraction│   │Abstraction│   │Abstraction│   │Abstraction│
    └──────────┘   └──────────┘   └──────────┘   └──────────┘
      │      │       │      │       │      │       │      │
      ▼      ▼       ▼      ▼       ▼      ▼       ▼      ▼
   Claude  GPT    Claude  GPT    Claude  GPT    Claude  GPT

Multi-Agent Orchestration Design

The orchestration layer is where most of the interesting design decisions live. The core insight is that software deployment is not a linear pipeline — it's a state machine with branching, retries, and human-in-the-loop decision points.

We use Celery as the task execution engine and Redis as the state store. Each agent exposes a set of tasks (e.g., packaging.analyze_installer, testing.provision_vm) that the orchestrator can invoke. The orchestrator maintains a workflow graph in Redis that tracks:

Current state for each package (discovered, packaged, tested, deployed)
Artifacts produced by each agent (package configs, test results, deployment logs)
Retry counts and backoff state for failed tasks
Human approval gates and override flags

The state machine design allows us to handle real-world complexity: if the Packaging Agent fails because a vendor changed their installer format, the system pauses at that state, alerts the operator, and waits for intervention. Once the issue is resolved, the workflow resumes from exactly where it stopped. No data is lost, no context is forgotten.

Agent Communication Protocol

Agents communicate through a simple event-driven protocol. Each agent publishes events to Redis (e.g., packaging.complete, testing.failed) and subscribes to events from upstream agents. The orchestrator acts as the event router and enforces ordering constraints. This design keeps agents decoupled — the Testing Agent doesn't need to know how the Packaging Agent works, only that it produces a package artifact.

Technology Choices

Celery and Redis

We chose Celery for task orchestration because it's boring technology that solves the hard problems: task queuing, distributed execution, retry logic, and failure handling. Redis serves as both the message broker and the state store. The combination gives us:

Horizontal scalability — add more workers to handle more packages
Fault tolerance — tasks are retried automatically on failure
Observability — every task execution is logged and traceable
Low operational overhead — Redis is simple to run and monitor

The alternative would have been a workflow engine like Airflow or Temporal, but both felt like over-engineering for our use case. Celery's task-based model maps cleanly to our agent architecture, and the simplicity reduces the surface area for things to break.

LLM Abstraction Layer

Each agent uses LLMs for domain-specific reasoning: the Discovery Agent extracts release notes, the Packaging Agent interprets installer flags, the Testing Agent analyzes failure logs. We built an abstraction layer that supports both Claude and GPT models interchangeably.

The abstraction provides:

A unified API for prompt submission and response parsing
Automatic retry with exponential backoff for rate limits
Cost tracking and usage monitoring per agent
Fallback logic (Claude primary, GPT fallback)

The key design decision was to keep the abstraction thin. We don't try to hide model-specific capabilities — if an agent needs Claude's tool use or GPT's function calling, it can use those features directly. The abstraction only handles the common path: send a prompt, get a response, handle errors.

Why Not End-to-End Automation?

We deliberately kept human approval gates in the workflow. Full autonomy is possible but risky — a packaging error could break thousands of devices. The current design automates the tedious parts (downloading installers, extracting parameters, provisioning test VMs) and surfaces the decision points to humans (does this package configuration look correct? did the test pass?). This balance gives us 80% of the efficiency gains with 20% of the risk.

Outcomes and Lessons Learned

What Worked

Agent isolation: Keeping agents independent and single-purpose made the system easy to debug and extend. When we needed to add a new capability (e.g., driver package support), we built a new agent rather than bolting it onto an existing one.
State machine design: Modeling the workflow as an explicit state machine meant failures were recoverable and workflows were observable. We could see exactly where each package was in the pipeline and intervene when needed.
LLM abstraction: Supporting both Claude and GPT gave us cost flexibility and redundancy. During a Claude API outage, we switched to GPT with zero code changes.

What Was Hard

Error handling: LLMs fail in unpredictable ways. A prompt that works 99% of the time will fail on the 1% edge case, and debugging those failures requires inspecting the full prompt/response cycle. We ended up logging everything and building a replay system to reproduce failures locally.
Rate limiting: Coordinating rate limits across multiple agents is surprisingly complex. If all agents share a single API key, one agent can starve the others. We implemented a token bucket system in Redis to enforce fair sharing.
Test environment provisioning: Spinning up VMs for testing is slow and expensive. We explored using containers but ran into issues with Windows installer compatibility. The current approach uses a pool of pre-provisioned VMs, which helps but adds operational overhead.

Lessons for Multi-Agent Systems

Building AutoPackager reinforced a few principles that apply to any multi-agent AI system:

Make the state visible: Agents should publish their state explicitly, not hide it in internal variables. This makes the system debuggable and gives operators confidence in what's happening.
Design for failure: Agents will fail. The orchestrator should assume failure is the default and design workflows that can pause, retry, or rollback gracefully.
Keep agents small: An agent that does one thing well is easier to test, debug, and replace than a monolithic agent that does everything poorly.
Human-in-the-loop is a feature, not a bug: Full autonomy is a nice goal, but in production systems, human oversight is how you catch the edge cases that automation misses.

Current Status

The public precursor to AutoPackager is available on GitHub and demonstrates the basic orchestration pattern. The ML-powered version described here is in active development and not yet production-ready. The architecture is validated, the agent framework is built, and we're iterating on the LLM prompting strategy to improve reliability.

The goal is to move from "proof of concept" to "production system" by focusing on the reliability fundamentals: better error handling, more comprehensive testing, and tighter integration with Intune's API. Once those are in place, AutoPackager becomes a force multiplier for any organization managing software deployment at scale.