8th Light's Head of Agentic AI, Travis Frisinger, and I spent two days in San Francisco at LangChain Interrupt 2026 watching what happens when agent demos meet production traffic. The room was full of companies that have crossed that line, and a few that wish they hadn't tried.
The conversation a year ago was "can agents work?" The conversation in May was "how do we operate reliable, observable, governable agent systems at scale?" In the discussions we had at the conference, we validated that the shift was real and not just product marketing.
What follows is what stood out across the two days, with Travis's read on the technical patterns and what I'm hearing from clients who are trying to ship.
Demos are Easy. Operating is the Job.
Travis's summary of the two days: agents are moving from demos to real operational systems, with companies like Lyft, Cisco, Toyota, and LATAM already running large-scale production deployments. The keynote framed agents as different from traditional software because of an infinite input space and non-deterministic models. The successful teams ship early and iterate. The unsuccessful ones, according to MongoDB's CJ Desai in conversation with Harrison Chase, built proof-of-concepts for two years and called it strategy. He called it agent washing. 2025 was supposed to be the year of agents at scale. It didn't materialize. He thinks 2026 will, but only for teams who treat the operational work as the actual work.
That matches what I hear in client conversations. The question is rarely "can we build an agent?" It is "how do we run one inside an enterprise that has compliance, vendor reviews, change management, and a board that wants to see ROI?"
Observability is Not Optional. It's the Foundation.
Nearly every speaker named LangSmith or an equivalent as mandatory infrastructure rather than nice-to-have tooling. LATAM Airlines made the point with the most authority. Nico Venegas and Claudio Urbina described a concierge agent on their B2C site serving 4,000 daily active users, built on a supervisor pattern delegating to six specialist agents. LangSmith was integrated as the observability layer from day one. The architecture has changed substantially since launch, and Travis flagged this as the key point: none of those changes would have been possible without trace data showing where the system actually broke.
The lesson is the inversion of how most teams build software. With deterministic systems, you instrument after you ship. With agents, you can't ship until you can see inside the loop.
Context Management is the Architectural Problem of the Year.
The most consistent failure mode across talks was the same: too many tools, too much context, degraded reasoning.
Monday.com's Omri Bruchim presented the version-by-version story most teams will recognize. V1 was a reactive agent with basic tools. It worked, but it was limited. V2 added 20+ tools per product domain, each division getting its own agent. That version failed. Context pollution, LLM confusion, cost explosion. V3, the current production system, uses a deep-agent architecture with what Omri called progressive tool discovery: a three-tier system where the agent only sees the tools relevant to its current context, with a third tier of semantically searchable tools it can activate on demand.
Rippling told a similar story. Senthil Sundaram and Akash Ashok started with hierarchical agents and moved to a single flat agent because the hierarchical version had context-sharing problems. They also moved away from large tool catalogs toward fewer, more generic tools, and they expose their database schema directly to the LLM so it can write SQL instead of receiving raw data through ten different specific tools.
The pattern across both: when in doubt, give the model less surface area, not more.
Evals are Becoming the new CI/CD.
Lyft's Nick Ung delivered the most concrete cautionary tale of the conference. The team built a lightweight simulator with an LLM playing the user, validated the agent against it, and watched offline success rates hit 90%. Then they shipped, and production performance was poor.
The reason was specific and instructive. Their simulated users were too polite, too patient, too detailed. Real Lyft riders were impatient and answered in one or two words. The fix involved fine-tuning a custom LLM on actual user verbatims and defining personas like "Bypasser," "Refund Seeker," and "AI Skeptic." The evaluation got harder. The production-offline gap got smaller.
Chime's Philipp Comans walked through a related shift, this time for compliance. In the old model, the legal team weighed in at kickoff to explain the rules and at release to approve or block. If they blocked, the team restarted the development cycle. In the new model, compliance co-authors the evals. The evals become the alignment surface between engineering and legal. Compliance signals come in hours rather than at release. Trust is built throughout the process rather than at a single gate at the end.
This is the practical answer to the governance question every enterprise asks: how do you put guardrails on a non-deterministic system? You make the rules executable. You evaluate against them continuously. You let the experts who own each rule co-own the test that enforces it.
Cost is on the Architecture Diagram.
Clay's Jeff Barg runs 350 million go-to-market agents per month. The team treats infrastructure, throughput, cost, and quality as four discrete engineering disciplines, each with its own tools. The team built a back-pressure system modeled on TCP congestion control to adaptively throttle requests against rate limits, yielding 4–10x throughput improvements over naive approaches. Anthropic's prompt caching alone reduced their costs by up to 70%. They bound retries, because forcing agents to stop after a fixed number of steps often produces better results than letting them spin.
Aaron Levie at Box made the point at the enterprise scale. Startups can convert VC dollars to tokens. Public companies cannot absorb a sudden $10M AI bill in the middle of a quarter. The constraint forces model routing, cost-aware design, and a real cost-quality-outcome trade-off that engineering teams have to plan around.
If you are operating an agent at any meaningful scale, cost is no longer an afterthought. It is a first-class engineering concern with ROI implications that go straight to the board.
Workflow Redesign Beats Workflow Automation.
Andrew Ng made this point in conversation with Harrison Chase and Travis flagged it as the strongest single insight of Day 2. Bottom-up automation creates incremental efficiency. Real impact comes from top-down workflow redesign with executive sponsorship. Automating loan application processing saves an hour. Rethinking how loans get approved is the 20–50% transformation.
Toyota's Ravi Ummadisetti and Kordel France made it concrete. Before centralizing, every team at Toyota was building the same chatbot with the same vision, the same instruction pipeline, and no shared security or architecture standards. Deployment took six months and six engineers. After the platform was built, deployment took four days and one engineer. The platform itself maps directly onto Toyota Production System principles: LangSmith as the andon board, kaizen as continuous agent improvement, jidoka as automation with a human in the loop, genchi genbutsu as trace-driven root cause analysis. Fifty-plus agents are now in production across the company. Gearbox, their manufacturing problem solver, has cut hours of manual searching through production line manuals to ten seconds, with millions in savings every time a production line stays running.
The Toyota story is not about building agents. It is about building a platform that lets the company build agents. The pattern repeats across every team that has shipped: LATAM ran 120 GenAI products on Cosmos, the platform they spent five years building. Monday.com's deep-agent architecture is a platform layer their product teams now build against. Clay's 350M agents a month sit on a back-pressure system, durable workflow architecture, and a data substrate that took years to assemble. The agentic product is the visible part. The platform is what makes it survive contact with production.
Enterprise Adoption is Still Hard, and that is the Real Opportunity.
Coding agents are easier than every other knowledge work agent. Aaron Levie laid out why: code is verifiable, the work has a structured representation, the users are technical, and engineers have broad access to the systems they need. Knowledge work has none of those properties. Permission structures alone are a real barrier. Every worker has different data access patterns, and an agent inherits the limitations.
This is the gap we live in. The interesting agentic products are not the easy ones. The interesting ones are customer-facing, regulated, or high-stakes. They demand governed autonomy: agents that act, with guardrails that let you sleep. Real observability. Real evals. Real architectural discipline. The teams who get there are the ones who treat operating an agent as a different engineering problem than building one.
What I've Taken Back to Client Conversations.
Five patterns from the two days that I think enterprise leaders should be sitting with:
- Observability comes first. Wire trace data and human-in-the-loop feedback before the first deploy, not after. The architecture will need to evolve. You cannot evolve what you cannot see.
- Evals are an artifact, not a stage. Treat evals like tests. Build them with the people who own the rules. Run them on every change. Use them as the conversation between engineering and the rest of the business.
- Less surface area, more composition. Fewer, more generic tools usually beat large specific tool catalogs. Progressive disclosure and code-writing tools scale better than a tool-per-task catalog.
- Cost is an engineering discipline. Caching, bounded retries, model routing, and adaptive throttling are not optimizations to defer. They determine whether the system pays for itself.
- Workflow redesign over workflow automation. The 20% gains are easier to sell. The 20–50% gains are where the actual transformation lives, and they require top-down sponsorship.
The shift this conference made obvious: the conversation has moved past whether agents work. The work now is operating them at scale, with the same engineering discipline we expect of any other production system. That is the work I want 8th Light doing with clients.
Working with 8th Light.
The pattern across every speaker who has shipped: fast to POC is easy. Slow to ship is the bottleneck. Fragile in production is the failure mode. The teams who close the gap do it with the same combination Toyota, LATAM, and Monday.com described from the stage: agentic products on top of deep platform expertise. The product is what users see. The platform is what keeps it running.
That combination is what we do at 8th Light. Our Agentic AI Studio takes one high-value workflow from concept to production in 8 to 12 weeks, with the platform discipline that makes it hold up. Not a pilot. Not a strategy deck. A team of senior designers, architects, and engineers working inside your stack to ship the workflow with orchestration, integrations, guardrails, monitoring, and recovery designed in from the start.
If you're somewhere on the path from prototype to production, book a Studio session. Thirty minutes, one conversation, no sprawling discovery. We'll help you identify the right workflow and what it would take to ship it.