Your Vendor Still Writes Code by Hand

A 25-person startup ships 8 production deploys a day with 99% AI-written code. Their CTO went from managing people 60% of his time to under 10%. This isn't a startup flex — it's a signal. If your core vendor's engineering team still operates the way they did in 2024, you're paying for a process that's already obsolete.

By Sean Hsieh
Read 30 min
Published April 13, 2026
Your Vendor Still Writes Code by Hand

If I could give every credit union technology leader one diagnostic question for their next vendor meeting, it would be this: How many times did you deploy to production last week?

Not last quarter. Not last sprint. Last week.

Peter Pang — a physicist-turned-CTO running a 25-person agent platform called CREAO — published a piece on X last Saturday that answers his own version of that question with numbers that should make every vendor in your stack nervous:

  • 99% of production code written by AI.
  • 3 to 8 production deployments per day.
  • A new feature shipped at 10 AM, A/B tested by noon, killed by 3 PM because the data said no, and replaced with a better version by 5 PM.
  • Three months earlier, that same cycle would have taken six weeks.

Those numbers are jarring. But the number that stopped me was this one: Pang went from spending 60% of his time managing people to under 10%. Not because he fired anyone. Because the system he built made most of the management work unnecessary.

The velocity gap between AI-native companies and everyone else is widening faster than most people realize. And if you’re being honest with yourself, your last vendor RFP didn’t ask a single question that would surface this gap. You asked about features. You asked about pricing. You asked about integration timelines. You didn’t ask how their engineering team actually builds software — because until recently, the answer didn’t matter. It does now.


The Name for What’s Happening

Code on a central screen with data streams flowing inward from surrounding nodes — database, chat, search, video, documents, globe

Multiple teams — OpenAI, Anthropic, and several independent practitioners including Pang — have converged on the same concept. The industry is calling it harness engineering: the primary job of an engineering team is no longer writing code. It is building the systems that enable AI agents to do useful work.

When something fails in a harness-engineered system, the fix is never “try harder.” The fix is: what capability is missing, and how do we make it legible and enforceable for the agent?

The engineer’s job is not to write the code. It’s to build the harness — the guardrails, the testing infrastructure, the context pipelines, the deployment automation, the feedback loops — that allows AI to write the code reliably. The code is output. The harness is the product.

This isn’t the same factory running 10% faster. This is a different factory — one that replaced the assembly line with robotics. The workers didn’t get better at turning wrenches. The factory stopped needing wrenches.

Pang arrived at harness engineering on his own, before anyone gave it a name. He spent one week designing the new system and another week re-architecting the entire codebase using agents. CREAO is an agent platform — they used their own agents to rebuild the platform that runs agents. If the product can build itself, it works.

He’s not alone. The convergence is striking. Andrej Karpathy — the former Tesla AI director who coined “Software 2.0” — now frames the shift as Software 3.0: the LLM is the operating system, the context window is RAM, tools are system calls, and the engineer’s primary discipline is context engineering — “the delicate art and science of filling the context window with just the right information for the next step.” That’s harness engineering described from the model’s perspective rather than the engineer’s.

Garry Tan, running Y Combinator, distilled it even further — this weekend:

“Push smart fuzzy operations humans do into markdown skills. Fat skills. Push must-be-perfect deterministic operations into code. Fat code. The harness? Keep it thin.”

Three different practitioners. Three different vocabularies. The same phase change: the intelligence goes into the skills and the code. The orchestration layer — the harness — should be minimal glue, not a complex framework. What matters is what the agent knows and what guardrails constrain it, not how many layers of abstraction sit between the prompt and the output.

At Runline, we’ve been calling this an orchestration fabric — a thin coordination layer with thick governance underneath. Our architecture philosophy predates any of these terms: thick governance, thin orchestration. Invest in durable coordination primitives — task state machines, trust tiers, verification gates, immutable audit trails — that survive model upgrades and harness evolution. Keep the orchestration logic itself thin, because the capabilities underneath advance faster than any framework can keep up with.

The terminology is converging. The principle is settled. The question for credit unions isn’t whether this shift is real. It’s whether your technology vendors have made it.


AI-Assisted Is Not AI-First

Most technology vendors will tell you they’re “using AI.” They are. Their engineers have Copilot licenses. Their PMs draft specs with ChatGPT. Their QA team experiments with AI-generated test scripts. The workflow stays the same. Efficiency goes up — maybe 10, maybe 20 percent in my experience. The quarterly release cycle doesn’t change. The Jira boards look identical. The standup meetings still happen every morning.

That’s AI-assisted. And in 2026, AI-assisted is table stakes. It’s the bare minimum. It’s the equivalent of telling a credit union in 2010 that your company “uses the internet.”

AI-first means you’ve torn down the existing process and rebuilt it from the ground up around the assumption that AI is the primary builder. You stop asking “how can AI help our engineers?” and start asking “how do we restructure everything so AI does the building, and engineers provide direction and judgment?”

Pang described the difference bluntly: “I see teams claim AI-first while running the same sprint cycles, the same Jira boards, the same weekly standups, the same QA sign-offs. They added AI to the loop. They didn’t redesign the loop.”

DimensionAI-AssistedAI-First
Code generationCopilot autocomplete (10-30%)AI writes 90-99% of production code
Deploy cadenceWeekly to quarterlyMultiple times per day
TestingManual QA with some automationAI-built test suites validate AI-written code
Code reviewHuman reviewers3 parallel AI review passes + human oversight
Error handlingOn-call rotation, manual triageSelf-healing loop: auto-detect, triage, fix, verify
Planning cycleQuarterly roadmap, monthly sprintsPrototype-ship-test-iterate in hours
Engineer’s roleWrite code faster with AI helpBuild the harness; AI writes the code
Efficiency gain10-20% improvement (typical)Order-of-magnitude throughput increase (per Pang: 6-week cycles → same-day)

The difference between these two approaches isn’t 10 to 20 percent. It’s multiplicative. CREAO went from not producing a single production release in two weeks to three to eight deployments per day. That’s not optimization. That’s a different category of operation.

I wrote in Article 28 about what happens when the underlying capability gets commoditized — Jasper AI watched revenue drop from $120 million to $55 million when the models became free. The same dynamic applies to AI-assisted engineering. If your vendor’s competitive advantage is “our engineers use Copilot,” that advantage is worth exactly zero, because every other vendor’s engineers also use Copilot.

The vendors that will survive the next three years are the ones who’ve done what Pang did: dismantled the old process and rebuilt it around AI as the primary builder. The ones who bolted Copilot onto a 2020 sprint cadence are shipping at the same speed they shipped three years ago — while their competitors compound daily. The gap doesn’t close. It accelerates.


The Three Bottlenecks Your Vendor Won’t Mention

Pang identified three bottlenecks that would have killed his company if he hadn’t restructured. Every credit union should ask whether their vendors have addressed the same three.

The product management bottleneck. Traditional product management is a weeks-long research-design-specify cycle. It has worked this way for decades. But when AI can implement a feature in two hours, a weeks-long planning cycle becomes the constraint. As Pang put it: “It doesn’t make sense to think about something for months and then build it in two hours.”

At a vendor that’s truly AI-first, product decisions happen through rapid prototype-ship-test-iterate loops, not specification documents reviewed in committee. The PM doesn’t write a 30-page PRD and hand it off to engineering. They define the intent, the agent builds a working prototype in hours, the team tests it with real data, and the decision is made based on results — not projections.

Ask your vendor: how long is your planning cycle? If the answer is “quarterly roadmap with monthly sprint planning,” they haven’t addressed this bottleneck. They’re spending months deciding what to build and hours building it. The ratio is inverted.

The QA bottleneck. Same dynamic, different function. If AI builds a feature in two hours and the QA team spends three days testing corner cases, you’ve moved the bottleneck ten feet downstream. You didn’t eliminate it.

Pang’s solution was to replace manual QA with AI-built testing platforms that test AI-written code. Validation has to move at the same speed as implementation. His six-phase CI/CD pipeline enforces typechecking, linting, unit tests, integration tests, end-to-end tests, and environment parity checks on every single pull request. No exceptions. No manual overrides. The pipeline is deterministic — agents can predict outcomes and reason about failures before they hit production.

Every pull request also triggers three parallel AI code review passes using Claude Opus: one for code quality and logic errors, one for security vulnerabilities and authentication boundary checks, and one for dependency risks and supply chain issues. These run alongside human review, catching what humans miss at volume. When you deploy eight times a day, no human reviewer can sustain attention across every change.

Ask your vendor: is your testing infrastructure automated end-to-end? Do you have AI code review on every PR? If they hesitate, the answer is no.

The headcount bottleneck. CREAO has 25 employees — 10 engineers — shipping at a pace that would require a traditional team many times that size. They couldn’t hire their way to parity. They had to redesign their way there.

This one matters for credit unions for a different reason than you might expect. Your vendor’s headcount tells you almost nothing about their capability. A vendor with 500 engineers operating the old way may ship less, with more bugs, than a vendor with 15 engineers operating inside a properly built harness. The question isn’t how many people they employ. It’s what system those people operate inside.

Three systems needed AI running through them at CREAO: how they design product, how they implement product, and how they test product. If any single one stays manual, it constrains the whole pipeline. The same diagnostic applies to every vendor in your current stack.


The Self-Healing Loop

A circular self-healing loop — five stations connected by flowing arrows, energy pulsing at the center

This is the part of Pang’s piece that should fundamentally change how you think about what production software infrastructure looks like in 2026.

Every morning at 9:00 AM UTC, an automated health workflow runs at CREAO. An AI agent queries their monitoring infrastructure, analyzes error patterns across all services, and generates an executive health summary delivered to the team via Microsoft Teams. Nobody asked for it. Nobody initiated it. The system watches itself.

One hour later, the triage engine runs. It clusters production errors from monitoring and exception tracking, scores each cluster across nine severity dimensions, and auto-generates investigation tickets — complete with sample logs, affected users, affected endpoints, and suggested investigation paths. If an open issue already covers the same error pattern, it updates that issue instead of creating a duplicate. If a previously closed issue recurs, it detects the regression and reopens it.

When an engineer pushes a fix, the same pipeline handles it: three AI review passes evaluate the pull request, CI validates across all test suites, the six-phase deploy pipeline promotes through dev and prod with testing at each stage. After deployment, the triage engine re-checks production metrics. If the original errors are resolved, the ticket auto-closes.

Read that sequence again: detect → triage → fix → verify → close. The system doesn’t just report problems. It diagnoses them, assigns them, validates the fix, and confirms resolution — then learns from the cycle.

I’ve watched structurally similar loops emerge in credit union compliance operations — I’ll walk through what that looks like in practice below. The pattern is the same: once the system closes a loop end-to-end, it starts compounding.

The self-healing loop isn’t a nice-to-have. It’s the mechanism that turns iteration speed into compounding quality. Every cycle through the loop makes the system more reliable, more adapted, more aware of its own failure modes. I’ll put hard numbers on the compounding effect in the velocity section — but the principle is simple: a system that heals itself eight times a day learns faster than one that waits for a human to file a ticket.

Now apply this diagnostic to your core vendor. When a production error occurs in their system, does it auto-triage, auto-assign, and auto-verify the fix? Or does a human have to notice the alert, file a ticket, assign it to an engineer, review the fix, and manually verify resolution? If it’s the latter, they’re running a 2020 operations model with a 2026 price tag. The error detection might be automated, but everything downstream is manual. And manual means slow. And slow means your system learns four things a month instead of 240.


The New Org Chart: Architects and Operators

Pang described a restructured engineering organization with two roles. If you evaluate vendors without understanding this distinction, you’ll mistake headcount for capability.

Architects — one or two people who design the standard operating procedures that teach AI how to work. They build the testing infrastructure, the integration systems, the triage engines. They decide architecture and system boundaries. They define what “good” looks like for the agents. The critical skill isn’t writing code. It’s critical thinking — finding the holes in AI’s proposals, identifying failure modes, spotting security boundaries being crossed, catching technical debt being accumulated before it compounds.

As Pang put it: “You criticize AI. You don’t follow it.”

He has a PhD in physics. He said the most useful thing it taught him was how to question assumptions, stress-test arguments, and look for what’s missing. “The ability to criticize AI will be more valuable than the ability to produce code.”

Operators — everyone else. The work matters and requires real skill. The structure is different. AI assigns tasks to humans. The triage system finds a bug, creates a ticket, surfaces the diagnosis, and assigns it to the right person. The person investigates, validates, and approves the fix. AI makes the PR. The human reviews whether there’s risk.

Then Pang described something counterintuitive that every technology leader needs to hear: junior engineers adapted faster than senior engineers.

Juniors felt empowered — they had access to tools that amplified their impact without a decade of habits to unlearn. Seniors had the hardest time. Two months of their accumulated work could be completed in one hour by AI. That’s a difficult thing to accept after years of building a rare skill set.

What he observed has direct implications for how credit unions think about their own teams.

The 25-year veteran systems administrator who knows every quirk of your core processor isn’t going away — that institutional context is exactly the kind of moat I described in Article 34. But their role is evolving from operator to architect. They should be building the systems and context layers that make AI effective inside your institution — encoding their knowledge into harnesses that compound — not manually executing the same processes AI can now handle faster and more consistently.

The implication for vendor evaluation is equally clear. When your vendor says “we have 200 engineers,” ask: how many are architects and how many are operators? Who designs the harness? Who defines what “good” looks like for the agents? If they can’t answer — if they don’t even recognize the distinction — their org chart hasn’t adapted to how software gets built now.


Five Questions for Your Next Vendor Meeting

I’ve been in enough vendor evaluation conversations to know that most credit unions ask about features, pricing, and integration timelines. Those questions still matter. But in 2026, they’re insufficient. They tell you what the vendor has. They don’t tell you whether the vendor will exist in three years.

Here are five questions that will tell you more about your vendor’s future viability than anything on their feature comparison sheet.

1. How many production deployments do you average per day?

The answer reveals everything about their engineering maturity. CREAO averages three to eight per day. Traditional vendors? Maybe once a month. Maybe quarterly if they’re “enterprise-grade.” A vendor deploying weekly is in a fundamentally different category than one deploying daily. Daily deployment requires automated testing, automated review, automated rollback — the full harness. Quarterly deployment means humans are still in every loop.

Good answer: “Multiple times per day.” Acceptable answer: “Weekly, with automated CI/CD.” Red flag: “We do quarterly releases.” That vendor is operating the same way they did in 2018.

2. What percentage of your code is AI-generated?

CREAO is at 99%. Most traditional vendors are at 10 to 30 percent — Copilot-assisted autocomplete on individual functions. There’s nothing wrong with asking this directly. If the vendor is proud of their AI engineering practice, they’ll have a number. If they deflect with “We use AI throughout our process,” they probably don’t track it, which means it’s not central to how they operate. The number itself matters less than whether they measure it.

3. Is your testing infrastructure AI-driven or manual?

This is the bottleneck question. Fast code generation without fast validation is fast-moving technical debt. It’s the equivalent of building a car with a 500-horsepower engine and drum brakes from 1960.

Ask specifically: do AI systems review every pull request? Is end-to-end testing automated on every code change? Do you have AI review passes for security, code quality, and dependencies? If their answer is “we have a great QA team,” follow up with “how many days between code completion and production deployment?” The gap between those two dates tells you exactly how much manual work sits between build and ship. A three-day gap is a three-day bottleneck.

4. Does your system detect, triage, and resolve production issues without human initiation?

Describe the self-healing loop: automated error detection, AI-driven severity scoring, auto-generated investigation tickets with full context, deployment verification, auto-close on resolution. Ask if their production infrastructure does this.

Most vendors won’t have it. The ones that do will know exactly what you’re talking about and light up — because they’re proud of it, and almost nobody asks. The ones that don’t will describe their “24/7 NOC” and “incident management process.” Those are human-speed systems. In a world of AI-speed deployment, human-speed incident management is a structural liability.

5. How has your engineering org structure changed in the last 12 months?

This is the culture question, and it’s the most revealing. A vendor that’s truly AI-first has restructured. Roles have changed. Workflows have been redesigned. Job descriptions look different than they did a year ago.

If the answer is “we added Copilot licenses and hired an AI team,” they’ve bolted AI onto the existing org. They added a room to the house. If the answer describes fundamental restructuring — reduced management overhead, new architect roles, AI-native testing pipelines, collapsed planning cycles, restructured deployment automation — they tore down the house and rebuilt it. Those are different buildings.

No vendor will ace all five questions. But the pattern of their answers will tell you whether they’re on the trajectory or behind it. And in this market, behind means irrelevant in 18 months.


The Self-Healing Credit Union

Everything above is about evaluating your vendors. But the self-healing loop isn’t just an engineering concept. It applies to credit union operations directly.

I watched it happen at a credit union partner with a 4-person BSA team. Their lead analyst — call her Dara — was reviewing 180 alerts per week manually. Every alert followed the same path: rule triggers, Dara opens it, pulls the member’s history, checks for prior SARs, writes a narrative or documents the dismissal, supervisor reviews, file gets closed. Six steps. Fifteen to forty minutes each. She told me her team spent more time documenting why something wasn’t suspicious than investigating things that were.

After six months running a compliance agent alongside her team, the system had learned Dara’s institution’s specific patterns: which transaction types were legitimately seasonal, which members had documented business reasons for cash-heavy activity, which alert categories had a 98% false-positive rate at that specific credit union. The agent now pre-triages every alert — assembling the full investigation context before a human touches it: member history, similar past alerts, the specific pattern that triggered it, relevant regulatory guidance, and a draft narrative. For clear false positives, it prepares a documented recommendation for dismissal with complete reasoning.

The critical point: Dara’s team still reviews and approves every disposition. The agent doesn’t make compliance decisions — trained personnel do, as the regulatory framework requires. But the agent does the research, assembles the evidence, and drafts the documentation. What used to take Dara 30 minutes now takes her 4. The alert volume requiring deep human investigation dropped by over 60%. Not because the rules changed. Because the system learned the context, and Dara’s corrections made it smarter every week.

Every examination finding gets encoded into the system’s understanding of what the institution’s specific compliance standards require. The agent doesn’t game examiner preferences — it builds a deeper model of what genuine compliance looks like at that institution.

I wrote about the false-positive crisis in Article 6 and the examiner-readiness framework in Article 14. The technology exists today. What’s missing at most credit unions isn’t the capability. It’s the harness — the system that connects detection to triage to documentation to learning in a continuous loop.

The same pattern applies to member service escalation, lending exception processing, fraud case management, and operational compliance. Any process that currently follows the pattern of detect → human investigates → human acts → human documents → supervisor reviews can be restructured into a self-healing loop where AI handles the routine cases with full documentation and surfaces the complex ones with complete context.

The credit unions that build these loops now will compound an operational advantage that becomes harder to replicate every month. Every cycle through the loop teaches the system something about your specific institution that no competitor can copy. It’s the same anti-fragility principle from Article 34 — institutional context accumulated through self-healing loops gets more valuable over time, not less. The pressure that drives adoption is the same pressure that widens the moat.


Beyond Engineering: The Full-Stack Test

Pang made a point that most AI coverage overlooks entirely: “If engineering ships features in hours but marketing takes a week to announce them, marketing is the bottleneck.”

At CREAO, they pushed AI-native operations into every function. Product release notes are auto-generated from changelogs. Feature introduction videos are AI-generated motion graphics. Daily social media posts are AI-orchestrated and auto-published. Health reports and analytics summaries are AI-generated from production data and monitoring systems.

The principle is clean: if one function operates at agent speed and another at human speed, the human-speed function constrains everything.

Apply this to your vendor evaluation. Don’t just ask about engineering. Ask about their support operations — is ticket triage AI-powered, or does a human read every inbound support email? Ask about their documentation — does it update automatically when features ship, or is there a technical writer three versions behind? Ask about their security posture — are vulnerability scans and patches AI-driven with automated remediation, or does someone manually review CVE bulletins every Tuesday?

Then apply it to your own institution. Your lending department processes applications at a certain speed. Your compliance team reviews alerts at a certain speed. Your member service team resolves issues at a certain speed. Your marketing team produces campaigns at a certain speed. If you accelerate lending with AI but leave compliance manual, compliance becomes the new bottleneck. If you accelerate compliance but leave member service untouched, member service becomes the constraint.

This is why I wrote in Article 36 about adoption as an institution-wide transformation, not an IT project. The credit unions that will lead their markets in three years are the ones building AI-native operations across every department. Not just the ones with the best chatbot on their website.


The Velocity Moat

I’ve written about capability moats collapsing (Article 34) and about the 18-month window (Article 16) where competitive positions will be set. Pang’s piece crystallizes why velocity — not capability — is the surviving moat.

When CREAO deploys three to eight times per day, each deployment is a learning cycle. The self-healing loop processes errors, the agents refine their understanding of the codebase, the testing harness gets stronger, the context layer deepens. Multiply that by 30 days: 90 to 240 learning cycles per month. A vendor deploying monthly gets one.

After a year, CREAO has completed somewhere north of 1,500 learning cycles. The monthly vendor has completed 12.

This isn’t just a speed advantage. It’s a compounding intelligence advantage. Each cycle makes the system smarter, more resilient, more adapted to its specific operating environment. After 1,500 cycles, the system knows things about its codebase, its failure modes, its user patterns, and its operational edge cases that no amount of upfront planning could have predicted. After 12 cycles, the system is still learning the basics.

I’ve watched this dynamic play out in our own work with credit unions. The institutions that deployed agents early — even when the early iterations were rough around the edges — are now operating with an intelligence advantage that latecomers can’t purchase. The agent’s accumulated context about their members, their examiners, their operational patterns, their community’s economic rhythms is irreplaceable. It was built one learning cycle at a time. There is no shortcut. There is no bulk import. There is no “catch-up” package.

Pang’s description of the one-person company is the logical endpoint of this velocity thesis: if one architect with a properly built harness can match the output of 100 engineers, then the question for every technology company becomes — is your advantage your people or your system? Companies that scale through headcount will be undercut by companies that scale through harness quality. Credit unions evaluating vendors should ask: which kind of company are you building?

The answer determines whether your vendor is compounding or depreciating.


What To Do This Quarter

This isn’t theoretical. Here’s what I’d recommend a credit union executive do in the next 90 days.

Audit your vendors’ engineering velocity. Ask the five questions from this article in your next QBR or vendor check-in. If your core provider deploys quarterly and can’t articulate their AI engineering strategy, start your evaluation timeline now. Not because you’ll switch tomorrow — switching decisions take 12 to 18 months — but because the window in which viable alternatives exist is closing faster than the RFP process moves.

Identify one self-healing candidate internally. Pick one operational process at your institution that follows the manual detect → investigate → act → document → review pattern. BSA alert triage is the obvious starting point, but member service escalation, lending exception processing, and fraud case management all qualify. Define what the self-healing version would look like. Map the loop. Identify where the bottlenecks are. You don’t have to build it this quarter — you need to see it. Seeing it is what moves it from abstract to inevitable.

Restructure your IT team’s mental model. Your best IT people should be spending their time building harnesses — systems, configurations, context pipelines, integrations, automation — not manually processing routine tickets. Every hour they spend on repetitive work that an agent could handle is an hour not spent building the institutional intelligence layer that will define your competitive position for the next decade.

Stop evaluating AI tools and start evaluating AI systems. A chatbot is a tool. A self-healing operational loop is a system. A tool gives you a capability. A system gives you compounding returns. Every AI investment should be evaluated against one question: does this get smarter the longer we use it, or does it perform the same on day 365 as it did on day one? If the answer is the latter, it’s a commodity — buy the cheapest version. If the answer is the former, it’s an asset — invest accordingly.


One Person, Many Agents

The Tower overseeing a grid of Runners — one command surface directing many agents

I should be transparent about something: the five vendor questions in this article aren’t hypothetical. I ask them about my own company.

Runline is one person and a growing fleet of AI agents. That’s not a temporary bootstrapping phase. It’s the architecture. From the first day I started building this company, the thesis was that AI-native wasn’t an optimization — it was the only way to build infrastructure for credit unions that could actually compete with what the big banks deploy. A traditional approach would require hundreds of engineers and a decade of runway. I didn’t have either. What I had was the conviction that if I built the harness right, the agents would do the rest.

We call our agents Runners — purpose-built AI agents aligned to specific domains. A compliance Runner handles BSA triage. A development Runner writes code, reviews pull requests, and deploys to production. An intelligence Runner does competitive research and market analysis. Each Runner executes Playbooks — SOPs encoded as structured workflows with typed state, branching logic, and approval Gates where human judgment is required. The smallest unit of work is a Skill — a portable, reusable capability that Runners can learn and compose. Everything flows through The Grid, our AI control plane that enforces kill-switches, rate limits, and immutable audit logging on every agent action. Staff observe and direct Runners through The Tower — a command surface where you can watch every Run in progress, approve gates, or trigger a derez (our kill-switch) if something goes sideways.

I spend my time doing what Pang’s architects do — designing the Playbooks, building the context layers, defining what “good” looks like for each Runner, and catching the things they miss. I don’t write much code by hand anymore. I build the systems that write code.

Garry Tan’s framework maps exactly to how we work: fat Skills in markdown that encode institutional knowledge about credit union operations, BSA compliance, and examiner expectations. Fat code for the deterministic pieces — the data pipelines, The Grid’s enforcement layer, the testing harness. And a thin orchestration layer that connects them. The intelligence lives in the Skills and the code, not in the glue.

When I described the self-healing loop earlier in this article — detect, triage, fix, verify, close — that’s not something I observed at CREAO and thought “we should try that.” It’s what our production infrastructure already does. When I described Dara’s BSA team working alongside compliance Runners that learn from every correction, that’s the system we built and deployed.

I’m not sharing this to pitch Runline. I’m sharing it because when a credit union CEO asks me “does your company pass its own test?” — the answer should be visible in how the company actually operates, not in a slide deck. One person. Many agents. AI-native from day one. The harness is the product we sell because it’s the product we run on.


The Harness Is the Product

Pang described the cost of this transformation honestly: “Uncertainty among employees, the CTO working 18-hour days, senior engineers questioning their value, a two-week period where the old system is gone and the new one isn’t proven.” He absorbed that cost. Two months later, the numbers spoke.

Credit unions don’t have to rebuild their engineering organizations — that’s their vendors’ job. But they have to recognize which vendors have done the hard work and which are still running the old playbook with a fresh coat of AI marketing paint. The distinction between AI-assisted and AI-first isn’t visible on a feature comparison sheet. It’s visible in deployment frequency, in org structure, in how the system handles its own failures, in whether the technology compounds institutional intelligence over time or resets to zero every session.

Pang’s principle that he applies to fixing failures also applies to fixing vendors: the answer is never “try harder.” The answer is: what’s missing from the system, and how do you make it legible?

The code matters less than the harness. The harness matters less than the loop. And the loop only matters if it’s running fast enough to compound.

Your vendor still writes code by hand? That’s the clearest signal they could send about where they’ll be in 18 months.


This is Article 37 of the Insights series. Previously: Article 36 — The Adoption Playbook | Article 34 — Your Data Isn’t Your Moat | Article 16 — The 18-Month Window

Get Started

Ready to see what stateful AI agents can do for your credit union?

Runline builds purpose-built AI agents for regulated financial institutions. Every interaction compounds institutional intelligence.

Schedule a Demo