What Your Credit Union Looks Like in 150 Lines of Python

Andrej Karpathy just wrote down what your CU does. The same algorithm that trains a tiny GPT in 150 lines runs in your back office on a 10-year timescale — informally, without capture, watching the gradient signal walk out the door at retirement. Here's what changes when you can see it.

By Sean Hsieh
Read 12 min
Published April 28, 2026
What Your Credit Union Looks Like in 150 Lines of Python

By Sean Hsieh, Founder & CEO, Runline


Last week’s piece said the institution that captures override patterns will compound; the one that does not will train its competitors’ agents indirectly, by hiring the people they later poach. This week, Andrej Karpathy shipped the atomic recipe for how that compounding actually works.

He posted a gist on Monday. Just under 200 lines of pure Python. The substantive learning machinery — dataset, autograd, GPT, optimizer, training loop, inference — is about 150 lines. No dependencies beyond os, math, and random at runtime (one urllib.request line bootstraps the dataset on first run).

He calls it:

“This file is the complete algorithm. Everything else is just efficiency.”

Karpathy just wrote down what your CU does. The 150 lines aren’t an explainer of how AI works. They’re a map of how an institution learns — same algorithm, same recipe, same compounding effect. Read it like a CU operator and you’ll see the back office on a 10-year timescale, in code.

Think of the model as a brand-new MSR on her first day at your credit union. She doesn’t know Maria’s flower shop. She doesn’t know your underwriting team’s tolerance for borderline DTI. She doesn’t know that the examiner who reviewed your last SAR has a particular preference for how the suspicious-activity narrative is structured. None of that is in any document she has read.

Everything that follows is how she becomes the senior MSR by year five. The CU that runs this loop deliberately compounds. The CU that runs it informally — the new MSR learns from the senior MSR, day by day, but the institution doesn’t capture what she learned — watches the gradient signal walk out the door at retirement. The first kind compounds. The second kind starts every five years over.

That is the cost of running the loop without the substrate.


The five stages of learning

Each stage shows a code excerpt from Karpathy’s gist, then the credit-union analog, then a visual. Stage 3 — autograd — breaks the pattern. It runs as continuous prose, not formal triad, because it carries the weight of the article.

Stage 1 — The dataset

The first line of Karpathy’s gist that does any actual work is random.seed(42). The comment reads: “Let there be order among chaos.” That is the article in a sentence. Every institution starts as chaos. The deliberate ones impose enough order that the system can take a guess at what comes next.

docs = [line.strip() for line in open('input.txt') if line.strip()]
random.shuffle(docs)
print(f"num docs: {len(docs)}")

Karpathy trains on names.txt from his makemore repo — about 32,000 names, one per line.

Every BSA alert your team has cleared. Every loan you have underwritten. Every member service call closed. The dataset is your institution’s collective experience — the cases it has already seen. As I argued in Article 30, “The AI That Remembers You”, every CU has this corpus already; the question is whether it’s structured to be reusable.

Two callouts hidden in three lines of code. The shuffle matters because chronological order causes the model to overfit to recency. Every BSA case from 2024, then every case from 2025 produces a worse model than the same cases mixed. The count matters because no agent learns from ten cases. Karpathy needs ~32,000 names; a CU’s agents need similar order of magnitude.

alertloancallSARmemothe corpusdocs = [...]
Every cleared alert, every underwritten loan, every closed call — the institution's collective experience as a reusable corpus.

Stage 2 — The tokenizer

uchars = sorted(set(''.join(docs))) # unique characters in the dataset become token ids 0..n-1
BOS = len(uchars) # token id for a special Beginning of Sequence (BOS) token
vocab_size = len(uchars) + 1 # total number of unique tokens, +1 is for BOS

Every unique character gets a number. Plus one special “Beginning of Sequence” token so the model knows where words start.

Every CU has its own vocabulary — share account, line of credit, NSF, OD return, MNTFLD ADDR1. The tokenizer is your taxonomy. Symitar’s MNTFLD codes are literally tokens — six-character strings the system uses to refer to fields. Every integration that touches Symitar has to learn that vocabulary; if the codes aren’t space-padded correctly the integration fails on day one.

Cleaning the taxonomy is the prerequisite to compounding learning. The same way Karpathy’s 27 unique characters are the prerequisite to training the model. Incoherent vocabulary in, incoherent agent out.

CU vocabularytoken idshare account01line of credit02NSF03OD return04MNTFLD ADDR105BOS⟨BOS⟩
Your taxonomy is the language your agents reason in. Symitar's MNTFLD codes are literally tokens.

Stage 3 — Autograd: the substrate of learning

Karpathy’s Value class is small enough to read as a single artifact:

class Value:
    __slots__ = ('data', 'grad', '_children', '_local_grads')

    def __init__(self, data, children=(), local_grads=()):
        self.data = data                # scalar value of this node calculated during forward pass
        self.grad = 0                   # derivative of the loss w.r.t. this node, calculated in backward pass
        self._children = children       # children of this node in the computation graph
        self._local_grads = local_grads # local derivative of this node w.r.t. its children

    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._children:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1
        for v in reversed(topo):
            for child, local_grad in zip(v._children, v._local_grads):
                child.grad += local_grad * v.grad

The Value class wraps every number in the model with a record of which inputs produced it. When the model is wrong at the end, backward() walks the chain of operations in topological order and figures out, for each input, how much it contributed to the error. That is the entire mechanism by which the model learns from being wrong. Every other component depends on this one.

I have watched a BSA analyst clear a case in forty seconds and then spend an hour explaining why to a junior. Every word of that hour is a gradient. Almost none of it gets captured.

Most CUs treat audit trails as compliance overhead — a record of what was decided, retained because regulators require it. Karpathy’s gist suggests something different. The audit trail is the substrate of learning. Without it, you cannot propagate gradients. Without gradients, you cannot update. Without updates, the institution does not compound — it forgets, every retirement.

This is also where regulatory care starts mattering. As soon as audit-trail data flows back into a model that makes risk decisions, that pipeline becomes a model component governed by SR 11-7 model risk management principles. NCUA examiners reference the same interagency MRM principles when reviewing credit-union risk models. Effective challenge, independent validation, change management — all attach. The institution that operationalizes Stage 3 also operationalizes the governance that comes with it; the institution that hand-waves Stage 3 will eventually meet an examiner who asks “how did the agent learn that?” and won’t have an answer.

This is the hardest piece of the algorithm to read in code, and the highest-leverage piece for a CU to operationalize. Skip it and the rest of the loop has nothing to flow through.

member contexttransactionpolicyhistoryforward (decision flow)agent's verdictoverride = lossbackward (gradient)↑ which inputs to nudge↑ and by how much
Forward: inputs flow into the agent's verdict. Backward: when the override happens, the gradient walks back, distributing blame to each input. The audit trail is the substrate.

Stage 4 — The forward pass

def gpt(token_id, pos_id, keys, values):
    tok_emb = state_dict['wte'][token_id] # token embedding
    pos_emb = state_dict['wpe'][pos_id] # position embedding
    x = [t + p for t, p in zip(tok_emb, pos_emb)] # joint token and position embedding
    x = rmsnorm(x)
    # ... attention layers + MLP layers ...
    logits = linear(x, state_dict['lm_head'])
    return logits

The forward pass takes a token and asks: given everything I know about this token, its position, and what I have seen so far, what should come next?

A senior analyst looking at a new case. She pulls in context — token + position embeddings answer what kind of member, what point in the timeline. Runs through her mental model — attention layers pattern-match against thousands of similar cases; MLP layers filter through CU policy. Produces a verdict — logits over possible actions.

Every judgment call in your back office has this shape. The senior analyst is doing the same thing the model does — except she does it intuitively, in seconds, with thousands of cases of compounded experience already inside her.

new case (token, position)token + position embedding"what kind of member, what point in timeline"attention layers"pattern-match against thousands of similar cases"MLP layers"filter through CU policy"lm_head (logits)"the verdict — probability over actions"probability over actionsapprovedenyescalateinvestigate
A senior analyst's mental model, in code: pulling context through layered pattern recognition until a verdict emerges.

Stage 5 — Loss + backward + Adam

Code excerpt, abridged to the inner loop. lr_t = learning_rate * (1 - step / num_steps) is defined just above in the gist:

loss = (1 / n) * sum(losses) # final average loss over the document sequence. May yours be low.

# Backward the loss, calculating the gradients with respect to all model parameters
loss.backward()

# Adam optimizer update: update the model parameters based on the corresponding gradients
for i, p in enumerate(params):
    m[i] = beta1 * m[i] + (1 - beta1) * p.grad
    v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
    m_hat = m[i] / (1 - beta1 ** (step + 1))
    v_hat = v[i] / (1 - beta2 ** (step + 1))
    p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
    p.grad = 0

The gap between the model’s prediction and what actually came next is the loss. Walk that gap backward through the network. Each parameter gets a small adjustment in the direction that would have made the error smaller. Adam is just a smarter way to figure out how big each adjustment should be.

The override.

When a senior person reviews the agent’s output and says “no, that’s not right,” the loss is the gap between the agent’s call and the right call. The backward pass is the post-mortem: “why did the agent get this wrong? was it missing context? was the policy interpretation off?” The Adam update is the playbook revision — small, deliberate, calibrated to not overcorrect on a single case.

Run this thousands of times — every override captured, every gradient propagated, every playbook nudged — and the agent’s wrongness shrinks at the same shape as Karpathy’s loss curve. Same algorithm. Same compounding. The new MSR becomes the senior MSR.

The same code has another dial: probs = softmax([l / temperature for l in logits]). Temperature controls how creative the output is. The BSA Runner wants low temperature — predictable, examiner-friendly. The Member Services Runner wants medium — warm, flexible. The Discovery Runner wants higher — exploratory. Different agents need different dials based on what kind of mistake you can afford.

The override signal must be audited for bias before it updates model parameters. If reviewers disproportionately override agent denials for one protected class and approvals for another, the gradient encodes that pattern into the agent — and Reg B / ECOA disparate-impact analysis applies to the credit decisions the model produces. Fair-lending oversight has to attach to the same loop that drives improvement.

the overrideloss = gapbackwardparameters (your playbooks)BSA threshold+0.04DTI tolerance-0.08verification+0.02policy weight-0.05flag rule+0.06repeat 1,000+ times — every override, every gradient, every nudge
The override walks back through the system. Each playbook parameter gets a small calibrated nudge. Run it thousands of times — the new MSR becomes the senior MSR.

The training loop

num_steps = 1000
for step in range(num_steps):
    ...

Karpathy runs the entire learning loop 1,000 times. Each iteration: see one document, predict the next character, compute loss, walk back, update.

Every interaction at your CU is a training step. The institution that runs 1,000 such steps with intent compounds. The institution that runs them without capture watches the gradient signal walk out the door at retirement.

The math is the same as last week’s asymptote — the irreducible slice doesn’t compress, the bar moves with the floor, today’s 99% is tomorrow’s 50%. Each loop step shrinks the loss by a small amount; the curve descends asymptotically; the limit is never reached. The shape of institutional learning over time is the same shape. The CU that compounds is operating at the useful end of that curve.

The CU that runs 1,000 deliberate steps per BSA officer per year, captured, ends year three with a different agent than the CU that runs the same 1,000 steps without capture. Same volume of work. Different trajectory. Same starting point. Different year three. Different year five. By year ten, the gap is the difference between two institutions that started the decade looking identical.

Karpathy's gist — 1,000 training steps0losssteps →Your institution — 10 years of caseswrongnessyears →↕ same shape, different timescale
Karpathy's loss curve descends across 1,000 training steps. Institutional learning over a decade has the same shape — if the loop is captured.

What changes when you can see it

What AI builders have that most credit unions don’t is a deliberate training loop. CUs run the same algorithm informally — the new MSR learns from the senior MSR, day by day — but unevenly, without capture, without the audit trail flowing back into anything that compounds. Five mechanisms, all already running in your back office, all uncaptured. The substrate is missing.

That substrate is what we build at Runline. Override capture, audit-trail-as-training-data infrastructure, Playbook revision pipelines. The institution still runs the loop; we ship what makes the loop teachable. That is the difference between a CU whose senior people leave with their gradients intact and a CU whose senior people leave their gradients in the system.

The cost of running the loop without the substrate compounds in the opposite direction. The senior who clears a case in forty seconds and explains it to a junior for an hour leaves at year ten. Without capture, that hour of gradient walks out with her. The new MSR learns it again from the next senior, slower, with more errors, until she becomes the next senior who carries it out the door at year fifteen. Every five years the institution starts over. Every five years the agent stays at random initialization.

Read the ~150 lines. Then look at your institution. The dataset is your alerts. The tokenizer is your taxonomy. The autograd is your audit trail. The forward pass is your senior analyst’s judgment. The loss is the override. They are all there.

The only question is whether you have structured your institution to compound them — or to forget them every retirement.

Get Started

Ready to see what stateful AI agents can do for your credit union?

Runline builds purpose-built AI agents for regulated financial institutions. Every interaction compounds institutional intelligence.

Schedule a Demo