What follows is what an embedded twelve-week AI engineering engagement looks like at our practice. There is no discovery phase. There is no readiness assessment. There is no kickoff deck. Week one is shipping week.
This is a deliberate choice, and not a fashionable one. The standard consulting playbook has weeks one through three for “discovery,” which in practice is a series of meetings where the client tells the consultant what they already know and the consultant takes notes. By week four the consultant produces a deck. By week six the deck is approved. By week eight the team starts to build. By week twelve they are halfway done with whatever they were going to build, and the consultant disappears.
We dislike this for two reasons. The first is that the team usually knows the problem. The discovery phase is theatre, and senior engineers can smell theatre. The second is that the only way to actually understand what an AI feature should do is to put a bad version of it in front of users and watch them use it. That is week one work, not week eight work.
Week one
Two of us arrive on a Monday. We have read the relevant code over the weekend. We have a pre-existing thesis about what to build first. The thesis came from a one-hour call with the engineering lead the previous week, plus an hour reading the codebase.
Monday and Tuesday are spent with three or four people on the team, refining the thesis. We do not run workshops. We sit at desks. We ask one question at a time. We say things like, “if we shipped only this, what would happen tomorrow,” and we listen. We rule out the wrong things faster than we converge on the right thing.
By Wednesday afternoon there is a Pull Request open, against the actual product, scaffolding the simplest possible version of the feature. By Friday a small group of internal users has touched it. They hate parts of it. We have notes.
The point of week one is not the artifact. The point of week one is to break the spell. Most teams, when they start working with us, expect a deliverable they cannot use for several weeks. By Friday they have a thing they can poke. The relationship for the rest of the engagement runs on that.
Weeks two through five
The middle is the part that looks the most like normal product engineering, with one difference: every Friday we cut a small release of whatever has changed and put it in front of someone who is not in the team. The team owns the work. We are pair-programmers, code reviewers, and architectural counsel. We do not own a backlog. We do not own a Jira project. We own the feedback loop.
There are usually three things happening at once. A retrieval or evaluation pipeline. A model call path that someone on the team is making faster, cheaper, or more reliable. A small piece of internal tooling that is not the headline feature but unblocks one of the engineers. By the end of week five most of the team has done at least one round of this work alone.
This is the period where senior engineers either start to move or stop moving. We have written elsewhere about what this transition looks like and what we do when it stalls. The short version is that one of us pairs with the slowest senior for a week, on a real problem, and we ship the slowest senior’s first model-driven feature into production together. After that, in our experience, it tends to take care of itself.
Week six
By week six the engagement has its first real review. The questions are concrete. What is now in production. What is the latency, the cost, the user-visible quality. What is the team doing now that it could not do in week one. What does the next six weeks need to deliver to be worth paying for.
We almost always restructure something at week six. The plan we made in week one was wrong about a few things. The mid-engagement reset is a feature of the design, not an embarrassment. The teams that try to push through their week-one plan unchanged at week six are the ones that miss the actual win sitting two pivots to the side.
Week six is also when we start a deliberate handoff conversation. Who on the team owns the model call path on day ninety-one. Who owns the eval. Who has the relationships with the relevant infra teams. If those names are not nominated by week seven, the engagement is in trouble.
Weeks seven through ten
The middle stretch shifts from us-and-them to them-and-us. We are still in the codebase. We are still in standups. We are pairing less. We are reviewing more. The number of features in production is going up, which means the number of pages, dashboards, and on-call rotations the team owns is going up. We watch how those rotations get handled. The engagement is not going well if we are still the people who get paged.
Two specific things happen during this period that we look for, because they are reliable signals.
The first is that someone on the team, usually a senior who was slow at first, starts pushing back on us about a design choice we proposed in week three. They are right. The team is now ahead of where the original plan was, because they understand the problem better than we do. We change the design. The fact that this happens is the deliverable.
The second is a feature that we did not propose, did not scope, and did not know about, gets shipped by someone on the team who saw a chance and took it. This is the one we celebrate the most. It is the moment where the team is no longer being run; it is running.
Weeks eleven and twelve
The last two weeks are operational. Documentation. On-call runbooks. The eval framework, which by this point has accreted real test cases from real production failures, gets cleaned and committed. The hand-off document is short. We try to keep it under five pages. Long hand-off documents tend not to be read.
The week-twelve review is structured to answer a single question. Can the team, on day ninety-one, ship the next thing without us. If the answer is yes, the engagement was successful. If the answer is no, we usually offer a six-week extension at a reduced load, two days a week. About one engagement in three runs that extension. About zero engagements in three need it past the six weeks.
What we measure
We measure four things across the engagement. We do not pretend they are sufficient.
Time from “we want to do this” to “users can touch a version of it.” Across the last six engagements this number went from a median of nine weeks before we arrived to a median of eleven days by the time we left.
Cost per relevant model call. Half the engagements involved at least one rewrite of a call path that took a price-per-feature down by sixty percent or more. The rewrites were almost always the same two or three patterns. We will eventually write those down.
Number of people on the team who have shipped a model-driven feature alone. We start every engagement with a count, and we end every engagement with a count, and the gap is the deliverable.
Whether the team feels less stuck. This is not a metric. It is a question we ask three times during the engagement and once after. The answer almost always tracks the other three. When it doesn’t, it tracks the other three eventually.
The framework above is what we have settled into after running this engagement six times. It is not the only way to do it. We have been wrong about pieces of it, and we have changed the pieces we were wrong about. The shape that has stayed constant, across every engagement, is week one ships. Everything else is downstream of that.