Born in 16k Tokens · clement.sh

Back at Greenkub (yes, the wooden-house company again) we had a different AI problem before the agents showed up. Our sales team lived on the phone. Every day, hundreds of calls, some of them an hour and a half long, full of signal nobody had time to extract: what the prospect actually wanted, their budget, whether they were a tyre- kicker or ready to sign. So we built Revelio: a thing that listens to a call and answers questions about it.

This was, in fact, the project I was hired for. I'd started at Greenkub that September, barely a month in, and Revelio was the reason. The first commit landed on 16 October 2023. Keep that date in mind, it's the whole joke.

The innocent version

The idea was simple. Take the transcript, hand it to GPT, ask it questions, get structured answers back. The clever bit was that the questions weren't hard-coded: a non-dev could define a Query in the product (a name, a question, a return type, maybe a list of choices), and we'd compile those into an OpenAI function-calling schema at runtime.

askQueryToOpenAI.tsts

const { properties, required } = queries.reduce((acc, query) => ({
  properties: {
    ...acc.properties,
    [query.name]: {
      type: formatQueryType(query.returnType),
      description: query.query,
      ...(query.choices?.length && { enum: query.choices }),
    },
  },
  required: [...acc.required, query.name],
}), { properties: {}, required: [] })

await openai.chat.completions.create({
  model: 'gpt-3.5-turbo-16k-0613',
  messages: [
    { role: 'system', content: prompt },
    { role: 'user', content },
  ],
  functions: [
    {
      name: 'analyse_conversation',
      parameters: { type: 'object', properties, required },
    },
  ],
  function_call: { name: 'analyse_conversation' },
})

If that functions array looks dated to you, good. In October 2023 there was no JSON mode, no Structured Outputs, no tool_calls. Function calling, shipped that June, was the only reliable way to force a model to return typed JSON instead of a friendly paragraph. So we leaned on it hard, pinning function_call to a single function so the model had no choice but to fill in our schema.

And gpt-3.5-turbo-16k? At the time, that 16,385-token window was the roomy option. Base GPT-4 capped out at 8k. The 32k model was expensive and gated. Sixteen thousand tokens felt like a mansion.

Then, on 6 November 2023, OpenAI held DevDay and announced GPT-4 Turbo with a 128k context window, JSON mode, and parallel tool calls. Three weeks after our first commit, every constraint we'd designed around was on its way out. We'd built a product in the last days of the small-context era and didn't know it.

The wall: a 90-minute call does not fit in a mansion

Here's the thing about a 16k window when your input is a 90-minute sales call. It doesn't fit. A long French transcript runs tens of thousands of tokens, and that's before you count the schema. Because the function definition also costs tokens, every question you ask eats into the same budget as the conversation itself.

So the first real piece of engineering wasn't a feature. It was a token counter that included the function schema in the estimate, not just the messages.

getTokenCountFromOpenAiFunction.tsts

import { promptTokensEstimate } from 'openai-chat-tokens'

// Count the messages AND the function schema — both spend the budget.
export const getTokenCount = ({ content, context, properties, required }) =>
  promptTokensEstimate({
    messages: [
      { role: 'system', content: context },
      { role: 'user', content },
    ],
    functions: [
      {
        name: 'ask_queries',
        parameters: { type: 'object', properties, required },
      },
    ],
  })

With a way to measure, we could slice. The transcript is one line per utterance, so we packed lines greedily until the running total (content + system prompt + schema) crossed the limit, then started a fresh chunk.

divideContentInSmallerParts.tsts

const contents = []
let current = ''
for (const line of content.split('\n')) {
  const tokens = getTokenCount({
    content: current + line,
    context,
    properties,
    required,
  })
  if (tokens > maxTokens) {
    contents.push(current)
    current = ''
  }
  current += line + '\n'
}
if (current.length) contents.push(current)

The second wall: too many questions

Chunking the transcript solved one axis. The other was the questions. Pile twenty, thirty properties into a single function schema and two bad things happen: the schema blows your budget, and the model gets sloppy, answering the first few questions well and phoning in the rest.

So we batched the queries too. Roughly: at most six free-text string questions per call, up to twenty for the cheaper constrained types (booleans, enums, numbers). A 90-minute call with a rich tag setup therefore fanned out into N transcript chunks × M question batches separate OpenAI calls, all merged back together at the end. One "analyse this call" click could quietly become a dozen requests.

Which, of course, is when we met the rate limiter. We ended up reading the x-ratelimit-* response headers back into the database so we could pace ourselves instead of getting slapped. Every file in that repo is a lesson taken to the face.

One more sin: the transcripts themselves

Small confession. Aircall, our phone provider, didn't expose call transcriptions through its public API back then. So we logged into their internal GraphQL endpoint with an email and password and pulled the transcripts straight out. The code knew exactly what it was:

SaaS.interfaces.tsts

aircall: {
  // will probably be removed when aircall
  // prevents us from getting transcription
  email: string
  password: string
}

Ship now, cry later. It's the most honest comment in the codebase.

The code did not age, it scarred

Fast-forward and the model registry tells the whole arc on its own:

modelsOpenAI.tsts

[
  { name: 'GPT-3.5',     openAIName: 'gpt-3.5-turbo-1106', limitTokens: 16385  },
  { name: 'GPT-4',       openAIName: 'gpt-4-0613',         limitTokens: 8192   },
  { name: 'GPT-4o mini', openAIName: 'gpt-4o-mini',        limitTokens: 128000 },
]

That last line is the punchline. By August 2024 we had a 128k window for a fraction of the old price, and a 90-minute call dropped into it whole, no slicing required. The clever token counter, the greedy chunker, the query batcher: all suddenly optional.

But here's what I keep telling people. The discipline that scarcity forced on us never went away. Counting tokens before you send them, splitting work so the model stays sharp, treating each call as something that costs money and can fail, pacing yourself against a rate limit: that's not 2023 nostalgia, that's just how you keep an LLM product cheap and reliable at scale. Bigger windows didn't delete the lessons. They just stopped being the price of entry.

Revelio was born in 16k tokens. It grew up to fit in 128k. And the bones it built while the room was small are the reason it still stands.