Prompt Engineering is Technical Debt

Why your 5,000-token system prompt is a liability, and how to refactor it into a fine-tuned model.

Dec 18, 2025

If you have built an agent in production, you know the lifecycle.

It starts with a simple prompt: “You are a helpful assistant.”

Thanks for reading! Subscribe for free to receive new posts and support my work.

Then the edge cases hit. The prompt grows.

“You are a helpful assistant. Do not hallucinate URLs. If the user asks for weather, use the tool, do not guess. If the tool fails, retry once. Do not use markdown in JSON blocks...”

I am not talking about telling Claude to “be concise.” That is just communication.

I’m talking about the “Prompt Monolith”—the 3,000-token system prompt filled with 50 edge-case rules, 10 few-shot examples, and complex XML schemas.

You know the one. It lives in a dedicated file. Everyone is terrified to touch it because changing a sentence in Paragraph 4 somehow breaks the JSON output in Paragraph 12. That is not engineering; that is Jenga.

Eventually, you hit The Prompt Engineering Ceiling.

Your system prompt is now 2,000 tokens long.

Cost: It taxes every single API call.
Latency: Time-to-first-token spikes.
Reliability: The model starts suffering from the “Lost in the Middle” phenomenon, ignoring instructions buried in the noise.

You cannot prompt-engineer reliability into a stochastic system forever. At some point, you have to stop telling the model what to do, and train it to know what to do.

The Fine-Tuning Gap

Most engineers avoid fine-tuning because it feels like “Science.” It feels like you need GPUs, PyTorch knowledge, and a dedicated MLOps team.

But in 2025, the bottleneck for fine-tuning isn’t Compute (OpenAI and Anthropic have solved that via API). The bottleneck is Data.

To fine-tune gpt-4o-mini to handle your specific edge cases, you don’t need a GPU cluster. You need a .jsonl file containing 50-100 perfect examples of the behavior you want.

Creating that file is the nightmare. It requires you to:

Dig through production logs.
Find the failure cases (the “Confusing Moments”).
Manually write the “ideal” correction.
Format it into a strict JSON schema.

This manual friction is the only reason most teams are still stuck in prompt engineering hell.

The “Active” Data Engine

We built Steer to solve the reliability problem using Verification (stopping errors before they happen). But we realized that Verification is actually the perfect wedge for Data Collection.

If you are already catching failures in production using Verifiers, you are sitting on a goldmine of training data. You just need to capture it.

We added a Data Engine to Steer (v0.2) to close this loop.

The Workflow:

Capture: The SDK catches a failure (e.g., a PII leak or a hallucinated URL).
Teach: Instead of digging through logs, you click “Teach” in the local dashboard and define the fix (e.g., “Enforce Strict JSON”). Steer captures this human signal.
Export: The system automatically converts those interactions into the exact JSONL format required for fine-tuning.

# One command to turn your bug reports into training data

steer export --format openai --out fine_tuning_data.jsonl

From 50 Rules to Zero

The goal of this architecture is to delete your system prompt.

Instead of a 2,000-token prompt filled with “DO NOT do X” rules, you fine-tune a small, fast model on the data collected from your verifiers. The model learns the behavior natively. It stops making the mistake not because you told it to, but because it learned not to.

Latency: Down (smaller prompt).
Cost: Down (fewer input tokens).
Reliability: Up (behavior is baked in, not injected).

This is the shift from AI-Enhanced (better prompts) to AI-Native (better models).

Stop writing more rules. Start collecting better data.

Steer is open source: github.com/imtt-dev/steer

Discuss on Hacker News

Neural Foundry

The Jenga metaphor is perfect. Watched this exact pattern play out last year when our team's agent prompt hit 4k tokens and every fix broke something downstream. The shift from instruction to training data makes total sense, but the real insight here is thatverification failures are actually your training dataset. Most teams treat bugs as incidents to close, not as signal tolearn from. Capturing those teach moments inline instead of digging through logs later removes the actual bottleneck.

Expand full comment

1 reply by Steer Labs

1 more comment...

Discussion about this post

Ready for more?