Saad Ullah Bilal — AI Systems Architect

I've watched teams spend three months and $40,000 fine-tuning a model to do something a well-crafted prompt would have handled in two days. I've also watched teams spend six months trying to coax GPT-4 with prompts to do something that fundamentally required fine-tuning from the start.

The largest frontier model delivers roughly 90% of the value of a well-chosen smaller model, at something like ten times the cost. For one-off strategy work, that premium is irrelevant. For workflows that run fifty thousand times a day, it's the difference between a viable product and a line item your CFO circles in red.

Neither is a failure of intelligence — it's a failure of having a clear decision framework. Here's the one I use.

Start with Prompting. Always.

If you haven't exhausted prompt engineering, you're not ready to fine-tune. Prompt engineering is fast (hours to days), cheap (API costs), and reversible. Fine-tuning is slow (weeks), expensive (GPU compute + data labeling), and locks you into a specific model version.

Before considering fine-tuning, try: few-shot examples, chain-of-thought prompting, structured output constraints, system prompt refinement, and prompt chaining. If these don't get you to 80% of your target quality, then start the fine-tuning conversation.

When Small Models Win

Repetitive, bounded tasks

Latency-sensitive workflows

High-volume operations

Well-defined inputs and outputs

Narrow domain expertise

When Frontier Models Win

Open-ended, ambiguous problems

Low-volume, high-stakes work

Strategy, synthesis, investigation

Novel problems with no template

Human-level reasoning required

When Fine-Tuning Actually Wins

Consistent Format Requirements

If every output must follow a precise schema and prompting keeps drifting, fine-tuning locks in the behavior.

Domain-Specific Knowledge

Proprietary terminology, internal processes, or specialized knowledge not in the base model's training data.

Cost & Latency at Scale

A fine-tuned smaller model often outperforms a prompted larger model at 1/10th the cost per call.

Tone & Style

If you need a very specific voice that prompt engineering can't reliably produce at scale.

The Data Problem

The most common reason fine-tuning projects fail isn't the training process — it's the data. You need high-quality input-output pairs that accurately represent what you want the model to do. 500 mediocre examples will produce a mediocre model.

If you can't define the output quality criteria clearly enough to label 500 examples consistently, you're not ready to fine-tune. Get clearer on what 'good' looks like first — then collect data.

The Maturity Move

The frontier model is your senior consultant. Brilliant, expensive, and absolutely not who you call to file routine paperwork. You bring in the consultant for the hard, ambiguous, consequential problems. You don't put them on data entry.

The maturity move in enterprise AI isn't picking the biggest model and feeling reassured. It's building the discipline to ask, task by task, what's the smallest model that does this job well — and reserving your frontier budget for work that genuinely earns it.