Skip to content
TerminalBytes
Go back

Prompt Engineering for ChatGPT and GPT-4 (Practical Guide)

Updated:
On this page

So here’s the thing. When I first wrote this post in early 2023, GPT-4 had been out for ten days and “prompt engineer” was a job title with a six-figure salary. Almost two years later, the salary inflation cooled off, the models got dramatically better, and most of what I called “prompt engineering” in that first version turned out to be either common sense or scaffolding the model now does for you.

This is the rewrite. The patterns that actually still matter in 2025, the ones that don’t, and the small handful of techniques that have outlived two model generations. I use Claude and GPT-4 daily as part of my actual job, so this is what works for me, not a cargo cult.

Prompt engineering for ChatGPT and GPT-4 cover image showing a developer terminal

What changed since 2023: models got much better at inferring intent, so verbose “you are an expert…” prefixes mostly stopped helping. Long-context windows (200K tokens and beyond) made retrieval-augmented prompting trivial. Tool use and structured output became reliable, which moved the discipline from “magic incantation” to “API design”.

What prompt engineering actually is in 2025

The 2023 definition was “the discipline of crafting natural-language inputs to LLMs to get the output you want”. That’s still true, but the surface area has narrowed. Three things matter today, and the rest is mostly noise that the model has gotten good enough to silently absorb.

The dirty secret of frontier LLMs in 2025 is that they’re surprisingly forgiving. You can write a sloppy, ungrammatical, abbreviated prompt and still get a useful answer 90% of the time. The discipline isn’t about avoiding bad prompts so much as recognising when you’ve hit the 10% case and need to be precise. The patterns below are what I reach for when the easy path doesn’t produce the output I want, not what I do for every interaction.

Be specific about the output shape. Models will give you what you ask for if you describe it. Want JSON with three keys? Say so, give an example, and use the model’s structured-output mode if it has one. Want a code block in Python with type hints? Say so. The single biggest improvement in my outputs came from spending an extra 30 seconds describing the format.

Show, don’t just tell. A single concrete example in the prompt outperforms a paragraph of abstract instructions, every time. “Translate this to French: ‘I’d like a coffee.’ → ‘Je voudrais un café.’” works better than “translate the following text from English to French in a casual register”.

Iterate against the actual output. Modern LLM workflows are loops, not one-shots. Write a prompt, look at the result, find the part that’s wrong, edit the prompt, run again. Three iterations beat any amount of pre-planning.

The rest of this post is the patterns inside those three principles.

Patterns that still work in 2025

These four patterns survive across model generations and are worth keeping in muscle memory.

Few-shot examples. Give the model 2-5 examples of the input/output shape you want. The model figures out the pattern from the examples without you explaining it. This is the single highest-leverage technique in the whole post; if you only remember one thing, remember this. The reason is that examples carry information density that prose can’t: they show the format, the tone, the level of detail, and the edge cases all at once. Two good examples often outperform 200 words of “instructions” because the model is fundamentally a pattern-matcher, and you’re handing it a pattern.

Convert the user's free-text query into a SQL WHERE clause.

Examples:
"users named alice"  →  WHERE name = 'alice'
"orders over $100"   →  WHERE total > 100
"posts from last week" →  WHERE created_at > NOW() - INTERVAL '7 days'

Query: orders from california

The model returns WHERE state = 'california' without you having to explain what a SQL WHERE clause is.

Chain-of-thought (CoT), but for math and logic only. For arithmetic or multi-step reasoning, asking the model to “think step by step” still meaningfully improves accuracy. For prose generation or factual questions, CoT mostly adds latency without improving quality. Use it surgically, not by default. Newer “reasoning” models (o1, Claude with extended thinking, DeepSeek R1) bake the CoT into a separate hidden phase, which means you don’t have to ask for it manually anymore; you just enable the reasoning mode at the API level and the model handles the rest.

Role assignment, kept short. “You are a senior Python engineer” still nudges the output toward more idiomatic code, but the long “you are a world-class expert with 20 years of experience” preambles are noise. One sentence is enough. Better still, just put a code example in your prompt and skip the role. The model picks up the right register from the example faster than from any title or credential you assign it. The exception is when you genuinely want a domain-specific tone (legal language, medical writing, ad copy); a one-line role still helps shift register in those cases.

Output format specification with delimiters. When you want JSON, ask for JSON wrapped in <output> tags or a \“json` code block. Modern models follow these reliably and the wrapper makes parsing the output trivial.

Extract the city, country, and population from this text.
Return as JSON inside <output> tags.

Text: Tokyo is the capital of Japan with about 13.96 million residents.

<output>
{"city": "Tokyo", "country": "Japan", "population": 13960000}
</output>

Text: Mumbai is India's most populous city, home to over 12 million people.

The model continues the pattern, returning a JSON object inside <output> tags for the new input. Use the OpenAI structured-outputs API or Anthropic’s tool-use mode for production, but this delimiter pattern is the universal fallback.

Patterns that stopped earning their keep

These were popular in 2023 and are mostly cargo cult in 2025. I include them because half the prompt engineering courses still being sold on YouTube treat them as load-bearing. They aren’t.

“Take a deep breath and think carefully.” This one trended because a Google paper showed it helped on math benchmarks. Modern models don’t need it, and it adds tokens without improving quality. If the model is hallucinating, the fix is more context, not motivational language. Most of the “magic phrase” tricks from 2023 worked because the underlying models were sensitive to specific token patterns; that brittleness has been smoothed out by post-training. The phrases still don’t hurt, they just don’t help.

Heavy persona prompting. “You are an expert PhD in machine learning with 30 years of experience writing in the style of…” was the 2023 cliché. The model’s behaviour with these long preambles barely changed compared to a sentence saying “answer like a knowledgeable but concise senior engineer”. Most of the time, the persona is friction.

“Tip the model” / threats / emotional appeals. A dozen viral tweets in 2023 claimed promising the model “I’ll tip $200 if you do a good job” or “this is critical for my career” measurably improved output. Modern post-training has flattened this; the differences are noise. Save your tokens.

Ten-paragraph instructions for trivial tasks. If you’re spending 500 tokens to instruct the model to do a 50-token task, you’ve over-engineered the prompt. Try a single sentence with one example first. Add complexity only when the simple version fails. The corollary: every time the model fails, your first instinct should be to look at the input you gave it, not to add more instructions. Most prompt failures I diagnose for other people come down to “you didn’t give it the information it needed”, not “your wording was wrong”.

A useful sanity check: if you read your prompt out loud and it sounds like a contract, you’ve over-engineered it. Real human-to-human task delegation looks more like “here’s what I want, here’s an example, ping me if anything’s unclear”. Talk to the model the way you’d brief a junior colleague who’s bright but new to the codebase.

Patterns that emerged after 2023

Three new patterns earned a place in my workflow over the last two years. These came from API-level capabilities the providers added, not from “magic words” the community discovered, which is why they actually stick.

Long-context retrieval prompting. With 200K-token context windows, you can paste an entire codebase or document into the prompt and ask questions of it directly. This used to require an embedding pipeline, a vector database, and weeks of plumbing; today it’s a single API call with the document inlined. For any document under 200,000 tokens (about 150,000 words, or a thick novel), skip the vector database entirely and just paste it in. The pattern is “here’s the document, then the question, then the constraints”:

<document>
[paste your full PDF, codebase, or notes here]
</document>

Based only on the document above, answer the question.
If the document doesn't contain the answer, say "not in the document".

Question: What's the total cost in section 4?

The “answer only from the document” guardrail is the load-bearing part. Without it, the model fills gaps with its training data, which is where hallucinations live.

Tool use / function calling. Modern models can call functions you define (“get_weather”, “search_database”, “send_email”) instead of generating text answers. This turns the LLM into a router that picks the right action. For applications where you need real data or real side effects, this beats text-only prompting on every axis. The OpenAI and Anthropic APIs both support this; the Anthropic tool-use docs and OpenAI function-calling guide are the cleanest references I’ve found.

Structured outputs. Both OpenAI and Anthropic now support enforced JSON schemas in their APIs. You define a schema, the model returns JSON that’s guaranteed to match it. This eliminates the “the model returned almost-JSON but the trailing comma broke my parser” class of bugs. If you’re calling LLMs from production code, use structured outputs.

Prompt caching. A change that landed in 2024 and is now standard: the API will cache prompt prefixes you mark as cacheable. If you’re sending the same 50,000-token system prompt on every request, mark it as cached and pay 10% of the token price after the first hit. For agentic workflows that build long context, this is the difference between “viable in production” and “burning $1,000/day on tokens”. The right place to put cacheable content is at the start of the prompt: tool definitions, system instructions, fixed examples. Variable input goes at the end so the cache prefix stays consistent across requests.

Worked example: refactoring a Python function

Here’s a real prompt I sent Claude last week. The task: refactor a Python function that fetched and parsed an HTML page. The original was 60 lines of mixed concerns; the refactor is what the model produced after one back-and-forth.

You are reviewing this Python function. Suggest a refactor that:
1. Separates fetching from parsing
2. Adds proper exception handling for HTTP errors
3. Uses type hints throughout
4. Keeps the existing public function signature

Return the full refactored code in a single Python code block.

<original>
def fetch_articles(url):
    import requests
    from bs4 import BeautifulSoup
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    articles = []
    for tag in soup.find_all("article"):
        title = tag.find("h2").text
        link = tag.find("a")["href"]
        articles.append({"title": title, "link": link})
    return articles
</original>

The model returned a clean refactor with a fetch_html function, a parse_articles function, type hints, and a requests.HTTPError clause. Total time from prompt to merged code: 3 minutes.

The pattern: numbered constraints (so the model can confirm it hit each one), an XML-tagged input block (so it’s clear what to refactor), an explicit output format (single Python block, easy to copy). No persona prompting, no chain-of-thought, no tipping. The model knows how to write Python.

Two more things worth knowing about prompts for code. First, the model will often invent imports for libraries that don’t exist or have moved. Always run the code (or have your editor’s static analysis run it) before trusting it. Second, when refactoring, paste the surrounding test cases too. The model uses tests as a contract and rarely breaks behaviour the tests pin down. When refactoring code without tests, write a single test case yourself first, paste both into the prompt, and you’ll get a much safer refactor than asking for one cold.

When to skip the LLM entirely

Three years of using these things daily has taught me when an LLM is the wrong tool. The hype cycle pushes “use AI for everything” and that’s actively bad advice; LLMs are powerful but they’re also slow, expensive at scale, and non-deterministic. Knowing when not to use one is half the skill.

Deterministic transformations. If you can write the rule as a regex, AST manipulation, or a SQL query, do that. LLMs cost more, are slower, and occasionally hallucinate. sed, jq, awk, and the rest of the text-manipulation toolbox are still the right answer for “rename every variable” or “extract the third column from a CSV”. A common anti-pattern I see in 2025 startups: someone replaces a 5-line regex with a $0.05-per-call LLM, ships it, and then notices the bill at the end of the month.

Anything safety-critical without verification. If the LLM’s wrong answer would cause real harm (medical advice, legal advice, financial decisions, code that runs unattended in production), you need either a human reviewer in the loop or a deterministic verifier downstream. The model being right 99% of the time is unacceptable when the 1% failures matter. The right architecture for these cases is “LLM proposes, deterministic system disposes”, which keeps the LLM’s flexibility while bounding its mistakes.

Tasks the model is provably bad at. As of 2025, frontier models still struggle with: counting characters in a word, doing arithmetic on numbers larger than ~10 digits without tool use, knowing what’s happened in the last 6 months without web search, and reasoning about visual/spatial layouts described in text. If your prompt depends on any of those, expect failures. Some of these gaps will close with future model versions; others (like the tokenisation issue that makes character counting hard) are architectural and may persist for a while.

Real-time, low-latency interactions. Even the fastest models have a 200-500ms latency floor for non-trivial outputs. If your application needs sub-100ms responses, an LLM is not the right path. Cache aggressively, pre-compute likely responses, or use a smaller specialised model fine-tuned for the task. The “stream the tokens as they generate” UI trick hides latency for chat-style products but doesn’t actually reduce it for backend pipelines.

Cost-sensitive batch jobs. Running an LLM over a million records at $0.001 each is a $1,000 job. If the same task can be done with a regex or a small classifier model, do that. LLMs are best when each invocation has high marginal value (one user query, one piece of content), not when you’re brute-forcing volume.

For everything else, especially boilerplate code generation, summarisation, format transformation, brainstorming, and explanation, modern LLMs are genuinely good. The bar isn’t “is the LLM perfect”; it’s “is it faster and good enough that I’d rather review its output than write from scratch”. For most of my daily work, the answer is yes. The skill of using LLMs effectively is mostly knowing where the answer is yes and where it’s no, then having the discipline to use the right tool either way.

Frequently asked questions

ChatGPT vs Claude vs Gemini, which should I use?

Day-to-day for code, I use Claude (Anthropic) most. For creative writing, I jump to GPT-4. For “search the web and summarise” I lean on Gemini or Perplexity. The capability gap between the top three frontier models has narrowed enough that the answer to “which is best?” is often “the one you have the API key for”. For a developer starting today, Claude’s tool use, prompt caching, and JSON-mode reliability make it the easiest path; OpenAI’s ecosystem is the broadest if you want third-party libraries.

How much should I worry about prompt injection?

If your prompt is read-only (just to get an answer for a human to review), prompt injection is a curiosity. If your prompt drives tool calls, web requests, or database writes, prompt injection is a security boundary you have to defend. The OWASP LLM top 10 covers the practical mitigations: don’t put untrusted content where the model can interpret it as instructions, treat tool-call inputs as untrusted, and run actions through human-approval steps for anything destructive.

Do I need to fine-tune a model for my use case?

Almost never, in 2025. Frontier models are good enough at zero-shot or few-shot for nearly every business task. Fine-tuning is worthwhile only when you have thousands of high-quality examples of the exact task and the latency or cost of frontier models is prohibitive. For 99% of “I want the model to behave a certain way” needs, a well-crafted system prompt and few-shot examples beat a fine-tune.

How do I get the model to stop being so verbose?

Tell it directly: “Answer in one sentence.” or “Reply with only the code, no explanation.” If it still over-explains, add an example showing the terse format you want. As a last resort, set max_tokens low. The model’s default tone is “be helpful by explaining”; you have to actively suppress that for terse output.

What’s the right way to handle hallucinations?

Three layers, in order of effectiveness. First, give the model the source-of-truth document in its context window so it doesn’t have to make up facts. Second, use tool calls to actually fetch data instead of asking the model to recall it. Third, run cheap downstream verifiers (regex, a second LLM call, a deterministic check) on the output before trusting it. Don’t try to “trust the model harder”; that doesn’t work.

If you’re using LLMs in your dev workflow, my Node.js backend without Express post covers the kind of project where I lean on Claude for boilerplate, and the Linux text-manipulation commands post is the still-relevant alternative when an LLM is overkill.

The skill of using LLMs in 2025 is mostly knowing when not to. The rest is just typing.

Last updated: January 2025