Building Reliable Prompts

At Brevian, prompts are treated as a first‑class engineering artifact. Every prompt that powers our AI systems is expected to meet the same standards of reliability, observability, and maintainability as any other production software component.

Anupreet Walia

CTO

Jul 28, 2025

min read

Automatic Optimization, Guardrails, and Evals: Building Trustworthy Prompts at Scale

To give a little background, we had a few realizations very quickly after deploying our first few prompts to production. First, the prompts are fickle and very tightly coupled to the underlying model/LLM. Second, the prompts need to be designed for “what good looks like”. The LLMs are really good at generating content even at the expense of facts. Few-shot prompting with examples didn’t work given that the language models start imitating the pattern of behavior almost to comical inaccuracy. Third, the structured json output has come a long way but you still need to prompt it well to ensure the constraints your apis have downstream are correctly reflected.

To achieve this, we follow a structured 3‑Pass Rule when designing and deploying prompts. This process combines automatic optimization, guardrails, and continuous evaluation to ensure consistency and correctness over time.

Pass 1: Automatic Prompt Optimization (APO)

‍We begin with APO — Automatic Prompt Optimization.
A curated test dataset is created that contains a representative set of inputs along with their expected outputs and required formatting rules. The prompt is iteratively tuned against this dataset until it meets predefined success criteria across all test cases using reinforcement learning. This allows us to build prompts that are validated against real product use cases rather than ad‑hoc examples.

Pass 2: Guardrails and Constraints

‍After the initial optimization, guardrail instructions are added. These explicitly define constraints and expected behaviors, such as:

Maintaining output within a specified scope and tone.
Handling incomplete or ambiguous inputs by prompting for clarification.
Avoiding irrelevant, unsafe, or non‑compliant responses.

This phase ensures that prompts exhibit predictable behavior even under unexpected or edge‑case inputs.

Pass 3: Evals and Runtime Observability

‍Finally, prompts are placed under evals—automated evaluation suites that continuously verify outputs both during CI/CD and in runtime environments. These evaluations:

Compare live outputs against expected patterns.
Surface exceptions when hallucinations or format deviations are detected.
Include dedicated edge‑case inputs to identify regressions or model drift over time.

With this, prompts remain stable and trustworthy, with drift or unexpected behavior detected before reaching end users.

For engineers accustomed to traditional software systems, the 3‑Pass Rule maps naturally to established engineering disciplines. In classical system engineering we strive for a system that is observable, testable, and resilient. One which handles expected workloads correctly, fails gracefully under unexpected conditions, and provides clear signals when something is wrong. And in practice, we build this with strong validation test suites, well‑defined interfaces, and thorough monitoring with dashboards. We are adapting these principles to prompts and we see similar results. We solve for:

Model drift: Continuous evaluation highlights subtle shifts in model behavior.
Hallucinations: Outputs are checked against expected patterns, with deviations flagged.
Runtime exceptions: Guardrails and evals catch failure modes before they affect end users.
Edge‑case errors: Dedicated test inputs ensure consistency under stress conditions.

Prompt Engineering Practice

Automatic Prompt Optimization (APO)
Guardrails, Consistency checks and Constraints
Evals and Runtime Checks
CI Prompt Tests (a subset of APO dataset)

Traditional Engineering Equivalent

End‑to‑End Test Suites
Exception handling & edge‑case protection
Observability & monitoring
Unit tests in CI/CD

Prompts are not static artifacts. They evolve alongside models, datasets, and user needs. By applying disciplined engineering practices—APO, guardrails, and evals—we ensure that every prompt deployed at Brevian is rigorously tested, observable in production, and resilient to change.

About the Author

‍Anupreet Walia is the Chief Technology Officer at Brevian, where she leads AI architecture, strategy, and engineering innovation. Her work focuses on building reliable, production-grade AI systems that bridge research and real-world deployment.

References

Teki, S. Prompting is the New Programming Paradigm in the Age of AI. Personal Blog, 2023. Link

Prompt Engineering Guide. Few-Shot Prompting. PromptingGuide.ai, 2023. Link

QED42. Building Simple & Effective Prompt-Based Guardrails. QED42 Insights, 2024. Link

Humanloop. Structured Outputs: Everything You Should Know. Humanloop Blog, 2023. Link

Wolfe, C. Automatic Prompt Optimization with Reinforcement Learning. Substack, 2023. Link

If you liked this, you might want to read

Anupreet Walia

Jul 28

Context‑Driven Design - A Design Pattern

Context‑Driven Design fills this gap by treating context assembly as a first‑class architectural concern, one which focusses on assembling the right working set of information at the moment of inference to build reliable LLM applications.

Vinay Wagh

Jul 2

Live Copilots Will Be The New Norm

Live Copilots are more than a ‘nice to have’—they’re the future of GTM enablement. But only if built for scale, speed, and context. BREVIAN doesn’t just transcribe your calls—it lives in them: pulling from your systems, thinking in-context, and enabling reps to act in real time. That’s how live Call Assistance should work.