Building Reliable Prompts

At Brevian, prompts are treated as a first‑class engineering artifact. Every prompt that powers our AI systems is expected to meet the same standards of reliability, observability, and maintainability as any other production software component.

Anupreet Walia
CTO, Co-Founder
Jul 28, 2025
3
min read
  • Share

Automatic Optimization, Guardrails, and Evals: Building Trustworthy Prompts at Scale

At Brevian, prompts are treated as a first‑class engineering artifact. Every prompt that powers our AI systems is expected to meet the same standards of reliability, observability, and maintainability as any other production software component.

To give a little background, we had a few realizations very quickly after deploying our first few prompts to production. First, the prompts are fickle and very tightly coupled to the underlying model/LLM. Second, the prompts need to be designed for “what good looks like”. The LLMs are really good at generating content even at the expense of facts. Few-shot prompting with examples didn’t work given that the language models start imitating the pattern of behavior almost to comical inaccuracy. Third, the structured json output has come a long way but you still need to prompt it well to ensure the constraints your apis have downstream are correctly reflected.

To achieve this, we follow a structured 3‑Pass Rule when designing and deploying prompts. This process combines automatic optimization, guardrails, and continuous evaluation to ensure consistency and correctness over time.

Pass 1: Automatic Prompt Optimization (APO)

We begin with APO — Automatic Prompt Optimization.
A curated test dataset is created that contains a representative set of inputs along with their expected outputs and required formatting rules. The prompt is iteratively tuned against this dataset until it meets predefined success criteria across all test cases using reinforcement learning. This allows us to build prompts that are validated against real product use cases rather than ad‑hoc examples.

Pass 2: Guardrails and Constraints

After the initial optimization, guardrail instructions are added. These explicitly define constraints and expected behaviors, such as:

  • Maintaining output within a specified scope and tone.
  • Handling incomplete or ambiguous inputs by prompting for clarification.
  • Avoiding irrelevant, unsafe, or non‑compliant responses.

This phase ensures that prompts exhibit predictable behavior even under unexpected or edge‑case inputs.

Pass 3: Evals and Runtime Observability

Finally, prompts are placed under evals—automated evaluation suites that continuously verify outputs both during CI/CD and in runtime environments. These evaluations:

  • Compare live outputs against expected patterns.
  • Surface exceptions when hallucinations or format deviations are detected.
  • Include dedicated edge‑case inputs to identify regressions or model drift over time.

With this, prompts remain stable and trustworthy, with drift or unexpected behavior detected before reaching end users.

For engineers accustomed to traditional software systems, the 3‑Pass Rule maps naturally to established engineering disciplines. In classical system engineering we strive for a system that is observable, testable, and resilient. One which handles expected workloads correctly, fails gracefully under unexpected conditions, and provides clear signals when something is wrong. And in practice, we build this with strong validation test suites, well‑defined interfaces, and thorough monitoring with dashboards. We are adapting these principles to prompts and we see similar results. We solve for: 

  1. Model drift: Continuous evaluation highlights subtle shifts in model behavior.
  2. Hallucinations: Outputs are checked against expected patterns, with deviations flagged.
  3. Runtime exceptions: Guardrails and evals catch failure modes before they affect end users.
  4. Edge‑case errors: Dedicated test inputs ensure consistency under stress conditions.

Prompt Engineering Practice

  • Automatic Prompt Optimization (APO)
  • Guardrails, Consistency checks and Constraints
  • Evals and Runtime Checks
  • CI Prompt Tests (a subset of APO dataset)

Traditional Engineering Equivalent

  • End‑to‑End Test Suites
  • Exception handling & edge‑case protection
  • Observability & monitoring
  • Unit tests in CI/CD

Prompts are not static artifacts. They evolve alongside models, datasets, and user needs. By applying disciplined engineering practices—APO, guardrails, and evals—we ensure that every prompt deployed at Brevian is rigorously tested, observable in production, and resilient to change.

About the Author

Anupreet Walia is the Chief Technology Officer at Brevian, where she leads AI architecture, strategy, and engineering innovation. Her work focuses on building reliable, production-grade AI systems that bridge research and real-world deployment.

LinkedIn

References

Teki, S. Prompting is the New Programming Paradigm in the Age of AI. Personal Blog, 2023. Link

Prompt Engineering Guide. Few-Shot Prompting. PromptingGuide.ai, 2023. Link

QED42. Building Simple & Effective Prompt-Based Guardrails. QED42 Insights, 2024. Link

Humanloop. Structured Outputs: Everything You Should Know. Humanloop Blog, 2023. Link

Wolfe, C. Automatic Prompt Optimization with Reinforcement Learning. Substack, 2023. Link

If you liked this, you might want to read

CRM Updates: Your Sales Methodology Should Live in Your CRM, Not Just Your Training Deck

Every sales org adopts a framework. Almost none can prove their reps are actually following it. CRM Updates keeps your sales methodology reflected in Salesforce or HubSpot automatically.

Deal Review: Deal Analysis That Actually Changes Rep Behavior

Your AE says the deal is on track. Your pipeline says it's on track. But neither has actually looked at the evidence. Deal Review replaces opinion-driven pipeline reviews with AI-powered deal intelligence.

Introducing Meeting Prep: AI-Generated Pre-Meeting Intelligence for Every Sales Conversation

Your reps spend 30+ minutes preparing for important calls. What if that prep was already done, and better than anything they'd build manually?