Mastering AI Engineering: Structured Output with Multiple Candidates
In the current landscape of Applied AI engineering, ensuring consistent, reliable output from LLMs is a critical challenge. In several of our projects, we have had to balance accurate structured outputs with the inherent variability of LLMs. To solve this issue, we have developed a robust solution that leverages structured output and samples the most reliable response from multiple candidates.

Demo Notebook and Dataset
To help you understand and experiment with this technique, we've created a demo notebook that demonstrates the implementation of our solution using the receipts dataset from Hugging Face. The notebook will show you the two main parts of our approach:
- How to define and implement the structured output schema.
- How to generate and process multiple response candidates.
The receipts dataset is particularly well-suited for this demonstration because:
- It contains real-world examples of structured data extraction.
- The images are high-quality and well-formatted.
- The data set is publicly available and easy to access.
You can read the full blog and follow along for a stepwise implementation, or directly dive into the final implementation in the notebook.
Note that we are working with Google Gemini Models in the example code, but the core logic works for all LLM providers that can output structured outputs.
The Challenge of Consistent AI Outputs
LLMs, by their very nature, produce inconsistent or unpredictable outputs. It is also known that setting temperature to 0 or constraining the output of the model too much, the quality of the output drops[1][2]. However, when building production grade applications that rely on AI-generated content, variability can create significant challenges. For critical business applications, like the quote comparison tool that we recently built for one of our clients, we often need reliable, structured data that can be consistently processed by downstream systems.
Our Solution: Structured Output with Response Sampling
Our approach to obtaining more reliable and robust output from LLMs combines two powerful techniques:
- Structured Output using Pydantic Models: We define structured output schemas using Pydantic, providing type safety and validation. This is already common practice in the industry.
- Multiple Candidate Generation with Frequency-Based Selection: We generate multiple outputs and select the most frequent candidate as the 'final' result, using hashable BaseModel.
This technique uses response sampling to identify the most accurate output. It works on the assumption that the best answer has a higher probability of appearing across multiple generations, even when the LLM doesn't consistently produce it. So, given that there is a best answer, we can find it by selecting the response that occurs the most in our response sample.
Let's examine how you could implement this:
Structured Output with Pydantic
Successful integration of LLMs into your applications usually requires them to generate well-defined, structured outputs. Therefore, we want to define structured schemas for various AI tasks. For example, here's how we define a schema for structuring the extraction of information from receipts:
from pydantic import BaseModel
class ReceiptInfo(BaseModel):
# Transaction details
total_amount: float = Field(description="Total amount of the transaction")
tip: float | None = Field(description="Tip amount if visible")
tax: float | None = Field(description="Tax amount if visible")
currency: str | None = Field(description="Three letter code of the currency of the transaction")
By utilizing Pydantic's Field class, we can provide clear descriptions for each element of our output. This description not only guides the code users but the same descriptions can also directly guide an AI in generating appropriate values for each field. We can therefore use these model definitions as explicit instructions that help the model understand exactly what information to extract from the documents. Since most LLMs are trained on python code, you can directly copy and paste this code into the prompt to guide the LLM towards the correct structured outputs.
Generating Multiple Candidates
Even with a relatively strict instruction, the output of the LLM will often vary when queried twice. In our example for instance, it is possible that the LLM will confuse the tip amount with the tax amount. Now, instead of trying to solve this variability by restricting the LLM even further, we take an alternative approach: we simply query the model multiple times to get a more robust response. Relying on the assumption that, with a correct prompt, the model is also more likely to output the correct response. With Google's Gemini models, obtaining multiple answers to the same prompt can be easily and efficiently done by setting the candidate_count
attribute of the GenerateContentConfig
to a number larger than 1.
Selecting the Most Frequent Candidate
The core of our approach is selecting the most frequent response among the candidates. Your intuition will tell you that the most pythonic way to find the most common item in a list is to use the collections.Counter
class. However, Pydantic's BaseModel is not, by default, hashable. Luckily, we can easily implement this:
class ExcludeFieldsBaseModel(BaseModel):
def __hash__(self):
return hash(self.model_dump_json())
Note that we use model_dump_json because strings are hashable unlike dictionaries.
Then we update our class from earlier like so:
class ReceiptInfo(ExcludeFieldsBaseModel)
# ... omitted code from earlier
Here's how we implement candidate selection logic:
from collections import Counter
def parse_response(candidates: list[str], target_schema: type[BaseModel]):
# Parse and validate each candidate response
validated_responses = []
for candidate in candidates:
try:
result = target_schema.model_validate_json(candidate)
validated_responses.append(result)
except JSONDecodeError:
pass
if not validated_responses:
raise ValueError("No valid candidates found")
candidate_counts = Counter(validated_responses)
best_answer, count = candidate_counts.most_common(1)[0]
print(f"consensus score: {count/len(candidates):.2f}")
return best_answer
This implementation counts the frequency of each unique response and selects the most common one. It also prints the relative frequency of the most common answer, which can be helpful in determining the quality of the response.
The Benefits of This Approach
Our frequency-based candidate selection provides several nice advantages:
- Improved Reliability: By sampling outputs and selecting the most frequent response, we reduce the impact of occasional model hallucinations or errors.
- Quantifiable Confidence: The consensus percentage gives us a metric to assess the confidence in our results.
- Ability to Fail Gracefully: When consensus is low, we can choose to flag the response for human review.
In the demo notebook you will find an example of how a data extraction task can easily be improved by using this method. For instance, you can see that this approach effectively reduces hallucinations, such as the generation of an “imaginary” currency that does not actually appear on the receipts.
Include reasoning
As we wrote earlier, it is not uncommon to let an LLM reason before supplying an answer to improve quality and reliability. The easiest way to do this is, in our experience, to add a field to your output schema named chain_of_thought
like so:
class ReceiptInfo(ExcludeFieldsBaseModel):
chain_of_thought: str | None = Field(description="Your reasoning and thought process before providing this answer")
# ... other fields
It's important to place the chain_of_thought
field first in your output schema, as this encourages the LLM to reason before providing its final answer. While some might assume this approach is obsolete with newer reasoning-focused models, it remains valuable for most general-purpose LLMs since the reasoning becomes part of the context window and better grounds the final answer. Note that some implementations (like Google's) may automatically sort fields alphabetically in structured outputs, in which case you might need to prefix the field name with a character like "0" to ensure it appears first.
Adding the chain_of_thought
field to the response, however, introduces a new issue. Look at the following example responses
test_class1 = ReceiptInfo(
chain_of_thought="Based on the receipt, I can identify the total amount as 13.000, the tax as 10% and the currency as EUR.",
total_amount="13000",
tip=None,
tax="1300",
currency="EUR",
)
test_class2 = ReceiptInfo(
chain_of_thought="Based on the receipt, I can identify the total amount as 13.000, the tax as 1.300 and the currency as EUR.",
total_amount="13000",
tip=None,
tax="1300",
currency="EUR",
)
The extracted information from the receipt is the same in both responses, but the presented reasoning is not! So, with our current code these responses would not be considered equal, while they should. To resolve this issue, we need a way to ignore the chain_of_thought
field.
There are several solutions to this problem:
1. Lie to your LLM
Since Pydantic's BaseModel.model_validate_json
ignores extra keys, you can simply tell your LLM to generate the key without ever actually using it.
2. Exclude the Attribute
You can exclude the attribute in the Field function Field(..., exclude=True
, which will ensure that the attribute is ignored in every model_dump(_json)
and __eq__
call. The tricky part here is that it becomes nearly impossible to model_dump that attribute from that point forward, due to Pydantic's exclusion settings. This means that if you want it for logging purposes, you will have to explicitly extract it. While this would work fine in the toy example, it can get a lot trickier when nesting Pydantic models.
3. Use our Implementation of ExcludableFieldsBaseModel
We have built a custom implementation of Pydantic's BaseModel that adds the private property _exclude_fields
, where you can define which fields to ignore when comparing two BaseModels. By implementing a custom __hash__
and __eq__
method, we make sure that all nested ExcludeFieldsBaseModel's also exclude their respective _exclude_fields
, making sure that they can be used properly in a Counter (or other comparison context).
In our approach, we set a private attribute named _exclude_fields
, which we use in the __hash__
and __eq__
calls. Our Example class then becomes:
class ExcludeFieldsBaseModel(BaseModel):
_exclude_fields: Iterable[str] = set()
def __hash__(self) -> int:
temporary_hash = 0
for attr in [*self.model_fields, *self.model_computed_fields]:
if attr in self._exclude_fields:
continue
attr_value = getattr(self, attr)
# Convert non-hashable objects to tuple to make sure it is hashable
# Python handles underlying object hash, which calls this method again
# if it is another ExcludeFieldsBaseModel, thereby excluding the fields
# of the nested objects
if isinstance(attr_value, dict):
attr_value = tuple(attr_value.items())
elif isinstance(attr_value, list):
attr_value = tuple(attr_value)
# The hash-summing makes it order insensitive
temporary_hash += hash(attr_value)
# Hash it again to make sure it is a valid hash
return hash(temporary_hash)
def __eq__(self, other: "ExcludeFieldsBaseModel") -> bool:
for attr in [*self.model_fields, *self.model_computed_fields]:
if attr in self._exclude_fields:
continue
if getattr(self, attr) != getattr(other, attr):
return False
return True
# ... omitted code in between ...
class ReceiptInfo(ExcludeFieldsBaseModel):
# ... other fields ...
_exclude_fields = {"chain_of_thought"}
You can now use the _exclude_fields
attribute in each model where you want to exclude fields. Just make sure that every 'parent' class also uses ExcludeFieldsBaseModel
, so it correctly hashes and computes equality.
Besides the chain of thought you can now also exclude other non-deterministic fields from the final comparison and majority vote. For example:
class ReceiptInfo(ExcludeFieldsBaseModel):
# ... other field ...
sample_item: str = Field(description="An example of an item on this receipt.")
_exclude_fields = {"chain_of_thought", "sample_item"}
Quality and Cost Tradeoff
When implementing the presented multiple candidate approach, an important consideration is balancing model quality, cost, and latency. We've discovered some interesting dynamics in our applications:
- Model Quality vs. Quantity Tradeoff: While higher-quality models like Gemini-1.5-Pro can produce more accurate individual responses, multiple samples from a more cost-effective model can achieve comparable reliability at lower cost. For instance, "gemini-2.0-flash-lite" costs 0.30 per million output tokens vs 5.00 for gemini-1.5-pro, a 16x difference. Crucially, you only pay for the extra output tokens when using a multi-candidate response.
- Finding the Optimal Sample Size: Through experimentation, we found that the number of candidates needed to achieve reliability has diminishing returns. So, while using fewer samples from a higher-quality model might seem optimal, the cost difference often makes multiple samples from a cheaper model more economical.
- Latency Considerations: When response time matters, running multiple candidates in parallel is crucial. This approach adds minimal latency while significantly improving reliability, though it does increase computational load. Most LLM providers support concurrent requests, making this approach viable for real-time applications.
- Consensus Thresholds: Set an appropriate consensus threshold based on your application's requirements. For critical financial data extraction, we require at least 70% consensus; otherwise, we flag the response for human review. For less critical applications, a lower threshold may be acceptable.
Our recommendation is to begin with a medium-quality model and 5 candidates, then adjust based on your specific needs, cost constraints, and reliability requirements. Track your consensus scores across different document types to identify where the approach works or needs refinement.
Lessons Learned and Best Practices
Through our work with these techniques, we've identified several best practices:
- Be Specific in Schema Definitions: Use descriptive field names and provide detailed field descriptions to guide the AI effectively.
- Balance Candidate Count with Model Quality: Cheaper and 'less intelligent' models can be used in place of stronger more expensive models if sampled more often.
- Include Reasoning Fields: Separate fields for reasoning and the actual target help the AI organize its thoughts before supplying the answer you are looking for.
- Log and Monitor Consensus Rates: Track how often candidates agree to identify problematic queries or schemas.
Conclusion
Structured output with multiple candidate selection represents a powerful approach for building more reliable AI applications. By generating multiple candidates and selecting the most frequent response, we improve reliability and confidence in our AI outputs.
These techniques have been central to the success of several solutions we’ve delivered to our clients, enabling processing of complex documents with high reliability and accuracy. As AI continues to evolve, these engineering approaches will remain essential tools for bridging the gap between AI capabilities and production requirements.
By implementing these patterns in your own AI applications, you can significantly improve output quality and reliability, making AI a more practical solution for complex real-world problems.
- As written in "The Curious Case of Text DeGeneration", greedy output sampling can lead to repetitive and low-quality outputs. https://arxiv.org/pdf/1904.09751
- Research by TrustGraph found that using temperature 0 doesn't always produce optimal results: https://dev.to/trustgraph/what-does-llm-temperature-actually-mean-18b
“This Vincent guy really, really knows his shit!”
As stated by one happy customer