"Evals" refers to evaluating a model's performance for a specific application.
Warning
Unlike unit tests, evals are an emerging art/science; anyone who claims to know for sure exactly how your evals should be defined can safely be ignored.
Pydantic Evals is a powerful evaluation framework designed to help you systematically test and evaluate the performance and accuracy of the systems you build, especially when working with LLMs.
We've designed Pydantic Evals to be useful while not being too opinionated since we (along with everyone else) are still figuring out best practices. We'd love your feedback on the package and how we can improve it.
In Beta
Pydantic Evals support was introduced in v0.0.47 and is currently in beta. The API is subject to change and the documentation is incomplete.
Installation
To install the Pydantic Evals package, run:
pipinstallpydantic-evals
uvaddpydantic-evals
pydantic-evals does not depend on pydantic-ai, but has an optional dependency on logfire if you'd like to
use OpenTelemetry traces in your evals, or send evaluation results to logfire.
pipinstall'pydantic-evals[logfire]'
uvadd'pydantic-evals[logfire]'
Datasets and Cases
In Pydantic Evals, everything begins with Datasets and Cases:
Case: A single test scenario corresponding to "task" inputs. Can also optionally have a name, expected outputs, metadata, and evaluators.
Dataset: A collection of test cases designed for the evaluation of a specific task or function.
simple_eval_dataset.py
frompydantic_evalsimportCase,Datasetcase1=Case(name='simple_case',inputs='What is the capital of France?',expected_output='Paris',metadata={'difficulty':'easy'},)dataset=Dataset(cases=[case1])
(This example is complete, it can be run "as is")
Evaluators
Evaluators are the components that analyze and score the results of your task when tested against a case.
Pydantic Evals includes several built-in evaluators and allows you to create custom evaluators:
The evaluation process involves running a task against all cases in a dataset:
Putting the above two examples together and using the more declarative evaluators kwarg to Dataset:
simple_eval_complete.py
frompydantic_evalsimportCase,Datasetfrompydantic_evals.evaluatorsimportEvaluator,EvaluatorContext,IsInstancecase1=Case(name='simple_case',inputs='What is the capital of France?',expected_output='Paris',metadata={'difficulty':'easy'},)classMyEvaluator(Evaluator[str,str]):defevaluate(self,ctx:EvaluatorContext[str,str])->float:ifctx.output==ctx.expected_output:return1.0elif(isinstance(ctx.output,str)andctx.expected_output.lower()inctx.output.lower()):return0.8else:return0.0dataset=Dataset(cases=[case1],evaluators=[IsInstance(type_name='str'),MyEvaluator()],)asyncdefguess_city(question:str)->str:return'Paris'report=dataset.evaluate_sync(guess_city)report.print(include_input=True,include_output=True,include_durations=False)""" Evaluation Summary: guess_city┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓┃ Case ID ┃ Inputs ┃ Outputs ┃ Scores ┃ Assertions ┃┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩│ simple_case │ What is the capital of France? │ Paris │ MyEvaluator: 1.00 │ ✔ │├─────────────┼────────────────────────────────┼─────────┼───────────────────┼────────────┤│ Averages │ │ │ MyEvaluator: 1.00 │ 100.0% ✔ │└─────────────┴────────────────────────────────┴─────────┴───────────────────┴────────────┘"""
Also create a custom evaluator function as above
(This example is complete, it can be run "as is")
Evaluation with LLMJudge
In this example we evaluate a method for generating recipes based on customer orders.
judge_recipes.py
from__future__importannotationsfromtypingimportAnyfrompydanticimportBaseModelfrompydantic_aiimportAgent,format_as_xmlfrompydantic_evalsimportCase,Datasetfrompydantic_evals.evaluatorsimportIsInstance,LLMJudgeclassCustomerOrder(BaseModel):dish_name:strdietary_restriction:str|None=NoneclassRecipe(BaseModel):ingredients:list[str]steps:list[str]recipe_agent=Agent('groq:llama-3.3-70b-versatile',output_type=Recipe,system_prompt=('Generate a recipe to cook the dish that meets the dietary restrictions.'),)asyncdeftransform_recipe(customer_order:CustomerOrder)->Recipe:r=awaitrecipe_agent.run(format_as_xml(customer_order))returnr.outputrecipe_dataset=Dataset[CustomerOrder,Recipe,Any](cases=[Case(name='vegetarian_recipe',inputs=CustomerOrder(dish_name='Spaghetti Bolognese',dietary_restriction='vegetarian'),expected_output=None,# metadata={'focus':'vegetarian'},evaluators=(LLMJudge(rubric='Recipe should not contain meat or animal products',),),),Case(name='gluten_free_recipe',inputs=CustomerOrder(dish_name='Chocolate Cake',dietary_restriction='gluten-free'),expected_output=None,metadata={'focus':'gluten-free'},# Case-specific evaluator with a focused rubricevaluators=(LLMJudge(rubric='Recipe should not contain gluten or wheat products',),),),],evaluators=[IsInstance(type_name='Recipe'),LLMJudge(rubric='Recipe should have clear steps and relevant ingredients',include_input=True,model='anthropic:claude-3-7-sonnet-latest',),],)report=recipe_dataset.evaluate_sync(transform_recipe)print(report)""" Evaluation Summary: transform_recipe┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓┃ Case ID ┃ Assertions ┃ Duration ┃┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩│ vegetarian_recipe │ ✔✔✔ │ 10ms │├────────────────────┼────────────┼──────────┤│ gluten_free_recipe │ ✔✔✔ │ 10ms │├────────────────────┼────────────┼──────────┤│ Averages │ 100.0% ✔ │ 10ms │└────────────────────┴────────────┴──────────┘"""
(This example is complete, it can be run "as is")
Saving and Loading Datasets
Datasets can be saved to and loaded from YAML or JSON files.
save_load_dataset_example.py
frompathlibimportPathfromjudge_recipesimportCustomerOrder,Recipe,recipe_datasetfrompydantic_evalsimportDatasetrecipe_transforms_file=Path('recipe_transform_tests.yaml')recipe_dataset.to_file(recipe_transforms_file)# (1)!print(recipe_transforms_file.read_text())"""# yaml-language-server: $schema=recipe_transform_tests_schema.jsoncases:- name: vegetarian_recipe inputs: dish_name: Spaghetti Bolognese dietary_restriction: vegetarian metadata: focus: vegetarian evaluators: - LLMJudge: Recipe should not contain meat or animal products- name: gluten_free_recipe inputs: dish_name: Chocolate Cake dietary_restriction: gluten-free metadata: focus: gluten-free evaluators: - LLMJudge: Recipe should not contain gluten or wheat productsevaluators:- IsInstance: Recipe- LLMJudge: rubric: Recipe should have clear steps and relevant ingredients model: anthropic:claude-3-7-sonnet-latest include_input: true"""# Load dataset from fileloaded_dataset=Dataset[CustomerOrder,Recipe,dict].from_file(recipe_transforms_file)print(f'Loaded dataset with {len(loaded_dataset.cases)} cases')#> Loaded dataset with 2 cases
(This example is complete, it can be run "as is")
Parallel Evaluation
You can control concurrency during evaluation (this might be useful to prevent exceeding a rate limit):
parallel_evaluation_example.py
importasyncioimporttimefrompydantic_evalsimportCase,Dataset# Create a dataset with multiple test casesdataset=Dataset(cases=[Case(name=f'case_{i}',inputs=i,expected_output=i*2,)foriinrange(5)])asyncdefdouble_number(input_value:int)->int:"""Function that simulates work by sleeping for a second before returning double the input."""awaitasyncio.sleep(0.1)# Simulate workreturninput_value*2# Run evaluation with unlimited concurrencyt0=time.time()report_default=dataset.evaluate_sync(double_number)print(f'Evaluation took less than 0.3s: {time.time()-t0<0.3}')#> Evaluation took less than 0.3s: Truereport_default.print(include_input=True,include_output=True,include_durations=False)""" Evaluation Summary: double_number┏━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓┃ Case ID ┃ Inputs ┃ Outputs ┃┡━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩│ case_0 │ 0 │ 0 │├──────────┼────────┼─────────┤│ case_1 │ 1 │ 2 │├──────────┼────────┼─────────┤│ case_2 │ 2 │ 4 │├──────────┼────────┼─────────┤│ case_3 │ 3 │ 6 │├──────────┼────────┼─────────┤│ case_4 │ 4 │ 8 │├──────────┼────────┼─────────┤│ Averages │ │ │└──────────┴────────┴─────────┘"""# Run evaluation with limited concurrencyt0=time.time()report_limited=dataset.evaluate_sync(double_number,max_concurrency=1)print(f'Evaluation took more than 0.5s: {time.time()-t0>0.5}')#> Evaluation took more than 0.5s: Truereport_limited.print(include_input=True,include_output=True,include_durations=False)""" Evaluation Summary: double_number┏━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓┃ Case ID ┃ Inputs ┃ Outputs ┃┡━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩│ case_0 │ 0 │ 0 │├──────────┼────────┼─────────┤│ case_1 │ 1 │ 2 │├──────────┼────────┼─────────┤│ case_2 │ 2 │ 4 │├──────────┼────────┼─────────┤│ case_3 │ 3 │ 6 │├──────────┼────────┼─────────┤│ case_4 │ 4 │ 8 │├──────────┼────────┼─────────┤│ Averages │ │ │└──────────┴────────┴─────────┘"""
(This example is complete, it can be run "as is")
OpenTelemetry Integration
Pydantic Evals integrates with OpenTelemetry for tracing.
The EvaluatorContext includes a property called span_tree
which returns a SpanTree. The SpanTree provides a way to query and analyze
the spans generated during function execution. This provides a way to access the results of instrumentation during
evaluation.
Note
If you just want to write unit tests that ensure that specific spans are produced during calls to your evaluation
task, it's usually better to just use the logfire.testing.capfire fixture directly.
There are two main ways this is useful.
opentelemetry_example.py
importasynciofromtypingimportAnyimportlogfirefrompydantic_evalsimportCase,Datasetfrompydantic_evals.evaluatorsimportEvaluatorfrompydantic_evals.evaluators.contextimportEvaluatorContextfrompydantic_evals.otel.span_treeimportSpanQuerylogfire.configure(# ensure that an OpenTelemetry tracer is configuredsend_to_logfire='if-token-present')classSpanTracingEvaluator(Evaluator[str,str]):"""Evaluator that analyzes the span tree generated during function execution."""defevaluate(self,ctx:EvaluatorContext[str,str])->dict[str,Any]:# Get the span tree from the contextspan_tree=ctx.span_treeifspan_treeisNone:return{'has_spans':False,'performance_score':0.0}# Find all spans with "processing" in the nameprocessing_spans=span_tree.find(lambdanode:'processing'innode.name)# Calculate total processing timetotal_processing_time=sum((span.duration.total_seconds()forspaninprocessing_spans),0.0)# Check for error spanserror_query:SpanQuery={'name_contains':'error'}has_errors=span_tree.any(error_query)# Calculate a performance score (lower is better)performance_score=1.0iftotal_processing_time<0.5else0.5return{'has_spans':True,'has_errors':has_errors,'performance_score':0ifhas_errorselseperformance_score,}asyncdefprocess_text(text:str)->str:"""Function that processes text with OpenTelemetry instrumentation."""withlogfire.span('process_text'):# Simulate initial processingwithlogfire.span('text_processing'):awaitasyncio.sleep(0.1)processed=text.strip().lower()# Simulate additional processingwithlogfire.span('additional_processing'):if'error'inprocessed:withlogfire.span('error_handling'):logfire.error(f'Error detected in text: {text}')returnf'Error processing: {text}'awaitasyncio.sleep(0.2)processed=processed.replace(' ','_')returnf'Processed: {processed}'# Create test casesdataset=Dataset(cases=[Case(name='normal_text',inputs='Hello World',expected_output='Processed: hello_world',),Case(name='text_with_error',inputs='Contains error marker',expected_output='Error processing: Contains error marker',),],evaluators=[SpanTracingEvaluator()],)# Run evaluation - spans are automatically captured since logfire is configuredreport=dataset.evaluate_sync(process_text)# Print the reportreport.print(include_input=True,include_output=True,include_durations=False)""" Evaluation Summary: process_text┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓┃ Case ID ┃ Inputs ┃ Outputs ┃ Scores ┃ Assertions ┃┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩│ normal_text │ Hello World │ Processed: hello_world │ performance_score: 1.00 │ ✔✗ │├─────────────────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────┼────────────┤│ text_with_error │ Contains error marker │ Error processing: Contains error marker │ performance_score: 0 │ ✔✔ │├─────────────────┼───────────────────────┼─────────────────────────────────────────┼──────────────────────────┼────────────┤│ Averages │ │ │ performance_score: 0.500 │ 75.0% ✔ │└─────────────────┴───────────────────────┴─────────────────────────────────────────┴──────────────────────────┴────────────┘"""
(This example is complete, it can be run "as is")
Generating Test Datasets
Pydantic Evals allows you to generate test datasets using LLMs with generate_dataset.
Datasets can be generated in either JSON or YAML format, in both cases a JSON schema file is generated alongside the dataset and referenced in the dataset, so you should get type checking and auto-completion in your editor.
generate_dataset_example.py
from__future__importannotationsfrompathlibimportPathfrompydanticimportBaseModel,Fieldfrompydantic_evalsimportDatasetfrompydantic_evals.generationimportgenerate_datasetclassQuestionInputs(BaseModel,use_attribute_docstrings=True):"""Model for question inputs."""question:str"""A question to answer"""context:str|None=None"""Optional context for the question"""classAnswerOutput(BaseModel,use_attribute_docstrings=True):"""Model for expected answer outputs."""answer:str"""The answer to the question"""confidence:float=Field(ge=0,le=1)"""Confidence level (0-1)"""classMetadataType(BaseModel,use_attribute_docstrings=True):"""Metadata model for test cases."""difficulty:str"""Difficulty level (easy, medium, hard)"""category:str"""Question category"""asyncdefmain():dataset=awaitgenerate_dataset(dataset_type=Dataset[QuestionInputs,AnswerOutput,MetadataType],n_examples=2,extra_instructions=""" Generate question-answer pairs about world capitals and landmarks. Make sure to include both easy and challenging questions. """,)output_file=Path('questions_cases.yaml')dataset.to_file(output_file)print(output_file.read_text())""" # yaml-language-server: $schema=questions_cases_schema.json cases: - name: Easy Capital Question inputs: question: What is the capital of France? metadata: difficulty: easy category: Geography expected_output: answer: Paris confidence: 0.95 evaluators: - EqualsExpected - name: Challenging Landmark Question inputs: question: Which world-famous landmark is located on the banks of the Seine River? metadata: difficulty: hard category: Landmarks expected_output: answer: Eiffel Tower confidence: 0.9 evaluators: - EqualsExpected """
(This example is complete, it can be run "as is" — you'll need to add asyncio.run(main(answer)) to run main)
You can also write datasets as JSON files:
generate_dataset_example_json.py
frompathlibimportPathfromgenerate_dataset_exampleimportAnswerOutput,MetadataType,QuestionInputsfrompydantic_evalsimportDatasetfrompydantic_evals.generationimportgenerate_datasetasyncdefmain():dataset=awaitgenerate_dataset(dataset_type=Dataset[QuestionInputs,AnswerOutput,MetadataType],n_examples=2,extra_instructions=""" Generate question-answer pairs about world capitals and landmarks. Make sure to include both easy and challenging questions. """,)output_file=Path('questions_cases.json')dataset.to_file(output_file)print(output_file.read_text())""" { "$schema": "questions_cases_schema.json", "cases": [ { "name": "Easy Capital Question", "inputs": { "question": "What is the capital of France?" }, "metadata": { "difficulty": "easy", "category": "Geography" }, "expected_output": { "answer": "Paris", "confidence": 0.95 }, "evaluators": [ "EqualsExpected" ] }, { "name": "Challenging Landmark Question", "inputs": { "question": "Which world-famous landmark is located on the banks of the Seine River?" }, "metadata": { "difficulty": "hard", "category": "Landmarks" }, "expected_output": { "answer": "Eiffel Tower", "confidence": 0.9 }, "evaluators": [ "EqualsExpected" ] } ] } """
(This example is complete, it can be run "as is" — you'll need to add asyncio.run(main(answer)) to run main)
Integration with Logfire
Pydantic Evals is implemented using OpenTelemetry to record traces of the evaluation process. These traces contain all
the information included in the terminal output as attributes, but also include full tracing from the executions of the
evaluation task function.
You can send these traces to any OpenTelemetry-compatible backend, including Pydantic Logfire.
All you need to do is configure Logfire via logfire.configure:
Logfire has some special integration with Pydantic Evals traces, including a table view of the evaluation results
on the evaluation root span (which is generated in each call to Dataset.evaluate):
and a detailed view of the inputs and outputs for the execution of each case:
In addition, any OpenTelemetry spans generated during the evaluation process will be sent to Logfire, allowing you to
visualize the full execution of the code called during the evaluation process:
This can be especially helpful when attempting to write evaluators that make use of the span_tree property of the
EvaluatorContext, as described in the
OpenTelemetry Integration section above.
This allows you to write evaluations that depend on information about which code paths were executed during the call to
the task function without needing to manually instrument the code being evaluated, as long as the code being evaluated
is already adequately instrumented with OpenTelemetry. In the case of PydanticAI agents, for example, this can be used
to ensure specific tools are (or are not) called during the execution of specific cases.
Using OpenTelemetry in this way also means that all data used to evaluate the task executions will be accessible in
the traces produced by production runs of the code, making it straightforward to perform the same evaluations on
production data.