Developer
Detailed tutorial for benchmark developers.
Build your benchmark with BenchFlow
We provide benchmark developers with two interfaces to interact with intelligence: BenchClient and BaseBench.
BenchClient
BenchClient make your benchmark as a client, which enables seamless interaction with the intelligence through HTTP. It should be embedded into the evaluation entrance.
BaseBench
BaseBench is an interface for running, managing, and displaying benchmark results. All benchmark outputs are unified, enabling standardized visualization on BenchFlow.
Install benchflow sdk
Make your benchmark a client
We go through the entire benchmark onboarding process by integrating MMLU-Pro into BenchFlow as an example.
Import BenchClient
Extend BenchClient
You need to implement two methods.
parse_input
defines the structure of data provided by the benchmark, and it returns a dictionary.
parse_response
is used to parse the raw response from the agent into a structured dictionary.
The intelligence_url
is an address used for communicating with the agent. Your evaluation script should provide an argument to accept this URL. We will explain the detail in subsequent steps.
The task_step_inputs
provided by MMLU-Pro consists of four fields, which are as follows:
"prompt"
: prompt,
"input_text"
: input_text,
"entry"
: single_question,
"cot_examples_dict"
: cot_examples_dict
Benchmark developers should clearly document in the README (model card) the keys and their meanings of the input data provided. This is essential for intelligence developers to benchmark.
Get response by get_response
Use get_response
provided by BaseClient to get response from intelligence. This method first calls parse_input
, then sends the input to intelligence. After receiving a response, it calls parse_response
and returns its result.
Ensure your script can retrieve the intelligence_url
field from the command line. This field will be provided by BenchFlow via the INTELLIGENCE_URL
environment variable.
Containerize your benchmark
Package your benchmark as an image and provide an entry point to run the benchmark.
All parameters will be passed to the container as environment variables. When extending BaseBench, you can specify the names of the required and optional environment variables, allowing you to retrieve arguments directly from the environment in your script.
Run Your Benchmarks
Import BaseBench, BenchArgs and BenchmarkResults
Implement your BaseBench
BenchArgs automatically verify that intelligence has received all the required parameters using pydantic
. You can also specify default values for optional arguments. Additionally, BenchFlow provides an environment variable called INTELLIGENCE_URL
, which you can use in your evaluation script to access all the defined environment variables.
Use get_image_name
to get the image your uploaded to Dockerhub.
Return the path within the container where the benchmark results are stored.
Return the path within the container where the benchmark logs are stored.
Return the whole task_id for your benchmark.
Many benchmarks do not include a task_id
field. In such cases, you can either pass a line number or treat the entire benchmark as a single task. The classification of task IDs is flexible and is primarily used for parallel processing.
The task_id
field is also passed to the evaluation environment as an environment variable named TEST_START_ID
.
Parse and return your benchmark result
Upload your benchmark to Benchmark Hub
Here’s your checklist:
-
benchflow_interface.py - your implementation of BaseBench should be in this file and make sure your file is named correctly.
-
readme.md – clearly document the keys and their meanings of the input data provided and all required and optional arguments for benchmarks.