from benchflow import BenchClientclass YourClient(BenchClient): def prepare_input(self, raw_step_inputs: Dict[str, Any]) -> dict: # Each benchmark should supply a dataset for task step as a dict. # For example, if your benchmark is a Q&A dataset, your returned dictionary should include at least a "question" key. # You can also include any additional fields contained in your dataset. ... def parse_response(self, raw_response: str) -> Dict[str, Any]: ...# Then you can use your Client in your evaluation script# For example:client = YourClient(intelligence_url)for steps in your_task_steps: env = {"question": "question text", "hint": "hint"} response = client.run_bench(env)score = your_eval_method(response)
3
Containerize your benchmark
Package your benchmark as an image and provide an entry point to run the benchmark.
Please configure your Docker image to target the Linux platform. We plan to support additional platforms in future releases.
4
Extend BaseBench to Run Your Benchmarks
Implement your subclass in benchflow_interface.py and upload it to benchmark Hub.
There are 6 methods to be implemented.
5 of them are very simple and can often be implemented with just a return statement. The only one that might take a bit of time is the get_result method.