Choose your role

BenchFlow aims to be the bridge between benchmark users and benchmark developers. Please choose a role to start your BenchFlow journey.

For benchmark users

1

Install the benchflow sdk

git clone https://github.com/benchflow-ai/benchflow.git
cd benchflow
pip install benchflow
2

Select your benchmarks

Discover benchmarks tailored to your needs on Benchmark Hub.

3

Implement your call_api

Extend the BaseAgent interface. call_api is used to call your intelligence (LLM, Agent …)

YourAgent.py
from benchflow import BaseAgent

class YourAgent(BaseAgent):
  def call_api(self, task_step_inputs):
    ...
4

Run your benchmark

Run your benchmark in a seperate script.

Kickstart your free BenchFlow trial on BenchFlow.ai to unlock benchmarking insights.

from benchflow import load_benchmark
from YourAgent import YourAgent

benchmark_name="organization/selected_benchmark", 
bf_token=os.getenv("BF_TOKEN")
bench = load_benchmark(benchmark_name, bf_token)

your_agent = YourAgent()

run_ids = bench.run(
    task_ids=[0],                   # Run the task with IDs
    agents=your_agents,             # Your agent
    api={                           # Your provider API configuration
        "provider": "",
        "model": "",
        "OPENAI_API_KEY": "",
    },
    requirements_txt="requirements.txt",  # Extra dependencies
    args={}                         # Arguments for your benchmarks
)

results = bench.get_results(run_ids)

For benchmark developers

1

Install the benchflow sdk

pip install benchflow
2

Make your benchmark a client

from benchflow import BenchClient

class YourClient(BenchClient):
  def prepare_input(self, raw_step_inputs: Dict[str, Any]) -> dict:
    # Each benchmark should supply a dataset for task step as a dict. 
    # For example, if your benchmark is a Q&A dataset, your returned dictionary should include at least a "question" key. 
    # You can also include any additional fields contained in your dataset.
    ...
  
  def parse_response(self, raw_response: str) -> Dict[str, Any]:
    ...
    
# Then you can use your Client in your evaluation script
# For example:
client = YourClient(intelligence_url)

for steps in your_task_steps:
  env = {"question": "question text", "hint": "hint"}
  response = client.run_bench(env)

score = your_eval_method(response)
3

Containerize your benchmark

Package your benchmark as an image and provide an entry point to run the benchmark.

Please configure your Docker image to target the Linux platform. We plan to support additional platforms in future releases.

4

Extend BaseBench to Run Your Benchmarks

Implement your subclass in benchflow_interface.py and upload it to benchmark Hub. There are 6 methods to be implemented.

5 of them are very simple and can often be implemented with just a return statement. The only one that might take a bit of time is the get_result method.

from benchflow import BaseBench
from benchflow.schemas import BenchArgs, BenchmarkResult

class YourBench(BaseBench):
  def get_args(task_id) -> BenchArgs:
     ...
  def get_image_name(self) -> str:
     ...
  def get_results_dir_in_container(self) -> str:
     ...
  def get_log_files_dir_in_container(self) -> str:
     ...
  def get_result(self, task_id: str) -> BenchmarkResult:
     ...
  def get_all_tasks(self, split: str) -> Dict[str, Any]:
     ...
5

Upload your benchmark to Benchmark Hub

Here’s your checklist:

  1. benchflow_interface.py - ensure your file is named correctly.****

  2. readme.md – This should showcase the field formats provided in the prepare_input methodfrom Step 2, along with detailed descriptions.

Explore detailed integration process