User

Benchmark your intelligence

Intelligence is any AI-related product (e.g.LLM models, AI agents …).

Benchflow use the api provided by your intelligence to help you rapidly build a benchmark pipeline without any benchmark setup.

Install benchflow sdk

uv add benchflow

Implement the interface for api

Import the interface for api

You need to implement the BaseAgent interface provided by benchflow.

from benchflow import BaseAgent

Although this interface is named BaseAgent, you don’t need to design an AI agent; simply implementing the call_api method is sufficient. We call it an “agent” because it serves as an agent to invoke your API.

Check the benchmark card

Go to the Benchmark Hub and read model card about the benchmarks you want to test on, especially the task_step_input provided by the benchmark developer.

The task_step_input is a dictionary provided as input to the call_api method that contains all the benchmark dataset information. You will need to use it to test your intelligence.

Implement your call_api function

Here is a basic example about testing the OpenAI on a Q&A benchmark. Suppose the model card specifies that the format of task_step_inputs is {"question": "question text"}.

openai_caller.py

from openai import OpenAI
from benchflow import BaseAgent

class YourCaller(BaseAgent):
  def call_api(self, task_step_inputs: Dict[str, Any]) -> str:
      messages = [
        {"role": "user", "content": task_step_inputs['question']}
      ]
      client = OpenAI(
         api_key=os.getenv("OPENAI_API_KEY"),
      )
      response = client.chat.completions.create(
         messages=messages,
         model="gpt-4o",
         temperature=0.9,
      )
      content = response.choices[0].message.content
      return content

Please implement this interface in a separate file. In the future, we will support more flexible implementation approaches.

Run the benchmark

Create the environment for calling your api

We require you to provide a Python-style requirements.txt file that lists all dependencies needed to call your API. For instance, in the openai_caller.py , your requirements.txt should include the following dependencies:

openai
benchflow

Get your BenchFlow token

Kickstart your free BenchFlow trial on BenchFlow.ai to unlock benchmarking insights.

Benchmark your intelligence

Load your benchmark from benchmark hub.

from benchflow import load_benchmark
bench = load_benchmark("benchflow/webarena", bf_token=os.getenv(BF_TOKEN))

The naming format for benchmarks is organization_name/benchmark_name

Import your api caller (agent).

from api_caller import YourCaller

agent = YourCaller()

Start the benchmark tasks.

# Refer to the fileds description below for more details
run_ids = bench.run(
        task_ids=[0],
        agents=agent,
        api={
            "provider": "openai", 
            "model": "gpt-4o-mini", 
            "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY")
        },
        requirements_txt="webarena_requirements.txt",
        args={}
    )

Fields Description

Get your results

You can get your results from our sdk:

 results = bench.get_results(run_ids)

Get Started

Tutorial

Benchmark your intelligence

Install benchflow sdk

Implement the interface for api

Import the interface for api

Check the benchmark card

Implement your call_api function

Run the benchmark

Create the environment for calling your api

Get your BenchFlow token

Benchmark your intelligence

Get your results

Complex Examples

Webarena

Rarebench

Webcanvas

Get Started

Tutorial

​Benchmark your intelligence

​Install benchflow sdk

​Implement the interface for api

​Import the interface for api

​Check the benchmark card

​Implement your call_api function

​Run the benchmark

​Create the environment for calling your api

​Get your BenchFlow token

​Benchmark your intelligence

​Get your results

​Complex Examples

Webarena

Rarebench

Webcanvas

​

Benchmark your intelligence

Install benchflow sdk

Implement the interface for api

Import the interface for api

Check the benchmark card

Implement your call_api function

Run the benchmark

Create the environment for calling your api

Get your BenchFlow token

Benchmark your intelligence

Get your results

Complex Examples