User
Detail tutorial for benchmark users.
Benchmark your intelligence
Intelligence is any AI-related product (e.g.LLM models, AI agents …).
Benchflow use the api provided by your intelligence to help you rapidly build a benchmark pipeline without any benchmark setup.
Install benchflow sdk
Implement the interface for api
Import the interface for api
You need to implement the BaseAgent interface provided by benchflow.
Although this interface is named BaseAgent, you don’t need to design an AI agent; simply implementing the call_api method is sufficient. We call it an “agent” because it serves as an agent to invoke your API.
Check the benchmark card
Go to the Benchmark Hub and read model card about the benchmarks you want to test on, especially the task_step_input provided by the benchmark developer.
The task_step_input
is a dictionary provided as input to the call_api
method that contains all the benchmark dataset information. You will need to use it to test your intelligence.
Implement your call_api function
Here is a basic example about testing the OpenAI on a Q&A benchmark. Suppose the model card specifies that the format of task_step_inputs
is {"question": "question text"}
.
Please implement this interface in a separate file. In the future, we will support more flexible implementation approaches.
Run the benchmark
Create the environment for calling your api
We require you to provide a Python-style requirements.txt
file that lists all dependencies needed to call your API. For instance, in the openai_caller.py
, your requirements.txt
should include the following dependencies:
Get your BenchFlow token
Kickstart your free BenchFlow trial on BenchFlow.ai to unlock benchmarking insights.
Benchmark your intelligence
Load your benchmark from benchmark hub.
The naming format for benchmarks is organization_name/benchmark_name
Import your api caller (agent).
Start the benchmark tasks.
Fields Description
Fields Description
-
task_ids:
A list that specifies which task(s) to run. If you leave this list empty, the system defaults to running the full benchmark. For the precise format, please refer to the model card.
-
agents:
An instance of your agent that implements the
call_api
method. -
api:
A dictionary containing the API configuration details. This should includes the provider name, model, and any necessary API keys.
-
requirements_txt:
A file path to your dependencies file formatted like a standard
requirements.txt
. This file should list all the Python dependencies needed for your API calls. -
args:
A dictionary for any required and optional arguments required by your benchmark. Please refer to the model card on the benchmark hub.
Get your results
You can get your results from our sdk:
or download the results from BenchFlow.ai dashborad.