User
Detail tutorial for benchmark users.
Benchmark your intelligence
Intelligence is any AI-related product (e.g.LLM models, AI agents …).
Benchflow use the api provided by your intelligence to help you rapidly build a benchmark pipeline without any benchmark setup.
Install benchflow sdk
Implement the interface for api
Import the interface for api
You need to implement the BaseAgent interface provided by benchflow.
Although this interface is named BaseAgent, you don’t need to design an AI agent; simply implementing the call_api method is sufficient. We call it an “agent” because it serves as an agent to invoke your API.
Check the benchmark card
Go to the Benchmark Hub and read model card about the benchmarks you want to test on, especially the task_step_input provided by the benchmark developer.
The task_step_input
is a dictionary provided as input to the call_api
method that contains all the benchmark dataset information. You will need to use it to test your intelligence.
Implement your call_api function
Here is a basic example about testing the OpenAI on a Q&A benchmark. Suppose the model card specifies that the format of task_step_inputs
is {"question": "question text"}
.
Please implement this interface in a separate file. In the future, we will support more flexible implementation approaches.
Run the benchmark
Create the environment for calling your api
We require you to provide a Python-style requirements.txt
file that lists all dependencies needed to call your API. For instance, in the openai_caller.py
, your requirements.txt
should include the following dependencies:
Get your BenchFlow token
Kickstart your free BenchFlow trial on BenchFlow.ai to unlock benchmarking insights.
Benchmark your intelligence
Load your benchmark from benchmark hub.
The naming format for benchmarks is organization_name/benchmark_name
Import your api caller (agent).
Start the benchmark tasks.
Get your results
You can get your results from our sdk:
or download the results from BenchFlow.ai dashborad.