Detailed tutorial for benchmark developers.
parse_input
defines the structure of data provided by the benchmark, and it returns a dictionary.
parse_response
is used to parse the raw response from the agent into a structured dictionary.
intelligence_url
is an address used for communicating with the agent. Your evaluation script should provide an argument to accept this URL. We will explain the detail in subsequent steps. task_step_inputs
provided by MMLU-Pro consists of four fields, which are as follows:
"prompt"
: prompt,
"input_text"
: input_text,
"entry"
: single_question,
"cot_examples_dict"
: cot_examples_dict
get_response
provided by BaseClient to get response from intelligence. This method first calls parse_input
, then sends the input to intelligence. After receiving a response, it calls parse_response
and returns its result.
intelligence_url
field from the command line. This field will be provided by BenchFlow via the INTELLIGENCE_URL
environment variable.pydantic
. You can also specify default values for optional arguments. Additionally, BenchFlow provides an environment variable called INTELLIGENCE_URL
, which you can use in your evaluation script to access all the defined environment variables.
get_image_name
to get the image your uploaded to Dockerhub.
task_id
field. In such cases, you can either pass a line number or treat the entire benchmark as a single task. The classification of task IDs is flexible and is primarily used for parallel processing.task_id
field is also passed to the evaluation environment as an environment variable named TEST_START_ID
.