A dataset by HuggingFace focused on code generation.
It is supposedly has more complex and diverse tasks,1 compared existing datasets like HumanEval.
This dataset comes in two flavours:
The first approach is to generate code based on a function signature and detailed docstring instruction (see “Complete Prompt”). This dataset is called BigCodeBench-Complete
.
The second approach is generating code based on requirements specified in natural language chat (see “Instruct Prompt”). This is much harder as it:
- may involve multi-turn dialogue
- is more conversational and less verbose
Lineage
Originated from the ODEX, expanded with GPT4 into comprehensive function-level tasks.
Limitations of this dataset
- python-only, at the moment
- Function-level tasks. It doesn’t evaluate the ability of LLMs on code design/architecture tasks
- Tasks themselves are expected to be generated in vacuo (i.e. in a sandbox enviroment), with no interactions from other sources or tools.
Leaderboard
As of 2024-06-25, GPT4o is the leader. It slightly outperforms it sibling variants.
Footnotes
-
As qualitatively reported by the team ↩