A dataset by HuggingFace focused on code generation.

It is supposedly has more complex and diverse tasks,1 compared existing datasets like HumanEval.

This dataset comes in two flavours:

The first approach is to generate code based on a function signature and detailed docstring instruction (see “Complete Prompt”). This dataset is called BigCodeBench-Complete.

The second approach is generating code based on requirements specified in natural language chat (see “Instruct Prompt”). This is much harder as it:

  • may involve multi-turn dialogue
  • is more conversational and less verbose

Lineage

Originated from the ODEX, expanded with GPT4 into comprehensive function-level tasks.

Limitations of this dataset

  • python-only, at the moment
  • Function-level tasks. It doesn’t evaluate the ability of LLMs on code design/architecture tasks
  • Tasks themselves are expected to be generated in vacuo (i.e. in a sandbox enviroment), with no interactions from other sources or tools.

Leaderboard

As of 2024-06-25, GPT4o is the leader. It slightly outperforms it sibling variants.

Footnotes

  1. As qualitatively reported by the team ↩