BigCodeBench

A dataset by HuggingFace focused on code generation.

It is supposedly has more complex and diverse tasks,¹ compared existing datasets like HumanEval.

This dataset comes in two flavours:

The first approach is to generate code based on a function signature and detailed docstring instruction (see “Complete Prompt”). This dataset is called BigCodeBench-Complete.

The second approach is generating code based on requirements specified in natural language chat (see “Instruct Prompt”). This is much harder as it:

may involve multi-turn dialogue
is more conversational and less verbose

Lineage

Originated from the ODEX, expanded with GPT4 into comprehensive function-level tasks.

Limitations of this dataset

python-only, at the moment
Function-level tasks. It doesn’t evaluate the ability of LLMs on code design/architecture tasks
Tasks themselves are expected to be generated in vacuo (i.e. in a sandbox enviroment), with no interactions from other sources or tools.

Leaderboard

As of 2024-06-25, GPT4o is the leader. It slightly outperforms it sibling variants.

As qualitatively reported by the team ↩

🪴 Chris' Digital Garden

Recent Notes

Arithmetic Intensity of a Neural Network Linear Layer

Automatic Material System

Explorer

BigCodeBench

Lineage

Limitations of this dataset

Leaderboard

Graph View

Table of Contents

Backlinks

🪴 Chris' Digital Garden

Recent Notes

Arithmetic Intensity of a Neural Network Linear Layer

Automatic Material System

Explorer

BigCodeBench

Lineage

Limitations of this dataset

Leaderboard

Footnotes

Graph View

Table of Contents

Backlinks