HumanEval

Dataset to evaluate LLMs on code generation. The tasks here are algorithm-oriented.

Some argue that the tasks here are too simple to adequately represent real-world code generation tasks that utilizes extensive, external libraries.

There’s also the concern of contamination¹ and overfitting, as LLMs are trained on the dataset itself.

🪴 Chris' Digital Garden