Dataset to evaluate LLMs on code generation. The tasks here are algorithm-oriented.
Some argue that the tasks here are too simple to adequately represent real-world code generation tasks that utilizes extensive, external libraries.
There’s also the concern of contamination1 and overfitting, as LLMs are trained on the dataset itself.
Footnotes
-
Model is trained on given data. ↩