Evaluation Techniques

Correct by Construction

Motivation: It’s hard to use LLMs to generate datasets where certain properties of the answer is hard to be verified. (e.g. A generated API call from natural language can be easily tested to see if it is valid, but it is hard to know if it’s semantically doing what we want)

Start with correct API calls
Ask LLM to generate a description for it
Start with the correct answer $y$ , and create a generated response $X$ . This allows you to create a ( $X$ , , $y$ ) pair that’s guaranteed to be correct
- Assumption here is that it is easy / there many correct answers $X$ for a given answer $y$
Tips:
- Set temperature lower when you are generating a correct response $X$ , otherwise it may hallucinate and include irrelevant info in the response
- You can set that higher if you want a negative response $X$

Resource Overboard

Can be used in tandem with Correct by Construction.

Use better model to generate examples of desired outputs
Use cheaper model sfor evaluations

🪴 Chris' Digital Garden

Recent Notes

Arithmetic Intensity of a Neural Network Linear Layer

Automatic Material System

Explorer

Evaluation Techniques

Correct by Construction

Resource Overboard

Graph View

Table of Contents

Backlinks