Synthetic Data

I’m realizing that there’s more depth to LLM-based synthetic data generation than I initially realized.

One strategy is highlighted in Andrew Ng’s article¹:

Fine-tune the LLM using limited, but high-quality data
Using RAG to gather data points required for your synthetic examples, prompt an LLM to generate more synthetic data
(Using an LLM,) critique the quality of the synthetic data
Generate a second iteration of the synthetic data
Repeat steps 1-4 to collect data
Fine-tune the model on the second conversation

🪴 Chris' Digital Garden