I’m realizing that there’s more depth to LLM-based synthetic data generation than I initially realized.
One strategy is highlighted in Andrew Ng’s article1:
- Fine-tune the LLM using limited, but high-quality data
- Using RAG to gather data points required for your synthetic examples, prompt an LLM to generate more synthetic data
- (Using an LLM,) critique the quality of the synthetic data
- Generate a second iteration of the synthetic data
- Repeat steps 1-4 to collect data
- Fine-tune the model on the second conversation
Footnotes
-
That example uses patient-doctor conversations ↩