Iā€™m realizing that thereā€™s more depth to LLM-based synthetic data generation than I initially realized.

One strategy is highlighted in Andrew Ngā€™s article1:

  1. Fine-tune the LLM using limited, but high-quality data
  2. Using RAG to gather data points required for your synthetic examples, prompt an LLM to generate more synthetic data
  3. (Using an LLM,) critique the quality of the synthetic data
  4. Generate a second iteration of the synthetic data
  5. Repeat steps 1-4 to collect data
  6. Fine-tune the model on the second conversation

Footnotes

  1. That example uses patient-doctor conversations ā†©