Reproducing the results of LLM papers is a worthy goal in general and publishing detailed reproduction journeys is highly laudable. While some corporations feel the need for secrets to create competitive advantage, wider understanding of this technology seems imperative.
This is particularly relevant in the case of the Phi family of LLMs. Although Microsoft initially released some good papers for Phi-1 & Phi-1.5 (Phi-2 so far has only been documented in a blog post), Microsoft held back some important knowledge, so we still don’t know quite how they created this important “little” LLM.
Understanding what makes Phi perform so well at its size class is important because it might provide hints for LLM explainability as well as help us all improve training for LLMs of all sizes. Also, inference speed and hardware requirements are quadratically better with small models, so high quality/size ratios are useful in all sorts of environments.
🤗 seem to be making a specialty of reproducing papers from important closed models; their IDEFICS reproduction of the Flamingo paper springs to mind as another valuable contribution. As with IDEFICS, following this reproduction attempt of Phi, they have released:
- A blog post describing their efforts.
- The code for the end-to-end pipeline used to generate the dataset.
- The Cosmopedia dataset of synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1 containing 30M files and 25B tokens.
- Cosmo-1b, a 1B parameter model which outperforms most similar-sized LLMs (but unfortunately not Phi-1.5).
The Cosmo team focussed on the most important open question for Phi; how Microsoft generated the data. We know it’s synthetic but as far as makeup, we only know:
(Phi-1.5 team) carefully selected 20K topics to seed the generation of (these synthetic textbooks). In our generation prompts, we use samples from web datasets for diversity.
…
It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data.
As the 🤗 team says, if the average file generated is 1000 tokens, ~20 million distinct prompts are required. How Phi team combined topics and web samples to increase diversity in these prompts is unknown but Cosmopedia team used web data and curated sources separately to build Cosmopedia’s “seed data”, keeping them isolated throughout the pipeline. Unfortunately, this resulted in lower-value web data generating 80% of their seeds, substantially outweighing the presumably more-significant curated curricula data.

The resulting model has performance gaps compared to Phi-1.5, which 🤗 suggest is be due to synthetic data generation quality and point to the LLM used, topic coverage, or prompts. They stress that they are continuing to work on the reproduction, focussing on improving the quality of the generated content. Apparently they noticed Mixtral hallucinating, particularly on historical facts or mathematical reasoning and are considering using RAG to mitigate and hallucination measurement to quantify.

Perhaps in a future attempt, more could be done to leverage the high quality and significance of the curriculum data into the prompts that were used to generate their training data. For instance, the whole of the web datasets could be clustered toward the curated data; relevant web data could then be used to generate more intrinsically varied textbooks by with web-sourced perspectives, instead of just manually-created audience and style variation. The enormous amount of prompt engineering required to make the chosen approach work makes it very expensive to reproduce and brittle, since the prompts will likely need to be reformulated to re-run with a different generation model.
The (First) Replication Attempt
Generating Prompts from Curated Sources
The curated seed data came from reputable educational sources such as Stanford courses, Khan Academy, OpenStax, and WikiHow. These resources cover valuable topics for an LLM to learn and they used the outlines of Stanford courses create prompts requesting textbooks be generated for each unit (for a value of “textbooks” defined in Textbooks are all you need). However, only 16,000 unique base prompts from OpenStax and 250,000 from Stanford were able to be derived from this source.
By specifying four different audiences (young children, high school students, college students, researchers) and 3 styles (textbooks, blog posts, wikiHow articles) they x12 their prompts but the variations need to be strongly emphasised with specifics on how the format and content should differ, to avoid a (too) high rate of duplicate content.

Generating Prompts from Web Data
They used text-clustering to break ~100k samples of datasets like RefinedWeb into 145 clusters, then asked Mixtral to label the cluster and give it an educational score out of 10 (using 10 samples from each) so they could exclude low value clusters like adult material, celebrity gossip, or obituaries, leaving 112 topics.

From these samples, 23 million prompts were constructed. The model was instructed to generate a textbook related to a web sample, limiting the scope to its topic in 50% for diversity (and mitigating labelling & clustering errors) in multiple audience and generation styles.

Adding More Science
AutoMathText, a curated dataset of Mathematical texts was also sampled to increase the amount of so-called ”scientific” content.

Dataset Generation
Having generated their prompts, they used the llm-swarm library to generate 25 billion tokens of synthetic content using Mixtral-8x7B-Instruct-v0.1 on H100 GPUs from the Hugging Face Science cluster with TGI, spending over 10k GPU hours to generate the dataset.
Here’s an example using llm-swarm with Mixtral to run 100k Cosmopedia prompts using 2 TGI instances on a Slurm cluster:
# clone the repo and follow installation requirements
cd llm-swarm
python ./examples/textbooks/generate_synthetic_textbooks.py \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--instances 2 \
--prompts_dataset "HuggingFaceTB/cosmopedia-100k" \
--prompt_column prompt \
--max_samples -1 \
--checkpoint_path "./tests_data" \
--repo_id "HuggingFaceTB/generations_cosmopedia_100k" \
--checkpoint_interval 500
They tracked the generations with wandb to monitor the throughput and number of generated tokens.

Interestingly, they just used HuggingChat for the initial iterations on the prompts. Then generated samples for each prompt type using llm-swarm, to spot unusual patterns and fix them. Apparently, the model frequently began stories “Once upon a time…” or “The sun hung low in the sky…” so they had to explicitly prompt Mixtral to avoid these introductory phrases and to be creative.
Adding More Commonsense
Models trained using just the generated textbooks lacked common sense and fundamental knowledge, so they then also created stories incorporating this data using portions of the UltraChat and OpenHermes2.5 instruction-tuning datasets, then added them to the seed data. The subsets were chosen to span a broad range of subjects that were suitable for storytelling; UltraChat “Questions about the world” cover 30 meta-concepts while OpenHermes2.5 “glaive-code-assist” and “camelai” (programming and advanced chemistry) were omitted.

Decontamination
To ensure fair benchmark scores, they compared the generated dataset to the benchmarks and dropped contaminated data. The decontamination process was applied across all benchmarks evaluated with the Cosmo-1B model, including MMLU, HellaSwag, PIQA, SIQA, Winogrande, OpenBookQA, ARC-Easy, and ARC-Challenge.
The final dataset can be investigated with their viewer.
Training
They trained cosmo-1b, a 1B LLM using Llama2 architecture on their generated Cosmopedia dataset to assess its quality, although they remained light on details here, so I guess it went quite smoothly(?). First they used datatrove for data deduplication and tokenization, then nanotron for model training, and finally lighteval for evaluation.
Conclusion
Cosmo-1B has not found the ideal recipe yet but the team is continuing to work on the problem and have some interesting ideas. It will be interesting to see how their approach changes if they experiment with using different LLMs for generation; I hope they don’t have to re-do all of their prompt engineering efforts for every model.
Personally, I would like to see an attempt to use web data differently, trying to support the curriculum seeds more directly, because I think in their first attempt the lower-value data might have swamped the higher value knowledge.
Regardless of the outcome, publishing the results of all experiments is a valuable contribution of knowledge that we can all build upon. The 🤗 team should be applauded for attempting this reproduction and publicly releasing the results of their work.
I wish them great success in their ongoing efforts!

You must be logged in to post a comment.