You are currently viewing Cosmo-1B, a Phi-1.5 Reproduction Attempt

Cosmo-1B, a Phi-1.5 Reproduction Attempt

  • Post author:
  • Post category:Blog

Reproducing the results of LLM papers is a worthy goal in general and publishing detailed reproduction journeys is highly laudable. While some corporations feel the need for secrets to create competitive advantage, wider understanding of this technology seems imperative.

This is particularly relevant in the case of the Phi family of LLMs. Although Microsoft initially released some good papers for Phi-1 & Phi-1.5 (Phi-2 so far has only been documented in a blog post), Microsoft held back some important knowledge, so we still don’t know quite how they created this important “little” LLM.

Understanding what makes Phi perform so well at its size class is important because it might provide hints for LLM explainability as well as help us all improve training for LLMs of all sizes. Also, inference speed and hardware requirements are quadratically better with small models, so high quality/size ratios are useful in all sorts of environments.

🤗 seem to be making a specialty of reproducing papers from important closed models; their IDEFICS reproduction of the Flamingo paper springs to mind as another valuable contribution. As with IDEFICS, following this reproduction attempt of Phi, they have released:

  • A blog post describing their efforts.
  • The code for the end-to-end pipeline used to generate the dataset.
  • The Cosmopedia dataset of synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1 containing 30M files and 25B tokens.
  • Cosmo-1b, a 1B parameter model which outperforms most similar-sized LLMs (but unfortunately not Phi-1.5).

The Cosmo team focussed on the most important open question for Phi; how Microsoft generated the data. We know it’s synthetic but as far as makeup, we only know:

(Phi-1.5 team) carefully selected 20K topics to seed the generation of (these synthetic textbooks). In our generation prompts, we use samples from web datasets for diversity.

It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data.

As the 🤗 team says, if the average file generated is 1000 tokens, ~20 million distinct prompts are required. How Phi team combined topics and web samples to increase diversity in these prompts is unknown but Cosmopedia team used web data and curated sources separately to build Cosmopedia’s “seed data”, keeping them isolated throughout the pipeline. Unfortunately, this resulted in lower-value web data generating 80% of their seeds, substantially outweighing the presumably more-significant curated curricula data.

The distribution of data sources for building Cosmopedia prompts (left plot) and the distribution of sources inside the Curated sources category (right plot), courtesy HuggingFace.
The distribution of data sources for building Cosmopedia prompts (left plot) and the distribution of sources inside the Curated sources category (right plot), courtesy HuggingFace.

The resulting model has performance gaps compared to Phi-1.5, which 🤗 suggest is be due to synthetic data generation quality and point to the LLM used, topic coverage, or prompts. They stress that they are continuing to work on the reproduction, focussing on improving the quality of the generated content. Apparently they noticed Mixtral hallucinating, particularly on historical facts or mathematical reasoning and are considering using RAG to mitigate and hallucination measurement to quantify.

Evaluation results of Cosmo-1B
Evaluation results of Cosmo-1B, courtesy HuggingFace.

Perhaps in a future attempt, more could be done to leverage the high quality and significance of the curriculum data into the prompts that were used to generate their training data. For instance, the whole of the web datasets could be clustered toward the curated data; relevant web data could then be used to generate more intrinsically varied textbooks by with web-sourced perspectives, instead of just manually-created audience and style variation. The enormous amount of prompt engineering required to make the chosen approach work makes it very expensive to reproduce and brittle, since the prompts will likely need to be reformulated to re-run with a different generation model.

The (First) Replication Attempt

Generating Prompts from Curated Sources

The curated seed data came from reputable educational sources such as Stanford courses, Khan Academy, OpenStax, and WikiHow. These resources cover valuable topics for an LLM to learn and they used the outlines of Stanford courses create prompts requesting textbooks be generated for each unit (for a value of “textbooks” defined in Textbooks are all you need). However, only 16,000 unique base prompts from OpenStax and 250,000 from Stanford were able to be derived from this source.

By specifying four different audiences (young children, high school students, college students, researchers) and 3 styles (textbooks, blog posts, wikiHow articles) they x12 their prompts but the variations need to be strongly emphasised with specifics on how the format and content should differ, to avoid a (too) high rate of duplicate content.

Example prompts for generating the same textbook for young children vs for professionals and researchers vs for high school students.
Example prompts for generating the same textbook for young children vs for professionals and researchers vs for high school students, courtesy HuggingFace

Generating Prompts from Web Data

They used text-clustering to break ~100k samples of datasets like RefinedWeb into 145 clusters, then asked Mixtral to label the cluster and give it an educational score out of 10 (using 10 samples from each) so they could exclude low value clusters like adult material, celebrity gossip, or obituaries, leaving 112 topics.

Diagram showing the flow of the text-clustering pipeline.
The flow of the text-clustering pipeline, courtesy HuggingFace.

From these samples, 23 million prompts were constructed. The model was instructed to generate a textbook related to a web sample, limiting the scope to its topic in 50% for diversity (and mitigating labelling & clustering errors) in multiple audience and generation styles.

Example of a web extract and the associated prompt.
Example of a web extract and the associated prompt, courtesy HuggingFace.

Adding More Science

AutoMathText, a curated dataset of Mathematical texts was also sampled to increase the amount of so-called ”scientific” content.

The distribution of seed data, generation format and target audiences in Cosmopedia dataset
The distribution of seed data, generation format and target audiences in Cosmopedia dataset, courtesy HuggingFace.

Dataset Generation

Having generated their prompts, they used the llm-swarm library to generate 25 billion tokens of synthetic content using  Mixtral-8x7B-Instruct-v0.1 on H100 GPUs from the Hugging Face Science cluster with TGI, spending over 10k GPU hours to generate the dataset.

Here’s an example using llm-swarm with Mixtral to run 100k Cosmopedia prompts using 2 TGI instances on a Slurm cluster:

# clone the repo and follow installation requirements 
cd llm-swarm
python ./examples/textbooks/generate_synthetic_textbooks.py \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --instances 2 \
    --prompts_dataset "HuggingFaceTB/cosmopedia-100k" \
    --prompt_column prompt \
    --max_samples -1 \
    --checkpoint_path "./tests_data" \
    --repo_id "HuggingFaceTB/generations_cosmopedia_100k" \
    --checkpoint_interval 500

They tracked the generations with wandb to monitor the throughput and number of generated tokens.

Weights and Biases plots for an llm-swarm run.
Weights and Biases plots for an `llm-swarm` run, courtesy HuggingFace.

Interestingly, they just used HuggingChat for the initial iterations on the prompts. Then generated samples for each prompt type using llm-swarm, to spot unusual patterns and fix them. Apparently, the model frequently began stories “Once upon a time…” or “The sun hung low in the sky…” so they had to explicitly prompt Mixtral to avoid these introductory phrases and to be creative.

Adding More Commonsense

Models trained using just the generated textbooks lacked common sense and fundamental knowledge, so they then also created stories incorporating this data using portions of the UltraChat and OpenHermes2.5 instruction-tuning datasets, then added them to the seed data. The subsets were chosen to span a broad range of subjects that were suitable for storytelling; UltraChat “Questions about the world” cover 30 meta-concepts while OpenHermes2.5 “glaive-code-assist” and “camelai” (programming and advanced chemistry) were omitted.

Example prompts for generating stories from UltraChat and OpenHermes samples for young children, a general audience & reddit forums
Example prompts for generating stories from UltraChat and OpenHermes samples for young children, a general audience & Reddit forums, courtesy HuggingFace.

Decontamination

To ensure fair benchmark scores, they compared the generated dataset to the benchmarks and dropped contaminated data. The decontamination process was applied across all benchmarks evaluated with the Cosmo-1B model, including MMLU, HellaSwag, PIQA, SIQA, Winogrande, OpenBookQA, ARC-Easy, and ARC-Challenge.

The final dataset can be investigated with their viewer.

Training

They trained cosmo-1b, a 1B LLM using Llama2 architecture on their generated Cosmopedia dataset to assess its quality, although they remained light on details here, so I guess it went quite smoothly(?). First they used datatrove for data deduplication and tokenization, then nanotron for model training, and finally lighteval for evaluation.

Conclusion

Cosmo-1B has not found the ideal recipe yet but the team is continuing to work on the problem and have some interesting ideas. It will be interesting to see how their approach changes if they experiment with using different LLMs for generation; I hope they don’t have to re-do all of their prompt engineering efforts for every model.

Personally, I would like to see an attempt to use web data differently, trying to support the curriculum seeds more directly, because I think in their first attempt the lower-value data might have swamped the higher value knowledge.

Regardless of the outcome, publishing the results of all experiments is a valuable contribution of knowledge that we can all build upon. The 🤗 team should be applauded for attempting this reproduction and publicly releasing the results of their work.

I wish them great success in their ongoing efforts!