152334H
1552 words
8 minutes published on included in tech
Paper summary: LLaMA-like 7B/67B pretrain (Base) + SFT&DPO (Chat). 2T tokens with strong CN/EN mix, >1mil SFT examples. Well-executed exploration of scaling laws. Good details about evals and safety. Not much described about their actual data.
152334H
1236 words
6 minutes published on included in tech
Paper summary: 1.3B to 33B LLMs on 1/2T code tokens (87 langs) w/ FiM and 16K seqlen. Strong effort in constructing pretraining data from Github from scratch, with repository-level samples. Evals beat OSS code models solidly + GPT-3.5 a bit; Coder-7B > CodeLlama-33B often.
They don’t spend much effort on Instruction tuning. They commit a continued pretrain of DeepSeek LLM -> Coder: I believe it underperforms; they don’t.