DeepSeek Core Readings

Status: In progress

266 words

These are a set of personal notes about the deepseek core readings (extended) (elab).

They are not meant for mass public consumption (though you are free to read/cite), as I will only be noting down information that I care about.

Links	Post	TL;DR
2401.14196 Repo	DeepSeek Coder	V1 1.3/6.7/33B Code models. Evals beat other OSS + GPT-3.5. Arguably bad post-training + continued-pretraining.
2401.02954 Repo	DeepSeek LLM	V1 7B/67B Base/Chat models. Great details about scaling laws, thoughtful evaluation/alignment.
2401.06066 Repo	DeepSeek MoE
2402.03300 Repo	DeepSeek Math
2403.05525 Repo	DeepSeek VL
paper.pdf Repo	DeepSeek V2
paper.pdf Repo	DeepSeek Coder V2	.

DeepSeek Core Readings 1 - LLM

152334H 1552 words 8 minutes published on June 23, 2024 included in tech

Paper summary: LLaMA-like 7B/67B pretrain (Base) + SFT&DPO (Chat). 2T tokens with strong CN/EN mix, >1mil SFT examples. Well-executed exploration of scaling laws. Good details about evals and safety. Not much described about their actual data.

DeepSeek Core Readings 0 - Coder

152334H 1236 words 6 minutes published on June 30, 2024 included in tech

Paper summary: 1.3B to 33B LLMs on 1/2T code tokens (87 langs) w/ FiM and 16K seqlen. Strong effort in constructing pretraining data from Github from scratch, with repository-level samples. Evals beat OSS code models solidly + GPT-3.5 a bit; Coder-7B > CodeLlama-33B often.

They don’t spend much effort on Instruction tuning. They commit a continued pretrain of DeepSeek LLM -> Coder: I believe it underperforms; they don’t.

I will be skipping the following papers:

Paper	Reason
DeepSeek Prover	No Code No Weights No Data. Also not really interested in LEAN-related LLM projects.