Blog Refurbishment📌

152334H 794 words 4 minutes published on August 13, 2022 included in tech

Under new management

Time To Think

152334H 784 words 4 minutes published on February 4, 2025

I don’t remember what it’s like to think with more than 60 seconds of context.

Calculating the Cost of a Google Deepmind Paper

152334H 3782 words 18 minutes published on July 30, 2024 included in tech

Recently, GDM released a great paper titled, Scaling Exponents Across Parameterizations and Optimizers, in which they conduct over 10,000 LLM training runs to obtain optimal hyperparameters under different regimes.

After reading it (it was great), I wanted to test my understanding of the paper by tallying up all experiments conducted within, calculating the total compute cost it would take to replicate the paper.

DeepSeek Core Readings 0 - Coder

152334H 1236 words 6 minutes published on June 30, 2024 included in tech

Paper summary: 1.3B to 33B LLMs on 1/2T code tokens (87 langs) w/ FiM and 16K seqlen. Strong effort in constructing pretraining data from Github from scratch, with repository-level samples. Evals beat OSS code models solidly + GPT-3.5 a bit; Coder-7B > CodeLlama-33B often.

They don’t spend much effort on Instruction tuning. They commit a continued pretrain of DeepSeek LLM -> Coder: I believe it underperforms; they don’t.

DeepSeek Core Readings 1 - LLM

152334H 1552 words 8 minutes published on June 23, 2024 included in tech

Paper summary: LLaMA-like 7B/67B pretrain (Base) + SFT&DPO (Chat). 2T tokens with strong CN/EN mix, >1mil SFT examples. Well-executed exploration of scaling laws. Good details about evals and safety. Not much described about their actual data.