152334H

Blog Refurbishment

152334H — Sat, 13 Aug 2022 05:30:32 +0100

Under new management

DeepSeek Core Readings 0 - Coder

152334H — Sun, 30 Jun 2024 00:00:00 +0800

Paper summary: 1.3B to 33B LLMs on 1/2T code tokens (87 langs) w/ FiM and 16K seqlen. Strong effort in constructing pretraining data from Github from scratch, with repository-level samples. Evals beat OSS code models solidly + GPT-3.5 a bit; Coder-7B > CodeLlama-33B often.

They don’t spend much effort on Instruction tuning. They commit a continued pretrain of DeepSeek LLM -> Coder: I believe it underperforms; they don’t.

DeepSeek Core Readings 1 - LLM

152334H — Sun, 23 Jun 2024 00:00:00 +0800

Paper summary: LLaMA-like 7B/67B pretrain (Base) + SFT&DPO (Chat). 2T tokens with strong CN/EN mix, >1mil SFT examples. Well-executed exploration of scaling laws. Good details about evals and safety. Not much described about their actual data.

Basic tips for remaining conscious

152334H — Fri, 12 Apr 2024 12:56:00 +0800

I don’t think very often. These are some of the things I’ve tried in the past to rectify that.

2023

152334H — Sun, 31 Dec 2023 23:25:26 +0800

Let’s keep this short.

Rough thoughts on Mixtral vs Open Source

152334H — Wed, 13 Dec 2023 20:12:34 +0800

Here’s a thesis (hypothesis, predicate, etc) to chew on:

The mixture-of-experts paradigm is fundamentally a hinderance to open source development, and mixtral-8x5B+2B will be summarily supplanted by a dense model like llama3/mistral-70b/yi/qwen/… in the near future.

Knowing Enough About MoE to Explain Dropped Tokens in GPT-4

152334H — Wed, 09 Aug 2023 05:15:14 +0800

In a previous blogpost, I made a simple observation about GPT-4 from a paper I had incidentally read on a whim. After finishing the post, I realised I didn’t actually ever figure out how token dropping could occur; only learning a black-box rule that it could occur in batched MoE inference for reasons.

This post is here to fix that – to collect enough info from important MoE papers (and alleged GPT-4 leaks) to explain the full mechanism of token dropping.

Non-determinism in GPT-4 is caused by Sparse MoE

152334H — Sat, 05 Aug 2023 04:09:15 +0800

It’s well-known at this point that GPT-4/GPT-3.5-turbo is non-deterministic, even at temperature=0.0. This is an odd behavior if you’re used to dense decoder-only models, where temp=0 should imply greedy sampling which should imply full determinism, because the logits for the next token should be a pure function of the input sequence & the model weights.

Dumped Blog Ideas

152334H — Sun, 02 Jul 2023 01:07:44 +0800

Despite the dead appearance of this blog, I actually think about it surprisingly often! Over the years, I’ve written a number of draft blogs or summaries that I simply ended up dumped at varying stages of completion.

Why can TorToiSe be fine-tuned?

152334H — Thu, 16 Feb 2023 11:18:28 +0800

Five days ago, I published a blog post, describing why TorToiSe could not be fine-tuned.

Today, I have released a fork of DL-Art-School with TorToiSe fine-tuning code. How did that happen?