Basic tips for remaining conscious

152334H 277 words 2 minutes published on April 12, 2024

I don’t think very often. These are some of the things I’ve tried in the past to rectify that.

2023

152334H 577 words 3 minutes published on December 31, 2023

Let’s keep this short.

Rough thoughts on Mixtral vs Open Source

152334H 1261 words 6 minutes published on December 13, 2023 included in tech

Here’s a thesis (hypothesis, predicate, etc) to chew on:

The mixture-of-experts paradigm is fundamentally a hinderance to open source development, and mixtral-8x5B+2B will be summarily supplanted by a dense model like llama3/mistral-70b/yi/qwen/… in the near future.

Knowing Enough About MoE to Explain Dropped Tokens in GPT-4

152334H 2103 words 10 minutes published on August 9, 2023 included in tech

In a previous blogpost, I made a simple observation about GPT-4 from a paper I had incidentally read on a whim. After finishing the post, I realised I didn’t actually ever figure out how token dropping could occur; only learning a black-box rule that it could occur in batched MoE inference for reasons.

This post is here to fix that – to collect enough info from important MoE papers (and alleged GPT-4 leaks) to explain the full mechanism of token dropping.

Non-determinism in GPT-4 is caused by Sparse MoE

152334H 1701 words 8 minutes published on August 5, 2023 included in tech

It’s well-known at this point that GPT-4/GPT-3.5-turbo is non-deterministic, even at temperature=0.0. This is an odd behavior if you’re used to dense decoder-only models, where temp=0 should imply greedy sampling which should imply full determinism, because the logits for the next token should be a pure function of the input sequence & the model weights.