In a previous blogpost, I made a simple observation about GPT-4 from a paper I had incidentally read on a whim. After finishing the post, I realised I didn’t actually ever figure out how token dropping could occur; only learning a black-box rule that it could occur in batched MoE inference for reasons.
This post is here to fix that – to collect enough info from important MoE papers (and alleged GPT-4 leaks) to explain the full mechanism of token dropping.
It’s well-known at this point that GPT-4/GPT-3.5-turbo is non-deterministic, even at
temperature=0.0. This is an odd behavior if you’re used to dense decoder-only models, where temp=0 should imply greedy sampling which should imply full determinism, because the logits for the next token should be a pure function of the input sequence & the model weights.
Despite the dead appearance of this blog, I actually think about it surprisingly often! Over the years, I’ve written a number of draft blogs or summaries that I simply ended up dumped at varying stages of completion.