Basic tips for remaining conscious
I don’t think very often. These are some of the things I’ve tried in the past to rectify that.
2023
Let’s keep this short.
Rough thoughts on Mixtral vs Open Source
Here’s a thesis (hypothesis, predicate, etc) to chew on:
The mixture-of-experts paradigm is fundamentally a hinderance to open source development, and mixtral-8x5B+2B will be summarily supplanted by a dense model like llama3/mistral-70b/yi/qwen/… in the near future.
Knowing Enough About MoE to Explain Dropped Tokens in GPT-4
In a previous blogpost, I made a simple observation about GPT-4 from a paper I had incidentally read on a whim. After finishing the post, I realised I didn’t actually ever figure out how token dropping could occur; only learning a black-box rule that it could occur in batched MoE inference for reasons.
This post is here to fix that – to collect enough info from important MoE papers (and alleged GPT-4 leaks) to explain the full mechanism of token dropping.