Why Language Models?
Over the course of the last month or so, I’ve been working on a webapp text editor that uses GPT-J, a language model, to perform autocomplete. If you don’t know what that is, then I hope you’ll enjoy some of the links in this blogpost.
But for the majority that does know what language models are, it’s safe to say that I’ve done nothing complex. Technologically, I made a React app with a text editor derived from Slate.js, and connected that to a FastAPI backend which throws requests to huggingface’s
None of this is revolutionary. There are many solutions online that do way better. EleutherAI hosts their own free demo page that runs a lot more elegantly than my webapp. OpenAI’s GPT3 models are a lot better than anything open source can provide. And companies like NovelAI corner the submarket of people who want to do more specific tasks like writing certain kinds of fiction novels.
So, why am I even working on any of this?
Models need less vram now
Back in August, someone published a paper titled LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. I recommend reading the huggingface article on it if you’re interested, but in short,
With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
Now, 175B parameters is still pretty big. With Int8, it’d be 175 gigabytes of memory, which is still well in the category of “not for personal use”.
But the improvements apply for any language model reliant on the transformer architecture. And there are many great models that are now accessible to larger sections of the general population because of this.
|Model||3050 (4GB)||2080 TI (11GB)||Tesla T4 (16GB)||3090 (24GB)|
✅ - int8 improvement | ⬛ - no change | ❌ - int8 insufficient
If you have an RTX 3090, you can run GPT-NeoX-20B or CodeGeeX-13B. If you have a 2080, you can run GPT-J-6B or Incoder-6B. And if you have enough memory to run Stable Diffusion, you can run Codegen-2B.
That last example is particularly motivating, because of the next section.
Advances in sampling strategies
Earlier this month, huggingface implemented Contrastive Search into their
transformers library. While I’m not at all qualified to describe what it does (and whether it is ’novel’ or ‘obvious’), I find their results rather encouraging.
A 3% jump might not sound like much, but it puts CodeGen-2B at the same level as Codex-2.5B. This puts open source replacements for Copilot (like fauxpilot) at the same level of code completion competency.
Contrastive search also does a lot better at long-form writing than other sampling strategies, which is great because:
I wanted to write blogposts again
I’m not very good at writing. While I don’t think the things I publish are terrible, I often feel that I take way too long to get from ‘idea’ to ‘written essay’. And I’m sure that’s not a unique problem, but it’s the kind of problem that a lot of people seem to shrug at and say,
Guess I have to try harder.
Guess I can’t do much of that
I don’t like either of these options. The third option, “Make an computer do it for you,” is what language models are. But I also don’t really like sending my drafted blogposts to a remote SaaS, so I wanted a solution that could run locally on my own hardware.
And that was surprisingly difficult to find online. I did some googling, asked a few communities, double checked a laundry list of github tags to make sure I didn’t miss anything, and somehow I just found nothing. I’m still 90% certain someone has already done, “Open source webapp editor that uses GPT-J,” but for the life of me, I couldn’t find it. The searches I got were polluted with solutions that, while open source, were only designed to send requests to OpenAI’s GPT3 API. Great for most people; not what I’m looking for.
So, I got to work on a simple tool that would help me to run GPT-J locally, thinking it would take me less than a weekend to finish.
The next few blogposts in this series will cover how I ended up spending a month doing just that.