It is only a matter of time before we’re running 40B+ parameters at home (casually).
I guess that’s kind of my problem. :) With 64GB RAM you can run 40, 65, 70B parameter quantized models pretty casually. It’s not super fast, but I don’t really have a specific “use case” so something like 600ms/token is acceptable. That being the case, how do I get excited about a 7B or 13B? It would have to be doing something really special that even bigger models can’t.
I assume they’ll be working on a Vicuna-70B 1.5 based on LLaMA to so I’ll definitely try that one out when it’s released assuming it performs well.