I may have been doing something wrong, but in my experience llama.cpp with openCL offloading isn’t much faster than CPU only, it uses the same CPU usage with the addition of my GPU making typewriter noises.
I have written this gist to run fastchat-t5-3b-v1.0 using Intel’s IPEX and it runs quite well, I have an A770 16GB but it seems to use under 8GB when using bfloat16. It could be easily be modified to run something else though.
Or if you want a GUI (or a nice CLI), I’ve added support for Intel XPUs in FastChat.