If they manage to fit larger and more recent models in it, it could greatly improve the energy requirements to run these models.
The recent diffusion LLM from google is also really exciting, and I think they might become the new architecture of choice for these models when running on general purpose GPUs - especially consumer cards, which are usually memory constrained, but computing-capable.
yeah, that is pretty insane at 17k tokens per second. It feels as if answers are already cached just waiting for you to prompt.
https://chatjimmy.ai/
If they manage to fit larger and more recent models in it, it could greatly improve the energy requirements to run these models.
The recent diffusion LLM from google is also really exciting, and I think they might become the new architecture of choice for these models when running on general purpose GPUs - especially consumer cards, which are usually memory constrained, but computing-capable.