I have a model with 64GB of ram. I’ve limited context to 16k, in an effort to make it more stable, but tbh - it is rather unreliable no matter what I do. With my setup - mlx_lm and webui, it frequently collapses or loops, no matter the settings. I have done a lot of debugging and have concluded it is probably inherent model behavior.
That’s lame about the looping, but ya I don’t think that’s a mlx issue, I’ve had it on my desktop with my nvidia card as well. I also tried fussing with configurations, and I was never sure if it was the models or my settings. I was mainly toying around with LLama based models.
I’m on a MacBook with M2, 32GB ram. Literally just tried:
Well, I guess I’ll try again next year.
For context: my home pc is running gemma4:31b just fine. It’s also a beefy ass desktop, though.
Are you running an mlx model? If not, try that. My m4 macbook runs qwen3.6-35b-a3b lightning fast. Has its issues, but fast nonetheless.
What kind of context length can you get with that, and how much ram?
I have a model with 64GB of ram. I’ve limited context to 16k, in an effort to make it more stable, but tbh - it is rather unreliable no matter what I do. With my setup - mlx_lm and webui, it frequently collapses or loops, no matter the settings. I have done a lot of debugging and have concluded it is probably inherent model behavior.
That’s lame about the looping, but ya I don’t think that’s a mlx issue, I’ve had it on my desktop with my nvidia card as well. I also tried fussing with configurations, and I was never sure if it was the models or my settings. I was mainly toying around with LLama based models.
You might be doing something wrong, models that size shouldn’t be that slow if properly configured on a 32gb m2
You need a metal optimized client and model, not the same models you’d run on your desktop machine.