this post was submitted on 25 Oct 2025
79 points (94.4% liked)

LocalLLaMA

3822 readers
11 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
top 10 comments
sorted by: hot top controversial new old
[–] afk_strats@lemmy.world 5 points 1 week ago (2 children)

Im not sure if it's a me issue but that's a static image. I figure you posted where they throw a brick into it.

Also, if this post was serious, how does a highly quantitized model compare to something less quantitized but with fewer parameters? I haven't seen benchmarks other than perplexity which isn't a good measure of capability?

[–] Xylight@lemdro.id 5 points 1 week ago (2 children)

It's a webp animation. Maybe your client doesn't display it right, i'll replace it with a gif

Regarding your other question, I tend to see better results with higher params + lower precision, versus low params + higher precision. That's just based on "vibes" though, I haven't done any real testing. Based on what I've seen, Q4 is the lowest safe quantization, and beyond that, the performance really starts to drop off. unfortunately even at 1 bit quantization I can't run GLM 4.6 on my system

[–] hendrik@palaver.p3x.de 3 points 1 week ago* (last edited 1 week ago) (1 children)

What's higher precision for you? What I remember from the old measurements for ggml is, lower than Q3 rarely makes sense and roughly at Q3 you'd think about switching to a smaller variant. But on the other hand everything above Q6 only shows marginal differences in perplexity, so Q6 or Q8 or full precision are basically the same thing.

[–] Xylight@lemdro.id 3 points 1 week ago* (last edited 1 week ago) (1 children)

As a memory-poor user (hence the 8gb vram card), I consider Q8+ to be is higher precision, Q4-Q5 is mid-low precision (what i typically use), and below that is low precision

[–] hendrik@palaver.p3x.de 4 points 1 week ago* (last edited 1 week ago)

Thanks. That sounds reasonable. Btw you're not the only poor person around, I don't even own a graphics card... I'm not a gamer so I never saw any reason to buy one before I took interest in AI. I'll do inference on my CPU and that's connected to more than 8GB of memory. It's just slow 😉 But I guess I'm fine with that. I don't rely on AI, it's just tinkering and I'm patient. And a few times a year I'll rent some cloud GPU by the hour. Maybe one day I'll buy one myself.

[–] afk_strats@lemmy.world 2 points 1 week ago

That fixed it.

I am a fan of this quant cook. He often posts perplexity charts.

https://huggingface.co/ubergarm

All of his quants require ik_llama which works best with Nvidia CUDA but they can do a lot with RAM+vRAM or even hard drive + rams. I don't know if 8gb is enough for everything.

[–] hendrik@palaver.p3x.de 4 points 1 week ago* (last edited 1 week ago)

I think perplexity is still central to evaluating models. It's notoriously difficult to come up with other ways to measure these things.

[–] TheMightyCat@ani.social 4 points 1 week ago (1 children)

I would suggest trying exllamav3 once, i have no idea what kind of black magic they use but its very memory efficient.

i can't load Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 with 16K using vllm

but using exlamav3 i can SOMEHOW load ArtusDev/Qwen_Qwen3-Coder-30B-A3B-Instruct-EXL3:8.0bpw_H8 at its full context of 262.144 with still 2GiB to spare.

I really feel like this is too good to be true and im doing something wrong but it just works so i don't know.

[–] fhein@lemmy.world 1 points 1 week ago (1 children)

I guess there's some automatic vram paging going on. How many tokens per second do you get while generating?

[–] TheMightyCat@ani.social 1 points 1 day ago* (last edited 1 day ago)

i found the reason, somehow setting --max_num_seqs 1 makes vllm way more efficient.

Not sure exactly what it does but i think its because vllm batches requests and the api was using with exlamav3 doesn't

Now im doing 100k with vllm too

(Worker_TP0_EP0 pid=99695) INFO 11-03 17:34:00 [gpu_worker.py:298] Available KV cache memory: 4.73 GiB
(Worker_TP1_EP1 pid=99696) INFO 11-03 17:34:00 [gpu_worker.py:298] Available KV cache memory: 4.73 GiB
(EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1087] GPU KV cache size: 103,264 tokens
(EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1091] Maximum concurrency for 100,000 tokens per request: 1.03x
(EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1087] GPU KV cache size: 103,328 tokens
(EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1091] Maximum concurrency for 100,000 tokens per request: 1.03x

I would say exlamav3 is still slightly more efficient but this explains the huge discrepancy, exlamav3 also allows setting GB per gpu which allows me to get a view more GB then vllm which spreads it evenly because a bunch of memory on gpu 0 is used for other stuff

As for the T/s its about the same, in the 80-100 range, this is what im getting with vllm:

(APIServer pid=99454) INFO 11-03 17:36:31 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:32 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:32 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:34 [loggers.py:127] Engine 000: Avg prompt throughput: 461.4 tokens/s, Avg generation throughput: 17.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 66.9%
(APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:43 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:44 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:44 [loggers.py:127] Engine 000: Avg prompt throughput: 1684.4 tokens/s, Avg generation throughput: 96.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.4%, Prefix cache hit rate: 83.4%

Now that i have found this out ive switched back to vllm because the API i'm using with exlamav3 doesn't support qwen 3 tools yet :(