Repeats a single word or phrase endlessly every single time I try to prompt
Using the following: GLM-4.7-REAP-218B-A32B-UD-Q2_K_XL with the latest llama cpp as of right now.
It does this during reasoning on the first prompt. I've tried messing with repeat penalty, temp, min p, top p. Nothing stops it.
With other versions of REAP, I've encountered this issue where the model would attempt to access something that was apparently truncated. For example, a language other than English or something similar.
The REAP method is apparently lossy in some fields, and if that's what your use case requires, you might as well try a lower quant instead.
Using the following: GLM-4.7-REAP-218B-A32B-UD-Q2_K_XL with the latest llama cpp as of right now.
It does this during reasoning on the first prompt. I've tried messing with repeat penalty, temp, min p, top p. Nothing stops it.
You checked the checksums of the downloaded files?
I've not checked the checksums. One thing to note is that the full GLM-4.7 Q4 K XL works great.
One thing that is interesting, is that the word or phrase it repeats is always clothing related. If I do a story that is not related to what a character may be wearing, it generates fine. I have not had it repeat anything else. This happens even if I don't mention clothing. If I have characters that are described or not described, it will, in the reasoning, start a section where it decides what clothes the characters has. That is where it always fails.
Same here. REAP/Q2_K_XL broken
Also on an RTX 6000 Pro? Getting about those tokens/sec too.
Can confirm Q3_K_M and UD-Q4_K_XL also Q5_K_S repeat a word infinitely (different each time but clothing related). I've also noticed models loaded extremely quickly like 10x speed of regular loading speed. (In Koboldcpp it happens if you reload a model which is already in vRAM) but I was loading both cold.
Programming related prompt is generated fine.
Correct me if I'm wrong, but I believe the REAP model produced by cerebras (on which these quants are based) used calibration datasets centred around code generation, tool calling and agentic use for the pruning process, so you'd expect disproportionate loss in proficiency for other use cases.
Hey all sorry on the delay - yep I can repro - the REAP models do in fact generate repeating words about clothing as @Disdrix 's example showed ie doing:
./llama.cpp/llama-cli --model unsloth/GLM-4.7-REAP-218B-A32B-GGUF/UD-Q2_K_XL/GLM-4.7-REAP-218B-A32B-UD-Q2_K_XL-00001-of-00002.gguf \
--fit on --flash-attn on --jinja --temp 1.0 --top-p 0.95 --min-p 0.00 \
--prompt "write a story about john. He is suddenly transported to new york city. he is wearing shorts and a tshirt."
produces:
The normal NON REAP Q2_K_XL generates correctly:
./llama.cpp/llama-cli --model unsloth/GLM-4.7-GGUF/UD-Q2_K_XL/GLM-4.7-UD-Q2_K_XL-00001-of-00003.gguf \
--fit on --flash-attn on --jinja --temp 1.0 --top-p 0.95 --min-p 0.00 \
--prompt "write a story about john. He is suddenly transported to new york city. he is wearing shorts and a tshirt."
gives:
As @names-are-for-friends hypothesized, the REAP models are mostly calibrated for coding, whilst our non REAP dynamic GGUFs use diverse datasets.
I would use REAP for coding tasks primarily - I tried 20 conversations and it works well.
However other REAPed models don't seem to exhibit this issue - it might be luck
Looks like we can handle it with frequency-penalty/repeat-penalty settings?




