Repeats a single word or phrase endlessly every single time I try to prompt

by Disdrix - opened Jan 13

•

Using the following: GLM-4.7-REAP-218B-A32B-UD-Q2_K_XL with the latest llama cpp as of right now.

It does this during reasoning on the first prompt. I've tried messing with repeat penalty, temp, min p, top p. Nothing stops it.

aldubl

Jan 13

With other versions of REAP, I've encountered this issue where the model would attempt to access something that was apparently truncated. For example, a language other than English or something similar.

CHNtentes

Jan 13

The REAP method is apparently lossy in some fields, and if that's what your use case requires, you might as well try a lower quant instead.

jcp1

Jan 13

Using the following: GLM-4.7-REAP-218B-A32B-UD-Q2_K_XL with the latest llama cpp as of right now.

It does this during reasoning on the first prompt. I've tried messing with repeat penalty, temp, min p, top p. Nothing stops it.

You checked the checksums of the downloaded files?

Disdrix

Jan 13

•

edited Jan 13

I've not checked the checksums. One thing to note is that the full GLM-4.7 Q4 K XL works great.

One thing that is interesting, is that the word or phrase it repeats is always clothing related. If I do a story that is not related to what a character may be wearing, it generates fine. I have not had it repeat anything else. This happens even if I don't mention clothing. If I have characters that are described or not described, it will, in the reasoning, start a section where it decides what clothes the characters has. That is where it always fails.

If you have this model downloaded, please try this prompt.

Serveurperso

Jan 13

Same here. REAP/Q2_K_XL broken

Disdrix

Jan 13

•

edited Jan 13

Same here. REAP/Q2_K_XL broken

Also on an RTX 6000 Pro? Getting about those tokens/sec too.

Thoron77

Jan 13

•

edited Jan 13

Can confirm Q3_K_M and UD-Q4_K_XL also Q5_K_S repeat a word infinitely (different each time but clothing related). I've also noticed models loaded extremely quickly like 10x speed of regular loading speed. (In Koboldcpp it happens if you reload a model which is already in vRAM) but I was loading both cold.
Programming related prompt is generated fine.

deleted

Jan 14

Correct me if I'm wrong, but I believe the REAP model produced by cerebras (on which these quants are based) used calibration datasets centred around code generation, tool calling and agentic use for the pruning process, so you'd expect disproportionate loss in proficiency for other use cases.

danielhanchen

Unsloth AI org Jan 15

•

edited Jan 15

Hey all sorry on the delay - yep I can repro - the REAP models do in fact generate repeating words about clothing as @Disdrix 's example showed ie doing:

./llama.cpp/llama-cli --model unsloth/GLM-4.7-REAP-218B-A32B-GGUF/UD-Q2_K_XL/GLM-4.7-REAP-218B-A32B-UD-Q2_K_XL-00001-of-00002.gguf \
     --fit on --flash-attn on --jinja --temp 1.0 --top-p 0.95 --min-p 0.00 \
    --prompt "write a story about john. He is suddenly transported to new york city. he is wearing shorts and a tshirt."

produces:

The normal NON REAP Q2_K_XL generates correctly:

./llama.cpp/llama-cli --model unsloth/GLM-4.7-GGUF/UD-Q2_K_XL/GLM-4.7-UD-Q2_K_XL-00001-of-00003.gguf \
    --fit on --flash-attn on --jinja --temp 1.0 --top-p 0.95 --min-p 0.00 \
    --prompt "write a story about john. He is suddenly transported to new york city. he is wearing shorts and a tshirt."

gives:

As @names-are-for-friends hypothesized, the REAP models are mostly calibrated for coding, whilst our non REAP dynamic GGUFs use diverse datasets.

I would use REAP for coding tasks primarily - I tried 20 conversations and it works well.

However other REAPed models don't seem to exhibit this issue - it might be luck

iyanello

Jan 19

•

edited Jan 19

Looks like we can handle it with frequency-penalty/repeat-penalty settings?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment