REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
Paper • 2510.13999 • Published • 16
40% Expert-Pruned + INT4 Quantized GLM-4 (218B total / 32B active params, ~116GB)
A highly compressed version of GLM-4.7 combining REAP expert pruning (40% experts removed) with INT4 weight quantization (AutoRound W4A16). This model is ~6.5x smaller than the original 700GB GLM-4.7.
| Property | Value |
|---|---|
| Base Model | GLM-4.7-REAP-218B-A32B |
| Original (GLM-4.7) | 358B params, ~717GB |
| After REAP Pruning | 218B params, ~407GB |
| After W4A16 Quant | 218B params, ~108GB |
| Active Parameters | 32B per forward pass |
| Total Compression | ~6.5x from original |
| Quantization | INT4 weights, FP16 activations |
| Group Size | 128 |
| Format | AutoRound |
| VRAM Required | ~110GB |
GLM-4.7 (358B, 700GB)
|
v REAP 40% pruning (96/160 experts)
|
GLM-4.7-REAP-218B-A32B (218B, 407GB)
|
v AutoRound W4A16 quantization
|
GLM-4.7-REAP-218B-A32B-W4A16 (218B, 108GB) <-- This model
Total: 6.5x compression
Tested on 8x RTX 3090:
| Metric | Value |
|---|---|
| Prefill | 375 tps |
| Generation | 38.5 |
| Time to First Token | 3.82s |
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve GLM-4.7-REAP-218B-A32B-W4A16 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2 \
--max-model-len 165000 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.92 \
--kv-cache-dtype fp8_e4m3 \
--tool-call-parser glm47 \
--served-model-name glm-4.7 \
--enable-auto-tool-choice \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
AutoRound is Intel's weight quantization method using signed gradient descent.
bits: 4
group_size: 128
format: auto_round
nsamples: 64
seqlen: 512
dataset: NeelNanda/pile-10k
# 1. Download the BF16 REAP model
huggingface-cli download 0xSero/GLM-4.7-REAP-218B-A32B --local-dir ./GLM-4.7-REAP-218B-A32B
# 2. Run AutoRound quantization
pip install auto-round
python -c "
from auto_round import AutoRound
ar = AutoRound(
'./GLM-4.7-REAP-218B-A32B',
device='cuda',
device_map='auto',
nsamples=64,
seqlen=512,
batch_size=1
)
ar.quantize_and_save('./GLM-4.7-REAP-218B-A32B-W4A16', format='auto_round')
"
# Takes ~2 hours on 8x H200
| Model | Params | Size | Format | Link |
|---|---|---|---|---|
| GLM-4.7 (Base) | 358B | ~700GB | BF16 | zai-org/GLM-4.7 |
| GLM-4.7-REAP-218B-A32B | 218B | ~407GB | BF16 | 0xSero/GLM-4.7-REAP-218B-A32B |
| This Model | 218B | ~108GB | W4A16 | - |
Benchmarks in progress
| Benchmark | GLM-4.7 Base | REAP BF16 | REAP W4A16 |
|---|---|---|---|
| HumanEval | - | - | - |
| MBPP | - | - | - |
| GSM8K | - | - | - |
@article{jones2025reap,
title={REAP: Router-Experts Activation Pruning for Efficient Mixture-of-Experts},
author={Jones, et al.},
journal={arXiv preprint arXiv:2505.20877},
year={2025}
}
@misc{autoround2024,
title={AutoRound: Advanced Weight Quantization},
author={Intel Corporation},
year={2024},
howpublished={\url{https://github.com/intel/auto-round}}
}
Built with REAP + AutoRound | Sponsored by Prime Intellect
If this work is useful, support Sybil Solutions here: https://donate.sybilsolutions.ai