GLM-4.7-REAP-218B-A32B-W4A16

40% Expert-Pruned + INT4 Quantized GLM-4 (218B total / 32B active params, ~116GB)

A highly compressed version of GLM-4.7 combining REAP expert pruning (40% experts removed) with INT4 weight quantization (AutoRound W4A16). This model is ~6.5x smaller than the original 700GB GLM-4.7.

Acknowledgments

Cerebras - REAP methodology (https://arxiv.org/abs/2510.13999)
Intel - AutoRound quantization framework

Model Details

Property	Value
Base Model	GLM-4.7-REAP-218B-A32B
Original (GLM-4.7)	358B params, ~717GB
After REAP Pruning	218B params, ~407GB
After W4A16 Quant	218B params, ~108GB
Active Parameters	32B per forward pass
Total Compression	~6.5x from original
Quantization	INT4 weights, FP16 activations
Group Size	128
Format	AutoRound
VRAM Required	~110GB

Compression Pipeline

GLM-4.7 (358B, 700GB)
        |
        v  REAP 40% pruning (96/160 experts)
        |
GLM-4.7-REAP-218B-A32B (218B, 407GB)
        |
        v  AutoRound W4A16 quantization
        |
GLM-4.7-REAP-218B-A32B-W4A16 (218B, 108GB)  <-- This model

Total: 6.5x compression

Usage

📊 Benchmarks

Tested on 8x RTX 3090:

Metric	Value
Prefill	375 tps
Generation	38.5
Time to First Token	3.82s

Deployment

vLLM

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve GLM-4.7-REAP-218B-A32B-W4A16 \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2 \
  --max-model-len 165000 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype fp8_e4m3 \
  --tool-call-parser glm47 \
  --served-model-name glm-4.7 \
  --enable-auto-tool-choice \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

AutoRound Quantization Details

AutoRound is Intel's weight quantization method using signed gradient descent.

bits: 4
group_size: 128
format: auto_round
nsamples: 64
seqlen: 512
dataset: NeelNanda/pile-10k

Reproduce This Model

# 1. Download the BF16 REAP model
huggingface-cli download 0xSero/GLM-4.7-REAP-218B-A32B --local-dir ./GLM-4.7-REAP-218B-A32B

# 2. Run AutoRound quantization
pip install auto-round

python -c "
from auto_round import AutoRound
ar = AutoRound(
    './GLM-4.7-REAP-218B-A32B',
    device='cuda',
    device_map='auto',
    nsamples=64,
    seqlen=512,
    batch_size=1
)
ar.quantize_and_save('./GLM-4.7-REAP-218B-A32B-W4A16', format='auto_round')
"

# Takes ~2 hours on 8x H200

Related Models

Model	Params	Size	Format	Link
GLM-4.7 (Base)	358B	~700GB	BF16	zai-org/GLM-4.7
GLM-4.7-REAP-218B-A32B	218B	~407GB	BF16	0xSero/GLM-4.7-REAP-218B-A32B
This Model	218B	~108GB	W4A16	-

Benchmarks

Benchmarks in progress

Benchmark	GLM-4.7 Base	REAP BF16	REAP W4A16
HumanEval	-	-	-
MBPP	-	-	-
GSM8K	-	-	-

Citation

@article{jones2025reap,
  title={REAP: Router-Experts Activation Pruning for Efficient Mixture-of-Experts},
  author={Jones, et al.},
  journal={arXiv preprint arXiv:2505.20877},
  year={2025}
}

@misc{autoround2024,
  title={AutoRound: Advanced Weight Quantization},
  author={Intel Corporation},
  year={2024},
  howpublished={\url{https://github.com/intel/auto-round}}
}

Built with REAP + AutoRound | Sponsored by Prime Intellect