GLM-4.7-REAP-218B-A32B-W4A16

40% Expert-Pruned + INT4 Quantized GLM-4 (218B total / 32B active params, ~116GB)

A highly compressed version of GLM-4.7 combining REAP expert pruning (40% experts removed) with INT4 weight quantization (AutoRound W4A16). This model is ~6.5x smaller than the original 700GB GLM-4.7.

Acknowledgments

Model Details

Property Value
Base Model GLM-4.7-REAP-218B-A32B
Original (GLM-4.7) 358B params, ~717GB
After REAP Pruning 218B params, ~407GB
After W4A16 Quant 218B params, ~108GB
Active Parameters 32B per forward pass
Total Compression ~6.5x from original
Quantization INT4 weights, FP16 activations
Group Size 128
Format AutoRound
VRAM Required ~110GB

Compression Pipeline

GLM-4.7 (358B, 700GB)
        |
        v  REAP 40% pruning (96/160 experts)
        |
GLM-4.7-REAP-218B-A32B (218B, 407GB)
        |
        v  AutoRound W4A16 quantization
        |
GLM-4.7-REAP-218B-A32B-W4A16 (218B, 108GB)  <-- This model

Total: 6.5x compression

Usage


📊 Benchmarks

Tested on 8x RTX 3090:

Metric Value
Prefill 375 tps
Generation 38.5
Time to First Token 3.82s

Deployment

vLLM

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve GLM-4.7-REAP-218B-A32B-W4A16 \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2 \
  --max-model-len 165000 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype fp8_e4m3 \
  --tool-call-parser glm47 \
  --served-model-name glm-4.7 \
  --enable-auto-tool-choice \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

AutoRound Quantization Details

AutoRound is Intel's weight quantization method using signed gradient descent.

bits: 4
group_size: 128
format: auto_round
nsamples: 64
seqlen: 512
dataset: NeelNanda/pile-10k

Reproduce This Model

# 1. Download the BF16 REAP model
huggingface-cli download 0xSero/GLM-4.7-REAP-218B-A32B --local-dir ./GLM-4.7-REAP-218B-A32B

# 2. Run AutoRound quantization
pip install auto-round

python -c "
from auto_round import AutoRound
ar = AutoRound(
    './GLM-4.7-REAP-218B-A32B',
    device='cuda',
    device_map='auto',
    nsamples=64,
    seqlen=512,
    batch_size=1
)
ar.quantize_and_save('./GLM-4.7-REAP-218B-A32B-W4A16', format='auto_round')
"

# Takes ~2 hours on 8x H200

Related Models

Model Params Size Format Link
GLM-4.7 (Base) 358B ~700GB BF16 zai-org/GLM-4.7
GLM-4.7-REAP-218B-A32B 218B ~407GB BF16 0xSero/GLM-4.7-REAP-218B-A32B
This Model 218B ~108GB W4A16 -

Benchmarks

Benchmarks in progress

Benchmark GLM-4.7 Base REAP BF16 REAP W4A16
HumanEval - - -
MBPP - - -
GSM8K - - -

Citation

@article{jones2025reap,
  title={REAP: Router-Experts Activation Pruning for Efficient Mixture-of-Experts},
  author={Jones, et al.},
  journal={arXiv preprint arXiv:2505.20877},
  year={2025}
}

@misc{autoround2024,
  title={AutoRound: Advanced Weight Quantization},
  author={Intel Corporation},
  year={2024},
  howpublished={\url{https://github.com/intel/auto-round}}
}

Built with REAP + AutoRound | Sponsored by Prime Intellect

Support

If this work is useful, support Sybil Solutions here: https://donate.sybilsolutions.ai

Support and links

Downloads last month
314
Safetensors
Model size
2B params
Tensor type
BF16
·
F32
·
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-4.7-REAP-218B-A32B-W4A16

Base model

zai-org/GLM-4.7
Quantized
(7)
this model

Papers for 0xSero/GLM-4.7-REAP-218B-A32B-W4A16