Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-NVFP4
NVIDIA FP4 (NVFP4) quantized version of HauhauCS/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive, with full multimodal (vision) and tool-calling capability preserved.
Model Details
- Base model: HauhauCS/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive
- Architecture:
Qwen3_5ForConditionalGeneration(hybrid Gated-DeltaNet + full attention) - Quantization: NVFP4 (4-bit weights + FP8 activations) via llm-compressor
- Calibration: 512 samples from
neuralmagic/calibration(LLM split), 4096 seq length - Model size: ~19.7 GB (vs ~54 GB bf16 original)
- Vision encoder: bf16 (unquantized, ~0.9 GB)
What's quantized, what's not
| Component | Format | Notes |
|---|---|---|
| MLP (gate/up/down_proj) | NVFP4 | All 64 layers |
| Full attention (q/k/v/o_proj) | NVFP4 | 16 layers (every 4th) |
| Linear attention (in_proj_qkv/z, out_proj) | NVFP4 | 48 layers |
| Linear attention (in_proj_a/b) | bf16 | SSM parameters, excluded |
| lm_head | bf16 | Output projection, excluded |
| Vision encoder | bf16 | All vision weights, excluded |
| Norms, biases, A_log, dt_bias, conv1d | bf16 | Small tensors, excluded |
Quantization Recipe
from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = QuantizationModifier(
targets=["Linear"],
ignore=["lm_head", "re:.*visual.*", "re:.*in_proj_a$", "re:.*in_proj_b$"],
scheme="NVFP4",
)
Following the approach from Kbenkhaled/Qwen3.5-27B-NVFP4.
Usage with vLLM
Basic (chat + reasoning)
docker run -d --name hauhaucs-nvfp4 \
--ipc host --network host --device nvidia.com/gpu=all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ~/.cache/vllm:/root/.cache/vllm \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm/vllm-openai:cu130-nightly \
lyf/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-NVFP4 \
--host 0.0.0.0 --port 8000 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 4 \
--max-num-batched-tokens 4096 \
--kv-cache-dtype fp8 \
--reasoning-parser qwen3
With tool calling
Add these flags to enable OpenAI-compatible function calling:
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml
The model uses XML-style tool calls inherited from the HauhauCS chat template:
<tool_call>
<function=get_weather>
<parameter=city>
Tokyo
</parameter>
</function>
</tool_call>
Important: Use
qwen3_xmlas the tool-call-parser, NOTqwen3_coder. Although both the standard Qwen3.5 and this model share the same chat template, theqwen3_xmlparser is the correct match for the<tool_call><function=...>XML output format. Theqwen3_coderparser happens to work in some cases butqwen3_xmlis the proper parser for this format.
Disabling thinking mode
To get direct answers without chain-of-thought reasoning, pass enable_thinking: false via the API:
{
"chat_template_kwargs": {"enable_thinking": false}
}
Some clients (e.g., Chatbox) may send this automatically when thinking mode is toggled off.
Full docker-compose example
services:
vllm:
image: vllm/vllm-openai:cu130-nightly
container_name: hauhaucs-nvfp4
restart: unless-stopped
network_mode: host
ipc: host
devices:
- nvidia.com/gpu=all
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ~/.cache/vllm:/root/.cache/vllm
environment:
- VLLM_USE_FLASHINFER_MOE_FP4=0
- VLLM_NVFP4_GEMM_BACKEND=marlin
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
command:
- --model
- lyf/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-NVFP4
- --host
- "0.0.0.0"
- --port
- "8000"
- --max-model-len
- "32768"
- --gpu-memory-utilization
- "0.90"
- --max-num-seqs
- "4"
- --max-num-batched-tokens
- "4096"
- --kv-cache-dtype
- fp8
- --reasoning-parser
- qwen3
- --enable-auto-tool-choice
- --tool-call-parser
- qwen3_xml
Memory budget (RTX 5090, 32GB VRAM)
| Component | Size |
|---|---|
| NVFP4 weights | ~18 GB |
| Vision encoder (bf16) | ~0.9 GB |
| KV cache (fp8, 32K ctx) | ~8 GB |
| Overhead | ~3 GB |
| Total | ~30 GB |
Tested on RTX 5090 with vLLM v0.17+ nightly. Runs at ~82 tokens/s.
Capabilities
Multimodal (vision)
Image understanding works out of the box via the OpenAI vision API format:
{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What do you see?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}]
}
Tool calling
Verified working with --enable-auto-tool-choice --tool-call-parser qwen3_xml. The model correctly populates the tool_calls array in OpenAI-compatible responses with finish_reason: "tool_calls".
Red Team AI Benchmark
| Scorer | Score |
|---|---|
| Keyword matching | 75.0% |
| Semantic similarity (gte-large-en-v1.5) | 77.1% |
12/12 questions answered without refusal. Strongest on low-level C/C++/assembly tasks (PE mapping, syscall shellcode, EDR unhooking). Benchmark: toxy4ny/redteam-ai-benchmark.
How It Was Made
The original model was distributed as GGUF files. Since transformers does not support loading Qwen3.5 from GGUF, we built a manual conversion pipeline that handles three critical GGUF-specific pitfalls:
- RMSNorm +1.0 offset -- GGUF stores
1 + learned_param, HF expectslearned_param - A_log domain mismatch -- GGUF stores
-exp(A_log), HF expectsA_log - Value head (3,16) permutation -- GGUF stores 48 value heads in (3-per-group, 16-groups) order; HF expects (16-groups, 3-per-group)
Full pipeline code and detailed write-up: github.com/li-yifei/gguf-to-nvfp4
MT-Bench Results (mini, 24 questions)
| Category | Score |
|---|---|
| Math | 9.33 |
| Coding | 8.83 |
| Humanities | 8.33 |
| Writing | 7.67 |
| Extraction | 7.50 |
| Roleplay | 7.33 |
| Reasoning | 7.17 |
| STEM | 6.67 |
| Overall | 7.85 |
Judged by gpt-5.1-codex-mini.
Acknowledgments
- HauhauCS for the uncensored Qwen3.5-27B base model
- Kbenkhaled for the NVFP4 quantization recipe
- Neural Magic / llm-compressor for the quantization framework
- vLLM for serving infrastructure
- Downloads last month
- 980