Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-NVFP4

NVIDIA FP4 (NVFP4) quantized version of HauhauCS/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive, with full multimodal (vision) and tool-calling capability preserved.

Model Details

  • Base model: HauhauCS/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive
  • Architecture: Qwen3_5ForConditionalGeneration (hybrid Gated-DeltaNet + full attention)
  • Quantization: NVFP4 (4-bit weights + FP8 activations) via llm-compressor
  • Calibration: 512 samples from neuralmagic/calibration (LLM split), 4096 seq length
  • Model size: ~19.7 GB (vs ~54 GB bf16 original)
  • Vision encoder: bf16 (unquantized, ~0.9 GB)

What's quantized, what's not

Component Format Notes
MLP (gate/up/down_proj) NVFP4 All 64 layers
Full attention (q/k/v/o_proj) NVFP4 16 layers (every 4th)
Linear attention (in_proj_qkv/z, out_proj) NVFP4 48 layers
Linear attention (in_proj_a/b) bf16 SSM parameters, excluded
lm_head bf16 Output projection, excluded
Vision encoder bf16 All vision weights, excluded
Norms, biases, A_log, dt_bias, conv1d bf16 Small tensors, excluded

Quantization Recipe

from llmcompressor.modifiers.quantization import QuantizationModifier

recipe = QuantizationModifier(
    targets=["Linear"],
    ignore=["lm_head", "re:.*visual.*", "re:.*in_proj_a$", "re:.*in_proj_b$"],
    scheme="NVFP4",
)

Following the approach from Kbenkhaled/Qwen3.5-27B-NVFP4.

Usage with vLLM

Basic (chat + reasoning)

docker run -d --name hauhaucs-nvfp4 \
  --ipc host --network host --device nvidia.com/gpu=all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ~/.cache/vllm:/root/.cache/vllm \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  vllm/vllm-openai:cu130-nightly \
  lyf/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-NVFP4 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 4096 \
  --kv-cache-dtype fp8 \
  --reasoning-parser qwen3

With tool calling

Add these flags to enable OpenAI-compatible function calling:

  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml

The model uses XML-style tool calls inherited from the HauhauCS chat template:

<tool_call>
<function=get_weather>
<parameter=city>
Tokyo
</parameter>
</function>
</tool_call>

Important: Use qwen3_xml as the tool-call-parser, NOT qwen3_coder. Although both the standard Qwen3.5 and this model share the same chat template, the qwen3_xml parser is the correct match for the <tool_call><function=...> XML output format. The qwen3_coder parser happens to work in some cases but qwen3_xml is the proper parser for this format.

Disabling thinking mode

To get direct answers without chain-of-thought reasoning, pass enable_thinking: false via the API:

{
  "chat_template_kwargs": {"enable_thinking": false}
}

Some clients (e.g., Chatbox) may send this automatically when thinking mode is toggled off.

Full docker-compose example

services:
  vllm:
    image: vllm/vllm-openai:cu130-nightly
    container_name: hauhaucs-nvfp4
    restart: unless-stopped
    network_mode: host
    ipc: host
    devices:
      - nvidia.com/gpu=all
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ~/.cache/vllm:/root/.cache/vllm
    environment:
      - VLLM_USE_FLASHINFER_MOE_FP4=0
      - VLLM_NVFP4_GEMM_BACKEND=marlin
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    command:
      - --model
      - lyf/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-NVFP4
      - --host
      - "0.0.0.0"
      - --port
      - "8000"
      - --max-model-len
      - "32768"
      - --gpu-memory-utilization
      - "0.90"
      - --max-num-seqs
      - "4"
      - --max-num-batched-tokens
      - "4096"
      - --kv-cache-dtype
      - fp8
      - --reasoning-parser
      - qwen3
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_xml

Memory budget (RTX 5090, 32GB VRAM)

Component Size
NVFP4 weights ~18 GB
Vision encoder (bf16) ~0.9 GB
KV cache (fp8, 32K ctx) ~8 GB
Overhead ~3 GB
Total ~30 GB

Tested on RTX 5090 with vLLM v0.17+ nightly. Runs at ~82 tokens/s.

Capabilities

Multimodal (vision)

Image understanding works out of the box via the OpenAI vision API format:

{
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What do you see?"},
      {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
    ]
  }]
}

Tool calling

Verified working with --enable-auto-tool-choice --tool-call-parser qwen3_xml. The model correctly populates the tool_calls array in OpenAI-compatible responses with finish_reason: "tool_calls".

Red Team AI Benchmark

Scorer Score
Keyword matching 75.0%
Semantic similarity (gte-large-en-v1.5) 77.1%

12/12 questions answered without refusal. Strongest on low-level C/C++/assembly tasks (PE mapping, syscall shellcode, EDR unhooking). Benchmark: toxy4ny/redteam-ai-benchmark.

How It Was Made

The original model was distributed as GGUF files. Since transformers does not support loading Qwen3.5 from GGUF, we built a manual conversion pipeline that handles three critical GGUF-specific pitfalls:

  1. RMSNorm +1.0 offset -- GGUF stores 1 + learned_param, HF expects learned_param
  2. A_log domain mismatch -- GGUF stores -exp(A_log), HF expects A_log
  3. Value head (3,16) permutation -- GGUF stores 48 value heads in (3-per-group, 16-groups) order; HF expects (16-groups, 3-per-group)

Full pipeline code and detailed write-up: github.com/li-yifei/gguf-to-nvfp4

MT-Bench Results (mini, 24 questions)

Category Score
Math 9.33
Coding 8.83
Humanities 8.33
Writing 7.67
Extraction 7.50
Roleplay 7.33
Reasoning 7.17
STEM 6.67
Overall 7.85

Judged by gpt-5.1-codex-mini.

Acknowledgments

Downloads last month
980
Safetensors
Model size
17B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lyf/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-NVFP4

Quantized
(5)
this model