面壁小钢炮MiniCPM 4.0开源，端侧推理常规提速5倍！

面壁智能重磅推出MiniCPM 4.0 ——一个极致高效的端侧大模型，通过其 CPM.cu 自研推理框架，可实现220倍极致的速度提升，5 倍常规提速。

魔搭ModelScope社区

166人浏览 · 2025-06-09 10:09:31

魔搭ModelScope社区 · 2025-06-09 10:09:31 发布

01.MiniCPM 4.0

面壁智能重磅推出MiniCPM 4.0 ——一个极致高效的端侧大模型，通过其 CPM.cu 自研推理框架，可实现220倍极致的速度提升，5 倍常规提速。本次在开源社区核心推出 8B 和 0.5B 两个参数规模的版本，均在同级别模型对比中实现了最佳性能，官方开源模型包括：

MiniCPM4-8B:

MiniCPM4的旗舰模型，拥有80亿参数，在8T tokens上训练。

MiniCPM4-0.5B:

MiniCPM4的小型版本，拥有0.5亿参数，在1T tokens上训练。

MiniCPM4-8B-Eagle-FRSpec:

用于FRSpec的Eagle头，加速MiniCPM4-8B的推测性推理。

MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu:

使用QAT训练的FRSpec的Eagle头，高效地结合推测和量化，实现MiniCPM4-8B的超加速。

MiniCPM4-8B-Eagle-vLLM:

vLLM格式的Eagle头，加速MiniCPM4-8B的推测性推理。

MiniCPM4-8B-marlin-Eagle-vLLM:

量化的vLLM格式Eagle头，加速MiniCPM4-8B的推测性推理。

BitCPM4-0.5B:

应用极值三进制量化于MiniCPM4-0.5B，将模型参数压缩为三进制值，实现了90%的位宽减少。

BitCPM4-1B:

应用极值三进制量化于MiniCPM3-1B，将模型参数压缩为三进制值，实现了90%的位宽减少。

MiniCPM4-Survey:

基于MiniCPM4-8B，接受用户的查询作为输入，并自动生成可信的长篇调查论文。

MiniCPM4-MCP:

基于MiniCPM4-8B，接受用户的查询及可用的MCP工具作为输入，并自动调用相关的MCP工具以满足用户需求。

模型合集：

https://www.modelscope.cn/collections/MiniCPM-4-ec015560e8c84d

Github：

https://github.com/openbmb/minicpm

技术报告：

https://github.com/OpenBMB/MiniCPM/blob/main/report/MiniCPM_4_Technical_Report.pdf

通过其技术报告可见，MiniCPM4 系列是专为终端设备设计的高度高效的大规模语言模型（LLMs），从模型架构、训练数据、训练算法和推理系统四个关键维度上的系统性创新实现了这一效率。

高效模型架构：

- InfLLM v2 -- 可训练的稀疏注意力机制：采用可训练的稀疏注意力机制架构，在 128K 长文本处理中，每个词元仅需与不足 5% 的词元进行相关性计算，显著降低长文本的计算开销

高效学习算法：

- 模型风洞 2.0 -- 高效 Predictable Scaling：引入下游任务的 Scaling 预测方法，实现更精准的模型训练配置搜索
- BitCPM -- 极致的三值量化：将模型参数位宽压缩至 3 值，实现模型位宽 90% 的极致瘦身
- 高效训练工程优化：采用 FP8 低精度计算技术，结合多词元预测（Multi-token Prediction）训练策略

高知识密度训练数据：

- UltraClean -- 高质量预训练数据的清洗与合成：构建基于高效验证的迭代式数据清洗策略，开源高质量中英文预训练数据集 UltraFineweb
- UltraChat v2 -- 高质量有监督微调数据合成：构建大规模高质量有监督微调数据集，涵盖知识密集型数据、推理密集型数据、指令遵循数据、长文本理解数据、工具调用数据等多个维度

高效推理系统：

- CPM.cu -- 轻量级的高效CUDA推理框架：融合了稀疏注意力机制、模型量化与投机采样，充分体现MiniCPM4的效率优势
- ArkInfer -- 跨平台部署系统：支持多后端环境的一键部署，提供灵活的跨平台适配能力

02.模型推理

使用transformers进行推理

from modelscope import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)
path = 'OpenBMB/MiniCPM4-8B'
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
# User can directly use the chat interface
# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
# print(responds)
# User can also use the generate interface
messages = [
    {"role": "user", "content": "Write an article about Artificial Intelligence."},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
model_outputs = model.generate(
    model_inputs,
    max_new_tokens=1024,
    top_p=0.7,
    temperature=0.7
)
output_token_ids = [
    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
]
responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)

使用CPM.cu进行推理

推荐使用CPM.cu对MiniCPM4进行推理。CPM.cu是OpenBMB开发的一个CUDA推理框架，集成了高效的稀疏、推测采样和量化技术，充分利用了MiniCPM4的效率优势。

可以通过运行以下命令来安装CPM.cu:

git clone https://github.com/OpenBMB/cpm.cu.git --recursive
cd cpm.cu
python3 setup.py install

MiniCPM4原生支持最长达到32,768 tokens的上下文长度。为了重现论文中的长文本加速效果，建议使用已经验证过的LongRoPE因子。通过修改config.json文件中的rope_scaling字段来启用LongRoPE。


{
    ...,
    "rope_scaling": {
        "rope_type": "longrope", 
        "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "original_max_position_embeddings": 32768
    }
}

修改后，你可以运行以下命令来重现长上下文加速效果

python3 tests/test_generate.py

有关CPM.cu的更多详细信息，请参阅 CPM.cu仓库(https://github.com/OpenBMB/cpm.cu).

03.模型微调

我们介绍使用ms-swift对MiniCPM4-8B进行自我认知微调。ms-swift是魔搭社区官方提供的大模型与多模态大模型训练部署框架。

ms-swift开源地址：

https://github.com/modelscope/ms-swift

在开始微调之前，请确保您的环境已准备妥当。

git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .
pip install transformers -U

微调数据集准备格式如下（system字段可选），在训练脚本中指定`--dataset <dataset_path>`即可。

{"messages": [{"role": "user", "content": "浙江的省会在哪？"}, {"role": "assistant", "content": "浙江的省会在杭州。"}]}

对MiniCPM4-8B进行10分钟快速自我认知微调脚本如下，可在魔搭提供的免费算力A10中运行：

https://modelscope.cn/my/mynotebook

# 训练显存：18GB
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model OpenBMB/MiniCPM4-8B \
    --train_type lora \
    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
              'AI-ModelScope/alpaca-gpt4-data-en#500' \
              'swift/self-cognition#500' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --model_author swift \
    --model_name swift-robot

训练显存占用：

训练完成后，使用以下命令进行推理：

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048

推送模型到ModelScope：

CUDA_VISIBLE_DEVICES=0 \
swift export \
    --adapters output/vx-xxx/checkpoint-xxx \
    --push_to_hub true \
    --hub_model_id '<your-model-id>' \
    --hub_token '<your-sdk-token>'

点击链接，即可跳转模型链接~

https://www.modelscope.cn/collections/MiniCPM-4-ec015560e8c84d