魔搭社区模型速递（3.30-4.12）

魔搭ModelScope本期社区进展：新增1911个模型，297个数据集，113个创新应用， 10篇内容。

魔搭ModelScope社区

280人浏览 · 2025-04-14 13:35:11

魔搭ModelScope社区 · 2025-04-14 13:35:11 发布

添加图片注释，不超过 140 字（可选）

🙋魔搭ModelScope本期社区进展：

📟1911个模型：InternVL3系列、Kimi-VL系列、Llama 4系列、SpatialLM、Dolphin系列、TripoSG 等；

📁297个数据集：MapDR、OpenAI-4o_t2i_human_preference、openhands-feedback、reasoning-v1-20m等；

🎨113个创新应用：ModelScope MCP Playground、飞桨第四代OCR模型在线演示、Motionshop2来啦！单张照片替换视频人物、视频擦除扩散模型-DiffuEraser等；

📄 10篇内容：

InternVL3开源：7种尺寸覆盖文、图、视频处理，多模态能力扩展至工业图像分析
Kimi开源MoE架构多模态推理模型，小激活参数，大能量！
把大模型变成微信私人助手，三步搞定！
突破自动驾驶"交规困境"：高德&西交发布交规+高精地图基准MapDR，车道级交通规则在线理解，让AI更懂交规！
智源开源FlagOS升级：首次实现DeepSeek-R1满血版多种芯片高效快速部署
Llama 4上线魔搭社区！社区推理、微调实战教程来啦！
开箱即用的可视化AI应用编排工具 Langflow，可调用魔搭免费API作为tool
重磅发布｜支持东方40语种+中国22方言的新SOTA语音大模型Dolphin开源啦！
杭州六小龙最新开源「空间理解模型」，保姆级教程来了！
通义灵码与魔搭Notebook深度集成：在线编码开箱即用，开发效率倍增

01.模型推荐

InternVL3 系列

OpenGVLab 开源发布的 InternVL3系列模型，包括从1B 到 78B 共 7 个尺寸，作为一款先进的多模态大型语言模型 (MLLM) ，能够同时处理文字、图片、视频等多种信息，展现出卓越的整体性能。与 InternVL 2.5 相比，InternVL3 展现出卓越的多模态感知和推理能力，同时进一步扩展了其多模态能力，涵盖工具使用、GUI 代理、工业图像分析、3D 视觉感知等。

模型合集

https://modelscope.cn/collections/InternVL3-5d0bdc54b7d84e

示例代码

推理、微调实战教程详见

InternVL3开源：7种尺寸覆盖文、图、视频处理，多模态能力扩展至工业图像分析

Kimi-VL系列

Kimi-VL是Moonshot AI推出的一种高效的开源专家混合（MoE）视觉语言模型（VLM），可提供高级多模态推理、长上下文理解和强大的代理功能，同时在其语言解码器（Kimi-VL-A3B）中仅激活 2.8B 参数。基于此基础，Moonshot同时推出了：Kimi-VL-Thinking。通过长链思维（CoT）监督微调（SFT）和强化学习（RL）开发，该模型展现出强大的长期推理能力。

模型地址

Kimi-VL-A3B-Instruct：

https://www.modelscope.cn/models/moonshotai/Kimi-VL-A3B-InstructKimi-VL-A3B-Thinking：

https://www.modelscope.cn/models/moonshotai/Kimi-VL-A3B-Thinking

示例代码

使用ModelScope的SDK（兼容transformers）推理Thinking模型。

from PIL import Image
from modelscope import AutoModelForCausalLM, AutoProcessor
model_path = "moonshotai/Kimi-VL-A3B-Thinking"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
image_paths = ["./figures/demo1.png", "./figures/demo2.png"]
images = [Image.open(path) for path in image_paths]
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path} for image_path in image_paths
        ] + [{"type": "text", "text": "Please infer step by step who this manuscript belongs to and what it records"}],
    },
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)

Llama 4系列

Meta推出了Llama 4系列的首批模型： Llama 4 Scout 和 Llama 4 Maverick，二者均为MoE架构模型，Llama 4 Scout 是一个拥有 16 位专家的 170 亿活跃参数模型，上下文支持达到了1000万；Llama 4 Maverick 是一个拥有 128 位专家的 170 亿活跃参数模型。模型采用原生多模态设计，将文本和视觉标记无缝集成到统一的模型主干中, 结合早期融合能够使用大量未标记的文本、图像和视频数据联合预训练模型。同时改进了 Llama 4 中的视觉编码器。它基于 MetaCLIP，但与冻结的 Llama 模型一起单独训练，以便更好地使编码器适应 LLM。

模型合集地址

https://modelscope.cn/collections/LLaMA-4-c9b8a04227f445

示例代码

使用transformers推理，请确保已经安装了 transformers

v4.51.0 版本，或者通过运行 pip install -U transformers 来升级

from modelscope import AutoProcessor, Llama4ForConditionalGeneration
import torch
model_id = "LLM-Research/Llama-4-Scout-17B-16E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
url1 = "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/car.jpg"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "text", "text": "Can you describe the image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

SpatialLM

SpatialLM 是一款由杭州六小龙群核科技开源的三维空间理解模型，能够通过三维点云数据生成结构化的三维场景理解输出，支持多种数据来源（如单目视频序列、RGBD图像和LiDAR传感器），并提供1B和0.5B两种模型版本。

模型链接：

1B：

https://modelscope.cn/models/manycore-research/SpatialLM-Llama-1B

0.5B：

https://modelscope.cn/models/manycore-research/SpatialLM-Qwen-0.5B

部署实战教程详见：

杭州六小龙最新开源「空间理解模型」，保姆级教程来了！

Dolphin系列

Dolphin 是由 Dataocean AI 和清华大学合作开发的多语言、多任务 ASR 模型。它支持东亚、南亚、东南亚和中东的 40 种东方语言，同时还支持 22 种中国方言。它基于超过 210,000 小时的数据进行训练，其中包括 DataoceanAI 的专有数据集和开源数据集。该模型可以执行语音识别、语音活动检测（VAD）、分割和语言识别（LID）。

模型链接

dolphin-base：

https://www.modelscope.cn/models/DataoceanAI/dolphin-basedolphin-small：

https://modelscope.cn/models/DataoceanAI/dolphin-small

示例代码

一键安装

pip install -U dataoceanai-dolphin

命令行调用Dolphin

dolphin audio.wav

# Download model and specify the model path
dolphin audio.wav --model small --model_dir /data/models/dolphin/

# Specify language and region
dolphin audio.wav --model small --model_dir /data/models/dolphin/ --lang_sym "zh" --region_sym "CN"

# padding speech to 30 seconds
dolphin audio.wav --model small --model_dir /data/models/dolphin/ --lang_sym "zh" --region_sym "CN" --padding_speech true

Python使用Dolphin

import dolphin

waveform = dolphin.load_audio("audio.wav")
model = dolphin.load_model("small", "/data/models/dolphin", "cuda")
result = model(waveform)
# Specify language and region
result = model(waveform, lang_sym="zh", region_sym="CN")
print(result.text)

部署实战教程详见：

重磅发布｜支持东方40语种+中国22方言的新SOTA语音大模型Dolphin开源啦！