魔搭社区模型速递（5.25-6.1）

魔搭ModelScope本期社区进展：1636个模型，663个数据集，147个创新应用， 6 篇内容

魔搭ModelScope社区

269人浏览 · 2025-06-03 10:09:18

魔搭ModelScope社区 · 2025-06-03 10:09:18 发布

🙋魔搭ModelScope本期社区进展：

📟1636个模型：DeepSeek-R1-0528、QwenLong-L1、QwenLong-CPRS、sarvam-m等；

📁663个数据集：DocQA-RL-1.6K、大模型目标检测微调数据集、PhysicalAI-Spatial-Intelligence-Warehouse、LIBRA、solyanka等；

🎨147个创新应用：MCP Playground V1、EasyControl 等；

📄 6 篇内容：

小米又放大招！MiMo-VL 多模态大模型开源，魔搭推理微调全面解读来了！
DeepSeek-R1-0528：小更新大升级
论文分类打榜赛Baseline：ms-swift微调InternLM实践
突破长上下文处理极限：通义实验室开源发布QwenLong-L1 与 QwenLong-CPRS 双模型
搭友来碰头｜魔搭核心开发者共创会精彩回顾
实战 | Qwen2.5-VL模型目标检测（Grounding）任务领域微调教程

01.模型推荐

DeepSeek-R1-0528

近期，DeepSeek R1 开源发布了其小版本升级——DeepSeek-R1-0528。在最新更新中，DeepSeek R1 通过利用增加的计算资源和在后训练过程中引入算法优化机制，显著提高了其推理和推断能力的深度。该模型在各种基准评估中表现出色，包括数学、编程和一般逻辑方面。其整体性能现已接近领先模型，如 O3 和 Gemini 2.5 Pro。除了改进推理能力外，此版本还减少了幻觉现象，增强了对函数调用的支持，并提升了代码编写体验。

同时，研究团队蒸馏 DeepSeek-R1-0528 的思维链后训练 Qwen3-8B Base 得到了 DeepSeek-R1-0528-Qwen3-8B，在数学测试 AIME 2024 中仅次于 DeepSeek-R1-0528。

模型链接：

DeepSeek-R1-0528

https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528

DeepSeek-R1-0528-Qwen3-8B

https://www.modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

更多推理、微调详见教程：

DeepSeek-R1-0528：小更新大升级

QwenLong系列

通义实验室最新发布两项前沿技术——QwenLong-L1（长上下文推理强化学习框架）和 QwenLong-CPRS（动态上下文压缩系统），开源系列模型和数据集，展现长上下文处理能力的飞跃。

QwenLong-L1-32B是首个通过强化学习训练的长文本上下文推理模型，在多个长文本文档问答基准测试中表现优异，超越了OpenAI-o3-mini和Qwen3-235B-A22B等旗舰推理模型，与Claude-3.7-Sonnet-Thinking的性能相当。

QwenLong-CPRS 是一种查询感知的多粒度压缩框架，优化长上下文处理，性能优于 RAG 与稀疏注意力方法。它通过词元级内容选择实现精准信息提取，提升准确性。不同于需重训练的稀疏注意力方法，它作为即插即用模块兼容各类大语言模型，无需额外训练，兼具细粒度优化与跨架构集成优势。

模型地址：

QwenLong-L1-32B

https://modelscope.cn/models/iic/QwenLong-L1-32B

QwenLong-CPRS-7B

https://modelscope.cn/models/iic/QwenLong-CPRS-7B

示例代码：

QwenLong-L1-32B 推理代码

from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "iic/QwenLong-L1-32B"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
# prepare the model input
template = """Please read the following text and answer the question below.
<text>
$DOC$
</text>
$Q$
Format your response as follows: "Therefore, the answer is (insert answer here)"."""
context = "<YOUR_CONTEXT_HERE>" 
question = "<YOUR_QUESTION_HERE>"
prompt = template.replace('$DOC$', context.strip()).replace('$Q$', question.strip())
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=10000,
    temperature=0.7,
    top_p=0.95
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 
# parsing thinking content
try:
    # rindex finding 151649 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151649)
except ValueError:
    index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)

MiMo-VL系列

小米开源发布两款 7B 规模视觉-语言模型 MiMo-VL-7B-SFT 和 MiMo-VL-7B-RL。预训练数据总计约 2.4 万亿 tokens，涵盖图像描述、文本-图像交错、OCR／定位、视频、GUI 操作与合成推理数据等多模态语料。后训练阶段，研究团队引入了混合在线强化学习，能够无缝集成涵盖感知准确性、视觉基础精度、逻辑推理能力和人机偏好的多种奖励信号。

在通用视觉语言理解中，MiMo-VL-7B 达到了当前开源模型性能的SOTA水平，在多模态推理中，SFT和RL模型基准测试中都显著优于所有比较的开源基线。此外，MiMo-VL-7B-RL 还具备卓越的 GUI 理解和定位能力，在内部的评估数据集和 GPT-4o的评判中在开源视觉语言模型中获得了最高的 Elo 评分（包括从 7B 到 72B 参数）。

模型合集：

https://modelscope.cn/collections/MiMo-VL-bb651017e02742

示例代码：

使用transformers推理

from modelscope import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "XiaomiMiMo/MiMo-VL-7B-SFT", torch_dtype="auto", device_map="auto"
)
# default processer
processor = AutoProcessor.from_pretrained("XiaomiMiMo/MiMo-VL-7B-SFT")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

sarvam-m

sarvam-m是由sarvamai团队基于Mistral-Small开发的多语言混合推理文本模型，对印度语言、数学和编程基准测试性能提升显著，支持复杂推理的“思考模式”和高效对话的“非思考模式”，并能流畅进行多语言交流。

模型地址：

https://modelscope.cn/models/sarvamai/sarvam-m

示例代码：

以下代码片段演示了如何使用Transformers来使用sarvam-m

from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "sarvamai/sarvam-m"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
# prepare the model input
prompt = "Who are you and what is your purpose on this planet?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    enable_thinking=True,  # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(**model_inputs, max_new_tokens=8192)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :].tolist()
output_text = tokenizer.decode(output_ids)
if "</think>" in output_text:
    reasoning_content = output_text.split("</think>")[0].rstrip("\n")
    content = output_text.split("</think>")[-1].lstrip("\n").rstrip("</s>")
else:
    reasoning_content = ""
    content = output_text.rstrip("</s>")
print("reasoning content:", reasoning_content)
print("content:", content)

深度讲解详见文章：

合成数据也能通吃真实世界？首个融合重建-预测-规划的生成式世界模型AETHER开源

02.数据集推荐

DocQA-RL-1.6K

DocQA-RL-1.6K是一个专门用于强化学习训练的文档问答（DocQA）数据集，包含1600个涵盖数学、逻辑和多跳推理领域的文档问答问题，旨在帮助模型提升长文本上下文推理能力。

数据集链接：

https://modelscope.cn/datasets/iic/DocQA-RL-1.6K

大模型目标检测微调数据集专注于为大模型提供目标检测任务的微调数据支持。该数据集通过标注图像中的目标边界框，帮助大模型更好地理解和定位图像中的目标对象，从而提升其在视觉问答等多模态任务中的表现。

数据集链接：

https://www.modelscope.cn/datasets/Tina12345/textVQA_groundingtask_bbox

PhysicalAI-Spatial-Intelligence-Warehouse

PhysicalAI-Spatial-Intelligence-Warehouse是由NVIDIA创建的一个综合性合成数据集，旨在推动仓库环境中3D场景理解的发展。该数据集包含499k训练集问答对、19k测试集问答对和1.9k验证集问答对，以及约95k RGB-D图像对，涵盖空间关系、多项选择问题、距离测量和目标计数四大类别，通过自然语言问答对形式标注，可用于训练和评估视觉语言模型在空间智能任务中的表现。

数据集链接：

https://modelscope.cn/datasets/nv-community/PhysicalAI-Spatial-Intelligence-Warehouse

LIBRA

LIBRA（Long Input Benchmark for Russian Analysis）旨在评估大型语言模型（LLMs）在理解和处理长篇俄语文本方面的能力。该基准包括21个针对不同任务和复杂度调整的数据集。这些任务被分为四个复杂度组，允许在从4k到128k标记的不同上下文长度上进行评估。

数据集链接：

https://modelscope.cn/datasets/ai-forever/LIBRA

solyanka

是一个包含约1000万个弱监督对的数据集集合，用于训练文本嵌入模型。集合中的任何数据集都可以使用InfoNCE损失函数在SentenceTransformers中使用。

数据集链接：

https://modelscope.cn/datasets/ai-forever/solyanka