魔搭社区模型速递（6.22-7.6）

魔搭ModelScope社区

14人浏览 · 2025-07-07 13:10:16

魔搭ModelScope社区 · 2025-07-07 13:10:16 发布

🙋魔搭ModelScope本期社区进展：

📟2662个模型：文心4.5系列、GLM-4.1V-9B-Thinking、FLUX.1-Kontext-dev、Osmosis-Structure-0.6B、Hunyuan-A13B、jina-embeddings-v4、Gemma 3n正式版、Kimi-VL-A3B-Thinking-2506、ThinkSound 等；

📁263个数据集：wiki_fr、RLPR-Train-Dataset、mad-cars等；

🎨152个创新应用：GLM-4.1V-9B-Thinking-Demo、Happy-LLM-215M-SFT等；

📄 12 篇内容：

送福利！FlowBench客户端首批内测邀请函
2025魔搭MCP&Agent挑战赛正式启动！50万总奖池！
Jina Embeddings V4: 为搜索而生，多模态多语言向量模型
智谱AI发布新版VLM开源模型GLM-4.1V-9B-Thinking，引入思考范式，性能提升8倍
AI 真会编程还是只会“背题” | Code Bench 专场直播带你洞悉代码能力的真实象限
FLUX.1 Kontext 的全生态教程来啦！AIGC专区在线试玩！DiffSynth框架、ComfyUI工作流教程
文心4.5系列模型，正式开源！
在魔搭社区使用 NVIDIA TensorRT-LLM PyTorch 新架构优化 Qwen3 系列模型推理
腾讯混元开源首款混合推理MoE模型Hunyuan-A13B，性能优异，激活参数仅13B
蚂蚁的可视化图表 MCP 首发上线！支持超过 25 种的可视化图表生成，也支持生成路书！
Dify MCP 保姆级教程来了！
小模型，大用途！用于结构化输出的小型语言模型

01.模型推荐

文心4.5系列

6月30日，百度文心大模型4.5正式开源。文心4.5系列开源模型共10款，涵盖了激活参数规模分别为47B和3B的混合专家（MoE）模型（最大的模型总参数量为424B），以及0.3B的稠密参数模型。针对 MoE 架构，研究团队提出了一种创新性的多模态异构模型结构，通过跨模态参数共享机制实现模态间知识融合，同时为各单一模态保留专用参数空间。

文心4.5系列模型均使用飞桨深度学习框架进行高效训练、推理和部署。在大语言模型的预训练中，模型FLOPs利用率（MFU）达到47%。实验结果显示，该系列模型在多个文本和多模态基准测试中达到SOTA水平，在指令遵循、世界知识记忆、视觉理解和多模态推理任务上效果尤为突出。模型权重按照Apache 2.0协议开源，支持开展学术研究和产业应用。

模型合集：

https://modelscope.cn/collections/ERNIE-45-56f40e2777e348

示例代码：

使用transformers推理（ERNIE-4.5-21B-A3B-PT）：

from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "./ERNIE-4.5-21B-A3B-PT"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name,device_map="auto", trust_remote_code=True)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], add_special_tokens=False, return_tensors="pt")
# conduct text completion
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=1024
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# decode the generated ids
generate_text = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print("generate_text:", generate_text)

更多模型推理、部署、微调实战教程详见：

文心4.5系列模型，正式开源！

GLM-4.1V-9B-Thinking

智谱AI推出新版VLM开源模型 GLM-4.1V-9B-Thinking ，引入思考范式，通过课程采样强化学习 RLCS（Reinforcement Learning with Curriculum Sampling）全面提升模型能力，达到 10B 参数级别的视觉语言模型的最强性能，并同步开源基座模型 GLM-4.1V-9B-Base，希望能够帮助更多研究者探索视觉语言模型的能力边界。

与上一代 CogVLM2 及 GLM-4V 系列模型相比，GLM-4.1V-Thinking 有如下改进：

系列中首个推理模型，不仅仅停留在数学领域，在多个子领域均达到世界前列的水平。
支持 64k 上下长度。
支持任意长宽比和高达 4k 的图像分辨率。
提供支持中英文双语的开源模型版本。

模型合集：

https://modelscope.cn/collections/GLM-41V-35d24b6def9f49

示例代码：

使用transformers进行单张图片推理的代码

首先安装transformers库

pip install git+https://github.com/huggingface/transformers.git

from modelscope import AutoProcessor, Glm4vForConditionalGeneration
import torch
MODEL_PATH = "ZhipuAI/GLM-4.1V-9B-Thinking"
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale_8bits_palette_sample_image.png"
            },
            {
                "type": "text",
                "text": "describe this image"
            }
        ],
    }
]
processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
model = Glm4vForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path=MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
print(output_text)

Jina-embeddings-v4

Jina AI 正式开源发布 jina-embeddings-v4，一款全新的多模态向量模型，参数规模达到 38 亿，并首次实现了对文本与图像的同步处理。为了在各类检索任务中发挥极致性能，在模型内置了一套面向特定任务的 LoRA 适配器，专门强化了模型在处理查询-文档检索、语义匹配以及代码搜索等任务时的表现。

在 MTEB、MMTEB、CoIR、LongEmbed、STS、Jina-VDR 及 ViDoRe 等多项基准测试中，jina-embeddings-v4 在多模态、多语言检索任务上均展现了顶尖性能。它尤其擅长解读富含视觉信息的内容，无论

是表格、图表还是复杂的示意图，都能精准捕捉其深层语义。此外，模型还同时支持单向量和多向量表示，灵活满足各种场景需求。

模型链接：

https://modelscope.cn/models/jinaai/jina-embeddings-v4

示例代码：

使用transformers推理

# !pip install transformers>=4.52.0 torch>=2.6.0 peft>=0.15.2 torchvision pillow
# !pip install
from modelscope import AutoModel
import torch
# Initialize the model
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v4", trust_remote_code=True, torch_dtype=torch.float16)
model.to("cuda")
# ========================
# 1. Retrieval Task
# ========================
# Configure truncate_dim, max_length (for texts), max_pixels (for images), vector_type, batch_size in the encode function if needed
# Encode query
query_embeddings = model.encode_text(
    texts=["Overview of climate change impacts on coastal cities"],
    task="retrieval",
    prompt_name="query",
)
# Encode passage (text)
passage_embeddings = model.encode_text(
    texts=[
        "Climate change has led to rising sea levels, increased frequency of extreme weather events..."
    ],
    task="retrieval",
    prompt_name="passage",
)
# Encode image/document
image_embeddings = model.encode_image(
    images=["https://i.ibb.co/nQNGqL0/beach1.jpg"],
    task="retrieval",
)
# ========================
# 2. Text Matching Task
# ========================
texts = [
    "غروب جميل على الشاطئ",  # Arabic
    "海滩上美丽的日落",  # Chinese
    "Un beau coucher de soleil sur la plage",  # French
    "Ein wunderschöner Sonnenuntergang am Strand",  # German
    "Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία",  # Greek
    "समुद्र तट पर एक खूबसूरत सूर्यास्त",  # Hindi
    "Un bellissimo tramonto sulla spiaggia",  # Italian
    "浜辺に沈む美しい夕日",  # Japanese
    "해변 위로 아름다운 일몰",  # Korean
]
text_embeddings = model.encode_text(texts=texts, task="text-matching")
# ========================
# 3. Code Understanding Task
# ========================
# Encode query
query_embedding = model.encode_text(
    texts=["Find a function that prints a greeting message to the console"],
    task="code",
    prompt_name="query",
)
# Encode code
code_embeddings = model.encode_text(
    texts=["def hello_world():\n    print('Hello, World!')"],
    task="code",
    prompt_name="passage",
)
# ========================
# 4. Use multivectors
# ========================
multivector_embeddings = model.encode_text(
    texts=texts,
    task="retrieval",
    prompt_name="query",
    return_multivector=True,
)
images = ["https://i.ibb.co/nQNGqL0/beach1.jpg", "https://i.ibb.co/r5w8hG8/beach2.jpg"]
multivector_image_embeddings = model.encode_image(
    images=images,
    task="retrieval",
    return_multivector=True,
)

更多模型技术细节解析，详见：

Jina Embeddings V4: 为搜索而生，多模态多语言向量模型

FLUX.1-Kontext-dev

FLUX.1 Kontext 是由黑森林实验室（Black Forest Labs）开源的一款专业图像生成与编辑模型，专注于通过上下文感知技术实现精准的图像编辑。该模型支持文本和图像的混合输入，能够智能理解图像内容并执行对象修改、风格转换、背景替换等多种编辑任务，同时在多轮编辑中较好地保持主体一致性。其核心采用流匹配（Flow Matching）架构，结合双流与单流混合设计，提升了语义关联的精度和生成速度。

Flux.1 Kontext [dev] 已正式上线ModelScope AIGC专区，支持在线免费的图像编辑。同时还支持在线界面GUI交互的模型训练，可以基于Flux.1 Kontext [dev] 底模训练LoRA模型。

模型链接：

https://www.modelscope.cn/models/black-forest-labs/FLUX.1-Kontext-dev

直接体验：

https://www.modelscope.cn/aigc/imageGeneration

Hunyuan-A13B

6月27日，腾讯混元宣布开源混元-A13B模型（Hunyuan-A13B），总参数800亿，激活参数仅130亿，在效果比肩顶尖开源模型的同时，大幅降低推理延迟与计算开销。Hunyuan-A13B在多个业内权威数据测试集上获得好成绩，并且在Agent工具调用和长文能力上有突出表现，其256K原生上下文窗口支持在长文数据集中取得领先成绩。

混元团队创新构建了多Agent数据合成框架，融合MCP、沙箱及大模型模拟场景，借助强化学习实现多环境自主探索，有效提升模型效能。该模型支持动态切换"快思考"与"慢思考"双模式：前者以高效简洁输出适配轻量级任务，后者通过深度推理与回溯机制满足复杂任务需求，智能资源分配机制在保证响应速度的同时，实现了效率与精准度的平衡优化。

模型链接：

https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct

示例代码：

使用modelscope推理（兼容transformers）：

from modelscope import AutoModelForCausalLM, AutoTokenizer
import os
import re
model_name_or_path = "Tencent-Hunyuan/Hunyuan-A13B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True)  # You may want to use bfloat16 and/or move to GPU here
messages = [
    {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_tensors="pt",
                                                enable_thinking=True # Toggle thinking mode (default: True)
                                                )
outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
output_text = tokenizer.decode(outputs[0])
think_pattern = r'<think>(.*?)</think>'
think_matches = re.findall(think_pattern, output_text, re.DOTALL)
answer_pattern = r'<answer>(.*?)</answer>'
answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)
think_content = [match.strip() for match in think_matches][0]
answer_content = [match.strip() for match in answer_matches][0]
print(f"thinking_content:{think_content}\n\n")
print(f"answer_content:{answer_content}\n\n")

更多微调实战教程，详见：

腾讯混元开源首款混合推理MoE模型Hunyuan-A13B，性能优异，激活参数仅13B

Gemma 3n正式版

谷歌正式开源发布 Gemma 3n 端侧多模态模型，支持在手机、平板和笔记本电脑上本地运行，处理音频、文本、图片和视频多种数据类型。

相较此前的预览版，最新的 Gemma 3n 完整版进一步提升性能表现，支持在 2GB 内存的硬件上本地运行，重点提升了编码和推理方面的能力。

本次开源提供50亿参数（E2B）和80亿参数（E4B）两种版本，实际内存占用分别相当于20亿和40亿模型。其采用MatFormer分层嵌套架构（类俄罗斯套娃设计），支持动态调节计算资源，结合Per Layer Embeddings（PLE）技术和MobileNet-v5视觉编码器，显著提升内存效率与视觉处理能力。

模型强化多语言支持（140种文本语言、35种多模态理解）、数学运算、代码生成及复杂推理能力，适用于离线智能助手、实时多模态交互、本地化内容生成等场景，兼顾高性能与低功耗需求。

模型链接：

gemma-3n-E4B

https://www.modelscope.cn/models/google/gemma-3n-E4B

gemma-3n-E4B-it

https://www.modelscope.cn/models/google/gemma-3n-E4B-it

gemma-3n-E2B

https://www.modelscope.cn/models/google/gemma-3n-E2B

gemma-3n-E2B-it

https://www.modelscope.cn/models/google/gemma-3n-E2B-it

示例代码：

以gemma-3n-E2B-it为例，在单个GPU上运行模型

from modelscope import AutoProcessor, Gemma3nForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "google/gemma-3n-e2b-it"
model = Gemma3nForConditionalGeneration.from_pretrained(model_id, device="cuda", torch_dtype=torch.bfloat16,).eval()
processor = AutoProcessor.from_pretrained(model_id)
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)
input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
# **Overall Impression:** The image is a close-up shot of a vibrant garden scene,
# focusing on a cluster of pink cosmos flowers and a busy bumblebee.
# It has a slightly soft, natural feel, likely captured in daylight.

02.数据集推荐

wiki_fr

此数据集包含截至2025年4月20日的法语版维基百科全书的完整快照。它包括每页的最新版本，包含原始文本内容、链接页面的标题以及一个唯一标识符。

每篇文章的文本保留了MediaWiki格式化的结构（如== 章节标题 ==、=== 子标题 ===等）。这使得它特别适用于可以从文档层次结构中受益的任务。

该语料库非常适合语言模型训练、信息检索任务（information retrieval）、问答任务（question-answering）以及其他需要大量结构化百科文本的自然语言处理（NLP）研究。