🙋魔搭ModelScope本期社区进展：

📟315个模型：Qwen2-Audio、Qwen2-Math系列、MiniCPM-V-2_6系列、InternLM2.5系列、CogVideoX-2b等；

📁36个数据集：MedTrinity-25M、Recap-DataComp-1B、WikiRAG-TR等；

🎨62个创新应用：思·索MindSearch、天降之物合集-Bert-VITS2-2.3、FLUX文生图/图生图模型体验空间_gradio版等；

📄5篇文章：

Qwen2-Math开源！初步探索数学合成数据生成！
Qwen2-Audio开源，让VoiceChat更流畅！
面向多样应用需求，书生·浦语2.5开源超轻量、高性能多种参数版本
多图、视频首上端！面壁「小钢炮」 MiniCPM-V 2.6 模型重磅上新！魔搭推理、微调、部署实战教程来啦！
MindSearch技术详解，本地搭建媲美Perplexity的AI思·索应用！

精选模型

Qwen2-Audio

Qwen2-Audio是 Qwen-Audio 的下一代版本，它能够接受音频和文本输入，并生成文本输出，具有以下特点：

语音聊天：用户可以使用语音向音频语言模型发出指令，无需通过自动语音识别（ASR）模块。
音频分析：该模型能够根据文本指令分析音频信息，包括语音、声音、音乐等。
多语言支持：该模型支持超过8种语言和方言，例如中文、英语、粤语、法语、意大利语、西班牙语、德语和日语。

模型链接：

Qwen2-Audio-7B-Instruct

https://www.modelscope.cn/models/qwen/Qwen2-Audio-7B-Instruct

代码示例：

语音聊天推理

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
from modelscope import snapshot_download
import torch

model_dir = snapshot_download("Qwen/Qwen2-Audio-7B-Instruct")

processor = AutoProcessor.from_pretrained(model_dir)
model = Qwen2AudioForConditionalGeneration.from_pretrained(model_dir, device_map="auto",torch_dtype=torch.bfloat16)

conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
    ]},
    {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(librosa.load(
                    BytesIO(urlopen(ele['audio_url']).read()), 
                    sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

更多实战教程详见：

Qwen2-Audio开源，让VoiceChat更流畅！

Qwen2-Math系列

Qwen2-Math基于开源模型Qwen2研发， Qwen2-Math-72B-Instruct在权威测评集MATH上的得分超越目前主流的闭源和开源模型，以84%的准确率处理了代数、几何、计数与概率、数论等多种数学问题。

Qwen2-Math系列模型目前主要支持英文，通义团队很快就将推出中英双语版本，多语言版本也在开发中。

模型链接：

Qwen2-Math-1.5B

https://www.modelscope.cn/models/qwen/Qwen2-Math-1.5B

Qwen2-Math-72B

https://www.modelscope.cn/models/qwen/Qwen2-Math-72B

Qwen2-Math-7B

https://www.modelscope.cn/models/qwen/Qwen2-Math-7B

Qwen2-Math-72B-Instruct

https://www.modelscope.cn/models/qwen/Qwen2-Math-72B-Instruct

Qwen2-Math-7B-Instruct

https://www.modelscope.cn/models/qwen/Qwen2-Math-7B-Instruct

Qwen2-Math-1.5B-Instruct

https://www.modelscope.cn/models/qwen/Qwen2-Math-1.5B-Instruct

代码示例：

以Qwen2-Math-72B-Instruct为例：

from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "qwen/Qwen2-Math-72B-Instruct"
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Find the value of $x$ that satisfies the equation $4x+5 = 6x+7$."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

MiniCPM-V-2.6系列

MiniCPM-V 2.6 是 MiniCPM-V 系列中最新、性能最佳的模型。该模型基于 SigLip-400M 和 Qwen2-7B 构建，共 8B 参数。与 MiniCPM-Llama3-V 2.5 相比，MiniCPM-V 2.6 性能提升显著，并引入了多图和视频理解的新功能。

模型链接：

MiniCPM-V-2_6

https://www.modelscope.cn/models/OpenBMB/MiniCPM-V-2_6

MiniCPM-V-2_6-gguf

https://www.modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-gguf

MiniCPM-V-2_6-int4

https://www.modelscope.cn/models/openbmb/minicpm-v-2_6-int4

示例代码：

以MiniCPM-V-2_6为例

# test.py
import torch
from PIL import Image
from modelscope import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('OpenBMB/MiniCPM-V-2_6', trust_remote_code=True)

image = Image.open('image.png').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]

res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(res)

## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    stream=True
)

generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')

CogVideoX-2b

CogVideoX-2b 是智谱AI推出的清影的同源开源视频生成模型，提示词上限为 226 个token，视频长6秒，帧率为8帧/秒，分辨率为720*480，FP-16 精度推理只需 18GB 显存，微调只需 40GB 显存。

模型链接：

https://www.modelscope.cn/models/ZhipuAI/CogVideoX-2b

运行示例：

安装依赖项

pip install --upgrade opencv-python transformers diffusers # Must using diffusers>=0.30.0

运行代码

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)

pipe.enable_model_cpu_offload()

prompt_embeds, _ = pipe.encode_prompt(
    prompt=prompt,
    do_classifier_free_guidance=True,
    num_videos_per_prompt=1,
    max_sequence_length=226,
    device="cuda",
    dtype=torch.float16,
)

video = pipe(
    num_inference_steps=50,
    guidance_scale=6,
    prompt_embeds=prompt_embeds,
).frames[0]

export_to_video(video, "output.mp4", fps=8)

数据集推荐

MedTrinity-25M

MedTrinity-25M，一个全面的、大规模的医学多模态数据集，涵盖 10 种模态的超过 2500 万张图像，具有超过 65 种疾病的多粒度注释。这些丰富的注释既包含全球文本信息，例如疾病/病变类型、模式、区域特定描述和区域间关系，也包含感兴趣区域（ROI）的详细本地注释，包括边界框、分割掩码。与现有数据集相比，MedTrinity-25M 提供了最丰富的注释，支持全面的多模态任务，如字幕和报告生成，以及以视觉为中心的任务，如分类和分割。该数据集可用于支持多模态医疗AI模型的大规模预训练，为医疗领域未来基础模型的发展做出贡献。

数据集链接：

https://www.modelscope.cn/datasets/AI-ModelScope/MedTrinity-25M

WikiRAG-TR

WikiRAG-TR是一个由6K（5999）个问答对组成的数据集，该数据集是根据土耳其语维基百科文章的介绍部分合成创建的。创建数据集以用于土耳其检索增强生成（RAG）任务。

数据集链接：

https://www.modelscope.cn/datasets/AI-ModelScope/Recap-DataComp-1B

Recap-DataComp-1B

数据集链接：

https://www.modelscope.cn/datasets/AI-ModelScope/WikiRAG-TR

精选应用

通义千问2-音频模型-对话

音频理解大模型Qwen2-Audio-Instruct，不同于仅能处理人声信号的传统语音模型，Qwen2-Audio具备对人声、自然声、动物声、音乐声等各类语音信号的感知和理解能力。向模型输入一段语音，就可要求模型给出对音频的理解，甚至基于音频进行文学创作、逻辑推理、故事续写等等。这让大模型具备了接近人类的听觉能力。

体验直达：

https://modelscope.cn/studios/qwen/Qwen2-Audio-Instruct-Demo