魔搭社区模型速递（3.2-3.8）

魔搭ModelScope本期社区进展：1340个模型,220个数据集,91个创新应用, 8篇内容

魔搭ModelScope社区

298人浏览 · 2025-03-10 11:14:36

魔搭ModelScope社区 · 2025-03-10 11:14:36 发布

🙋魔搭ModelScope本期社区进展：

📟1340个模型：通义千问 QwQ-32B、HunyuanVideo-I2V、CogView4-6B、Phi-4系列模型等；

📁220个数据集：Big-Math-RL-UNVERIFIED、msmarco-msmarco-MiniLM-L6-v3、KodCode-V1等；

🎨91个创新应用：CogView4、QwQ-32B-Demo、推理模型大作战（QwQ-32B vs DeepSeek-R1）等；

📄 8篇内容：

腾讯开源HunyuanVideo-I2V图生视频模型+LoRA训练脚本，社区部署、推理实战教程来啦！
QwQ-32B开源！更小尺寸，仅1/20参数性能比肩满血R1
微软Phi-4系列开源：多模态与文本处理的创新突破
打造跨语言智能工具与应用，“万卷·丝路”专项课题开放申请
CogView4开源发布！智谱AI文生图模型支持任意长度双语输入，汉字生成能力突出，可商用！
CLIPer：开创性框架提升CLIP空间表征，实现开放词汇语义分割突破
高效部署通义万相Wan2.1：ComfyUI文生/图生视频实战，工作流直取！
高效部署通义万相Wan2.1：使用Gradio搭建WebUI体验实战

01.精选模型

通义千问 QwQ-32B

QwQ-32B是通义千问团队最新开源的推理模型，以320亿参数实现媲美更大模型的性能，其亮点包括：

推理能力突出：数学测试（AIME24）与DeepSeek-R1持平，编程（LiveCodeBench）接近其水平，通用能力在LiveBench等评测中更超越DeepSeek-R1；

分阶段强化学习：初期通过数学答案校验与代码执行反馈优化，后期结合通用奖励模型训练，兼顾多领域性能；

轻量化部署：支持单卡（如3090/M4 Max）运行，降低硬件门槛；

开源标杆价值：以1/20的参数规模（对比DeepSeek-R1）验证强化学习的效率提升，为开发者提供高性价比选择。

模型合集链接：

https://www.modelscope.cn/collections/QwQ-32B-0f1806b8a8514a

示例代码：

from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/QwQ-32B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r's are in the word \"strawberry\""
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

关于模型更多部署、推理、微调实战教程详见：

QwQ-32B开源！更小尺寸，仅1/20参数性能比肩满血R1

HunyuanVideo-I2V图生视频模型

腾讯混元重磅发布并开源HunyuanVideo-I2V图生视频模型。该模型基于HunyuanVideo文生视频基础模型，利用基础模型先进的视频生成能力，将应用扩展到图像到视频的生成任务，还同步开源了LoRA训练代码，用于定制化特效生成，可创建更有趣的视频效果。

研究团队采用图像潜在连接技术，重建参考图像信息并纳入视频生成过程，同时使用预训练的Decoder-Only架构多模态大语言模型作为文本编码器，增强模型对输入图像语义内容的理解，实现图像与文本描述信息的深度融合。

模型地址：

https://modelscope.cn/models/AI-ModelScope/HunyuanVideo-i2v

示例代码：

提供本地推理运行HunyuanVideo-I2V方案，硬件要求如下

模型	分辨率	GPU显存峰值
HunyuanVideo-I2V	720p	60GB

需配备支持CUDA的NVIDIA GPU
测试环境为单卡80G GPU
最低要求: 720p分辨率需至少60GB显存
推荐配置: 建议使用80GB显存GPU以获得更佳生成质量
测试操作系统：Linux

克隆代码

git clone https://github.com/tencent/HunyuanVideo-I2V
cd HunyuanVideo-I2V

配置环境

pip install -r requirements.txt
pip install ninja
pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3

混元图生视频包括三个模型，基础模型hunyuan-video-i2v-720p和两个文本编码器（text_encoder_i2v，text_encoder_2）。模型下载后默认放在HunyuanVideo-I2V/ckpts文件夹下，文件结构：

HunyuanVideo-I2V
  ├──ckpts
  │  ├──README.md
  │  ├──hunyuan-video-i2v-720p
  │  │  ├──transformers
  │  │  │  ├──mp_rank_00_model_states.pt
  ├  │  ├──vae
  ├  │  ├──lora
  │  │  │  ├──embrace_kohaya_weights.safetensors
  │  │  │  ├──hair_growth_kohaya_weights.safetensors
  │  ├──text_encoder_i2v
  │  ├──text_encoder_2
  ├──...

魔搭上可以下载到这三个模型，下载命令如下：

cd HunyuanVideo-I2V

# 下载基础模型
modelscope download --model AI-ModelScope/HunyuanVideo-I2V --local_dir ./ckpts

# 下载文本编码器MLLM
modelscope download --model AI-ModelScope/llava-llama-3-8b-v1_1-transformers --local_dir ./ckpts/text_encoder_i2v

# 下载文本编码器CLIP
modelscope download --model AI-ModelScope/clip-vit-large-patch14 --local_dir ./ckpts/text_encoder_2

推理代码

cd HunyuanVideo-I2V

python3 sample_image2video.py \
    --model HYVideo-T/2 \
    --prompt "A man with short gray hair plays a red electric guitar." \
    --i2v-mode \
    --i2v-image-path ./assets/demo/i2v/imgs/0.png \
    --i2v-resolution 720p \
    --video-length 129 \
    --infer-steps 50 \
    --flow-reverse \
    --flow-shift 17.0 \
    --seed 0 \
    --use-cpu-offload \
    --save-path ./results

耗时：50步，生成1280*704分辨率5秒的视频，A100，大概需要50分钟

显存占用：约60G

关于模型更多部署、推理实战教程详见：

腾讯开源HunyuanVideo-I2V图生视频模型+LoRA训练脚本，社区部署、推理实战教程来啦！

CogView4-6B

智谱AI正式发布并开源了最新的图像生成模型CogView4，该模型具备复杂语义对齐和指令跟随能力，支持任意长度中英双语输入，可生成任意分辨率图像，文字生成能力出色，尤其在汉字生成方面表现突出。

在DPG-Bench基准测试中综合评分排名第一，达到开源文生图模型的SOTA水平。模型采用二维旋转位置编码、Flow-matching扩散生成建模、多阶段训练策略等技术，突破了传统模型在文本长度和图像分辨率上的限制，适合国内广告、短视频等领域的创意需求，且也是首个遵循 Apache 2.0协议开源的图像生成模型。

模型链接：

https://modelscope.cn/models/ZhipuAI/CogView4-6B

示例代码：

安装依赖

pip install git+https://github.com/huggingface/diffusers.git

推理代码

from diffusers import CogView4Pipeline
from modelscope import snapshot_download
import torch

model_dir = snapshot_download("ZhipuAI/CogView4-6B")
pipe = CogView4Pipeline.from_pretrained(model_dir, torch_dtype=torch.bfloat16)

# Open it for reduce GPU memory usage
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
    prompt=prompt,
    guidance_scale=3.5,
    num_images_per_prompt=1,
    num_inference_steps=50,
    width=1024,
    height=1024,
).images[0]

image.save("cogview4.png")

Phi-4系列

微软最新开源的Phi-4系列模型，包括Phi-4-multimodal和Phi-4-mini。Phi-4-multimodal是一个5.6B参数的多模态语言模型，能同时处理语音、视觉和文本，为创建具有上下文感知能力的应用程序提供新可能；Phi-4-mini则是一个3.8B参数的紧凑模型，专为提高速度和效率设计，在基于文本的任务中表现出色。

模型合集链接：

https://www.modelscope.cn/collections/phi-4-4ce2630c1b664f

示例代码：

Phi-4-mini-instruct推理代码：

from vllm import LLM, SamplingParams

llm = LLM(model="LLM-Research/Phi-4-mini-instruct", trust_remote_code=True)

messages = [
 {"role": "system", "content": "You are a helpful AI assistant."},
 {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
 {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
 {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

sampling_params = SamplingParams(
 max_tokens=500,
 temperature=0.0,
)

output = llm.chat(messages=messages, sampling_params=sampling_params)
print(output[0].outputs[0].text)

Phi-4-multimodal-instruct推理代码：

import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen
from modelscope import snapshot_download


# Define model path
model_path = snapshot_download("LLM-Research/Phi-4-multimodal-instruct")

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    #attn_implementation='flash_attention_2',
).cuda()

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')