🙋魔搭ModelScope本期社区进展:

📟1340个模型:通义千问 QwQ-32B、HunyuanVideo-I2V、CogView4-6B、Phi-4系列模型等;

📁220个数据集:Big-Math-RL-UNVERIFIED、msmarco-msmarco-MiniLM-L6-v3、KodCode-V1等;

🎨91个创新应用:CogView4、QwQ-32B-Demo、推理模型大作战(QwQ-32B vs DeepSeek-R1)等;

📄 8篇内容:

  • 腾讯开源HunyuanVideo-I2V图生视频模型+LoRA训练脚本,社区部署、推理实战教程来啦!

  • QwQ-32B开源!更小尺寸,仅1/20参数性能比肩满血R1

  • 微软Phi-4系列开源:多模态与文本处理的创新突破

  • 打造跨语言智能工具与应用,“万卷·丝路”专项课题开放申请

  • CogView4开源发布!智谱AI文生图模型支持任意长度双语输入,汉字生成能力突出,可商用!

  • CLIPer:开创性框架提升CLIP空间表征,实现开放词汇语义分割突破

  • 高效部署通义万相Wan2.1:ComfyUI文生/图生视频实战,工作流直取!

  • 高效部署通义万相Wan2.1:使用Gradio搭建WebUI体验实战

01.精选模型

通义千问 QwQ-32B

QwQ-32B是通义千问团队最新开源的推理模型,以320亿参数实现媲美更大模型的性能,其亮点包括:

推理能力突出 :数学测试(AIME24)与DeepSeek-R1持平,编程(LiveCodeBench)接近其水平,通用能力在LiveBench等评测中更超越DeepSeek-R1;

分阶段强化学习 :初期通过数学答案校验与代码执行反馈优化,后期结合通用奖励模型训练,兼顾多领域性能;

轻量化部署 :支持单卡(如3090/M4 Max)运行,降低硬件门槛;

开源标杆价值 :以1/20的参数规模(对比DeepSeek-R1)验证强化学习的效率提升,为开发者提供高性价比选择。

模型合集链接:

https://www.modelscope.cn/collections/QwQ-32B-0f1806b8a8514a

示例代码:

from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/QwQ-32B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r's are in the word \"strawberry\""
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

关于模型更多部署、推理、微调实战教程详见:

QwQ-32B开源!更小尺寸,仅1/20参数性能比肩满血R1

HunyuanVideo-I2V图生视频模型

腾讯混元重磅发布并开源HunyuanVideo-I2V图生视频模型。该模型基于HunyuanVideo文生视频基础模型,利用基础模型先进的视频生成能力,将应用扩展到图像到视频的生成任务,还同步开源了LoRA训练代码,用于定制化特效生成,可创建更有趣的视频效果。

研究团队采用图像潜在连接技术,重建参考图像信息并纳入视频生成过程,同时使用预训练的Decoder-Only架构多模态大语言模型作为文本编码器,增强模型对输入图像语义内容的理解,实现图像与文本描述信息的深度融合。

模型地址:

https://modelscope.cn/models/AI-ModelScope/HunyuanVideo-i2v

示例代码:

提供本地推理运行HunyuanVideo-I2V方案,硬件要求如下

模型

分辨率

GPU显存峰值

HunyuanVideo-I2V

720p

60GB

  • 需配备支持CUDA的NVIDIA GPU

  • 测试环境为单卡80G GPU

  • 最低要求: 720p分辨率需至少60GB显存

  • 推荐配置: 建议使用80GB显存GPU以获得更佳生成质量

  • 测试操作系统:Linux

克隆代码

git clone https://github.com/tencent/HunyuanVideo-I2V
cd HunyuanVideo-I2V

配置环境

pip install -r requirements.txt
pip install ninja
pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3

混元图生视频包括三个模型,基础模型hunyuan-video-i2v-720p和两个文本编码器(text_encoder_i2v,text_encoder_2)。模型下载后默认放在HunyuanVideo-I2V/ckpts文件夹下,文件结构:

HunyuanVideo-I2V
  ├──ckpts
  │  ├──README.md
  │  ├──hunyuan-video-i2v-720p
  │  │  ├──transformers
  │  │  │  ├──mp_rank_00_model_states.pt
  ├  │  ├──vae
  ├  │  ├──lora
  │  │  │  ├──embrace_kohaya_weights.safetensors
  │  │  │  ├──hair_growth_kohaya_weights.safetensors
  │  ├──text_encoder_i2v
  │  ├──text_encoder_2
  ├──...

魔搭上可以下载到这三个模型,下载命令如下:

cd HunyuanVideo-I2V

# 下载基础模型
modelscope download --model AI-ModelScope/HunyuanVideo-I2V --local_dir ./ckpts

# 下载文本编码器MLLM
modelscope download --model AI-ModelScope/llava-llama-3-8b-v1_1-transformers --local_dir ./ckpts/text_encoder_i2v

# 下载文本编码器CLIP
modelscope download --model AI-ModelScope/clip-vit-large-patch14 --local_dir ./ckpts/text_encoder_2

推理代码

cd HunyuanVideo-I2V

python3 sample_image2video.py \
    --model HYVideo-T/2 \
    --prompt "A man with short gray hair plays a red electric guitar." \
    --i2v-mode \
    --i2v-image-path ./assets/demo/i2v/imgs/0.png \
    --i2v-resolution 720p \
    --video-length 129 \
    --infer-steps 50 \
    --flow-reverse \
    --flow-shift 17.0 \
    --seed 0 \
    --use-cpu-offload \
    --save-path ./results

耗时:50步,生成1280*704分辨率5秒的视频,A100,大概需要50分钟

显存占用:约60G

关于模型更多部署、推理实战教程详见:

腾讯开源HunyuanVideo-I2V图生视频模型+LoRA训练脚本,社区部署、推理实战教程来啦!

CogView4-6B

智谱AI正式发布并开源了最新的图像生成模型CogView4,该模型具备复杂语义对齐和指令跟随能力,支持任意长度中英双语输入,可生成任意分辨率图像,文字生成能力出色,尤其在汉字生成方面表现突出。

在DPG-Bench基准测试中综合评分排名第一,达到开源文生图模型的SOTA水平。模型采用二维旋转位置编码、Flow-matching扩散生成建模、多阶段训练策略等技术,突破了传统模型在文本长度和图像分辨率上的限制,适合国内广告、短视频等领域的创意需求,且也是首个遵循 Apache 2.0协议开源的图像生成模型。

模型链接:

https://modelscope.cn/models/ZhipuAI/CogView4-6B

示例代码:

安装依赖

pip install git+https://github.com/huggingface/diffusers.git

推理代码

from diffusers import CogView4Pipeline
from modelscope import snapshot_download
import torch

model_dir = snapshot_download("ZhipuAI/CogView4-6B")
pipe = CogView4Pipeline.from_pretrained(model_dir, torch_dtype=torch.bfloat16)

# Open it for reduce GPU memory usage
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
    prompt=prompt,
    guidance_scale=3.5,
    num_images_per_prompt=1,
    num_inference_steps=50,
    width=1024,
    height=1024,
).images[0]

image.save("cogview4.png")

Phi-4系列

微软最新开源的Phi-4系列模型,包括Phi-4-multimodal和Phi-4-mini。Phi-4-multimodal是一个5.6B参数的多模态语言模型,能同时处理语音、视觉和文本,为创建具有上下文感知能力的应用程序提供新可能;Phi-4-mini则是一个3.8B参数的紧凑模型,专为提高速度和效率设计,在基于文本的任务中表现出色。

模型合集链接:

https://www.modelscope.cn/collections/phi-4-4ce2630c1b664f

示例代码:

Phi-4-mini-instruct推理代码:

from vllm import LLM, SamplingParams

llm = LLM(model="LLM-Research/Phi-4-mini-instruct", trust_remote_code=True)

messages = [
 {"role": "system", "content": "You are a helpful AI assistant."},
 {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
 {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
 {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

sampling_params = SamplingParams(
 max_tokens=500,
 temperature=0.0,
)

output = llm.chat(messages=messages, sampling_params=sampling_params)
print(output[0].outputs[0].text)

Phi-4-multimodal-instruct推理代码:

import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen
from modelscope import snapshot_download


# Define model path
model_path = snapshot_download("LLM-Research/Phi-4-multimodal-instruct")

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    #attn_implementation='flash_attention_2',
).cuda()

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

更多Phi-4系列模型微调实战教程,详见:

微软Phi-4系列开源:多模态与文本处理的创新突破

02.数据集推荐

Big-Math-RL-UNVERIFIED

Big-Math-RL-UNVERIFIED 是一个专注于数学问题的高质量数据集,包含超过25万道数学题目及其可验证的解题过程,专为强化学习(Reinforcement Learning, RL)在语言模型中的应用而设计。

数据集链接:

https://www.modelscope.cn/datasets/SynthLabsAI/Big-Math-RL-UNVERIFIED

msmarco-msmarco-MiniLM-L6-v3

msmarco-MiniLM-L6-v3是一个基于 Sentence-Transformers 的预训练语言模型,专为文本嵌入和语义相似性任务设计。它在 MS MARCO数据集上进行了微调,适用于信息检索、文本匹配和语义搜索等任务。

数据集链接:

https://www.modelscope.cn/models/sentence-transformers/msmarco-MiniLM-L6-v3

KodCode-V1

KodCode 是最大的全合成开源数据集,为编码任务提供可验证的解决方案和测试。它包含 12 个不同的子集,涵盖各个领域(从算法到包特定的知识)和难度级别(从基本编码练习到面试和竞争性编程挑战)。KodCode 专为监督式微调 (SFT) 和 RL 优化而设计。

数据集链接:

https://www.modelscope.cn/datasets/AI-ModelScope/KodCode-V1

03.精选应用

CogView4

体验直达:

https://www.modelscope.cn/studios/ZhipuAI/CogView4

QwQ-32B-Demo

体验直达:

https://modelscope.cn/studios/Qwen/QwQ-32B-Demo

推理模型大作战(QwQ-32B vs DeepSeek-R1)

体验直达:

https://www.modelscope.cn/studios/AI-ModelScope/QwQ-32B-vs-DeepSeek-R1

04.社区精选文章

Logo

ModelScope旨在打造下一代开源的模型即服务共享平台,为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品,让模型应用更简单!

更多推荐