🙋魔搭ModelScope本期社区进展:
📟4848个模型:Emu3系列、GLM-4-Voice、stable-diffusion-3.5-large、Janus-1.3B等;
📁45个数据集:CCI3-HQ-Annotation-Benchmark、SWE-bench、simpletuner_venv等;
🎨46个创新应用:SD3.5-turbo快速生图、阿里Tora-轨迹导向的视频生成、open-notebooklm-demo等;
📄7篇文章:
-
GLM-4-Voice,智谱开源版“Her”来了!
-
统一多模态模型来了!智源发布多模态世界模型Emu3!
-
Compass Arena 大模型竞技场多模态榜单发布!
-
今日热点:“AI像人类一样使用手机和电脑”,魔搭社区的开源项目已先行一步
-
Deepseek开源多模态LLM模型框架Janus,魔搭社区最佳实践
-
请拥有edu邮箱的同学来领取专(免)属(费)GPU!
-
MemoryScope:为LLM聊天机器人配备的长期记忆系统
精选模型
GLM-4-Voice
智谱 AI 推出并开源端到端语音模型 GLM-4-Voice,GLM-4-Voice 能够直接理解和生成中英文语音,进行实时语音对话,并且能够遵循用户的指令要求改变语音的情感、语调、语速、方言等属性。
GLM-4-Voice 由三个部分组成:
-
GLM-4-Voice-Tokenizer: 通过在 Whisper 的 Encoder 部分增加 Vector Quantization 并在 ASR 数据上有监督训练,将连续的语音输入转化为离散的 token。每秒音频平均只需要用 12.5 个离散 token 表示。
-
GLM-4-Voice-Decoder: 基于 CosyVoice 的 Flow Matching 模型结构训练的支持流式推理的语音解码器,将离散化的语音 token 转化为连续的语音输出。最少只需要 10 个语音 token 即可开始生成,降低端到端对话延迟。
-
GLM-4-Voice-9B: 在 GLM-4-9B 的基础上进行语音模态的预训练和对齐,从而能够理解和生成离散化的语音 token。
模型链接:
GLM-4-Voice-Tokenizer:
https://modelscope.cn/models/ZhipuAI/glm-4-voice-tokenizer
GLM-4-Voice-9B:
https://modelscope.cn/models/ZhipuAI/glm-4-voice-9b
GLM-4-Voice-Decoder:
https://modelscope.cn/models/ZhipuAI/glm-4-voice-decoder
代码示例:
环境准备:
git clone https://github.com/THUDM/GLM-4-Voice.git
pip install matcha-tts torchaudio hyperpyyaml
cd GLM-4-Voice
# 如果出现环境问题,可以运行以下命令
pip install -r requirements.txt
然后进去`GLM-4-Voice`的目录下运行以下代码:
import os
import uuid
from typing import List, Optional, Tuple
import torch
import torchaudio
from flow_inference import AudioDecoder
from modelscope import snapshot_download
from speech_tokenizer.modeling_whisper import WhisperVQEncoder
from speech_tokenizer.utils import extract_speech_token
from transformers import AutoModel, AutoTokenizer, GenerationConfig, WhisperFeatureExtractor
class GLM4Voice:
def _prepare_model(self):
model_path = snapshot_download('ZhipuAI/glm-4-voice-9b')
decoder_path = snapshot_download('ZhipuAI/glm-4-voice-decoder')
tokenizer_path = snapshot_download('ZhipuAI/glm-4-voice-tokenizer')
flow_config = os.path.join(decoder_path, 'config.yaml')
flow_checkpoint = os.path.join(decoder_path, 'flow.pt')
hift_checkpoint = os.path.join(decoder_path, 'hift.pt')
# GLM
self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map=self.device).eval()
self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
# Flow & Hift
self.audio_decoder = AudioDecoder(
config_path=flow_config, flow_ckpt_path=flow_checkpoint, hift_ckpt_path=hift_checkpoint, device=self.device)
# Speech tokenizer
self.whisper_model = WhisperVQEncoder.from_pretrained(tokenizer_path).eval().to(self.device)
self.feature_extractor = WhisperFeatureExtractor.from_pretrained(tokenizer_path)
def clear(self):
self.previous_tokens = ''
def __init__(self, generation_config=None):
if generation_config is None:
generation_config = GenerationConfig(top_p=0.8, temperature=0.2, max_new_tokens=2000, do_sample=True)
self.generation_config = generation_config
self.device = 'cuda:0'
self._prepare_model()
self.audio_offset = self.tokenizer.convert_tokens_to_ids('<|audio_0|>')
self.end_token_id = self.tokenizer.convert_tokens_to_ids('<|user|>')
self.clear()
def infer(self, audio_path: Optional[str] = None, text: Optional[str] = None) -> Tuple[str, str]:
if audio_path is not None:
audio_tokens = extract_speech_token(self.whisper_model, self.feature_extractor, [audio_path])[0]
audio_tokens = ''.join([f'<|audio_{x}|>' for x in audio_tokens])
audio_tokens = '<|begin_of_audio|>' + audio_tokens + '<|end_of_audio|>'
user_input = audio_tokens
system_prompt = 'User will provide you with a speech instruction. Do it step by step. First, think about the instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens. '
else:
user_input = text
system_prompt = 'User will provide you with a text instruction. Do it step by step. First, think about the instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens.'
text = self.previous_tokens
text = text.strip()
if '<|system|>' not in text:
text += f'<|system|>\n{system_prompt}'
text += f'<|user|>\n{user_input}<|assistant|>streaming_transcription\n'
inputs = self.tokenizer([text], return_tensors='pt').to(self.device)
generate_ids = self.model.generate(**inputs, generation_config=self.generation_config)[0]
generate_ids = generate_ids[inputs['input_ids'].shape[1]:]
self.previous_tokens += text + self.tokenizer.decode(generate_ids, spaces_between_special_tokens=False)
return self._parse_generate_ids(generate_ids)
def _parse_generate_ids(self, generate_ids: List[int]) -> Tuple[str, str]:
text_tokens, audio_tokens = [], []
this_uuid = str(uuid.uuid4())
for token_id in generate_ids.tolist():
if token_id >= self.audio_offset:
audio_tokens.append(token_id - self.audio_offset)
elif token_id != self.end_token_id:
text_tokens.append(token_id)
audio_tokens_pt = torch.tensor(audio_tokens, device=self.device)[None]
tts_speech, _ = self.audio_decoder.token2wav(audio_tokens_pt, uuid=this_uuid, finalize=True)
audio_path = f'{this_uuid}.wav'
with open(audio_path, 'wb') as f:
torchaudio.save(f, tts_speech.cpu(), 22050, format='wav')
response = self.tokenizer.decode(text_tokens)
return response, audio_path
if __name__ == '__main__':
generation_config = GenerationConfig(top_p=0.8, temperature=0.2, max_new_tokens=2000, do_sample=True)
glm_voice = GLM4Voice(generation_config=generation_config)
audio_path = 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav'
response, output_path = glm_voice.infer(audio_path=audio_path)
print(f'response: {response}\noutput_path: {output_path}')
text = '请用英文回答'
response, output_path = glm_voice.infer(text=text)
print(f'response: {response}\noutput_path: {output_path}')
glm_voice.clear() # 清空历史
text = '请用英文回答'
response, output_path = glm_voice.infer(text=text)
print(f'response: {response}\noutput_path: {output_path}')
"""
response: 是啊,阳光明媚的,真是个出门走走的好日子!你今天有什么计划吗?
output_path: 7f146cb5-4c1f-4c2c-85d0-0a8c985c90c0.wav
response: Sure! Today's weather is really nice, isn't it? It's a great day to go out and enjoy some fresh air. Do you have any plans for today?
output_path: 9326df35-aeec-4292-856b-5c0b1688e3f8.wav
response: Sure, I'll answer in English. What would you like to know?
output_path: e6e7c94b-7532-475f-bea7-e41566a954b6.wav
"""
更多玩法教程详见:
Emu3系列
2024年10月21日,智源研究院正式发布原生多模态世界模型Emu3。该模型使用单一的Transformer进行训练,并通过将图像、文本和视频等不同模态的数据转化为离散空间中的token来进行预测。只基于下一个token预测,无需扩散模型或组合方法,即可完成文本、图像、视频三种模态数据的理解和生成,并超越传统任务特定模型的效果,在生成和感知任务中都达到了SOTA的水平。
模型链接:
Emu3-Stage1:
https://modelscope.cn/models/BAAI/Emu3-Stage1
Emu3-VisionTokenizer:
https://modelscope.cn/models/BAAI/Emu3-VisionTokenizer
Emu3-Gen:
https://modelscope.cn/collections/Emu3-9eacc8668b1043
Emu3-Chat:
https://modelscope.cn/models/BAAI/Emu3-Chat
代码示例:
from PIL import Image
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModelForCausalLM
from transformers.generation.configuration_utils import GenerationConfig
from transformers.generation import LogitsProcessorList, PrefixConstrainedLogitsProcessor, UnbatchedClassifierFreeGuidanceLogitsProcessor
import torch
from modelscope import snapshot_download
# model path
EMU_HUB = snapshot_download("BAAI/Emu3-Stage1")
VQ_HUB = snapshot_download("BAAI/Emu3-VisionTokenizer")
import sys
sys.path.append(EMU_HUB)
from processing_emu3 import Emu3Processor
# prepare model and processor
model = AutoModelForCausalLM.from_pretrained(
EMU_HUB,
device_map="cuda:0",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(EMU_HUB, trust_remote_code=True, padding_side="left")
image_processor = AutoImageProcessor.from_pretrained(VQ_HUB, trust_remote_code=True)
image_tokenizer = AutoModel.from_pretrained(VQ_HUB, device_map="cuda:0", trust_remote_code=True).eval()
processor = Emu3Processor(image_processor, image_tokenizer, tokenizer, chat_template="{image_prompt}{text_prompt}")
# Image Generation
# prepare input
POSITIVE_PROMPT = " masterpiece, film grained, best quality."
NEGATIVE_PROMPT = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry."
classifier_free_guidance = 3.0
prompt = "a portrait of young girl."
prompt += POSITIVE_PROMPT
kwargs = dict(
mode='G',
ratio="1:1",
image_area=model.config.image_area,
return_tensors="pt",
padding="longest",
)
pos_inputs = processor(text=prompt, **kwargs)
neg_inputs = processor(text=NEGATIVE_PROMPT, **kwargs)
# prepare hyper parameters
GENERATION_CONFIG = GenerationConfig(
use_cache=True,
eos_token_id=model.config.eos_token_id,
pad_token_id=model.config.pad_token_id,
max_new_tokens=40960,
do_sample=True,
top_k=2048,
)
h = pos_inputs.image_size[:, 0]
w = pos_inputs.image_size[:, 1]
constrained_fn = processor.build_prefix_constrained_fn(h, w)
logits_processor = LogitsProcessorList([
UnbatchedClassifierFreeGuidanceLogitsProcessor(
classifier_free_guidance,
model,
unconditional_ids=neg_inputs.input_ids.to("cuda:0"),
),
PrefixConstrainedLogitsProcessor(
constrained_fn ,
num_beams=1,
),
])
# generate
outputs = model.generate(
pos_inputs.input_ids.to("cuda:0"),
GENERATION_CONFIG,
logits_processor=logits_processor,
attention_mask=pos_inputs.attention_mask.to("cuda:0"),
)
mm_list = processor.decode(outputs[0])
for idx, im in enumerate(mm_list):
if not isinstance(im, Image.Image):
continue
im.save(f"result_{idx}.png")
# Multimodal Understanding
text = "The image depicts "
image = Image.open("assets/demo.png")
inputs = processor(
text=text,
image=image,
mode='U',
padding="longest",
return_tensors="pt",
)
GENERATION_CONFIG = GenerationConfig(
pad_token_id=tokenizer.pad_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=1024,
)
outputs = model.generate(
inputs.input_ids.to("cuda:0"),
GENERATION_CONFIG,
attention_mask=inputs.attention_mask.to("cuda:0"),
)
outputs = outputs[:, inputs.input_ids.shape[-1]:]
answers = processor.batch_decode(outputs, skip_special_tokens=True)
for ans in answers:
print(ans)
更多玩法教程详见:
stable-diffusion-3.5-large
Stability近期发布了最新模型系列 stable-diffusion-3.5-large,SD3.5进行了全面的架构,现在根据更新的、更宽松的社区license,增强了图像保真度、指令遵循和可控性,可在消费级显卡轻松运行。
模型链接:
https://modelscope.cn/models/AI-ModelScope/stable-diffusion-3.5-large
示例代码:
安装依赖
!pip install diffusers -U
推理代码:
import torch
from diffusers import StableDiffusion3Pipeline
from modelscope import snapshot_download
model_dir = snapshot_download("AI-ModelScope/stable-diffusion-3.5-large")
pipe = StableDiffusion3Pipeline.from_pretrained(model_dir, torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
image = pipe(
"A capybara holding a sign that reads Hello World",
num_inference_steps=28,
guidance_scale=3.5,
).images[0]
image.save("capybara.png")
Janus-1.3B
DeepSeek近期推出了简单、统一且灵活的多模态框架Janus,它能够统一处理多模态理解和生成任务。与之前的研究不同的是,Janus将视觉编码解耦为独立的路径,并利用单一、统一的transformer架构进行处理。这种方法不仅缓解了视觉编码器在理解和生成任务中的冲突,还增强了框架的灵活性。
模型链接:
https://modelscope.cn/models/deepseek-ai/Janus-1.3B
示例代码:
环境安装
!git clone https://github.com/deepseek-ai/Janus.git
%cd Janus
!pip install -e .
视觉理解
import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
from modelscope import snapshot_download
# specify the path to the model
model_path = snapshot_download("deepseek-ai/Janus-1.3B")
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "User",
"content": "<image_placeholder>\nConvert the formula into latex code.",
"images": ["/mnt/workspace/Janus/images/equation.png"],
},
{"role": "Assistant", "content": ""},
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)
# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# # run the model to get the response
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
图片生成
import os
import PIL.Image
import torch
import numpy as np
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from modelscope import snapshot_download
# specify the path to the model
model_path = snapshot_download("deepseek-ai/Janus-1.3B")
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "User",
"content": "A stunning princess from kabul in red, white traditional clothing, blue eyes, brown hair",
},
{"role": "Assistant", "content": ""},
]
sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
conversations=conversation,
sft_format=vl_chat_processor.sft_format,
system_prompt="",
)
prompt = sft_format + vl_chat_processor.image_start_tag
@torch.inference_mode()
def generate(
mmgpt: MultiModalityCausalLM,
vl_chat_processor: VLChatProcessor,
prompt: str,
temperature: float = 1,
parallel_size: int = 16,
cfg_weight: float = 5,
image_token_num_per_image: int = 576,
img_size: int = 384,
patch_size: int = 16,
):
input_ids = vl_chat_processor.tokenizer.encode(prompt)
input_ids = torch.LongTensor(input_ids)
tokens = torch.zeros((parallel_size*2, len(input_ids)), dtype=torch.int).cuda()
for i in range(parallel_size*2):
tokens[i, :] = input_ids
if i % 2 != 0:
tokens[i, 1:-1] = vl_chat_processor.pad_id
inputs_embeds = mmgpt.language_model.get_input_embeddings()(tokens)
generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).cuda()
for i in range(image_token_num_per_image):
outputs = mmgpt.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=outputs.past_key_values if i != 0 else None)
hidden_states = outputs.last_hidden_state
logits = mmgpt.gen_head(hidden_states[:, -1, :])
logit_cond = logits[0::2, :]
logit_uncond = logits[1::2, :]
logits = logit_uncond + cfg_weight * (logit_cond-logit_uncond)
probs = torch.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated_tokens[:, i] = next_token.squeeze(dim=-1)
next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
img_embeds = mmgpt.prepare_gen_img_embeds(next_token)
inputs_embeds = img_embeds.unsqueeze(dim=1)
dec = mmgpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int), shape=[parallel_size, 8, img_size//patch_size, img_size//patch_size])
dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8)
visual_img[:, :, :] = dec
os.makedirs('generated_samples', exist_ok=True)
for i in range(parallel_size):
save_path = os.path.join('generated_samples', "img_{}.jpg".format(i))
PIL.Image.fromarray(visual_img[i]).save(save_path)
generate(
vl_gpt,
vl_chat_processor,
prompt,
)
数据集推荐
CCI3-HQ-Annotation-Benchmark
CCI3-HQ-Annotation-Benchmark是一个高质量中文互联网语料库,由智源研究院提供,适用于各种自然语言处理任务的数据标注基准。
数据集链接:
https://modelscope.cn/datasets/BAAI/CCI3-HQ-Annotation-Benchmark
SWE-bench
SWE-bench是一个自然语言处理任务的数据集,旨在评估和提升模型在社交媒体文本理解方面的表现。
数据集链接:
https://modelscope.cn/datasets/AI-ModelScope/SWE-bench
simpletuner_venv
Simpletuner_venv是一个用于音乐制作和调音的数据集,支持音乐爱好者和开发者进行音频处理和音乐创作。
数据集链接:
https://modelscope.cn/datasets/livehouse/simpletuner_venv
精选应用
SD3.5-turbo快速生图
Stability近期发布了最新模型系列 stable-diffusion-3.5-large,SD3.5-turbo是基于Stable Diffusion 3.5 Large模型的快速生图工具,支持高效图像生成体验。
体验直达:
https://modelscope.cn/studios/AI-ModelScope/stable-diffusion-3.5-large-turbo
阿里Tora-轨迹导向的视频生成
阿里近期开源的轨迹控制版视频生成工具—— Tora,只需绘制任意数量的轨迹,并输入一段文本提示(prompt),便可生成为期 6 秒的轨迹控制视频。用户可以选择使用提供的预设轨迹,或者自定义绘制轨迹,以实现更具个性化的效果。
体验直达:
https://modelscope.cn/studios/xiaoche/Tora
open-notebooklm-demo
Open-NotebookLM-Demo是用于展示如何在Notebook环境中应用和测试大型语言模型。
体验直达:
https://modelscope.cn/studios/studio-test/open-notebooklm-demo
社区精选文章
-
GLM-4-Voice,智谱开源版“Her”来了!
-
统一多模态模型来了!智源发布多模态世界模型Emu3!
-
Compass Arena 大模型竞技场多模态榜单发布!
-
今日热点:“AI像人类一样使用手机和电脑”,魔搭社区的开源项目已先行一步
-
Deepseek开源多模态LLM模型框架Janus,魔搭社区最佳实践
-
请拥有edu邮箱的同学来领取专(免)属(费)GPU!
-
MemoryScope:为LLM聊天机器人配备的长期记忆系统
所有评论(0)