今天,智谱 AI 推出并开源端到端语音模型 GLM-4-Voice!GLM-4-Voice 能够直接理解和生成中英文语音,进行实时语音对话,并且能够遵循用户的指令要求改变语音的情感、语调、语速、方言等属性。
模型结构
GLM-4-Voice 由三个部分组成:
-
GLM-4-Voice-Tokenizer: 通过在 Whisper 的 Encoder 部分增加 Vector Quantization 并在 ASR 数据上有监督训练,将连续的语音输入转化为离散的 token。每秒音频平均只需要用 12.5 个离散 token 表示。
-
GLM-4-Voice-Decoder: 基于 CosyVoice 的 Flow Matching 模型结构训练的支持流式推理的语音解码器,将离散化的语音 token 转化为连续的语音输出。最少只需要 10 个语音 token 即可开始生成,降低端到端对话延迟。
-
GLM-4-Voice-9B: 在 GLM-4-9B 的基础上进行语音模态的预训练和对齐,从而能够理解和生成离散化的语音 token。
预训练方面,为了攻克模型在语音模态下的智商和合成表现力两个难关,研究团队将 Speech2Speech 任务解耦合为“根据用户音频做出文本回复”和“根据文本回复和用户语音合成回复语音”两个任务,并设计两种预训练目标,分别基于文本预训练数据和无监督音频数据合成语音-文本交错数据以适配这两种任务形式。GLM-4-Voice-9B 在 GLM-4-9B 的基座模型基础之上,经过了数百万小时音频和数千亿 token 的音频文本交错数据预训练,拥有很强的音频理解和建模能力。
对齐方面,为了支持高质量的语音对话,研究团队设计了一套流式思考架构:根据用户语音,GLM-4-Voice 可以流式交替输出文本和语音两个模态的内容,其中语音模态以文本作为参照保证回复内容的高质量,并根据用户的语音指令要求做出相应的声音变化,在最大程度保留语言模型智商的情况下仍然具有端到端建模的能力,同时具备低延迟性,最低只需要输出 20 个 token 便可以合成语音。
模型体验
模型链接
GLM-4-Voice-Tokenizer:
https://modelscope.cn/models/ZhipuAI/glm-4-voice-tokenizer
GLM-4-Voice-9B:
https://modelscope.cn/models/ZhipuAI/glm-4-voice-9b
GLM-4-Voice-Decoder:
https://modelscope.cn/models/ZhipuAI/glm-4-voice-decoder
模型效果体验
体验链接:
https://modelscope.cn/studios/ZhipuAI/GLM-4-Voice-Demo/summary
小程序体验:
模型实战实践
模型推理
环境准备:
git clone https://github.com/THUDM/GLM-4-Voice.git
pip install matcha-tts torchaudio hyperpyyaml
cd GLM-4-Voice
# 如果出现环境问题,可以运行以下命令
pip install -r requirements.txt
然后进去`GLM-4-Voice`的目录下运行以下代码:
import os
import uuid
from typing import List, Optional, Tuple
import torch
import torchaudio
from flow_inference import AudioDecoder
from modelscope import snapshot_download
from speech_tokenizer.modeling_whisper import WhisperVQEncoder
from speech_tokenizer.utils import extract_speech_token
from transformers import AutoModel, AutoTokenizer, GenerationConfig, WhisperFeatureExtractor
class GLM4Voice:
def _prepare_model(self):
model_path = snapshot_download('ZhipuAI/glm-4-voice-9b')
decoder_path = snapshot_download('ZhipuAI/glm-4-voice-decoder')
tokenizer_path = snapshot_download('ZhipuAI/glm-4-voice-tokenizer')
flow_config = os.path.join(decoder_path, 'config.yaml')
flow_checkpoint = os.path.join(decoder_path, 'flow.pt')
hift_checkpoint = os.path.join(decoder_path, 'hift.pt')
# GLM
self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map=self.device).eval()
self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
# Flow & Hift
self.audio_decoder = AudioDecoder(
config_path=flow_config, flow_ckpt_path=flow_checkpoint, hift_ckpt_path=hift_checkpoint, device=self.device)
# Speech tokenizer
self.whisper_model = WhisperVQEncoder.from_pretrained(tokenizer_path).eval().to(self.device)
self.feature_extractor = WhisperFeatureExtractor.from_pretrained(tokenizer_path)
def clear(self):
self.previous_tokens = ''
def __init__(self, generation_config=None):
if generation_config is None:
generation_config = GenerationConfig(top_p=0.8, temperature=0.2, max_new_tokens=2000, do_sample=True)
self.generation_config = generation_config
self.device = 'cuda:0'
self._prepare_model()
self.audio_offset = self.tokenizer.convert_tokens_to_ids('<|audio_0|>')
self.end_token_id = self.tokenizer.convert_tokens_to_ids('<|user|>')
self.clear()
def infer(self, audio_path: Optional[str] = None, text: Optional[str] = None) -> Tuple[str, str]:
if audio_path is not None:
audio_tokens = extract_speech_token(self.whisper_model, self.feature_extractor, [audio_path])[0]
audio_tokens = ''.join([f'<|audio_{x}|>' for x in audio_tokens])
audio_tokens = '<|begin_of_audio|>' + audio_tokens + '<|end_of_audio|>'
user_input = audio_tokens
system_prompt = 'User will provide you with a speech instruction. Do it step by step. First, think about the instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens. '
else:
user_input = text
system_prompt = 'User will provide you with a text instruction. Do it step by step. First, think about the instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens.'
text = self.previous_tokens
text = text.strip()
if '<|system|>' not in text:
text += f'<|system|>\n{system_prompt}'
text += f'<|user|>\n{user_input}<|assistant|>streaming_transcription\n'
inputs = self.tokenizer([text], return_tensors='pt').to(self.device)
generate_ids = self.model.generate(**inputs, generation_config=self.generation_config)[0]
generate_ids = generate_ids[inputs['input_ids'].shape[1]:]
self.previous_tokens += text + self.tokenizer.decode(generate_ids, spaces_between_special_tokens=False)
return self._parse_generate_ids(generate_ids)
def _parse_generate_ids(self, generate_ids: List[int]) -> Tuple[str, str]:
text_tokens, audio_tokens = [], []
this_uuid = str(uuid.uuid4())
for token_id in generate_ids.tolist():
if token_id >= self.audio_offset:
audio_tokens.append(token_id - self.audio_offset)
elif token_id != self.end_token_id:
text_tokens.append(token_id)
audio_tokens_pt = torch.tensor(audio_tokens, device=self.device)[None]
tts_speech, _ = self.audio_decoder.token2wav(audio_tokens_pt, uuid=this_uuid, finalize=True)
audio_path = f'{this_uuid}.wav'
with open(audio_path, 'wb') as f:
torchaudio.save(f, tts_speech.cpu(), 22050, format='wav')
response = self.tokenizer.decode(text_tokens)
return response, audio_path
if __name__ == '__main__':
generation_config = GenerationConfig(top_p=0.8, temperature=0.2, max_new_tokens=2000, do_sample=True)
glm_voice = GLM4Voice(generation_config=generation_config)
audio_path = 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav'
response, output_path = glm_voice.infer(audio_path=audio_path)
print(f'response: {response}\noutput_path: {output_path}')
text = '请用英文回答'
response, output_path = glm_voice.infer(text=text)
print(f'response: {response}\noutput_path: {output_path}')
glm_voice.clear() # 清空历史
text = '请用英文回答'
response, output_path = glm_voice.infer(text=text)
print(f'response: {response}\noutput_path: {output_path}')
"""
response: 是啊,阳光明媚的,真是个出门走走的好日子!你今天有什么计划吗?
output_path: 7f146cb5-4c1f-4c2c-85d0-0a8c985c90c0.wav
response: Sure! Today's weather is really nice, isn't it? It's a great day to go out and enjoy some fresh air. Do you have any plans for today?
output_path: 9326df35-aeec-4292-856b-5c0b1688e3f8.wav
response: Sure, I'll answer in English. What would you like to know?
output_path: e6e7c94b-7532-475f-bea7-e41566a954b6.wav
"""
显存占用:
模型Web Demo体验
智谱提供了可以直接启动的 Web Demo。用户可以输入语音或文本,模型会同时给出语音和文字回复。
首先下载仓库
git clone --recurse-submodules https://github.com/THUDM/GLM-4-Voice
cd GLM-4-Voice
然后安装依赖。
pip install -r requirements.txt
由于 Decoder 模型不支持通过 transformers
初始化,因此 checkpoint 需要单独下载,建议单独下载三个模型。
# 使用ModelScope CLI下载
modelscope download --model=ZhipuAI/glm-4-voice-9b --local_dir ./glm-4-voice-9b
# 建议其他模型也单独下载
modelscope download --model=ZhipuAI/glm-4-voice-decoder --local_dir ./glm-4-voice-decoder
modelscope download --model=ZhipuAI/glm-4-voice-tokenizer --local_dir ./glm-4-voice-tokenizer
发布WebDemo
首先启动模型服务,指定文件路径
python model_server.py --model-path glm-4-voice-9b
然后启动 web 服务
python web_demo.py --tokenizer-path glm-4-voice-tokenizer --model-path glm-4-voice-9b
即可在 http://127.0.0.1:8888 访问 web demo。可以手动下载之后通过 --tokenizer-path
和 --model-path
指定本地的路径。
显存占用:
点击链接👇,直达模型体验
https://modelscope.cn/studios/ZhipuAI/GLM-4-Voice-Demo/summary
所有评论(0)