GLM-4-Voice，智谱开源版“Her”来了！_魔搭ModelScope社区-ModelScope魔搭社区

今天，智谱 AI 推出并开源端到端语音模型 GLM-4-Voice！GLM-4-Voice 能够直接理解和生成中英文语音，进行实时语音对话，并且能够遵循用户的指令要求改变语音的情感、语调、语速、方言等属性。

模型结构

GLM-4-Voice 由三个部分组成：

GLM-4-Voice-Tokenizer: 通过在 Whisper 的 Encoder 部分增加 Vector Quantization 并在 ASR 数据上有监督训练，将连续的语音输入转化为离散的 token。每秒音频平均只需要用 12.5 个离散 token 表示。
GLM-4-Voice-Decoder: 基于 CosyVoice 的 Flow Matching 模型结构训练的支持流式推理的语音解码器，将离散化的语音 token 转化为连续的语音输出。最少只需要 10 个语音 token 即可开始生成，降低端到端对话延迟。
GLM-4-Voice-9B: 在 GLM-4-9B 的基础上进行语音模态的预训练和对齐，从而能够理解和生成离散化的语音 token。

预训练方面，为了攻克模型在语音模态下的智商和合成表现力两个难关，研究团队将 Speech2Speech 任务解耦合为“根据用户音频做出文本回复”和“根据文本回复和用户语音合成回复语音”两个任务，并设计两种预训练目标，分别基于文本预训练数据和无监督音频数据合成语音-文本交错数据以适配这两种任务形式。GLM-4-Voice-9B 在 GLM-4-9B 的基座模型基础之上，经过了数百万小时音频和数千亿 token 的音频文本交错数据预训练，拥有很强的音频理解和建模能力。

对齐方面，为了支持高质量的语音对话，研究团队设计了一套流式思考架构：根据用户语音，GLM-4-Voice 可以流式交替输出文本和语音两个模态的内容，其中语音模态以文本作为参照保证回复内容的高质量，并根据用户的语音指令要求做出相应的声音变化，在最大程度保留语言模型智商的情况下仍然具有端到端建模的能力，同时具备低延迟性，最低只需要输出 20 个 token 便可以合成语音。

模型体验

模型链接

GLM-4-Voice-Tokenizer：

https://modelscope.cn/models/ZhipuAI/glm-4-voice-tokenizer

GLM-4-Voice-9B：

https://modelscope.cn/models/ZhipuAI/glm-4-voice-9b

GLM-4-Voice-Decoder：

https://modelscope.cn/models/ZhipuAI/glm-4-voice-decoder

模型效果体验

体验链接：

https://modelscope.cn/studios/ZhipuAI/GLM-4-Voice-Demo/summary

小程序体验：

模型实战实践

模型推理

环境准备：

git clone https://github.com/THUDM/GLM-4-Voice.git
 
pip install matcha-tts torchaudio hyperpyyaml
cd GLM-4-Voice

# 如果出现环境问题，可以运行以下命令
pip install -r requirements.txt

然后进去`GLM-4-Voice`的目录下运行以下代码：

import os
import uuid
from typing import List, Optional, Tuple

import torch
import torchaudio
from flow_inference import AudioDecoder
from modelscope import snapshot_download
from speech_tokenizer.modeling_whisper import WhisperVQEncoder
from speech_tokenizer.utils import extract_speech_token
from transformers import AutoModel, AutoTokenizer, GenerationConfig, WhisperFeatureExtractor


class GLM4Voice:

    def _prepare_model(self):
        model_path = snapshot_download('ZhipuAI/glm-4-voice-9b')
        decoder_path = snapshot_download('ZhipuAI/glm-4-voice-decoder')
        tokenizer_path = snapshot_download('ZhipuAI/glm-4-voice-tokenizer')

        flow_config = os.path.join(decoder_path, 'config.yaml')
        flow_checkpoint = os.path.join(decoder_path, 'flow.pt')
        hift_checkpoint = os.path.join(decoder_path, 'hift.pt')
        # GLM
        self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map=self.device).eval()
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)

        # Flow & Hift
        self.audio_decoder = AudioDecoder(
            config_path=flow_config, flow_ckpt_path=flow_checkpoint, hift_ckpt_path=hift_checkpoint, device=self.device)

        # Speech tokenizer
        self.whisper_model = WhisperVQEncoder.from_pretrained(tokenizer_path).eval().to(self.device)
        self.feature_extractor = WhisperFeatureExtractor.from_pretrained(tokenizer_path)

    def clear(self):
        self.previous_tokens = ''

    def __init__(self, generation_config=None):
        if generation_config is None:
            generation_config = GenerationConfig(top_p=0.8, temperature=0.2, max_new_tokens=2000, do_sample=True)
        self.generation_config = generation_config
        self.device = 'cuda:0'
        self._prepare_model()
        self.audio_offset = self.tokenizer.convert_tokens_to_ids('<|audio_0|>')
        self.end_token_id = self.tokenizer.convert_tokens_to_ids('<|user|>')
        self.clear()

    def infer(self, audio_path: Optional[str] = None, text: Optional[str] = None) -> Tuple[str, str]:
        if audio_path is not None:
            audio_tokens = extract_speech_token(self.whisper_model, self.feature_extractor, [audio_path])[0]
            audio_tokens = ''.join([f'<|audio_{x}|>' for x in audio_tokens])
            audio_tokens = '<|begin_of_audio|>' + audio_tokens + '<|end_of_audio|>'
            user_input = audio_tokens
            system_prompt = 'User will provide you with a speech instruction. Do it step by step. First, think about the instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens. '
        else:
            user_input = text
            system_prompt = 'User will provide you with a text instruction. Do it step by step. First, think about the instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens.'

        text = self.previous_tokens
        text = text.strip()
        if '<|system|>' not in text:
            text += f'<|system|>\n{system_prompt}'
        text += f'<|user|>\n{user_input}<|assistant|>streaming_transcription\n'

        inputs = self.tokenizer([text], return_tensors='pt').to(self.device)
        generate_ids = self.model.generate(**inputs, generation_config=self.generation_config)[0]
        generate_ids = generate_ids[inputs['input_ids'].shape[1]:]
        self.previous_tokens += text + self.tokenizer.decode(generate_ids, spaces_between_special_tokens=False)
        return self._parse_generate_ids(generate_ids)

    def _parse_generate_ids(self, generate_ids: List[int]) -> Tuple[str, str]:
        text_tokens, audio_tokens = [], []
        this_uuid = str(uuid.uuid4())

        for token_id in generate_ids.tolist():
            if token_id >= self.audio_offset:
                audio_tokens.append(token_id - self.audio_offset)
            elif token_id != self.end_token_id:
                text_tokens.append(token_id)

        audio_tokens_pt = torch.tensor(audio_tokens, device=self.device)[None]
        tts_speech, _ = self.audio_decoder.token2wav(audio_tokens_pt, uuid=this_uuid, finalize=True)
        audio_path = f'{this_uuid}.wav'
        with open(audio_path, 'wb') as f:
            torchaudio.save(f, tts_speech.cpu(), 22050, format='wav')
        response = self.tokenizer.decode(text_tokens)
        return response, audio_path


if __name__ == '__main__':
    generation_config = GenerationConfig(top_p=0.8, temperature=0.2, max_new_tokens=2000, do_sample=True)
    glm_voice = GLM4Voice(generation_config=generation_config)

    audio_path = 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav'
    response, output_path = glm_voice.infer(audio_path=audio_path)
    print(f'response: {response}\noutput_path: {output_path}')

    text = '请用英文回答'
    response, output_path = glm_voice.infer(text=text)
    print(f'response: {response}\noutput_path: {output_path}')

    glm_voice.clear()  # 清空历史
    text = '请用英文回答'
    response, output_path = glm_voice.infer(text=text)
    print(f'response: {response}\noutput_path: {output_path}')
"""
response: 是啊，阳光明媚的，真是个出门走走的好日子！你今天有什么计划吗？
output_path: 7f146cb5-4c1f-4c2c-85d0-0a8c985c90c0.wav
response: Sure! Today's weather is really nice, isn't it? It's a great day to go out and enjoy some fresh air. Do you have any plans for today?
output_path: 9326df35-aeec-4292-856b-5c0b1688e3f8.wav
response: Sure, I'll answer in English. What would you like to know?
output_path: e6e7c94b-7532-475f-bea7-e41566a954b6.wav
"""

显存占用：

模型Web Demo体验

智谱提供了可以直接启动的 Web Demo。用户可以输入语音或文本，模型会同时给出语音和文字回复。

首先下载仓库

git clone --recurse-submodules https://github.com/THUDM/GLM-4-Voice
cd GLM-4-Voice

然后安装依赖。

pip install -r requirements.txt

由于 Decoder 模型不支持通过 transformers 初始化，因此 checkpoint 需要单独下载，建议单独下载三个模型。

# 使用ModelScope CLI下载
modelscope download --model=ZhipuAI/glm-4-voice-9b --local_dir ./glm-4-voice-9b
# 建议其他模型也单独下载
modelscope download --model=ZhipuAI/glm-4-voice-decoder --local_dir ./glm-4-voice-decoder
modelscope download --model=ZhipuAI/glm-4-voice-tokenizer --local_dir ./glm-4-voice-tokenizer

发布WebDemo

首先启动模型服务，指定文件路径

python model_server.py --model-path glm-4-voice-9b

然后启动 web 服务

python web_demo.py --tokenizer-path glm-4-voice-tokenizer --model-path glm-4-voice-9b

即可在 http://127.0.0.1:8888 访问 web demo。可以手动下载之后通过 --tokenizer-path 和 --model-path 指定本地的路径。

显存占用：

点击链接👇，直达模型体验
https://modelscope.cn/studios/ZhipuAI/GLM-4-Voice-Demo/summary

GLM-4-Voice，智谱开源版“Her”来了！

魔搭ModelScope社区

模型结构

模型体验

模型实战实践

所有评论(0)

魔搭ModelScope社区