魔搭牵手vLLM,提供更快更高效LLM推理服务
导言 今年六月,来自加州大学伯克利分校、斯坦福大学、加州大学圣迭戈分校的研究人员基于操作系统中经典的虚拟内存和分页技术,提出了一个新的注意力算法PagedAttention,并打造了一个LLM服务系统vLLM。 论文链接: https://arxiv.org/pdf/2309.06180.pdf Github开源链接: https://github.com/vllm-project/vll
导言
今年六月,来自加州大学伯克利分校、斯坦福大学、加州大学圣迭戈分校的研究人员基于操作系统中经典的虚拟内存和分页技术,提出了一个新的注意力算法PagedAttention,并打造了一个LLM服务系统vLLM。
论文链接:
https://arxiv.org/pdf/2309.06180.pdf
Github开源链接:
https://github.com/vllm-project/vllm
vLLM在KV缓存上实现了几乎零浪费,并且可以在「请求内部」和「请求之间」灵活共享KV高速缓存,进一步减少了内存的使用量。
近期,魔搭社区和vLLM合作,一起为中国开发者提供更快更高效的LLM推理服务,基于vLLM,开发者可以实现针对魔搭社区的大语言模型,快速对数据集进行离线批量推理,构建API服务器,启动兼容 OpenAI 的 API 服务器等。
魔搭社区最新的镜像已经支持预装vLLM,魔搭官方镜像环境:
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda11.8.0-py310-torch2.1.0-tf2.14.0-1.9.5
最新镜像也将尽快上架到魔搭免费算力镜像列表。
魔搭社区支持的模型列表:
模型结构 |
模型名称 |
实际的模型id样例 |
AquilaForCausalLM |
Aquila |
BAAI/AquilaChat2-34B, BAAI/Aquila2-34B, etc. |
BaiChuanForCausalLM |
Baichuan |
baichuan-inc/Baichuan2-7B-Base, baichuan-inc/Baichuan2-13B-Base, etc. |
ChatGLMModel |
ChatGLM |
ZhipuAI/chatglm2-6b, ZhipuAI/chatglm3-6b, etc. |
InternLMForCausalLM |
InternLM |
internlm/internlm-7b, internlm/internlm-chat-7b, etc. |
QWenLMHeadModel |
Qwen |
qwen/Qwen-7B, qwen/Qwen-7B-Chat, etc. |
LlamaForCausalLM |
LLaMa |
modelscope/Llama-2-7b-ms,modelscope/Llama-2-13b-ms modelscope/Llama-2-70b-ms, etc. |
YiForCausalLM |
Yi |
01ai/Yi-6B, 01ai/Yi-34B, etc. |
魔搭社区最佳实践
在vLLM上使用魔搭的模型只需要在任何vLLM命令之前设置一个环境变量:
export VLLM_USE_MODELSCOPE=True
之后在需要填入模型id的地方使用魔搭的模型id即可。下面我们给出几个代码范例,来展示在vLLM上如何快速地加载魔搭模型进行推理。
离线批量推理
我们首先展示一个使用 vLLM 对数据集进行离线批量推理的示例。
使用来自魔搭ModelScope社区的LLM基础模型:
qwen/Qwen-7B
https://www.modelscope.cn/models/qwen/Qwen-7B/summary
from vllm import LLM, SamplingParams
import os
# 设置环境变量,从魔搭下载模型
os.environ['VLLM_USE_MODELSCOPE'] = 'True'
llm = LLM(model="qwen/Qwen-7B", revision="v1.1.8", trust_remote_code=True)
prompts = [
"Hello, my name is",
"today is a sunny day,",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95,stop=["<|endoftext|>"])
outputs = llm.generate(prompts, sampling_params,)
# print the output
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
使用来自魔搭ModelScope社区的LLM对话模型(支持单轮和多轮):
ChatGLM3-6b-32k
https://www.modelscope.cn/models/ZhipuAI/chatglm3-6b-32k/summary
from vllm import LLM, SamplingParams
import os
from modelscope import AutoTokenizer
from copy import deepcopy
# 设置环境变量,从魔搭下载模型
os.environ['VLLM_USE_MODELSCOPE'] = 'True'
def process_response(output, history):
# Code borrowed from ChatGLM3-6b-32k
content = ""
history = deepcopy(history)
for response in output.split("<|assistant|>"):
metadata, content = response.split("\n", maxsplit=1)
if not metadata.strip():
content = content.strip()
history.append({"role": "assistant", "metadata": metadata, "content": content})
content = content.replace("[[训练时间]]", "2023年")
else:
history.append({"role": "assistant", "metadata": metadata, "content": content})
if history[0]["role"] == "system" and "tools" in history[0]:
content = "\n".join(content.split("\n")[1:-1])
def tool_call(**kwargs):
return kwargs
parameters = eval(content)
content = {"name": metadata.strip(), "parameters": parameters}
else:
content = {"name": metadata.strip(), "content": content}
return content, history
llm = LLM(model="ZhipuAI/chatglm3-6b-32k", revision="v1.0.1", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ZhipuAI/chatglm3-6b-32k", trust_remote_code=True)
prompts = [
"Hello, my name is Alia",
"Today is a sunny day,",
"The capital of France is",
"Introduce YaoMing to me.",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=128,
stop=[tokenizer.eos_token, "<|user|>", "<|observation|>"])
inputs = []
for prompt in prompts:
# build chat input according to the prompt and history
inputs.append(tokenizer.build_chat_input(prompt, [])['input_ids'].numpy()[0].tolist())
# call with prompt_token_ids, which has template information
outputs = llm.generate(prompt_token_ids=inputs, sampling_params=sampling_params,)
histories = []
for prompt, output in zip(prompts, outputs):
history = []
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
history.append({"role": 'user', "content": prompt})
generated_text, history = process_response(generated_text, history)
histories.append(history)
prompts_new = [
'What is my name again?',
'What is the weather I just said today?',
'What is the city you mentioned just now?',
'How tall is him?'
]
inputs = []
for prompt, history in zip(prompts_new, histories):
inputs.append(tokenizer.build_chat_input(prompt, history)['input_ids'].numpy()[0].tolist())
outputs = llm.generate(prompt_token_ids=inputs, sampling_params=sampling_params,)
# print the output
for prompt, output in zip(prompts_new, outputs):
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
4条prompt推理时间低于1秒:
多轮对话的效果也很流畅:
API服务器
vLLM 可以部署为 LLM 服务。服务器使用AsyncLLMEngine类来支持传入请求的异步处理。
启动服务器:
VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.openai.api_server \
--model="qwen/Qwen-7B-Chat" --revision="v1.1.8" --trust-remote-code
在shell中查询模型:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen/Qwen-7B-Chat",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
# Response:
# {"id":"cmpl-2a54b777c8714388806a53e7c00daf1d","object":"text_completion","created":1127948,"model":"qwen/Qwen-7B-Chat","choices":[{"index":0,"text":" city in California, United States.","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":4,"total_tokens":11,"completion_tokens":7}}
有关使用vLLM+ModelScope的更多方法,请查看vLLM的官方快速入门指南:
https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html
点击直达链接
https://docs.vllm.ai/en/latest/getting_started/quickstart.html
更多推荐
所有评论(0)