
Qwen2.5 模型使用初体验
qwen2.5 vllm部署&推理
·
1. 环境准备
硬件环境:3张A100,40G;
开发环境:CUDA-12.2,conda虚拟环境
conda create -n my_vllm python==3.9.19 pip
conda activate my_vllm
pip install modelscope
pip install vllm
2. 模型下载
因为硬件环境限制,经多次尝试,只能部署Qwen2.5-72B-Instruct-GPTQ-Int4版本;
也可能是我部署方式不对,导致部署更大版本时一直OOM。。。
# 模型下载
# modelscope默认安装路径:/root/.cache/modelscope/hub/qwen/Qwen2.5-72B-Instruct-GPTQ-Int4
from modelscope import snapshot_download
model_dir = snapshot_download('qwen/Qwen2.5-72B-Instruct-GPTQ-Int4', local_dir='/home/models/qwen/Qwen2.5-72B-Instruct-GPTQ-Int4')
参考文档:
3. 直接服务器vllm方式启动测试
vllm serve /home/models/qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 2
启动成功结果如下:
参考文档:
https://qwen.readthedocs.io/zh-cn/latest/deployment/vllm.html
4. 测试代码
import json
import requests
url = "http://localhost:8000/v1/chat/completions"
headers = {
"Content-Type": "application/json"
}
question= 'balabala'
data = {
"model": "/home/models/qwen/Qwen2.5-72B-Instruct-GPTQ-Int4",
"messages": [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": question}
],
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 512
}
response = requests.post(url, headers=headers, data=json.dumps(data))
# Print the response
print(response.json())
print(response.json()['choices'][0]['message']['content'][8:-4])
可以正常返回结果。
5. 批量推理
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from utils import *
from custom_prompts import *
batch_question = [
[{"role": "user", "content": question}],
[{"role": "user", "content": question1}],
[{"role": "user", "content": question2}]
]
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
sampling_params = SamplingParams(temperature=0.01, max_tokens=512)
batch_question = tokenizer.apply_chat_template(
batch_question,
tokenize=False,
add_generation_prompt=True
)
llm = LLM(model=MODEL_PATH, tensor_parallel_size=2, dtype='auto', gpu_memory_utilization=0.9, max_model_len=512, enforce_eager=True)
# generate outputs
outputs = llm.generate(batch_question, sampling_params)
# Print the outputs.
for idx, output in enumerate(outputs):
prompt = output.prompt
generated_text = output.outputs[0].text
formated_res = parse_result_format(generated_text)
print(f"Answer-{idx}:\n{formated_res}")
import json
# 具体的结果格式化方法,可能会跟你的prompt强相关,此处我的就是要求返回严格的json结构的处理方式
def parse_result_format(llm_result):
llm_result = llm_result[8:-4].replace('\n', '')
llm_result = json.dumps(json.loads(llm_result), ensure_ascii=False, indent=4)
return llm_result
更多推荐
所有评论(0)