1. 环境准备

硬件环境:3张A100,40G;

开发环境:CUDA-12.2,conda虚拟环境

conda create -n my_vllm python==3.9.19 pip
conda activate my_vllm
pip install modelscope
pip install vllm

2. 模型下载

因为硬件环境限制,经多次尝试,只能部署Qwen2.5-72B-Instruct-GPTQ-Int4版本;

也可能是我部署方式不对,导致部署更大版本时一直OOM。。。

# 模型下载
# modelscope默认安装路径:/root/.cache/modelscope/hub/qwen/Qwen2.5-72B-Instruct-GPTQ-Int4
from modelscope import snapshot_download
model_dir = snapshot_download('qwen/Qwen2.5-72B-Instruct-GPTQ-Int4', local_dir='/home/models/qwen/Qwen2.5-72B-Instruct-GPTQ-Int4')

参考文档:

魔搭社区

效率评估 - Qwen

3. 直接服务器vllm方式启动测试

vllm serve /home/models/qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 2

启动成功结果如下:

参考文档:

https://qwen.readthedocs.io/zh-cn/latest/deployment/vllm.html

4. 测试代码

import json
import requests


url = "http://localhost:8000/v1/chat/completions"
headers = {
    "Content-Type": "application/json"
}

question= 'balabala'

data = {
    "model": "/home/models/qwen/Qwen2.5-72B-Instruct-GPTQ-Int4",
    "messages": [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": question}
    ],
    "temperature": 0.7,
    "top_p": 0.8,
    "repetition_penalty": 1.05,
    "max_tokens": 512
}

response = requests.post(url, headers=headers, data=json.dumps(data))

# Print the response
print(response.json())
print(response.json()['choices'][0]['message']['content'][8:-4])

可以正常返回结果。 

5. 批量推理

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

from utils import *
from custom_prompts import *

batch_question = [
    [{"role": "user", "content": question}],
    [{"role": "user", "content": question1}],
    [{"role": "user", "content": question2}]
]

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
sampling_params = SamplingParams(temperature=0.01, max_tokens=512)

batch_question = tokenizer.apply_chat_template(
    batch_question,
    tokenize=False,
    add_generation_prompt=True
)

llm = LLM(model=MODEL_PATH, tensor_parallel_size=2, dtype='auto', gpu_memory_utilization=0.9, max_model_len=512, enforce_eager=True)

# generate outputs
outputs = llm.generate(batch_question, sampling_params)

# Print the outputs.
for idx, output in enumerate(outputs):
    prompt = output.prompt
    generated_text = output.outputs[0].text
    formated_res = parse_result_format(generated_text)
    print(f"Answer-{idx}:\n{formated_res}")
import json

# 具体的结果格式化方法,可能会跟你的prompt强相关,此处我的就是要求返回严格的json结构的处理方式
def parse_result_format(llm_result):
    llm_result = llm_result[8:-4].replace('\n', '')
    llm_result = json.dumps(json.loads(llm_result), ensure_ascii=False, indent=4)

    return llm_result

Logo

ModelScope旨在打造下一代开源的模型即服务共享平台,为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品,让模型应用更简单!

更多推荐