【大语言模型加速实战】使用vLLM加速Qwen2-7B模型推理

vLLM是一种应用于大语言模型（Large Language Model, LLM）的推理和服务引擎，具有高吞吐量、内存高效的特点。本文以Qwen2-7B为例，使用vLLM加速LLM推理过程。

Allen :-)

1417人浏览 · 2024-10-13 17:36:47

Allen :-) · 2024-10-13 17:36:47 发布

文章目录

一、什么是vLLM？
二、环境配置
三、加载LLM
- 3.1 git安装
- 3.2 模型下载
四、模型部署及推理
- 4.1 模型部署
- 4.2 模型推理
参考文献

一、什么是vLLM？

vLLM是一种应用于大语言模型（Large Language Model, LLM）的推理和服务引擎，具有高吞吐量、内存高效的特点。

vLLM库：https://github.com/vllm-project/vllm
最新文档：https://docs.vllm.ai/en/latest/
论文：https://arxiv.org/pdf/2309.06180

本文以Qwen2-7B为例，使用vLLM加速LLM推理过程。

二、环境配置

2.1 配置要求

值得注意的是，vLLM目前只支持Linux操作系统。具体的配置要求如下所示：

OS: Linux
Python: 3.8 – 3.12
GPU: 算力不小于7.0 (如：V100, T4, RTX20xx, A100, L4, H100等)
*算力查询：https://developer.nvidia.com/cuda-gpus

本文采用适用于Linux的Windows子系统（Windows Subsystem for Linux, WSL）部署vLLM、Qwen2-7B，其中Ubuntu版本为24.04.1 LTS。
*WSL安装教程：https://blog.csdn.net/wangtcCSDN/article/details/137950545

2.2 Miniconda安装

安装Miniconda包管理工具，结果如图1所示：

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh

Miniconda安装

图1. Miniconda安装

激活Miniconda：

~/miniconda3/bin/conda init bash
# 重启Ubuntu
~/miniconda3/bin/conda init zsh
# 重启Ubuntu
source /home/{username}/.bashrc
source /home/{username}/.zshrc

其中{username}为当前用户名（本文为"allen"），输入前请先替换。

设置国内镜像，结果如图2所示：

conda config --set show_channel_urls yes
vi ~/.condarc

show_channel_urls: true
channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
  - defaults
custom_channels:
  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
# 查看配置
conda info

配置成功结果

图2. 配置成功结果

2.3 创建虚拟环境

创建并激活虚拟环境，结果如图3所示

# 安装python版本为3.10的虚拟环境my_vLLM
conda create --name my_vLLM python=3.10 -y
# 激活虚拟环境
conda activate my_vLLM

成功创建虚拟环境my_vLLM

图3. 成功创建虚拟环境my_vLLM

恭喜！至此您已完成vLLM所需的虚拟环境搭建。接下来，让我们将安装待加速的LLM（本文以Qwen2-7B为例）。

三、加载LLM

魔搭（国内更推荐访问）、Hugging face提供了多种LLM下载方式，本文将以git方式下载Qwen2-7B。

3.1 git安装

安装并使用"git"命令检查，结果如图4所示

# 检查git是否已安装
git
# 若未安装，则执行
sudo apt-get install git
# 更新git
sudo apt-get update

成功安装git

图4. 成功安装git

3.2 模型下载

魔搭提供了详细的下载教程：

# 安装大文件系统
sudo apt-get install git-lfs
# 创建目录
mkdir -p ~/ModelSpace && cd ~/ModelSpace
# git下载模型
git lfs install
git clone https://www.modelscope.cn/Qwen/Qwen2-7B.git Qwen2-7B

至此，您已通过git clone的方式实现所需模型的下载。接下来，让我们将使用vLLM部署LLM、加速推理。

四、模型部署及推理

4.1 模型部署

Qwen官方文档提供了使用vLLM部署模型的详细教程：

# 安装vLLM库
pip install vLLM
# 部署Qwen2-7B
python -m vllm.entrypoints.openai.api_server --model ~/ModelSpace/Qwen2-7B --port 8000 --host 127.0.0.1

4.2 模型推理

模型推理

vi Qwen2_7B_interface

# 使用CURL验证服务
curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "/home/{username}/ModelSpace/Qwen2-7B",
  "messages": [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'