使用 AMD GPU 部署和推理 vLLM

_勇

2076人浏览 · 2024-11-25 00:45:00

_勇 · 2024-11-25 00:45:00 发布

Inferencing and serving with vLLM on AMD GPUs — ROCm Blogs

2024 年 9 月 19 日，作者 Clint Greene。

简介

在快速发展的人工智能领域，大型语言模型（LLM）已经成为理解和生成类人文本的强大工具。然而，高效地大规模部署这些模型仍然存在重大挑战。这时，vLLM 便应运而生。vLLM 是一个创新的开源库，旨在通过先进技术优化 LLM 的服务。vLLM 的核心是 PagedAttention，这是一个新颖的算法，它通过将注意力机制管理为虚拟内存来提高模型的效率。这一方法优化了 GPU 内存的利用，便于更长序列的处理，并在现有硬件限制内更高效地处理大型模型。此外，vLLM 还结合了连续批处理技术，以最大化吞吐量并最小化延迟。通过利用这些尖端技术，vLLM 显著改善了 LLM 部署的性能和可扩展性，使组织能够更高效、更经济地利用先进的 AI 模型。

深入了解 vLLM 的先进功能，PagedAttention 通过将键值（KV）缓存分成可管理的非连续内存块来优化内存使用，类似于操作系统如何管理内存页。这种结构确保了内存资源的最佳利用。KV 缓存使模型仅通过存储之前计算的键和值来专注于当前令牌的注意力计算。这种方法加快了处理速度，减少了内存开销，因为它消除了重新计算过去令牌的注意力得分的需求。

同时，连续批处理通过动态将传入请求分组为批次，消除了等待固定批次大小的需求，从而提高了吞吐量并最小化延迟。这允许 vLLM 在请求到达时立即处理，确保更快的响应时间和更高效地处理大量请求。

在这篇博客中，我们将指导您如何使用 vLLM 进行大型语言模型的服务，从设置环境到使用 vLLM 在 AMD GPU 上对 Qwen2-7B、Yi-34B 和 Llama3-70B 等先进的 LLM 进行基本推理。

前提条件

要运行此博客，您需要具备以下条件：

Linux: 查看受支持的Linux发行版
ROCm: 查看安装指南
AMD GPUs: 查看兼容的GPU列表

安装

要在ROCm上访问最新的vLLM功能，克隆vLLM存储库并使用以下命令构建Docker镜像。根据系统的不同，构建过程可能需要相当长的时间。

git clone https://github.com/vllm-project/vllm.git
cd vllm
DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm .

成功构建vLLM ROCm Docker镜像后，可以使用以下命令运行它。如果您拥有一个包含多个LLM的文件夹，并希望在容器中访问它们，只需将 <path/to/model> 替换为该文件夹的实际路径，以便轻松在容器内挂载并使用您的LLM；如果没有，则只需从以下命令中删除 -v <path/to/model>:/app/models。

docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri -v <path/to/model>:/app/models vllm-rocm

推理

要对一批提示进行离线推理，首先需要导入 LLM 和 SamplingParams 类。

from vllm import LLM, SamplingParams

接下来，定义您的提示批次和生成的采样参数。采样参数为：

温度 = 0.8
顶k个候选token = 20
nucleus采样概率 = 0.95
生成的最大token数 = 128

有关采样参数的更多信息，请参阅类定义。

prompts = ["Write a haiku about machine learning"]
sampling_params = SamplingParams(max_tokens=128,
    skip_special_tokens=True,
    temperature=0.8,
    top_k=20,
    top_p=0.95,)

您现在可以加载一个LLM。我们将演示如何加载较小的Qwen2-7B模型，以及较大的Yi-34B和Llama3-70B模型。

Qwen2-7B

由于Qwen2可以轻松适应MI210上的VRAM，我们只需调用LLM类并指定模型名称即可，这将从Hugging Face缓存文件夹加载Qwen2-7B。如果你在其他地方有模型权重文件，也可以直接指定路径，比如`model="/app/model/qwen-7b/"，假设你在docker run`命令中指定了适当的文件夹挂载路径。如果尚未预先下载权重文件，建议在此步骤之前进行下载以加快加载时间。

llm = LLM(model="Qwen/Qwen2-7B-Instruct")

要使用前述的提示生成文本，我们只需调用`generate`函数并打印输出：

outputs = llm.generate(prompts, sampling_params)

prompt = prompts[0]
generated_text = outputs[0].outputs[0].text
print(prompt + ': ' + generated_text)

输出结果为：

Data flows in streams,
Algorithms sift and learn,
Predictions emerge.

要运行更大的（30B+）参数语言模型，我们可能需要利用张量并行（tensor parallelism）将模型分散到多个GPU上。这通过将模型权重矩阵按列划分为N部分，每个N GPU接收不同部分来实现。在每个GPU完成计算后，通过`allreduce`操作将结果汇总。vLLM利用了Megatron-LM的张量并行算法和python的`multiprocessing`来管理单节点上的分布式运行时。

要在vLLM中启用张量并行，只需将其作为参数添加到LLM，并指定要分配的GPU数量。我们还建议使用`multiprocessing`（mp）作为后台来分配，因为它比`ray`更快。

llm = LLM(model="meta-llama/Meta-Llama-3-70B-Instruct", tensor_parallel_size=4, distributed_executor_backend="mp")

使用相同的提示和采样参数，Llama3-70B生成的输出如下：

Algorithms dance
Data whispers secrets sweet
Intelligence born

现在让我们尝试另一个高性能的LLM：Yi-34B。

llm = LLM(model="01-ai/Yi-34B-Chat", tensor_parallel_size=4, distributed_executor_backend="mp")

输出为：

In the digital realm,
Algorithms learn and evolve,
Predicting the future.

Serving

您可以通过在终端中调用 vllm serve <model-name> 来使用 vLLM 部署您的大语言模型（LLM）作为服务。

vllm serve Qwen/Qwen2-7B-Instruct

然后，您可以在另一个终端窗口中使用 curl 命令查询 vLLM 服务。

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2-7B-Instruct",
        "prompt": "Write a haiku about artificial intelligence",
        "max_tokens": 128,
        "top_p": 0.95,
        "top_k": 20,
        "temperature": 0.8
      }'

这将生成以下 JSON 输出：

{
  "id": "cmpl-622896e563984235a6f83633c46db7cf",
  "object": "text_completion",
  "created": 1724445396,
  "model": "Qwen/Qwen2-7B-Instruct",
  "choices": [
    {
      "index": 0,
      "text": ". Machines learn and grow,\nBinary thoughts never falter,\nIntelligence artificial. \n\nThis haiku captures the idea of machines learning and growing through artificial intelligence, while their thoughts are never subject to the same limitations as human emotions or physical constraints. The use of binary suggests the reliance on a system of ones and zeros, which is a fundamental aspect of how computers process information. Overall, the haiku highlights the potential and possibilities of artificial intelligence while also acknowledging the limitations of machines as they lack the same complexity and depth of human intelligence.",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 115,
    "completion_tokens": 108
  }
}

如果您需要服务一个大到无法加载到单个 GPU 上的大模型，您可以通过在启动 vllm serve 时添加 --tensor-parallel-size <number-of-gpus> 来运行多 GPU 服务。我们还指定使用多处理 mp 作为分布式执行的后端。

vllm serve --model="meta-llama/Meta-Llama-3-70B-Instruct" --tensor-parallel-size 4 --distributed-executor-backend=mp

这将生成以下输出：

{
  "id": "cmpl-bed1534b639a4ab7b65775f75cdeed33",
  "object": "text_completion",
  "created": 1724446207,
  "model": "meta-llama/Meta-Llama-3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "text": "\nHere is a haiku about artificial intelligence:\n\nMetal minds awake\nIntelligence born of code\nFuture's uncertain",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": 128009,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 32,
    "completion_tokens": 24
  }
}

注意

本文博客最初发布于 2024年4月4日。