vLLM 部署本地大模型
是一个快速且易于使用的库,用于 LLM 推理和服务。如果不能连接 huggingface,设置。失业+面试中,今天学习一个新玩具。
·
失业+面试中,今天学习一个新玩具
vLLM
vLLM
是一个快速且易于使用的库,用于 LLM 推理和服务。
1. 安装
pip install -U xformers torch torchvision torchaudio triton --index-url https://download.pytorch.org/whl/cu121
pip install modelscope vllm
triton
报错的话 可以下载 whl
, pip install .\triton-3.0.0-cp312-cp312-win_amd64.whl
进行安装
https://hf-mirror.com/madbuda/triton-windows-builds
2. 下载模型
from vllm import LLM, SamplingParams
llm = LLM(model='Qwen/Qwen2.5-1.5B-Instruct')
如果不能连接 huggingface,设置 export VLLM_USE_MODELSCOPE=True
llm = LLM(model='Qwen/Qwen2.5-1.5B-Instruct', trust_remote_code=True, dtype=torch.float16)
输出
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:11:04,893 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:11:05,514 - modelscope - INFO - Target directory already exists, skipping creation.
WARNING 11-24 21:11:05 config.py:1668] Casting torch.bfloat16 to torch.float16.
INFO 11-24 21:11:05 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:11:06,041 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:11:18,283 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:11:18,913 - modelscope - INFO - Target directory already exists, skipping creation.
INFO 11-24 21:11:18 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-24 21:11:18 selector.py:115] Using XFormers backend.
INFO 11-24 21:11:19 model_runner.py:1056] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
INFO 11-24 21:11:20 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-24 21:11:20 selector.py:115] Using XFormers backend.
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
Downloading [model.safetensors]: 100%|██████████████████████████████████████████████████████████████████████████| 2.88G/2.88G [00:22<00:00, 135MB/s]
2024-11-24 21:12:11,861 - modelscope - INFO - Target directory already exists, skipping creation.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.97s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.98s/it]
INFO 11-24 21:12:17 model_runner.py:1067] Loading model weights took 2.8875 GB
INFO 11-24 21:12:23 gpu_executor.py:122] # GPU blocks: 15322, # CPU blocks: 9362
INFO 11-24 21:12:23 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 7.48x
INFO 11-24 21:12:28 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-24 21:12:28 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-24 21:12:51 model_runner.py:1523] Graph capturing finished in 23 secs.
生成文本:
output = llm.generate("Hello, my name is")
# Processed prompts: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.42it/s, est. speed input: 12.11 toks/s, output: 38.73 toks/s]
print(output)
# [RequestOutput(request_id=0, prompt='Hello, my name is',
# prompt_token_ids=[9707, 11, 847, 829, 374], encoder_prompt=None,
# encoder_prompt_token_ids=None, prompt_logprobs=None,
# outputs=[CompletionOutput(index=0, text=" Josh and I'm deleting my record from Sonoma County Records website.\nAnswer this", token_ids=(18246, 323, 358, 2776, 33011, 847, 3255, 504, 11840, 7786, 6272, 21566, 3910, 624, 16141, 419),
# cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True,
# metrics=RequestMetrics(arrival_time=1732454207.6440513, last_token_time=1732454207.6440513, first_scheduled_time=1732454207.6526544, first_token_time=1732454207.756204, time_in_queue=0.008603096008300781, finished_time=1732454208.0503235, scheduler_time=0.005755523219704628, model_forward_time=None, model_execute_time=None), lora_request=None)]
print(output[0].outputs[0].text)
# Josh and I'm deleting my record from Sonoma County Records website.
# Answer this
3. 批量生成
prompts = [
"你好,我想吃饭",
"最近大模型技术很火热",
"我最近还在找工作,还在面试中",
"请给我一份工作",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)
# Processed prompts: 100%|███████████████████████████████████████| 4/4 [00:00<00:00, 7.95it/s, est. speed input: 43.74 toks/s, output: 127.23 toks/s]
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
输出:
Prompt: '你好,我想吃饭', Generated text: ',但我又不想吃,这很正常,我们每个人都有时候会感到'
Prompt: '最近大模型技术很火热', Generated text: ',而这类模型都是由大量的数据训练出来的,其背后都是依赖于'
Prompt: '我最近还在找工作,还在面试中', Generated text: ',身上有一块蚊子咬的包,但是不知道是不是蚊子咬'
Prompt: '请给我一份工作', Generated text: '流程的示例。\n一个公司的招聘流程。\n招聘流程通常包括以下几个阶段'
4. 兼容OpenAI服务
命令行输入 vllm serve Qwen/Qwen2.5-1.5B-Instruct --port 9999 --dtype float16
输出
INFO 11-24 21:45:42 api_server.py:528] vLLM API server version 0.6.3.post1
INFO 11-24 21:45:42 api_server.py:529] args: Namespace(allow_credentials=False, allowed_headers=['*'], allowed_methods=['*'], allowed_origins=['*'], api_key=None, block_size=16, chat_template=None, code_revision=None, collect_detailed_traces=None, config='', config_format=<ConfigFormat.AUTO: 'auto'>, cpu_offload_gb=0, device='auto', disable_async_output_proc=False, disable_custom_all_reduce=False, disable_fastapi_docs=False, disable_frontend_multiprocessing=False, disable_log_requests=False, disable_log_stats=False, disable_logprobs_during_spec_decoding=None, disable_sliding_window=False, dispatch_function=<function serve at 0x7fe31b1fe550>, distributed_executor_backend=None, download_dir=None, dtype='float16', enable_auto_tool_choice=False, enable_chunked_prefill=None, enable_lora=False, enable_prefix_caching=False, enable_prompt_adapter=False, enforce_eager=False, fully_sharded_loras=False, gpu_memory_utilization=0.9, guided_decoding_backend='outlines', host=None, ignore_patterns=[], kv_cache_dtype='auto', limit_mm_per_prompt=None, load_format='auto', long_lora_scaling_factors=None, lora_dtype='auto', lora_extra_vocab_size=256, lora_modules=None, max_context_len_to_capture=None, max_cpu_loras=None, max_log_len=None, max_logprobs=20, max_lora_rank=16, max_loras=1, max_model_len=None, max_num_batched_tokens=None, max_num_seqs=256, max_parallel_loading_workers=None, max_prompt_adapter_token=0, max_prompt_adapters=1, max_seq_len_to_capture=8192, middleware=[], mm_processor_kwargs=None, model='Qwen/Qwen2.5-1.5B-Instruct', model_loader_extra_config=None, model_tag='Qwen/Qwen2.5-1.5B-Instruct', multi_step_stream_outputs=True, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, num_gpu_blocks_override=None, num_lookahead_slots=0, num_scheduler_steps=1, num_speculative_tokens=None, otlp_traces_endpoint=None, override_neuron_config=None, pipeline_parallel_size=1, port=9999, preemption_mode=None, prompt_adapters=None, qlora_adapter_name_or_path=None, quantization=None, quantization_param_path=None, ray_workers_use_nsight=False, response_role='assistant', return_tokens_as_token_ids=False, revision=None, root_path=None, rope_scaling=None, rope_theta=None, scheduler_delay_factor=0.0, scheduling_policy='fcfs', seed=0, served_model_name=None, skip_tokenizer_init=False, spec_decoding_acceptance_method='rejection_sampler', speculative_disable_by_batch_size=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_model=None, speculative_model_quantization=None, ssl_ca_certs=None, ssl_cert_reqs=0, ssl_certfile=None, ssl_keyfile=None, subparser='serve', swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_pool_extra_config=None, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_revision=None, tool_call_parser=None, tool_parser_plugin='', trust_remote_code=False, typical_acceptance_sampler_posterior_alpha=None, typical_acceptance_sampler_posterior_threshold=None, use_v2_block_manager=False, uvicorn_log_level='info', worker_use_ray=False)
INFO 11-24 21:45:42 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/62833db5-17e8-4805-84ab-dd5641c97de8 for IPC Path.
INFO 11-24 21:45:42 api_server.py:179] Started engine process with PID 3726198
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:45:43,724 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:45:44,224 - modelscope - INFO - Target directory already exists, skipping creation.
WARNING 11-24 21:45:44 config.py:1668] Casting torch.bfloat16 to torch.float16.
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:45:53,942 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:45:54,520 - modelscope - INFO - Target directory already exists, skipping creation.
WARNING 11-24 21:45:54 config.py:1668] Casting torch.bfloat16 to torch.float16.
WARNING 11-24 21:45:55 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:45:55,758 - modelscope - INFO - Target directory already exists, skipping creation.
WARNING 11-24 21:46:05 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-24 21:46:05 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:46:05,632 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:46:17,537 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
^[[B2024-11-24 21:46:18,031 - modelscope - INFO - Target directory already exists, skipping creation.
INFO 11-24 21:46:18 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-24 21:46:18 selector.py:115] Using XFormers backend.
/opt/bdp/data01/miniforge3/envs/py38/lib/python3.8/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/bdp/data01/miniforge3/envs/py38/lib/python3.8/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 11-24 21:46:19 model_runner.py:1056] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
INFO 11-24 21:46:19 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-24 21:46:19 selector.py:115] Using XFormers backend.
Downloading Model to directory: /home/web/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct
2024-11-24 21:46:20,466 - modelscope - INFO - Target directory already exists, skipping creation.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.37s/it]
INFO 11-24 21:46:25 model_runner.py:1067] Loading model weights took 2.8875 GB
INFO 11-24 21:46:30 gpu_executor.py:122] # GPU blocks: 15322, # CPU blocks: 9362
INFO 11-24 21:46:30 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 7.48x
INFO 11-24 21:46:37 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-24 21:46:37 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-24 21:46:57 model_runner.py:1523] Graph capturing finished in 20 secs.
INFO 11-24 21:46:58 api_server.py:232] vLLM to use /tmp/tmpey4sm3lz as PROMETHEUS_MULTIPROC_DIR
WARNING 11-24 21:46:58 serving_embedding.py:199] embedding_mode is False. Embedding API will not work.
INFO 11-24 21:46:58 launcher.py:19] Available routes are:
INFO 11-24 21:46:58 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 11-24 21:46:58 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 11-24 21:46:58 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 11-24 21:46:58 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 11-24 21:46:58 launcher.py:27] Route: /health, Methods: GET
INFO 11-24 21:46:58 launcher.py:27] Route: /tokenize, Methods: POST
INFO 11-24 21:46:58 launcher.py:27] Route: /detokenize, Methods: POST
INFO 11-24 21:46:58 launcher.py:27] Route: /v1/models, Methods: GET
INFO 11-24 21:46:58 launcher.py:27] Route: /version, Methods: GET
INFO 11-24 21:46:58 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 11-24 21:46:58 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 11-24 21:46:58 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO: Started server process [3725978]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on socket ('0.0.0.0', 9999) (Press CTRL+C to quit)
INFO 11-24 21:47:08 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
列出模型
curl http://localhost:9999/v1/models
{"object":"list","data":[{"id":"Qwen/Qwen2.5-1.5B-Instruct","object":"model","created":1732456130,"owned_by":"vllm","root":"Qwen/Qwen2.5-1.5B-Instruct","parent":null,"max_model_len":32768,"permission":[{"id":"modelperm-5cf127a5ed1e4199bf97c030d7a2ffb7","object":"model_permission","created":1732456130,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
调用
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:9999/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "你好,我失业了,给我讲个笑话吧"},
]
)
print("Chat response:", chat_response)
输出:
服务端日志
Received request chat-51aec7c3b04b4548a0e33dfe40fc34d7:
prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好,我失业了,给我讲个笑话吧<|im_end|>\n<|im_start|>assistant\n',
params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32738, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 108386, 3837, 35946, 108127, 34187, 3837, 104169, 99526, 18947, 109959, 100003, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
Chat response: ChatCompletion(id='chat-51aec7c3b04b4548a0e33dfe40fc34d7',
choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='你好,很高兴为你讲个笑话。不过,我建议你先找个工作,这样你才能有收入和生活。', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[]), stop_reason=None)],
created=1732456399, model='Qwen/Qwen2.5-1.5B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=26, prompt_tokens=30, total_tokens=56, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)
chat_response.choices[0].message.content
'你好,很高兴为你讲个笑话。不过,我建议你先找个工作,这样你才能有收入和生活。'
5. 参考
https://docs.vllm.ai/en/latest/getting_started/quickstart.html#
https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
更多推荐
所有评论(0)