Qwen1.5-14B-Chat使用与多方式部署(Linux和GPU环境)
test。
·
下载模型权重
使用modelscope下载
from modelscope.models import Model
model = Model.from_pretrained('qwen/Qwen1.5-14B-chat')
使用huggingface下载
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen1.5-14B-Chat')
用Transformers使用模型
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen1.5-14B-Chat",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-14B-Chat")
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
vLLM 部署模型
vLLM拉起openai服务
python -m vllm.entrypoints.openai.api_server \
--model qwen/Qwen1.5-14B-Chat --max-model-len 8192 --gpu-memory-utilization 0.95
访问服务
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen/Qwen1.5-7B-Chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "写一篇春天为主题的作文"}
],
"stop": ["<|im_end|>", "<|endoftext|>"]
}'
参考资料:
- https://huggingface.co/Qwen/Qwen1.5-14B-Chat
- https://developer.aliyun.com/article/1439006
更多推荐
已为社区贡献1条内容
所有评论(0)