单卡可推理CodeFuse-CodeLlama-34B 4bits量化版本魔搭开源!
继2023-09-11 CodeFuse-CodeLlama-34B发布,HumanEval pass@1指标达到74.4% (贪婪解码), 为当前开源SOTA。最近,CodeFuse-CodeLlama-34B 4bits量化版本发布,CodeFuse-CodeLlama-34B-4bits是CodeFuse-CodeLlama-34B模型的4bits量化版本,后者是通过QLoRA对基座模型CodeLlama-34b-Python进行多代码任务微调而得到的代码大模型,模型输入长度为4K。
经4bits量化后,CodeFuse-CodeLlama-34B-4bits可用单张A10 (24GB显存)或者RTX 4090 (24GB显存)加载,同时,量化后的模型在Humaneval pass@1指标上仍取得了73.8%的表现。
评测表现(代码):
模型 |
HumanEval(pass@1) |
日期 |
CodeFuse-CodeLlama-34B |
74.4% |
2023.9 |
CodeFuse-CodeLlama-34B-4bits |
73.8% |
2023.9 |
WizardCoder-Python-34B-V1.0 |
73.2% |
2023.8 |
GPT-4(zero-shot) |
67.0% |
2023.3 |
PanGu-Coder2 15B |
61.6% |
2023.8 |
CodeLlama-34b-Python |
53.7% |
2023.8 |
CodeLlama-34b |
48.8% |
2023.8 |
GPT-3.5(zero-shot) |
48.1% |
2022.11 |
OctoCoder |
46.2% |
2023.8 |
StarCoder-15B |
33.6% |
2023.5 |
LLaMA 2 70B(zero-shot) |
29.9% |
2023.7 |
-
python 3.8及以上版本
-
pytorch 1.12及以上版本,推荐2.0及以上版本
-
建议使用CUDA 11.4及以上(GPU用户需考虑此选项)
使用步骤
本文在PAI-DSW运行 (可单卡运行)
CodeFuse量化模型现已在ModelScope社区开源,
CodeFuse-CodeLlama-34B 4bits:
https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B-4bits/summary
社区支持直接下载模型的repo:
from modelscope.hub.snapshot_download import snapshot_download
model_dir = snapshot_download('codefuse-ai/CodeFuse-CodeLlama-34B-4bits', 'v1.0.0')
依赖项:
依赖项:
pip install "modelscope>=1.9.1"
pip install auto_gptq
推理代码:
import os
import torch
import time
from modelscope import AutoTokenizer, snapshot_download
from auto_gptq import AutoGPTQForCausalLM
os.environ["TOKENIZERS_PARALLELISM"] = "false"
def load_model_tokenizer(model_path):
"""
Load model and tokenizer based on the given model name or local path of downloaded model.
"""
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True,
use_fast=False,
lagecy=False)
tokenizer.padding_side = "left"
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids("<unk>")
tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids("</s>")
model = AutoGPTQForCausalLM.from_quantized(model_path,
inject_fused_attention=False,
inject_fused_mlp=False,
use_cuda_fp16=True,
disable_exllama=False,
device_map='auto' # Support multi-gpus
)
return model, tokenizer
def inference(model, tokenizer, prompt):
"""
Uset the given model and tokenizer to generate an answer for the speicifed prompt.
"""
st = time.time()
prompt = prompt if prompt.endswith('\n') else f'{prompt}\n'
inputs = f"<|role_start|>human<|role_end|>{prompt}<|role_start|>bot<|role_end|>"
input_ids = tokenizer.encode(inputs,
return_tensors="pt",
padding=True,
add_special_tokens=False).to("cuda")
with torch.no_grad():
generated_ids = model.generate(
input_ids=input_ids,
top_p=0.95,
temperature=0.1,
do_sample=True,
max_new_tokens=512,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
print(f'generated tokens num is {len(generated_ids[0][input_ids.size(1):])}')
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(f'generate text is {outputs[0][len(inputs): ]}')
latency = time.time() - st
print('latency is {} seconds'.format(latency))
if __name__ == "__main__":
model_dir = snapshot_download('codefuse-ai/CodeFuse-CodeLlama-34B-4bits', revision='v1.0.0')
prompt = 'Please write a QuickSort program in Python'
model, tokenizer = load_model_tokenizer(model_dir)
inference(model, tokenizer, prompt)
资源消耗:
我们测量了模型加载后占用的显存占用情况,以及输入2048/1024 tokens并输出1024/2048 tokens时的显存使用情况,如下表所示
精度 |
模型空载 |
输入 2048 tokens + 输出1024 tokens |
输入 1024 tokens + 输出2048 tokens |
bfloat16 |
64.89GB |
69.31GB |
66.41GB |
int4 |
19.09GB |
22.19GB |
20.78GB |
int4示例代码显存占用:
更多推荐
所有评论(0)