微调Qwen2:7B模型,加入未知信息语料
对于QWen2这样的模型,在微调的时候,语料的投喂格式满足ChatML这样的格式!!!OpenAI - ChatML下面是ChatML格式的介绍:https://github.com/openai/openai-python/blob/release-v0.28.1/chatml.md传统上,GPT模型使用非结构化文本。ChatGPT 模型需要一种结构化格式,称为 Chat Markup Lan
·
对于QWen2这样的模型,在微调的时候,语料的投喂格式满足ChatML这样的格式!!!
OpenAI - ChatML
下面是ChatML格式的介绍:
https://github.com/openai/openai-python/blob/release-v0.28.1/chatml.md
传统上,GPT模型使用非结构化文本。
ChatGPT 模型需要一种结构化格式,称为 Chat Markup Language 聊天标记语言(简称 ChatML)。
ChatML 文档由一系列消息组成。每条消息都包含一个 header(现在由说这句话的人组成,但将来将包含其他元数据)和 内容(现在是文本有效负载,但将来将包含其他数据类型)。
我们仍在不断发展 ChatML,但当前版本 (ChatML v0) 可以用我们即将推出的“字典列表”JSON 格式表示,如下所示:
[
{"token": "<|im_start|>"},
"system\nYou are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.\nKnowledge cutoff: 2021-09-01\nCurrent date: 2023-03-01",
{"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
"user\nHow are you",
{"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
"assistant\nI am doing well!",
{"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
"user\nHow are you now?",
{"token": "<|im_end|>"}, "\n"
]
您还可以用经典的“不安全原始字符串”格式表示它。
然而,这种格式本质上允许从包含特殊令牌语法的用户输入进行注入,类似于 SQL 注入:
<|im_start|>system
You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.
Knowledge cutoff: 2021-09-01
Current date: 2023-03-01<|im_end|>
<|im_start|>user
How are you<|im_end|>
<|im_start|>assistant
I am doing well!<|im_end|>
<|im_start|>user
How are you now?<|im_end|>
语料的获取:
可以让豆包或者chatgpt根据上传的文档进行分析,具体提示词如下:
根据文档列举出较全面的问题和答案
代码的实现:
yuliao.txt:
qwen2.py:
import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForCausalLM, AdamW, get_scheduler
from dataset import QwenDataset
from tqdm import tqdm
# 1. 加载预训练模型和分词器
model_name = "./Qwen/Qwen2-7B-Instruct" # 替换为你实际使用的 Qwen2 模型名称
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
def collate_fn(batch):
# 这里我们对文本进行分词并填充到最大长度
finetue_template = \
"""<|im_start|>system
You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.
Knowledge cutoff: 2023-09-01
Current date: 2024-08-25<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
{answer}<|im_end|>
<|im_start|>user
{question}<|im_end|>
"""
finetune_list = []
for item in batch:
question = item[0]
answer = item[1]
finetune_list.append(finetue_template.format(question=question, answer=answer))
return finetune_list
def train():
train_dataset = QwenDataset()
# 3. 定义优化器和学习率调度器
optimizer = AdamW(model.parameters(), lr=5e-5)
train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True,collate_fn=collate_fn)
num_training_steps = len(train_dataloader) * 3 # 假设训练3轮
scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps,
)
# 4. 设置设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# 5. 开始训练
for epoch in range(30): # 训练3轮
model.train()
train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True,collate_fn=collate_fn)
for batch in tqdm(train_dataloader):
# 将数据从 `batch` 转换为模型需要的输入格式
inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)
input_ids = inputs['input_ids']
# 使用 `input_ids` 作为 `labels`,因为我们希望模型预测下一个 token
labels = input_ids.clone()
# 前向传播
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
scheduler.step()
print(f"Epoch {epoch}, Loss: {loss.item()}")
# 6. 保存微调后的模型
model.save_pretrained("./fine-tuned-model")
tokenizer.save_pretrained("./fine-tuned-model")
if __name__ == "__main__":
train()
训练完成后,就可以生成了!!!
新的对话生成:
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
"./fine-tuned-model",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-model")
prompt = "熟悉哪些算法模型和算法原理?"
print(prompt)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
更多推荐
已为社区贡献1条内容
所有评论(0)