对于QWen2这样的模型,在微调的时候,语料的投喂格式满足ChatML这样的格式!!!

OpenAI - ChatML

下面是ChatML格式的介绍:

https://github.com/openai/openai-python/blob/release-v0.28.1/chatml.md

传统上,GPT模型使用非结构化文本。
ChatGPT 模型需要一种结构化格式,称为 Chat Markup Language 聊天标记语言(简称 ChatML)。
ChatML 文档由一系列消息组成。每条消息都包含一个 header(现在由说这句话的人组成,但将来将包含其他元数据)和 内容(现在是文本有效负载,但将来将包含其他数据类型)。

我们仍在不断发展 ChatML,但当前版本 (ChatML v0) 可以用我们即将推出的“字典列表”JSON 格式表示,如下所示:

[
 {"token": "<|im_start|>"},
 "system\nYou are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.\nKnowledge cutoff: 2021-09-01\nCurrent date: 2023-03-01",
 {"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
 "user\nHow are you",
 {"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
 "assistant\nI am doing well!",
 {"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
 "user\nHow are you now?",
 {"token": "<|im_end|>"}, "\n"
]

您还可以用经典的“不安全原始字符串”格式表示它。
然而,这种格式本质上允许从包含特殊令牌语法的用户输入进行注入,类似于 SQL 注入:

<|im_start|>system
You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.
Knowledge cutoff: 2021-09-01
Current date: 2023-03-01<|im_end|>
<|im_start|>user
How are you<|im_end|>
<|im_start|>assistant
I am doing well!<|im_end|>
<|im_start|>user
How are you now?<|im_end|>

 语料的获取:

可以让豆包或者chatgpt根据上传的文档进行分析,具体提示词如下:

根据文档列举出较全面的问题和答案

代码的实现:

yuliao.txt:

qwen2.py:

import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForCausalLM, AdamW, get_scheduler
from dataset import QwenDataset
from tqdm import tqdm

# 1. 加载预训练模型和分词器
model_name = "./Qwen/Qwen2-7B-Instruct"  # 替换为你实际使用的 Qwen2 模型名称
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def collate_fn(batch):
    # 这里我们对文本进行分词并填充到最大长度
    finetue_template = \
"""<|im_start|>system
You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.
Knowledge cutoff: 2023-09-01
Current date: 2024-08-25<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
{answer}<|im_end|>
<|im_start|>user
{question}<|im_end|>
"""
    finetune_list = []
    for item in batch:
        question = item[0]
        answer = item[1]
        finetune_list.append(finetue_template.format(question=question, answer=answer))

    return finetune_list

def train():
    train_dataset = QwenDataset()
    # 3. 定义优化器和学习率调度器
    optimizer = AdamW(model.parameters(), lr=5e-5)
    train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True,collate_fn=collate_fn)
    num_training_steps = len(train_dataloader) * 3  # 假设训练3轮
    scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )

    # 4. 设置设备
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # 5. 开始训练
    for epoch in range(30):  # 训练3轮
        model.train()
        train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True,collate_fn=collate_fn)
        for batch in tqdm(train_dataloader):
            # 将数据从 `batch` 转换为模型需要的输入格式
            inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)
            input_ids = inputs['input_ids']
            
            # 使用 `input_ids` 作为 `labels`,因为我们希望模型预测下一个 token
            labels = input_ids.clone()
            
            # 前向传播
            outputs = model(input_ids=input_ids, labels=labels)
            loss = outputs.loss

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.step()

            print(f"Epoch {epoch}, Loss: {loss.item()}")

    # 6. 保存微调后的模型
    model.save_pretrained("./fine-tuned-model")
    tokenizer.save_pretrained("./fine-tuned-model")



if __name__ == "__main__":
    train()

训练完成后,就可以生成了!!!

新的对话生成:

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "./fine-tuned-model",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-model")

prompt = "熟悉哪些算法模型和算法原理?"
print(prompt)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Logo

ModelScope旨在打造下一代开源的模型即服务共享平台,为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品,让模型应用更简单!

更多推荐