化学科学催化反应产率预测之大模型回归（Datawhale AI 夏令营）

在本文中，我们学习了如何使用大模型API和下载模型进行催化反应产率预测。通过构造Prompt和并行处理样本，我们实现了对化学反应产率的预测。在未来的研究中，我们可以进一步优化Prompt工程，提高预测精度。欢迎关注我的后续博文，我将分享更多关于人工智能、自然语言处理和计算机视觉的精彩内容。

冷基栋_攻城师

1663人浏览 · 2024-08-01 04:15:00

冷基栋_攻城师 · 2024-08-01 04:15:00 发布

在本文中，我们将探索如何使用大模型解决化学科学中的催化反应产率预测问题。需要说明的是，大模型在有限空间内的分类问题上表现出色，但在回归问题（如产率预测）上表现相对较弱。为了提高大模型的预测精度，许多研究将回归问题转化为分类问题，例如将产率预测划分为“高产率”和“低产率”两类。

了解大模型在化学中的应用

本节目标：

了解大模型在化学领域的研究进展和应用
认识大模型在各种化学任务中的表现
了解大模型的局限性和优势
掌握提示词工程（Prompt Engineering）的基本概念

大模型与化学领域的研究进展

化学研究中的大模型应用不断进步。以下是一些关键的研究论文：

ChemLLM: A Chemical Large Language Model ChemLLM
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks Benchmark

为什么化学需要大模型？

大模型是一种预训练的深度神经网络，通过提示词可以引导其利用已学知识输出结果。大模型在许多分类任务中表现出色，但在化学等科学领域，许多数据并非序列类型，因此需要转换为序列格式才能输入大模型。本赛题使用药物的SMILES序列，非常适合直接输入大模型

大模型的局限性

需要注意的是，大模型更擅长分类任务，而不擅长处理回归任务。目前的研究将产率预测转换为“高产率”和“低产率”的二分类问题，以提高大模型的预测精度。

提示词工程（Prompt Engineering）

提示词工程是一门新的学科，关注如何设计和优化提示词，帮助用户将大语言模型用于各种场景和研究领域。提示词工程的目标是提升大模型处理复杂任务的能力，如问答和算术推理。

实践部分：使用大模型进行产率预测

如何调用大模型API

目前有许多国产大模型提供了API接口，如阿里的Qwen模型和讯飞的星火大模型。以下是如何使用这些API的简要步骤：

使用Qwen-7b-instruct模型

申请API密钥。
调用API进行预测。

from http import HTTPStatus
import dashscope

MODEL_NAME = 'qwen2-7b-instruct'
dashscope.api_key="your-api-key"

def get_completions(query, MODEL_NAME='qwen2-7b-instruct'):
    messages = [{'role': 'user', 'content': query}]
    response = dashscope.Generation.call(
        MODEL_NAME,
        messages=messages,
        result_format='message'
    )
    if response.status_code == HTTPStatus.OK:
        return response['output']['choices'][0]['message']['content']
    else:
        raise Exception("API调用失败")

使用讯飞的星火大模型

申请API密钥。
调用API进行预测。

from sparkai.llm.llm import ChatSparkLLM, ChunkPrintHandler
from sparkai.core.messages import ChatMessage

SPARKAI_URL = 'wss://spark-api.xf-yun.com/v3.5/chat'
SPARKAI_APP_ID="your-app-id"
SPARKAI_API_SECRET="your-api-secret"
SPARKAI_API_KEY="your-api-key"
SPARKAI_DOMAIN = 'generalv3.5'

def get_completions(text):
    messages = [ChatMessage(role="user", content=text)]
    spark = ChatSparkLLM(
        spark_api_url=SPARKAI_URL,
        spark_app_id=SPARKAI_APP_ID,
        spark_api_key=SPARKAI_API_KEY,
        spark_api_secret=SPARKAI_API_SECRET,
        spark_llm_domain=SPARKAI_DOMAIN,
        streaming=False,
    )
    handler = ChunkPrintHandler()
    response = spark.generate([messages], callbacks=[handler])
    return response.generations[0][0].text

如何下载大模型

我们可以使用Hugging Face下载Qwen2-7B-Instruct模型。

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")

构造Prompt

构造Prompt时，可以包括角色描述、任务描述、格式要求、示例参考和待预测样本等信息。

chem_prompt = """
你精通预测药物合成反应的产率，你的任务是根据给定的反应数据准确预测反应的产率(Yield)。请仔细阅读以下说明并严格遵循:

***** 任务描述：*****
1. 你将获得一个相应物质的SMILES字符串字段组成的催化合成反应数据，包含Reactant1(反应物1)、Reactant2(反应物2)、Product(产物)、Additive(添加剂)、Solvent(溶剂)。
2. 数据为药物合成中常用的碳氮键形成反应。
3. 待预测的Yield是目标字段，是归一化为0到1之间的4位浮点数，表示反应的产率。

***** 你的任务：*****
1. 仔细分析提供的示例数据，理解反应与Yield之间的关系。
2. 根据待预测样本的数据，预测其反应产率Yield。
3. 输出预测的Yield值，精确到小数点后四位。

***** 输出格式要求：*****
1. 仅输出预测的Yield值，不要包含任何其他解释或评论。
2. 使用以下格式输出你的预测：@{{预测的Yield值}}
   例如：@{{0.7823}}

***** 注意事项：*****
1. 你必须输出预测的产率预测结果，不能输出其他内容。
2. 确保你的预测合理，介于0.0000到1.0000之间。
3. 始终保持四位小数的格式。
"""

def generate_prompt(prompt_template, test_sample, top_samples):
    examples = "\n\n".join([f"Reactant1: {row['Reactant1']}  \nReactant2: {row['Reactant2']}  \nProduct: {row['Product']}  \nAdditive: {row['Additive']}  \nSolvent: {row['Solvent']}  \nYield: {row['Yield']}" for _, row in top_samples.iterrows()])
    return prompt_template.format(
        examples=examples,
        test_reactant1=test_sample['Reactant1'],
        test_reactant2=test_sample['Reactant2'],
        test_product=test_sample['Product'],
        test_additive=test_sample['Additive'],
        test_solvent=test_sample['Solvent']
    )

并行处理样本并生成预测结果

import backoff
import concurrent.futures
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

@backoff.on_exception(backoff.expo, Exception, max_tries=5, max_time=300)
def process_single_sample(args):
    test_sample, train_df, train_tfidf, tfidf = args
    try:
        test_tfidf = tfidf.transform([test_sample['combined_features']])
        similarities = cosine_similarity(test_tfidf, train_tfidf).flatten()
        top_k_indices = similarities.argsort()[-3:][::-1]
        top_k_samples = train_df.iloc[top_k_indices]
        prompt = generate_prompt(chem_prompt, test_sample, top_k_samples)
        prediction = get_completions(prompt)
        yield_value = extract_yield(prediction)
        if yield_value is None:
            raise ValueError("无法提取有效的产率值")
        return test_sample['rxnid'], yield_value, None
    except Exception as e:
        return test_sample['rxnid'], None, str(e)

def extract_yield(prediction):
    yield_match = re.search(r'@{(.+?)}', prediction)
    if yield_match:
        yield_value = yield_match.group(1)
        try:
            float_yield = float(yield_value)
            if 0 <= float_yield <= 1:
                return f"{float_yield:.4f}"
        except ValueError:
            logger.warning(f"无法将提取的值 '{yield_value}' 转换为浮点数")
    logger.warning("无法从预测结果中提取产率值")
    return None

def process_samples_parallel(test_df, train_df, train_tfidf, tfidf, max_workers=1, batch_size=100):
    results = {}
    error_indices = []
    total_samples = len(test_df)
    logger.info(f"开始处理 {total_samples} 个测试样本")

    batches = [test_df[i:i+batch_size] for i in range(0, total_samples, batch_size)]
    with tqdm(total=total_samples, desc="处理测试样本", unit="sample") as pbar:
        for batch in batches:
            with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
                futures = [executor.submit(process_single_sample, (row, train_df, train_tfidf, tfidf)) for _, row in batch.iterrows()]
                for future in concurrent.futures.as_completed(futures):
                    rxnid, yield_value, error = future.result()
                    if error:
                        logger.error(f"处理样本 {rxnid} 时出错: {error}")
                        error_indices.append(rxnid)
                    else:
                        results[rxnid] = yield_value
                    pbar.update(1)
    return results, error_indices

处理错误样本并生成最终结果

def process_default_yield(test_df, train_df, train_tfidf, tfidf):
    default_yield = []
    for _, test_sample in tqdm(test_df.iterrows(), total=len(test_df), desc="处理测试样本"):
        test_tfidf = tfidf.transform([test_sample['combined_features']])
        similarities = cosine_similarity(test_tfidf, train_tfidf).flatten()
        top_k_indices = similarities.argsort()[-3:][::-1]
        top_k_yields = train_df.iloc[top_k_indices]['Yield'].astype(float).values
        weights = similarities[top_k_indices] / similarities[top_k_indices].sum()
        weighted_yield = np.dot(top_k_yields, weights)
        default_yield.append(weighted_yield)
    return default_yield

logger.info("开始处理测试样本...")
results, error_indices = process_samples_parallel(test_df, train_df, train_tfidf, tfidf)
default_yield = process_default_yield(test_df, train_df, train_tfidf, tfidf)

if error_indices:
    logger.info(f"有 {len(error_indices)} 个样本处理出错，正在重新处理...")
    error_df = test_df[test_df['rxnid'].isin(error_indices)]
    retry_results, retry_error_indices = process_samples_parallel(error_df, train_df, train_tfidf, tfidf)
    results.update(retry_results)
    for rxnid in retry_error_indices:
        results[rxnid] = default_yield[int(rxnid[4:]) - 1]

logger.info("开始写入结果...")
with open('submit.txt', 'w') as f:
    f.write('rxnid,Yield\n')
    for rxnid in test_df['rxnid']:
        yield_value = results.get(rxnid, default_yield[int(rxnid[4:]) - 1])
        f.write(f"{rxnid},{yield_value:.4f}\n")
logger.info("结果已保存到submit.txt文件中")