如何将视觉大模型（VLM）与多模态RAG 结合起来，创建服装搜索和搭配推荐！本文展示了InternVL模型在分析服装图像和提取颜色、款式和类型等关键特征方面的强大功能。

InternVL2是国内首个在MMMU(多学科问答)上突破60的模型，堪称开源多模态大模型性能新标杆。

基于视觉大模型的功能，本文采用自定义匹配算法和 RAG 技术在我们的知识库中搜索与已识别特征互补的项目。该算法考虑了颜色兼容性和样式一致性等因素，为用户提供合适的建议。

使用VLM+ RAG（检索增强生成）的组合具有以下几个优点：

情境理解：VLM可以分析输入图像并理解情境，例如所描绘的对象、场景和活动。这允许在各个领域（OCR，道路交通，商品图片）提供更准确、更相关的建议或信息。
丰富的知识库：RAG 将 VLM的生成功能与检索组件相结合，该组件可以访问不同领域的大量信息。这意味着系统可以根据从历史事实到科学概念的广泛知识提供建议或见解。
定制：该方法允许轻松定制以满足各种应用程序中的特定用户需求或偏好。系统都可以进行调整以提供个性化的体验。

InternVL模型链接：

https://modelscope.cn/models/OpenGVLab/InternVL2-8B

LMDeploy代码链接：

https://github.com/InternLM/lmdeploy

GTE模型链接：

https://modelscope.cn/models/iic/nlp_gte_sentence-embedding_english-base

最佳实践参考链接：

https://cookbook.openai.com/examples/how_to_combine_gpt4o_with_rag_outfit_assistant

最佳实践

最佳实践流程图如下，本文主要通过多模态RAG搜索相似的衣服，并推荐最佳搭配，其中主要考察了多模态模型的图片描述生成（稳定生成JSON格式）以及多图对比的能力。

环境安装

首先，我们将安装必要的依赖项，然后导入库并编写一些稍后将使用的实用函数，ModelScope >= 1.6.0镜像已经预装LMDeploy

pip install openai tenacity tqdm numpy typing tiktoken concurrent fastapi-cli --quiet
git clone https://github.com/openai/openai-cookbook.git
cd openai-cookbook/examples

使用LMDeploy v0.5.0, 需要先设置chat template. 创建如下json文件chat_template.json.

{
    "model_name":"internlm2",
    "meta_instruction":"你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型，英文名叫InternVL, 是一个有用无害的人工智能助手。",
    "stop_words":["<|im_start|>", "<|im_end|>"]
}

模型下载和运行

modelscope login --token YOUR_MODELSCOPE_SDK_TOKEN
modelscope download --model=OpenGVLab/InternVL2-8B --local_dir ./InternVL2-8B
LMDeploy serve api_server ./InternVL2-8B/ --model-name InternVL2-8B --server-port 23333 --chat-template chat_template.json

模型定义

import pandas as pd
import numpy as np
import json
import ast
import tiktoken
import concurrent
from openai import OpenAI
from tqdm import tqdm
from tenacity import retry, wait_random_exponential, stop_after_attempt
from IPython.display import Image, display, HTML
from typing import List

client = OpenAI(api_key='local client', base_url='http://0.0.0.0:23333/v1')

InternVL_MODEL = "InternVL2-8B"
EMBEDDING_MODEL = "iic/nlp_gte_sentence-embedding_english-base"

创建Embedding

我们现在将通过选择数据库并为其生成Embedding来设置知识库。参考数据文件夹中的 sample_styles.csv 文件。这是一个包含约 44K 个项目的数据集样本。

styles_filepath = "data/sample_clothes/sample_styles.csv"
styles_df = pd.read_csv(styles_filepath, on_bad_lines='skip')
print(styles_df.head())
print("Opened dataset successfully. Dataset has {} items of clothing.".format(len(styles_df)))
from modelscope.models import Model
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

def get_emb(chunked_corpus=None, sequence_length=8191):
    pip = pipeline(Tasks.sentence_embedding,
                   model=EMBEDDING_MODEL,
                   sequence_length=sequence_length)
get_emb()

现在我们将为整个数据集生成Embedding。我们可以并行执行这些Embedding，以确保脚本可以扩展到更大的数据集。

from modelscope.models import Model
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

def get_embeddings(chunked_corpus, sequence_length=8191):
    pip = pipeline(Tasks.sentence_embedding,
                   model=EMBEDDING_MODEL,
                   sequence_length=sequence_length)
    inputs = {"source_sentence": chunked_corpus}
    res = pip(inputs)['text_embedding']
    embs = [list(item) for item in res]
    
    return embs


# Splits an iterable into batches of size n.
def batchify(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx : min(ndx + n, l)]


def embed_corpus(
    corpus: List[str],
    batch_size=64,
    num_workers=8,
    sequence_length=8191):
    chunked_corpus = [
        item[:sequence_length] for item in corpus]

    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = [
            executor.submit(get_embeddings, text_batch)
            for text_batch in batchify(chunked_corpus, batch_size)
        ]

        with tqdm(total=len(chunked_corpus)) as pbar:
            for _ in concurrent.futures.as_completed(futures):
                pbar.update(batch_size)

        embeddings = []
        for future in futures:
            data = future.result()
            embeddings.extend(data)

        return embeddings

    
def generate_embeddings(df, column_name):
    # Initialize an empty list to store embeddings
    descriptions = df[column_name].astype(str).tolist()
    embeddings = embed_corpus(descriptions)

    # Add the embeddings as a new column to the DataFrame
    df['embeddings'] = embeddings
    print("Embeddings created successfully.")

创建Embedding的两个选项：将为样本服装数据集创建Embedding，并将结果写入本地 .csv 文件。该过程使用魔搭社区的GTE模型，

generate_embeddings(styles_df, 'productDisplayName')
print("Writing embeddings to file ...")
styles_df.to_csv('data/sample_clothes/sample_styles_with_embeddings_new.csv', index=False)
print("Embeddings successfully stored in sample_styles_with_embeddings_new.csv")

print(styles_df.head())print("Opened dataset successfully. Dataset has {} items of clothing along with their embeddings.".format(len(styles_df)))

我们开发一个基于余弦相似度检索的函数find_similar_items，来查找数据框中的相似项。该函数接受四个参数：

input_embedding：想要找到匹配的嵌入。
embeddings：用于搜索最佳匹配的嵌入列表。
threshold（可选）：此参数指定匹配被视为有效的最小相似度得分。较高的阈值可产生更接近（更好）的匹配，而较低的阈值可返回更多项目，尽管它们可能与初始匹配度不高embedding。
top_k（可选）：此参数确定要返回的超出给定阈值的项目数。这些将是提供的得分最高的匹配项embedding。

def cosine_similarity_manual(vec1, vec2):
    """Calculate the cosine similarity between two vectors."""
    vec1 = np.array(vec1, dtype=float)
    vec2 = np.array(vec2, dtype=float)


    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)


def find_similar_items(input_embedding, embeddings, threshold=0.5, top_k=2):
    """Find the most similar items based on cosine similarity."""
    
    # Calculate cosine similarity between the input embedding and all other embeddings
    similarities = [(index, cosine_similarity_manual(input_embedding, vec)) for index, vec in enumerate(embeddings)]
    
    # Filter out any similarities below the threshold
    filtered_similarities = [(index, sim) for index, sim in similarities if sim >= threshold]
    
    # Sort the filtered similarities by similarity score
    sorted_indices = sorted(filtered_similarities, key=lambda x: x[1], reverse=True)[:top_k]

    # Return the top-k most similar items
    return sorted_indices

def find_matching_items_with_rag(df_items, item_descs):
   """Take the input item descriptions and find the most similar items based on cosine similarity for each description."""
   
   # Select the embeddings from the DataFrame.
   embeddings = df_items['embeddings'].tolist()

   
   similar_items = []
   for desc in item_descs:
      
      # Generate the embedding for the input item
      input_embedding = get_embeddings([desc])
    
      # Find the most similar items based on cosine similarity
      similar_indices = find_similar_items(input_embedding, embeddings, threshold=0.6)
      similar_items += [df_items.iloc[i] for i in similar_indices]
    
   return similar_items

图片分析

在分析模块中，我们利用InternVL模型，来分析输入图像并提取重要特征，如详细描述、样式和类型等等。分析通过简单的 API 调用执行，我们提供图像的 URL 进行分析并请求模型识别相关特征。为了确保模型返回准确的结果，我们在提示中使用了特定的技术，来输出格式规范。我们指示模型返回具有预定义结构的 JSON 块，包括：

items(str[])：字符串列表，每个字符串代表一件衣服的简明标题，包括款式、颜色和性别。这些标题与productDisplayName我们原始数据库中的属性非常相似。
category(str)：最能代表给定项目的类别。该模型从articleTypes原始样式数据框中存在的所有唯一列表中进行选择。
gender(str)：表示该物品适用性别的标签。模型从选项中进行选择[Men, Women, Boys, Girls, Unisex]。

同时我们对项目标题应包含的内容以及输出格式提供了清晰简洁的说明。输出应为 JSON 格式，但不包含json模型响应通常包含的标签。为了进一步明确预期输出，我们为模型提供了一个示例输入描述和相应的示例输出。虽然这可能会增加使用的 token 数量（从而增加调用成本），但它有助于指导模型并带来更好的整体性能。通过遵循这种结构化方法，我们旨在从模型中获取精确和有用的信息，以便进一步分析并集成到我们的数据库中。

def analyze_image(image_base64, subcategories):
    response = client.chat.completions.create(
        model=InternVL_MODEL,
        messages=[
            {
            "role": "user",
            "content": [
                {
                "type": "text",
                "text": """Given an image of an item of clothing, analyze the item and generate a JSON output with the following fields: "items", "category", and "gender". 
                           Use your understanding of fashion trends, styles, and gender preferences to provide accurate and relevant suggestions for how to complete the outfit.
                           The items field should be a list of items that would go well with the item in the picture. Each item should represent a title of an item of clothing that contains the style, color, and gender of the item.
                           The category needs to be chosen between the types in this list: {subcategories}.
                           You have to choose between the genders in this list: [Men, Women, Boys, Girls, Unisex]
                           Do not include the description of the item in the picture. Do not include the ```json ``` tag in the output.
                           
                           Example Input: An image representing a black leather jacket.

                           Example Output: {"items": ["Fitted White Women's T-shirt", "White Canvas Sneakers", "Women's Black Skinny Jeans"], "category": "Jackets", "gender": "Women"}
                           """,
                },
                {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image_base64}",
                },
                }
            ],
            }
        ]
    )
    # Extract relevant features from the response
    features = response.choices[0].message.content
    return features

图片测试

为了评估提示的有效性，让我们使用数据集中的精选图像加载并测试它。我们将使用文件夹中的图像"data/sample_clothes/sample_images"，确保风格、性别和类型多样。以下为选定的样本：

2133.jpg：男式衬衫
7143.jpg：女式衬衫
4226.jpg: 休闲男士印花T恤

通过用这些不同的图像测试提示，我们可以评估其准确分析和提取不同类型的服装和配饰的相关特征的能力。

我们需要一个实用函数来将 .jpg 图像以 base64 格式编码

import base64

def encode_image_to_base64(image_path):
    with open(image_path, 'rb') as image_file:
        encoded_image = base64.b64encode(image_file.read())
        return encoded_image.decode('utf-8')

接下来，我们处理图像分析的输出，并使用它来过滤和显示数据集中的匹配项。

# Set the path to the images and select a test image
image_path = "data/sample_clothes/sample_images/"
test_images = ["2133.jpg", "7143.jpg", "4226.jpg"]

# Encode the test image to base64
reference_image = image_path + test_images[0]
encoded_image = encode_image_to_base64(reference_image)

# Select the unique subcategories from the DataFrame
unique_subcategories = styles_df['articleType'].unique()

# Analyze the image and return the results
analysis = analyze_image(encoded_image, unique_subcategories)
print(analysis)
image_analysis = json.loads(analysis)

# Display the image and the analysis results
display(Image(filename=reference_image))
print(image_analysis

以下是代码的细分：

提取图像分析结果：我们从字典中提取项目描述、类别和性别image_analysis。
过滤数据集：我们过滤styles_df DataFrame 以仅包含与图像分析中的性别匹配（或男女通用）的项目，并排除与分析图像属于同一类别的项目。
查找匹配项目：我们使用该find_matching_items_with_rag函数在过滤的数据集中查找与从分析的图像中提取的描述匹配的项目。
显示匹配项：我们创建一个 HTML 字符串来显示匹配项的图像。我们使用项目 ID 构造图像路径，并将每个图像附加到 HTML 字符串。最后，我们使用display(HTML(html))在笔记本中渲染图像。

以下有效地演示了如何使用图像分析的结果，来过滤数据集，并以可视化的方式显示与分析图像的特征相匹配的项目。

# Extract the relevant features from the analysis
item_descs = image_analysis['items']
item_category = image_analysis['category']
item_gender = image_analysis['gender']


# Filter data such that we only look through the items of the same gender (or unisex) and different category
filtered_items = styles_df.loc[styles_df['gender'].isin([item_gender, 'Unisex'])]
filtered_items = filtered_items[filtered_items['articleType'] != item_category]
print(str(len(filtered_items)) + " Remaining Items")

# Find the most similar items based on the input item descriptions
matching_items = find_matching_items_with_rag(filtered_items, item_descs)

# Display the matching items (this will display 2 items for each description in the image analysis)
html = ""
paths = []
for i, item in enumerate(matching_items):
    item_id = item['id']
        
    # Path to the image file
    image_path = f'data/sample_clothes/sample_images/{item_id}.jpg'
    paths.append(image_path)
    html += f'<img src="{image_path}" style="display:inline;margin:1px"/>'

# Print the matching item description as a reminder of what we are looking for
print(item_descs)
# Display the image
display(HTML(html))

在使用 InternVL2 等大模型时，“护栏”是指为确保模型的输出保持在所需参数或边界内，而采取的机制或检查。这些护栏对于保持模型响应的质量和相关性至关重要，尤其是在处理复杂或细微的任务时。护栏之所以有用，有以下几个原因：

准确性：它们有助于确保模型的输出准确且与提供的输入相关。
一致性：它们保持模型响应的一致性，尤其是在处理相似或相关的输入时。
安全性：它们可以防止模型生成有害、攻击性或不适当的内容。
上下文相关性：它们确保模型的输出与其所使用的特定任务或领域在上下文上相关。

在我们的案例中，我们使用InternVL2分析时尚图片，并推荐可以与原始服装相配的单品。为了实施护栏，我们可以优化结果：从 InternVL2 获得初步建议后，可以将原始图片和推荐的单品发送回模型。然后，我们可以让 InternVL2 评估每个推荐的单品是否确实适合原始服装。这使得模型能够根据反馈或其他信息，进行自我修正和调整自己的输出。通过实施这些防护措施并启用自我修正，我们可以在时尚分析和推荐的背景下提高模型输出的可靠性和实用性。

为了实现这一点，我们编写了一个提示，要求大模型对建议的物品是否与原始服装相匹配的问题，做出简单的“是”或“否”回答。这种二元回答有助于简化改进过程，并确保模型提供清晰且可操作的反馈。

def check_match(reference_image_base64, suggested_image_base64):
    response = client.chat.completions.create(
        model=InternVL_MODEL,
        messages=[
            {
            "role": "user",
            "content": [
                {
                "type": "text",
                "text": """ You will be given two images of two different items of clothing.
                            Your goal is to decide if the items in the images would work in an outfit together.
                            The first image is the reference item (the item that the user is trying to match with another item).
                            You need to decide if the second item would work well with the reference item.
                            Your response must be a JSON output with the following fields: "answer", "reason".
                            The "answer" field must be either "yes" or "no", depending on whether you think the items would work well together.
                            The "reason" field must be a short explanation of your reasoning for your decision. Do not include the descriptions of the 2 images.
                            Do not include the ```json ``` tag in the output.
                           """,
                },
                {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{reference_image_base64}",
                },
                },
                {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{suggested_image_base64}",
                },
                }
            ],
            }
        ],
        max_tokens=300,
    )
    # Extract relevant features from the response
    features = response.choices[0].message.content
    return features

最后，让我们确定一下上面列出的哪些物品真正适合这套服装。

# Select the unique paths for the generated images
paths = list(set(paths))
print(paths)
for path in paths:
    # Encode the test image to base64
    suggested_image = encode_image_to_base64(path)
    
    # Check if the items match
    match = json.loads(check_match(encoded_image, suggested_image))
    
    # Display the image and the analysis results
    if match["answer"] == 'yes':
        display(Image(filename=path))
        print("The items match!")
        print(match["reason"])

我们可以观察到，最初的潜在物品清单已经得到进一步完善，从而产生了与服装更加匹配的精选物品。此外，该模型还解释了为什么每件物品被认为是很好的搭配，为决策过程提供了宝贵的见解。