Why We Chose miniCPM

When selecting a model for our needs, miniCPM stood out for several reasons. One of the key advantages of miniCPM is that it runs efficiently during inference. This efficiency means it can operate without requiring extensive computational resources, making it ideal for environments with limited hardware capabilities. In our case, miniCPM could run effectively on a single GPU, such as the NVIDIA L4.

Additionally, miniCPM has demonstrated excellent performance in Optical Character Recognition (OCR) tasks. Its ability to accurately recognize and process text within images is crucial for our application, where understanding visual content is a significant requirement. The proven performance of miniCPM across various tasks reassured us that it would be a reliable choice for our specific needs.

Use Case & Playground

We use this as a building block for our AI-agent QA-tester, read the blog post on how we leverage this in production or try it out yourself 

Data Set

To fine-tune miniCPM for our purposes, we curated a specialized data set from user interactions. From our experience, we learned that the model performs more effectively when it’s not overwhelmed with too much information at once. To address this, we decided to separate the understanding task from the decision-making task. This separation allows the model to first comprehend the content before proceeding to make any decisions, enhancing its overall performance.

Each data point in our data set is carefully structured to provide the necessary information for both understanding and decision-making tasks. The data points include the following elements:

  • Bounding Box of Element (Coordinates):
    This defines the specific area of the screen or document where the relevant information is located. By providing precise coordinates, the model can focus on the exact area that needs attention, improving its processing accuracy.
  • Relevant DOM / HTML:
    The Document Object Model (DOM) or HTML code associated with the content is also included. This provides the structural and contextual information necessary for the model to understand the content better.
  • Human Readable Description:
    A plain text description that offers a human-readable summary of the content. This element helps bridge the gap between raw data and semantic understanding, enabling the model to make more informed decisions.

How to Fine-Tune the Model

Create the Data Set

Start with a well-annotated data set of at least 10,000 data points. Quality annotations help the model learn and generalize better.

We use HuggingFace to keep track of our sets
Read their docs on how to create a data set

Prepare the Data Set

Download a data set from Hugging Face or use your own. Process it into a format compatible with miniCPM using the data prep notebook.

This is an example script for data preparation

import json
import os
from datasets import load_dataset
from PIL import Image

# Load the dataset from Hugging Face
dataset = load_dataset("your-username/name-of-your-dataset")

# Directory where images are stored or where you want to save them
image_dir = "dataset/images"

PROMPT_TEMPLATE = """<image>
In a short sentence, what does the action outlined with a magenta rectangle do?
Action data:
action_type: {action_type}
element_type: {element_type}
text: {text}
surrounding_html: {html}
"""

# Placeholder conversations template (assuming static for this example)
def format_conversations_template(action_type, element_type, text, html, summary):
    return [
        {
            'role': 'user', 
            'content': PROMPT_TEMPLATE.format(
                action_type=action_type, 
                element_type=element_type,
                text=text, 
                html=html)
        }, 
        {
            'role': 'assistant', 
            'content': summary
        }
    ]

print("Preparing training dataset...")

# Initialize the list to hold the JSON data
json_data = []

# Iterate over the dataset to create the JSON format
for idx, row in enumerate(dataset['train']):  # Adjust to the correct split if necessary (e.g., 'validation', 'test')
    if idx % 100 == 0:
        print(f"Processing row {idx+1}/{len(dataset['train'])}")
    
    if idx > 100:
        break
        
    # Extract the image and save it
    image = row['image']  # This is the PIL.Image object
    image_path = os.path.join(image_dir, f"image_{row['external_id']}.png") 
    image.save(image_path)

    entry = {
        "id": row['external_id'],
        "image": image_path,
        "conversations": format_conversations_template(
            action_type=row['action_type'], 
            element_type=row['element_type'],
            text=row['text'], 
            html=row['html'],
            summary=row['summary'],
        )
    }
    json_data.append(entry)

# Save to a JSON file
output_file = "dataset/train.json"
with open(output_file, "w") as f:
    json.dump(json_data, f, indent=4)

print(f"Data successfully saved to {output_file}")

Set Up a Fine-Tuning Environment

Use a machine with a high-performance GPU, such as an NVIDIA H100. The fine-tuning process, which takes a few hours, should be run periodically to keep the model updated.

On Google Cloud – see  https://cloud.google.com/compute/docs/gpus 

Run the Fine-Tuning

You can use Low-Rank Adaptation (LoRA) or full fine-tuning. We chose LoRA for its lower resource requirements and good results.

#!/bin/bash

GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

MODEL="openbmb/MiniCPM-V-2_6" # or openbmb/MiniCPM-V-2, openbmb/MiniCPM-Llama3-V-2_5
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="path/to/trainging_data"
EVAL_DATA="path/to/test_data"
LLM_TYPE="qwen2" 
# if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm
#if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE=llama3

MODEL_MAX_Length=2048 # if conduct multi-images sft, please set MODEL_MAX_Length=4096

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py  \
    --model_name_or_path $MODEL \
    --llm_type $LLM_TYPE \
    --data_path $DATA \
    --eval_data_path $EVAL_DATA \
    --remove_unused_columns false \
    --label_names "labels" \
    --prediction_loss_only false \
    --bf16 false \
    --bf16_full_eval false \
    --fp16 true \
    --fp16_full_eval true \
    --do_train \
    --do_eval \
    --tune_vision true \
    --tune_llm false \
    --use_lora true \
    --lora_target_modules "llm\..*layers\.\d+\.self_attn\.(q_proj|k_proj|v_proj|o_proj)" \
    --model_max_length $MODEL_MAX_Length \
    --max_slice_nums 9 \
    --max_steps 10000 \
    --eval_steps 1000 \
    --output_dir output/output__lora \
    --logging_dir output/output_lora \
    --logging_strategy "steps" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "steps" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-6 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --gradient_checkpointing true \
    --deepspeed ds_config_zero2.json \
    --report_to "tensorboard" # wandb


Full documentation: 

https://github.com/OpenBMB/MiniCPM-V

By following these steps, you can fine-tune miniCPM to perform optimally for your specific application.

Further Reading

Using Multimodal LLMs to Understand UI Elements on Websites

Try it out in the Playground App