Preprocessing inputs
Info
Cornstarch repository provides an end-to-end example
in examples/pretrain_vlm.py
.
Running a multimodal model needs multimodal inputs. Yet processors for each modality model do not have interaction required for multimodal model execution.
In this section, we introduce the basics of multimodal interaction and Cornstarch APIs for interaction.
Interation between modality inputs
Multimodal LLMs merge modality encoder outputs into text embedidng and execute an LLM together.
In the text input, it is typical to use special tokens such as <image>
to indicate this is where modality encoder outputs should be located:
When this text is tokenized, the <image>
token is replaced with multiple image token IDs as placeholders, where later modality encoder outputs are injected by replacing the hidden states of the corresponding location.
Cornstarch MultimodalProcessor
To support the same feature with multimodal-unaware processors, Cornstarch provides MultimodalProcessor
class to define the multimodal interaction between processors, feature extractors, and a tokenizer.
which wraps a CLIP image processor for vision
modality encoder, a Whisper feature extractor for audio
modality encoder, and a Llama tokenizer for LLM into a single processor.
Functions for calculating the number of features
In the example of using llava processor above, tokenized inputs has a lot of image tokens (128256). The number of the image tokens must be exactly the same with the number of image features, otherwise merging the modality encoder outputs fails:
Because Cornstarch does not know how many tokens the given model will generate, users need to provide functions that return the number of tokens in num_features_calculation_funcs
:
Cornstarch provides processor inputs and processor outputs as two dictionaries as the input of the function:
where num_features
should be either
- list[int]
: a list of the number of features, one per modality input, across the entire batch, or
- list[list[int]]
: a list of the lists of the number of features, one per modality input per batch.
Note
For modality encoders that Cornstarch officially supports, calculation functions are automatically set.
However, if your multimodal model needs more features (e.g. Llava-next's dynamic high resolution that its underlying CLIP vision encoder does not support) or if you use a modality encoder that Cornstarch does not know, a custom function must be provided.
Token ID configuration
The tokenizer does not know which special tokens should be used for modality encoders.
At the same time, the LLM in MultimodalModel
does not know which token IDs should be replaced with the modality encoder outputs when merging them, either.
For this reason, MultimodalProcessor
, unlike processors that are independent from models in HuggingFace transformers, requires to take MultimodalModel
to add such interaction:
By default, Cornstarch registers <modal_key>
special tokens to the tokenizer:
When your dataset already includes its own special token, you can override the token by providing predefined_tokens
.
The following example registers <image>
instead of <vision>
for the vision encoder:
Data preprocessing with MultimodalProcessor
Cornstarch designs the MultimodalProcessor
to provide the maximum flexibility of data processing to users.
To avoid duplicated arguments from multiple modalities and the LLM, MultimodalProcessor
takes a dictionary per modality encoder and the LLM:
Cornstarch executes each processor and the tokenizer with the corresponding input dictionary.
It also forwards arguments in kwargs
if a processor accepts the argument.
So, you do not have to repetitively include some common argument to dictionaries for multiple processors.