oumi.core.collators#
Submodules#
oumi.core.collators.text_collator_with_padding module#
oumi.core.collators.text_completions_collator_with_padding module#
- class oumi.core.collators.text_completions_collator_with_padding.TextCompletionsCollatorWithPadding(tokenizer: PreTrainedTokenizerBase, response_template: str, train_target: str, instruction_template: str | None = None, debug: bool = False, end_of_turn_template: str | None = None, ignore_index: int = -100)[source]#
Bases:
object
oumi.core.collators.trl_data_collator_for_completion_only_lm module#
- class oumi.core.collators.trl_data_collator_for_completion_only_lm.DataCollatorForCompletionOnlyLM(response_template: str | list[int], instruction_template: str | list[int] | None = None, *args, train_target: str, end_of_turn_template: str | list[int] | None = None, mlm: bool = False, ignore_index: int = -100, padding_free: bool = False, **kwargs)[source]#
Bases:
DataCollatorForLanguageModelingData collator for completion-only training.
Masks input labels so that the loss is only computed on specific tokens (typically assistant responses), while ignoring other tokens (system prompts, user messages, padding).
The
train_targetparameter selects the training target:- ``all_assistant_turns``:
Span-based masking for multi-turn and tool-calling conversations. Masks everything, then unmarks each assistant response span bounded by
response_template..end_of_turn_template(inclusive of EOT). Correctly handles interleaved tool results and parallel tool calls.- ``final_assistant_turn``:
Masks all tokens before the last
response_templateoccurrence. Only the final assistant response is trained on. Suitable for single-turn completions.
- Parameters:
response_template – String or token IDs marking the start of an assistant response. Required for all modes.
instruction_template – String or token IDs marking the start of a user instruction. Legacy — only used with the instruction+response fallback path.
train_target – One of
"all_assistant_turns","final_assistant_turn","_legacy_instruction_response". Resolved by the builder before construction.end_of_turn_template – String or token IDs marking the end of a conversational turn. Required for
all_assistant_turnsmode.mlm – Whether to use masked language modeling. Default False.
ignore_index – Label value for masked tokens. Default -100.
padding_free – Remove padding and add position_ids. Default False.
oumi.core.collators.vision_language_collator_with_padding module#
- class oumi.core.collators.vision_language_collator_with_padding.VisionLanguageCollatorWithPadding(tokenizer: PreTrainedTokenizerBase, *, max_length: int | None, truncation: bool = False, label_ignore_index: int | None = None, allow_multi_image_inputs: bool = True, main_image_feature: str = 'pixel_values', debug: bool = False)[source]#
Bases:
object
oumi.core.collators.vision_language_sft_collator module#
Vision-Language SFT collator for conversation-based multimodal training.
This module provides a collator specifically designed for supervised fine-tuning (SFT) of vision-language models using conversation data.
Unlike VisionLanguageCollatorWithPadding which expects pre-processed features, this collator works with raw conversation objects and handles the complete feature generation pipeline.
Example
>>> from oumi.builders import build_tokenizer
>>> from oumi.core.configs import ModelParams
>>> tokenizer = build_tokenizer(ModelParams(model_name="llava-hf/llava-1.5-7b-hf"))
>>> collator = VisionLanguageSftCollator(
... tokenizer=tokenizer,
... processor_name="llava-hf/llava-1.5-7b-hf",
... max_length=512,
... truncation=True
... )
>>> # Expects batch items with conversation_json field
>>> batch = collator([{"conversation_json": conversation1.to_json()}, ...])
- class oumi.core.collators.vision_language_sft_collator.VisionLanguageSftCollator(tokenizer: PreTrainedTokenizerBase, processor_name: str, *, processor_kwargs: dict[str, Any] | None = None, max_length: int | None = None, truncation: bool = False, truncation_side: str = 'right', label_ignore_index: int | None = None, allow_multi_image_inputs: bool = True, trust_remote_code: bool = False, train_on_completions_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, process_individually: bool = False)[source]#
Bases:
objectCollator for vision-language SFT that processes conversation data.
This collator is designed for supervised fine-tuning of vision-language models where training data comes in the form of conversations containing both text and images. It handles the complete pipeline from raw conversations to model-ready tensor batches.
Key Features:
Processes Conversation objects containing text and image data
Uses model-specific processors to extract image features
Handles tokenization and feature generation in one step
Supports various vision-language architectures
Manages padding, truncation, and label masking
The collator expects batch items with a “conversation_json” field containing serialized Conversation objects. These conversations can include:
Multiple turns of dialogue
Image references (paths, URLs, or base64 data)
System prompts and user/assistant messages
- __call__(batch) dict[str, Any][source]#
Process a batch of conversation data into model-ready features.
This method converts serialized conversations into the tensor format expected by vision-language models. It handles the complete pipeline: 1. Deserializes conversation JSON strings 2. Passes conversations to the feature generator 3. Returns batched tensors ready for training
- Parameters:
batch –
List of dictionaries, where each dictionary must contain a “conversation_json” field with a serialized Conversation object.
Expected format:
[ {"conversation_json": '{"messages": [...], "images": [...]}'}, {"conversation_json": '{"messages": [...], "images": [...]}'}, ... ]
The conversation JSON should include: - messages: List of message dictionaries with role and content - images: Optional list of image data (paths, URLs, or base64)
- Returns:
- “input_ids”: Token IDs including image placeholders
”attention_mask”: Attention masks for the input
”labels”: Target labels with appropriate masking
”pixel_values” or model-specific image features
Additional model-specific features (cross_attention_mask, etc.)
The exact keys depend on the model architecture and processor used.
- Return type:
Dictionary containing all features needed for model training
- Raises:
ValueError – If batch is empty or any item lacks “conversation_json” field.
Example:
>>> conversation = Conversation(messages=[ ... {"role": "user", "content": "What's in this image?"}, ... {"role": "assistant", "content": "I see a cat."} ... ], images=["path/to/image.jpg"]) >>> batch_item = {"conversation_json": conversation.to_json()} >>> features = collator([batch_item]) >>> print(features.keys()) dict_keys(['input_ids', 'attention_mask', 'labels', 'pixel_values'])