Supervised Fine-Tuning#
Supervised Fine-Tuning (SFT) is the most common approach for adapting a pre-trained language model to specific downstream tasks. This involves fine-tuning the model’s parameters on a labeled dataset of input-output pairs, effectively teaching the model to perform the desired task.
This guide covers datasets used for using SFT datasets in Oumi OSS.
SFT Datasets#
Out-of-the box, we support multiple popular SFT datasets:
Name |
Description |
Reference |
|---|---|---|
AlpacaDataset |
In-memory dataset for SFT data. |
|
ArgillaDollyDataset |
Dataset class for the Databricks Dolly 15k curated dataset. |
|
ArgillaMagpieUltraDataset |
Dataset class for the argilla/magpie-ultra-v0.1 dataset. |
|
AyaDataset |
Dataset class for the CohereForAI/aya_dataset dataset. |
|
ChatRAGBenchDataset |
In-memory dataset for SFT data. |
|
ChatqaDataset |
In-memory dataset for SFT data. |
|
ChatqaTatqaDataset |
ChatQA Subclass to handle tatqa subsets. |
|
CoALMDataset |
Dataset class for the UIUC CoALM dataset. |
|
HuggingFaceDataset |
Converts HuggingFace Datasets with messages to Oumi Message format. |
|
MagpieProDataset |
Dataset class for the Magpie-Align/Llama-3-Magpie-Pro-1M-v0.1 dataset. |
|
OpenO1SFTDataset |
Synthetic reasoning SFT dataset. |
|
PromptResponseDataset |
Converts HuggingFace Datasets with input/output columns to Message format. |
|
TextSftJsonLinesDataset |
TextSftJsonLinesDataset for loading SFT data in oumi and alpaca formats. |
|
Tulu3MixtureDataset |
In-memory dataset for SFT data. |
|
UltrachatH4Dataset |
Dataset class for the HuggingFaceH4/ultrachat_200k dataset. |
|
WildChatDataset |
Dataset class for the allenai/WildChat-1M dataset. |
Usage#
Configuration#
To use a specific SFT dataset in your Oumi OSS configuration, specify it in the TrainingConfig.
Here’s an example:
training:
data:
train:
datasets:
- dataset_name: your_sft_dataset_name
split: train
stream: false
collator_name: text_with_padding
In this configuration:
dataset_namespecifies the name of your SFT datasetsplitselects a specific dataset split (e.g., train, validation, test)streamenables streaming mode for large datasetscollator_namespecifies the collator to use for batching
Python API#
To use a specific SFT dataset in your code, you can use the build_dataset() function:
from oumi.builders import build_dataset
from oumi.core.configs import DatasetSplit
from torch.utils.data import DataLoader
# Assume you have your tokenizer initialized
tokenizer = ...
# Build the dataset
dataset = build_dataset(
dataset_name="your_sft_dataset_name",
tokenizer=tokenizer,
dataset_split=DatasetSplit.TRAIN
)
loader = DataLoader(dataset, batch_size=32, shuffle=True)
# Now you can use the dataset in your training loop
for batch in loader:
# Process your batch
...
Adding a New SFT Dataset#
All SFT datasets in Oumi OSS are subclasses of BaseSftDataset.
To add a new SFT dataset:
Subclass
BaseSftDatasetImplement the
transform_conversation()method to define the dataset-specific transformation logic.Register your new dataset to the dataset class by adding it to
pyandpy.
For example:
from oumi.core.datasets import BaseSftDataset
from oumi.core.types.conversation import Conversation, Message, Role
from oumi.core.registry import register_dataset
@register_dataset("custom_sft_dataset")
class CustomSftDataset(BaseSftDataset):
def __init__(self, config: TrainingConfig,
tokenizer: BaseTokenizer,
dataset_split: DatasetSplit):
super().__init__(config, tokenizer, dataset_split)
# Initialize your dataset here
def transform_conversation(self, example: Dict[str, Any]) -> Conversation:
# Transform the raw example into a Conversation object
# 'example' represents one row of the raw dataset
# Structure of 'example':
# {
# 'input': str, # The user's input or question
# 'output': str # The assistant's response
# }
conversation = Conversation(
messages=[
Message(role=Role.USER, content=example['input']),
Message(role=Role.ASSISTANT, content=example['output'])
]
)
return conversation
Tip
For more advanced SFT dataset implementations, explore the oumi.datasets module, which contains implementations of several open source datasets.
Using an Unregistered Dataset Whose Format is Identical to a Registered Dataset#
Many datasets on Hugging Face share the same format as Oumi OSS registered datasets. It is not necessary to register each dataset explicitly to use it. Instead, you can override the dataset_name parameter using a keyword argument; see the code snippet below for an example of how to do this.
- dataset_name: registered_hf_dataset_with_compatible_class
dataset_kwargs:
- dataset_name_override: hf_dataset_with_data_to_use
NOTE: This feature is experimental, and we expect it to change in a future release.
Using Custom Datasets via the CLI#
See Customizing Oumi OSS to quickly enable your dataset when using the CLI.