Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] HuggingFaceTrainer&Predictor implementation #23876

Merged
merged 89 commits into from
Apr 29, 2022
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
b72c722
WIP
Yard1 Apr 7, 2022
9a0b41e
WIP
Yard1 Apr 7, 2022
5ca35b0
WIP
Yard1 Apr 7, 2022
f8e153a
WIP
Yard1 Apr 7, 2022
240e661
Make datasets arg mandatory
Yard1 Apr 7, 2022
196905a
WIP
Yard1 Apr 7, 2022
d73991e
Merge branch 'master' into hf_trainer_implementation
Yard1 Apr 11, 2022
f6e9daf
Add docs
Yard1 Apr 11, 2022
55633ef
WIP
Yard1 Apr 11, 2022
e94d55d
Merge branch 'master' into hf_trainer_implementation
Yard1 Apr 12, 2022
587f8ad
Merge branch 'master' into hf_trainer_implementation
Yard1 Apr 12, 2022
3152272
HuggingFaceTrainer
Yard1 Apr 12, 2022
7b19a01
Add basic example
Yard1 Apr 12, 2022
0a0223e
Remove notebook
Yard1 Apr 12, 2022
ec83f90
Doc
Yard1 Apr 12, 2022
66cef4a
Doc
Yard1 Apr 12, 2022
347ba0b
Better example
Yard1 Apr 13, 2022
c58ec5f
Lint
Yard1 Apr 13, 2022
5d252d8
Typo fix
Yard1 Apr 13, 2022
22b7812
Merge branch 'master' into hf_trainer_implementation
Yard1 Apr 13, 2022
0fc3a04
_checkpoint_manager_cls
Yard1 Apr 13, 2022
28ae105
Improve checkpointing
Yard1 Apr 13, 2022
3263bd4
cleanup checkpoint
Yard1 Apr 13, 2022
35cc359
Sort imports
Yard1 Apr 13, 2022
7920c22
Remove monkey patching for callbacks
Yard1 Apr 13, 2022
552d0fe
Move to utils
Yard1 Apr 13, 2022
ac2bd18
Lint fix, check transformers version
Yard1 Apr 13, 2022
af271cf
Bump transformers version in requirements
Yard1 Apr 13, 2022
542e007
Fix checkpoint loading
Yard1 Apr 13, 2022
04a6d1d
Add `HuggingFacePredictor`
Yard1 Apr 13, 2022
d4f98cf
Fix predictor columns
Yard1 Apr 13, 2022
e9ba551
Address some comments
Yard1 Apr 14, 2022
d55f843
add tune checkpoint id
Yard1 Apr 14, 2022
e36bfce
Update python/ray/ml/utils/huggingface_utils.py
Yard1 Apr 14, 2022
0d7f14d
Use an external func
Yard1 Apr 14, 2022
890f84e
Rename huggingface_basic_language_modelling_example.py to huggingface…
Yard1 Apr 14, 2022
aa9e0e2
Merge branch 'master' into hf_trainer_implementation
Yard1 Apr 14, 2022
fd815f6
Do not override __new__
Yard1 Apr 14, 2022
3de7302
Merge branch 'master' into hf_trainer_implementation
Yard1 Apr 19, 2022
d020fca
HuggingFacePredictor inherits from TorchPredictor
Yard1 Apr 19, 2022
9cdadb7
Move utils to hf folder
Yard1 Apr 19, 2022
13ac55f
Inheritance tweak
Yard1 Apr 19, 2022
7f3403a
Doc fix
Yard1 Apr 19, 2022
c3d8050
Improve tensorize
Yard1 Apr 19, 2022
d4851e4
Fix docs
Yard1 Apr 19, 2022
3185009
Stack after all
Yard1 Apr 19, 2022
814b889
Add tests, predictor work
Yard1 Apr 19, 2022
d49739f
Merge branch 'master' into hf_trainer_implementation
Yard1 Apr 20, 2022
25d3bda
Lint
Yard1 Apr 20, 2022
8a74340
Make tests work
Yard1 Apr 20, 2022
97d99ac
Add n>1 gpus warning
Yard1 Apr 20, 2022
2188e40
Raise exception instead of warning
Yard1 Apr 20, 2022
6028d4f
Add predictor doc
Yard1 Apr 20, 2022
95c792e
Merge branch 'master' into hf_trainer_implementation
Yard1 Apr 21, 2022
7699f79
Bump train requirements
Yard1 Apr 21, 2022
e852959
Lint
Yard1 Apr 21, 2022
a66da31
Add more mocks to docs
Yard1 Apr 21, 2022
f2627bb
Put data into file
Yard1 Apr 21, 2022
5617524
Add predictor test, fix small issues
Yard1 Apr 21, 2022
5f7fda1
CI fixes
Yard1 Apr 22, 2022
bdc7387
Fix docs
Yard1 Apr 22, 2022
a948482
Merge branch 'master' into hf_trainer_implementation
Yard1 Apr 25, 2022
6302ae4
Merge branch 'master' into hf_trainer_implementation
Yard1 Apr 26, 2022
3ef8d1b
Expand predictor
Yard1 Apr 26, 2022
c0da17e
Apply suggestions from code review
Yard1 Apr 26, 2022
8c7a6eb
WIP
Yard1 Apr 27, 2022
2d5f94e
Complete refactor
Yard1 Apr 27, 2022
8f085b4
Clarify
Yard1 Apr 27, 2022
2126556
Remove shuffle mention from docstring
Yard1 Apr 27, 2022
89e5d55
Merge branch 'master' into hf_trainer_implementation
Yard1 Apr 28, 2022
0abab40
Doc fix
Yard1 Apr 28, 2022
3c69367
Upgrade torch version
Yard1 Apr 28, 2022
dc4fb41
Update requirements_ml_docker.txt
Yard1 Apr 28, 2022
2c83878
Update requirements_dl.txt
Yard1 Apr 28, 2022
58d8136
Revert
Yard1 Apr 28, 2022
790a31e
Revert
Yard1 Apr 28, 2022
3ed8551
Update huggingface_basic_language_modeling_example.py
Yard1 Apr 28, 2022
3bc93e4
Update huggingface_basic_language_modeling_example.py
Yard1 Apr 28, 2022
e48794c
Update huggingface_basic_language_modeling_example.py
Yard1 Apr 28, 2022
79ad5b4
Merge branch 'ray-project:master' into hf_trainer_implementation
Yard1 Apr 28, 2022
42d99dd
Update custom_directives.py
Yard1 Apr 29, 2022
cf322b2
Update custom_directives.py
Yard1 Apr 29, 2022
27f1d70
Merge branch 'ray-project:master' into hf_trainer_implementation
Yard1 Apr 29, 2022
1757ff1
Merge branch 'ray-project:master' into hf_trainer_implementation
Yard1 Apr 29, 2022
71d2f1b
Better checkpoint detection
Yard1 Apr 29, 2022
3b45939
Apply suggestions from code review
Yard1 Apr 29, 2022
b26f67f
Add context
Yard1 Apr 29, 2022
2c8f48c
Load huggingface checkpoint to staticmethod
Yard1 Apr 29, 2022
b549dbe
Update python/ray/ml/examples/huggingface/huggingface_basic_language_…
amogkam Apr 29, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions doc/source/ray-air/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,10 @@ Trainer
:members:
:show-inheritance:

.. automodule:: ray.ml.train.integrations.huggingface
:members:
:show-inheritance:

.. automodule:: ray.ml.train.integrations.sklearn
:members:
:show-inheritance:
Expand Down
22 changes: 17 additions & 5 deletions python/ray/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import ray.util.sgd
import torch
import tensorflow as tf
import torch.utils.data
from ray.data.dataset_pipeline import DatasetPipeline
from ray.data.grouped_dataset import GroupedDataset

Expand Down Expand Up @@ -302,7 +303,8 @@ def transform(block: Block) -> Iterable[Block]:
):
raise ValueError(
"The map batches UDF returned the value "
f"{applied}, which is not allowed. "
f"{applied} of type {type(applied)}, "
"which is not allowed. "
"The return type must be either list, "
"pandas.DataFrame, or pyarrow.Table"
)
Expand Down Expand Up @@ -2072,6 +2074,7 @@ def to_torch(
prefetch_blocks: int = 0,
drop_last: bool = False,
unsqueeze_label_tensor: bool = True,
unsqueeze_feature_tensors: bool = True,
) -> "torch.utils.data.IterableDataset":
"""Return a Torch IterableDataset over this dataset.

Expand Down Expand Up @@ -2145,6 +2148,10 @@ def to_torch(
be left as is, that is (N, ). In general, regression loss
functions expect an unsqueezed tensor, while classification
loss functions expect a squeezed one. Defaults to True.
unsqueeze_feature_tensors (bool): If set to True, the features tensors
will be unsqueezed (reshaped to (N, 1)) before being concatenated into
the final features tensor. Otherwise, they will be left as is, that is
(N, ). Defaults to True.

Returns:
A torch IterableDataset.
Expand Down Expand Up @@ -2196,10 +2203,13 @@ def make_generator():
drop_last=drop_last,
):
if label_column:
label_vals = batch.pop(label_column).values
label_tensor = torch.as_tensor(label_vals, dtype=label_column_dtype)
if unsqueeze_label_tensor:
label_tensor = label_tensor.view(-1, 1)
label_tensor = convert_pandas_to_torch_tensor(
batch,
[label_column],
label_column_dtype,
unsqueeze=unsqueeze_label_tensor,
)
batch.pop(label_column)
else:
label_tensor = None

Expand All @@ -2211,6 +2221,7 @@ def make_generator():
feature_column_dtypes[key]
if isinstance(feature_column_dtypes, dict)
else feature_column_dtypes,
unsqueeze=unsqueeze_feature_tensors,
)
for key in feature_columns
}
Expand All @@ -2219,6 +2230,7 @@ def make_generator():
batch,
columns=feature_columns,
column_dtypes=feature_column_dtypes,
unsqueeze=unsqueeze_feature_tensors,
)

yield (features_tensor, label_tensor)
Expand Down
94 changes: 94 additions & 0 deletions python/ray/ml/examples/huggingface_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Based on
# huggingface/notebooks/examples/language_modeling_from_scratch.ipynb

from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoConfig,
AutoModelForCausalLM,
Trainer,
TrainingArguments,
)

import ray
import ray.data
from ray.ml.train.integrations.huggingface import HuggingFaceTrainer

model_checkpoint = "gpt2"
tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer"

# block_size = tokenizer.model_max_length
block_size = 128


def get_dataset():
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)

def tokenize_function(examples):
return tokenizer(examples["text"])

tokenized_datasets = datasets.map(
tokenize_function, batched=True, num_proc=1, remove_columns=["text"]
)

def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result

lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
batch_size=1000,
num_proc=1,
)
return lm_datasets


lm_dataset = get_dataset()
ray_train = ray.data.from_arrow(lm_dataset["train"]._data.table)
ray_validation = ray.data.from_arrow(lm_dataset["validation"]._data.table)


def train_function(train_dataset, eval_dataset=None, **config):
model_config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForCausalLM.from_config(model_config)
print("Initializing TrainingArguments...")
training_args = TrainingArguments(
f"{model_checkpoint}-wikitext2",
evaluation_strategy="epoch",
num_train_epochs=2,
learning_rate=2e-5,
weight_decay=0.01,
disable_tqdm=True,
save_strategy="epoch",
)
print("Initializing Trainer...")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
print("Trainer initialized! Starting training...")
return trainer


trainer = HuggingFaceTrainer(
trainer_init_per_worker=train_function,
scaling_config={"num_workers": 2, "use_gpu": False},
datasets={"train": ray_train.limit(16), "evaluation": ray_validation.limit(8)},
)
results = trainer.fit()
print(results.metrics)
5 changes: 4 additions & 1 deletion python/ray/ml/train/data_parallel_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,9 @@ def _validate_train_loop_per_worker(
f"but it accepts {num_params} arguments instead."
)

def _get_checkpoint_manager(self) -> TuneCheckpointManager:
return _DataParallelCheckpointManager()
Yard1 marked this conversation as resolved.
Show resolved Hide resolved

def training_loop(self) -> None:
scaling_config_dataclass = ScalingConfigDataClass(**self.scaling_config)

Expand All @@ -271,7 +274,7 @@ def training_loop(self) -> None:
max_retries=0,
)

checkpoint_manager = _DataParallelCheckpointManager()
checkpoint_manager = self._get_checkpoint_manager()
checkpoint_manager.on_init(preprocessor=self.preprocessor)

# Start the remote actors.
Expand Down
Loading