Skip to content

Commit

Permalink
support packed dataset
Browse files Browse the repository at this point in the history
Signed-off-by: Chen Cui <[email protected]>

support packed dataset

Signed-off-by: Chen Cui <[email protected]>

[Codec] Finite scalar quantizer (NVIDIA#7886)

* Finite scalar quantizer

Signed-off-by: Ante Jukić <[email protected]>

* Updated test

Signed-off-by: Ante Jukić <[email protected]>

---------

Signed-off-by: Ante Jukić <[email protected]>

upgrade to latest mcore and TE (NVIDIA#7908)

* reimport module

Signed-off-by: dimapihtar <[email protected]>

* update mcore and TE commits

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>

Tar codec (NVIDIA#7867)

added missing torch import (NVIDIA#7913)

Signed-off-by: David Mosallanezhad <[email protected]>

add cpu init check (NVIDIA#7889)

Signed-off-by: Chen Cui <[email protected]>

Fix pinned triton version (NVIDIA#7925)

* Fix pinned triton version

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Change README

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove flash-attn in Dockerfile

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Revert

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fix tp_overlap config var name (NVIDIA#7928)

Signed-off-by: Xiaowei Ren <[email protected]>

add Dutch P&C FC model info (NVIDIA#7892)

* add Dutch P&C FC model info

Signed-off-by: zhehuaichen <[email protected]>

* update order of the results

Signed-off-by: zhehuaichen <[email protected]>

---------

Signed-off-by: zhehuaichen <[email protected]>

fix issues with convert_nemo_llama_to_hf.py (NVIDIA#7922)

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

fix collate_fn bug for TP > 1

Signed-off-by: Chen Cui <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

make packed dataset work

Signed-off-by: Chen Cui <[email protected]>

fix nan bug

Signed-off-by: Chen Cui <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

support answer only loss

Signed-off-by: Chen Cui <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

account for padding in cu_seqlens during dataloading for attn kernel

Signed-off-by: Chen Cui <[email protected]>

fix path for answer_only_loss = false

Signed-off-by: Chen Cui <[email protected]>
  • Loading branch information
cuichenx authored and erhoo82 committed Dec 2, 2023
1 parent eba699d commit 44cb7fd
Show file tree
Hide file tree
Showing 19 changed files with 1,025 additions and 90 deletions.
6 changes: 2 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ WORKDIR /workspace/
# We leave it here in case we need to work off of a specific commit in main
RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
cd Megatron-LM && \
git checkout 4c7a0251ae7c234a4ca3f02327330235d8d35028 && \
git checkout e122536b7645edcb7ebf099b5c92a443f7dbf8e7 && \
pip install .

# Distributed Adam support for multiple dtypes
Expand Down Expand Up @@ -85,10 +85,8 @@ WORKDIR /tmp/nemo
COPY requirements .
RUN for f in $(ls requirements*.txt); do pip3 install --disable-pip-version-check --no-cache-dir -r $f; done

# install flash attention dependencies
# install flash attention
RUN pip install flash-attn
# pinned triton version for flash-attention https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3
RUN pip install triton==2.0.0.dev20221202
# install numba for latest containers
RUN pip install numba>=0.57.1

Expand Down
4 changes: 2 additions & 2 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ pipeline {
steps {
sh 'git clone https://github.com/NVIDIA/TransformerEngine.git && \
cd TransformerEngine && \
git fetch origin 8eae4ce2b8fdfbbe525fc8bfecb0df5498cc9687 && \
git fetch origin e6676c53f26f6ef072943c909d136cf2a39c1d90 && \
git checkout FETCH_HEAD && \
git submodule init && git submodule update && \
NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install .'
Expand All @@ -75,7 +75,7 @@ pipeline {
steps {
sh 'git clone https://github.com/NVIDIA/Megatron-LM.git && \
cd Megatron-LM && \
git checkout 4c7a0251ae7c234a4ca3f02327330235d8d35028 && \
git checkout e122536b7645edcb7ebf099b5c92a443f7dbf8e7 && \
pip install -e .'
}
}
Expand Down
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,7 @@ Transformer Engine requires PyTorch to be built with CUDA 11.8.

Flash Attention
~~~~~~~~~~~~~~~~~~~~
Transformer Engine already supports Flash Attention for GPT models. If you want to use Flash Attention for non-causal models or use with attention bias (introduced from position encoding, e.g. Alibi), please install `flash-attn <https://github.com/HazyResearch/flash-attention>`_.
Transformer Engine already supports Flash Attention for GPT models. If you want to use Flash Attention for non-causal models, please install `flash-attn <https://github.com/HazyResearch/flash-attention>`_. If you want to use Flash Attention with attention bias (introduced from position encoding, e.g. Alibi), please also install triton pinned version following the `implementation <https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3>`_.

.. code-block:: bash
Expand Down
2 changes: 2 additions & 0 deletions docs/source/asr/data/scores/nl/fastconformer_nl.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Model Name,Language,MCV Test-Set v12.0 (nl),MLS Test (nl)
stt_nl_fastconformer_hybrid_large_pc,nl,9.2 %,12.1 %
2 changes: 2 additions & 0 deletions docs/source/asr/data/scores_pc/nl/fastconformer_nl.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Model Name,Language,MCV Test-Set v12.0 (nl),MLS Test (nl)
stt_nl_fastconformer_hybrid_large_pc,nl,32.1 %,25.1 %
24 changes: 21 additions & 3 deletions docs/source/asr/scores.rst
Original file line number Diff line number Diff line change
Expand Up @@ -278,6 +278,16 @@ KAB

--------------------

NL
^^

.. csv-table::
:header-rows: 1
:align: left
:file: data/scores/nl/fastconformer_nl.csv

--------------------

PL
^^

Expand Down Expand Up @@ -350,7 +360,6 @@ ZH
--------------------



Scores with Punctuation and Capitalization
------------------------------------------

Expand Down Expand Up @@ -414,6 +423,16 @@ IT with P&C

--------------------

NL with P&C
^^^^^^^^^^^

.. csv-table::
:header-rows: 1
:align: left
:file: data/scores_pc/nl/fastconformer_nl.csv

--------------------

PL with P&C
^^^^^^^^^^^

Expand All @@ -432,5 +451,4 @@ UA with P&C
:align: left
:file: data/scores_pc/ua/fastconformer_ua.csv

--------------------

--------------------
Original file line number Diff line number Diff line change
Expand Up @@ -448,3 +448,95 @@ def collate_fn(self, batch):
}

return processed_batch


class GPTSFTPackedDataset(GPTSFTDataset):
def __init__(self, file_path: str, tokenizer: TokenizerSpec, **kwargs):
super().__init__(file_path, tokenizer, **kwargs)

self._load_packed_dataset(file_path)

def __getitem__(self, idx):
input_ids = self.indexed_dataset[idx]['input_ids']
seq_boundaries = self.indexed_dataset[idx]['seq_start_id'] + [len(input_ids)]
loss_mask = self.indexed_dataset[idx]['loss_mask']
return {'input_ids': input_ids, 'seq_boundaries': seq_boundaries, 'loss_mask': loss_mask}

def __len__(self):
return len(self.indexed_dataset)

def _load_packed_dataset(self, file_path):
self.indexed_dataset = np.load(file_path, allow_pickle=True)

def _build_loss_mask(self, processed_example):
if self.answer_only_loss:
seq_boundaries = processed_example['seq_boundaries']
return np.concatenate(
[
processed_example['loss_mask'][seq_boundaries[i] + 1 : seq_boundaries[i + 1]]
for i in range(len(seq_boundaries) - 1)
]
)
return [1.0] * (len(processed_example['input_ids']) - len(processed_example['seq_boundaries']))

def _maybe_cast_to_list(self, x):
return [item.tolist() if isinstance(item, np.ndarray) else item for item in x]

def collate_fn(self, batch):
input_ids = [
np.concatenate(
[
item['input_ids'][item['seq_boundaries'][i] : item['seq_boundaries'][i + 1] - 1]
for i in range(len(item['seq_boundaries']) - 1)
]
)
for item in batch
]
labels = [
np.concatenate(
[
item['input_ids'][item['seq_boundaries'][i] + 1 : item['seq_boundaries'][i + 1]]
for i in range(len(item['seq_boundaries']) - 1)
]
)
for item in batch
]

loss_mask = [self._build_loss_mask(item) for item in batch]

max_length = max(len(l) for l in input_ids)
max_length = self._ceil_to_nearest(max_length, 16)

position_ids: List[List[int]] = []
cu_seqlens: List[List[int]] = []
for item in batch:
position_ids.append([])
cu_seqlens.append([0])
seqlens = np.array(item['seq_boundaries'][1:]) - np.array(item['seq_boundaries'][:-1])
for l in seqlens:
# length minus 1 because input_ids is truncated by 1 for labels
position_ids[-1].extend(list(range(l - 1)))
cu_seqlens[-1].append(cu_seqlens[-1][-1] + l - 1)
# set last seq to the max seq len because rope and attn kernels expect no padding
cu_seqlens[-1][-1] = max_length

assert len(input_ids[0]) == len(
position_ids[0]
), "Dataset problem: input_ids and position_ids lengths don't match"

cu_seqlens = self._collate_item(cu_seqlens, max_length=max(len(l) for l in cu_seqlens) + 1, pad_id=-1)
input_ids = self._collate_item(input_ids, max_length=max_length, pad_id=self.tokenizer.eos_id)
labels = self._collate_item(labels, max_length=max_length, pad_id=self.tokenizer.eos_id)
loss_mask = self._collate_item(loss_mask, max_length=max_length, pad_id=0)
position_ids = self._collate_item(position_ids, max_length=max_length, pad_id=0)

processed_batch = {
'tokens': torch.LongTensor(input_ids),
'labels': torch.LongTensor(labels),
'attention_mask': torch.LongTensor([1] * len(input_ids)), # no attention mask is needed for packed seq
'loss_mask': torch.LongTensor(loss_mask),
'position_ids': torch.LongTensor(position_ids),
'cu_seqlens': torch.IntTensor(cu_seqlens), # cu_seqlens_q must be in dtype torch.int32
}

return processed_batch
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@
try:
from megatron.core import InferenceParams, parallel_state
from megatron.core.models.gpt import GPTModel as MCoreGPTModel
from megatron.core.models.gpt.gpt_layer_specs import gpt_layer_with_transformer_engine_spec
from megatron.core.models.gpt.gpt_layer_specs import get_gpt_layer_with_transformer_engine_spec
from megatron.core.pipeline_parallel.schedules import get_forward_backward_func
from megatron.core.transformer.module import Float16Module as MCoreFloat16Module
from megatron.core.transformer.transformer_config import TransformerConfig
Expand Down Expand Up @@ -246,7 +246,7 @@ def __init__(self, cfg: DictConfig, trainer: Trainer):

if self.megatron_amp_O2:

if not self.with_distributed_adam:
if not self.with_distributed_adam and not self.cfg.get("use_cpu_initialization", False):
# Pre-allocate the model on GPU to have master parameters allocated on the same device with matching data type
if isinstance(self.model, list):
for module in self.model:
Expand Down Expand Up @@ -308,7 +308,7 @@ def model_provider_func(self, pre_process, post_process):
if self.mcore_gpt:
model = MCoreGPTModel(
config=self.transformer_config,
transformer_layer_spec=gpt_layer_with_transformer_engine_spec,
transformer_layer_spec=get_gpt_layer_with_transformer_engine_spec(),
vocab_size=self.cfg.get('override_vocab_size', self.padded_vocab_size),
max_sequence_length=self.cfg.get('encoder_seq_length', 512),
pre_process=pre_process,
Expand Down Expand Up @@ -835,6 +835,8 @@ def fwd_output_and_loss_func(dataloader_iter, model, checkpoint_activations_all_
required_keys.update(batch.keys())
else:
required_keys.add('attention_mask')
if 'cu_seqlens' in batch:
required_keys.add('cu_seqlens')
if parallel_state.is_pipeline_first_stage():
required_keys.update(('tokens', 'position_ids'))
if parallel_state.is_pipeline_last_stage():
Expand All @@ -859,6 +861,15 @@ def fwd_output_and_loss_func(dataloader_iter, model, checkpoint_activations_all_
else:
# TODO: @eharper can we add this to mcore?
forward_args.pop('loss_mask')

if 'cu_seqlens' in batch: # packed sequence from GPTSFTPackedDataset
# these args are passed eventually into TEDotProductAttention.forward()
cu_seqlens = batch['cu_seqlens'].squeeze() # remove batch size dimension (mbs=1)
cu_seqlens = cu_seqlens[: torch.argmin(cu_seqlens)] # remove -1 "paddings" added in collate_fn
forward_args['cu_seqlens_q'] = cu_seqlens
forward_args['cu_seqlens_kv'] = cu_seqlens
forward_args['qkv_format'] = 'thd'

output_tensor = model(**forward_args)

def loss_func(output_tensor):
Expand Down Expand Up @@ -1585,7 +1596,7 @@ def build_transformer_config(self) -> TransformerConfig:
'recompute_method': recompute_method,
'recompute_num_layers': recompute_num_layers,
'distribute_saved_activations': False, # not currently used in NeMo
'ub_tp_comm_overlap': ub_tp_comm_overlap,
'tp_comm_overlap': ub_tp_comm_overlap,
'fp8': fp8,
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
)
from nemo.collections.nlp.data.language_modeling.megatron.blendable_dataset import BlendableDataset
from nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_chat_dataset import GPTSFTChatDataset
from nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset import GPTSFTDataset
from nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset import GPTSFTDataset, GPTSFTPackedDataset
from nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers import (
MegatronPretrainingBatchSampler,
)
Expand Down Expand Up @@ -208,6 +208,7 @@ def setup(self, stage=None):
self.setup_complete = True

def _build_dataset(self, data_cfg, is_train=True):
packed_sequence = data_cfg.get("packed_sequence", False)
datasets = []
# Determine if we are using a single dataset or a list of datasets.
is_list_config = isinstance(data_cfg.file_names, ListConfig)
Expand Down Expand Up @@ -263,6 +264,9 @@ def _build_dataset(self, data_cfg, is_train=True):
for file_path, num_samples in zip(data_cfg.file_names, num_train_samples_per_dataset):
if self.cfg.data.get("chat", False):
dataset_cls = GPTSFTChatDataset
elif packed_sequence:
dataset_cls = GPTSFTPackedDataset
assert data_cfg.micro_batch_size == 1, "Micro batch size must be 1 if using packed sequence"
else:
dataset_cls = GPTSFTDataset
dataset = dataset_cls(
Expand All @@ -274,7 +278,7 @@ def _build_dataset(self, data_cfg, is_train=True):
add_eos=data_cfg.get('add_eos', True),
add_sep=data_cfg.get('add_sep', False),
sep_id=self.sep_id,
max_num_samples=num_samples[0],
max_num_samples=num_samples[0] if not packed_sequence else None,
seed=data_cfg.get('seed', 1234),
label_key=data_cfg.get('label_key', 'answer'),
answer_only_loss=self.cfg.get('answer_only_loss', True),
Expand All @@ -301,6 +305,8 @@ def _build_dataset(self, data_cfg, is_train=True):
)
datasets.append(dataset)
if is_train:
if packed_sequence:
num_train_samples_after_blend = sum(len(dataset) for dataset in datasets)
dataset = BlendableDataset(
datasets=datasets, weights=data_cfg.concat_sampling_probabilities, size=num_train_samples_after_blend
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import torch
import torch.nn.functional as F
from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
from megatron.core.fusions.fused_bias_gelu import bias_gelu_impl
Expand Down
17 changes: 14 additions & 3 deletions nemo/collections/nlp/modules/common/megatron/attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,11 +65,23 @@

HAVE_MEGATRON_CORE = False

try:
# Flash Attention Triton
import pkg_resources
from flash_attn.flash_attn_triton import flash_attn_func as flash_attn_func_triton

# pinned triton version for flash-attention triton https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3
assert pkg_resources.get_distribution("triton").version == '2.0.0.dev20221202'

except (ImportError, ModuleNotFoundError, AssertionError):

flash_attn_func_triton = None


try:
# Flash Attention 1.X
from flash_attn.bert_padding import pad_input, unpad_input
from flash_attn.flash_attn_interface import flash_attn_unpadded_func
from flash_attn.flash_attn_triton import flash_attn_func as flash_attn_func_triton

HAVE_FLASH_ATTENTION = True
flash_attn_func = None
Expand All @@ -85,8 +97,7 @@
except (ImportError, ModuleNotFoundError):

HAVE_FLASH_ATTENTION = False

flash_attn_unpadded_func, flash_attn_func_triton, flash_attn_func = None, None, None
flash_attn_unpadded_func, flash_attn_func = None, None
unpad_input, pad_input = None, None

try:
Expand Down
Loading

0 comments on commit 44cb7fd

Please sign in to comment.