support packed dataset

Signed-off-by: Chen Cui <[email protected]> support packed dataset Signed-off-by: Chen Cui <[email protected]> [Codec] Finite scalar quantizer (NVIDIA#7886) * Finite scalar quantizer Signed-off-by: Ante Jukić <[email protected]> * Updated test Signed-off-by: Ante Jukić <[email protected]> --------- Signed-off-by: Ante Jukić <[email protected]> upgrade to latest mcore and TE (NVIDIA#7908) * reimport module Signed-off-by: dimapihtar <[email protected]> * update mcore and TE commits Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Tar codec (NVIDIA#7867) added missing torch import (NVIDIA#7913) Signed-off-by: David Mosallanezhad <[email protected]> add cpu init check (NVIDIA#7889) Signed-off-by: Chen Cui <[email protected]> Fix pinned triton version (NVIDIA#7925) * Fix pinned triton version Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove comment Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change README Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove flash-attn in Dockerfile Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Revert Signed-off-by: Cheng-Ping Hsieh <[email protected]> --------- Signed-off-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> fix tp_overlap config var name (NVIDIA#7928) Signed-off-by: Xiaowei Ren <[email protected]> add Dutch P&C FC model info (NVIDIA#7892) * add Dutch P&C FC model info Signed-off-by: zhehuaichen <[email protected]> * update order of the results Signed-off-by: zhehuaichen <[email protected]> --------- Signed-off-by: zhehuaichen <[email protected]> fix issues with convert_nemo_llama_to_hf.py (NVIDIA#7922) [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci fix collate_fn bug for TP > 1 Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci make packed dataset work Signed-off-by: Chen Cui <[email protected]> fix nan bug Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci support answer only loss Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci account for padding in cu_seqlens during dataloading for attn kernel Signed-off-by: Chen Cui <[email protected]> fix path for answer_only_loss = false Signed-off-by: Chen Cui <[email protected]>
erhoo82 · Dec 2, 2023 · 44cb7fd · 44cb7fd
1 parent eba699d
commit 44cb7fd
Show file tree

Hide file tree

Showing 19 changed files with 1,025 additions and 90 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -47,7 +47,7 @@ WORKDIR /workspace/
 # We leave it here in case we need to work off of a specific commit in main
 RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
   cd Megatron-LM && \
-  git checkout 4c7a0251ae7c234a4ca3f02327330235d8d35028 && \
+  git checkout e122536b7645edcb7ebf099b5c92a443f7dbf8e7 && \
   pip install .
 
 # Distributed Adam support for multiple dtypes
@@ -85,10 +85,8 @@ WORKDIR /tmp/nemo
 COPY requirements .
 RUN for f in $(ls requirements*.txt); do pip3 install --disable-pip-version-check --no-cache-dir -r $f; done
 
-# install flash attention dependencies
+# install flash attention
 RUN pip install flash-attn
-# pinned triton version for flash-attention https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3
-RUN pip install triton==2.0.0.dev20221202
 # install numba for latest containers
 RUN pip install numba>=0.57.1
 

diff --git a/Jenkinsfile b/Jenkinsfile
@@ -64,7 +64,7 @@ pipeline {
       steps {
          sh 'git clone https://github.com/NVIDIA/TransformerEngine.git && \
              cd TransformerEngine && \
-             git fetch origin 8eae4ce2b8fdfbbe525fc8bfecb0df5498cc9687 && \
+             git fetch origin e6676c53f26f6ef072943c909d136cf2a39c1d90 && \
              git checkout FETCH_HEAD && \
              git submodule init && git submodule update && \
              NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install .'
@@ -75,7 +75,7 @@ pipeline {
       steps {
          sh 'git clone https://github.com/NVIDIA/Megatron-LM.git && \
              cd Megatron-LM && \
-             git checkout 4c7a0251ae7c234a4ca3f02327330235d8d35028 && \
+             git checkout e122536b7645edcb7ebf099b5c92a443f7dbf8e7 && \
              pip install -e .'
       }
     }

diff --git a/README.rst b/README.rst
@@ -319,7 +319,7 @@ Transformer Engine requires PyTorch to be built with CUDA 11.8.
 
 Flash Attention
 ~~~~~~~~~~~~~~~~~~~~
-Transformer Engine already supports Flash Attention for GPT models. If you want to use Flash Attention for non-causal models or use with attention bias (introduced from position encoding, e.g. Alibi), please install `flash-attn <https://github.com/HazyResearch/flash-attention>`_.
+Transformer Engine already supports Flash Attention for GPT models. If you want to use Flash Attention for non-causal models, please install `flash-attn <https://github.com/HazyResearch/flash-attention>`_. If you want to use Flash Attention with attention bias (introduced from position encoding, e.g. Alibi), please also install triton pinned version following the `implementation <https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3>`_. 
 
 .. code-block:: bash
 

diff --git a/docs/source/asr/data/scores/nl/fastconformer_nl.csv b/docs/source/asr/data/scores/nl/fastconformer_nl.csv
@@ -0,0 +1,2 @@
+Model Name,Language,MCV Test-Set v12.0 (nl),MLS Test (nl)
+stt_nl_fastconformer_hybrid_large_pc,nl,9.2 %,12.1 %
diff --git a/docs/source/asr/data/scores_pc/nl/fastconformer_nl.csv b/docs/source/asr/data/scores_pc/nl/fastconformer_nl.csv
@@ -0,0 +1,2 @@
+Model Name,Language,MCV Test-Set v12.0 (nl),MLS Test (nl)
+stt_nl_fastconformer_hybrid_large_pc,nl,32.1 %,25.1 %
diff --git a/docs/source/asr/scores.rst b/docs/source/asr/scores.rst
@@ -278,6 +278,16 @@ KAB
 
 --------------------
 
+NL
+^^
+
+.. csv-table::
+    :header-rows: 1
+    :align: left
+    :file: data/scores/nl/fastconformer_nl.csv
+
+--------------------
+
 PL
 ^^
 
@@ -350,7 +360,6 @@ ZH
 --------------------
 
 
-
 Scores with Punctuation and Capitalization
 ------------------------------------------
 
@@ -414,6 +423,16 @@ IT with P&C
 
 --------------------
 
+NL with P&C
+^^^^^^^^^^^
+
+.. csv-table::
+    :header-rows: 1
+    :align: left
+    :file: data/scores_pc/nl/fastconformer_nl.csv
+
+--------------------
+
 PL with P&C
 ^^^^^^^^^^^
 
@@ -432,5 +451,4 @@ UA with P&C
     :align: left
     :file: data/scores_pc/ua/fastconformer_ua.csv
 
---------------------
-
+--------------------
diff --git a/nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py b/nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py
@@ -448,3 +448,95 @@ def collate_fn(self, batch):
         }
 
         return processed_batch
+
+
+class GPTSFTPackedDataset(GPTSFTDataset):
+    def __init__(self, file_path: str, tokenizer: TokenizerSpec, **kwargs):
+        super().__init__(file_path, tokenizer, **kwargs)
+
+        self._load_packed_dataset(file_path)
+
+    def __getitem__(self, idx):
+        input_ids = self.indexed_dataset[idx]['input_ids']
+        seq_boundaries = self.indexed_dataset[idx]['seq_start_id'] + [len(input_ids)]
+        loss_mask = self.indexed_dataset[idx]['loss_mask']
+        return {'input_ids': input_ids, 'seq_boundaries': seq_boundaries, 'loss_mask': loss_mask}
+
+    def __len__(self):
+        return len(self.indexed_dataset)
+
+    def _load_packed_dataset(self, file_path):
+        self.indexed_dataset = np.load(file_path, allow_pickle=True)
+
+    def _build_loss_mask(self, processed_example):
+        if self.answer_only_loss:
+            seq_boundaries = processed_example['seq_boundaries']
+            return np.concatenate(
+                [
+                    processed_example['loss_mask'][seq_boundaries[i] + 1 : seq_boundaries[i + 1]]
+                    for i in range(len(seq_boundaries) - 1)
+                ]
+            )
+        return [1.0] * (len(processed_example['input_ids']) - len(processed_example['seq_boundaries']))
+
+    def _maybe_cast_to_list(self, x):
+        return [item.tolist() if isinstance(item, np.ndarray) else item for item in x]
+
+    def collate_fn(self, batch):
+        input_ids = [
+            np.concatenate(
+                [
+                    item['input_ids'][item['seq_boundaries'][i] : item['seq_boundaries'][i + 1] - 1]
+                    for i in range(len(item['seq_boundaries']) - 1)
+                ]
+            )
+            for item in batch
+        ]
+        labels = [
+            np.concatenate(
+                [
+                    item['input_ids'][item['seq_boundaries'][i] + 1 : item['seq_boundaries'][i + 1]]
+                    for i in range(len(item['seq_boundaries']) - 1)
+                ]
+            )
+            for item in batch
+        ]
+
+        loss_mask = [self._build_loss_mask(item) for item in batch]
+
+        max_length = max(len(l) for l in input_ids)
+        max_length = self._ceil_to_nearest(max_length, 16)
+
+        position_ids: List[List[int]] = []
+        cu_seqlens: List[List[int]] = []
+        for item in batch:
+            position_ids.append([])
+            cu_seqlens.append([0])
+            seqlens = np.array(item['seq_boundaries'][1:]) - np.array(item['seq_boundaries'][:-1])
+            for l in seqlens:
+                # length minus 1 because input_ids is truncated by 1 for labels
+                position_ids[-1].extend(list(range(l - 1)))
+                cu_seqlens[-1].append(cu_seqlens[-1][-1] + l - 1)
+            # set last seq to the max seq len because rope and attn kernels expect no padding
+            cu_seqlens[-1][-1] = max_length
+
+        assert len(input_ids[0]) == len(
+            position_ids[0]
+        ), "Dataset problem: input_ids and position_ids lengths don't match"
+
+        cu_seqlens = self._collate_item(cu_seqlens, max_length=max(len(l) for l in cu_seqlens) + 1, pad_id=-1)
+        input_ids = self._collate_item(input_ids, max_length=max_length, pad_id=self.tokenizer.eos_id)
+        labels = self._collate_item(labels, max_length=max_length, pad_id=self.tokenizer.eos_id)
+        loss_mask = self._collate_item(loss_mask, max_length=max_length, pad_id=0)
+        position_ids = self._collate_item(position_ids, max_length=max_length, pad_id=0)
+
+        processed_batch = {
+            'tokens': torch.LongTensor(input_ids),
+            'labels': torch.LongTensor(labels),
+            'attention_mask': torch.LongTensor([1] * len(input_ids)),  # no attention mask is needed for packed seq
+            'loss_mask': torch.LongTensor(loss_mask),
+            'position_ids': torch.LongTensor(position_ids),
+            'cu_seqlens': torch.IntTensor(cu_seqlens),  # cu_seqlens_q must be in dtype torch.int32
+        }
+
+        return processed_batch
diff --git a/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py b/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
@@ -76,7 +76,7 @@
 try:
     from megatron.core import InferenceParams, parallel_state
     from megatron.core.models.gpt import GPTModel as MCoreGPTModel
-    from megatron.core.models.gpt.gpt_layer_specs import gpt_layer_with_transformer_engine_spec
+    from megatron.core.models.gpt.gpt_layer_specs import get_gpt_layer_with_transformer_engine_spec
     from megatron.core.pipeline_parallel.schedules import get_forward_backward_func
     from megatron.core.transformer.module import Float16Module as MCoreFloat16Module
     from megatron.core.transformer.transformer_config import TransformerConfig
@@ -246,7 +246,7 @@ def __init__(self, cfg: DictConfig, trainer: Trainer):
 
         if self.megatron_amp_O2:
 
-            if not self.with_distributed_adam:
+            if not self.with_distributed_adam and not self.cfg.get("use_cpu_initialization", False):
                 # Pre-allocate the model on GPU to have master parameters allocated on the same device with matching data type
                 if isinstance(self.model, list):
                     for module in self.model:
@@ -308,7 +308,7 @@ def model_provider_func(self, pre_process, post_process):
         if self.mcore_gpt:
             model = MCoreGPTModel(
                 config=self.transformer_config,
-                transformer_layer_spec=gpt_layer_with_transformer_engine_spec,
+                transformer_layer_spec=get_gpt_layer_with_transformer_engine_spec(),
                 vocab_size=self.cfg.get('override_vocab_size', self.padded_vocab_size),
                 max_sequence_length=self.cfg.get('encoder_seq_length', 512),
                 pre_process=pre_process,
@@ -835,6 +835,8 @@ def fwd_output_and_loss_func(dataloader_iter, model, checkpoint_activations_all_
                 required_keys.update(batch.keys())
             else:
                 required_keys.add('attention_mask')
+                if 'cu_seqlens' in batch:
+                    required_keys.add('cu_seqlens')
                 if parallel_state.is_pipeline_first_stage():
                     required_keys.update(('tokens', 'position_ids'))
                 if parallel_state.is_pipeline_last_stage():
@@ -859,6 +861,15 @@ def fwd_output_and_loss_func(dataloader_iter, model, checkpoint_activations_all_
             else:
                 # TODO: @eharper can we add this to mcore?
                 forward_args.pop('loss_mask')
+
+                if 'cu_seqlens' in batch:  # packed sequence from GPTSFTPackedDataset
+                    # these args are passed eventually into TEDotProductAttention.forward()
+                    cu_seqlens = batch['cu_seqlens'].squeeze()  # remove batch size dimension (mbs=1)
+                    cu_seqlens = cu_seqlens[: torch.argmin(cu_seqlens)]  # remove -1 "paddings" added in collate_fn
+                    forward_args['cu_seqlens_q'] = cu_seqlens
+                    forward_args['cu_seqlens_kv'] = cu_seqlens
+                    forward_args['qkv_format'] = 'thd'
+
             output_tensor = model(**forward_args)
 
             def loss_func(output_tensor):
@@ -1585,7 +1596,7 @@ def build_transformer_config(self) -> TransformerConfig:
             'recompute_method': recompute_method,
             'recompute_num_layers': recompute_num_layers,
             'distribute_saved_activations': False,  # not currently used in NeMo
-            'ub_tp_comm_overlap': ub_tp_comm_overlap,
+            'tp_comm_overlap': ub_tp_comm_overlap,
             'fp8': fp8,
         }
 

diff --git a/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py b/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py
@@ -26,7 +26,7 @@
 )
 from nemo.collections.nlp.data.language_modeling.megatron.blendable_dataset import BlendableDataset
 from nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_chat_dataset import GPTSFTChatDataset
-from nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset import GPTSFTDataset
+from nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset import GPTSFTDataset, GPTSFTPackedDataset
 from nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers import (
     MegatronPretrainingBatchSampler,
 )
@@ -208,6 +208,7 @@ def setup(self, stage=None):
         self.setup_complete = True
 
     def _build_dataset(self, data_cfg, is_train=True):
+        packed_sequence = data_cfg.get("packed_sequence", False)
         datasets = []
         # Determine if we are using a single dataset or a list of datasets.
         is_list_config = isinstance(data_cfg.file_names, ListConfig)
@@ -263,6 +264,9 @@ def _build_dataset(self, data_cfg, is_train=True):
         for file_path, num_samples in zip(data_cfg.file_names, num_train_samples_per_dataset):
             if self.cfg.data.get("chat", False):
                 dataset_cls = GPTSFTChatDataset
+            elif packed_sequence:
+                dataset_cls = GPTSFTPackedDataset
+                assert data_cfg.micro_batch_size == 1, "Micro batch size must be 1 if using packed sequence"
             else:
                 dataset_cls = GPTSFTDataset
             dataset = dataset_cls(
@@ -274,7 +278,7 @@ def _build_dataset(self, data_cfg, is_train=True):
                 add_eos=data_cfg.get('add_eos', True),
                 add_sep=data_cfg.get('add_sep', False),
                 sep_id=self.sep_id,
-                max_num_samples=num_samples[0],
+                max_num_samples=num_samples[0] if not packed_sequence else None,
                 seed=data_cfg.get('seed', 1234),
                 label_key=data_cfg.get('label_key', 'answer'),
                 answer_only_loss=self.cfg.get('answer_only_loss', True),
@@ -301,6 +305,8 @@ def _build_dataset(self, data_cfg, is_train=True):
             )
             datasets.append(dataset)
         if is_train:
+            if packed_sequence:
+                num_train_samples_after_blend = sum(len(dataset) for dataset in datasets)
             dataset = BlendableDataset(
                 datasets=datasets, weights=data_cfg.concat_sampling_probabilities, size=num_train_samples_after_blend
             )

diff --git a/nemo/collections/nlp/modules/common/megatron/adapters/mcore_mixins.py b/nemo/collections/nlp/modules/common/megatron/adapters/mcore_mixins.py
@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import torch
 import torch.nn.functional as F
 from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
 from megatron.core.fusions.fused_bias_gelu import bias_gelu_impl

diff --git a/nemo/collections/nlp/modules/common/megatron/attention.py b/nemo/collections/nlp/modules/common/megatron/attention.py
@@ -65,11 +65,23 @@
 
     HAVE_MEGATRON_CORE = False
 
+try:
+    # Flash Attention Triton
+    import pkg_resources
+    from flash_attn.flash_attn_triton import flash_attn_func as flash_attn_func_triton
+
+    # pinned triton version for flash-attention triton https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3
+    assert pkg_resources.get_distribution("triton").version == '2.0.0.dev20221202'
+
+except (ImportError, ModuleNotFoundError, AssertionError):
+
+    flash_attn_func_triton = None
+
+
 try:
     # Flash Attention 1.X
     from flash_attn.bert_padding import pad_input, unpad_input
     from flash_attn.flash_attn_interface import flash_attn_unpadded_func
-    from flash_attn.flash_attn_triton import flash_attn_func as flash_attn_func_triton
 
     HAVE_FLASH_ATTENTION = True
     flash_attn_func = None
@@ -85,8 +97,7 @@
     except (ImportError, ModuleNotFoundError):
 
         HAVE_FLASH_ATTENTION = False
-
-        flash_attn_unpadded_func, flash_attn_func_triton, flash_attn_func = None, None, None
+        flash_attn_unpadded_func, flash_attn_func = None, None
         unpad_input, pad_input = None, None
 
     try: