Training not working with default script #1698

ByteBrigand · 2024-08-28T07:09:20Z

Bug description

I tried to train using the example script in the README but I get an error. The script is the following:

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a tokenizer
litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

# 2) Pretrain the model
litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 10_000_000 \
  --out_dir out/custom-model

Here's the entire output from terminal:

(mpi_env) root@a785850c8b5e:/workspace# mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a tokenizer
litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

# 2) Pretrain the model
litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 10_000_000 \
  --out_dir out/custom-model
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  676k  100  676k    0     0  2714k      0 --:--:-- --:--:-- --:--:-- 2717k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  497k  100  497k    0     0  1700k      0 --:--:-- --:--:-- --:--:-- 1698k
Fetching 3 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 476.43it/s]
Using bfloat16 Automatic Mixed Precision (AMP)
{'data': {'batch_size': 1,
          'max_seq_length': -1,
          'num_workers': 4,
          'seed': 42,
          'tokenizer': None,
          'train_data_path': PosixPath('custom_texts'),
          'val_data_path': None},
 'devices': 'auto',
 'eval': {'final_validation': True,
          'initial_validation': False,
          'interval': 1000,
          'max_iters': 100,
          'max_new_tokens': None},
 'initial_checkpoint_dir': None,
 'logger_name': 'tensorboard',
 'model_config': {'attention_logit_softcapping': None,
                  'attention_scores_scalar': None,
                  'bias': True,
                  'block_size': 2048,
                  'final_logit_softcapping': None,
                  'gelu_approximate': 'none',
                  'head_size': 64,
                  'hf_config': {'name': 'pythia-160m', 'org': 'EleutherAI'},
                  'intermediate_size': 3072,
                  'lm_head_bias': False,
                  'mlp_class_name': 'GptNeoxMLP',
                  'n_embd': 768,
                  'n_expert': 0,
                  'n_expert_per_token': 0,
                  'n_head': 12,
                  'n_layer': 12,
                  'n_query_groups': 12,
                  'name': 'pythia-160m',
                  'norm_class_name': 'LayerNorm',
                  'norm_eps': 1e-05,
                  'padded_vocab_size': 50304,
                  'padding_multiple': 128,
                  'parallel_residual': True,
                  'post_attention_norm': False,
                  'post_mlp_norm': False,
                  'rope_base': 10000,
                  'rope_condense_ratio': 1,
                  'rotary_percentage': 0.25,
                  'scale_embeddings': False,
                  'shared_attention_norm': False,
                  'sliding_window_layer_placing': None,
                  'sliding_window_size': None,
                  'vocab_size': 50254},
 'model_name': 'EleutherAI/pythia-160m',
 'num_nodes': 1,
 'optimizer': 'AdamW',
 'out_dir': PosixPath('out/custom-model'),
 'precision': None,
 'resume': False,
 'seed': 42,
 'tokenizer_dir': PosixPath('checkpoints/EleutherAI/pythia-160m'),
 'train': {'epochs': None,
           'global_batch_size': 512,
           'log_interval': 1,
           'lr_warmup_fraction': None,
           'lr_warmup_steps': 2000,
           'max_norm': 1.0,
           'max_seq_length': None,
           'max_steps': None,
           'max_tokens': 10000000,
           'micro_batch_size': 4,
           'min_lr': 4e-05,
           'save_interval': 1000,
           'tie_embeddings': False}}
Seed set to 42
Time to instantiate model: 0.25 seconds.
Total parameters: 162,322,944
Create an account on https://lightning.ai/ to optimize your data faster using multiple nodes and large machines.
Setting multiprocessing start_method to spawn. 
Storing the files under /workspace/custom_texts/train
Setup started with fast_dev_run=False.
Setup finished in 0.001 seconds. Found 1 items to process.
Starting 1 workers with 1 items. The progress bar is only updated when a worker finishes.
Workers are ready ! Starting data processing...
                                                                                                                                                    Rank 0 inferred the following `['no_header_tensor:16']` data format.                                                            | 0/1 [00:00<?, ?it/s]
Worker 0 is terminating.
Worker 0 is done.
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.48s/it]
Workers are finished.██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.48s/it]
Finished data processing!
Create an account on https://lightning.ai/ to optimize your data faster using multiple nodes and large machines.
Setting multiprocessing start_method to spawn. 
Storing the files under /workspace/custom_texts/val
Setup started with fast_dev_run=False.
Setup finished in 0.0 seconds. Found 1 items to process.
Starting 1 workers with 1 items. The progress bar is only updated when a worker finishes.
Workers are ready ! Starting data processing...
                                                                                                                                                    Rank 0 inferred the following `['no_header_tensor:16']` data format.                                                            | 0/1 [00:00<?, ?it/s]
Worker 0 is terminating.
Worker 0 is done.
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.90s/it]
Workers are finished.██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.90s/it]
Finished data processing!
Traceback (most recent call last):
  File "/usr/local/bin/litgpt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/usr/local/lib/python3.11/dist-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litgpt/pretrain.py", line 154, in setup
    main(
  File "/usr/local/lib/python3.11/dist-packages/litgpt/pretrain.py", line 214, in main
    train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litgpt/pretrain.py", line 411, in get_dataloaders
    train_dataloader = data.train_dataloader()
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litgpt/data/text_files.py", line 107, in train_dataloader
    train_dataset = StreamingDataset(
                    ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litdata/streaming/dataset.py", line 91, in __init__
    self.subsampled_files, self.region_of_interest = subsample_streaming_dataset(
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litdata/utilities/dataset_utilities.py", line 68, in subsample_streaming_dataset
    roi = generate_roi(original_chunks, item_loader)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litdata/utilities/dataset_utilities.py", line 164, in generate_roi
    roi.append((0, chunk["dim"] // item_loader._block_size))
                   ~~~~~~~~~~~~~^^~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'
(mpi_env) root@a785850c8b5e:/workspace# 
(mpi_env) root@a785850c8b5e:/workspace# pip freeze
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aiosignal==1.3.1
attrs==24.2.0
boto3==1.35.7
botocore==1.35.7
certifi==2024.7.4
charset-normalizer==3.3.2
datasets==2.21.0
dill==0.3.8
docstring_parser==0.16
filelock==3.15.4
frozenlist==1.4.1
fsspec==2024.6.1
huggingface-hub==0.24.6
idna==3.8
importlib_resources==6.4.4
Jinja2==3.1.4
jmespath==1.0.1
jsonargparse==4.32.1
lightning==2.4.0.dev20240728
lightning-utilities==0.11.6
litdata==0.2.24
litgpt==0.4.11
MarkupSafe==2.1.5
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
networkx==3.3
numpy==2.1.0
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
packaging==24.1
pandas==2.2.2
pyarrow==17.0.0
python-dateutil==2.9.0.post0
pytorch-lightning==2.4.0
pytz==2024.1
PyYAML==6.0.2
regex==2024.7.24
requests==2.32.3
s3transfer==0.10.2
safetensors==0.4.4
six==1.16.0
sympy==1.13.2
tokenizers==0.19.1
torch==2.4.0
torchmetrics==1.4.1
tqdm==4.66.5
transformers==4.44.2
triton==3.0.0
typeshed_client==2.7.0
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
xxhash==3.5.0
yarl==1.9.4
(mpi_env) root@a785850c8b5e:/workspace# 
(mpi_env) root@a785850c8b5e:/workspace# python3 --version
Python 3.11.9
(mpi_env) root@a785850c8b5e:/workspace# uname -a
Linux a785850c8b5e 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

What operating system are you using?

Linux

LitGPT Version

Version: 0.4.11

The text was updated successfully, but these errors were encountered:

srikhetramohanty · 2024-08-28T09:01:29Z

Hi. I am facing the same issue. Any progress yet?

srikhetramohanty · 2024-08-28T12:49:41Z

I have an interesting observation here. I tried running this in the lightning.ai studio and it is working. The litgpt version is 0.3.0.dev0 while this is not available in https://pypi.org/project/litgpt/#history. Seems to be a dev version rolled out internally. I am facing the above issue while working on version 0.4.11.

ByteBrigand · 2024-08-28T19:35:38Z

I have an interesting observation here. I tried running this in the lightning.ai studio and it is working. The litgpt version is 0.3.0.dev0 while this is not available in https://pypi.org/project/litgpt/#history. Seems to be a dev version rolled out internally. I am facing the above issue while working on version 0.4.11.

Can you please dump the source code and upload it somewhere?
Use the following script:

import os
import shutil
import sys

def dump_package_source(package_name, output_dir):
    try:
        package = __import__(package_name)
        package_path = os.path.dirname(package.__file__)
        destination_dir = os.path.join(output_dir, package_name)
        shutil.copytree(package_path, destination_dir)
        print(f"Source code of the package '{package_name}' has been dumped to '{destination_dir}'")
    except ImportError:
        print(f"Package '{package_name}' is not installed.")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python dump_package.py <package_name> <output_dir>")
    else:
        package_name = sys.argv[1]
        output_dir = sys.argv[2]
        dump_package_source(package_name, output_dir)

twaka · 2024-08-30T10:59:54Z

Your litdata version may be too high for litgpt.
There is a breaking change on litdata Lightning-AI/litdata#296 and litgpt seems not to handle it yet.

rasbt · 2024-09-09T22:18:07Z

Thanks for the note. I was out in the last 2 weeks and haven't had a chance to look into it yet.

rasbt · 2024-09-10T18:20:20Z

I just tested it in a Studio on CPU and GPU on a Studio, and it seemed to work fine:

Progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.05it/s]
Workers are finished.
Finished data processing!
Verifying settings ...
Measured TFLOPs: 7.93
Epoch 8 | iter 128 step 1 | loss train: 10.943, val: n/a | iter time: 445.72 ms (step) remaining time: 0:08:16
Epoch 16 | iter 256 step 2 | loss train: 9.708, val: n/a | iter time: 364.51 ms (step) remaining time: 0:05:51

This was with versions

LitGPT: 0.4.11
Version: 0.2.17 (and I tested 0.2.26 as well)

installed from the latest main branch:

git clone https://github.com/Lightning-AI/litgpt.git
pip install -e ".[all]"

Could you let me know which LitData version you were using?

You can use pip show litgpt | grep Version: and pip show litdata | grep Version: to get these.

rasbt · 2024-09-10T18:56:10Z

There might be a LitData bug. I was getting an error with the LitGPT code and a simpler self-contained example. Reported it here: Lightning-AI/litdata#367

ByteBrigand · 2024-09-10T20:24:34Z

litdata==0.2.24
litgpt==0.4.11

rasbt · 2024-09-10T20:32:11Z

Thanks!

zhe-thoughts · 2024-09-11T16:53:20Z

I'm getting the same error

@rasbt : I'm a big fan of litgpt (as well as nanoGPT), because they are minimal / clean implementations.

Should we consider having an option to run this small example (pythia-160m) without relying on litdata?

rasbt · 2024-09-12T02:20:44Z

Thanks for the kind comment. Ideally, it would be nice to keep LitData here because then we don't have to maintain two implementations, one for the small scale and one for the larger scale experiments. A contributor to LitData mentioned that they are looking into this issue, so hopefully it gets fixed soon.

srikhetramohanty · 2024-09-17T09:07:21Z

Hi, any resolution to this on linux systems?

emily-xiao-19 · 2024-09-22T02:50:53Z

also getting this error. What's the current solution? which older version works. tried downgrading and still did not work
miniconda3/envs/myenv/lib/python3.9/site-packages/litdata/utilities/dataset_utilities.py", line 111, in generate_roi
roi.append((0, chunk["dim"] // item_loader._block_size))
TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'

ByteBrigand added the bug Something isn't working label Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training not working with default script #1698

Training not working with default script #1698

ByteBrigand commented Aug 28, 2024

srikhetramohanty commented Aug 28, 2024

srikhetramohanty commented Aug 28, 2024

ByteBrigand commented Aug 28, 2024

twaka commented Aug 30, 2024

rasbt commented Sep 9, 2024

rasbt commented Sep 10, 2024

rasbt commented Sep 10, 2024

ByteBrigand commented Sep 10, 2024

rasbt commented Sep 10, 2024

zhe-thoughts commented Sep 11, 2024

rasbt commented Sep 12, 2024

srikhetramohanty commented Sep 17, 2024

emily-xiao-19 commented Sep 22, 2024

Training not working with default script #1698

Training not working with default script #1698

Comments

ByteBrigand commented Aug 28, 2024

Bug description

What operating system are you using?

LitGPT Version

srikhetramohanty commented Aug 28, 2024

srikhetramohanty commented Aug 28, 2024

ByteBrigand commented Aug 28, 2024

twaka commented Aug 30, 2024

rasbt commented Sep 9, 2024

rasbt commented Sep 10, 2024

rasbt commented Sep 10, 2024

ByteBrigand commented Sep 10, 2024

rasbt commented Sep 10, 2024

zhe-thoughts commented Sep 11, 2024

rasbt commented Sep 12, 2024

srikhetramohanty commented Sep 17, 2024

emily-xiao-19 commented Sep 22, 2024