Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training not working with default script #1698

Open
ByteBrigand opened this issue Aug 28, 2024 · 13 comments
Open

Training not working with default script #1698

ByteBrigand opened this issue Aug 28, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@ByteBrigand
Copy link

Bug description

I tried to train using the example script in the README but I get an error. The script is the following:

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a tokenizer
litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

# 2) Pretrain the model
litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 10_000_000 \
  --out_dir out/custom-model

Here's the entire output from terminal:

(mpi_env) root@a785850c8b5e:/workspace# mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a tokenizer
litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

# 2) Pretrain the model
litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 10_000_000 \
  --out_dir out/custom-model
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  676k  100  676k    0     0  2714k      0 --:--:-- --:--:-- --:--:-- 2717k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  497k  100  497k    0     0  1700k      0 --:--:-- --:--:-- --:--:-- 1698k
Fetching 3 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 476.43it/s]
Using bfloat16 Automatic Mixed Precision (AMP)
{'data': {'batch_size': 1,
          'max_seq_length': -1,
          'num_workers': 4,
          'seed': 42,
          'tokenizer': None,
          'train_data_path': PosixPath('custom_texts'),
          'val_data_path': None},
 'devices': 'auto',
 'eval': {'final_validation': True,
          'initial_validation': False,
          'interval': 1000,
          'max_iters': 100,
          'max_new_tokens': None},
 'initial_checkpoint_dir': None,
 'logger_name': 'tensorboard',
 'model_config': {'attention_logit_softcapping': None,
                  'attention_scores_scalar': None,
                  'bias': True,
                  'block_size': 2048,
                  'final_logit_softcapping': None,
                  'gelu_approximate': 'none',
                  'head_size': 64,
                  'hf_config': {'name': 'pythia-160m', 'org': 'EleutherAI'},
                  'intermediate_size': 3072,
                  'lm_head_bias': False,
                  'mlp_class_name': 'GptNeoxMLP',
                  'n_embd': 768,
                  'n_expert': 0,
                  'n_expert_per_token': 0,
                  'n_head': 12,
                  'n_layer': 12,
                  'n_query_groups': 12,
                  'name': 'pythia-160m',
                  'norm_class_name': 'LayerNorm',
                  'norm_eps': 1e-05,
                  'padded_vocab_size': 50304,
                  'padding_multiple': 128,
                  'parallel_residual': True,
                  'post_attention_norm': False,
                  'post_mlp_norm': False,
                  'rope_base': 10000,
                  'rope_condense_ratio': 1,
                  'rotary_percentage': 0.25,
                  'scale_embeddings': False,
                  'shared_attention_norm': False,
                  'sliding_window_layer_placing': None,
                  'sliding_window_size': None,
                  'vocab_size': 50254},
 'model_name': 'EleutherAI/pythia-160m',
 'num_nodes': 1,
 'optimizer': 'AdamW',
 'out_dir': PosixPath('out/custom-model'),
 'precision': None,
 'resume': False,
 'seed': 42,
 'tokenizer_dir': PosixPath('checkpoints/EleutherAI/pythia-160m'),
 'train': {'epochs': None,
           'global_batch_size': 512,
           'log_interval': 1,
           'lr_warmup_fraction': None,
           'lr_warmup_steps': 2000,
           'max_norm': 1.0,
           'max_seq_length': None,
           'max_steps': None,
           'max_tokens': 10000000,
           'micro_batch_size': 4,
           'min_lr': 4e-05,
           'save_interval': 1000,
           'tie_embeddings': False}}
Seed set to 42
Time to instantiate model: 0.25 seconds.
Total parameters: 162,322,944
Create an account on https://lightning.ai/ to optimize your data faster using multiple nodes and large machines.
Setting multiprocessing start_method to spawn. 
Storing the files under /workspace/custom_texts/train
Setup started with fast_dev_run=False.
Setup finished in 0.001 seconds. Found 1 items to process.
Starting 1 workers with 1 items. The progress bar is only updated when a worker finishes.
Workers are ready ! Starting data processing...
                                                                                                                                                    Rank 0 inferred the following `['no_header_tensor:16']` data format.                                                            | 0/1 [00:00<?, ?it/s]
Worker 0 is terminating.
Worker 0 is done.
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.48s/it]
Workers are finished.██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.48s/it]
Finished data processing!
Create an account on https://lightning.ai/ to optimize your data faster using multiple nodes and large machines.
Setting multiprocessing start_method to spawn. 
Storing the files under /workspace/custom_texts/val
Setup started with fast_dev_run=False.
Setup finished in 0.0 seconds. Found 1 items to process.
Starting 1 workers with 1 items. The progress bar is only updated when a worker finishes.
Workers are ready ! Starting data processing...
                                                                                                                                                    Rank 0 inferred the following `['no_header_tensor:16']` data format.                                                            | 0/1 [00:00<?, ?it/s]
Worker 0 is terminating.
Worker 0 is done.
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.90s/it]
Workers are finished.██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.90s/it]
Finished data processing!
Traceback (most recent call last):
  File "/usr/local/bin/litgpt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/usr/local/lib/python3.11/dist-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litgpt/pretrain.py", line 154, in setup
    main(
  File "/usr/local/lib/python3.11/dist-packages/litgpt/pretrain.py", line 214, in main
    train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litgpt/pretrain.py", line 411, in get_dataloaders
    train_dataloader = data.train_dataloader()
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litgpt/data/text_files.py", line 107, in train_dataloader
    train_dataset = StreamingDataset(
                    ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litdata/streaming/dataset.py", line 91, in __init__
    self.subsampled_files, self.region_of_interest = subsample_streaming_dataset(
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litdata/utilities/dataset_utilities.py", line 68, in subsample_streaming_dataset
    roi = generate_roi(original_chunks, item_loader)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/litdata/utilities/dataset_utilities.py", line 164, in generate_roi
    roi.append((0, chunk["dim"] // item_loader._block_size))
                   ~~~~~~~~~~~~~^^~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'
(mpi_env) root@a785850c8b5e:/workspace# 
(mpi_env) root@a785850c8b5e:/workspace# pip freeze
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aiosignal==1.3.1
attrs==24.2.0
boto3==1.35.7
botocore==1.35.7
certifi==2024.7.4
charset-normalizer==3.3.2
datasets==2.21.0
dill==0.3.8
docstring_parser==0.16
filelock==3.15.4
frozenlist==1.4.1
fsspec==2024.6.1
huggingface-hub==0.24.6
idna==3.8
importlib_resources==6.4.4
Jinja2==3.1.4
jmespath==1.0.1
jsonargparse==4.32.1
lightning==2.4.0.dev20240728
lightning-utilities==0.11.6
litdata==0.2.24
litgpt==0.4.11
MarkupSafe==2.1.5
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
networkx==3.3
numpy==2.1.0
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
packaging==24.1
pandas==2.2.2
pyarrow==17.0.0
python-dateutil==2.9.0.post0
pytorch-lightning==2.4.0
pytz==2024.1
PyYAML==6.0.2
regex==2024.7.24
requests==2.32.3
s3transfer==0.10.2
safetensors==0.4.4
six==1.16.0
sympy==1.13.2
tokenizers==0.19.1
torch==2.4.0
torchmetrics==1.4.1
tqdm==4.66.5
transformers==4.44.2
triton==3.0.0
typeshed_client==2.7.0
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
xxhash==3.5.0
yarl==1.9.4
(mpi_env) root@a785850c8b5e:/workspace# 
(mpi_env) root@a785850c8b5e:/workspace# python3 --version
Python 3.11.9
(mpi_env) root@a785850c8b5e:/workspace# uname -a
Linux a785850c8b5e 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

What operating system are you using?

Linux

LitGPT Version

Version: 0.4.11



@ByteBrigand ByteBrigand added the bug Something isn't working label Aug 28, 2024
@srikhetramohanty
Copy link

Hi. I am facing the same issue. Any progress yet?

@srikhetramohanty
Copy link

I have an interesting observation here. I tried running this in the lightning.ai studio and it is working. The litgpt version is 0.3.0.dev0 while this is not available in https://pypi.org/project/litgpt/#history. Seems to be a dev version rolled out internally. I am facing the above issue while working on version 0.4.11.

@ByteBrigand
Copy link
Author

I have an interesting observation here. I tried running this in the lightning.ai studio and it is working. The litgpt version is 0.3.0.dev0 while this is not available in https://pypi.org/project/litgpt/#history. Seems to be a dev version rolled out internally. I am facing the above issue while working on version 0.4.11.

Can you please dump the source code and upload it somewhere?
Use the following script:

import os
import shutil
import sys

def dump_package_source(package_name, output_dir):
    try:
        package = __import__(package_name)
        package_path = os.path.dirname(package.__file__)
        destination_dir = os.path.join(output_dir, package_name)
        shutil.copytree(package_path, destination_dir)
        print(f"Source code of the package '{package_name}' has been dumped to '{destination_dir}'")
    except ImportError:
        print(f"Package '{package_name}' is not installed.")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python dump_package.py <package_name> <output_dir>")
    else:
        package_name = sys.argv[1]
        output_dir = sys.argv[2]
        dump_package_source(package_name, output_dir)

@twaka
Copy link

twaka commented Aug 30, 2024

Your litdata version may be too high for litgpt.
There is a breaking change on litdata Lightning-AI/litdata#296 and litgpt seems not to handle it yet.

@rasbt
Copy link
Collaborator

rasbt commented Sep 9, 2024

Thanks for the note. I was out in the last 2 weeks and haven't had a chance to look into it yet.

@rasbt
Copy link
Collaborator

rasbt commented Sep 10, 2024

I just tested it in a Studio on CPU and GPU on a Studio, and it seemed to work fine:

Progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.05it/s]
Workers are finished.
Finished data processing!
Verifying settings ...
Measured TFLOPs: 7.93
Epoch 8 | iter 128 step 1 | loss train: 10.943, val: n/a | iter time: 445.72 ms (step) remaining time: 0:08:16
Epoch 16 | iter 256 step 2 | loss train: 9.708, val: n/a | iter time: 364.51 ms (step) remaining time: 0:05:51

This was with versions

  • LitGPT: 0.4.11
  • Version: 0.2.17 (and I tested 0.2.26 as well)

installed from the latest main branch:

git clone https://github.com/Lightning-AI/litgpt.git
pip install -e ".[all]"

Could you let me know which LitData version you were using?

You can use pip show litgpt | grep Version: and pip show litdata | grep Version: to get these.

@rasbt
Copy link
Collaborator

rasbt commented Sep 10, 2024

There might be a LitData bug. I was getting an error with the LitGPT code and a simpler self-contained example. Reported it here: Lightning-AI/litdata#367

@ByteBrigand
Copy link
Author

litdata==0.2.24
litgpt==0.4.11

@rasbt
Copy link
Collaborator

rasbt commented Sep 10, 2024

Thanks!

@zhe-thoughts
Copy link

I'm getting the same error

@rasbt : I'm a big fan of litgpt (as well as nanoGPT), because they are minimal / clean implementations.

Should we consider having an option to run this small example (pythia-160m) without relying on litdata?

@rasbt
Copy link
Collaborator

rasbt commented Sep 12, 2024

Thanks for the kind comment. Ideally, it would be nice to keep LitData here because then we don't have to maintain two implementations, one for the small scale and one for the larger scale experiments. A contributor to LitData mentioned that they are looking into this issue, so hopefully it gets fixed soon.

@srikhetramohanty
Copy link

Hi, any resolution to this on linux systems?

@emily-xiao-19
Copy link

also getting this error. What's the current solution? which older version works. tried downgrading and still did not work
miniconda3/envs/myenv/lib/python3.9/site-packages/litdata/utilities/dataset_utilities.py", line 111, in generate_roi
roi.append((0, chunk["dim"] // item_loader._block_size))
TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants
@rasbt @twaka @zhe-thoughts @srikhetramohanty @ByteBrigand @emily-xiao-19 and others