Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example notebook not working #69

Open
1nuno opened this issue Nov 18, 2024 · 3 comments
Open

Example notebook not working #69

1nuno opened this issue Nov 18, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@1nuno
Copy link

1nuno commented Nov 18, 2024

Bug description

When I try to run this example notebook (01-active-learning-for-text-classification-with-small-text-intro.ipynb), it errors on the cell corresponding to the 3rd section (III. Setting up the Active Learner).

Steps to reproduce

  1. Download thenotebook
  2. Run it in an adequate environment such as jupyter notebook, jupyter lab or google colab.

Expected behavior

The notebooks runs successfully without any error and outputting the expected results.

Environment:

Python version: 3.11.9
small-text version: 1.4.1
small-text integrations (e.g., transformers): 4.46.2
PyTorch version (if applicable): 2.5.1+cu124

Installation (pip, conda, or from source): pip
CUDA version (if applicable): 12.4

Addition information

this is the error it outputs:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/torch/serialization.py:850, in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization, _disable_byteorder_record)
    849 with _open_zipfile_writer(f) as opened_zipfile:
--> 850     _save(
    851         obj,
    852         opened_zipfile,
    853         pickle_module,
    854         pickle_protocol,
    855         _disable_byteorder_record,
    856     )
    857     return

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/torch/serialization.py:1114, in _save(obj, zip_file, pickle_module, pickle_protocol, _disable_byteorder_record)
   1113 # Now that it is on the CPU we can directly copy it into the zip file
-> 1114 zip_file.write_record(name, storage, num_bytes)

RuntimeError: [enforce fail at inline_container.cc:778] . PytorchStreamWriter failed writing file data/97: file write failed

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[9], line 29
     26 query_strategy = PredictionEntropy()
     28 active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, train)
---> 29 indices_labeled = initialize_active_learner(active_learner, train.y)

Cell In[9], line 14, in initialize_active_learner(active_learner, y_train)
     11 def initialize_active_learner(active_learner, y_train):
     13     indices_initial = random_initialization_balanced(y_train, n_samples=20)
---> 14     active_learner.initialize_data(indices_initial, y_train[indices_initial])
     16     return indices_initial

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/active_learner.py:154, in PoolBasedActiveLearner.initialize_data(self, indices_initial, y_initial, indices_ignored, indices_validation, retrain)
    151     self.indices_ignored = np.empty(shape=(0), dtype=int)
    153 if retrain:
--> 154     self._retrain(indices_validation=indices_validation)

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/active_learner.py:393, in PoolBasedActiveLearner._retrain(self, indices_validation)
    390 dataset.y = self.y
    392 if indices_validation is None:
--> 393     self._clf.fit(dataset, **self.fit_kwargs)
    394 else:
    395     indices = np.arange(self.indices_labeled.shape[0])

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/integrations/transformers/classifiers/classification.py:378, in TransformerBasedClassification.fit(self, train_set, validation_set, weights, early_stopping, model_selection, optimizer, scheduler)
    374 self.class_weights_ = self.initialize_class_weights(sub_train)
    375 self.criterion = self._get_default_criterion(self.class_weights_,
    376                                              use_sample_weights=weights is not None)
--> 378 return self._fit_main(sub_train, sub_valid, sub_train_weights, early_stopping,
    379                       model_selection, fit_optimizer, fit_scheduler)

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/integrations/transformers/classifiers/classification.py:401, in TransformerBasedClassification._fit_main(self, sub_train, sub_valid, weights, early_stopping, model_selection, optimizer, scheduler)
    398 self.model = self.model.to(self.device)
    400 with tempfile.TemporaryDirectory(dir=get_tmp_dir_base()) as tmp_dir:
--> 401     self._train(sub_train, sub_valid, weights, early_stopping, model_selection,
    402                 optimizer, scheduler, tmp_dir)
    403     self._perform_model_selection(optimizer, model_selection)
    405 return self

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/integrations/transformers/classifiers/classification.py:432, in TransformerBasedClassification._train(self, sub_train, sub_valid, weights, early_stopping, model_selection, optimizer, scheduler, tmp_dir)
    429 if not stop:
    430     start_time = datetime.datetime.now()
--> 432     train_acc, train_loss, valid_acc, valid_loss, stop = self._train_loop_epoch(epoch,
    433                                                                                 sub_train,
    434                                                                                 sub_valid,
    435                                                                                 weights,
    436                                                                                 early_stopping,
    437                                                                                 model_selection,
    438                                                                                 optimizer,
    439                                                                                 scheduler,
    440                                                                                 tmp_dir)
    442     timedelta = datetime.datetime.now() - start_time
    444     self._log_epoch(epoch, timedelta, sub_train, sub_valid, train_acc, train_loss,
    445                     valid_acc, valid_loss)

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/integrations/transformers/classifiers/classification.py:468, in TransformerBasedClassification._train_loop_epoch(self, num_epoch, sub_train, sub_valid, weights, early_stopping, model_selection, optimizer, scheduler, tmp_dir)
    465 else:
    466     validate_every = None
--> 468 train_loss, train_acc, valid_loss, valid_acc, stop = self._train_loop_process_batches(
    469     num_epoch,
    470     sub_train,
    471     sub_valid,
    472     weights,
    473     early_stopping,
    474     model_selection,
    475     optimizer,
    476     scheduler,
    477     tmp_dir,
    478     validate_every=validate_every)
    480 return train_acc, train_loss, valid_acc, valid_loss, stop

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/integrations/transformers/classifiers/classification.py:536, in TransformerBasedClassification._train_loop_process_batches(self, num_epoch, sub_train_, sub_valid_, weights, early_stopping, model_selection, optimizer, scheduler, tmp_dir, validate_every)
    529 measured_values = {
    530     'train_loss': train_loss,
    531     'train_acc': train_acc,
    532     'val_loss': valid_loss,
    533     'val_acc': valid_acc
    534 }
    535 stop = early_stopping.check_early_stop(num_epoch+1, measured_values)
--> 536 self._save_model(optimizer, model_selection, f'{num_epoch}-b0',
    537                  train_acc, train_loss, valid_acc, valid_loss, stop, tmp_dir)
    538 return train_loss, train_acc, valid_loss, valid_acc, stop

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/integrations/pytorch/classifiers/base.py:48, in PytorchModelSelectionMixin._save_model(self, optimizer, model_selection, model_id, train_acc, train_loss, valid_acc, valid_loss, stop, tmp_dir)
     45 measured_values = {'train_acc': train_acc, 'train_loss': train_loss,
     46                    'val_acc': valid_acc, 'val_loss': valid_loss}
     47 model_path = Path(tmp_dir).joinpath(f'model_{model_id}.pt')
---> 48 torch.save(self.model.state_dict(), model_path)
     49 optimizer_path = model_path.with_suffix('.pt.optimizer')
     50 torch.save(optimizer.state_dict(), optimizer_path)

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/torch/serialization.py:849, in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization, _disable_byteorder_record)
    846 _check_save_filelike(f)
    848 if _use_new_zipfile_serialization:
--> 849     with _open_zipfile_writer(f) as opened_zipfile:
    850         _save(
    851             obj,
    852             opened_zipfile,
   (...)
    855             _disable_byteorder_record,
    856         )
    857         return

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/torch/serialization.py:690, in _open_zipfile_writer_file.__exit__(self, *args)
    689 def __exit__(self, *args) -> None:
--> 690     self.file_like.write_end_of_file()
    691     if self.file_stream is not None:
    692         self.file_stream.close()

RuntimeError: [enforce fail at inline_container.cc:603] . unexpected pos 428560128 vs 428560016
@1nuno 1nuno added the bug Something isn't working label Nov 18, 2024
@chschroeder
Copy link
Contributor

Hi @1nuno,

Thank you for reporting this and sorry for the inconvenience.

I have never seen this before, but a Pytorch issue (pytorch/pytorch#76108) suggests that it is related to running out of disk space or memory. Could you try to reproduce it and check the disc space and memory while doing it?

My guess would be disk space. In this version we save intermediate models for model selection (which will be disabled by default starting with 2.0.0+), which can quickly take a lot of space.

You can also try to disable it by changing the model arguments

clf_factory = TransformerBasedClassificationFactory(transformer_model, 
                                                    num_classes, 
                                                    kwargs=dict({'device': 'cuda', 
                                                                 'mini_batch_size': 32,
                                                                 'class_weight': 'balanced',
                                                                 'model_selection': False  # this line is new compared to the notebook example
                                                                }))
``

@1nuno
Copy link
Author

1nuno commented Nov 19, 2024

Hi thanks for the fast reply!

I tried setting that option (model_selection) to false the it keeps having the same error.

I want to help by reproducing it while logging the disc space and memory while doing it so I can share it here but I don't know how to do that. Do you mind explaining it how can I achieve that? or should I just maybe monitor it myself using htop or something like that.

Ps: I am quite new to this programming stuff

@chschroeder
Copy link
Contributor

From htop, I conclude that you are using a Linux/Unix-like operating system and that you are comfortable using the command line, is that right?

Usually, the training is not that fast that you can't monitor it yourself. Open a second command line window and use du -hs /tmp to get the total disk usage for the folder /tmp. There are some htop-like tools for watching the disk space, but since I don't use any of them, I don't want to give a wrong recommendation here.

By default, the intermediary models are written to /tmp. You can override this path by setting the SMALL_TEXT_TEMP environment variable. In a jupyter notebook you can achieve that via SMALL_TEXT_TEMP=/path/to/another/location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants