Example notebook not working #69

1nuno · 2024-11-18T18:47:43Z

Bug description

When I try to run this example notebook (01-active-learning-for-text-classification-with-small-text-intro.ipynb), it errors on the cell corresponding to the 3rd section (III. Setting up the Active Learner).

Steps to reproduce

Download thenotebook
Run it in an adequate environment such as jupyter notebook, jupyter lab or google colab.

Expected behavior

The notebooks runs successfully without any error and outputting the expected results.

Environment:

Python version: 3.11.9
small-text version: 1.4.1
small-text integrations (e.g., transformers): 4.46.2
PyTorch version (if applicable): 2.5.1+cu124

Installation (pip, conda, or from source): pip
CUDA version (if applicable): 12.4

Addition information

this is the error it outputs:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/torch/serialization.py:850, in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization, _disable_byteorder_record)
    849 with _open_zipfile_writer(f) as opened_zipfile:
--> 850     _save(
    851         obj,
    852         opened_zipfile,
    853         pickle_module,
    854         pickle_protocol,
    855         _disable_byteorder_record,
    856     )
    857     return

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/torch/serialization.py:1114, in _save(obj, zip_file, pickle_module, pickle_protocol, _disable_byteorder_record)
   1113 # Now that it is on the CPU we can directly copy it into the zip file
-> 1114 zip_file.write_record(name, storage, num_bytes)

RuntimeError: [enforce fail at inline_container.cc:778] . PytorchStreamWriter failed writing file data/97: file write failed

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[9], line 29
     26 query_strategy = PredictionEntropy()
     28 active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, train)
---> 29 indices_labeled = initialize_active_learner(active_learner, train.y)

Cell In[9], line 14, in initialize_active_learner(active_learner, y_train)
     11 def initialize_active_learner(active_learner, y_train):
     13     indices_initial = random_initialization_balanced(y_train, n_samples=20)
---> 14     active_learner.initialize_data(indices_initial, y_train[indices_initial])
     16     return indices_initial

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/active_learner.py:154, in PoolBasedActiveLearner.initialize_data(self, indices_initial, y_initial, indices_ignored, indices_validation, retrain)
    151     self.indices_ignored = np.empty(shape=(0), dtype=int)
    153 if retrain:
--> 154     self._retrain(indices_validation=indices_validation)

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/active_learner.py:393, in PoolBasedActiveLearner._retrain(self, indices_validation)
    390 dataset.y = self.y
    392 if indices_validation is None:
--> 393     self._clf.fit(dataset, **self.fit_kwargs)
    394 else:
    395     indices = np.arange(self.indices_labeled.shape[0])

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/integrations/transformers/classifiers/classification.py:378, in TransformerBasedClassification.fit(self, train_set, validation_set, weights, early_stopping, model_selection, optimizer, scheduler)
    374 self.class_weights_ = self.initialize_class_weights(sub_train)
    375 self.criterion = self._get_default_criterion(self.class_weights_,
    376                                              use_sample_weights=weights is not None)
--> 378 return self._fit_main(sub_train, sub_valid, sub_train_weights, early_stopping,
    379                       model_selection, fit_optimizer, fit_scheduler)

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/integrations/transformers/classifiers/classification.py:401, in TransformerBasedClassification._fit_main(self, sub_train, sub_valid, weights, early_stopping, model_selection, optimizer, scheduler)
    398 self.model = self.model.to(self.device)
    400 with tempfile.TemporaryDirectory(dir=get_tmp_dir_base()) as tmp_dir:
--> 401     self._train(sub_train, sub_valid, weights, early_stopping, model_selection,
    402                 optimizer, scheduler, tmp_dir)
    403     self._perform_model_selection(optimizer, model_selection)
    405 return self

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/integrations/transformers/classifiers/classification.py:432, in TransformerBasedClassification._train(self, sub_train, sub_valid, weights, early_stopping, model_selection, optimizer, scheduler, tmp_dir)
    429 if not stop:
    430     start_time = datetime.datetime.now()
--> 432     train_acc, train_loss, valid_acc, valid_loss, stop = self._train_loop_epoch(epoch,
    433                                                                                 sub_train,
    434                                                                                 sub_valid,
    435                                                                                 weights,
    436                                                                                 early_stopping,
    437                                                                                 model_selection,
    438                                                                                 optimizer,
    439                                                                                 scheduler,
    440                                                                                 tmp_dir)
    442     timedelta = datetime.datetime.now() - start_time
    444     self._log_epoch(epoch, timedelta, sub_train, sub_valid, train_acc, train_loss,
    445                     valid_acc, valid_loss)

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/integrations/transformers/classifiers/classification.py:468, in TransformerBasedClassification._train_loop_epoch(self, num_epoch, sub_train, sub_valid, weights, early_stopping, model_selection, optimizer, scheduler, tmp_dir)
    465 else:
    466     validate_every = None
--> 468 train_loss, train_acc, valid_loss, valid_acc, stop = self._train_loop_process_batches(
    469     num_epoch,
    470     sub_train,
    471     sub_valid,
    472     weights,
    473     early_stopping,
    474     model_selection,
    475     optimizer,
    476     scheduler,
    477     tmp_dir,
    478     validate_every=validate_every)
    480 return train_acc, train_loss, valid_acc, valid_loss, stop

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/integrations/transformers/classifiers/classification.py:536, in TransformerBasedClassification._train_loop_process_batches(self, num_epoch, sub_train_, sub_valid_, weights, early_stopping, model_selection, optimizer, scheduler, tmp_dir, validate_every)
    529 measured_values = {
    530     'train_loss': train_loss,
    531     'train_acc': train_acc,
    532     'val_loss': valid_loss,
    533     'val_acc': valid_acc
    534 }
    535 stop = early_stopping.check_early_stop(num_epoch+1, measured_values)
--> 536 self._save_model(optimizer, model_selection, f'{num_epoch}-b0',
    537                  train_acc, train_loss, valid_acc, valid_loss, stop, tmp_dir)
    538 return train_loss, train_acc, valid_loss, valid_acc, stop

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/small_text/integrations/pytorch/classifiers/base.py:48, in PytorchModelSelectionMixin._save_model(self, optimizer, model_selection, model_id, train_acc, train_loss, valid_acc, valid_loss, stop, tmp_dir)
     45 measured_values = {'train_acc': train_acc, 'train_loss': train_loss,
     46                    'val_acc': valid_acc, 'val_loss': valid_loss}
     47 model_path = Path(tmp_dir).joinpath(f'model_{model_id}.pt')
---> 48 torch.save(self.model.state_dict(), model_path)
     49 optimizer_path = model_path.with_suffix('.pt.optimizer')
     50 torch.save(optimizer.state_dict(), optimizer_path)

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/torch/serialization.py:849, in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization, _disable_byteorder_record)
    846 _check_save_filelike(f)
    848 if _use_new_zipfile_serialization:
--> 849     with _open_zipfile_writer(f) as opened_zipfile:
    850         _save(
    851             obj,
    852             opened_zipfile,
   (...)
    855             _disable_byteorder_record,
    856         )
    857         return

File ~/virtual_envs/iach_2/lib64/python3.11/site-packages/torch/serialization.py:690, in _open_zipfile_writer_file.__exit__(self, *args)
    689 def __exit__(self, *args) -> None:
--> 690     self.file_like.write_end_of_file()
    691     if self.file_stream is not None:
    692         self.file_stream.close()

RuntimeError: [enforce fail at inline_container.cc:603] . unexpected pos 428560128 vs 428560016

The text was updated successfully, but these errors were encountered:

chschroeder · 2024-11-18T19:48:17Z

Hi @1nuno,

Thank you for reporting this and sorry for the inconvenience.

I have never seen this before, but a Pytorch issue (pytorch/pytorch#76108) suggests that it is related to running out of disk space or memory. Could you try to reproduce it and check the disc space and memory while doing it?

My guess would be disk space. In this version we save intermediate models for model selection (which will be disabled by default starting with 2.0.0+), which can quickly take a lot of space.

You can also try to disable it by changing the model arguments

clf_factory = TransformerBasedClassificationFactory(transformer_model, 
                                                    num_classes, 
                                                    kwargs=dict({'device': 'cuda', 
                                                                 'mini_batch_size': 32,
                                                                 'class_weight': 'balanced',
                                                                 'model_selection': False  # this line is new compared to the notebook example
                                                                }))
``

1nuno · 2024-11-19T05:46:53Z

Hi thanks for the fast reply!

I tried setting that option (model_selection) to false the it keeps having the same error.

I want to help by reproducing it while logging the disc space and memory while doing it so I can share it here but I don't know how to do that. Do you mind explaining it how can I achieve that? or should I just maybe monitor it myself using htop or something like that.

Ps: I am quite new to this programming stuff

chschroeder · 2024-11-19T10:51:17Z

From htop, I conclude that you are using a Linux/Unix-like operating system and that you are comfortable using the command line, is that right?

Usually, the training is not that fast that you can't monitor it yourself. Open a second command line window and use du -hs /tmp to get the total disk usage for the folder /tmp. There are some htop-like tools for watching the disk space, but since I don't use any of them, I don't want to give a wrong recommendation here.

By default, the intermediary models are written to /tmp. You can override this path by setting the SMALL_TEXT_TEMP environment variable. In a jupyter notebook you can achieve that via SMALL_TEXT_TEMP=/path/to/another/location.

1nuno added the bug Something isn't working label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example notebook not working #69

Example notebook not working #69

1nuno commented Nov 18, 2024

chschroeder commented Nov 18, 2024

1nuno commented Nov 19, 2024

chschroeder commented Nov 19, 2024

Example notebook not working #69

Example notebook not working #69

Comments

1nuno commented Nov 18, 2024

Bug description

Steps to reproduce

Expected behavior

Environment:

Addition information

chschroeder commented Nov 18, 2024

1nuno commented Nov 19, 2024

chschroeder commented Nov 19, 2024