Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] StopIteration error in model.fit at validation time #1240

Open
CarloNicolini opened this issue Jun 3, 2024 · 1 comment
Open

[QST] StopIteration error in model.fit at validation time #1240

CarloNicolini opened this issue Jun 3, 2024 · 1 comment

Comments

@CarloNicolini
Copy link

❓ Questions & Help

I am using the Nvidia Merlin Docker 23.08 Tensorflow container.

I've created my training and validation datasets and saved them into parquet following the standard procedure done with the nvt.workflow.

I am now facing some issues training a two towers model based largely on the examples provided in the notebooks, but with many more list features (such as genres in the MovieLens dataset).
The training starts and the loss function decreases but at the validation step I get an Unknown error that seems to originate from a missing index in the underlying cudf DataFrame, which in turn comes out from a StopIteration when validation data is evaluated.

UnknownError: Graph execution error:

2 root error(s) found.
  (0) UNKNOWN:  IndexError: single positional indexer is out-of-bounds
Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 332, in _get_next_batch
    batch = next(self._batch_itr)

StopIteration


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/script_ops.py", line 267, in __call__
    ret = func(*args)

  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
    return func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/from_generator_op.py", line 198, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/usr/local/lib/python3.10/dist-packages/keras/engine/data_adapter.py", line 902, in wrapped_generator
    for data in generator_fn():

  File "/usr/local/lib/python3.10/dist-packages/keras/engine/data_adapter.py", line 1049, in generator_fn
    yield x[i]

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 93, in __getitem__
    return self.__next__()

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 97, in __next__
    converted_batch = self.convert_batch(super().__next__())

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 261, in __next__
    return self._get_next_batch()

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 344, in _get_next_batch
    batch = next(self._batch_itr)

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 369, in make_tensors
    tensors_by_name = self._convert_df_to_tensors(gdf)

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 548, in _convert_df_to_tensors
    if isinstance(leaves[0], list):

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 1293, in __getitem__
    return self.loc[arg]

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 270, in __getitem__
    return self._frame.iloc[arg]

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 187, in __getitem__
    data = self._frame._get_elements_from_column(arg)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/single_column_frame.py", line 398, in _get_elements_from_column
    return self._column.element_indexing(int(arg))

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 539, in element_indexing
    raise IndexError("single positional indexer is out-of-bounds")

IndexError: single positional indexer is out-of-bounds


	 [[{{node PyFunc}}]]
	 [[IteratorGetNext]]
	 [[retrieval_model_v2_1/parallel_block_5/encoder_2/prepare_features_3/prepare_list_features_3/StatefulPartitionedCall_16/RaggedFromRowSplits/RowPartitionFromRowSplits/assert_non_negative/assert_less_equal/Assert/Assert/data_0/_2484]]
  (1) UNKNOWN:  IndexError: single positional indexer is out-of-bounds
Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 332, in _get_next_batch
    batch = next(self._batch_itr)

StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/script_ops.py", line 267, in __call__
    ret = func(*args)

  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
    return func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/from_generator_op.py", line 198, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/usr/local/lib/python3.10/dist-packages/keras/engine/data_adapter.py", line 902, in wrapped_generator
    for data in generator_fn():

  File "/usr/local/lib/python3.10/dist-packages/keras/engine/data_adapter.py", line 1049, in generator_fn
    yield x[i]

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 93, in __getitem__
    return self.__next__()

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 97, in __next__
    converted_batch = self.convert_batch(super().__next__())

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 261, in __next__
    return self._get_next_batch()

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 344, in _get_next_batch
    batch = next(self._batch_itr)

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 369, in make_tensors
    tensors_by_name = self._convert_df_to_tensors(gdf)

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 548, in _convert_df_to_tensors
    if isinstance(leaves[0], list):

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 1293, in __getitem__
    return self.loc[arg]

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 270, in __getitem__
    return self._frame.iloc[arg]

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 187, in __getitem__
    data = self._frame._get_elements_from_column(arg)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/single_column_frame.py", line 398, in _get_elements_from_column
    return self._column.element_indexing(int(arg))

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 539, in element_indexing
    raise IndexError("single positional indexer is out-of-bounds")

IndexError: single positional indexer is out-of-bounds


	 [[{{node PyFunc}}]]
	 [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_test_function_919093]

I've then tried to run some test iterations on the valid dataset and found with my surprise that even the mm.Loader cannot correctly iterate on the validation dataset.

In other words, I've verified that I cannot consume all the batches from the dataset, unless I set the batch_size to 1 which every number is divisible from.
Indeed, this simple loop raise StopIteration.

for batch in mm.Loader(validation, batch_size=512):
    pass

I hope this is something bad on my side. I didn't do the shuffle_by_keys method on the loaded dataset, nor in the phase of its creation. Is this related?

@rnyak
Copy link
Contributor

rnyak commented Jul 15, 2024

@CarloNicolini please provide a minimal reproducible example so that we can run and reproduce the issue you are facing.

  • what are the dtypes of your list columnss? Are you properly categorifying the list features using NVTabular and are you transforming your validation data accordingly?

  • why do you think you need shuffle_by_keys? we have shuffle_by_keys in the Groupby op, in case one is doing groupby for a given column (say unique session id) but their unique session id is scattered over different parquet files, BUT we dont recommend to use it for large datasets. are you doing something like that? you are finetuning a Two-tower model right? not a session based model, I believe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants