Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes retrieval encoders when query / item features have dense list features #1169

Merged
merged 3 commits into from
Jul 4, 2023

Conversation

gabrielspmoreira
Copy link
Member

@gabrielspmoreira gabrielspmoreira commented Jun 30, 2023

Goals ⚽

This PR fixes the retrieval encoder methods (e.g. to_top_k_model(), batch_predict()), that were failing in some cases, depending on the input features, e.g. multi-hot non-ragged item features.

Implementation Details 🚧

  • The retrieval models (i.e. based on RetrievalModelV2) are composed by two towers, that encode item features and query/user features in separate towers. It allows for encoding the towers separately, generating the item or query embeddings.
  • Internally those encoding methods use Dask DataFrame.map_partitions() to call the encoding function for every partition and generate the corresponding output of the encoding function (i.e., the output of the tower).
  • If the meta argument is not passed to Dask DataFrame.map_partitions(), it generates some fake data base on the input dataframe schema to infer the output dataframe schema. But that may generate fake data that is different from the real data, in particular, the fake data generated for dense list columns (not-ragged), (e.g. multi-hot or embedding features), causes an error when the model encode function is called.
  • This PR sets the meta argument of the DataFrame.map_partitions() by computing manually the expected output dataframe schema from a sample batch from real data in order to make the encoding more robust for different types of inputs.
  • The PR also changes the data_iterator_func() that is used by the model encoder to use directly the schema rather than the old Loader arguments that set categorical, continuous and targets separately, as the previous code did not deal correctly with list features.

Testing Details 🔍

  • Created the test_two_tower_v2_export_item_tower_embeddings_with_seq_item_features test, that uses the music_streaming_data synthetic data and contains multi-hot list features (ragged and not ragged), for which the encoding functions were failing before this fix

@gabrielspmoreira gabrielspmoreira self-assigned this Jun 30, 2023
@gabrielspmoreira gabrielspmoreira added this to the Merlin 23.07 milestone Jun 30, 2023
@gabrielspmoreira gabrielspmoreira added the bug Something isn't working label Jun 30, 2023
@gabrielspmoreira gabrielspmoreira changed the title Fixes retrieval encoders when query / item features have multi-hot features Fixes retrieval encoders when query / item features have dense list features Jun 30, 2023
…ons(meta) with the expected output dataframe schema. This fixes the issue when multi-hot features were used in the user / item tower encoding
@@ -72,7 +72,8 @@
{
"name": "item_genres",
"valueCount": {
"min": "4"
"min": "4",
"max": "4"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes Loader to return item_genres as a multi-hot dense tensor rather than a ragged representation (__values, __offsets). I keep the user_genres feature as a ragged multi-hot feature, so that we test both cases

@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1169

…e required changes within test_two_tower_v2_export_item_tower_embeddings_with_seq_item_features
@gabrielspmoreira gabrielspmoreira merged commit bb8b7bd into main Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants