Fixes retrieval encoders when query / item features have dense list features #1169

gabrielspmoreira · 2023-06-30T15:53:49Z

Goals ⚽

This PR fixes the retrieval encoder methods (e.g. to_top_k_model(), batch_predict()), that were failing in some cases, depending on the input features, e.g. multi-hot non-ragged item features.

Implementation Details 🚧

The retrieval models (i.e. based on RetrievalModelV2) are composed by two towers, that encode item features and query/user features in separate towers. It allows for encoding the towers separately, generating the item or query embeddings.
Internally those encoding methods use Dask DataFrame.map_partitions() to call the encoding function for every partition and generate the corresponding output of the encoding function (i.e., the output of the tower).
If the meta argument is not passed to Dask DataFrame.map_partitions(), it generates some fake data base on the input dataframe schema to infer the output dataframe schema. But that may generate fake data that is different from the real data, in particular, the fake data generated for dense list columns (not-ragged), (e.g. multi-hot or embedding features), causes an error when the model encode function is called.
This PR sets the meta argument of the DataFrame.map_partitions() by computing manually the expected output dataframe schema from a sample batch from real data in order to make the encoding more robust for different types of inputs.
The PR also changes the data_iterator_func() that is used by the model encoder to use directly the schema rather than the old Loader arguments that set categorical, continuous and targets separately, as the previous code did not deal correctly with list features.

Testing Details 🔍

Created the test_two_tower_v2_export_item_tower_embeddings_with_seq_item_features test, that uses the music_streaming_data synthetic data and contains multi-hot list features (ragged and not ragged), for which the encoding functions were failing before this fix

…ons(meta) with the expected output dataframe schema. This fixes the issue when multi-hot features were used in the user / item tower encoding

gabrielspmoreira · 2023-06-30T15:59:32Z

merlin/datasets/entertainment/music_streaming/schema.json

@@ -72,7 +72,8 @@
    {
      "name": "item_genres",
      "valueCount": {
-        "min": "4"
+        "min": "4",
+        "max": "4"


This change makes Loader to return item_genres as a multi-hot dense tensor rather than a ragged representation (__values, __offsets). I keep the user_genres feature as a ragged multi-hot feature, so that we test both cases

github-actions · 2023-06-30T16:03:14Z

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1169

…e required changes within test_two_tower_v2_export_item_tower_embeddings_with_seq_item_features

gabrielspmoreira requested a review from marcromeyn June 30, 2023 15:53

gabrielspmoreira self-assigned this Jun 30, 2023

gabrielspmoreira added this to the Merlin 23.07 milestone Jun 30, 2023

gabrielspmoreira added the bug Something isn't working label Jun 30, 2023

gabrielspmoreira changed the title ~~Fixes retrieval encoders when query / item features have multi-hot features~~ Fixes retrieval encoders when query / item features have dense list features Jun 30, 2023

gabrielspmoreira requested a review from oliverholworthy June 30, 2023 15:55

Making retrieval encoders more trustworthy by setting the map_partiti…

33fb8c4

…ons(meta) with the expected output dataframe schema. This fixes the issue when multi-hot features were used in the user / item tower encoding

gabrielspmoreira force-pushed the tf/candidate_embeddings_fix branch from 208ffcb to 33fb8c4 Compare June 30, 2023 15:56

gabrielspmoreira commented Jun 30, 2023

View reviewed changes

gabrielspmoreira added 2 commits June 30, 2023 16:41

Changing back the schema of music_streaming_data fixture and doing th…

fbb07d3

…e required changes within test_two_tower_v2_export_item_tower_embeddings_with_seq_item_features

Merge branch 'main' into tf/candidate_embeddings_fix

592ec73

oliverholworthy approved these changes Jul 4, 2023

View reviewed changes

gabrielspmoreira merged commit bb8b7bd into main Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes retrieval encoders when query / item features have dense list features #1169

Fixes retrieval encoders when query / item features have dense list features #1169

gabrielspmoreira commented Jun 30, 2023 •

edited

Loading

gabrielspmoreira Jun 30, 2023

github-actions bot commented Jun 30, 2023

Fixes retrieval encoders when query / item features have dense list features #1169

Fixes retrieval encoders when query / item features have dense list features #1169

Conversation

gabrielspmoreira commented Jun 30, 2023 • edited Loading

Goals ⚽

Implementation Details 🚧

Testing Details 🔍

gabrielspmoreira Jun 30, 2023

Choose a reason for hiding this comment

github-actions bot commented Jun 30, 2023

Documentation preview

gabrielspmoreira commented Jun 30, 2023 •

edited

Loading