Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updates notebooks for multistage with subgraphs #1022

Merged

Conversation

jperez999
Copy link
Collaborator

@jperez999 jperez999 commented Jun 21, 2023

This PR updates the multi stage recsys notebooks to use subgraph. There are quite a few changes that occur in there. Some major changes include a shift in flow of what information is stored in the feature store. Now the feature store holds raw item and raw user information. These values have not been pre processed. So when they are retrieved in the systems graph they must go through a preprocessing step. In the first notebook, we add usage of the Subgraph operator, and we create to subgraphs one for item and one for user. We also create another subgraph for the item categorification. This is so that we can categorify the item_features separately when they are used to retrieve item embeddings. In the second notebook, the use of subgraphs forces the ensemble to introduce NVT workflows to handle the preprocessing of the data after it is retrieved for both users and items. Here is where we introduce the usage of subworkflows, which are based on subgraph.

This PR depends on the following PRs:
NVIDIA-Merlin/core#349
NVIDIA-Merlin/systems#372
NVIDIA-Merlin/core#350
NVIDIA-Merlin/core#353
NVIDIA-Merlin/systems#378

@jperez999 jperez999 added enhancement New feature or request chore Infrastructure update labels Jun 21, 2023
@jperez999 jperez999 added this to the Merlin 23.07 milestone Jun 21, 2023
@jperez999 jperez999 self-assigned this Jun 21, 2023
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/Merlin/review/pr-1022

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we can add some text right above this cell to explain what we are doing here and what we are registering to feast repo. I can do that.


Reply via ReviewNB

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to explain this cell what's going on here, since cat_wkflow = nvt_wkflow.get_subworkflow("items_cat") is a new syntax..


Reply via ReviewNB

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #5.        view="user_features",

guess view should be user_attributes ?


Reply via ReviewNB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jperez999 and @karlhigley changing view to user_attributes give error since it was saved to feature store as user_features. so we should leave it as it is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you cant change the name of the view, that comes from the feast configs at the bottom of notebook 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can change it if it’s changed in both places though, right?

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #3.        view="item_features",

similarly, here view should be item_attributes? but if we use "item_attributes" it gives error since we registered it at item_features in the feature store.


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above for this change. You cannot make this change unless we change the configs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably should change both for the sake of clarity

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #14.    os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"

guess we need to move this line right after import os to make it effective.


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as import os is done before it does not have to be right before. If you look further up in that block os is the first import.

Copy link
Contributor

@rnyak rnyak Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jperez999 I wanted to say if os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async" comes after import tensorflow it might NOT work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it has to happen in with the os call first and import tensorflow second so that TF picks up the config from the environment when it loads

Copy link
Contributor

@rnyak rnyak Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say let's move os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async" right after import os based on my experience and before import tensorflow :)

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #5.    item_embs_df["item_id"] = item_features["item_id"]

I am skeptical that this step is correct. This is not doing correct mapping. if we are trying to replace encoded items with their original raw item_ids, I think, we should use unique.item_id.parquet. so we can do something like

raw_item_ids = pd.read_parquet('./categories/unique.item_id.parquet')

item_embs_df["item_id"] = raw_item_ids["item_id"].values



Reply via ReviewNB

Copy link
Contributor

@rnyak rnyak Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, are we sure we want to save item_embeddings for raw item_ids? Arent we supposed to do KNN search after retrieval stage based on the encoded items? If yes, we should then get rid of the mapping scripts above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to save the embeddings with raw_item_ids because those embeddings should correspond to raw data. This way when you do the lookup in the second notebook you are getting the actual items. Those items are then processed through the item subworkflow to get the features for the model to rank the items. Suppose you wanted to use that similarity search with a different feature engineering workflow, if you had placed the data with the item ids already categorified that would not work. For the item_id to raw_item_ids you are correct that it is not correctly merged. However, I pose the question why are we sorting IDs by default? I see that we use it in the code for categorify to handle making na/nulls the first items in the dataframe... but as a user I think I expect to see the categories given in a first come first serve order. Unless I elect to use freq_threshold or kwarg that changes the strategy. Anyway I will fix for now.

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    user_id_raw = ["user_id"] >> Rename(postfix='_raw') >> LambdaOp(lambda col: col.astype("int32")) >> TagAsUserFeatures()

I think we no longer need user_id_raw and item_id_raw since we do not register them in the feature store.


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct but I wanted to leave it so people can see the incremental changes. I will remove it.

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #2.        store=feature_store,

I am not sure how this step works without registering any encoded ids in the feature store. I assume retrieval["candidate_ids"]  returns encoded item ids? then, how feature store can map encoded ids to the raw ids if it does not have such mapping info stored?


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those candidate IDs come from an embedding space, that embedding space is loaded into a vector database to be able to conduct similarity search. The IDs returned here are the raw item IDs (aka the actual names) from the previous notebook (note they are currently misrepresented, because categorify always sorts the values in a dataframe, I will fix it shortly). So you take the user features and pass them to the retrieval model to get the user embeddings. Those embeddings are overlayed in the item embedding space and then we run a similarity search against those to find the items that are closest (most similar) to the user embedding supplied. Once we have the real item_ids (should be item names/uuid) we can query for the features of those items in the feature store. The records we end up with are real item id/feature values that have not been processed yet. That is why the next step is to process them using the item subworkflow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jperez999 as long as retrieval["candidate_ids"] return raw item_ids then all good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a lot easier to sort through if the dataset generation returned string ids for categorical columns, because we’d then be able to visually tell whether we were looking at raw or encoded values 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karlhigley agreed :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a stab at adding string id columns to the dataset generation this morning, and quickly realized there are some bigger issues to sort out re: synthetic data and how our libraries relate to it. @oliverholworthy is going to take a look and try to figure out where we should go with that, but for now I think we're going to have to continue with integer IDs. I wonder if we could make them more visually distinct from the Categorified IDs if we added a large offset to the raw IDs (e.g. 1_000_000)?

Copy link
Contributor

@rnyak rnyak Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a stab at adding string id columns to the dataset generation this morning, and quickly realized there are some bigger issues to sort out re: synthetic data and how our libraries relate to it. @oliverholworthy is going to take a look and try to figure out where we should go with that, but for now I think we're going to have to continue with integer IDs. I wonder if we could make them more visually distinct from the Categorified IDs if we added a large offset to the raw IDs (e.g. 1_000_000)?

@karlhigley if this is a nice to have things we dont have to prioritize it now. your call, I am ok with moving with integer IDs :)

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #6.    schema = train.schema.without(['user_id_raw', 'item_id_raw'])

if we remove user_id_raw and item_id_raw above from NVT pipeline, we can remove .without here as well.


Reply via ReviewNB

@jperez999 jperez999 linked an issue Jun 27, 2023 that may be closed by this pull request
10 tasks
@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rephrase this as

"We start with the offline candidate retrieval stage. We are going to train a Two-Tower model for item retrieval using only positive interactions, that's why we apply Filter operator below. To learn ..."


Reply via ReviewNB

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have a section "Feature Engineering with NVTabular" above. So better to remove this title here.


Reply via ReviewNB

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let move this paragraph up right under Feature Engineering with NVTabular section above.


Reply via ReviewNB

@rnyak
Copy link
Contributor

rnyak commented Jun 27, 2023

@jperez999 we need to update the unit test as well.. I can push it to this PR .

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #3.    )

so we can save this as user_attributes if you want like below:

user_features.to_parquet(

  os.path.join(feature_repo_path, "data", "user_attributes.parquet")

)


Reply via ReviewNB

@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we can save this as item_attributes if you want like below:

item_features.to_parquet(

  os.path.join(feature_repo_path, "data", "item_attributes.parquet")

)


Reply via ReviewNB

Copy link
Contributor

@karlhigley karlhigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from the perspective of code changes. @rnyak, let’s update the text in a second PR

@@ -74,7 +74,7 @@ def test_func(tmpdir):
df_lib = get_lib()
train = df_lib.read_parquet(
os.path.join("{tmpdir / "data"}/processed_nvt/", "train", "part_0.parquet"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that sending a request of transformed dataset instead of raw user_id?

@jperez999 jperez999 merged commit 237fe4b into NVIDIA-Merlin:main Jul 5, 2023
@@ -7,7 +7,7 @@
"metadata": {},
Copy link
Contributor

@rnyak rnyak Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #4.    workflow =  nvt.Workflow(["item_id"] + (['item_id', 'item_brand', 'item_category', 'item_shop'] >> TransformWorkflow(nvt_wkflow.get_subworkflow("item")) >> PredictTensorflow(model_tt.first.item_block())))

This step is not really intuitive from user perspective. we should explain why we have two times item_id? why we need to do fit_transform not only transform? why are we fitting again?


Reply via ReviewNB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line and comment have us way down a rabbit-hole fixing underlying issues that make this awkward 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Infrastructure update enhancement New feature or request
Projects
None yet
3 participants