-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updates notebooks for multistage with subgraphs #1022
updates notebooks for multistage with subgraphs #1022
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Documentation preview |
.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb
Show resolved
Hide resolved
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we can add some text right above this cell to explain what we are doing here and what we are registering to feast repo. I can do that.
Reply via ReviewNB
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to explain this cell what's going on here, since cat_wkflow = nvt_wkflow.get_subworkflow("items_cat")
is a new syntax..
Reply via ReviewNB
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jperez999 and @karlhigley changing view to user_attributes
give error since it was saved to feature store as user_features
. so we should leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you cant change the name of the view, that comes from the feast configs at the bottom of notebook 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can change it if it’s changed in both places though, right?
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #3. view="item_features",
similarly, here view
should be item_attributes
? but if we use "item_attributes" it gives error since we registered it at item_features in the feature store.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above for this change. You cannot make this change unless we change the configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably should change both for the sake of clarity
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #14. os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"
guess we need to move this line right after import os
to make it effective.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as import os is done before it does not have to be right before. If you look further up in that block os is the first import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jperez999 I wanted to say if os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"
comes after import tensorflow
it might NOT work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it has to happen in with the os
call first and import tensorflow
second so that TF picks up the config from the environment when it loads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say let's move os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"
right after import os
based on my experience and before import tensorflow
:)
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #5. item_embs_df["item_id"] = item_features["item_id"]
I am skeptical that this step is correct. This is not doing correct mapping. if we are trying to replace encoded items with their original raw item_ids, I think, we should use unique.item_id.parquet
. so we can do something like
raw_item_ids = pd.read_parquet('./categories/unique.item_id.parquet')item_embs_df["item_id"] = raw_item_ids["item_id"].values
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, are we sure we want to save item_embeddings for raw item_ids? Arent we supposed to do KNN search after retrieval stage based on the encoded items? If yes, we should then get rid of the mapping scripts above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to save the embeddings with raw_item_ids because those embeddings should correspond to raw data. This way when you do the lookup in the second notebook you are getting the actual items. Those items are then processed through the item subworkflow to get the features for the model to rank the items. Suppose you wanted to use that similarity search with a different feature engineering workflow, if you had placed the data with the item ids already categorified that would not work. For the item_id to raw_item_ids you are correct that it is not correctly merged. However, I pose the question why are we sorting IDs by default? I see that we use it in the code for categorify to handle making na/nulls the first items in the dataframe... but as a user I think I expect to see the categories given in a first come first serve order. Unless I elect to use freq_threshold or kwarg that changes the strategy. Anyway I will fix for now.
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. user_id_raw = ["user_id"] >> Rename(postfix='_raw') >> LambdaOp(lambda col: col.astype("int32")) >> TagAsUserFeatures()
I think we no longer need user_id_raw
and item_id_raw
since we do not register them in the feature store.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct but I wanted to leave it so people can see the incremental changes. I will remove it.
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #2. store=feature_store,
I am not sure how this step works without registering any encoded ids in the feature store. I assume retrieval["candidate_ids"]
returns encoded item ids? then, how feature store can map encoded ids to the raw ids if it does not have such mapping info stored?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those candidate IDs come from an embedding space, that embedding space is loaded into a vector database to be able to conduct similarity search. The IDs returned here are the raw item IDs (aka the actual names) from the previous notebook (note they are currently misrepresented, because categorify always sorts the values in a dataframe, I will fix it shortly). So you take the user features and pass them to the retrieval model to get the user embeddings. Those embeddings are overlayed in the item embedding space and then we run a similarity search against those to find the items that are closest (most similar) to the user embedding supplied. Once we have the real item_ids (should be item names/uuid) we can query for the features of those items in the feature store. The records we end up with are real item id/feature values that have not been processed yet. That is why the next step is to process them using the item subworkflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jperez999 as long as retrieval["candidate_ids"]
return raw item_ids then all good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a lot easier to sort through if the dataset generation returned string ids for categorical columns, because we’d then be able to visually tell whether we were looking at raw or encoded values 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karlhigley agreed :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a stab at adding string id columns to the dataset generation this morning, and quickly realized there are some bigger issues to sort out re: synthetic data and how our libraries relate to it. @oliverholworthy is going to take a look and try to figure out where we should go with that, but for now I think we're going to have to continue with integer IDs. I wonder if we could make them more visually distinct from the Categorified IDs if we added a large offset to the raw IDs (e.g. 1_000_000
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a stab at adding string id columns to the dataset generation this morning, and quickly realized there are some bigger issues to sort out re: synthetic data and how our libraries relate to it. @oliverholworthy is going to take a look and try to figure out where we should go with that, but for now I think we're going to have to continue with integer IDs. I wonder if we could make them more visually distinct from the Categorified IDs if we added a large offset to the raw IDs (e.g.
1_000_000
)?
@karlhigley if this is a nice to have
things we dont have to prioritize it now. your call, I am ok with moving with integer IDs :)
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #6. schema = train.schema.without(['user_id_raw', 'item_id_raw'])
if we remove user_id_raw
and item_id_raw
above from NVT pipeline, we can remove .without
here as well.
Reply via ReviewNB
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rephrase this as
"We start with the offline candidate retrieval stage. We are going to train a Two-Tower model for item retrieval using only positive interactions, that's why we apply Filter operator below. To learn ..."
Reply via ReviewNB
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have a section "Feature Engineering with NVTabular" above. So better to remove this title here.
Reply via ReviewNB
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let move this paragraph up right under Feature Engineering with NVTabular
section above.
Reply via ReviewNB
@jperez999 we need to update the unit test as well.. I can push it to this PR . |
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #3. )
so we can save this as user_attributes
if you want like below:
user_features.to_parquet(
os.path.join(feature_repo_path, "data", "user_attributes.parquet")
)
Reply via ReviewNB
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we can save this as item_attributes
if you want like below:
item_features.to_parquet(
os.path.join(feature_repo_path, "data", "item_attributes.parquet")
)
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from the perspective of code changes. @rnyak, let’s update the text in a second PR
@@ -74,7 +74,7 @@ def test_func(tmpdir): | |||
df_lib = get_lib() | |||
train = df_lib.read_parquet( | |||
os.path.join("{tmpdir / "data"}/processed_nvt/", "train", "part_0.parquet"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that sending a request of transformed dataset instead of raw user_id?
@@ -7,7 +7,7 @@ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #4. workflow = nvt.Workflow(["item_id"] + (['item_id', 'item_brand', 'item_category', 'item_shop'] >> TransformWorkflow(nvt_wkflow.get_subworkflow("item")) >> PredictTensorflow(model_tt.first.item_block())))
This step is not really intuitive from user perspective. we should explain why we have two times item_id
? why we need to do fit_transform
not only transform
? why are we fitting again?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line and comment have us way down a rabbit-hole fixing underlying issues that make this awkward 😅
This PR updates the multi stage recsys notebooks to use subgraph. There are quite a few changes that occur in there. Some major changes include a shift in flow of what information is stored in the feature store. Now the feature store holds raw item and raw user information. These values have not been pre processed. So when they are retrieved in the systems graph they must go through a preprocessing step. In the first notebook, we add usage of the Subgraph operator, and we create to subgraphs one for item and one for user. We also create another subgraph for the item categorification. This is so that we can categorify the item_features separately when they are used to retrieve item embeddings. In the second notebook, the use of subgraphs forces the ensemble to introduce NVT workflows to handle the preprocessing of the data after it is retrieved for both users and items. Here is where we introduce the usage of subworkflows, which are based on subgraph.
This PR depends on the following PRs:
NVIDIA-Merlin/core#349
NVIDIA-Merlin/systems#372
NVIDIA-Merlin/core#350
NVIDIA-Merlin/core#353
NVIDIA-Merlin/systems#378