diff --git a/tables/automl/notebooks/census_income_prediction/getting_started_notebook.ipynb b/tables/automl/notebooks/census_income_prediction/getting_started_notebook.ipynb index 2e963d2b1553..48af9c77c529 100644 --- a/tables/automl/notebooks/census_income_prediction/getting_started_notebook.ipynb +++ b/tables/automl/notebooks/census_income_prediction/getting_started_notebook.ipynb @@ -1,23 +1,51 @@ { "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Copyright 2019 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, { "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "m26YhtBMvVWA" - }, + "metadata": {}, "source": [ "# Getting started with AutoML Tables\n", "\n", - "To use this Colab notebook, copy it to your own Google Drive and open it with [Colaboratory](https://colab.research.google.com/) (or Colab). To run a cell hold the Shift key and press the Enter key (or Return key). Colab automatically displays the return value of the last line in each cell. Refer to [this page](https://colab.research.google.com/notebooks/welcome.ipynb) for more information on Colab.\n", - "\n", - "You can run a Colab notebook on a hosted runtime in the Cloud. The hosted VM times out after 90 minutes of inactivity and you will lose all the data stored in the memory including your authentication data. If your session gets disconnected (for example, because you closed your laptop) for less than the 90 minute inactivity timeout limit, press 'RECONNECT' on the top right corner of your notebook and resume the session. After Colab timeout, you'll need to\n", - "\n", - "1. Re-run the initialization and authentication.\n", - "2. Continue from where you left off. You may need to copy-paste the value of some variables such as the `dataset_name` from the printed output of the previous cells.\n", - "\n", - "Alternatively you can connect your Colab notebook to a [local runtime](https://research.google.com/colaboratory/local-runtimes.html).\n", - "\n" + "\n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \"Google Read on cloud.google.com\n", + " \n", + " \n", + " \n", + " \"Colab Run in Colab\n", + " \n", + " \n", + " \n", + " \"GitHub\n", + " View on GitHub\n", + " \n", + "
" ] }, { @@ -27,10 +55,31 @@ "id": "b--5FDDwCG9C" }, "source": [ - "## 1. Project set up\n", + "## Overview\n", "\n", + "[Google’s AutoML](https://cloud.google.com/automl-tables/) provides the ability for software engineers to build high quality models without the need to know how to build, train models, or deploy/serve models on the cloud. Instead, one only needs to know about dataset curation, evaluating results, and the how-to steps.\n", "\n", - "\n" + "\"AutoML\n", + "\n", + "AutoML Tables is a supervised learning service. This means that you train a machine learning model with example data. AutoML Tables uses tabular (structured) data to train a machine learning model to make predictions on new data. One column from your dataset, called the target, is what your model will learn to predict. Some number of the other data columns are inputs (called features) that the model will learn patterns from. \n", + "\n", + "In this notebook, we will use the [Google Cloud SDK AutoML Python API](https://cloud.google.com/automl-tables/docs/client-libraries) to create a binary classification model using a real dataset from the [Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n", + "\n", + "We will provide the training and evaluation dataset, once dataset is created we will use AutoML API to create the model and then perform predictions to predict if a given individual has an income above or below 50k, given information like the person's age, education level, marital-status, occupation etc... \n", + "\n", + "For setting up a Google Cloud Platform (GCP) account for using AutoML, please see the online documentation for [Getting Started](https://cloud.google.com/automl-tables/docs/quickstart).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dataset\n", + "\n", + "This tutorial uses the [United States Census Income\n", + "Dataset](https://archive.ics.uci.edu/ml/datasets/census+income) provided by the\n", + "[UC Irvine Machine Learning\n", + "Repository](https://archive.ics.uci.edu/ml/index.php)containing information about people from a 1994 Census database, including age, education, marital status, occupation, and whether they make more than $50,000 a year. The dataset consists of over 30k rows, where each row corresponds to a different person. For a given row, there are 14 features that the model conditions on to predict the income of the person. A few of the features are named above, and the exhaustive list can be found both in the dataset link above." ] }, { @@ -40,20 +89,19 @@ "id": "AZs0ICgy4jkQ" }, "source": [ - "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to\n", + "## Before you begin\n", + "\n", + "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to:\n", "* Create a Google Cloud Platform (GCP) project.\n", "* Enable billing.\n", "* Apply to whitelist your project.\n", "* Enable AutoML API.\n", - "* Enable AutoML Tables API.\n", - "* Create a service account, grant required permissions, and download the service account private key.\n", "\n", - "You also need to upload your data into Google Cloud Storage (GCS) or BigQuery. For example, to use GCS as your data source\n", - "* Create a GCS bucket.\n", - "* Upload the training and batch prediction files.\n", + "You also need to upload your data into [Google Cloud Storage](https://cloud.google.com/storage/) (GCS) or [BigQuery](https://cloud.google.com/bigquery/). \n", + "For example, to use GCS as your data source:\n", "\n", - "\n", - "**Warning:** Private keys must be kept secret. If you expose your private key it is recommended to revoke it immediately from the Google Cloud Console." + "* [Create a GCS bucket](https://cloud.google.com/storage/docs/creating-buckets).\n", + "* Upload the training and batch prediction files." ] }, { @@ -71,13 +119,127 @@ }, { "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "rstRPH9SyZj_" - }, + "metadata": {}, + "source": [ + "## Instructions\n", + "\n", + "You must do several things before you can train and deploy a model in\n", + "AutoML:\n", + "\n", + "\n", + " * Set up your local development environment (optional)\n", + " * Set Project ID and Compute Region\n", + " * Authenticate your GCP account\n", + " * Import Python API SDK and create a Client instance,\n", + " * Create a dataset instance and import the data.\n", + " * Create a model instance and train the model.\n", + " * Evaluating the trained model.\n", + " * Deploy the model on the cloud for online predictions.\n", + " * Make online predictions.\n", + " * Undeploy the model\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set up your local development environment\n", + "\n", + "**If you are using Colab or AI Platform Notebooks**, your environment already meets\n", + "all the requirements to run this notebook. You can skip this step." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set up your GCP project\n", + "\n", + "**The following steps are required, regardless of your notebook environment.**\n", + "\n", + "1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager)\n", + "\n", + "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n", + "\n", + "3. [Enable the AutoML API (\"AutoML API\")](https://console.cloud.google.com/flows/enableapi?apiid=automl.googleapis.com)\n", + "\n", + "4. Enter your project ID in the cell below. Then run the cell to make sure the\n", + "Cloud SDK uses the right project for all the commands in this notebook.\n", + "\n", + "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ - "## 2. Initialize and authenticate\n", - "This section runs intialization and authentication. It creates an authenticated session which is required for running any of the following sections." + "PROJECT_ID = \"\" # @param {type:\"string\"}\n", + "COMPUTE_REGION = \"us-central1\" # Currently only supported region.\n", + "! gcloud config set project $PROJECT_ID" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Authenticate your GCP account\n", + "\n", + "**If you are using AI Platform Notebooks**, your environment is already\n", + "authenticated. Skip this step." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**If you are using Colab**, run the cell below and follow the instructions\n", + "when prompted to authenticate your account via oAuth.\n", + "\n", + "**Otherwise**, follow these steps:\n", + "\n", + "1. In the GCP Console, go to the [**Create service account key**\n", + " page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).\n", + "\n", + "2. From the **Service account** drop-down list, select **New service account**.\n", + "\n", + "3. In the **Service account name** field, enter a name.\n", + "\n", + "4. From the **Role** drop-down list, select\n", + " **AutoML > AutoML Admin** and\n", + " **Storage > Storage Object Admin**.\n", + "\n", + "5. Click *Create*. A JSON file that contains your key downloads to your\n", + "local environment.\n", + "\n", + "6. Enter the path to your service account key as the\n", + "`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "# If you are running this notebook in Colab, run this cell and follow the\n", + "# instructions to authenticate your GCP account. This provides access to your\n", + "# Cloud Storage bucket and lets you submit training jobs and prediction\n", + "# requests.\n", + "\n", + "if 'google.colab' in sys.modules: \n", + " from google.colab import files\n", + " keyfile_upload = files.upload()\n", + " keyfile = list(keyfile_upload.keys())[0]\n", + " %env GOOGLE_APPLICATION_CREDENTIALS $keyfile\n", + "# If you are running this notebook locally, replace the string below with the\n", + "# path to your service account key and run this cell to authenticate your GCP\n", + "# account.\n", + "else:\n", + " %env GOOGLE_APPLICATION_CREDENTIALS /path/to/service_account.json" ] }, { @@ -93,7 +255,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -101,81 +263,67 @@ }, "outputs": [], "source": [ - "#@title Install AutoML Tables client library { vertical-output: true }\n", - "!pip install google-cloud-automl" + "%pip install google-cloud-automl" ] }, { "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "eVFsPPEociwF" - }, + "metadata": {}, "source": [ - "### Authenticate using service account key\n", - "Run the following cell. Click on the 'Choose Files' button and select the service account private key file. If your Service Account key file or folder is hidden, you can reveal it in a Mac by pressing the Command + Shift + . combo." + "### Import libraries and define constants\n", + "\n", + "First, import Python libraries required for training,\n", + "The code example below demonstrates importing the AutoML Python API module into a python script. " ] }, { "cell_type": "code", - "execution_count": 0, - "metadata": { - "colab": {}, - "colab_type": "code", - "id": "u-kCqysAuaJk" - }, + "execution_count": null, + "metadata": {}, "outputs": [], "source": [ - "#@title Authenticate and create a client. { vertical-output: true }\n", + "# AutoML library\n", + "from google.cloud import automl_v1beta1 as automl\n", "\n", - "from google.colab import files\n", - "from google.cloud import automl_v1beta1\n", + "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Quickstart for AutoML tables\n", "\n", - "# Upload service account key\n", - "keyfile_upload = files.upload()\n", - "keyfile_name = list(keyfile_upload.keys())[0]\n", - "# Authenticate and create an AutoML client.\n", - "client = automl_v1beta1.AutoMlClient.from_service_account_file(keyfile_name)\n", - "# Authenticate and create a prediction service client.\n", - "prediction_client = automl_v1beta1.PredictionServiceClient.from_service_account_file(keyfile_name)" + "This section of the tutorial walks you through creating an AutoML client." ] }, { "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "s3F2xbEJdDvN" - }, + "metadata": {}, "source": [ - "### Test" + "Additionally, one will want to create an instance to the AutoMLClient. \n", + "This client instance is the HTTP request/response interface between the python script and the GCP AutoML service." ] }, { "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "0uX4aJYUiXh5" - }, + "metadata": {}, "source": [ - "Enter your GCP project ID." + "### Create API Client to AutoML Service*\n", + "\n", + "**If you are using AI Platform Notebooks**, or *Colab* environment is already\n", + "authenticated using GOOGLE_APPLICATION_CREDENTIALS. Run this step." ] }, { "cell_type": "code", - "execution_count": 0, - "metadata": { - "colab": {}, - "colab_type": "code", - "id": "6R4h5HF1Dtds" - }, + "execution_count": null, + "metadata": {}, "outputs": [], "source": [ - "#@title GCP project ID and location\n", - "\n", - "project_id = 'my-project-trial5' #@param {type:'string'}\n", - "location = 'us-central1'\n", - "location_path = client.location_path(project_id, location)\n", - "location_path" + "client = automl.AutoMlClient()\n", + "prediction_client = automl.PredictionServiceClient()" ] }, { @@ -185,12 +333,49 @@ "id": "rUlBcZ3OfWcJ" }, "source": [ - "To test whether your project set up and authentication steps were successful, run the following cell to list your datasets." + "**If you are using Colab or Jupyter**, and you have defined a service account\n", + "follow the following steps to create the AutoML client\n", + "\n", + "You can see a different way to create the API Clients using service account." ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# client = automl.AutoMlClient.from_service_account_file('/path/to/service_account.json')\n", + "# prediction_client = automl.PredictionServiceClient.from_service_account_file('/path/to/service_account.json')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get the GCP location of your project.\n", + "project_location = client.location_path(PROJECT_ID, COMPUTE_REGION)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "List datasets in Project:" + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "cellView": "both", "colab": {}, @@ -199,11 +384,9 @@ }, "outputs": [], "source": [ - "#@title List datasets. { vertical-output: true }\n", - "\n", - "list_datasets_response = client.list_datasets(location_path)\n", - "datasets = {\n", - " dataset.display_name: dataset.name for dataset in list_datasets_response}\n", + "# List datasets in Project\n", + "list_datasets = client.list_datasets(project_location)\n", + "datasets = { dataset.display_name: dataset.name for dataset in list_datasets }\n", "datasets" ] }, @@ -219,7 +402,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "cellView": "both", "colab": {}, @@ -228,10 +411,8 @@ }, "outputs": [], "source": [ - "#@title List models. { vertical-output: true }\n", - "\n", - "list_models_response = client.list_models(location_path)\n", - "models = {model.display_name: model.name for model in list_models_response}\n", + "list_models = client.list_models(project_location)\n", + "models = { model.display_name: model.name for model in list_models }\n", "models" ] }, @@ -248,16 +429,6 @@ "\n" ] }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "ODt86YuVDZzm" - }, - "source": [ - "## 3. Import training data" - ] - }, { "cell_type": "markdown", "metadata": { @@ -265,7 +436,7 @@ "id": "XwjZc9Q62Fm5" }, "source": [ - "### Create dataset" + "### Create a dataset" ] }, { @@ -275,12 +446,16 @@ "id": "_JfZFGSceyE_" }, "source": [ + "Now we are ready to create a dataset instance (on GCP) using the client method create_dataset(). This method takes two parameters, the **project_location** (see above) and *dataset_settings*. \n", + "\n", + "The **dataset_settings** parameter is a dictionary with two keys: **display_name** and **tables_dataset_metadata**. A value must be specified for the display_name, which must be a string consisting only of alphanumeric characters and the underscore. The display name is what one would see through the web UI interface to the AutoML service.\n", + "\n", "Select a dataset display name and pass your table source information to create a new dataset." ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -288,13 +463,12 @@ }, "outputs": [], "source": [ - "#@title Create dataset { vertical-output: true, output-height: 200 }\n", - "\n", - "dataset_display_name = 'test_deployment' #@param {type: 'string'}\n", + "# Create dataset\n", "\n", - "create_dataset_response = client.create_dataset(\n", - " location_path,\n", - " {'display_name': dataset_display_name, 'tables_dataset_metadata': {}})\n", + "dataset_display_name = 'census' \n", + "dataset_settings = {'display_name': dataset_display_name, \n", + " 'tables_dataset_metadata': {}}\n", + "create_dataset_response = client.create_dataset(project_location, dataset_settings)\n", "dataset_name = create_dataset_response.name\n", "create_dataset_response" ] @@ -317,16 +491,29 @@ }, "source": [ "You can import your data to AutoML Tables from GCS or BigQuery. For this tutorial, you can use the [census_income dataset](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv) \n", - "as your training data. You can create a GCS bucket and upload the data intofa your bucket. The URI for your file is `gs://BUCKET_NAME/FOLDER_NAME1/FOLDER_NAME2/.../FILE_NAME`. Alternatively you can create a BigQuery table and upload the data into the table. The URI for your table is `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", + "as your training data. You can create a GCS bucket and upload the data into your bucket.\n", + "\n", + "- The URI for your file is `gs://BUCKET_NAME/filename`. \n", + "\n", + "Alternatively you can create a BigQuery table and upload the data into the table:\n", + "\n", + "- The URI for your table is `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", "\n", "Importing data may take a few minutes or hours depending on the size of your data. If your Colab times out, run the following command to retrieve your dataset. Replace `dataset_name` with its actual value obtained in the preceding cells.\n", "\n", " dataset = client.get_dataset(dataset_name)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Data source is GCS**" + ] + }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -334,20 +521,26 @@ }, "outputs": [], "source": [ - "#@title ... if data source is GCS { vertical-output: true }\n", + "gcs_input_uris = ['gs://cloud-ml-data-tables/notebooks/census_income.csv',]\n", "\n", - "dataset_gcs_input_uris = ['gs://cloud-ml-data/automl-tables/notebooks/census_income.csv',] #@param\n", "# Define input configuration.\n", "input_config = {\n", " 'gcs_source': {\n", - " 'input_uris': dataset_gcs_input_uris\n", + " 'input_uris': gcs_input_uris\n", " }\n", "}" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Data source is BigQuery**" + ] + }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -355,20 +548,26 @@ }, "outputs": [], "source": [ - "#@title ... if data source is BigQuery { vertical-output: true }\n", + "bq_input_uri = 'bq://bigquery-public-data.ml_datasets.census_adult_income'\n", "\n", - "dataset_bq_input_uri = 'bq://my-project-trial5.census_income.income_census' #@param {type: 'string'}\n", "# Define input configuration.\n", "input_config = {\n", " 'bigquery_source': {\n", - " 'input_uri': dataset_bq_input_uri\n", + " 'input_uri': bq_input_uri\n", " }\n", "}" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Import data into the dataset, this process may take a while, depending on your data, once completed, you can verify the status in cell below." + ] + }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -376,13 +575,29 @@ }, "outputs": [], "source": [ - " #@title Import data { vertical-output: true }\n", - "\n", "import_data_response = client.import_data(dataset_name, input_config)\n", "print('Dataset import operation: {}'.format(import_data_response.operation))\n", - "# Wait until import is done.\n", + "\n", + "# Synchronous check of operation status. Wait until import is done.\n", "import_data_result = import_data_response.result()\n", - "import_data_result" + "import_data_response.done()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Obtain the dataset details, this time pay attention to the `example_count` field with 32561 records." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dataset = client.get_dataset(dataset_name)\n", + "dataset" ] }, { @@ -392,7 +607,7 @@ "id": "QdxBI4s44ZRI" }, "source": [ - "### Review the specs" + "### Review the data specs" ] }, { @@ -402,12 +617,15 @@ "id": "RC0PWKqH4jwr" }, "source": [ - "Run the following command to see table specs such as row count." + "Run the following command to see table specs such as row count.\n", + "We can see the different data types (numerical, string or categorical). \n", + "\n", + "More information [here](https://cloud.google.com/automl-tables/docs/data-types)" ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -415,37 +633,39 @@ }, "outputs": [], "source": [ - "#@title Table schema { vertical-output: true }\n", - "\n", - "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", - "import matplotlib.pyplot as plt\n", - "\n", "# List table specs\n", "list_table_specs_response = client.list_table_specs(dataset_name)\n", "table_specs = [s for s in list_table_specs_response]\n", + "\n", "# List column specs\n", "table_spec_name = table_specs[0].name\n", "list_column_specs_response = client.list_column_specs(table_spec_name)\n", "column_specs = {s.display_name: s for s in list_column_specs_response}\n", + "\n", + "# Print Features and data_type:\n", + "\n", + "features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) for key, value in column_specs.items()]\n", + "print('Feature list:\\n')\n", + "for feature in features:\n", + " print(feature[0],':', feature[1])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# Table schema pie chart.\n", + "\n", "type_counts = {}\n", "for column_spec in column_specs.values():\n", " type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)\n", " type_counts[type_name] = type_counts.get(type_name, 0) + 1\n", - "\n", + " \n", "plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')\n", "plt.axis('equal')\n", - "plt.show()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "vcJP7xoq4yAJ" - }, - "source": [ - "Run the following command to see column specs such inferred schema." + "plt.show()" ] }, { @@ -465,7 +685,7 @@ "id": "kNRVJqVOL8h3" }, "source": [ - "## 4. Update dataset: assign a label column and enable nullable columns" + "### Update dataset: assign a label column and enable nullable columns" ] }, { @@ -475,7 +695,9 @@ "id": "-57gehId9PQ5" }, "source": [ - "AutoML Tables automatically detects your data column type. For example, for the ([census_income](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv)) it detects `income` to be categorical (as it is just either over or under 50k) and `age` to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema." + "This section is important, as it is where you specify which column (meaning which feature) you will use as your label. This label feature will then be predicted using all other features in the row.\n", + "\n", + "AutoML Tables automatically detects your data column type. For example, for the ([census_income](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv)) it detects `income_bracket` to be categorical (as it is just either over or under 50k) and `age` to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema." ] }, { @@ -485,12 +707,12 @@ "id": "iRqdQ7Xiq04x" }, "source": [ - "### Update a column: set to nullable" + "#### Update a column: Set to nullable" ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -498,10 +720,8 @@ }, "outputs": [], "source": [ - "#@title Update dataset { vertical-output: true }\n", - "\n", "update_column_spec_dict = {\n", - " 'name': column_specs['income'].name,\n", + " 'name': column_specs['income_bracket'].name,\n", " 'data_type': {\n", " 'type_code': 'CATEGORY',\n", " 'nullable': False\n", @@ -528,12 +748,12 @@ "id": "nDMH_chybe4w" }, "source": [ - "### Update dataset: assign a label" + "#### Update dataset: Assign a label" ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -541,12 +761,11 @@ }, "outputs": [], "source": [ - "#@title Update dataset { vertical-output: true }\n", - "\n", - "label_column_name = 'income' #@param {type: 'string'}\n", + "label_column_name = 'income_bracket'\n", "label_column_spec = column_specs[label_column_name]\n", "label_column_id = label_column_spec.name.rsplit('/', 1)[-1]\n", "print('Label column ID: {}'.format(label_column_id))\n", + "\n", "# Define the values of the fields to be updated.\n", "update_dataset_dict = {\n", " 'name': dataset_name,\n", @@ -575,7 +794,7 @@ "id": "FcKgvj1-Tbgj" }, "source": [ - "## 5. Creating a model" + "### Creating a model" ] }, { @@ -585,15 +804,20 @@ "id": "Pnlk8vdQlO_k" }, "source": [ - "### Train a model\n", - "Specify the duration of the training. For example, `'train_budget_milli_node_hours': 1000` runs the training for one hour. If your Colab times out, use `client.list_models(location_path)` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace `model_name` with its actual value.\n", + "Once we have defined our datasets and features we will create a model.\n", + "\n", + "Specify the duration of the training. For example, `'train_budget_milli_node_hours': 1000` runs the training for one hour. \n", "\n", - " model = client.get_model(model_name)" + "If your Colab times out, use `client.list_models(project_location)` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace `model_name` with its actual value.\n", + "\n", + "```\n", + " model = client.get_model(model_name) \n", + "```" ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -601,23 +825,38 @@ }, "outputs": [], "source": [ - "#@title Create model { vertical-output: true }\n", - "\n", - "model_display_name = 'census_income_model' #@param {type:'string'}\n", + "model_display_name = 'census_income_model'\n", "\n", "model_dict = {\n", " 'display_name': model_display_name,\n", " 'dataset_id': dataset_name.rsplit('/', 1)[-1],\n", " 'tables_model_metadata': {'train_budget_milli_node_hours': 1000}\n", "}\n", - "create_model_response = client.create_model(location_path, model_dict)\n", + "create_model_response = client.create_model(project_location, model_dict)\n", "print('Dataset import operation: {}'.format(create_model_response.operation))\n", "# Wait until model training is done.\n", "create_model_result = create_model_response.result()\n", - "model_name = create_model_result.name\n", "create_model_result" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Model status" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get complete detail of the model.\n", + "model_name = create_model_result.name\n", + "client.get_model(model_name)" + ] + }, { "cell_type": "markdown", "metadata": { @@ -632,35 +871,117 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "LMYmHSiCE8om" + "id": "xGVGwgwXSZe_" }, "source": [ - "## 6. Make a prediction" + "Adjust the slides on the right to the desired test values for your online prediction." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "bDzd5GYQSdpa" + }, + "outputs": [], + "source": [ + "#@title Make an online prediction: set the numeric variables{ vertical-output: true }\n", + "\n", + "age = 34 #@param {type:'slider', min:1, max:100, step:1}\n", + "capital_gain = 40000 #@param {type:'slider', min:0, max:100000, step:10000}\n", + "capital_loss = 3.8 #@param {type:'slider', min:0, max:4000, step:0.1}\n", + "fnlwgt = 150000 #@param {type:'slider', min:0, max:1000000, step:50000}\n", + "education_num = 9 #@param {type:'slider', min:1, max:16, step:1}\n", + "hours_per_week = 40 #@param {type:'slider', min:1, max:100, step:1}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Model deployment" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "G2WVbMFll96k" + "id": "n0lFAIkISf4K" + }, + "source": [ + "**Important** : Deploy the model, then wait until the model FINISHES deployment.\n", + "\n", + "The model takes a while to deploy online. When the deployment code response = client.deploy_model(model_name) finishes, you will be able to see this on the UI. Check the [UI](https://console.cloud.google.com/automl-tables?_ga=2.255483016.-1079099924.1550856636) and navigate to the predict tab of your model, and then to the online prediction portion, to see when it finishes online deployment before running the prediction cell.You should see \"online prediction\" text near the top, click on it, and it will take you to a view of your online prediction interface. You should see \"model deployed\" on the far right of the screen if the model is deployed, or a \"deploying model\" message if it is still deploying. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "kRoHFbVnSk05" + }, + "outputs": [], + "source": [ + "deploy_model_response = client.deploy_model(model_name)\n", + "deploy_model_result = deploy_model_response.result()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Verify if model has been deployed, check `deployment_state` field, it should show: `DEPLOYED`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.get_model(model_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "0tymBrhLSnDX" }, "source": [ - "### There are two different prediction modes: online and batch. The following cells show you how to make an online prediction. " + "Run the prediction, only after the model finishes deployment" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "ZAGi8Co-SU-b" + "id": "LMYmHSiCE8om" }, "source": [ - "Run the following cell, and then choose the desired test values for your online prediction." + "### Make an Online prediction" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "G2WVbMFll96k" + }, + "source": [ + "You can toggle exactly which values you want for all of the numeric features, and choose from the drop down windows which values you want for the categorical features.\n", + "\n", + "Note: If the model has not finished deployment, the prediction will NOT work.\n", + "The following cells show you how to make an online prediction. " ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -680,6 +1001,7 @@ "race_ids = ['White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other', 'Black']\n", "sex_ids = ['Female', 'Male']\n", "native_country_ids = ['United-States', 'Cambodia', 'England', 'Puerto-Rico', 'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)', 'India', 'Japan', 'Greece', 'South', 'China', 'Cuba', 'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland', 'Jamaica', 'Vietnam', 'Mexico', 'Portugal', 'Ireland', 'France', 'Dominican-Republic', 'Laos', 'Ecuador', 'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala', 'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia', 'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong', 'Holand-Netherlands']\n", + "\n", "workclass = widgets.Dropdown(options=workclass_ids, value=workclass_ids[0],\n", " description='workclass:')\n", "\n", @@ -711,76 +1033,22 @@ "display(relationship)\n", "display(race)\n", "display(sex)\n", - "display(native_country)\n" + "display(native_country)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "xGVGwgwXSZe_" - }, - "source": [ - "Adjust the slides on the right to the desired test values for your online prediction." - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "colab": {}, - "colab_type": "code", - "id": "bDzd5GYQSdpa" - }, - "outputs": [], - "source": [ - "#@title Make an online prediction: set the numeric variables{ vertical-output: true }\n", - "\n", - "age = 34 #@param {type:'slider', min:1, max:100, step:1}\n", - "capital_gain = 40000 #@param {type:'slider', min:0, max:100000, step:10000}\n", - "capital_loss = 3.8 #@param {type:'slider', min:0, max:4000, step:0.1}\n", - "fnlwgt = 150000 #@param {type:'slider', min:0, max:1000000, step:50000}\n", - "education_num = 9 #@param {type:'slider', min:1, max:16, step:1}\n", - "hours_per_week = 40 #@param {type:'slider', min:1, max:100, step:1}\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "n0lFAIkISf4K" - }, - "source": [ - "**IMPORTANT** : Deploy the model, then wait until the model FINISHES deployment.\n", - "Check the [UI](https://console.cloud.google.com/automl-tables?_ga=2.255483016.-1079099924.1550856636) and navigate to the predict tab of your model, and then to the online prediction portion, to see when it finishes online deployment before running the prediction cell." - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "colab": {}, - "colab_type": "code", - "id": "kRoHFbVnSk05" - }, - "outputs": [], - "source": [ - "response = client.deploy_model(model_name)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "0tymBrhLSnDX" + "id": "ZAGi8Co-SU-b" }, "source": [ - "Run the prediction, only after the model finishes deployment" + "Run the following cell, and then choose the desired test values for your online prediction." ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -808,7 +1076,28 @@ " ]\n", " }\n", "}\n", - "prediction_client.predict(model_name, payload)" + "prediction_result = prediction_client.predict(model_name, payload)\n", + "print(prediction_result)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Get Prediction \n", + "\n", + "We extract the `google.cloud.automl_v1beta1.types.PredictResponse` object `prediction_result` and iterate to create a list of tuples with score and label, then we sort based on highest score and display it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "predictions = [(prediction.tables.score, prediction.tables.value.string_value) for prediction in prediction_result.payload]\n", + "predictions = sorted(predictions, key=lambda tup: (tup[0],tup[1]), reverse=True)\n", + "print('Prediction is: ', predictions[0])" ] }, { @@ -823,7 +1112,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -831,7 +1120,7 @@ }, "outputs": [], "source": [ - "response2 = client.undeploy_model(model_name)" + "undeploy_model_response = client.undeploy_model(model_name)" ] }, { @@ -841,7 +1130,7 @@ "id": "TarOq84-GXch" }, "source": [ - "## 7. Batch prediction" + "### Batch prediction" ] }, { @@ -851,7 +1140,7 @@ "id": "Soy5OB8Wbp_R" }, "source": [ - "### Initialize prediction" + "#### Initialize prediction" ] }, { @@ -861,15 +1150,39 @@ "id": "39bIGjIlau5a" }, "source": [ - "Your data source for batch prediction can be GCS or BigQuery. For this tutorial, you can use [census_income_batch_prediction_input.csv](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv) as input source. Create a GCS bucket and upload the file into your bucket. Some of the lines in the batch prediction input file are intentionally left missing some values. The AutoML Tables logs the errors in the `errors.csv` file.\n", - "Also, enter the UI and create the bucket into which you will load your predictions. The bucket's default name here is automl-tables-pred.\n", + "Your data source for batch prediction can be GCS or BigQuery. \n", + "\n", + "For this tutorial, you can use: \n", + "\n", + "- [census_income_batch_prediction_input.csv](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv) as input source. \n", + "\n", + "Create a GCS bucket and upload the file into your bucket. \n", "\n", - "**NOTE:** The client library has a bug. If the following cell returns a `TypeError: Could not convert Any to BatchPredictResult` error, ignore it. The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells." + "Some of the lines in the batch prediction input file are intentionally left missing some values. \n", + "The AutoML Tables logs the errors in the `errors.csv` file.\n", + "Also, enter the UI and create the bucket into which you will load your predictions. \n", + "\n", + "The bucket's default name here is `automl-tables-pred` to be replaced with your own.\n", + "\n", + "**NOTE:** The client library has a bug. If the following cell returns a:\n", + "\n", + "`TypeError: Could not convert Any to BatchPredictResult` error, ignore it. \n", + "\n", + "The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! gsutil ls -al gs://cloud-ml-data-tables/notebooks/census_income_batch_prediction_input.csv" ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -879,9 +1192,9 @@ "source": [ "#@title Start batch prediction { vertical-output: true, output-height: 200 }\n", "\n", - "batch_predict_gcs_input_uris = ['gs://cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv',] #@param\n", - "batch_predict_gcs_output_uri_prefix = 'gs://automl-tables-pred1' #@param {type:'string'}\n", - "#gs://automl-tables-pred\n", + "batch_predict_gcs_input_uris = ['gs://cloud-ml-data-tables/notebooks/census_income_batch_prediction_input.csv',] #@param\n", + "batch_predict_gcs_output_uri_prefix = 'gs://automl-tables-pred/' #@param {type:'string'}\n", + "\n", "# Define input source.\n", "batch_prediction_input_source = {\n", " 'gcs_source': {\n", @@ -893,7 +1206,22 @@ " 'gcs_destination': {\n", " 'output_uri_prefix': batch_predict_gcs_output_uri_prefix\n", " }\n", - "}\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Launch Batch prediction" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "batch_predict_response = prediction_client.batch_predict(\n", " model_name, batch_prediction_input_source, batch_prediction_output_target)\n", "print('Batch prediction operation: {}'.format(batch_predict_response.operation))\n", @@ -901,6 +1229,16 @@ "batch_predict_result = batch_predict_response.result()\n", "batch_predict_response.metadata" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Next steps\n", + "\n", + "Please follow latest updates on AutoML [here](https://cloud.google.com/automl/docs/)\n", + "if you have any questions contact us at [cloud-automl-tables-discuss](https://groups.google.com/forum/#!forum/cloud-automl-tables-discuss)" + ] } ], "metadata": { @@ -925,9 +1263,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.2" + "version": "3.5.3" } }, "nbformat": 4, - "nbformat_minor": 1 + "nbformat_minor": 2 }