google-gemini · markmcd · Oct 11, 2024 · Oct 4, 2024 · Oct 7, 2024 · Oct 8, 2024
diff --git a/quickstarts/Tuning.ipynb b/quickstarts/Tuning.ipynb
@@ -48,7 +48,7 @@
       "source": [
         "<table align=\"left\">\n",
         "  <td>\n",
-        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Tuning.ipynb\"><img src=\"../images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
+        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Tuning.ipynb\"><img src=\"https://github.com/google-gemini/cookbook/blob/main/images/colab_logo_32px.png?raw=1\" />Run in Google Colab</a>\n",
         "  </td>\n",
         "</table>"
       ]
@@ -164,23 +164,260 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "BhkXRzciv3Dp"
+        "id": "UVVPChWGX-2K"
       },
       "source": [
-        "## Create tuned model"
+        "## Prepare your dataset"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "OO8VZYAinLWc"
+        "id": "82j6NHPC5g8Q"
       },
       "source": [
-        "To create a tuned model, you need to pass your dataset to the model in the `genai.create_tuned_model` method. You can do this be directly defining the input and output values in the call or importing from a file into a dataframe to pass to the method.\n",
+        "Before you can start fine-tuning, you need a dataset to tune the model with. For the best performance, the examples in the dataset should be of high quality, diverse, and representative of real inputs and outputs.\n",
         "\n",
         "For this example, you will tune a model to generate the next number in the sequence. For example, if the input is `1`, the model should output `2`. If the input is `one hundred`, the output should be `one hundred one`.\n",
         "\n",
-        "**Note**: In general, you need between 100 and 500 examples to significantly change the behavior of the model."
+        "Dataset for tuning the model can be one of the following types:\n",
+        "1. `Iterable` of dicts or tuples.\n",
+        "2. `Mapping` of `Iterable[str]`.\n",
+        "3. CSV file.\n",
+        "4. JSON file.\n",
+        "\n",
+        "To know more about preparing a dataset for fine-tuning visit [model-tuning documentation](https://ai.google.dev/gemini-api/docs/model-tuning#prepare-dataset).\n",
+        "\n",
+        "Note: In general, you need between 100 and 500 examples to significantly change the behavior of the model."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "0in9dID6c7lS"
+      },
+      "source": [
+        "The following sections illustrate how to provide the dataset as an `Iterable` or a CSV file."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "MIHNSWr90qzN"
+      },
+      "source": [
+        "### Training data as an `Iterable`"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "jZ6mcXJi5W_F"
+      },
+      "source": [
+        "Data can be an `Iterable` of:\n",
+        "* `{'text_input': text_input, 'output': output}` dicts\n",
+        "* `(text_input, output)` tuples."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "J73L1obNYtoF"
+      },
+      "outputs": [],
+      "source": [
+        "# Provide data as a list of dicts\n",
+        "\n",
+        "dict_data =[\n",
+        "    {\n",
+        "          'text_input': '1',\n",
+        "          'output': '2',\n",
+        "    },{\n",
+        "          'text_input': '3',\n",
+        "          'output': '4',\n",
+        "    },{\n",
+        "          'text_input': '-3',\n",
+        "          'output': '-2',\n",
+        "    },{\n",
+        "          'text_input': 'twenty two',\n",
+        "          'output': 'twenty three',\n",
+        "    },{\n",
+        "          'text_input': 'two hundred',\n",
+        "          'output': 'two hundred one',\n",
+        "    },{\n",
+        "          'text_input': 'ninety nine',\n",
+        "          'output': 'one hundred',\n",
+        "    },{\n",
+        "          'text_input': '8',\n",
+        "          'output': '9',\n",
+        "    },{\n",
+        "          'text_input': '-98',\n",
+        "          'output': '-97',\n",
+        "    },{\n",
+        "          'text_input': '1,000',\n",
+        "          'output': '1,001',\n",
+        "    },{\n",
+        "          'text_input': '10,100,000',\n",
+        "          'output': '10,100,001',\n",
+        "    },{\n",
+        "          'text_input': 'thirteen',\n",
+        "          'output': 'fourteen',\n",
+        "    },{\n",
+        "          'text_input': 'eighty',\n",
+        "          'output': 'eighty one',\n",
+        "    },{\n",
+        "          'text_input': 'one',\n",
+        "          'output': 'two',\n",
+        "    },{\n",
+        "          'text_input': 'three',\n",
+        "          'output': 'four',\n",
+        "    },{\n",
+        "          'text_input': 'seven',\n",
+        "          'output': 'eight',\n",
+        "    }\n",
+        "]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "QtCw-DsddxZe"
+      },
+      "source": [
+        "### Training data as a CSV file"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "TD3_PjOI45Jw"
+      },
+      "source": [
+        "You can provide your CSV file to the tuning API in one of the following ways:\n",
+        "  * A path of type `str` or `pathlib.Path` to a local CSV file.\n",
+        "  * A URL to the CSV file.\n",
+        "  * The public URL of a Google Sheets file.\n",
+        "\n",
+        "For this example, you will provide the path to a local CSV file containing the training dataset as `pathlib.Path` to the tuning API.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "NyEBsKxH2AAu"
+      },
+      "source": [
+        "Run the following cell to create the CSV file, `data.csv`.\n",
+        "The CSV file has the default columns, `text_input` for the input and `output` for the output.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "JW1Lb9He079n"
+      },
+      "outputs": [],
+      "source": [
+        "%%writefile data.csv\n",
+        "text_input,output\n",
+        "1,2\n",
+        "3,4\n",
+        "-3,-2\n",
+        "twenty two,twenty three\n",
+        "two hundred,two hundred one\n",
+        "ninety nine,one hundred\n",
+        "8,9\n",
+        "-98,-97\n",
+        "\"1,000\",\"1,001\"\n",
+        "\"1,01,00,000\",\"1,01,00,001\"\n",
+        "thirteen,fourteen\n",
+        "eighty,eighty one\n",
+        "one,two\n",
+        "three,four\n",
+        "seven,eight"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "-THQjy61DgE8"
+      },
+      "source": [
+        "If your CSV file doesn't have the default field names, you can mention your input and output field directly in the `create_tuned_model` function.\n",
+        "\n",
+        "```\n",
+        "create_tuned_model(\n",
+        "    training_data = <csv file path>,\n",
+        "    ...\n",
+        "    input_key= <input field name>,\n",
+        "    output_key = <output field name>\n",
+        ")\n",
+        "```"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ReHerc2i1p6z"
+      },
+      "source": [
+        "Get the CSV file path as a `pathlib.Path` object."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ZJQn89jPgG4K"
+      },
+      "outputs": [],
+      "source": [
+        "import pathlib\n",
+        "\n",
+        "# Provide data as a CSV file `pathlib.Path` object.\n",
+        "csv_file=pathlib.Path('data.csv')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "o_Q3BrtFBewK"
+      },
+      "source": [
+        "### Pass your dataset as training data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ijC7LvH4Bf9L"
+      },
+      "outputs": [],
+      "source": [
+        "# Here you can specify any of the supported formats, e.g. dict_data or csv_file.\n",
+        "train_data = dict_data"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "BhkXRzciv3Dp"
+      },
+      "source": [
+        "## Create tuned model"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "PXAMtm1RHWvC"
+      },
+      "source": [
+        "Get the list of models available for tuning.\n"
       ]
     },
     {
@@ -197,6 +434,15 @@
         "tunable_models  # ['models/gemini-1.0-pro-001', 'models/gemini-1.5-flash-001-tuning']"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "obdMpissCVAO"
+      },
+      "source": [
+        "Select the source model for tuning.\n"
+      ]
+    },
     {
       "cell_type": "code",
       "execution_count": null,
@@ -209,11 +455,20 @@
         "base_model"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "OO8VZYAinLWc"
+      },
+      "source": [
+        "To create a tuned model, you need to pass the dataset you prepared earlier to  `genai.create_tuned_model` method."
+      ]
+    },
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "id": "baHjHh1oTTTC"
+        "id": "_q1iKDiZ_cjG"
       },
       "outputs": [],
       "source": [
@@ -223,54 +478,8 @@
         "operation = genai.create_tuned_model(\n",
         "    # You can use a tuned model here too. Set `source_model=\"tunedModels/...\"`\n",
         "    source_model=base_model.name,\n",
-        "    training_data=[\n",
-        "        {\n",
-        "             'text_input': '1',\n",
-        "             'output': '2',\n",
-        "        },{\n",
-        "             'text_input': '3',\n",
-        "             'output': '4',\n",
-        "        },{\n",
-        "             'text_input': '-3',\n",
-        "             'output': '-2',\n",
-        "        },{\n",
-        "             'text_input': 'twenty two',\n",
-        "             'output': 'twenty three',\n",
-        "        },{\n",
-        "             'text_input': 'two hundred',\n",
-        "             'output': 'two hundred one',\n",
-        "        },{\n",
-        "             'text_input': 'ninety nine',\n",
-        "             'output': 'one hundred',\n",
-        "        },{\n",
-        "             'text_input': '8',\n",
-        "             'output': '9',\n",
-        "        },{\n",
-        "             'text_input': '-98',\n",
-        "             'output': '-97',\n",
-        "        },{\n",
-        "             'text_input': '1,000',\n",
-        "             'output': '1,001',\n",
-        "        },{\n",
-        "             'text_input': '10,100,000',\n",
-        "             'output': '10,100,001',\n",
-        "        },{\n",
-        "             'text_input': 'thirteen',\n",
-        "             'output': 'fourteen',\n",
-        "        },{\n",
-        "             'text_input': 'eighty',\n",
-        "             'output': 'eighty one',\n",
-        "        },{\n",
-        "             'text_input': 'one',\n",
-        "             'output': 'two',\n",
-        "        },{\n",
-        "             'text_input': 'three',\n",
-        "             'output': 'four',\n",
-        "        },{\n",
-        "             'text_input': 'seven',\n",
-        "             'output': 'eight',\n",
-        "        }\n",
-        "    ],\n",
+        "    # Pass the dataset created earlier.\n",
+        "    training_data=train_data,\n",
         "    id = name,\n",
         "    epoch_count = 100,\n",
         "    batch_size=4,\n",