Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning notebook updated to include csv upload #300

Merged
merged 4 commits into from
Oct 11, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
319 changes: 264 additions & 55 deletions quickstarts/Tuning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@
"source": [
markmcd marked this conversation as resolved.
Show resolved Hide resolved
markmcd marked this conversation as resolved.
Show resolved Hide resolved
markmcd marked this conversation as resolved.
Show resolved Hide resolved
markmcd marked this conversation as resolved.
Show resolved Hide resolved
markmcd marked this conversation as resolved.
Show resolved Hide resolved
markmcd marked this conversation as resolved.
Show resolved Hide resolved
"<table align=\"left\">\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Tuning.ipynb\"><img src=\"../images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Tuning.ipynb\"><img src=\"https://github.com/google-gemini/cookbook/blob/main/images/colab_logo_32px.png?raw=1\" />Run in Google Colab</a>\n",
" </td>\n",
"</table>"
]
Expand Down Expand Up @@ -164,23 +164,260 @@
{
"cell_type": "markdown",
"metadata": {
"id": "BhkXRzciv3Dp"
"id": "UVVPChWGX-2K"
},
"source": [
"## Create tuned model"
"## Prepare your dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OO8VZYAinLWc"
"id": "82j6NHPC5g8Q"
},
"source": [
"To create a tuned model, you need to pass your dataset to the model in the `genai.create_tuned_model` method. You can do this be directly defining the input and output values in the call or importing from a file into a dataframe to pass to the method.\n",
"Before you can start fine-tuning, you need a dataset to tune the model with. For the best performance, the examples in the dataset should be of high quality, diverse, and representative of real inputs and outputs.\n",
"\n",
"For this example, you will tune a model to generate the next number in the sequence. For example, if the input is `1`, the model should output `2`. If the input is `one hundred`, the output should be `one hundred one`.\n",
"\n",
"**Note**: In general, you need between 100 and 500 examples to significantly change the behavior of the model."
"Dataset for tuning the model can be one of the following types:\n",
"1. `Iterable` of dicts or tuples.\n",
"2. `Mapping` of `Iterable[str]`.\n",
"3. CSV file.\n",
"4. JSON file.\n",
"\n",
"To know more about preparing a dataset for fine-tuning visit [model-tuning documentation](https://ai.google.dev/gemini-api/docs/model-tuning#prepare-dataset).\n",
"\n",
"Note: In general, you need between 100 and 500 examples to significantly change the behavior of the model."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0in9dID6c7lS"
},
"source": [
"The following sections illustrate how to provide the dataset as an `Iterable` or a CSV file."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MIHNSWr90qzN"
},
"source": [
"### Training data as an `Iterable`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jZ6mcXJi5W_F"
},
"source": [
"Data can be an `Iterable` of:\n",
"* `{'text_input': text_input, 'output': output}` dicts\n",
"* `(text_input, output)` tuples."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "J73L1obNYtoF"
},
"outputs": [],
"source": [
"# Provide data as a list of dicts\n",
"\n",
"dict_data =[\n",
" {\n",
" 'text_input': '1',\n",
" 'output': '2',\n",
" },{\n",
" 'text_input': '3',\n",
" 'output': '4',\n",
" },{\n",
" 'text_input': '-3',\n",
" 'output': '-2',\n",
" },{\n",
" 'text_input': 'twenty two',\n",
" 'output': 'twenty three',\n",
" },{\n",
" 'text_input': 'two hundred',\n",
" 'output': 'two hundred one',\n",
" },{\n",
" 'text_input': 'ninety nine',\n",
" 'output': 'one hundred',\n",
" },{\n",
" 'text_input': '8',\n",
" 'output': '9',\n",
" },{\n",
" 'text_input': '-98',\n",
" 'output': '-97',\n",
" },{\n",
" 'text_input': '1,000',\n",
" 'output': '1,001',\n",
" },{\n",
" 'text_input': '10,100,000',\n",
" 'output': '10,100,001',\n",
" },{\n",
" 'text_input': 'thirteen',\n",
" 'output': 'fourteen',\n",
" },{\n",
" 'text_input': 'eighty',\n",
" 'output': 'eighty one',\n",
" },{\n",
" 'text_input': 'one',\n",
" 'output': 'two',\n",
" },{\n",
" 'text_input': 'three',\n",
" 'output': 'four',\n",
" },{\n",
" 'text_input': 'seven',\n",
" 'output': 'eight',\n",
" }\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QtCw-DsddxZe"
},
"source": [
"### Training data as a CSV file"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TD3_PjOI45Jw"
},
"source": [
"You can provide your CSV file to the tuning API in one of the following ways:\n",
" * A path of type `str` or `pathlib.Path` to a local CSV file.\n",
" * A URL to the CSV file.\n",
" * The public URL of a Google Sheets file.\n",
"\n",
"For this example, you will provide the path to a local CSV file containing the training dataset as `pathlib.Path` to the tuning API.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NyEBsKxH2AAu"
},
"source": [
"Run the following cell to create the CSV file, `data.csv`.\n",
"The CSV file has the default columns, `text_input` for the input and `output` for the output.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "JW1Lb9He079n"
},
"outputs": [],
"source": [
"%%writefile data.csv\n",
"text_input,output\n",
"1,2\n",
"3,4\n",
"-3,-2\n",
"twenty two,twenty three\n",
"two hundred,two hundred one\n",
"ninety nine,one hundred\n",
"8,9\n",
"-98,-97\n",
"\"1,000\",\"1,001\"\n",
"\"1,01,00,000\",\"1,01,00,001\"\n",
"thirteen,fourteen\n",
"eighty,eighty one\n",
"one,two\n",
"three,four\n",
"seven,eight"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-THQjy61DgE8"
},
"source": [
"If your CSV file doesn't have the default field names, you can mention your input and output field directly in the `create_tuned_model` function.\n",
"\n",
"```\n",
"create_tuned_model(\n",
" training_data = <csv file path>,\n",
" ...\n",
" input_key= <input field name>,\n",
" output_key = <output field name>\n",
")\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ReHerc2i1p6z"
},
"source": [
"Get the CSV file path as a `pathlib.Path` object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZJQn89jPgG4K"
},
"outputs": [],
"source": [
"import pathlib\n",
"\n",
"# Provide data as a CSV file `pathlib.Path` object.\n",
"csv_file=pathlib.Path('data.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "o_Q3BrtFBewK"
},
"source": [
"### Pass your dataset as training data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ijC7LvH4Bf9L"
},
"outputs": [],
"source": [
"# Here you can specify any of the supported formats, e.g. dict_data or csv_file.\n",
"train_data = dict_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BhkXRzciv3Dp"
},
"source": [
"## Create tuned model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PXAMtm1RHWvC"
},
"source": [
"Get the list of models available for tuning.\n"
]
},
{
Expand All @@ -197,6 +434,15 @@
"tunable_models # ['models/gemini-1.0-pro-001', 'models/gemini-1.5-flash-001-tuning']"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "obdMpissCVAO"
},
"source": [
"Select the source model for tuning.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -209,11 +455,20 @@
"base_model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OO8VZYAinLWc"
},
"source": [
"To create a tuned model, you need to pass the dataset you prepared earlier to `genai.create_tuned_model` method."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "baHjHh1oTTTC"
"id": "_q1iKDiZ_cjG"
},
"outputs": [],
"source": [
Expand All @@ -223,54 +478,8 @@
"operation = genai.create_tuned_model(\n",
" # You can use a tuned model here too. Set `source_model=\"tunedModels/...\"`\n",
" source_model=base_model.name,\n",
" training_data=[\n",
" {\n",
" 'text_input': '1',\n",
" 'output': '2',\n",
" },{\n",
" 'text_input': '3',\n",
" 'output': '4',\n",
" },{\n",
" 'text_input': '-3',\n",
" 'output': '-2',\n",
" },{\n",
" 'text_input': 'twenty two',\n",
" 'output': 'twenty three',\n",
" },{\n",
" 'text_input': 'two hundred',\n",
" 'output': 'two hundred one',\n",
" },{\n",
" 'text_input': 'ninety nine',\n",
" 'output': 'one hundred',\n",
" },{\n",
" 'text_input': '8',\n",
" 'output': '9',\n",
" },{\n",
" 'text_input': '-98',\n",
" 'output': '-97',\n",
" },{\n",
" 'text_input': '1,000',\n",
" 'output': '1,001',\n",
" },{\n",
" 'text_input': '10,100,000',\n",
" 'output': '10,100,001',\n",
" },{\n",
" 'text_input': 'thirteen',\n",
" 'output': 'fourteen',\n",
" },{\n",
" 'text_input': 'eighty',\n",
" 'output': 'eighty one',\n",
" },{\n",
" 'text_input': 'one',\n",
" 'output': 'two',\n",
" },{\n",
" 'text_input': 'three',\n",
" 'output': 'four',\n",
" },{\n",
" 'text_input': 'seven',\n",
" 'output': 'eight',\n",
" }\n",
" ],\n",
" # Pass the dataset created earlier.\n",
" training_data=train_data,\n",
" id = name,\n",
" epoch_count = 100,\n",
" batch_size=4,\n",
Expand Down
Loading