Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REST API #82

Closed
jobliz opened this issue Feb 14, 2018 · 19 comments
Closed

REST API #82

jobliz opened this issue Feb 14, 2018 · 19 comments
Assignees
Labels
Milestone

Comments

@jobliz
Copy link

jobliz commented Feb 14, 2018

A REST API would be useful to access ATM from any other language and device through the web, and also to illustrate how the code is structured and might be extended.

From the project's readme I get that the internal API's are going to change and that it consequently might be a bit early to develop a REST API, but I wanted to see what was possible to do with the current code. The API currently serves:

  • Various GET endpoints for reading data from the four entities that are currently present in the database
  • 1 GET endpoint to run the worker.py script inside the virtualenv as a subprocess and retrieve it's stdout and stderr
  • 1 POST endpoint to send a .csv file with the HTTP request, save the file to the atm/data directory and run enter_data on it.

No modifications were made outside of the rest_api_server.py file except to the requirements.txt file, adding flask and simplejson to the project dependencies.

TODO's / caveats:

  • No AWS integration
  • api.py currently does not check if the uploaded filename is already present, so a CSV upload can rewrite a previously sent file with the same name. This needs fixing, but I thought that I should ask first if the atm/data directory is the right one to put new files in, and if storing UUID's is ok with the project's design before using them and doing a pull request.

Example api.py usage:

After following the readme's installation instructions and running python scripts/rest_api_server.py on a separate shell under the virtualenv:

curl localhost:5000/enter_data -F file=@/path/file.csv

It should return:

{"success": true}

To see the created dataset:

curl localhost:5000/datasets/1 | jq

{
  "class_column": "class",
  "description": null,
  "size_kb": 6,
  "test_path": null,
  "k_classes": 3,
  "majority": 0.333333333,
  "d_features": 6,
  "train_path": "FULLPATH/file.csv",
  "id": 1,
  "n_examples": 150,
  "name": "file"
}

To see the created datarun:

curl localhost:5000/dataruns/1 | jq

{
  "status": "pending",
  "start_time": null,
  "description": "uniform__uniform",
  "r_minimum": 2,
  "metric": "f1",
  "budget": 100,
  "selector": "uniform",
  "priority": 1,
  "score_target": "cv_judgment_metric",
  "deadline": null,
  "budget_type": "classifier",
  "id": 1,
  "tuner": "uniform",
  "dataset_id": 1,
  "gridding": 0,
  "k_window": 3,
  "end_time": null
}

To run the worker.py script once:

curl localhost:5000/simple_worker | jq

after a while

{
  "stderr": "",
  "stdout": "huge stdout string with worker.py's output"
}
@bcyphers
Copy link
Contributor

First of all, thanks so much for working on this! Really fantastic stuff. A REST API is something that's always been on the back of my mind, but wouldn't have had time to get to it.

I don't see any problem with using UUIDs. And since we don't have AWS working yet anyway, that's obviously not an issue.

I'll ask @kveerama, but I think this is something we want in the main repo. If you create a pull request, I'll try to review it this weekend and have some more feedback. It would also be very helpful if you could write a couple of tests for this.

Also, FYI: I was a full-time maintainer of this project for a while, but I've taken another job and will no longer be able to devote much time. I'll keep up with the repo, but updates may come much more sporadically than they did the past couple of months. I believe the lab is trying to hire another maintainer, but until that happens, feedback may be slow.

Thanks again for contributing!

@jobliz
Copy link
Author

jobliz commented Jun 5, 2018

Hi! I recently looked at my code again and realized that of all things I had forgot to do the PR here. Just did it. I'll look into adding some tests too.

@RogerTangos RogerTangos self-assigned this Sep 25, 2018
@RogerTangos
Copy link

Hey there - I'm picking up this issue. I think @jobliz's design was pretty good, and I'm going to propose a few more endpoints and decouple the API from the database

@RogerTangos
Copy link

RogerTangos commented Oct 2, 2018

Here are some more thoughts. I'd really appreciate feedback and other insights into this.

A. I started off using @jobliz's existing REST API. However, since I would have very much like to force database access to go through the classes, this would require adding to_json methods to the following classes:
Database.Dataset
Database.Datarun
Database.Hyperpartition
Database.Classifier
Worker
B. This would also mean making some edits on worker for starting/stopping/inspecting.
C. This is a good option and I can pursue it if that's what people want.
D. However, mapping out the relationships between the four main tables, I'm beginning to think that using the graphQL specification may be a better bet. It would certainly make for simpler frontend client development.
E. This would allow people to make queries like the following:

type Datarun {
  id: Int
  name: String
  dataset {
    name: String
  }
  classifiers {
    id: Int
    cv_judgment_metric: Float
  }
}

F. I'm not very familiar with graphQL, but a co-worker likes it, and it seems like its sticking around and may replace REST. Given that I'm going to need to do work in the Database subclasses anyway, it doesn't seem unreasonable.
G. There are some other options also detailed below. If you have thoughts on these, please share them. I'd like to start working on the API this week.

  1. GraphQL implementation (detailed above)
  2. REST implementation without direct database access (detailed above)
  3. REST implementation with direct database access
    - This would roughly be roughly what's already on the rest-api branch, with just a few tweaks for starting/stopping workers
  4. Another option I haven't thought of

Please let me know what you think.


If I did do a REST API, here are the endpoints I think would make sense

GET summarize datasets table
POST to upload a file
/datasets

GET information about a particular dataset
/datasets/<int:id>

GET info about all the times we ran ATM
/dataruns

GET info about one particular time we ran ATM
/dataruns/<int: id>

GET all classifiers that were trained in dataruns
/classifiers

GET single classifier that was trained in a datarun
/classifiers/<int:id>

GET dataruns on a given dataset
POST to create new datarun on a dataset
/dataruns/dataset/<int:dataset_id>

GET classifiers for a given datarun
POST to train a new classifier on the datarun (possibly)
/classifiers/datarun/<int:datarun_id>

GET information on all workers
POST - Create a worker?
/worker

GET - get information on a single worker
POST - Create a worker? I'm not sure this makes sense here.
PUT - Pause/Start the worker
DELETE - Destroy?
/worker/<int: id>

GET a hyperpartition
note: values are currently base64 encoded and the API should make a note of this
/hyperpartions

GET a single hyperpartition
/hyperpartitions/<int:id>

GET the hyperpartitions used in a datarun
/hyperpartitions/datarun/<int:datarun_id>

GET the classifiers using a hyperpartition
/classifiers/hyperpartition/<int:hyperpartition_id>

Changed from Joblit PR:
/enter_data
/simple_worker

@RogerTangos
Copy link

RogerTangos commented Oct 4, 2018

I'm posting discussion from our slack here so that it's available to public contributors

from @csala to @RogerTangos
Given that you are the only one who seems to have any experience with graphQL, and that you said that you are not very familiar with it, I would discard graphQL.

I think that REST will completely satisfy our needs, and it's something that most people is already familiar with.

@RogerTangos
Copy link

from @csala

A part from it, a couple of comments about the API design:

In general, when building a REST API I follow the following pattern for each possible "object" (e.g. dataruns, datasets, etc.)

  1. Getting all the objects is done in the base path: GET /object_name
  2. Getting multiple objects but not all, is done in the base path, filtering the query string: GET /object_name?condition1=value1&condition2=value2
  3. Getting a single object by ID is done in the base path + id: GET /object_name/<object-id>
  4. Put (which creates a NEW object) is done on the base path: PUT /object_name + Object Body
  5. Post (which updates an object) is done on the base_path + id: POST /object_name/<object-id> + Object Body
  6. Delete is done on the base_path + id: DELETE /object_name/<object-id>

So, for example, I would implement this as follows:

  1. Getting all the dataruns: GET /dataruns
  2. Getting all the dataruns for a given dataset: GET /dataruns?dataset_id=<dataset-id>
  3. Getting a single datarun, by id: GET /dataruns/<datarun-id>
  4. Creating a new datarun: PUT /dataruns + JSON indicating the datarun details

A part from that, I would also implement DELETE in some cases, where it makes sense:

  1. If adding "static" things like a dataset is possible, I think that deleting them should also be. So, I would also implement DELETE /datasets/.
  2. Similarly, should we also allow deleting other things like dataruns as a "cleanup measure"?

Another example of the schema explained would be, to "train a new classifier on a datarun", instead of posting to /classifiers/datarun/<int:datarun_id>, I would use:

POST /classifiers
{
    "datarun_id": "datarun_id",
    ...
}

@RogerTangos
Copy link

from @RogerTangos

Good critique. Thanks @csala!

I think we should allow for deleting dataruns. TBH, I haven’t spent as much time on methods as I’d like to. (I was trying to get that out before leaving work)
Tomorrow’s [Wednesday October 4] a holiday in Germany. I may do some more stuff then, but will push out a new draft for Thursday

@micah - responding to the ATM-Vis stuff first: I’ve taken a look at their API, and although it does cover more usecases, I don’t think it’s nearly as well thought out as what joblitz put together in the PR.

@RogerTangos
Copy link

from @RogerTangos
The ATM-Viz group does have some other shim code, which is good to look at, but I think should be integrated directly into the classes themselves.
For graph QL, my reasoning is that it’s generally a much more expressive syntax. Instead of having to hit multiple endpoints, the client can specify exactly what they want and (in most cases) receive exactly that. This lends itself to better stats on what clients actually want and to faster development.

I went through their getting started guide. Once under the hood, it seems surprisingly simple. There are methods that would need to be defined on the classes, which need to have some relationships to eachother. Those relationships are already explicit due to SQLalchemy, so my impression is that it’s an equal or slightly larger amount of work for a significantly easier to use client experience.

@RogerTangos
Copy link

from @RogerTangos to @csala
I see your reasoning on going with something familiar. (REST has been around forever and is also quite familiar to me.) Will you have time to look at the graphQL docs sometime soon? I’d be interested to hear your impression after taking a look.

Because of your feedback, I’m leaning towards just building a REST api, but I’d like to make that decision based on more than familiarity.

(I really unhappy with the fickleness of front end development, and actually dragged my feet on looking at graphQL because of that. After having a look though, I’m realizing it’s a specification - not an implementation - and I think it’s likely to stick around.)

@RogerTangos
Copy link

RogerTangos commented Oct 4, 2018

from @micahjsmith

albert [@RogerTangos] I looked into GraphQL based on your recommendation and it does look really great. If I understand correctly, the main use is the endpoint where someone wants to get all classifiers associated with a datarun. Then, we would specify an allowed query using graphql on the server that returns, for a given datarun, any info about the dataset and classifiers that the client requires. The REST alternative would be to first make a GET /dataruns/1 and then make a series of GETs to /classifiers/1, etc.?


</copypasta>

@RogerTangos
Copy link

Hey @micahjsmith , thanks for taking a look. That's correct. It would also allow the client to extend that query to give metrics about the classifier.

The alternative is something similar to the endpoints that @csala and I have been discussion above. (I'm editing those today.)

@csala have you had a chance to look at graphQL?

@micahjsmith
Copy link
Member

I'll let @csala weigh in and it seems like the two of you have more expertise than myself in this type of question. But if it seems that both approaches will work, and you are the one who is putting in the time to implement this feature, I'd think you could use your best judgment.

@RogerTangos
Copy link

RogerTangos commented Oct 9, 2018

New endpoints. I'm working on implementing these

GET summarize datasets table
/datasets

GET information about a particular dataset
POST to create a new dataset
DELETE to remove a dataset
PUT to replace a dataset with the current
/datasets?id=<int:id>&name=<str: name>&description=<str: description>

note: Further down, this is abbreviated as follows:

/datasets?<int:id>&<str: name>&<str: description>

GET info about all the times we ran ATM
/dataruns

GET info about one or more dataruns
DELETE to remove a datarun
/dataruns?<int: id>&<int: dataset_id>&<str: description>&<str: tuner>&<str: status>

PATCH to stop a datarun
/dataruns?<int: id>&<int: dataset_id>&<str: description>&<str: tuner>&<str: status>/status

GET all classifiers that were trained in dataruns
/classifiers

GET one or more classifiers
DELETE one or more
returns link to download model(s)
/classifiers?<int:id>&<int:datarun_id>&<hyperpartition_id>

GET information on all workers
POST to create a worker
/worker

GET information about a single worker, including the status
DELETE a worker
PUT to replace a worker
PATCH to update a worker
/worker?<int: id>?<int: dataset_id>

GET
/hyperpartitions?<int: id>&<int: datarun_id>&<str: method>&<str: status>

PATCH to Disable/Enable for further search
/hyperpartitions?<int: id>&<int: datarun_id>&<str: method>&<str: status>

@micahjsmith
Copy link
Member

A couple questions:

GET info about one or more dataruns
DELETE to remove a datarun
/dataruns?<int: id>&<int: dataset_id>&<str: description>&<str: tuner>&<str: status>

PATCH to stop a datarun
/dataruns?<int: id>&<int: dataset_id>&<str: description>&<str: tuner>&<str: status>/status

Why not just provide the datarun id? Why is tuner a query parameter? Would we stop dataruns with a tuner matching some string?

PUT to update or replace a worker
PATCH to update/modify a worker

What is the difference here?

PATCH to activate or deactivate the worker
/worker?<int: id>?<int: dataset_id>/status

Thought we discussed that this is not currently possible or relevant

@RogerTangos
Copy link

Thank you for taking a look!

GET info about one or more dataruns
DELETE to remove a datarun
/dataruns?<int: id>&<int: dataset_id>&<str: description>&<str: tuner>&<str: status>

PATCH to stops a datarun
/dataruns?<int: id>&<int: dataset_id>&<str: description>&<str: tuner>&<str: status>/status

Why not just provide the datarun id? Why is tuner a query parameter? Would we stop dataruns with a tuner matching some string?

I'm trying to provide a maximal ability for users to search for a datarun here. I'd also like to include <str: selector> and other parameters in the datasets table. I should have noted that users could stop or start dataruns based on the query string.

Do you think it'd be better to limit searching to just the ID?

PATCH to activate or deactivate the worker
/worker?<int: id>?<int: dataset_id>/status

Thought we discussed that this is not currently possible or relevant

Good catch. We had. I just forgot to remove those lines when copy pasting. Edited above.

@micahjsmith
Copy link
Member

Thanks for your explanations.

Do you think it'd be better to limit searching to just the ID?

Based on your explanations, I don't think it makes sense to limit searching. Though I wonder if it makes sense to limit deletion. I wonder if DELETE /dataruns?tuner=gp could be overly destructive

@RogerTangos
Copy link

Note: There are requests from Kalyan that

  1. We be able to provide API keys
  2. We make datasets storable off the main server (probably in s3 buckets) with supplied links.

Both seem doable. I'll make a separate issue for no 2 if it's not already possible

@takabayashi
Copy link

Hey Guys,

I am architect and main contributor of [marvin](https://github.com/apache/incubator-marvin). Marvin is a platform that makes easy to create, deploy and manage models. We already have a lot of thinks that delivers APIs and more architectures concerns. This example shows how I did to integrate another AutoML library. The project is very new as an apache project.

My ideia is start to help you guys cause I believe AutoML is the future.

This is one paper that, more or less, show the ideia that supports marvin.

Summarising the benefits of this integration:

  • Rest APIs to execute action, training, evaluation and predict
  • S3, HDFS as storage backend to save and versioning models, dataset and metrics
  • Reload models online base on metrics
  • Distributed architecture based on actor models

What do you guys think about this?

@csala csala added this to the 0.1.1 milestone Apr 25, 2019
@csala
Copy link
Contributor

csala commented Apr 25, 2019

The first read-only version of the REST API is already in master, so I'm closing this issue now so future enhancements can be added as new issues.

@csala csala closed this as completed Apr 25, 2019
@csala csala modified the milestones: 0.1.1, 0.1.2 Apr 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants