Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GET/PUT file ID/path handlers #12

Merged
merged 4 commits into from
Oct 17, 2022

Conversation

davidbrochart
Copy link
Contributor

No description provided.

@codecov-commenter
Copy link

codecov-commenter commented Oct 13, 2022

Codecov Report

Base: 81.98% // Head: 77.53% // Decreases project coverage by -4.45% ⚠️

Coverage data is based on head (4e47dcd) compared to base (0db749a).
Patch coverage: 50.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #12      +/-   ##
==========================================
- Coverage   81.98%   77.53%   -4.46%     
==========================================
  Files           4        5       +1     
  Lines         272      316      +44     
  Branches       41       45       +4     
==========================================
+ Hits          223      245      +22     
- Misses         37       59      +22     
  Partials       12       12              
Impacted Files Coverage Δ
jupyter_server_fileid/handlers.py 48.78% <48.78%> (ø)
jupyter_server_fileid/extension.py 50.00% <66.66%> (+2.38%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@web.authenticated
@authorized
async def get(self, path):
idx = self.settings["file_id_manager"].index(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be:

Suggested change
idx = self.settings["file_id_manager"].index(path)
idx = self.settings["file_id_manager"].get_id(path)

otherwise, index() will add the path (assuming it can be stat'd) and return the newly created ID.

Are you planning on supporting the ability to fetch a file's path given its ID? This will be necessary for external references (e.g., comments) to locate their parent/host file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that get_id doesn't index the file automatically, while index does. So I guess we need the latter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning on supporting the ability to fetch a file's path given its ID?

Maybe in another PR, but it's not needed for now in JupyterLab.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that get_id doesn't index the file automatically, while index does. So I guess we need the latter?

I think so - otherwise you're potentially performing a transaction on a fetch request - which is contrary to REST best practices (AFAIU).

Maybe in another PR, but it's not needed for now in JupyterLab.

Ok - this should be fine. You won't be able to use a similar format of endpoint in that case as it will be ambiguous so I wasn't sure if supporting path-fetching will affect this endpoint. You could probably take an approach similar to kernel interrupt and restarts with an endpoint definition of something like api/fileid/{id,path}/(.*) although I'm sure there are more elegant approaches. My comment was more about not impacting what is essentially "set in stone" when fetching paths is required - which will be the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so - otherwise you're potentially performing a transaction on a fetch request - which is contrary to REST best practices (AFAIU).

I think you misunderstood me, we need index which might perform a transaction.

with an endpoint definition of something like api/fileid/{id,path}/(.*)

I think you're right, we need api/fileid/id/(.*) here, and we'll need api/fileid/path/(.*) when we want to get the path corresponding to a given ID.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we will need both the path and id variant. @dlqqq can you help us understand which underlying methods to call?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Kevin's concern is that a GET request shouldn't be creating a new resource under RESTful semantics. Indexing creates a file ID "resource" and is also an idempotent operation, so under RESTful semantics, it should be called via PUT.

An API that conforms strictly to RESTful semantics might look something like this:

GET /api/fileid/files/{path or id} => calls get_path() if given ID, or get_id() if given path
PUT /api/fileid/ids => index a file at a given path (provided in request payload or URI query string)

An API compatible with @davidbrochart's suggestion might look something like this:

GET /api/fileid/{path or id} => index a file if given path, or get_path() if given ID

Personally, I'm inclined towards Kevin's suggestion as it allows REST API users to distinguish between calling get_id() and index(), which is useful in scenarios where a user wants to check whether a file was indexed previously but doesn't want to index it now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The obvious issue with putting the path or ID in the request URL rather than as a query string is that it's not possible for the backend to distinguish whether you're talking about a file with the name of a UUID located in the server root or if you're talking about a file ID. So regardless of which high-level API design we settle on, we should probably modify the GET requests to explicitly specify whether they're passing in a path or an ID via ?path=... or ?id=....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, from the client point of view, GETting a file ID seems like the right HTTP verb. The server creating a new resource is an implementation detail. For instance, if all files under a root directory were indexed at startup and then tracked for any change (renaming, etc.), then there would be no need for indexing in this GET request. The client doesn't ask for a new resource, it asks for a property of a file.
I don't have any preference as to where to specify what we request (an ID for a path or a path for an ID). It could be in the URL path:

GET /api/fileid/id/{path}: get an ID for a path
GET /api/fileid/path/{id}: get a path for an ID

or as a query parameter:

GET /api/fileid?path={path}: get an ID for a path
GET /api/fileid?id={id}: get a path for an ID

The latter is slightly more complicated as we need to check that only one parameter is passed, and the returned schema could theoretically be different for a path or an ID, but it's not a big deal.

@kevin-bates
Copy link
Member

@dlqqq is correct regarding my concern about RESTful API semantics. As David Q points out, users should be able to determine that a given path is not currently indexed, so, in those cases, GET should return some indication (404 perhaps?) if the path is not indexed. Users would then turn around and issue a PUT if they wanted to create the index (which should probably return a 404 if the path doesn't exist). Since David B knows he wants to index the path if it's not indexed already, then PUT (with its idempotent semantics) is the correct approach for that (IMO).

Regarding the endpoints, I believe Jupyter, thus far, has tried to not introduce parameters and instead used "determinators" embedded into the endpoint. This would argue for api/fileid/{id,path}/{path-value, id-value} where the "determinator" of id indicates that the target "resource" is an ID and the remaining portion of the endpoint is the path, while a "determinator" of path indicates the opposite (resource is a PATH with remaining portion the ID).

If there's a desire to keep the implementation to a single handler, you could then have PUT api/fileid/id/my/path/to/my/file indicate that an ID is created relative to my/path/to/my/file and, due to its idempotent semantic, will return the existing ID should it already be indexed. The analogous PUT api/fileid/path/<uuid> would probably return a 501 (not implemented) since the FileID service doesn't allow for BYOID (currently). (If using UUID as a PK, we could support this endpoint, which would require a body to provide the file path.)

@davidbrochart
Copy link
Contributor Author

users should be able to determine that a given path is not currently indexed

I have difficulty understanding why they would want to know that, and if it should be allowed. For me an ID is like a property of the file, just like its size.

This would argue for api/fileid/{id,path}/{path-value, id-value} where the "determinator" of id indicates that the target "resource" is an ID and the remaining portion of the endpoint is the path

I'm fine with that too.

@ellisonbg
Copy link

ellisonbg commented Oct 14, 2022 via email

@kevin-bates
Copy link
Member

The only case it
doesn't cover is if the client wants to know if the path has been assigned
an id or not. If we believe that a client will need that information, then
the dual GET/PUT API makes sense.

If we're ever going to want to allow BYOID (which UUIDs enable - in addition to global uniqueness) we need GET to indicate a given path is not currently indexed. PUT would then be used for cases where the user needs the ID and doesn't want to provide it (i.e., let the service generate the ID), while POST could be used to associate a given ID to a path.

However, getting down to brass tacks, GET is defined to be both idempotent and safe and if GET requests can change the state of the server, they are not "safe". Using PUT we'd return a 201 when the index is created and 200 when it already exists - just as the spec indicates (and 404 if the path doesn't exist). This is the precise behavior David B wants and we maintain appropriate semantics - win, win.

@ellisonbg
Copy link

ellisonbg commented Oct 14, 2022 via email

@davidbrochart
Copy link
Contributor Author

It seems strange to me to use PUT to actually get something, but I'm definitely not an expert in requests semantics, so I trust you guys.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me.

Should extensions specify/expose their own Open API (swagger) docs? If so, that can be added later - although I'm not really finding precedent for this at the moment.

@ellisonbg
Copy link

ellisonbg commented Oct 17, 2022 via email

Copy link
Collaborator

@dlqqq dlqqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin-bates @davidbrochart Thank you two for resolving your concerns together while I was away! I love this new REST API. It balances readability with usability.

It seems strange to me to use PUT to actually get something, but I'm definitely not an expert in requests semantics, so I trust you guys.

Again, the concern is that index() has the potential to write to the database, which a GET request shouldn't do. In a way, all requests are "getting" something from the backend, be it a JSON payload or status code. GET should be reserved for requests that involve exclusively database reads, when we're talking about REST APIs for backends backed by a database. Of course the semantics here get muddled if every request writes telemetry/logging data, but generally we don't count that. 😉

Question, why PUT and not POST? Isn't POST` used in situation where the
server is responsible for assigning an identifier?

@ellisonbg PUT is essentially POST with an idempotency constraint. PUT is to POST as "upsert" is to "insert". For more, see: https://www.rfc-editor.org/rfc/rfc2616#section-9.1.2

@dlqqq
Copy link
Collaborator

dlqqq commented Oct 17, 2022

@davidbrochart Feel free to merge if this PR is complete.

@ellisonbg
Copy link

ellisonbg commented Oct 17, 2022 via email

@davidbrochart
Copy link
Contributor Author

Again, the concern is that index() has the potential to write to the database, which a GET request shouldn't do.

As I said, it's an implementation detail. When I implement this feature in jupyverse, I will probably use watchfiles to detect file changes and automatically index them before any request is made. So I could have used a GET because the file will always be indexed before getting its ID, hence no write to the database inside the GET. That is, as Kevin said, if we don't allow BYOID.
You could argue that there might be a small amount of time when the client requests an ID for a file that "is going to be indexed", because the GET request and the file change detection are parallel and independent tasks. But why would the client ask an ID for a file that it doesn't even know exists (because not detected yet) in the first place? You could say that it's because the client thought it existed and the file was swapped with another one before the GET request. Fair enough, but then the client has a problem, because it's going to open the wrong file. Maybe we should just return a 404 when the file is not yet indexed.
Anyway, all this seems quite arbitrary to me, but if everyone is fine with this API, I'll merge 😄

@kevin-bates
Copy link
Member

I totally see where you're coming from David. This discussion is primarily about the discipline required to adhere to a spec otherwise, what good is a spec? I like to think of REST verbs mapped to CRUD operations and GET maps to Read - where no transaction will be created. Indexing, on the other hand, can resort in a transaction. If one doesn't already exist, one will be created, etc. PUT, POST, and DELETE all require transactions. (This is all coming from years of Microsoft DCOM and J2EE EJB distributed transactions.)

Regardless of the implementation, clients wanting the ID of a given path need to first attempt GET - which will succeed 99% of the time in both implementations (but may not in others that might choose an on-demand approach to indexing), then fallback to PUT when GET returns 404.

@davidbrochart
Copy link
Contributor Author

Thanks everyone, merging!

@davidbrochart davidbrochart changed the title Add GET file ID handler Add GET/PUT file ID/path handlers Oct 17, 2022
@davidbrochart davidbrochart merged commit 7ec1f95 into jupyter-server:main Oct 17, 2022
@davidbrochart davidbrochart deleted the handler branch October 17, 2022 19:35
@davidbrochart
Copy link
Contributor Author

@dlqqq Would you mind adding davidbrochart as a maintainer on https://pypi.org/project/jupyter-server-fileid ?

@dlqqq
Copy link
Collaborator

dlqqq commented Oct 17, 2022

Added both @davidbrochart and @Zsailer as maintainers.

@davidbrochart
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants