Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DS-4301] Added Content Reports section and Filtered Collections report therein #202

Merged
merged 25 commits into from
Feb 28, 2024

Conversation

jeffmorin
Copy link
Contributor

This pull request features a REST specification for the first of the two REST-based reports to be ported to DSpace 7.x (see ticket DS-4301, DSpace/DSpace#7641), the Filtered Collections report.

Following a suggestion from Andrea Bollini, I created a new /contentreports branch in which I created a /filteredcollections URI for the report itself. I added an entry under Endpoints Under Development/Discussion in endpoints.md.

I also added a missing link and removed a duplicate entry in endpoints.md.

@jeffmorin
Copy link
Contributor Author

Added the second report, Filtered Items (for metadata queries).

Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffmorin : Gave this a quick review today (while also reviewing the corresponding backend PR: DSpace/DSpace#8598 -- still testing it though). I have some questions about the design here, namely about the GET vs POST versions of these endpoints... I'm not sure I understand why POST is used so heavily (as POST tends to mean you are creating something new).

Also, I noticed that in the old REST API, these were all GET queries: https://wiki.lyrasis.org/display/DSDOC6x/REST+Reports+-+Summary+of+API+Calls

It's possible I'm simply forgetting why these are switched to POST, or it has to do with the limit to the number of params that you pass to a GET. But, I think we need to describe GET vs POST better in this Contract.

?filters=is_discoverable,has_multiple_originals,has_pdf_original
```

In POST mode, it is defined as a JSON document like this:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we describe GET and POST mode separately. I'm having a hard time understanding the way this is documented. When would someone use GET and when would they use POST? It's unclear if everything below this point in the docs is ONLY for POST or if it also applies to GET? Could we give some more examples here as to what the differences are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reorganized report parameterization in the Filtered Collections report documentation. I also fixed a few mistakes and added some info I realised that was missing (e.g., definition of "basic report" in Filtered Items).

About usage of GET vs. POST, please see my other comments below.

[Back to the list of all defined endpoints](endpoints.md)

## Statistics for the whole repository
**POST /api/contentreports/filtereditems**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a POST instead of a GET? I notice that the "statistics" endpoints always use GET except when they are adding data to the statistics. See https://github.com/DSpace/RestContract/blob/main/statistics-reports.md and https://github.com/DSpace/RestContract/blob/main/statistics-viewevents.md

Could we better describe why we need to use POST for these endpoints? It appears they are readonly, which implies they might be switched to GET.

Copy link
Contributor Author

@jeffmorin jeffmorin Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, I thought about it. My concern with GET is, as you suggested above, the limited length of the parameters passed as part of the URL. There should be no problems with the Filtered Collections report (only Boolean filters).

Parameterization of the Filtered Items report, however, is much more complex and can easily become long enough to exceed any limit enforced by application servers for URL query strings. This is why I implemented this report as a POST endpoint.

For (a bit of) uniformity, I also added POST support to the Filtered Collections report.

Besides, while the HTTP spec clearly states that GET should be used for read-only requests, I saw nothing stating that POST should be used only for data-changing requests.

If you feel that everything should be switched to GET anyway, I can do it. In this case, the Filtered Items report shall be thoroughly tested with lots of parameters selected to make sure that nothing goes wrong due to too long a query string.

Copy link
Contributor Author

@jeffmorin jeffmorin Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About Filtered Items: The parameters could be organized into a GET query string, although such a string might become quite long. Another concern I had while designing the API for this report is the "query predicates" part: these are structured parameters (a query predicate is a (field, operator, value) tuple). This is another reason why I didn't include a GET version of this report.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check: https://github.com/DSpace/RestContract/blob/main/search-endpoint.md solution.

A query string solution it's used for filtering results:

f.<:filter-name>=<:filter-value>,<:filter-operator>

where a filter-name belongs to a predefined structure that is previously returned. Like: f.title=rainbows,notcontains

{
  "query":"my query",
  "scope":"9076bd16-e69a-48d6-9e41-0238cb40d863",
  "appliedFilters": [
      {
        "filter" : "title",
        "operator" : "notcontains",
        "value" : "abcd",
        "label" : "abcd"
      },

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this discussion stems from a disagreement between HTTP and REST about what POST is for. RFC 9110 says that creating a new resource is only one possible use for POST. "The POST method requests that the target resource process the representation enclosed in the request according to the resource's own specific semantics." The description of POST here is quite a bit narrower. One might say that REST is a reuse of HTTP syntax with different semantics.

So it can be argued that REST is not a very good fit for this operation, but it is what we have.

Would it be a violation of the "spirit of REST" to consider such a POST to be storing a report description, which is consumed in the act of generating the report? Reports may take some time to create. Suppose that one POSTs a document describing the desired report and receives a token in the response. The report generator runs in the background. When finished, the token can be presented (using GET) to receive the report, and the report description is then destroyed.

@jeffmorin
Copy link
Contributor Author

I changed all contentreports occurrences to contentreport to address a request in the REST layer regarding category names.

@tdonohue tdonohue removed this from the 7.5 milestone Feb 21, 2023
@paulo-graca
Copy link
Contributor

Thank you @jeffmorin for this important contribution. I would like to help you bringing this to DSpace core. I didn't (yet) check your implementation and I was expecting to have the DSpace demo site working for comparing, but it's unavailable.

I agree with @tdonohue , our Rest Contract defines that POST method should be used to create new resources, leaving the GET to access resources:
https://github.com/DSpace/RestContract#on-collection-of-resources-endpoints

My understanting it's that we aren't creating resources, but access them, so we should use GET.
I understand that we have a GET max chars limitation and a POST could help us in this matter. That's why this solution.
A somehow similiar feature (to me) was addressed at:
https://github.com/DSpace/RestContract/blob/main/search-endpoint.md

At the discovery or search endpoints, a complex structure is defined at the backend, stating what the client can do (defining available browses, facets, sorts options,...). It's also possible use query string to pass params. But, when pre-defining what structure to use, you don't need to pass a lot of query params.
I didn't saw (yet) your implementation, but this feature could somehow benefict from discovery's rest design.

@mwoodiupui
Copy link
Member

I could argue that we are neither creating nor accessing resources, because these reports are not resources; they are snapshots of the operational state of a store of resources. (I realize that this is the opposite of what I said above.) What is wanted is really not REST at all, but REST is all that we have to work with. REST basically defines putting things into a store and getting them out again. Complex configurable reports on the status of the store itself are something that REST isn't able to do if one cares about ideological purity.

The notion of triggering pre-defined reports (which is how I understand the Discovery analogy) is interesting, but I think that we need to consider how the reports will be used. Statistics might be a useful model. One sometimes doesn't yet know what questions to ask, so one engages in wide-ranging exploration of the data to find promising perspectives. At other times (such as after exploratory examination) one has specific questions to be answered, possibly over and over again as the dataset evolves. Is this a good fit to the way we expect Content Reports to be used? Because doing exploratory work by repeatedly editing a set of preconfigured reports on the back-end sounds slow, tedious and exhausting.

@tdonohue tdonohue self-requested a review April 13, 2023 14:40
@paulo-graca
Copy link
Contributor

I don't have access to this feature right now, but I found a very useful video by Terry Brady (@terrywbrady )
https://www.youtube.com/watch?v=K2gGHYUZI40 and he shows some work they did and how we can interact with it.

Just to contextualize others, I would also like to add the Wiki page URL for the DS6 feature:
https://wiki.lyrasis.org/display/DSDOC6x/REST+Reports+-+Summary+of+API+Calls

@jeffmorin
Copy link
Contributor Author

I just added a GET endpoint to the Filtered Items content report. I also made a minor update to the Filtered Collections report documentation.

Copy link
Member

@abollini abollini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffmorin thanks for sharing your proposal to move the DSpace 6 reporting system to DSpace 7. I have added my feedback inline

contentreport-filteredcollections.md Show resolved Hide resolved
Comment on lines 12 to 27
The GET-based endpoint takes a `filters` query parameter whose value is a comma-separated list of filters
like the following:
```
?filters=is_discoverable,has_multiple_originals,has_pdf_original
```

Alternatively, the comma-separated list can be replaced by a repetition of the `filters` parameter
for each requested filter:
```
?filters=is_discoverable&filter=has_multiple_originals&filter=has_pdf_original
```


Please see [below](#available-filters) for the list of available filters.

**POST /api/contentreport/filteredcollections**
Copy link
Member

@abollini abollini May 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend against that. We should use the filter to build an unique id of the report, similar to what is done in the statistics endpoint see https://github.com/DSpace/RestContract/blob/main/statistics-reports.md
In this way the /api/contentreports/filteredcollections will represent our collection of resources endpoint (it can return 405 as there is no use case in scrolling all the potential report) and /api/contentreports/filteredcollections/<concatenation-of-filter-as-report-id> the individual report. In this way the HTTP caching mechanism will work at the best

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be very close to the discovery endpoint... we need to build a query and get the list of matching items. If we want to get a different representation of the resulting items it should be an "export" script.
My suggestion is to look to which "filter" type are not yet supported in discover and eventually implement them instead than build from scratch a separate endpoint. This would allow to reuse almost everything in the UI just defining a special discovery configuration used for this purpose (similar to what has been done for the administrative search)

@jeffmorin
Copy link
Contributor Author

Please see the comment I wrote at DSpace/DSpace#8598.

@jeffmorin
Copy link
Contributor Author

My fork is now synchronized with the main branch.

Copy link
Contributor

@paulo-graca paulo-graca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jeffmorin !
I've already had a manifested my thoughts about when we should bring this feature into DS. But that doesn't mean that I don't want it for DS8. I just think we need to straighten our heads in which way is the right path for it, then build the needed scaffolding to have this feature in top of it. Just for comparison, a specific Working Group was created just to address discussions about Entities single feature on DSpace 7.
I'm afraid that having this feature as is, others might build features on top of it, or use it and then it will be very hard to have it refactored later on.

DS6 content report feature has a set of functionalities that are still missing in this specification on this PR, for example collection filters (I didn't see any endpoint specification that retrieved available filters) and also filtering by collection id.

You can take a glance on:
https://wiki.lyrasis.org/display/DSDOC6x/REST+Reports+-+Collection+Report+Screenshots+with+Annotated+API+Calls

Also, I think queryPredicates at the item filtering level could be better described in the documentation. And I also think we need to have endpoints for what type of operations are available within fields (perhaps advanced search could help in here).
Even if the implementation doesn't support it yet, I think those missing endpoints should be defined at this phase.

@mwoodiupui
Copy link
Member

Perhaps some of the difficulty is in calling it "beta"? That implies that the design is finished and the code is expected to work well. Would "experimental" or "preview" be more suitable? With a stern warning: "the work is unfinished and we expect to make substantial changes in the final, supported code to address, at least, these issues...."

Because that's what was agreed upon, I think: an early release in 8.0 despite 8.0's compressed schedule, to get this facility into the hands of users this year and to gather wider experience; and a stable release with full support by 9.0.

If people choose to rely on internals of code when they've been clearly warned not to, that is their problem, and we should not be shy about reminding them. Politely appreciative of feedback: yes; apologetic: no.

It would be helpful to have missing features logged as issues so that they can be studied, scheduled and tracked.

@tdonohue
Copy link
Member

tdonohue commented Feb 12, 2024

@paulo-graca : From talking with @pilasou and @jeffmorin , I don't think it's possible to update this PR any further in time for 8.0. The decision for 8.0 was to scale back because @pilasou and @jeffmorin don't have availability to do any more detailed work on this code. We have to remember that, while this is wanted by many, the development on this is entirely donated by U of Laval.

There are only two options that remain:

  1. We accept this work pretty much "as-is" for 8.0 based on the agreement we achieved in recent DevMtgs (see notes from Feb 8 meeting). Then we would work to improve it / fix bugs in 8.1+ and 9.0.
  2. OR, the only other option is to delay everything until 9.0. There is no time remaining to add more features or even design new endpoints....both of those take time, and we have no time left to get this into 8.0. (Keep in mind, this feature was initially built for 7.x, but was delayed until 8.0. This option would be yet another delay)

As decided in the meeting on Feb 8, we voted for option 1. That means we let this into 8.0 so that everyone can start to improve/enhance it. But, put warnings on it that it is missing features & there are some known issues. Those missing features / known issues can be detailed in GitHub issue tickets to be addressed in future 8.x or 9.0 releases.

@paulo-graca
Copy link
Contributor

@tdonohue, to me, there are two different aspects here:

  • The first one is what you just mentioned, and it's related to code changes and missing features. I completely understand that we don't have the time and necessary resources to address them in time for the 8.0 release. I want to make it clear that I don't oppose it for DS8, nor am I in any kind of blocking position here. I want to move it forward. The decision in our meeting was to address only minor things like the requests (POST->GET) and merge it as it is.
  • The second aspect relates specifically to this PR, the REST contract, and my comments were about it. I understand that some specifications are still missing, and some parts should be more detailed, such as the queryPredicates. I understand that the REST contract, generally speaking, could address the definition of parts that aren't yet implemented. So, I think my suggestions don't pose a blocking state to this feature (Content Reports) for DS8, especially because the REST Contract isn't tied to any specific version of DSpace. It's very version-agnostic (>=7.0)

@tdonohue
Copy link
Member

tdonohue commented Feb 12, 2024

@paulo-graca : To clarify my point... I disagree with this statement:

Even if the implementation doesn't support it yet, I think those missing endpoints should be defined at this phase.

I feel the REST Contract should only define the endpoints that are implemented at this time. Otherwise, it is misleading to other developers as to what features actually exist. (And to clarify, we do have a versioned REST Contract. We have the main branch which is pre-8.0, and the dspace-7_x branch, which is the 7.x version of the REST Contract. So, the contract is versioned in the same way that the backend is versioned because the contract describes the backend.)

Overall, I agree that we should enhance this contract as necessary to describe the current implementation. However, we should not add endpoints which do not yet exist, as the Contract is meant to describe the implementation.

I hope that clarifies things. If you do have feedback on the contract compared to the current implementation, then please do point it out. We both agree that there is the opportunity to clean up this PR as the implementation PR also gets cleaned up.

Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks @jeffmorin ! This looks good now.

@tdonohue tdonohue added this to the 8.0 milestone Feb 28, 2024
@tdonohue tdonohue merged commit 1ed2768 into DSpace:main Feb 28, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

5 participants