Skip to content

Commit

Permalink
Merge remote-tracking branch 'IQSS/develop' into change_solr_doc_deletes
Browse files Browse the repository at this point in the history
  • Loading branch information
qqmyers committed Jun 12, 2024
2 parents a7839b5 + 5bf6b6d commit 5dc7b8f
Show file tree
Hide file tree
Showing 57 changed files with 725 additions and 194 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ oauth-credentials.md
/src/main/webapp/oauth2/newAccount.html
scripts/api/setup-all.sh*
scripts/api/setup-all.*.log
src/main/resources/edu/harvard/iq/dataverse/openapi/

# ctags generated tag file
tags
Expand Down
1 change: 0 additions & 1 deletion Dockerfile

This file was deleted.

3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Dataverse is an [open source][] software platform for sharing, finding, citing,

We maintain a demo site at [demo.dataverse.org][] which you are welcome to use for testing and evaluating Dataverse.

To install Dataverse, please see our [Installation Guide][] which will prompt you to download our [latest release][].
To install Dataverse, please see our [Installation Guide][] which will prompt you to download our [latest release][]. Docker users should consult the [Container Guide][].

To discuss Dataverse with the community, please join our [mailing list][], participate in a [community call][], chat with us at [chat.dataverse.org][], or attend our annual [Dataverse Community Meeting][].

Expand All @@ -28,6 +28,7 @@ Dataverse is a trademark of President and Fellows of Harvard College and is regi
[Dataverse community]: https://dataverse.org/developers
[Installation Guide]: https://guides.dataverse.org/en/latest/installation/index.html
[latest release]: https://github.com/IQSS/dataverse/releases
[Container Guide]: https://guides.dataverse.org/en/latest/container/index.html
[features]: https://dataverse.org/software-features
[project board]: https://github.com/orgs/IQSS/projects/34
[roadmap]: https://www.iq.harvard.edu/roadmap-dataverse-project
Expand Down
4 changes: 2 additions & 2 deletions conf/solr/9.3.0/solrconfig.xml
Original file line number Diff line number Diff line change
Expand Up @@ -290,7 +290,7 @@
have some sort of hard autoCommit to limit the log size.
-->
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<maxTime>${solr.autoCommit.maxTime:30000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>

Expand All @@ -301,7 +301,7 @@
-->

<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
<maxTime>${solr.autoSoftCommit.maxTime:1000}</maxTime>
</autoSoftCommit>

<!-- Update Related Event Listeners
Expand Down
8 changes: 8 additions & 0 deletions doc/release-notes/10236-openapi-definition-endpoint.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
In Dataverse 6.0 Payara was updated, which caused the url `/openapi` to stop working:

- https://github.com/IQSS/dataverse/issues/9981
- https://github.com/payara/Payara/issues/6369

When it worked in Dataverse 5.x, the `/openapi` output was generated automatically by Payara, but in this release we have switched to OpenAPI output produced by the [SmallRye OpenAPI plugin](https://github.com/smallrye/smallrye-open-api/tree/main/tools/maven-plugin). This gives us finer control over the output.

For more information, see the section on [OpenAPI](https://dataverse-guide--10328.org.readthedocs.build/en/10328/api/getting-started.html#openapi) in the API Guide.
1 change: 1 addition & 0 deletions doc/release-notes/10466-math-challenge-403-error-page.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
On forbidden access error page, also know as 403 error page, the math challenge is now correctly display to submit the contact form.
1 change: 1 addition & 0 deletions doc/release-notes/10547-solr-updates.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Multiple improvements have ben made to they way Solr indexing and searching is done. Response times should be significantly improved. See the individual PRs in this release for details.
5 changes: 5 additions & 0 deletions doc/release-notes/10554-avoid-solr-join-guest.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Two experimental features flag called "add-publicobject-solr-field" and "avoid-expensive-solr-join" have been added to change how Solr documents are indexed for public objects and how Solr queries are constructed to accommodate access to restricted content (drafts, etc.). It is hoped that it will help with performance, especially on large instances and under load.

Before the search feature flag ("avoid-expensive...") can be turned on, the indexing flag must be enabled, and a full reindex performed. Otherwise publicly available objects are NOT going to be shown in search results.

For details see https://dataverse-guide--10555.org.readthedocs.build/en/10555/installation/config.html#feature-flags and #10555.
1 change: 1 addition & 0 deletions doc/release-notes/10561-3dviewer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3DViewer by openforestdata.pl has been added to the list of external tools: https://preview.guides.gdcc.io/en/develop/admin/external-tools.html#inventory-of-external-tools
1 change: 1 addition & 0 deletions doc/release-notes/10568-Fix File Reingest.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
A bug that prevented the Ingest option in the File page Edit File menu from working has been fixed
1 change: 1 addition & 0 deletions doc/release-notes/9729-release-notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
An error is now correctly reported when an attempt is made to assign an identical role to the same collection, dataset, or file. #9729 #10465
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ Data Curation Tool configure file "A GUI for curating data by adding labels, gro
Ask the Data query file Ask the Data is an experimental tool that allows you ask natural language questions about the data contained in Dataverse tables (tabular data). See the README.md file at https://github.com/IQSS/askdataverse/tree/main/askthedata for the instructions on adding Ask the Data to your Dataverse installation.
TurboCurator by ICPSR configure dataset TurboCurator generates metadata improvements for title, description, and keywords. It relies on open AI's ChatGPT & ICPSR best practices. See the `TurboCurator Dataverse Administrator <https://turbocurator.icpsr.umich.edu/tc/adminabout/>`_ page for more details on how it works and adding TurboCurator to your Dataverse installation.
JupyterHub explore file The `Dataverse-to-JupyterHub Data Transfer Connector <https://forgemia.inra.fr/dipso/eosc-pillar/dataverse-jupyterhub-connector>`_ is a tool that simplifies the transfer of data between Dataverse repositories and the cloud-based platform JupyterHub. It is designed for researchers, scientists, and data analysts, facilitating collaboration on projects by seamlessly moving datasets and files. The tool is a lightweight client-side web application built using React and relies on the Dataverse External Tool feature, allowing for easy deployment on modern integration systems. Currently optimized for small to medium-sized files, future plans include extending support for larger files and signed Dataverse endpoints. For more details, you can refer to the external tool manifest: https://forgemia.inra.fr/dipso/eosc-pillar/dataverse-jupyterhub-connector/-/blob/master/externalTools.json
3DViewer by openforestdata.pl explore file The 3DViewer by openforestdata.pl can be used to explore 3D files (e.g. STL format). It was presented by Kamil Guryn during the 2020 community meeting (`slide deck <https://osf.io/aqgr5>`_, `video <https://youtu.be/YH4I_kldmGI?si=Y-9dZ3xPXeswchAl&t=2379>`_) and can be found at https://github.com/OpenForestData/open-forest-data-previewers
5 changes: 5 additions & 0 deletions doc/sphinx-guides/source/admin/harvestclients.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,8 @@ What if a Run Fails?
Each harvesting client run logs a separate file per run to the app server's default logging directory (``/usr/local/payara6/glassfish/domains/domain1/logs/`` unless you've changed it). Look for filenames in the format ``harvest_TARGET_YYYY_MM_DD_timestamp.log`` to get a better idea of what's going wrong.

Note that you'll want to run a minimum of Dataverse Software 4.6, optimally 4.18 or beyond, for the best OAI-PMH interoperability.

Harvesting Non-OAI-PMH
~~~~~~~~~~~~~~~~~~~~~~

`DOI2PMH <https://github.com/IQSS/doi2pmh-server>`__ is a community-driven project intended to allow OAI-PMH harvesting from non-OAI-PMH sources.
7 changes: 7 additions & 0 deletions doc/sphinx-guides/source/api/apps.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,13 @@ https://github.com/libis/rdm-integration
PHP
---

DOI2PMH
~~~~~~~

The DOI2PMH server allow Dataverse instances to harvest DOI through OAI-PMH from otherwise unharvestable sources.

https://github.com/IQSS/doi2pmh-server

OJS
~~~

Expand Down
1 change: 1 addition & 0 deletions doc/sphinx-guides/source/api/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ v6.0
----

- **/api/access/datafile**: When a null or invalid API token is provided to download a public (non-restricted) file with this API call, it will result on a ``401`` error response. Previously, the download was allowed (``200`` response). Please note that we noticed this change sometime between 5.9 and 6.0. If you can help us pinpoint the exact version (or commit!), please get in touch. See :doc:`dataaccess`.
- **/openapi**: This endpoint is currently broken. See https://github.com/IQSS/dataverse/issues/9981

v5.6
----
Expand Down
6 changes: 4 additions & 2 deletions doc/sphinx-guides/source/api/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,12 +56,12 @@ Where is the Comprehensive List of All API Functionality?

There are so many Dataverse Software APIs that a single page in this guide would probably be overwhelming. See :ref:`list-of-dataverse-apis` for links to various pages.

It is possible to get a complete list of API functionality in Swagger/OpenAPI format if you deploy Dataverse Software 5.x. For details, see https://github.com/IQSS/dataverse/issues/5794
It is possible to get a complete list of API functionality in Swagger/OpenAPI format. See :ref:`openapi`.

Is There a Changelog of API Functionality That Has Been Added Over Time?
------------------------------------------------------------------------

No, but there probably should be. If you have suggestions for how it should look, please create an issue at https://github.com/IQSS/dataverse/issues
Changes to the API that don't break anything can be found in the `release notes <https://github.com/IQSS/dataverse/releases>`_ of each release. Breaking changes are documented in :doc:`changelog`.

.. _no-api:

Expand Down Expand Up @@ -89,6 +89,8 @@ Why Are the Return Values (HTTP Status Codes) Not Documented?

They should be. Please consider making a pull request to help. The :doc:`/developers/documentation` section of the Developer Guide should help you get started. :ref:`create-dataverse-api` has an example you can follow or you can come up with a better way.

Also, please note that we are starting to experiment with putting response codes in our OpenAPI document. See :ref:`openapi`.

What If My Question Is Not Answered Here?
-----------------------------------------

Expand Down
26 changes: 26 additions & 0 deletions doc/sphinx-guides/source/api/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,32 @@ Listing Permissions (Role Assignments)

See :ref:`list-role-assignments-on-a-dataverse-api`.

.. _openapi:

Getting the OpenAPI Document
----------------------------

You can access our `OpenAPI document`_ using the ``/openapi`` endpoint. The default format is YAML if no parameter is provided, but you can also obtain the JSON version by either passing ``format=json`` as a query parameter or by sending ``Accept:application/json`` (case-sensitive) as a header.

.. _OpenAPI document: https://spec.openapis.org/oas/latest.html#openapi-document

.. note:: See :ref:`curl-examples-and-environment-variables` if you are unfamiliar with the use of export below.

.. code-block:: bash
export SERVER_URL=https://demo.dataverse.org
export FORMAT=json
curl "$SERVER_URL/openapi?format=$FORMAT"
The fully expanded example above (without environment variables) looks like this:

.. code-block:: bash
curl "https://demo.dataverse.org/openapi?format=json"
We are aware that our OpenAPI document is not perfect. You can find more information about validating the document under :ref:`openapi-dev` in the Developer Guide.

Beyond "Getting Started" Tasks
------------------------------

Expand Down
2 changes: 1 addition & 1 deletion doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1179,7 +1179,7 @@ See also :ref:`batch-exports-through-the-api` and the note below:
export PERSISTENT_IDENTIFIER=doi:10.5072/FK2/J8SJZB
export METADATA_FORMAT=ddi
curl "$SERVER_URL/api/datasets/export?exporter=$METADATA_FORMAT&persistentId=PERSISTENT_IDENTIFIER"
curl "$SERVER_URL/api/datasets/export?exporter=$METADATA_FORMAT&persistentId=$PERSISTENT_IDENTIFIER"
The fully expanded example above (without environment variables) looks like this:

Expand Down
15 changes: 15 additions & 0 deletions doc/sphinx-guides/source/developers/api-design.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,21 @@ API design is a large topic. We expect this page to grow over time.
.. contents:: |toctitle|
:local:

.. _openapi-dev:

OpenAPI
-------

As you add API endpoints, please be conscious that we are exposing these endpoints as an OpenAPI document at ``/openapi`` (e.g. http://localhost:8080/openapi ). See :ref:`openapi` in the API Guide for the user-facing documentation.

We've played around with validation tools such as https://quobix.com/vacuum/ and https://pb33f.io/doctor/ only to discover that our OpenAPI output is less than ideal, generating various warnings and errors.

You can prevent additional problems in our OpenAPI document by observing the following practices:

- When creating a method name within an API class, make it unique.

If you are looking for a reference about the annotations used to generate the OpenAPI document, you can find it in the `MicroProfile OpenAPI Specification <https://download.eclipse.org/microprofile/microprofile-open-api-3.1/microprofile-openapi-spec-3.1.html#_detailed_usage_of_key_annotations>`_.

Paths
-----

Expand Down
12 changes: 3 additions & 9 deletions doc/sphinx-guides/source/developers/deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,17 +91,11 @@ Download `ec2-create-instance.sh`_ and put it somewhere reasonable. For the purp

.. _ec2-create-instance.sh: https://raw.githubusercontent.com/GlobalDataverseCommunityConsortium/dataverse-ansible/master/ec2/ec2-create-instance.sh

To run it with default values you just need the script, but you may also want a current copy of the ansible `group vars <https://raw.githubusercontent.com/GlobalDataverseCommunityConsortium/dataverse-ansible/master/defaults/main.yml>`_ file.
To run the script, you can make it executable (``chmod 755 ec2-create-instance.sh``) or run it with bash, like this with ``-h`` as an argument to print the help:

ec2-create-instance accepts a number of command-line switches, including:
``bash ~/Downloads/ec2-create-instance.sh -h``

* -r: GitHub Repository URL (defaults to https://github.com/IQSS/dataverse.git)
* -b: branch to build (defaults to develop)
* -p: pemfile directory (defaults to $HOME)
* -g: Ansible GroupVars file (if you wish to override role defaults)
* -h: help (displays usage for each available option)

``bash ~/Downloads/ec2-create-instance.sh -b develop -r https://github.com/scholarsportal/dataverse.git -g main.yml``
If you run the script without any arguments, it should spin up the latest version of Dataverse.

You will need to wait for 15 minutes or so until the deployment is finished, longer if you've enabled sample data and/or the API test suite. Eventually, the output should tell you how to access the Dataverse installation in a web browser or via SSH. It will also provide instructions on how to delete the instance when you are finished with it. Please be aware that AWS charges per minute for a running instance. You may also delete your instance from https://console.aws.amazon.com/console/home?region=us-east-1 .

Expand Down
4 changes: 4 additions & 0 deletions doc/sphinx-guides/source/developers/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,10 @@ Solr

While in the past Solr performance hasn't been much of a concern, in recent years we've noticed performance problems when Harvard Dataverse is under load. Improvements were made in `PR #10050 <https://github.com/IQSS/dataverse/pull/10050>`_, for example.

We are tracking performance problems in `#10469 <https://github.com/IQSS/dataverse/issues/10469>`_.

In a meeting with a Solr expert on 2024-05-10 we were advised to avoid joins as much as possible. (It was acknowledged that many Solr users make use of joins because they have to, like we do, to keep some documents private.) Toward that end we have added two feature flags called ``avoid-expensive-solr-join`` and ``add-publicobject-solr-field`` as explained under :ref:`feature-flags`. It was confirmed experimentally that performing the join on all the public objects (published collections, datasets and files), i.e., the bulk of the content in the search index, was indeed very expensive, especially on a large instance the size of the IQSS prod. archive, especially under indexing load. We confirmed that it was in fact unnecessary and were able to replace it with a boolean field directly in the indexed documents, which is achieved by the two feature flags above. However, as of writing this, this mechanism should still be considered experimental.

Datasets with Large Numbers of Files or Versions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
6 changes: 6 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3268,6 +3268,12 @@ please find all known feature flags below. Any of these flags can be activated u
* - api-session-auth
- Enables API authentication via session cookie (JSESSIONID). **Caution: Enabling this feature flag exposes the installation to CSRF risks!** We expect this feature flag to be temporary (only used by frontend developers, see `#9063 <https://github.com/IQSS/dataverse/issues/9063>`_) and for the feature to be removed in the future.
- ``Off``
* - avoid-expensive-solr-join
- Changes the way Solr queries are constructed for public content (published Collections, Datasets and Files). It removes a very expensive Solr join on all such documents, improving overall performance, especially for large instances under heavy load. Before this feature flag is enabled, the corresponding indexing feature (see next feature flag) must be turned on and a full reindex performed (otherwise public objects are not going to be shown in search results). See :doc:`/admin/solr-search-index`.
- ``Off``
* - add-publicobject-solr-field
- Adds an extra boolean field `PublicObject_b:true` for public content (published Collections, Datasets and Files). Once reindexed with these fields, we can rely on it to remove a very expensive Solr join on all such documents in Solr queries, significantly improving overall performance (by enabling the feature flag above, `avoid-expensive-solr-join`). These two flags are separate so that an instance can reindex their holdings before enabling the optimization in searches, thus avoiding having their public objects temporarily disappear from search results while the reindexing is in progress.
- ``Off``

**Note:** Feature flags can be set via any `supported MicroProfile Config API source`_, e.g. the environment variable
``DATAVERSE_FEATURE_XXX`` (e.g. ``DATAVERSE_FEATURE_API_SESSION_AUTH=1``). These environment variables can be set in your shell before starting Payara. If you are using :doc:`Docker for development </container/dev-usage>`, you can set them in the `docker compose <https://docs.docker.com/compose/environment-variables/set-environment-variables/>`_ file.
Expand Down
5 changes: 5 additions & 0 deletions docker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Dataverse in Docker

Please see the [Container Guide][].

[Container Guide]: https://guides.dataverse.org/en/latest/container/index.html
Loading

0 comments on commit 5dc7b8f

Please sign in to comment.