Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formatter / Cache & landing pages #4299

Closed
wants to merge 21 commits into from

Conversation

fxprunayre
Copy link
Member

@fxprunayre fxprunayre commented Dec 10, 2019

Landing pages apply to public records only and provides a quick access to pre-built HTML pages based on a formatter. It is disabled by default. Landing page's main goal is to provide persistent pages that can be used as DOI registered URL for a record.

Formatters

GeoNetwork provides an advanced cache of record view modes (aka formatters). While viewing records, the default view is based on the search service response and is built by the JS client application. No cache in action here. The advanced view is using a formatter and other custom views can also be created using formatter (eg. a view based on an editor configuration, JSON-LD). While using a formatter a first level cache is an in-memory one. Requests are added to a queue which then populate a file system cache.

The file system cache is persisted in the data directory under resources/formatter-cache. A database store the current content of the cache and a public and private folder store formatter outputs (usually HTML) based on record's privileges.

All formatters can be cached but it depends on URL parameters (PDF outputs are not cached). If the call for the formatter contains a parameter not in formatterAllowedParameters it will not be cached. Mind the parameters available in a formatter to properly cache a version which depends on parameters (2 URLs with slightly different parameters pointing to the same formatter may result in caching in the same file). The cache key depends on record id and formatter id - URL parameters are not part of the key for now.

Landing pages

A specific cache is added to store landing pages for public records. A landing page may be used as the main entry point for a record when registering a DOI for example. The landing page by default is a simple HTML page containing all record information as HTML plus a JSON-LD representation for better indexing by search engine. The landing page is using the xsl-view formatter by default but this can be customized.

The landing page cache can be populated using the API and is updated depending on record changes (cf. FormatterCachePublishListener which takes care of updating formatters from private to public folder depending on publication status - event is MetadataIndexCompleted). Those HTML pages may be served directly by a web server to allow direct/quick access to those pages (in this case the landing page folder has to be mounted as a web folder).

Only public records are stored in this landing page cache.

File structure is:

DATA_DIR/resources/htmlcache/landing-page/uuid.html

The configuration is made in config.properties:

landingPage.formatter=xsl-view
landingPage.language=${language.default}

When exposing the landing page folder directly to the web, it may be relevant to setup a redirect to the catalog in case the requesting page does not exist. This will trigger the landing page creation. For this, create a .htaccess file in the landing page folder:

# Landing page redirect. 
# If the landing page requested does not exist, 
# redirect to the catalogue which will check 
# if it really does not exist or needs to be generated.
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^(.*).html$ /geonetwork/srv/api/records/$1 [L,R=301]

Other changes

  • Restore the caching mechanism for the xsl-view formatter.

  • Move ALLOWED_PARAMETERS to bean config as it depends on catalogue available formatters.

  • Add API to get cache info (eg. cache current size).

  • Restore the option to skip popularity when using a formatter

  • Add option to force a cache refresh when using a formatter

  • Avoid NPE when a formatter is called before the app init while trying to retrive lang code

  • Portal link. Values can be:

    • custom url using {{uuid}} to be replaced by the record UUID. eg. http://another.portal.org/${uuid}

    • default (will link to the main application)

    • group (will link to the record groupowner website URL if any, default if not)

image

Documentation

This PR description will be added to https://geonetwork-opensource.org/manuals/3.8.x/en/customizing-application/creating-custom-view.html

@fxprunayre fxprunayre added this to the 3.10.0 milestone Dec 16, 2019
@fxprunayre fxprunayre marked this pull request as ready for review December 16, 2019 08:26
GeoNetwork provides an advanced cache of record view modes (aka formatters). While viewing records, the default view is based on the search service response and is built by the JS client application. No cache in action here. The advanced view is using a formatter and other custom views can also be created using formatter (eg. a view based on an editor configuration, JSON-LD). While using a formatter a first level cache is an in-memory one. Requests are added to a queue which then populate a file system cache.

The file system cache is persisted in the data directory under `resources/formatter-cache`. A database store the current content of the cache and a `public` and `private` folder store formatter outputs (usually HTML) based on record's privileges.

All formatters can be cached but it depends on URL parameters (PDF outputs are not cached). If the call for the formatter contains a parameter not in `formatterAllowedParameters` it will not be cached. Mind the parameters available in a formatter to properly cache a version depending on parameters. The cache key depends on record id and formatter id - URL parameters are not part of the key for now.

A new cache is added to store landing pages for public records. A landing page may be used as the main entry point for a record when registering a DOI for example. The landing page by default is a simple HTML page containing all record information as HTML plus a JSON-LD representation for better indexing by search engine. The landing page is using the `xsl-view` formatter by default but this can be customized.

The landing page cache can be populated using the API and is updated depending on record changes (TODO: check on which events). Those HTML pages may be served directly by a webserver to allow direct/quick access to those pages (in this case the landing page folder has to be mounted as a web folder).

Only public records are stored in this landing page cache.

File structure is
```
DATA_DIR/resources/htmlcache/landing-page/uuid.html
```

* Restore the caching mechanism for the xsl-view formatter.
* Move ALLOWED_PARAMETERS to bean config as it depends on catalogue available formatters.
* Add API to get cache info (eg. cache current size).

* A formatter may depend on request language? Define one language in the landing page config?
* Web server direct access / If a landing page does not exist? Trigger a redirect to check if the record exist or not ?
…g to make the landing page creation asynch (to not extend indexing time).
@pvgenuchten
Copy link

francois, does the landingpage (and htaccess rewrite rule) respect the request headers language and accept (encoding)? For uri's such as doi, it's a good practice to support multiple output encodings (xml,json,html) dependent from requested accept header.

@fxprunayre
Copy link
Member Author

does the landingpage (and htaccess rewrite rule) respect the request headers language and accept (encoding)?

No. As described, the config define which formatter to use and which language:

The configuration is made in config.properties:

landingPage.formatter=xsl-view
landingPage.language=${language.default}

This work funded by Ifremer has for main goal to provide a static (and up to date) HTML view of all public records of a catalogue in order to make search engine indexing all properly (with the JSON-LD representation). Then you click on get more information button and you access the catalogue with the complete UI, formats, language selection ... If we start making a full cache of all combinations of language/encoding then it is another story. Here we focus on having a static, fast and up to date version of all publich records. Ifremer was using an ETL process to take care of this that will be replace by a custom landing page formatter.

It slows down a bit the indexing process (even if done in separate thread) that we should probably improve in near future but that another story too.

@fxprunayre fxprunayre modified the milestones: 3.10.0, 3.10.1 Jan 17, 2020
Define the full portal link. Values can be:

* default (will link to the main application)
* group (will link to the record groupowner website URL if any, default if not)
* custom url using {{uuid}} to be replaced by the record UUID. eg. http://another.portal.org/${uuid}
Copy link
Collaborator

@cmangeat cmangeat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems there is a trouble at compilation time.

also, FormatterApi.getRecordFormattedBy could be broken in two parts, a reusable one and the HTTP/REST wrapper, the reusable one to be used for generating landing page (as anyway getRecordFormattedBy not really used thru http as a distributed service).

furthermore, as, as far as I understand, landing pages are accessed as static resources, not really using the webapp but relying on the web server, first level cache (memory level) of formatter cache is of no use for landing page, considering building landing page repository apart from formatter cache could make sense (change date FormatterCache logic also does not apply).

final Root<OperationAllowed> root = query.from(OperationAllowed.class);
final Root<Metadata> metadataRoot = query.from(Metadata.class);

Predicate mdIdEquals = cb.equal(metadataRoot.get(Metadata_.id), root.get(OperationAllowed_.id).get(OperationAllowedId_.metadataId));
Copy link
Collaborator

@cmangeat cmangeat Feb 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One should also use a join clause => not sure, join seems to be not "straight-forward"

<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
</dependency>
<dependency>
<groupId>${project.groupId}</groupId>
<artifactId>wro4j</artifactId>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it please needed at service side ?

final String lang2 = appContext.getBean(IsoLanguagesMapper.class).iso639_2_to_iso639_1(lang3);

String finalLang = lang3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one should set the final modifier ?

}
if (StringUtils.isNotEmpty(landingPageFormatter)) {
final ServletContext servletContext = _context.getServlet().getServletContext();
_context.setAsThreadLocal();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_context (the one from geonet startup) shared (also overriding the previous one) in Thread from which comes index MetadataIndexCompleted event, an asynchronous way... I understand that we need a context to perform service call thru http if we plan to rely on a distributed services framework, but this is not what we are doing as it seems we need both to set the context in Thread before to use it to execute request. Furthermore, we are not really executing request, just in a certain way invoking a local spring service (i.e. formatService cannot be hosted on a different node/jvm).

(with "context = createServiceContext" in getRecordFormattedBy)

* in formattersToInitialize so they get loaded eg. XSLCache init
* @param context
*/
public static void initializeFormatters(final ServiceContext context) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that this method is never used

* @param lang the current ui language
* @param formatType the content type of the output
* @param formatterId the formatter used to create the output
* @param hideWithheld if true then elements in the metadata with the attribute
* gco:nilreason="withheld" are being hidden
*/
public Key(int mdId, String lang, FormatType formatType, String formatterId, boolean hideWithheld, FormatterWidth width) {
public Key(int mdId, String mdUuid, String lang, FormatType formatType, String formatterId, boolean hideWithheld, FormatterWidth width) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does adding uuid in the key make it better, please ?

Copy link
Member Author

@fxprunayre fxprunayre Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because you need it to write/remove the landing page file.

@Autowired
OperationAllowedRepository operationAllowedRepo;
@Autowired
DataManager dataManager;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nerver used ?

}


@Autowired
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am used to look for available beans at the beginning of the file

}
}
}

private void resize() throws SQLException, IOException {
private void resize(String mdUuid) throws SQLException, IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resize helps making actual cache twice time smaller removing older elements from it... Older elements and uuid of the currently caching metadata are NOT related.

FormatType.html,
true,
true,
true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why please refreshing in this case ? If not, service call should be shared with fillLandingPageCache.

@fxprunayre fxprunayre modified the milestones: 3.10.1, 3.10.2 Feb 12, 2020
@fxprunayre fxprunayre modified the milestones: 3.10.2, 3.10.3 Apr 8, 2020
@pvgenuchten
Copy link

pvgenuchten commented May 26, 2020

hi @fxprunayre, what's the status on this, it's an impressive work, would love to see that merged soon. I have some questions.

  • I didn't quite understand your statement: "It is disabled by default." As far as i know landingpages are enabled by default, and accessible via the sitemap. Where is this configuration?

  • xsl-view is derived from editor-configuration, therefore you can add a parameter to indicate which editor to use, basic, inspire or advanced. I would suggest to add this configuration option also for the landingpage.

  • the header and footer of landingpages are defined in https://github.com/geonetwork/core-geonetwork/blob/master/web/src/main/webapp/xslt/skin/default/skin.xsl. The idea with 'skin' in this url is that users could easily provide an alternative skin in /xslt/skin/forest/skin.xsl and then use some configuration (for example the view=default parameter in settings) to point to the alternative skin. Is that something to build on top of this PR?

  • it's probably more a formatter issue, but the display of xsl-view keeps having this weird html artifacts caused by the xslt used in formatters, < i / > is not allowed in html
    image

  • Would be interesting to have on the landingpage an "add to map" button for any wm(t)s links, which opens the map page in the angular application with the selected layer loaded.

  • For those records which have a link with a dataset/service or parent/child record, would be interesting to retrieve the links and display them on the landingpage

@fxprunayre
Copy link
Member Author

  • I didn't quite understand your statement: "It is disabled by default." As far as i know landingpages are enabled by default, and accessible via the sitemap. Where is this configuration?

Landing pages are always available via the formatter call. What is disabled by default is the possibility to have a permanent HTML cache of all public records ie. if the app is down, the HTML version is always available.

xsl-view is derived from editor-configuration, therefore you can add a parameter to indicate which editor to use, basic, inspire or advanced. I would suggest to add this configuration option also for the landingpage.

It is. For Sextant which is using this PR, we have a custom view with even custom styling depending on the group of the record.

    <landingPage.formatter>xsl-view?view=sextant&amp;template=sextant-summary-view&amp;portalLink=group</landingPage.formatter>
  • The idea with 'skin' i

For Sextant, we also experimented on top of that PR, the possibility to add a style based on the group ... and also customize the link of the "view in the portal" button. That is still work in progress.

it's probably more a formatter issue, but the display of xsl-view keeps having this weird html artifacts caused by the xslt used in formatters, < i / > is not allowed in html

Yep I thought all those were fixed. I'll have a look.

Would be interesting to have on the landingpage an "add to map" button for any wm(t)s links, which opens the map page in the angular application with the selected layer loaded.

This is more difficult because the add to map actions are all controled by the JS logic of the angular app ... not sure how to solve that easily.

For those records which have a link with a dataset/service or parent/child record, would be interesting to retrieve the links and display them on the landingpage

Same here. All that is done in JS so that's why it is important to have a quite visible button to open the record in the JS app.

it's an impressive work, would love to see that merged soon.

We are discussing with Christophe who is improving this a bit. We are not 100% convinced on where/when/how to produce the landing page. We would like to avoid altering indexing performance. We also discussed micro services - so having a separate app taking care of that kind of jobs would make sense - having an indexing machine not part of the main webapp could help ...

@pvgenuchten
Copy link

pvgenuchten commented May 28, 2020

if the app is down, the HTML version is always available.

this is an interesting aspect. Should be in title of the PR

Would be interesting to have on the landingpage an "add to map" button for any wm(t)s links, which opens the map page in the angular application with the selected layer loaded.

This is more difficult because the add to map actions are all controled by the JS logic of the angular app ... not sure how to solve that easily.

We should reuse as much as possible to add-to-map of the angular app, so only add:
suggestion for each online resource:

<xsl:if test="contains(gmd:protocol/*/text(),"wms")">
<a href="catalog.search#/map?type=wms&amp;url={gmd:url/*/text()}&amp;layer={gmd:name/*/text()}">Add to map</a>
</xsl:if>

For those records which have a link with a dataset/service or parent/child record, would be interesting to retrieve the links and display them on the landingpage

Same here. All that is done in JS so that's why it is important to have a quite visible button to open the record in the JS app.

Machines will not follow the link to the js app, but it is important they detect that linkage to other records is available, i guess we can use backend code from record/uuid/related as part of landing page creation

We would like to avoid altering indexing performance.

agree, to me creation at initial request (and removal at update), similar to tile-cache, makes sense. third party can set up a caching mechanism (can be a search engine crawler).

@josegar74 josegar74 modified the milestones: 3.10.3, 3.10.4 Jun 25, 2020
@fxprunayre
Copy link
Member Author

Not that much interest on this PR ? It has been used live in Sextant for a couple of months now - the main drawback is that it slow down indexing a bit.

Closing for now unless someone review it. We may move to another approach having a dedicated service to take care of building landing page ...

@pvgenuchten
Copy link

This could be a aspect of upcoming micro services refacture / ogc api records implementation

@archaeogeek
Copy link
Contributor

@fxprunayre I'm really keen to use this. If it was re-opened, what sort of review do you still need- is it for code or for how it works in action? Could it be implemented in 3.10.x or is it mainly for 4?

@fxprunayre
Copy link
Member Author

@fxprunayre I'm really keen to use this. If it was re-opened, what sort of review do you still need- is it for code or for how it works in action?

The code works, it has been used for a couple of months in Ifremer for Sextant (production currently based on 3.10.x).
BTW any extra code review would be good too.

Major "complain" so far on Ifremer instance is that it slows down indexing and that why we are looking into moving this to a specific app (synergy with OGC API Records micro service - we would have a dedicated app to index and populate the landing page cache). We did some experiment, but this is not planned yet.

Could it be implemented in 3.10.x or is it mainly for 4?

For now, this PR was mainly targeting 3.10.4~.

@archaeogeek
Copy link
Contributor

@fxprunayre I'm happy to implement and test it in 3.10.4 and see how it impacts performance. I'm not going to be much good at a code review though!

@fxprunayre
Copy link
Member Author

Sure you can try to run and check this PR. Also an issue we have on Ifremer side is that the landing page still depends on the webapp for the CSS and attachments. They would like to completely 'detach' the landing page from the app in case it is down.

I'm trying to push more to a dedicated service taking care of this. I added an item about this in the potential January sprint we will make and will try to find support to make progress on this topic. It could be good to hook a cache mechanism/events on top of OGC API Records or indexing service. https://github.com/geonetwork/core-geonetwork/wiki/GeoNetwork-UI-and-microservices-codesprint-January-2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants