Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Harvester / Add XSL transformation for harvesting DKAN catalogs #6240

Merged
merged 3 commits into from
May 23, 2022

Conversation

jahow
Copy link
Contributor

@jahow jahow commented Apr 8, 2022

Can be tested using the following parameters:

And leave other parameters empty.

@fgravin
Copy link
Member

fgravin commented May 5, 2022

Thanks @jahow

I noticed some weird behavior though.
In the source JSON, there is encoded HTML in the url properties
eg:

    {
      "id": "379ac99b-9864-4269-8f1b-5ab6a4a198d0",
      "revision_id": "",
      "url": "\u003Cdiv class=\u0022field field-name-field-link-remote-file field-type-file field-label-hidden\u0022\u003E\u003Cdiv class=\u0022field-items\u0022\u003E\u003Cdiv class=\u0022field-item even\u0022\u003Ehttps:\/\/sig.hautsdefrance.fr\/ext\/opendata\/Sraddet2020\/cer_reservoir_s_fr32.csv\u003C\/div\u003E\u003C\/div\u003E\u003C\/div\u003E",
      "description": "\u003Cp\u003EDonn\u00e9es brutes au format Csv (r\u00e9servoirs de la biodiversit\u00e9- Trame verte)\u003C\/p\u003E\n",
      "format": "csv",
      "state": "Active",
      "revision_timestamp": "lun, 06\/12\/2021 - 03:00",
      "name": "Tableau de donn\u00e9es (r\u00e9servoirs de la biodiversit\u00e9)",
      "mimetype": "csv",
      "size": "",
      "created": "jeu, 03\/06\/2021 - 03:00",
      "resource_group_id": "b72cd25d-1cec-49f6-8c71-297bd373fa01",
      "last_modified": "Date changed lun, 06\/12\/2021 - 03:00"
    },

which ends up in the metadata XML as

<cit:linkage>
<gco:CharacterString xmlns:gco="http://standards.iso.org/iso/19115/-3/gco/1.0">
  <div class="field field-name-field-link-remote-file field-type-file field-label-hidden">
    <div class="field-items">
       <div class="field-item even">
           https://sig.hautsdefrance.fr/ext/opendata/Sraddet2020/cer_reservoir_s_fr32.csv
       </div>
    </div>
  </div>
</gco:CharacterString>
</cit:linkage>

Is there any XSL Utils that could help to remove all HTML tags from a text element ?
ping @fxprunayre @josegar74

Thanks

@jahow
Copy link
Contributor Author

jahow commented May 5, 2022

I've added a commit to remove the HTML tags from the urls with a regex. It works quite well in the datahub:
image

(you can see that the data preview is functional now)

There has been work recently by @fxprunayre to handle HTML content in metadata records but I think it was more intended to convert HTML to markdown, not strip HTML tags completely.

@fxprunayre
Copy link
Member

Is there any XSL Utils that could help to remove all HTML tags from a text element ?

See https://github.com/geonetwork/core-geonetwork/blob/main/core/src/main/java/org/fao/geonet/util/XslUtil.java#L659-L661

@jahow
Copy link
Contributor Author

jahow commented May 6, 2022

Updated with the html2textNormalized utility, looks like it works fine

@fgravin
Copy link
Member

fgravin commented May 6, 2022

See https://github.com/geonetwork/core-geonetwork/blob/main/core/src/main/java/org/fao/geonet/util/XslUtil.java#L659-L661

My bad thanks @fxprunayre. I was very surprise to find XSLUtil almost empty but i confused with the on in the schema 19115-3, didn't see there was 2 differents :/

@fgravin
Copy link
Member

fgravin commented May 6, 2022

@jahow did you check metadata 7698d9ab-3e4f-497c-9332-87413deb24f2 I remember that there was also weird char in the keywords (the encoding of the ')

@jahow
Copy link
Contributor Author

jahow commented May 6, 2022

@jahow did you check metadata 7698d9ab-3e4f-497c-9332-87413deb24f2 I remember that there was also weird char in the keywords (the encoding of the ')

No I haven't handled that yet

@jahow
Copy link
Contributor Author

jahow commented May 23, 2022

@fgravin with the latest commit this is good to go:
image

@fgravin
Copy link
Member

fgravin commented May 23, 2022

Yes looks good thanks @jahow

@fgravin fgravin merged commit cb58c1b into geonetwork:main May 23, 2022
@jahow jahow deleted the dkan-harvester branch May 23, 2022 13:38
@gkeimeHDF
Copy link

@jahow Hello this feature to harwest with DKAN-to-ISO19115-3-2018 disappear in Geonetwork 4.4.1 ? I see there is a new field "XSL transformation to apply" with value "schema:iso19115-3.2018:convert/fromJsonDkan" but doesn't look to works. this feature disappear in 4.4.1 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants