Skip to content

201211dataextractor

François Prunayre edited this page Jan 13, 2014 · 4 revisions

Epic: Users can extract data from the catalogue in any format, projection and bounds they desire.

Current:

  • For WFS resources : A WFS download panel is available in html5ui from which users can extract data within custom bounds in any format/projection supported by the WFS server. Some servers like geoserver offer a wide range of formats, however others like Deegree only offer GML, which is a format not many (from outside the geo-world) can not handle.
  • For uploaded file : A link provides access to the file download according to privileges

UI: for every WFS (WCS) onlineresource a download panel should be added, where users can select the required bounds, a projection and a format (gml, geojson, csv, shp, pdf, dxf, kml, gdb, geopackage, ...). The selected resource is then filtered and converted to meet the needs. A user can either directly download the requested data or be given a link (by mail) where he can download the file.

Additional parameters are required for some conversions. Some of these might be managed in an advanced tab. Most important is the conversion from Complex GML from WFS to flat features required for a shapefile. A user should be enabled to drill down the complex structure and select those geometry fields (+attributes) that are required for the export. At first we can just let geotools decide what is best from that data to export.

Formats to convert: onlineresources of type WFS and WCS can probably instantly be extracted. onlineresources of type download can be extracted if they are of a known spatial type like shp, dxf, mif, gml, gdb, kml, geojson, geopackage. In the worst case scenario the extraction service will only discover that the type of file does not qualify only after it has been fully downloaded.

Related: The WFS download panel as available in the html5ui. In GeOrchestra, an advanced data extractor tool (https://github.com/georchestra/georchestra/tree/master/extractorapp) is available. Some of those concepts (if the license permits) can be reused

Autorisation: Some services will be protected, the user could be enabled to enter his username/password as a configuration option, and/or a user-CAS/XAML/OpenId token can be forwarded to the server when accessing the resource. Autorisation by IP-whitelisting will cause an issue here.

Server load: The extraction service can best be hosted as a separate service as geonetwork, as it's due to high cpu when large datasets are downloaded. It could even be considered to add download actions to a cue by default.

Extraction Workflow:

  • User opens extraction panel for a dataset. The dataset is analysed for complience with the extraction requirements (does it offer required protocol/format/projection).
  • The panel is shown with available extraction options. User selects options and press download button.
  • A download-task is scheduled in a cue. If the cue is empty and the task is below a treshold time (10sec?) the download is instantly returned to the user. Else an option is provided to the user where he can insert an emailadress, if the task is finished a mail will be send with the download location.
  • Alternative is "Streaming ETL", some datasets are too big to be handled in memory. While the data is downloaded it should be converted and made available to the user as a stream. http://stetl.org offers such processing.

Activities:

  • Build the download panel (AngularJS) - 1month
  • Build services required to initialise the download panel (analyse getcapabilities) .25month
  • Build the extraction service
    • Management of the cue (send mail) .25month
    • Get data (filter by bounds) .25month
    • transform data (reprojection and format conversion; considering geotools+OGR provides all required functionality) 1month
  • Add extraction options to Geonetwork Config screens .25month
Clone this wiki locally