Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

Scrape and harvest collections / sources P4 (OELA) #137

Open
3 of 5 tasks
osahon-okungbowa opened this issue May 15, 2020 · 0 comments
Open
3 of 5 tasks

Scrape and harvest collections / sources P4 (OELA) #137

osahon-okungbowa opened this issue May 15, 2020 · 0 comments

Comments

@osahon-okungbowa
Copy link
Contributor

Depends on #114

Description

Harvesting collections and sources depends on the schema validation allowing groups of different types and package relationships in the data.json source files.

Proposed Spec For Implementation is located here: Specs For Implementing data.json Validation Schema for Dept of Ed

Format

uses same format outlined in #115

Scraping rules:

  • If a HTML page contains multiple datasets -> extract the page itself as a collection
  • If a HTML page contains no datasets, but it has multiple links to pages that are collections -> extract the page as a source

CKAN extensions updates:

The datajson extension needs Collection / Source processing capabilities based on the data it finds in the data.json file.

SITUATION

Based on the format And scraping rules specified above, the OELA office requires an implementation of collection and sources

Tasks

  • Implement the scraping output changes
  • Implement the scraping rules for Collection / Source
  • Add the new items to the datajson schema we are using
  • Load a datajson containing collections and sources into a harvester source and test

Acceptance criteria:

  • sources and collections are implemented for the OELA office and is visible on staging portal
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant