Skip to content

Utilities for fetching, parsing and crawling OpenActive data catalogs and dataset sites

License

Notifications You must be signed in to change notification settings

openactive/dataset-utils

Repository files navigation

dataset-utils

@openactive/dataset-utils is a Node.js utility library designed to simplify the handling of OpenActive data catalogs and dataset sites. The library facilitates fetching, parsing, and manipulating data from various dataset URLs within a specified catalog, ensuring a seamless interaction with OpenActive data.

Features

  • Recursive Data Catalog Crawling: Methodically navigates through data catalogs, fetches datasets, and extracts JSON-LD from dataset HTML.
  • Data URL Retrieval: Efficiently retrieves an array of dataset site URLs from data catalogs and part collections.
  • Metadata Extraction: Extracts JSON-LD metadata from HTML dataset pages.

Installation

Install the package via npm:

npm install @openactive/dataset-utils

Usage

getAllDatasetSiteUrls(dataCatalogUrl)

Description

This is a recursive function that returns an array of dataset site URLs. If the URL supplied is a data catalog collection, it gets all the data catalogs in hasPart and crawls them. If the URL supplied is a data catalog, it gets the dataset array and flattens it.

Parameters

Returns

A Promise that resolves with an object containing:

  • catalogMetadata: A JSON-LD object of the root data catalog provided.
  • urls - An array of strings, each being a URL for a dataset.
  • errors - An array of error objects, each containing details about errors encountered during the retrieval process. If no errors were encountered, this array is empty. Each error object includes:
    • url: The URL from which data was being fetched when the error occurred.
    • status: HTTP status code of the error response (if available).
    • message: A descriptive message detailing the nature of the error.

Example

const { getAllDatasetSiteUrls } = require('@openactive/dataset-utils');

const { urls, errors } = await getAllDatasetSiteUrls();

console.log(`Retrieved ${urls.length} dataset URLs`);
if (errors.length > 0) {
  console.error(`${errors.length} errors encountered during retrieval:`);
  errors.forEach(error => {
    console.error(`- [${error.status}] ${error.url}: ${error.message}`);
  });
}

extractJSONLDfromHTML(url, html)

This function extracts JSON-LD metadata from a given Dataset Site html, using the provided url to resolve relative URLs within the JSON-LD.

Note that relative URLs are not generally permissible within OpenActive data, however the underlying JSON-LD library still requires that this be specified.

Parameters:

  • url: The URL used to resolve relative URLs in the HTML page.
  • html: HTML content from which JSON-LD data will be extracted.

Returns:

An object representing the extracted JSON-LD, or null if extraction fails.

Example:

const { extractJSONLDfromHTML } = require('@openactive/dataset-utils');

const jsonld = extractJSONLDfromHTML('https://example.com/dataset', '<html>...</html>');
console.log(jsonld);

getAllDatasets([dataCatalogUrl])

This function recursively crawls through a data catalog, fetches datasets, and extracts JSONLD from the dataset HTML. This combines getAllDatasetSiteUrls() and extractJSONLDfromHTML().

The errors array it returns will detail any issues that occurred during the process of fetching and extracting data from URLs. This can be large in number due to the fractured nature of maintainence of OpenActive feeds.

Parameters:

Returns:

A Promise that resolves with an object containing:

  • catalogMetadata: A JSON-LD object of the root data catalog provided.
  • datasets: An array of extracted JSON-LD objects from the Dataset Sites.
  • errors: An array of error objects indicating any issues encountered during fetching. Each error object includes:
    • url: The URL from which data was being fetched when the error occurred.
    • status: HTTP status code of the error response (if available).
    • message: A descriptive message detailing the nature of the error.

Example:

const { getAllDatasets } = require('@openactive/dataset-utils');

getAllDatasets().then(({ datasets, errors }) => {
  console.log(datasets);
  
  // Iterating through the errors
  errors.forEach(error => {
    console.log(`Error fetching URL: ${error.url}`);
    console.log(`HTTP Status Code: ${error.status}`);
    console.log(`Message: ${error.message}`);
  });
});

validateJsonLdId(id, expectHtml)

Description

This function validates the @id (or id, for backwards compatibility) property within a JSON-LD Dataset or DataCatalog. It fetches JSON-LD data from a specified URL, checks whether the data is embedded in HTML or raw JSON-LD, extracts the JSON-LD, and ensures that the @id field within the document matches the provided id. This function acts as a safety check, affirming that the expected identifier aligns exactly with the identifier found within the fetched JSON-LD document. Note that @id is case sensitive and must match exactly.

Parameters

  • id (string): A string that specifies the expected @id or id value in the JSON-LD document.
  • expectHtml (boolean): A boolean flag indicating whether the fetched data is expected to be embedded within HTML such as for a Dataset Site (when true), or expected to be raw JSON-LD such as for a Data Catalogue (when false).

Returns

A Promise that resolves with an object containing:

  • isValid - A boolean that is true if the validation is successful (the expected @id matches the found @id) and false otherwise.
  • error - A string describing the error encountered during the validation process or null if the validation is successful.

Usage

async function exampleUsage() {
  const id = "https://example.com/data.jsonld";
  const { isValid, error } = await validateJsonLdId(id, false);

  if (isValid) {
    console.log(`Validation successful for ID: ${id}`);
  } else {
    console.error(`Validation failed for ID: ${id}. Error: ${error}`);
  }
}

Testing

Execute test cases using:

npm test

The test suite, located in ./test/getAllDatasets-test.js, utilises mocks to simulate various use cases.

Contributions

We welcome your contributions! Feel free to submit a pull request.

About

Utilities for fetching, parsing and crawling OpenActive data catalogs and dataset sites

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published