Scripts for querying Australian public data APIs. These scripts were part of my PhD project, but I have used them for several other projects since then and have made them open source as they may be helpful to others.
Using PHP over Javascript (with fetch or axios) avoids CORS errors, as several of the APIs don't allow cross origin requests. Note that some of the scripts on the bottom of the page are a few years old now and probably need some work. There is also a new directory with jupyter notebooks with examples.
In future, I hope to update all the scripts and consolidate them. I may add the option to dump output as CSV. For the time being, I copy the html and paste it into a spreadsheet when I need to analyse the results further.
This is the latest and most complete search script. It run queries on data.gov.au, then the CSIRO Knowledgebase, and finally on all the CKAN catalogues listed in ckan_apis.json. It puts out the results in a table with improved (over the older scripts) formatting and layout. It checks for duplicate datasets based on ID and title (older scripts were by id alone). It shows the finds per repository and lists for each dataset found the repositories it was found in.
JSON specification for the list of CKAN APIs to query by search_all.php - used by scripts querying all CKAN instances. Name of API has to be unique, and needs a url and API key (note: key needed for VIC CKAN API). Will run without specifying API key, but will return result set without results for that API and will give no error. Note: search-all.php allows the use of a ckan_apis_local.json file which may contain private api keys. If this file is present, it will be used by the script, otherwise ckan_apis.json will be used. The local version of that file is excluded from the repo.
I have started adding notebooks to this directory as examples of using some of the APIs within python. Notebooks are a great way to convert json responses to dataframes and then create output in various formats such as csv in a few lines of code, and very useful to prototyping.
Now superceded by v3.
Searches magda catalog first and then goes through CKAN instances. Marks madgda results and other sources(work in progress). Output is a html table of results. The form allows ticking a box that will format the output to produce a better spreadsheet without nested cells, and filters out duplicate distributions (and licence info).
Several data catalogues are based on the CKAN API. With this script, a user can select one or more of the available Australian CKAN APIs (refer to ckan_apis.json/ckan_apis_local.json) and specify search terms and max number of results to return. Returns a html table of results and total number of datasets found, highlighting any duplicates found. This table can be copied and pasted into a spreadsheet for further processing.
Data.gov.au is built on the Magda API. This script queries the magda API, a user can specify search term and optional search tags(keywords). Very similar in output to CKAN search results.
Meat & Livestock Australia publishes market livestock reports. These are available via a public API. This script obtains a list of available reports from that API, and provides a URL to query each report. Details of API can be found here.
The Australioan National Data Service has several API endpoints for data searches. This is a script using the getRIFCS API or the getExtRif API, and the output is a html table. Details via the Widgets & APIs page at ARDC. This script is very basic, API endpoint and query parameters need to be set in the source. Not also that at times the script returns no results from the API endpoint. However, it is possible to obtain an API response (XML) via Postman, save as XML file and then use that saved file as script input to produce formatted output. Refer to comments in script.### ands-search.html Script that fetches dataset listing from ANDS via registry widget. See RDA Registry Search Widget for details. Has a modifiable results template. Very limited in returned fields. The search_RDA script produces more detailed output.
Script that fetches dataset listing from ANDS via registry widget. See RDA Registry Search Widget for details. Has a modifiable results template. Very limited in returned fields. The search_RDA script produces more detailed output.
CSIRO data is available via DAP Web Services and OpenAPI, details as per the Developer Tools page. This script uses the OpenAPI endpoint, documentation can be found on the CSIRO Data Access Portal Web Services page.
Figshare uses OpenAPI, and this script searches that catalogue, creating a html table as output.
Essentially the same as magda_search.php, but will query the Magda interface provided by CSIRO. Not used as the other script returns more results, and according to CSIRO documentation, the V2 OpenAPI used by the above script is the more relevant one.
The FAIR Evaluation Services allow the testing of datasets against a selected set of FAIR metrics. A collection called Maturity Indicator collection for Australian Agricutural data was created and is the default used by this script. This collection (id 15) consists of 18 individual tests from the FAIR Evaluation Services. The PHP script runs a FAIR evaluation using the FAIRMetrics API. It defaults to the metrics collection 15, but can be changed in the form shown. Note that datasets to be evaluated need to be specified in datasets-to-check.csv. It is recommended to run less than 10 at a time, as the tests are very slow, often taking in excess of 5 minutes per dataset. A user has to specify an ORCID id and provide a title for the test.
Can be found in the /visualise directory. The live version of Graphs allows the viewing of data in different ways. Data is based on a spreadsheet that summarises results of all the data searches using the above scripts as well as manual searches.
This application shows the output from several data summaries based on that spreadsheet and uses the D3.js library, and is based on a sample application by subrata20011997. Screenshot below:
User can specify search terms. Returns table of results and number of finds. Identifies duplicates (by dataset name).
Very basic script. User needs to specify search terms in script. All functionality now in magda_search.php. Only keeping it as it has the older code that lists rganisations and facets at start of output.