Skip to content
This repository has been archived by the owner on May 6, 2024. It is now read-only.

Detailed data from BAG available #492

Closed
baryluk opened this issue Apr 9, 2020 · 11 comments
Closed

Detailed data from BAG available #492

baryluk opened this issue Apr 9, 2020 · 11 comments
Labels
data update This pull request or issue is about new or updated data

Comments

@baryluk
Copy link
Contributor

baryluk commented Apr 9, 2020

I just found BAG is now providing very detailed data dump, with breakdown by canton, age group, sex, fully historicized:

Also, as far as I can see, there is no data for Principality of Lichtenstein there.

I got this links from https://covid-19-schweiz.bagapps.ch/de-2.html and https://covid-19-schweiz.bagapps.ch/de-1.html , but the interface requires me to select the end date, so they will most likely break and/or not have all data by tomorrow. So for completeness I am attaching archive with the files:

BAG_tableau_csv_2020-04-09.tar.gz

I guess it might be useful to develop a tool to cross reference the data, and compare with what we store in the repo? Or maybe even publish in a separate directory too in this repo?

@metaodi metaodi added the data update This pull request or issue is about new or updated data label Apr 9, 2020
@baryluk
Copy link
Contributor Author

baryluk commented Apr 9, 2020

The links don't work any more. I think they are tied dynamically to particular instance of viewing (on my computer), and so now are gone.

There is probably way to get this data without a session, or create a session programmatically.

There are some information of HTTP REST API in Python here: https://github.com/tableau/server-client-python

And tutorial here: https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_get_started_tutorial_part_1.htm

The HTTP REST API itself also is documented, for example this is the method we are probably most interested in: https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_ref_datasources.htm#download_data_source and https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_ref_workbooksviews.htm#get_view

It looks like this: GET /api/api-version/sites/site-id/datasources/datasource-id/content or GET /api/api-version/sites/site-id/views/view-id and/or GET /api/api-version/sites/site-id/views/view-id/data
(It is also implemented in the mentioned official Python library).

Needs more digging to figure this out.

There is also JavaScript client, which is the same as used for the visualisations itself, and it might be possible to execute some of it in node.js, but it is possible that it has too many features, and will not work in node.js actually.

So far by inspecting html and javascript, the "path" parameter is shared/3WXJ8ZXN4.

@BFLB
Copy link

BFLB commented Apr 14, 2020

Hi @baryluk , thank you for working on this. I have found datasource as well some time ago and have started to feed it to Elasticsearch. So far I am downloading it manually via https://covid-19-schweiz.bagapps.ch/de-1.html. Have you found a way already to automate the download? If it is not possible by using the API, I could help by building a scraper based on synthetic-monitoring.
In terms of the dataset, I am using the version with all columns and all lines. As you mentioned already, the number of lines corresponds to the number of confirmed cases, which is as detailed as it can get. There even exists a column called f1, which seem to contain the case number what could simplify updating the data. The problem is that theese numbers seem to somehow change over time. On each of the updates I made, more than 1000 numbers did not exist on the new data anymore, still the amount of lines of the new dataset was correct.
In addition, a lot of data is redundant, i.e. exists in german, french or abbreviations. Right now I plan to implement some homologation on the import. Although creating a sanitized, englisch version would be the better solution. As well if we could store the data in public, ideally on this repo, since it is used by a lot of people and solutions. If collaboration is welcome, please let me know.

@baryluk
Copy link
Contributor Author

baryluk commented Apr 14, 2020 via email

@BFLB
Copy link

BFLB commented Apr 15, 2020

Hi Witold,
I have just created an own repo. It contains one original file, a first draft of a converted file and a python script to do the conversion.
Scraping will be done manually for the moment.
Please feel free to have a look. Comments are highly welcome

Regards
Bernhard

@BFLB
Copy link

BFLB commented Apr 22, 2020

Hi @baryluk ,
If you are interested, a python/selenium scraper is now available on my repo.

Cheers
Bernhard

@zukunft
Copy link
Contributor

zukunft commented Apr 28, 2020

I just compared the number of deceased of the cantons and the BAG:

Canton date Canton number BAG number diff in percent
JU 43948 7 1 6 600%
SH 43949 6 2 4 200%
NE 43943 65 25 40 160%
VS 43948 132 86 46 53%
TI 43948 311 219 92 42%
ZG 43946 8 6 2 33%
VD 43947 355 267 88 33%
SO 43949 15 12 3 25%
BE 43948 83 72 11 15%
GE 43947 239 222 17 8%
GR 43947 43 40 3 8%
BL 43948 30 28 2 7%
AG 43948 33 31 2 6%
TG 43948 17 16 1 6%
ZH 43948 115 112 3 3%
FR 43948 78 78 0 0%
SG 43948 31 31 0 0%
SZ 43948 18 18 0 0%
GL 43948 7 7 0 0%
UR 43948 5 5 0 0%
AR 43948 3 3 0 0%
NW 43948 3 3 0 0%
LU 43948 16 17 -1 -6%
BS 43948 46 50 -4 -8%
CH TOTAL 1666 1351 315

Looks to me that in many cantons the difference is small and is most likely depending on the reporting time. But in NE, VS, TI and VD I guess the difference has some other reason. Maybe NE can be an indication: The number of persons "Décès hospitalier" until now based on the data from the canton is 22 and the BAG reports 25, whereas the total number is 65 in the canton.

@zukunft
Copy link
Contributor

zukunft commented Apr 30, 2020

It is possible that the difference between the numbers of the canton and the BAG is due to testing criteria. Some cantons declare a COVID-19 positive case also based on a CT scan.

@baryluk
Copy link
Contributor Author

baryluk commented May 16, 2020

@BFLB Selenium scraper sounds like an interesting idea. Have you had a continues stream of archived data for last 2 weeks with it?

@BFLB
Copy link

BFLB commented May 18, 2020

@baryluk
The scrapper works fine most of the time and needed some changes once in a while. Last week was a game changer. BAG changed the data model of the csv file from individual cases to aggregations. Now, there is a line per groups of gender-ageClass-Canton-Date-(either Confirmed Date or Death Date). This adds around 1000 lines per day. Most of them contain 0 values, since the curve flattened.

What I will do next is to provide a lean version of the converted csv file with only non zero data sets. This should drastically reduce size. For most use cases this should work.

I adapted the new model last week and refactored the code today. If everything works smooth I will start running the scrapper as a scheduled task until the end of the week to fully automate the process.

Finally I would like to scrap the number of tests done as soon as I have time, since these numbers are published as well since a week or two.

@metaodi
Copy link
Collaborator

metaodi commented Jul 7, 2020

It seems that no one is working on this issue currently (i.e. to cross-reference the data from here and BAG). I'm closing this for now, but feel free to re-open it if needed.

@metaodi metaodi closed this as completed Jul 7, 2020
@zukunft
Copy link
Contributor

zukunft commented Jul 7, 2020

I tried to get the confirmation from SH and BAG for the number of deceased, but no one wanted to confirm this issue. It seems that the BAG is working now also working on a fully digital version, that should be ready for the second wave. So maybe soon we will see more details.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data update This pull request or issue is about new or updated data
Projects
None yet
Development

No branches or pull requests

4 participants