This repo will provision and run a Clickhouse instance with data from msk_met_2012, msk_ch_2020 and msk_imapct_2017 datahub studies. This Clickhouse instance can be used by a modified cBioPortal backend to run cohort/filter queries in Study View.
Clickhouse performs well for analytical queries (search on column values) but is less suitable to retrieve all column values on an entity (typically SELECT * FROM ...). In the current implementation the samples table contains a column with internal sample identifiers used in the cBioPortal MySQL database. This allows for efficient retrieval of sample objects (created with SELECT * FROM sample ... in the MySQL database) once Clickhouse has determined the correct sample identifiers in the cohort.
The clickhouse schema is defined in clickhouse_provisioning/
directory
- Edit the
study_configs
section in create_clickhouse_db_table_files.py file to reflect paths to msk_met_2012, msk_ch_2020 and msk_imapct_2017 datahub studies
study_configs = [
{
"study_dir": "/home/pnp300/git/datahub/public/msk_met_2021",
"name": "msk_met_2021"
},
{
"study_dir": "/home/pnp300/git/datahub/public/msk_ch_2020",
"name": "msk_ch_2020"
},
{
"study_dir": "/home/pnp300/git/datahub/public/msk_impact_2017",
"name": "msk_impact_2017"
}
]
- Create Clickhouse staging files in the clickhouse_provisioning directory (in this repo) by running the create_clickhouse_db_table_files.py script:
python3 create_clickhouse_db_table_files.py
- Provision and run Clickhouse by running the docker-compose.yml file:
docker-compose up
or for detached mode:
docker-compose up -d
This will start a Clickhouse instance with port 8123
exposed on the host system.