Skip to content

aim-rsf/cprd-data-wrangle

Repository files navigation

All Contributors

License: MIT

DOI

👋 Welcome

👥 Who is this repository for?

This repository is for anyone new to working with datasets released by the Clinical Practice Research Datalink (CPRD). Researchers tasked with understanding the database tables, then querying and filtering to create a research cohort, may find our pre-processing pipeline and interactive notebooks a helpful guide to getting started.

Please note:

  • You need your own copy of CPRD's synthetic/real data to run the code. This repository does not contain any data files.

  • CPRD are moving towards a TRE model of data access, instead of a researcher downloading data onto their own computer. Read more here.

  • This is a work in progress repository. If you would like to suggest or contribute a change, please read our contributor guide.

🥅 Project Goals

We aim to streamline the process for researchers using CPRD datasets, with the creation of clear documentation, efficient data management strategies and analytical pipelines. We will start with development of workflows utilising CPRD's medium fidelity synthetic datasets because they resemble

"the real world CPRD data with respect to the data types, data values, data formats, data structure and table relationships" ref.

New to Synthetic Data? Read an introduction here.

We will create and share documentation & code, in openly available languages. We will start by loading the data into a relational database and summarising some of its main features.

By working with our research collaborators, we aim to test workflows written with synthetic datasets on the real datasets to ensure transferability and utility. An anticipated mismatch will be the size of the data files and possibly the variability in file format. Please reach out to us if you want to test our code on your real CPRD data, or have any feedback on improving transferability and utility.

CPRD's most recently released data specifications can be found here for the real datasets and here for the synthetic datasets.

💻 Current content

We include information on CPRD's Code Browser tool and how to request access to it.

The code-for-aurum folder uses Python and postgreSQL to create a pre-processing workflow for CPRD Aurum data which includes a conversion of data file format for compatibility, and then reading the data into tables in a relational database. Workbooks have been created to familiarise a user with the CPRD Aurum tables, including how they link together and how to build a sample cohort. See a preview below:

landing-page-demo-gif

🤝 Contributions and Acknowledgments

We acknowledge and thank these groups for making this project possible:

The views expressed within any file in this repository are those of the author(s) within the AIM-RSF programme, and not necessarily those of the: NIHR, Department of Health and Social Care, Medicines and Healthcare products Regulatory Agency (MHRA) or CPRD.

Thanks to specific contributors

This project follows the all-contributors specification, using the emoji key:

Rachael Stickland
Rachael Stickland

📆 🚧 💻 📖 🤔
Mahwish Mohammad
Mahwish Mohammad

🚧 💻 📖 🤔 👀
Batool Almarzouq
Batool Almarzouq

👀 🤔
Ann-Marie Mallon
Ann-Marie Mallon

📆 🤔
Kirstie Whitaker
Kirstie Whitaker

🤔

Would you like to contribute? Please read our contributor guide.

♻️ Licences

This project is licensed under the MIT License. See the LICENSE file for more details.

Citation

Almarzouq, B., Mallon, A.-M., Mohammad, M., Stickland, R., Whitaker, K., & AIM-RSF team. (2024). Introduction to CPRD using synthetic datasets (cprd-data-wrangle) (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.13693616


You got to the end of the README? You get our 🦭 of approval!