Natural Language Processing using private and secure data. Powered by OpenMined's tools PySyft and SyferText.
The contents of this repo were featured in the Encrypted training on medical text data using SyferText and PyTorch blog post at OpenMined's blog
This is an ongoing work in progress. Be prepared to tackle coding errors and/or typos.
Follow the instructions to install:
Dataset compiled for Natural Language Processing using a corpus of medical transcriptions and custom-generated clinical stop words and vocabulary.
X.csv
. Fully processed dataset obtained from running the Data Modelling notebook.classes.txt
. Text file describing the dataset's classes:Surgery
,Medical Records
,Internal Medicine
andOther
train.csv
. Training data subset. Contains 90% of theX.csv
processed file.test.csv
. Test data subset. Contains 10% of theX.csv
processed file.
-
mtsamples.csv
. Compiled from Kaggle's medical transcriptions dataset by Tara Boyle, scraped from Transcribed Medical Transcription Sample Reports and Examples. See Kaggle repository. -
clinical-stopwords.txt
. Compiled from Dr. Kavita Ganesan clinical-concepts repository. See the Discovering Related Clinical Concepts Using Large Amounts of Clinical Notes paper. -
vocab.txt
. Generated vocabulary text files for Natural Language Processing (NLP) using the Systematized Nomenclature of Medicine International (SNMI) data. See how to Generate your own vocab file.
-
Data Modelling: Data exploration and feature engineering using Pandas, matplotlib and Seaborn and consolidation of dataset using scikit-learn.
-
Medical text data exploration: An introduction to data exploration of medical text using Pandas, matplotlib and Seaborn. NOTE: Deprecated. Use Data Modelling instead.
-
Medical text encrypted training: Tutorial on how to train an NLP model of data you cannot see using PySyft and SyferText. Heavily inspired by Alan Aboudib's Sentiment classification using SyferText use case. WARNING: This is an ongoing project, be wary of errors.
Holds the script used to download whole datasets using url
Issues and Pull requests welcomed