private_nlp

Natural Language Processing using private and secure data. Powered by OpenMined's tools PySyft and SyferText.

Blog post

The contents of this repo were featured in the Encrypted training on medical text data using SyferText and PyTorch blog post at OpenMined's blog

Disclaimer

This is an ongoing work in progress. Be prepared to tackle coding errors and/or typos.

Getting Started

Follow the instructions to install:

PySyft==0.2.5. There is an incompatibility issue with Tensorflow on version 0.2.6
SyferText

Data

Dataset compiled for Natural Language Processing using a corpus of medical transcriptions and custom-generated clinical stop words and vocabulary.

X.csv. Fully processed dataset obtained from running the Data Modelling notebook.
classes.txt. Text file describing the dataset's classes: Surgery, Medical Records, Internal Medicine and Other
train.csv. Training data subset. Contains 90% of the X.csv processed file.
test.csv. Test data subset. Contains 10% of the X.csv processed file.

Authors and acknowledgment

mtsamples.csv. Compiled from Kaggle's medical transcriptions dataset by Tara Boyle, scraped from Transcribed Medical Transcription Sample Reports and Examples. See Kaggle repository.
clinical-stopwords.txt. Compiled from Dr. Kavita Ganesan clinical-concepts repository. See the Discovering Related Clinical Concepts Using Large Amounts of Clinical Notes paper.
vocab.txt. Generated vocabulary text files for Natural Language Processing (NLP) using the Systematized Nomenclature of Medicine International (SNMI) data. See how to Generate your own vocab file.

Notebooks

Data Modelling: Data exploration and feature engineering using Pandas, matplotlib and Seaborn and consolidation of dataset using scikit-learn.
Medical text data exploration: An introduction to data exploration of medical text using Pandas, matplotlib and Seaborn. NOTE: Deprecated. Use Data Modelling instead.
Medical text encrypted training: Tutorial on how to train an NLP model of data you cannot see using PySyft and SyferText. Heavily inspired by Alan Aboudib's Sentiment classification using SyferText use case. WARNING: This is an ongoing project, be wary of errors.

Scripts

Holds the script used to download whole datasets using url

Contributing

Issues and Pull requests welcomed

License

GNU GENERAL PUBLIC LICENSE VERSION 3

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

private_nlp

Blog post

Disclaimer

Getting Started

Data

Authors and acknowledgment

Notebooks

Scripts

Contributing

License

About

Releases

Packages

Languages

License

salgadev/private_nlp

Folders and files

Latest commit

History

Repository files navigation

private_nlp

Blog post

Disclaimer

Getting Started

Data

Authors and acknowledgment

Notebooks

Scripts

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages