A simple Anonymization Tool for Japanese texts.
It uses Ginza POS tagging to identify potential Proper Nouns
and replace those
with a [ANON]
tag.
The method was specifically tunned to anonymize Japanese clinical texts, aiming to remove patient and doctor personal information, such as names and addresses, and also institution (hospital, clinic) names and locations.
Currently, this software only supports Excel (.xls/xlsx) files as input.
A setup file is provided in the Releases section. This installer packs all required dependencies (including Python). After installation, the program can be executed from the start menu or the desktop shortcut.
It can be executed in two main ways currently:
- Executing
gui.py
will initialize a gui to select files and define the parameters - To run it in the console,
main.py
can be called, passing the proper parameters.
-
The list of files to be processed can be added by clicking the
Select File(s)/Dir(s)
button.
Files can be removed by selecting them in the list and clicking theRemove Selected
button. -
Select the output directory by clicking the
Select Output Dir
button or typing the path. -
Shows/Hides the advanced options.
-
Add the name of the columns which should be completely anonymized (e.g. A column containing a person's name).
Column names can be added by typing inn the text box at the top and hittingenter
.
To remove an entry, simply select it in the list and click theRemove Selected
button. -
Add tokens (words) that should always be anonymized, regardless of the context. This can be useful in situations where, for example, a less common name is not correctly anonymized.
-
Add Stop-words. There are special words that usually indicate that the previous word should be anonymized, such as "病院" or "クリニック" (hospital and clinic).
-
Clicking the
Anonymize Documents
button will start the process. -
The
Results
box will show the statistics of the process (how many documents were processed and how many entites were anonymized). It also shows any errors that may occur.
- Python 3.10 or higher (It may work with older versions, but it was not tested)
- Spacy 3.4.4
- Ginza 5.1.2 (and ja-ginza module for Spacy)
The executable can be generated using PyInstaller.
The AnonymizationTool.spec
file contains the configuration used for the generation of the executable.
The setup file was generated using NSIS and NSIS Quick Setup Script Generator .
Note that all compilation and setup generation was done in a Windows 10 environment and some modification may be needed for it to work in other environments.