Training,Application in GATE Batch Learning PR is already straight forward. You can use GATE GUI to load a corpus , Train and Test it in Batch Learning PR. Then what's the scope of GATE-ML ??
- Crazy ?? : NO
- Reinventing the wheel again ?? : NO
One and Only One Reason :)
- When you load a very big corpus and try to pre process and train using the GATE GUI;it hangs like HELL.
Machine Learning in GATE as Embedded. This package contains 3 phases.
- Preprocessing : Read input text files and create GATE XML files
- Training : Train GATE XML files and create a model
- Application : Classify a input text using the trained model
Reminder for "forward-slash" "backward-slash" change according to the operating system environment.
Inital property needed for the system to run
- GATE_HOME : GATE HOME in your system
- learningMode : Three modes are : Preprocessing,Training and Application
- sourceDirectory : contains all the property files for the above 3 learning modes
Source directory contains three sub directories. Each points to one of the three learning modes.
- GAPPFile : GAPP file for Preprocessing . A sample gapp file can be found at gappFile/ml_data_preprocessing.gapp
- AnnotationTypesRequired : Annotation name which you want to inject the class label.By default its Sentence.You can add your own custom annotations here. If you are using a annotation other than GATE default annotations , make sure to build the gapp files using that PR's
- CorpusName : Name of the corpus
- inputDir : Contains training files as .txt files. At the time of preprocessing , the directory name is treated as the class label for all the txt files in it. Expects simple directory hierarchy like 20news-group-data
- outputDir : Output GATE XML's are stored here
- removeStopWords : Removing stopwords or not ( true / false )
- removePunctuation : Removing punctuations or not ( true / false )
- GAPPFile : GAPP file for Training . A sample gapp file can be found at gappFile/ml_training.gapp
- CorpusName : Name of the corpus
- xmlCorpus : outputDir of Preprocess mode
The ml-config.xml is under this folder , so default location of trained model is here.
- GAPPFile : GAPP file for Preprocessing . A sample gapp file can be found at gappFile/ml_application.gapp
- CorpusName : Name of the corpus
- removeStopWords : Removing stopwords or not ( true / false )
- removePunctuation : Removing punctuations or not ( true / false )
Sample gapp files can be found here.
- ml_data_preprocessing.gapp : ANNIE with defaults ( with out NE Transducer and Ortho Matcher)
- ml_training.gapp : Batch Learning PR with ml-config.xml from sources/training
- ml_application.gapp : ANNIE with defaults ( with out NE Transducer and Ortho Matcher) and Batch Learning PR
Execution starts from GateLearning.java which takes GATE_ML.properties and proceed further according to the learningMode.
If the learningMode is "Preprocess" then the system takes sources/preprocess folder as configuration directory.
If the learningMode is "Training" then the system takes sources/training folder as configuration directory.
If the learningMode is "Application" then the system takes sources/application folder as configuration directory. This is just a Demo mode , the input text is hard coded in GateLearning's executeClassifier method.
Using Maven , mvn clean install assembly:single or mvn clean package
Apache License 2 - http://www.apache.org/licenses/LICENSE-2.0.html