Skip to content

Pre-processing , Training and Classification in Embedded GATE

Notifications You must be signed in to change notification settings

srijiths/GATE-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Why GATE-ML ?

Training,Application in GATE Batch Learning PR is already straight forward. You can use GATE GUI to load a corpus , Train and Test it in Batch Learning PR. Then what's the scope of GATE-ML ??

  • Crazy ?? : NO
  • Reinventing the wheel again ?? : NO

One and Only One Reason :)

  • When you load a very big corpus and try to pre process and train using the GATE GUI;it hangs like HELL.

Machine Learning in GATE as Embedded. This package contains 3 phases.

  • Preprocessing : Read input text files and create GATE XML files
  • Training : Train GATE XML files and create a model
  • Application : Classify a input text using the trained model

Property files

Reminder for "forward-slash" "backward-slash" change according to the operating system environment.

GATE_ML.properties

Inital property needed for the system to run

  • GATE_HOME : GATE HOME in your system
  • learningMode : Three modes are : Preprocessing,Training and Application
  • sourceDirectory : contains all the property files for the above 3 learning modes

Sources Directory

Source directory contains three sub directories. Each points to one of the three learning modes.

preprocess

  • GAPPFile : GAPP file for Preprocessing . A sample gapp file can be found at gappFile/ml_data_preprocessing.gapp
  • AnnotationTypesRequired : Annotation name which you want to inject the class label.By default its Sentence.You can add your own custom annotations here. If you are using a annotation other than GATE default annotations , make sure to build the gapp files using that PR's
  • CorpusName : Name of the corpus
  • inputDir : Contains training files as .txt files. At the time of preprocessing , the directory name is treated as the class label for all the txt files in it. Expects simple directory hierarchy like 20news-group-data
  • outputDir : Output GATE XML's are stored here
  • removeStopWords : Removing stopwords or not ( true / false )
  • removePunctuation : Removing punctuations or not ( true / false )

training

  • GAPPFile : GAPP file for Training . A sample gapp file can be found at gappFile/ml_training.gapp
  • CorpusName : Name of the corpus
  • xmlCorpus : outputDir of Preprocess mode

The ml-config.xml is under this folder , so default location of trained model is here.

application

  • GAPPFile : GAPP file for Preprocessing . A sample gapp file can be found at gappFile/ml_application.gapp
  • CorpusName : Name of the corpus
  • removeStopWords : Removing stopwords or not ( true / false )
  • removePunctuation : Removing punctuations or not ( true / false )

GAPP Files

Sample gapp files can be found here.

  • ml_data_preprocessing.gapp : ANNIE with defaults ( with out NE Transducer and Ortho Matcher)
  • ml_training.gapp : Batch Learning PR with ml-config.xml from sources/training
  • ml_application.gapp : ANNIE with defaults ( with out NE Transducer and Ortho Matcher) and Batch Learning PR

GATE-ML Work Flow

Execution starts from GateLearning.java which takes GATE_ML.properties and proceed further according to the learningMode.

If the learningMode is "Preprocess" then the system takes sources/preprocess folder as configuration directory.

If the learningMode is "Training" then the system takes sources/training folder as configuration directory.

If the learningMode is "Application" then the system takes sources/application folder as configuration directory. This is just a Demo mode , the input text is hard coded in GateLearning's executeClassifier method.

Dependency Project

Build

Using Maven , mvn clean install assembly:single or mvn clean package

License

Apache License 2 - http://www.apache.org/licenses/LICENSE-2.0.html

About

Pre-processing , Training and Classification in Embedded GATE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages