- About
- Setup Tutorial
- Data
- EasyCCG Submodule
- ETE_Trees
- categorize.py
- convert.sh & to_tree.py
- batch_categorize.sh
This repository can be used to categorize questions based upon their CCG parses from EasyCCG.
This repository was created for the Honors Option for Computer Science 442 at The Pennsylvania State University, Fall 2018.
This project was worked on from August, 2018 - November, 2018.
Authors:
For instructions on cloning this repository along with it's submodule, reference EasyCCG Submodule.
For this project, the pre-trained module used in development is model_questions
.
The categorize.py
script requires a path to an input file that contains lines of questions without line numbers. The python script data/clean.py
contains a function that will strip line numbers from an input file.
For instructions on running the categorization script categorize.py
, reference categorize.py.
If an --outfile
parameter is not provided, categorize.py
will default the output to data/output/_<inputfile>_grouped_out.txt
.
The contents of the output file will depend on what option is given to the -o
flag of categorize.py
.
Option | Output Description |
---|---|
0 | The output will contain only the categorized questions. |
1 | The output will contain only the categorized CCG trees. |
2 | The output will contain all of the categorized questions as well as their common CCG subtree. |
This directory contains all of the data files for both inputting questions to be categorized and outputting questions that have been categorized.
This directory also contains a script clean.py
which contains different question file processing functions.
EasyCCG is a CCG parser created by Mike Lewis. It is added as a submodule to this repository.
git clone --recursive [email protected]:jed326/EasyCCGTrees.git
git submodule init
git submodule update
EasyCCG requires a model in order to run. Fortunately, the author of EasyCCG has provided pre-trained modules.
Pre-trained modules can be downloaded here: https://drive.google.com/drive/folders/0B7AY6PGZ8lc-NGVOcUFXNU5VWXc
After the modules have been downloaded, they should be placed in the easyccg/
directory.
For more detailed setup instructions, reference the EasyCCG repository.
To parse questions into text form:
java -jar $EASYCCG_HOME/easyccg.jar -f path/to/input --model $EASYCCG_HOME/model_questions [> outfile.txt]
To output trees to html:
java -jar $EASYCCG_HOME/easyccg.jar -f path/to/input --model $EASYCCG_HOME/model_questions -o html [> outfile.txt]
ETE is a python library that can be used to visualize and print out python tree structures. Specifically this can be used to save trees to .png files.
Instructions on how to use ETE can be found in the ETE_Trees directory.
This file is the primary script for this project. Usage is as follows:
usage: python3 categorize.py [-h] [--outfile OUTFILE] [-d DEPTH] [-o OUTPUT] path
Group similar questions into categories
positional arguments:
path Relative path to input file containing newline
separated questions to group
optional arguments:
-h, --help show this help message and exit
--outfile OUTFILE Optional path to output categories to
-d DEPTH, --depth DEPTH
Maximum depth to compare trees at
-o OUTPUT, --output OUTPUT
0: Questions Only / 1: Trees Only / 2: Questions and Common Subtree
to_tree.py natively receives a easyccg output from stdin and writes the corresponding tree string to stdout
convert.sh uses to_tree function to help batch converting questions to tree form
./convert data/input/QALD-questions-stripped.txt > output.txt
# or use -i to ignore 1 column per line
./convert -i1 data/input/QALD-questions.txt > output.txt
This is a helper script that exports all three types of categorized results to folder: data/output
The types are broken down here: Output