GitHub - emory-irlab/AttributeValueExtraction

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.idea		.idea
data		data
src/main		src/main
target		target
AttributeValue.iml		AttributeValue.iml
README.txt		README.txt
pom.xml		pom.xml
run.sh		run.sh
run2.sh		run2.sh

Repository files navigation

# AttributeValue

Project description
--------------------------------------
From raw text data, generates relevant pairs of attributes (nouns or compound nouns) and values (nouns, adjectives, or
quantities). Also provides measures of each pair's frequency within the input data and a truncated file containing only
the most frequent pairs in terms of support (with a ceiling of 20,000).

How to use
--------------------------------------
1 - Ensure that the JAVA_HOME variable is set, preferably to version 1.8.0_73 or later of the JDK
2 - Ensure that Maven, preferably version 3.3.9 or later, is installed and included in the classpath
3 - Navigate to the directory "AttributeValue"
4 - Run the "run.sh" file in the root project directory (ex: enter "bash run.sh" into Bash when in the root directory).
Include 3 arguments as follows:
1: A path to input data (an XML file containing discrete entries denoted with the tag "row", and the text as the
value of the "body" attribute)
2: A path to a file that will contain all of the outputted attribute-value pairs, plus the support and confidence
of each
3: A path to a file that will contain up to the top 20,000 most frequent outputted attribute-value pairs, plus the
support and confidence of each

Methodology
--------------------------------------
Attribute-value pairs here are taken to be pairs of words/chains of words indexing objects (i.e. "dog", "water heater")
and words indexing features that describe them (i.e. "lost" for "dog", "solar" for "water heater"). A limited definition
is adopted here; candidate objects are nouns, as well as dependent nouns that form a linear chain to the left of the
root noun (in "red garden gnome society", the last three words form an object, as "garden" and "gnome" are both nouns
depending on and immediately to the left of the root noun "society" without any intervening words, whereas in
"garden gnome appreciating society", "garden" and "gnome" are not part of the same object as "society", since there
is an adjective separating them).

The general format of an attribute-value pair here is as follows: (Attribute/POS, value/POS). Examples: (dog/NN, blue/JJ)
for "blue dog"; ((heater/NN, water/NN), solar/NN) for "solar water heater". This presentation takes note of the fact
that objects may themselves be attribute-value pairs. In "solar water heater", the object "water heater" is itself
an attribute-value pair, with "heater" as the attribute and "water" as the value. In this way, a hierarchy of attributes
and values can be seen in this project's presentation of multi-word pairs.

Procedure
--------------------------------------
1 - Take in an XML file
2 - Collect the text data of each input file's <row> entities into a file, with each line
consisting of a <row>'s number, "Title" value, and "Body" text
3 - Remove HTML tags from the file created in step 2 (so that HTML tags are not
tokenized)
4 - Annotate the parsed input text, generate all dependency structures contained within it
5 - Extract all dependency structures with a noun or proper noun as the root
6 - From the dependency structures given in 5, generate attribute-value pairs
7 - Sort the attribute-value pairs
8 - Calculate the support and confidence of each pair
9 - Generate two output files: One containing all attribute-value pairs, and one containing the top 20,000 attribute-value
pairs (no duplicates). The output files feature three columns: One for each pair, one for the pair's support, and
one for the pair's confidence

Directory structure and relevant file descriptions
--------------------------------------
1 - .idea
2 - .src/main
i. defunctJava
a. defunctCode.java
-Various methods not used in the final implementation of this project, for reference
b. generateItemsets1_1.java
-An unfinished set of methods to obtain attribute-value pairs from a list of dependencies. Intended to
collect all possible interpretations of noun phrases, not including prepositional phrases or determiners
(e.g. "old garden gnome society" would be represented as (((society, gnome), garden), old),
((society, (gnome, garden)), old), as well as others)
ii. development_txts
-Various files used for development purposes
iii. I-O_data
-Used to store data at various stages of processing
iv. java
- Package containing all Java classes needed and used in the execution of the project (all are .java classes)
a. AttrValPair
-A representation of an attribute-value pair, with fields for the attribute, value, transaction ID (if the
pair is part of a collection of attribute-value pairs from various transactions), and frequency metrics
such as support and confidence
b. AttrValPairsToOutput
-Contains the second main method in the project. Takes in a file of attribute-value pairs in the following
format, each pair in its own line: Attribute:Value;transaction ID. Computes the support and confidence of
each attribute-value pair, then writes all itemsets to one file, and the top 20,000 most frequent itemsets
in terms of support, no duplicates, to another. The output format is as follows:

Attribute-value pair Support Confidence
(attr1, val1) support1 confidence1
(attr2, val2) support2 confidence2
. . .
. . .
Args:
1: Path to a file from which input will be taken
2: Path to a file. All pairs will be printed here
3: Path to a file. A ceiling of the 20,000 most frequent pairs (in terms of support) will be printed here
c. combineAttrValPairs
-Not implemented
d. generateAttrValPairs0_2
-Methods useful for generating attribute-value pairs from an array of noun dependencies
e. generateOutput
-Methods to generate the output formats used in this project, including a method to help print values in
columns and a method to generate a string version of a rounded double
f. getNounDependencies
-Methods to generate and print the dependencies of a root noun
g. InputXMLsToAttrValPairs
-Contains the first main method in the project. Takes in an XML file containing posts, then generates
attribute-value pairs and prints them to a file, in the following format:

attribute1:value1;transaction ID1
attribute2:value2;transaction ID2
.
.
.
Args:
1: Path to an XML file from which data will be extracted
2: Path to a file to which attribute-value pairs will be printed
h. StaticMinOrientedAttrValHeap
-Implementation of a min-oriented heap with a fixed capacity. Only compares attribute-value pairs in terms
of support. Designed to track attribute-value pairs in terms of maximum support
i. parseXML
-Methods useful for parsing text from <row> tags in XML files. Creates a single line of each <row>'s Id,
Title, and Body entities
j. RemoveHTMLTagsFromTXTs
-Methods useful for removing HTML tags from an inputted file as well as, potentially, row numbers
k. splitInputs
-Not implemented
l. test
-Site for test operations
v. resources
3 - target
4 - AttributeValue.iml
5 - pom.xml
-Project object model
6 - README.md
-This file
7 - run.sh
-Bash shell script that runs the first part of the project
8 - run.sh
-Bash shell script that runs the second part of the project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

emory-irlab/AttributeValueExtraction

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages