Skip to content

emory-irlab/AttributeValueExtraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

# AttributeValue

Project description
--------------------------------------
From raw text data, generates relevant pairs of attributes (nouns or compound nouns) and values (nouns, adjectives, or
quantities). Also provides measures of each pair's frequency within the input data and a truncated file containing only
the most frequent pairs in terms of support (with a ceiling of 20,000).

How to use
--------------------------------------
1 - Ensure that the JAVA_HOME variable is set, preferably to version 1.8.0_73 or later of the JDK
2 - Ensure that Maven, preferably version 3.3.9 or later, is installed and included in the classpath
3 - Navigate to the directory "AttributeValue"
4 - Run the "run.sh" file in the root project directory (ex: enter "bash run.sh" into Bash when in the root directory).
    Include 3 arguments as follows:
    1: A path to input data (an XML file containing discrete entries denoted with the tag "row", and the text as the
       value of the "body" attribute)
    2: A path to a file that will contain all of the outputted attribute-value pairs, plus the support and confidence
       of each
    3: A path to a file that will contain up to the top 20,000 most frequent outputted attribute-value pairs, plus the
       support and confidence of each

Methodology
--------------------------------------
Attribute-value pairs here are taken to be pairs of words/chains of words indexing objects (i.e. "dog", "water heater")
and words indexing features that describe them (i.e. "lost" for "dog", "solar" for "water heater"). A limited definition
is adopted here; candidate objects are nouns, as well as dependent nouns that form a linear chain to the left of the
root noun (in "red garden gnome society", the last three words form an object, as "garden" and "gnome" are both nouns
depending on and immediately to the left of the root noun "society" without any intervening words, whereas in
"garden gnome appreciating society", "garden" and "gnome" are not part of the same object as "society", since there
is an adjective separating them).

The general format of an attribute-value pair here is as follows: (Attribute/POS, value/POS). Examples: (dog/NN, blue/JJ)
for "blue dog"; ((heater/NN, water/NN), solar/NN) for "solar water heater". This presentation takes note of the fact
that objects may themselves be attribute-value pairs. In "solar water heater", the object "water heater" is itself
an attribute-value pair, with "heater" as the attribute and "water" as the value. In this way, a hierarchy of attributes
and values can be seen in this project's presentation of multi-word pairs.

Procedure
--------------------------------------
1 - Take in an XML file
2 - Collect the text data of each input file's <row> entities into a file, with each line
    consisting of a <row>'s number, "Title" value, and "Body" text
3 - Remove HTML tags from the file created in step 2 (so that HTML tags are not
    tokenized)
4 - Annotate the parsed input text, generate all dependency structures contained within it
5 - Extract all dependency structures with a noun or proper noun as the root
6 - From the dependency structures given in 5, generate attribute-value pairs
7 - Sort the attribute-value pairs
8 - Calculate the support and confidence of each pair
9 - Generate two output files: One containing all attribute-value pairs, and one containing the top 20,000 attribute-value
    pairs (no duplicates). The output files feature three columns: One for each pair, one for the pair's support, and
    one for the pair's confidence

Directory structure and relevant file descriptions
--------------------------------------
1 - .idea
2 - .src/main
    i. defunctJava
        a. defunctCode.java
           -Various methods not used in the final implementation of this project, for reference
        b. generateItemsets1_1.java
           -An unfinished set of methods to obtain attribute-value pairs from a list of dependencies. Intended to
            collect all possible interpretations of noun phrases, not including prepositional phrases or determiners
            (e.g. "old garden gnome society" would be represented as (((society, gnome), garden), old),
            ((society, (gnome, garden)), old), as well as others)
    ii. development_txts
        -Various files used for development purposes
    iii. I-O_data
        -Used to store data at various stages of processing
    iv. java
        - Package containing all Java classes needed and used in the execution of the project (all are .java classes)
         a. AttrValPair
            -A representation of an attribute-value pair, with fields for the attribute, value, transaction ID (if the
             pair is part of a collection of attribute-value pairs from various transactions), and frequency metrics
             such as support and confidence
         b. AttrValPairsToOutput
            -Contains the second main method in the project. Takes in a file of attribute-value pairs in the following
            format, each pair in its own line: Attribute:Value;transaction ID. Computes the support and confidence of
            each attribute-value pair, then writes all itemsets to one file, and the top 20,000 most frequent itemsets
            in terms of support, no duplicates, to another. The output format is as follows:

            Attribute-value pair        Support         Confidence
            (attr1, val1)               support1        confidence1
            (attr2, val2)               support2        confidence2
            .                           .               .
            .                           .               .
             Args:
             1: Path to a file from which input will be taken
             2: Path to a file. All pairs will be printed here
             3: Path to a file. A ceiling of the 20,000 most frequent pairs (in terms of support) will be printed here
         c. combineAttrValPairs
            -Not implemented
         d. generateAttrValPairs0_2
            -Methods useful for generating attribute-value pairs from an array of noun dependencies
         e. generateOutput
            -Methods to generate the output formats used in this project, including a method to help print values in
             columns and a method to generate a string version of a rounded double
         f. getNounDependencies
            -Methods to generate and print the dependencies of a root noun
         g. InputXMLsToAttrValPairs
            -Contains the first main method in the project. Takes in an XML file containing posts, then generates
             attribute-value pairs and prints them to a file, in the following format:

             attribute1:value1;transaction ID1
             attribute2:value2;transaction ID2
             .
             .
             .
             Args:
             1: Path to an XML file from which data will be extracted
             2: Path to a file to which attribute-value pairs will be printed
         h. StaticMinOrientedAttrValHeap
            -Implementation of a min-oriented heap with a fixed capacity. Only compares attribute-value pairs in terms
             of support. Designed to track attribute-value pairs in terms of maximum support
         i. parseXML
            -Methods useful for parsing text from <row> tags in XML files. Creates a single line of each <row>'s Id,
             Title, and Body entities
         j. RemoveHTMLTagsFromTXTs
            -Methods useful for removing HTML tags from an inputted file as well as, potentially, row numbers
         k. splitInputs
            -Not implemented
         l. test
            -Site for test operations
    v. resources
3 - target
4 - AttributeValue.iml
5 - pom.xml
    -Project object model
6 - README.md
    -This file
7 - run.sh
    -Bash shell script that runs the first part of the project
8 - run.sh
    -Bash shell script that runs the second part of the project

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published