-
Notifications
You must be signed in to change notification settings - Fork 0
emory-irlab/AttributeValueExtraction
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# AttributeValue Project description -------------------------------------- From raw text data, generates relevant pairs of attributes (nouns or compound nouns) and values (nouns, adjectives, or quantities). Also provides measures of each pair's frequency within the input data and a truncated file containing only the most frequent pairs in terms of support (with a ceiling of 20,000). How to use -------------------------------------- 1 - Ensure that the JAVA_HOME variable is set, preferably to version 1.8.0_73 or later of the JDK 2 - Ensure that Maven, preferably version 3.3.9 or later, is installed and included in the classpath 3 - Navigate to the directory "AttributeValue" 4 - Run the "run.sh" file in the root project directory (ex: enter "bash run.sh" into Bash when in the root directory). Include 3 arguments as follows: 1: A path to input data (an XML file containing discrete entries denoted with the tag "row", and the text as the value of the "body" attribute) 2: A path to a file that will contain all of the outputted attribute-value pairs, plus the support and confidence of each 3: A path to a file that will contain up to the top 20,000 most frequent outputted attribute-value pairs, plus the support and confidence of each Methodology -------------------------------------- Attribute-value pairs here are taken to be pairs of words/chains of words indexing objects (i.e. "dog", "water heater") and words indexing features that describe them (i.e. "lost" for "dog", "solar" for "water heater"). A limited definition is adopted here; candidate objects are nouns, as well as dependent nouns that form a linear chain to the left of the root noun (in "red garden gnome society", the last three words form an object, as "garden" and "gnome" are both nouns depending on and immediately to the left of the root noun "society" without any intervening words, whereas in "garden gnome appreciating society", "garden" and "gnome" are not part of the same object as "society", since there is an adjective separating them). The general format of an attribute-value pair here is as follows: (Attribute/POS, value/POS). Examples: (dog/NN, blue/JJ) for "blue dog"; ((heater/NN, water/NN), solar/NN) for "solar water heater". This presentation takes note of the fact that objects may themselves be attribute-value pairs. In "solar water heater", the object "water heater" is itself an attribute-value pair, with "heater" as the attribute and "water" as the value. In this way, a hierarchy of attributes and values can be seen in this project's presentation of multi-word pairs. Procedure -------------------------------------- 1 - Take in an XML file 2 - Collect the text data of each input file's <row> entities into a file, with each line consisting of a <row>'s number, "Title" value, and "Body" text 3 - Remove HTML tags from the file created in step 2 (so that HTML tags are not tokenized) 4 - Annotate the parsed input text, generate all dependency structures contained within it 5 - Extract all dependency structures with a noun or proper noun as the root 6 - From the dependency structures given in 5, generate attribute-value pairs 7 - Sort the attribute-value pairs 8 - Calculate the support and confidence of each pair 9 - Generate two output files: One containing all attribute-value pairs, and one containing the top 20,000 attribute-value pairs (no duplicates). The output files feature three columns: One for each pair, one for the pair's support, and one for the pair's confidence Directory structure and relevant file descriptions -------------------------------------- 1 - .idea 2 - .src/main i. defunctJava a. defunctCode.java -Various methods not used in the final implementation of this project, for reference b. generateItemsets1_1.java -An unfinished set of methods to obtain attribute-value pairs from a list of dependencies. Intended to collect all possible interpretations of noun phrases, not including prepositional phrases or determiners (e.g. "old garden gnome society" would be represented as (((society, gnome), garden), old), ((society, (gnome, garden)), old), as well as others) ii. development_txts -Various files used for development purposes iii. I-O_data -Used to store data at various stages of processing iv. java - Package containing all Java classes needed and used in the execution of the project (all are .java classes) a. AttrValPair -A representation of an attribute-value pair, with fields for the attribute, value, transaction ID (if the pair is part of a collection of attribute-value pairs from various transactions), and frequency metrics such as support and confidence b. AttrValPairsToOutput -Contains the second main method in the project. Takes in a file of attribute-value pairs in the following format, each pair in its own line: Attribute:Value;transaction ID. Computes the support and confidence of each attribute-value pair, then writes all itemsets to one file, and the top 20,000 most frequent itemsets in terms of support, no duplicates, to another. The output format is as follows: Attribute-value pair Support Confidence (attr1, val1) support1 confidence1 (attr2, val2) support2 confidence2 . . . . . . Args: 1: Path to a file from which input will be taken 2: Path to a file. All pairs will be printed here 3: Path to a file. A ceiling of the 20,000 most frequent pairs (in terms of support) will be printed here c. combineAttrValPairs -Not implemented d. generateAttrValPairs0_2 -Methods useful for generating attribute-value pairs from an array of noun dependencies e. generateOutput -Methods to generate the output formats used in this project, including a method to help print values in columns and a method to generate a string version of a rounded double f. getNounDependencies -Methods to generate and print the dependencies of a root noun g. InputXMLsToAttrValPairs -Contains the first main method in the project. Takes in an XML file containing posts, then generates attribute-value pairs and prints them to a file, in the following format: attribute1:value1;transaction ID1 attribute2:value2;transaction ID2 . . . Args: 1: Path to an XML file from which data will be extracted 2: Path to a file to which attribute-value pairs will be printed h. StaticMinOrientedAttrValHeap -Implementation of a min-oriented heap with a fixed capacity. Only compares attribute-value pairs in terms of support. Designed to track attribute-value pairs in terms of maximum support i. parseXML -Methods useful for parsing text from <row> tags in XML files. Creates a single line of each <row>'s Id, Title, and Body entities j. RemoveHTMLTagsFromTXTs -Methods useful for removing HTML tags from an inputted file as well as, potentially, row numbers k. splitInputs -Not implemented l. test -Site for test operations v. resources 3 - target 4 - AttributeValue.iml 5 - pom.xml -Project object model 6 - README.md -This file 7 - run.sh -Bash shell script that runs the first part of the project 8 - run.sh -Bash shell script that runs the second part of the project
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published