I will be utilizing 2 different datasets for this project. The first is the Universal Decompositional Semantics dataset. The main Decomp repository that contains the UDS dataset as well as the Decomp toolkit is available at https://github.com/decompositional-semantics-initiative/decomp. The novelty of the UDS dataset is that it combines 5 different decompositional semantics-aligned annotation datasets into one with node and edge attributes in a graphical format.
The second dataset I will be using that will provide the more non-prototypical constructions (to test the benefits and limitations of the UDS dataset) is the CHILDES Narrative English Hicks Corpus. The main page with the project description is available at https://childes.talkbank.org/access/Eng-NA/Hicks.html. This dataset contains narratives by first, second, and fifth graders split into three genres - a factual news report, an ongoing event case, and a more embellished story. These genres come from retellings of a silent film that was shown to each of the students. The data were coded for forms that might mark genre differences, so there are a lot of annotation codes to work with (syntactic, event, and indexical) that I am hoping will lead to an interesting investigation in light of the UDS framework. The original researchers were interested in assesesing differences in syntactic complexity, use of verb forms, and use of intensifiers among the three genres.
A big part of my analysis section will actually be dedicated to exploration of the UDS dataset and the Decompt toolkit that comes with it. The dataset is massive and it will take some time to acquaint myself with it and understand how it can be queried using the toolkit to yield the information I want to further look at. I'll probably have a separate exploration section that comes before the analysis portion in my final product.
The main, true, analysis part will involve me looking into how the CHILDES dataset aligns with the UDS framework. UDS is supposedly better for capturing the semantics of words as the average person would understand. I will take what the original researchers (who collected the narrative corpus) were looking at and analyze it within the framework of the UDS dataset. I think an interesting aspect of the narrative corpus to do this with will be the differences in "verb forms, aspectual markers, timemarkers and logical connectors," as well as the event descriptions since the UDS dataset seems like it can offer complex information about these linguistic aspects.
As I had planned on before, I think I will first need to demonstrate what the Universal Decompositional Semantics Dataset looks like. Using the Decomp toolkit, I will show a bit of what the dataset looks like and the different types of queries that can be run for semantic analysis. I will also talk about the CHILDES dataset and give some background about how it is annotated and what the researchers were initially trying to do with the data. I will then demonstrate how the universal decompositional semantics approach works with the CHILDES dataset - what it captures, what it fails to capture, etc.