Yanjun Gao's PhD Thesis Defense
Title: Analysis of Text To Identify, Represent, And Group Distinct Propositions
Atomic propositions are the semantic building blocks of discursive text, and are organized into simple or complex sentences in diverse syntactic structures. They are critical for many NLP applications. In this thesis, we study the identification, representation and grouping of propositions for text analysis. At the beginning of the thesis, we will introduce NLP resources for identifying and representing propositions, which include a newly annotated corpus and two modified corpus from publicly released datasets. We will also present educational data collection and annotation for the purpose of developing educational technologies to analyze and assess student writing. Later we will present the NLP contributions of this thesis, which include EDUA, an algorithm to group propositions from different texts that mean essentially the same thing; ABCD, a neural model to learn edit operations to identify and extract propositions from complex sentences, and DAnCE, a neural model to learn clause-based representations that perform well in discourse tasks such as predicting connectives to join two sentences.
EDUA groups propositions extracted from different reference summaries of the same source text into distinct expressions of the same idea. The input to EDUA is a graph of semantic vectors of propositions extracted from input sentences. It is a key component of PyrEval, a tool for automatic content assessment of human or machine summaries. ABCD successfully identifies propositions from complex sentence where each proposition corresponds to a distinct clause in the original sentences. We show ABCD achieves competitive results compared to a state-of-the-art encoder-decoder model, where the latter has poor performance when the input sentences cover wider types of linguistic phenomena. DAnCE takes input sentences and their dependency and constituency parses, and converts each clause into a dependency-anchor graph that highlights the verb phrase and the subject. It produces clause embeddings using a graph convolution layer that aggregates features of the verb phrase and the subject. On connective prediction, DAnCE performs better than two Tree-LSTM variants which also take parsing as input. We also present DAnCE++, an extension of DAnCE for complex sentence representation. Discourse coherence pertains to a wide range of semantic connections that link successive sentences to form coherent discourse. DAnCE++ shows great potential in producing coherence-aware sentence representation, by its good performance on both connective prediction and sentence ordering tasks. Together, these three NLP algorithms address problems that are fundamental to automated analysis of discourse.
Zoom Link: https://psu.zoom.us/j/95990301287
Comments
Post a Comment