Prepared on October 15, 2011 for an internal review at USC.
My background and early research are in linguistics and cognitive science. Since 2004 I have been working on computational linguistics, mostly on corpus creation and evaluation of dialogue systems. My research strives to integrate linguistic knowledge and psychological plausibility into computational models of natural language processing, for example by annotating complex and ambiguous relations between anaphors and antecedents, or incorporating phonetic knowledge into natural language understanding processes (see details below). I have a specific interest in experimental methodology, and have developed novel means of data analysis to gain insight into the performance of systems that deliver “soft” outcomes such as user engagement. My main strands of research are detailed in the following sections.
Formal semantics of natural language
My research until 2004 was in theoretical linguistics, mostly on the formal semantics of natural language. My Ph.D. thesis at Rutgers University developed a compositional semantics for parts of words, in order to explain the behavior of constructions such as ortho and periodontists. The account decomposes the word semantics into a function-argument structure assigned to their prosodic parts. This representation retains the standard compositional semantics of conjunction and, rather than treating the constructions as exceptional cases; it also explains some puzzles about the behavior of plurality in such constructions (the expression is not equivalent to orthodontists and periodontists). My postdoctoral work at the Technion—Israel Institute of Technology concentrated on temporal quantification, developing a theory for natural language quantifiers in temporal clauses, as in the sentence a secretary cried after each executive resigned (abstract). This sentence has an interpretation that associates a different secretary with each executive, whereas according to standard linguistic theory, temporal clauses should be “islands” which restrict the scope of quantifiers. The solution to the puzzle lies in treating all temporal modifiers as generalized quantifiers, following Pratt and Francez 2001.
During my graduate career at Rutgers I also did research in cognitive science, developing a model that accounts for certain preferences in human sentence processing, for example the tendency to interpret the interrogative which dog in the partial utterance Which dog did John give… as a direct object of the verb rather than an indirect object (unpublished manuscript [PDF]; results cited in Stevenson and Smolensky 2006). I maintain contact with the linguistics community (e.g. reviewing for conferences and attending talks).
Corpus annotation for anaphora resolution
Anaphora is a linguistic process where the meaning or referent of a linguistic expression depends on a previous expression: for example, in a text like John bought a fish. It weighed one pound, the pronoun it is an anaphor whose referent depends on the expression a fish. Anaphora resolution is a computational task of identifying the referents of anaphoric expressions, often construed more narrowly as identifying textual antecedents, that is identifying a preceding expression on which the anaphor depends. In the above example, anaphora resolution needs to recognize that the word it is an anaphor and link it to the antecedent a fish.
Like many computational linguistic tasks, current methods of anaphora resolution depend on text corpora annotated with anaphor-antecedent relations, both for supervised machine learning and for evaluation of other algorithms. My postdoctoral work at the University of Essex was creating such an annotated corpus – the ARRAU corpus (abstract). This work involved an extensive series of experiments testing the feasibility of schemes for annotating anaphoric relations in transcripts of human dialogue and newspaper text. The resulting corpus captures not only simple anaphoric relations, as in previous corpora which concentrated on expressions referring to specific named entities, but also anaphoric expressions whose antecedent is ambiguous (abstract), and ones whose referent is an event rather than an object, such as the reference of the demonstrative pronoun that in the sentence So we ship one boxcar of oranges to Elmira, and that takes another two hours (abstract).
Reliability of human annotation
A general problem with human-annotated corpora, needed for supervised machine learning and for evaluation, is ensuring the quality of annotations. A standard method for assessing quality is measuring agreement between independent annotators: if agreement is high, then we have reason to believe that annotators are applying the guidelines consistently, based on a shared understanding of the task and the annotated material. As part of my work on creating the ARRAU corpus at Essex, I engaged in an extensive study of theory and practice in the measurement of inter-coder agreement, looking at existing coefficients for measuring agreement, application of such coefficients to a variety of annotation tasks in computational linguistics, and developing specific extensions and guidelines for use and interpretation of such measures as applied to the annotation of anaphoric relations. The results were published in Computational Linguistics (abstract), the leading journal of the ACL, prompting the journal to change its policy about publishing survey articles.
Evaluation of conversational dialogue systems
Since coming to ICT in 2007 I have worked extensively on corpora and evaluation for implemented conversational dialogue systems. A typical dialogue system consists of a set of components, each of which performs a specified Natural Language Processing task: for example, a speech recognizer transforms sound from a speaker’s voice into a text string; a natural language understanding component interprets text coming from a user; a dialogue manager decides what the system should say and when; a natural language generation component expresses a system’s intention as a string of words; and a speech synthesizer transforms a string of words into an audible utterance. Such components are fairly well understood, and the criteria for their evaluation are well established – components are typically evaluated on a target set of predefined input-output relations, to which a metric like accuracy or precision/recall is applied. While suitable for component evaluation, this method is usually not appropriate for end-to-end evaluation of a complete dialogue system, because dialogue is highly context-dependent, so a system reaction to a particular user action may be appropriate in some dialogue contexts but inappropriate in others: evaluating a system against a prespecified target dialogue will not work if an unexpected system action changes the context for subsequent utterances. Additionally, conversational dialogue systems often have a primary goal of engaging the user or teaching a conversational skill, with the information goals taking secondary role, so an emphasis in evaluation should be on dialogue coherence and user engagement.
To deal with these challenges I developed a system which I dubbed semi-formal evaluation (abstract). The system measures the “soft” aspects of dialogue coherence by collecting user ratings of the utterances in context, employing inter-annotator reliability measures to isolate parts where judgments are clear from ones where they are not. The ratings are combined with “hard” data from within the system such as response classes, frequency, and internal system confidence, in order to achieve a better understanding of where the system performs well and where it fails. Similar techniques for the analysis of “soft” data were used in the assessment of the applicability and limits of schemes for representing dialogue acts in conversational characters (abstract, abstract).
Applied natural language processing
Some of my recent work falls under the general heading of applied natural language processing, that is research into ways to get improved performance on deployed applications that use natural language processing components. One such effort is phonetically aware natural language understanding, developed together with William Yang Wang, a student intern at ICT in the summer of 2010 (abstract). The standard architecture for interpreting human speech uses a speech recognizer to transform speech sounds into text, which is then interpreted by a natural language understanding (NLU) component. Our work shows that adding linguistic knowledge to the NLU in the form of a phonetic dictionary allows better recovery from certain speech recognition errors, where the recognizer outputs a word that is phonetically similar to the one uttered by the user, but is distinct in meaning. This work was finalist for best paper award at FLAIRS-24 (9 finalists of 179 submissions). This research has also led to a new method of quantifying the tradeoff between misunderstanding and non-understanding in dialogue systems (abstract).
Another applied effort is the study of question generation as a means for creating knowledge bases for virtual characters, developed with Grace Chen and Emma Tosch, ICT student interns in the summer of 2010 (abstract). Many conversational systems developed at ICT use a training database of linked questions and answers, from which a system learns to find an appropriate response to novel inputs; these training databases are authored by hand, which presents a substantial bottleneck for development. Our experiments show that a training database, created automatically by applying existing question generation tools to encyclopedia texts, results in systems which give an appropriate response about half the time. A follow-up study with Elnaz Nouri, a student at USC, showed that adding an automatically generated question-answer database to an existing, hand-authored database results in similar performance on questions relating to the added topic, though with some degradation of the system’s ability to appropriately respond to questions that the original system was able to handle (abstract).
In the near future I see myself concentrating on the latter two research thrusts, developing rigorous evaluation methods for “soft” data, and pursuing innovative applications in natural language processing which can help with the overall development of conversational dialogue systems. The applied work also leads to theoretical questions: In the phonetically aware natural language understanding project, integration of phonetic knowledge with contextual meaning retrieval was treated as an engineering problem, but there is a parallel debate on how context affects lexical access in human language processing. I believe we should strive to integrate these observations, working toward computational models that are also linguistically adequate and psychologically plausible. I am also seeking to incorporate more detailed linguistic representations into natural language processing components, and to work on conversational systems in languages other than English. The latter direction will initially continue a previous effort of developing a conversational system that speaks Pashto, a language of Afghanistan and Pakistan (abstract).