Presentation of the dissertation project and the research results to
date:
Motivation
The
starting point of my project is the observation that human language
processing is inherently incremental, i.e. processing starts on
partial input and an interpretation of the current input is available
at each point in time. Early approaches to natural language
processing (NLP) were based on a sequence of discrete modules, each
working independently on the complete output of the previous one. The
underlying assumption was that the problem tackled in each domain can
be solved independently without feedback from later stages in the
processing. Contrary to this assumption, human language processing
has been shown to be inherently incremental [1] and highly
interleaved [2]. Incremental processing starts before the input is
complete, generates interpretations for partial input, and additional
input is integrated into the current interpretation if possible,
reanalyzing otherwise. Interleaving of processing includes syntactic
and semantic expectations driving speech recognition, and contextual
knowledge driving linguistic processing. This context is especially
rich in a face to face communication situation, where there are
multiple communication channels in different modalities (e.g.
gestures) and a common visual context to refer to.
To
approach the cognitive example given by human language processing we
have to make sure an artificial language processing system is able to
-
process
partial input as early as possible
-
provide
partial analyses as early as possible
-
generate
expectations to be fed back to other modules like speech processing
-
integrate
expectations derived from the context (e.g. the dialog history,
visual context or other communication channels like gestures)
Platform
The
platform of my research is the Weighted Constraint Dependency Grammar
(WCDG) System [3]. The WCDG-System is already capable of incremental
syntax processing, i.e. to assign a dependency structure to a prefix
of a sentence and to use this partial analysis as starting point for
analyzing an extended prefix. This is faster than starting from
scratch for about 95% of the test sentences [4].
Goal
The
goal of my project is to design and implement an architecture for
passing partial analyses and expectations between different modules
in language processing, including an interface to integrate
processing of other modalities, especially vision. Such an interface
will be bidirectional, with input from NLP guiding (visual) attention
and input from the external modality giving hints on disambiguation
in NLP. This part of the work is based in the project of Patrick
McCrae, who worked on syntactic disambiguation via context
integration. Furthermore, I am cooperating with Christopher
Baumgärtner, who is working on online context integration and
attention guiding.
To
determine the requirements of such a system I will investigate the
applications of partial analyses, especially given a visual context.
I have identified the following possible applications:
-
early
reference resolution
-
anticipate
upcoming input
-
feedback
to sender on processing status
The
possible benefits of this approach include
-
improved
speed and accuracy of the syntax parsing
-
improved
performance of other modules by top-down expectations derived from
the analysis of incoming language and multi-modal context
integration.
-
improved
user acceptance in human to computer/robot dialogue situations by
the use of verbal or nonverbal feedback during utterance time.
-
reproduce
psycholinguistic observations related to incremental language
processing, like anticipatory eye movement or garden pathing.
Early
Reference Resolution
A
partial linguistic analysis can be used to establish references to
entities from the context. There are several benefits of doing this
as early as possible:
-
Determine
if a nominal phrase (NP) already denotes a unique object from the
context or if additional complements like an upcoming prepositional
phrase (PP) are needed. By this, ambiguities like the PP-attachment
problem could be handled where a context is available (compare [5]).
If an ambiguity cannot be solved in this way, its early detection
enables a timely feedback in the form of a clarifying question.
-
Visual
focus can be shifted to referenced objects while the utterance is
produced, thus giving feedback to the producer.
-
Visual
focus can be shifted to gather additional details from the visual
context. This information can be used to check the current
interpretation for consistency with the context and to help process
the rest of the utterance.
-
Extract
partial instructions and possibly begin execution, which saves time
and makes it possible to give feedback.
Building
up Expectations
When
only parts of an input sentence are known, there are several ways to
anticipate different aspects of the upcoming input. These
expectations can be inferred linguistically from the partial input
alone or contextually, by integrating other sources. One source of
linguistic anticipation is the linguistic constraints violated by
parsing a partial input with a constraint dependency grammar parser
like WCDG. These might hint at the sentence being incomplete and at
what kind of continuation is missing.
Upcoming
words anticipated via linguistic constraints will be mostly
underspecified. For example, a violated transitivity constraint of a
verb proposes an upcoming object, but provides no further details
about it. To enrich the anticipation, the underspecified slots can be
augmented in several ways. One approach is to add ontological aspects
to the valence information of a verb, e.g. the direct object of “to
eat” is probably edible. Context information is a rich source of
anticipation as well. Partially specified Events - an action and its
arguments - can be extracted from partial input and be matched with
events visible in or possible to execute in the visual context,
thereby identifying candidates to fill unknown parts in the event.
Anticipation
of the upcoming input has several applications in natural language
processing, including
-
as
top-down expectation for speech recognition
-
for
incremental syntax parsing, matching upcoming words to placeholders
from prediction that are already integrated into the analysis
structure.
-
simulate
anticipatory eye movement to search for possible upcoming referents,
as has been observed with humans.
Feedback
In
face to face dialog between humans, feedback is given by the listener
while an utterance is produced. This feedback includes verbal and
nonverbal clues on
-
whether
the utterance was successfully processed up to that moment
-
which
entities in the visual context are related to the utterance, by eye
movement
-
ambiguities
that could not be resolved, possibly by asking back.
This
kind of feedback could be provided by a robotic system, given that
its language
processor
works incrementally, with early reference resolution and a capacity
to anticipate certain aspects of the upcoming input.
Current
Status of the Project
The
initial conception and literature review stage is finished and the
project is currently in the implementation phase for a first
prototype.
The
next steps include
• designing
and implementing means of integrating feedback from later stages of
processing into WCDG
• combining
the context integration architecture designed by Patrick McCrae with
the incremental processing mode of WCDG
• selecting
specific psycholinguistic observations to reproduce
Connections
of my PhD project to other CINACS projects
Multi-modal
context integration into language processing is the research topic of
two other Cinacs PhD students, Patrick McCrae and Christopher
Baumgärtner, with whom I cooperate. In his project “A Model
for the influence of Cross-Modal Context upon Syntactic Parsing”,
P. McCrae designed a system for disambiguating syntactic ambiguities
via semantic knowledge gained from visual context. In my project I
am, expanding on this foundation in cooperation with C. Baumgärtner
to permit online context integration during incremental language
processing.
With
the project of Dominik Off, Cross-Modal Enhanced Memory for Mobile
Service Robots, my project shares the common scenario of giving
instructions to a service robot in natural language. This scenario is
interesting for my research, as it is rich in references to objects
in the surroundings and possible actions therein. A long-term goal is
to integrate the planning and language processing components on a
robot platform like TASER.
The
Tsinghua Cinacs students Wei Qiao and Kaixu Zhang work on the problem
of Chinese word segmentation. I discussed with them the possibilities
of integrating word segmentation with syntax parsing and the
similarities to word segmentation in speech recognition. The WCDG
parser used in my project is already capable of working with word
lattices, the ambiguity structure found in word segmentation, and is
thus promising to become a common platform for research in this
direction. I hope to be able to intensify this cooperation in the
future.
References
[1]
R. P. G. van Gompel and
M. J. Pickering. Syntactic parsing. Oxford Handbook of
Psycholinguistics, 2007.
[2]
M. Mayberry, M.
W. Crocker and P.
Knoeferle. Learning to Attend: A
Connectionist Model of the Coordinated Interplay of Utterance, Visual
Context, and World Knowledge. Cognitive Science,
33: 449-496, 2008.
[3]
K. A. Foth.
Hybrid Methods of Natural Language Analysis. PhD-Thesis, Universität
Hamburg, Department Informatik, 2007.
[4]
N. Beuck.
Inkrementelles Parsing mit Constraint-Dependency-Grammatiken. Diploma
Thesis, Hamburg University, 2009.
[5]
B. Timothy
and S. Matthias.
Incremental natural language processing for HRI. Proc.
HRI 2007 - the ACM/IEEE International
Conference on
Human-Robot
Interaction,
Washington DC, USA, Mar 09-11, 2007
|