1997 — 1999 |
Saltzman, Elliot L |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Task Dynamics--Goals and Time in Speech Production
behavioral /social science research tag; model design /development
|
0.958 |
2001 — 2004 |
Saltzman, Elliot L |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Temporal Structuring of Speech and Manual Gestures
The long-term objective of this program is to understand the temporal structuring of speech and manual gestures. A major hypothesis in the present research program is that such an understanding requires a consideration of the dynamics that underlie the spatiotemporal patterning of activity across the task-relevant elements of a speaker's or performer's behavioral repertoire. Our specific aims are to investigate these dynamics, both empirically and theoretically, by focusing on three major temporal phenomena in both speech production and in the production of simple patterns of finger tapping "gestures." Knowledge gained from each of these areas will have specific implications for further developments of our task-dynamic model of speech production and its generalization to nonspeech behaviors. The first phenomenon is that local variations in the duration of single elements in a spoken utterance or manual sequence, induced through specification of the element's intrinsic duration or emphatic stress, have systematic but poorly understood effects on the overall global duration of the sequence. Understanding these effects will provide constraints for modeling interactions between central "clock-like" processes and peripheral "motoric" events. The second phenomenon is the decrease in pattern stability that occurs during the production of "tongue twisters" and comparably difficult manual patterns. Of particular interest is the way that instability develops over time. Understanding this effect will provide constraints for modeling the interactions (coupling functions) among simultaneously active gestural units. The third phenomenon is the systematic "signature" of high level structure (hierarchical prosodic structure in speech; hierarchical rhythmic structure in unimanual tapping) on the kinematics of speech and the kinematics and kinetics of manual activity. Understanding this effect will provide constraints for modeling the means by which high level structure expressively modulates ongoing movements, and how this high level structure is embodied in central clocks or timekeepers. Overall, the research program will provide valuable data and theoretical understanding of the manner in which human skilled activities are structured in time and, hence, will provide valuable clues to movement disorders, such as Parkinson's disease, in which compromised control in the temporal domain is central to the disorder's presentation.
|
0.958 |
2007 — 2011 |
Saltzman, Elliot |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Landmark-Based Robust Speechrecognition Using Prosody-Guided Models of Speech @ Trustees of Boston University
Proposal ID 0703859 Date 04/11/2007 Despite great strides in the development of automatic speech recognition technology, we do not yet have a system with performance comparable to humans in automatically transcribing unrestricted conversational speech, representing many speakers and dialects, and embedded in adverse acoustic environments. This approach applies new high-dimensional machine learning techniques, constrained by empirical and theoretical studies of speech production and perception, to learn from data the information structures that human listeners extract from speech. To do this, we will develop large-vocabulary psychologically realistic models of speech acoustics, pronunciation variability, prosody, and syntax by deriving knowledge representations that reflect those proposed for human speech production and speech perception, using machine learning techniques to adjust the parameters of all knowledge representations simultaneously in order to minimize the structural risk of the recognizer. The team will develop nonlinear acoustic landmark detectors and pattern classifiers that integrate auditory-based signal processing and acoustic phonetic processing, are invariant to noise, change in speaker characteristics and reverberation, and can be learned in a semi-supervised fashion from labeled and unlabeled data. In addition, they will use variable frame rate analysis, which will allow for multi-resolution analysis, as well as implement lexical access based on gesture, using a variety of training data. The work will improve communication and collaboration between people and machines and also improve understanding of how human produce and perceive speech. The work brings together a team of experts in speech processing, acoustic phonetics, prosody, gestural phonology, statistical pattern matching, language modeling, and speech perception, with faculty across engineering, computer science and linguistics. Support and engagement of students and postdoctoral fellows are part of the project, engaging in speech modeling and algorithm development. Finally, the proposed work will result in a set of databases and tools that will be disseminated to serve the research and education community at large.
|
1 |
2012 — 2016 |
Saltzman, Elliot |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Medium: Collaborative Research: Multilingual Gestural Models For Robust Language-Independent Speech Recognition @ Trustees of Boston University
Current state-of-the-art automatic speech recognition (ASR) systems typically model speech as a string of acoustically-defined phones and use contextualized phone units, such as tri-phones or quin-phones to model contextual influences due to coarticulation. Such acoustic models may suffer from data sparsity and may fail to capture coarticulation appropriately because the span of a tri- or quin-phone's contextual influence is not flexible. In a small vocabulary context, however, research has shown that ASR systems which estimate articulatory gestures from the acoustics and incorporate these gestures in the ASR process can better model coarticulation and are more robust to noise. The current project investigates the use of estimated articulatory gestures in large vocabulary automatic speech recognition. Gestural representations of the speech signal are initially created from the acoustic waveform using the Task Dynamic model of speech production. These data are then used to train automatic models for articulatory gesture recognition where the articulatory gestures serve as subword units in the gesture-based ASR system. The main goal of the proposed work is to evaluate the performance of a large-vocabulary gesture-based ASR system using American English (AE). The gesture-based system will be compared to a set of competitive state-of-the-art recognition systems in term of word and phone recognition accuracies, both under clean and noisy acoustic background conditions.
The broad impact of this research is threefold: (1) the creation of a large vocabulary American English (AE) speech database containing acoustic waveforms and their articulatory representations, (2) the introduction of novel machine learning techniques to model articulatory representations from acoustic waveforms, and (3) the development of a large vocabulary ASR system that uses articulatory representation as subword units. The robust and accurate ASR system for AE resulting from the proposed project will deal effectively with speech variability, thereby significantly enhancing communication and collaboration between people and machines in AE, and with the promise to generalize the method to multiple languages. The knowledge gained and the systems developed will contribute to the broad application of articulatory features in speech processing, and will have the potential to transform the fields of ASR, speech-mediated person-machine interaction, and automatic translation among languages. The interdisciplinary collaboration will facilitate a cross-disciplinary learning environment for the participating faculty, researchers, graduate students and undergraduate students Thus, this collaboration will result in the broader impact of enhanced training in speech modeling and algorithm development. Finally, the proposed work will result in a set of databases and tools that will be disseminated to serve the research and education community at large.
|
1 |
2016 — 2020 |
Saltzman, Elliot |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Prosodic Structure: An Integrated Empirical and Modeling Investigation @ Trustees of Boston University
This project examines how the prosodic structure of language shapes the articulation of spoken utterances. Speaking is a complex, uniquely human ability that relies on precisely coordinated movements of the speech organs (tongue, lips, jaw, soft palate, and larynx) and respiratory system. These movements produce sounds that listeners perceive and that convey not only the 'dictionary' content of the utterance, but also its prosodic content. Prosody organizes phonological forms into successively larger units or phrases, and renders certain syllables, words and phrases more 'prominent' (perceptually or rhythmically important) to the listener. Understanding the processes that shape prosodic structure and how these processes act to transmit this structure through coordination of the speech organs has profound implications for advancing our understanding of language processing and communication disorders, for improving speech technology, and for providing insights regarding the more general relationship between linguistic and cognitive operations.
The central hypothesis examined in this collaborative project is that the control processes governing temporal and tonal structure are intimately linked with one another. This linkage is viewed as emerging from the underlying motor control dynamics that coordinate the motions of the speech organs and guide speech production. In order to investigate this hypothesis, two additional open questions are addressed concerning the nature of the units that exist at different prosodic and prominence levels, and the relationships among syllables, prosodic feet, words and phrases: "What are the structural and articulatory-acoustic properties of prosodic feet?"; and "How can the relationships among the levels be best understood?" These questions are investigated through a series of articulatory studies--using electromagnetic articulography--and acoustic measurements that will examine: 1) how temporal and tonal properties interact at phrasal boundaries, 2) how temporal and tonal properties interact within and across all levels of prominence, and 3) how motor control dynamics shape the details of articulatory-acoustic structure at the foot level. These studies will be complemented by a series of computational simulations, which will serve the dual purpose of testing the project's hypotheses and guiding further developments of the prosodic component of the group's Task-Dynamics model of speech production.
|
1 |