2007 — 2011 |
Leen, Todd (co-PI) [⬀] Kain, Alexander |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Hcc: High-Quality Compression, Enhancement, and Personalization of Text-to-Speech Voices @ Oregon Health and Science University
The vast variability of the human speech signal remains a central challenge for Text-to-Speech (TTS) systems. The objective of this research is to develop TTS technologies that focus on elimination of concatenation errors, and accurate speech modifications in the areas of coarticulation, degree of articulation, prosodic effects, and speaker characteristics. The investigators are exploring an asynchronous interpolation model (AIM), which promises to provide for high-quality and flexible TTS. The core idea of AIM is to represent a short region of speech as a composition of several types of features called streams. Each stream is computed by asynchronous interpolation of basis vectors. Each basis vector is associated with a particular phoneme, allophone, or more specialized unit. Thus, the speech region is described by the varying degrees of influence of several types of preceding and following acoustic features. Using AIM, the investigators are also developing methods to optimally compress the acoustic inventories of TTS systems, given a size or a quality constraint, and to adapt the system to a new voice, given a few training samples. The system being researched forms a hybrid between traditional concatenative and formant-based synthesis, having advantages of both, resulting in a high-quality, optimized TTS system with voice adaptation capabilities. TTS has generally recognized societal benefits for universal access, education, and information access by voice. Our research will make it possible, for example, to build personalized TTS systems for individuals with speech disorders who can only intermittently produce normal speech sounds.
|
1 |
2009 — 2013 |
Shafran, Izhak (co-PI) [⬀] Song, Xubo Kain, Alexander Van Santen, Jan [⬀] Black, Lois |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Hcc: Medium: Automatic Detection of Atypical Patterns in Cross-Modal Affect @ Oregon Health and Science University
"This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5)."
The expression of affect in face-to-face situations requires the ability to generate a complex, coordinated, cross-modal affective signal, having gesture, facial expression, vocal prosody, and language content modalities. This ability is compromised in neurological disorders such as Parkinson?s disease and autism spectrum disorder (ASD). The PI's long term goal is to build computer-based interactive, agent based systems for remediation of poor affect communication and diagnosis of the underlying neurological disorders based on analysis of affective signals. A requirement for such systems is technology to detect atypical patterns in affective signals. The objective of this project is to develop that technology. Toward that end the PI will develop a play situation for eliciting affect, will collect audio-visual data from approximately 60 children between the ages of 4-7 years old, half of them with ASD and the other half constituting a control group of typically developing children. The PI will label the data on relevant affective dimensions, will develop algorithms for the analysis of affective incongruity, and will then test the algorithms against the labeled data in order to determine their ability to differentiate between ASD and typical development. While automatic methods for cross-modal recognition of discrete affect classes already have yielded promising results, automatic detection and quantification of atypical patterns in affective signals, and the ability to do so in semi-natural interactive situations, is unexplored territory. The PI expects this research will lead to new methods for affect recognition based on facial affective features (with special emphasis on facial frontalization algorithms and on modeling of facial expressive dynamics), vocal affective features, and lexical affective features, as well as to new methods for automated measurement of cross-modal affective incongruity.
Broader Impacts: The expression of affect in special populations is a largely neglected area in affective computing and robotics; yet, these populations may be among the most important beneficiaries of these technologies. Affective expression impairments afflict many individuals, including those with neuro-developmental disorders such as autism, and those with neuro-degenerative disorders such as Parkinson?s disease. Because these impairments concern a core aspect of human communication and, hence, may cause profound social isolation in these individuals, intervention is highly desirable. However, one-on-one intervention by therapists, if effective, would be available only to relatively few individuals, thereby making computer-based intervention critical for broader access to such treatment. Accurate processing of the affective signal will be of use as a research and diagnostic tool for a range of neurological disorders. The CSLU research team will continue its tradition of disseminating research findings and technology, including speech corpora and software, to the research community.
|
1 |
2009 — 2013 |
Kain, Alexander Hosom, John-Paul |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Small: Modeling Coarticulation For Automatic Speech Recognition @ Oregon Health and Science University
This project focuses on applying a model used in text-to-speech synthesis (TTS) to the task of automatic speech recognition (ASR). The standard method in ASR for addressing variability due to phonemic context, or ?coarticulation,? requires a large amount of training data and is sensitive to differences between training and testing conditions. Despite the effective use of stochastic models, current ASR systems are often unable to sufficiently account for the large degree of variability observed in speech. In many cases, this variability is not due to random factors, but is due to predictable changes in the speech signal. These factors are currently modeled in order to generate speech via TTS, but they are not yet modeled in order to recognize speech, largely because of non-local dependencies. We apply the Asynchronous Interpolation Model (AIM) used in TTS to the task of speech recognition, by decomposing the speech signal into target vectors and weight trajectories, and then searching weight-trajectory and stochastic target-vector models for the highest-probability match to the input signal.
The goal of this research is improve the robustness of ASR to variability that is due to phonemic and lexical context. This improvement will increase the use of ASR technology in automated information access by telephone, educational software, and universal access for individuals with visual, auditory, or speech-production challenges. More effective models of coarticulation may increase our understanding of both human speech perception and speech production. Results from this project are disseminated through technical papers and the CSLU Toolkit software package.
|
1 |
2010 — 2016 |
Miczek, Klaus Grant, Kathleen (co-PI) [⬀] Shafran, Izhak (co-PI) [⬀] Kain, Alexander Coleman, Kristine |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Cdi-Type I: Computational Models For the Automatic Recognition of Non-Human Primate Social Behaviors @ Oregon Health and Science University
The goal of this project is to develop methods that will permit researchers to remotely and automatically monitor behavior of primates and other highly social animals. The PIs will collect behavioral data from cameras and microphones. They will then develop statistical models and computational algorithms to track the individuals in the group and to recognize facial expressions and vocalizations. Patterns in movements, expressions, and vocalizations will be used to develop behavior-identifying algorithms that will recognize different behaviors such as aggression, submission, grooming, eating and sleeping. The project is a collaboration between computer scientists and primatologists. A key element of this project is the observation that complex social interactions can often be regarded as being composed of sequences of elementary behaviors which occur frequently and consist of relatively simple and distinct gestures. Thus, the task of modeling complex social interactions can be broken down into two regimes - elementary behaviors spanning short duration, and their stochastic sequences spanning relatively longer time duration.
Apart from advancing computational science, the new methods for recording behavior unobtrusively and analyzing them at a high data rate are likely to be of interest to behavioral ecologists, socio-biologists and neuroscientists in studies of primates and other highly social animals. With these new tools, scientists can study and understand behavior, for example, in the context of planning conservation efforts for threatened species, building accurate animal models for health research, and supporting animal husbandry decisions in zoos. The project will provide an extensive, annotated data repository and associated algorithms and will also fund graduate students who will gain hands-on training in all aspects of the project.
|
1 |
2010 — 2015 |
Klabbers, Esther Kain, Alexander Van Santen, Jan (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Hcc: Medium: Synthesis and Perception of Speaker Identity @ Oregon Health and Science University
This proposal addresses the problem of synthesizing speaker identity when only a small training sample is available. To achieve the goal of synthesis of speaker identity from a small training corpus the project will address problems including trainable abstract parameterizations of the prosodic patterns that characterize a speaker and voice conversion methods. The project falls into the general category of building Text-to-Speech (TTS) synthesis system in order to generate speech that sounds like that of a specific individual (Speaker Identity Synthesis, or SIS). Systems of this kind have numerous applications, including the creation of personalized voices for individuals with neurodegenerative disorders who anticipate becoming users of Speech Generating Devices (Sods) in the future and many other applications in the consumer products and entertainment industry. Consumer products such as navigation systems and mobile phones are rapidly being developed that make use of linguistic information about generated utterance. The project will also provide new tools and data for human perception of speaker identity. The tools developed in the process and the associated perceptual studies are also relevant for assessment of speaker recognition systems, and the project provides a new generation of concise, trainable characterizations of a speaker?s prosodic patterns that can be incorporated in these systems. The proposed study will elucidate the trade-offs and algorithm issues of the proposed SIS systems and it is likely that the proposed work will have a strong intellectual impact in the field of speech synthesis.
|
1 |
2010 — 2015 |
Shafran, Izhak (co-PI) [⬀] Sproat, Richard (co-PI) [⬀] Roark, Brian Kain, Alexander |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Medium: Collaborative Research: Semi-Supervised Discriminative Training of Language Models @ Oregon Health and Science University
This project is conducting fundamental research in statistical language modeling to improve human language technologies, including automatic speech recognition (ASR) and machine translation (MT).
A language model (LM) is conventionally optimized, using text in the target language, to assign high probability to well-formed sentences. This method has a fundamental shortcoming: the optimization does not explicitly target the kinds of distinctions necessary to accomplish the task at hand, such as discriminating (for ASR) between different words that are acoustically confusable or (for MT) between different target-language words that express the multiple meanings of a polysemous source-language word.
Discriminative optimization of the LM, which would overcome this shortcoming, requires large quantities of paired input-output sequences: speech and its reference transcription for ASR or source-language (e.g. Chinese) sentences and their translations into the target language (say, English) for MT. Such resources are expensive, and limit the efficacy of discriminative training methods.
In a radical departure from convention, this project is investigating discriminative training using easily available, *unpaired* input and output sequences: un-transcribed speech or monolingual source-language text and unpaired target-language text. Two key ideas are being pursued: (i) unlabeled input sequences (e.g. speech or Chinese text) are processed to learn likely confusions encountered by the ASR or MT system; (ii) unpaired output sequences (English text) are leveraged to discriminate between these well-formed sentences from the (supposed) ill-formed sentences the system could potentially confuse them with.
This self-supervised discriminative training, if successful, will advance machine intelligence in fundamental ways that impact many other applications.
|
1 |
2011 — 2012 |
Kain, Alexander |
R21Activity Code Description: To encourage the development of new research activities in categorical program areas. (Support generally is restricted in level of support and in time.) |
Computer-Based Pronunciation Analysis For Children With Speech Sound Disorders @ Oregon Health &Science University
DESCRIPTION (provided by applicant): The long-term objective of the proposed work is to develop speech-production assessment and pronunciation- training tools for children with speech sound disorders. The technology resulting from research on computer- assisted pronunciation training has not yet been successfully extended to help children with speech sound disorders, primarily because of a lack of accuracy in phoneme-level analysis of the speech signal. The goal of the proposed exploratory research is to develop a set of algorithms that will constitute the core components of an effective pronunciation analysis system for children with speech sound disorders. The components of this system, when used in concert, will reliably identify and score the intelligibility of a phoneme within an isolated target word. The algorithms will also identify specific types of distortion errors (e.g. fronting, in which the /sh/ phoneme is realized as /s/). The tools resulting from the proposed work will provide immediate, relevant, and understandable feedback about pronunciation errors. The Specific Aims are to (1) Create individualized speech templates for use in objective analysis of pronunciation, (2) Automatically identify phoneme locations in speech recordings, and (3) Automatically score phoneme intelligibility for children with speech sound disorders. For Specific Aim 1, the template for evaluating a participant's spoken word will be selected from a large pool of templates of that word, and each template will be further individualized to match the general spectral characteristics of the participant. For Specific Aim 2, the primary challenge is to identify phoneme locations when the observed (spoken) phoneme sequence is different from the expected (target) phoneme sequence. A five-step process will be used to identify possible differences between the observed and expected phoneme sequence using several independent sources of information. Methods will include automatic classification of manner of articulation using a Hidden Markov Model, dynamic time warping, and a priori determination of likely phoneme errors. Specific Aim 3 will provide a measure of the intelligibility of a target phoneme and also identify distorted features. The scoring of intelligibility will be performed using a proposed Phoneme Intelligibility Analysis (PIA) module, which is phoneme-specific and composed of six sources of information, including an acoustic template of the target phoneme, likely phonetic substitutions, acoustic features used in analysis, thresholds of acceptability, statistics of phoneme duration in the given context, and evaluation metrics. The use of human perceptual data (intelligibility scores) as training data is an important and new component of the proposed approach. PUBLIC HEALTH RELEVANCE: The proposed work is relevant to the public health in that the software tools that result from this work will enable children with speech sound disorders to better communicate with the general population. Furthermore, these tools will assist teachers of such children in the task of pronunciation assessment, allowing the teachers to more effectively use their time.
|
1 |