2002 — 2008 |
Hasegawa-Johnson, Mark |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Career: Landmark-Based Speech Recognition in Music and Speech Backgrounds @ University of Illinois At Urbana-Champaign
This is a Faculty Early Career Development (CAREER) award. The research will develop speech recognition and auditory scene analysis models that are probability distributions whose parameters can be trained from data and whose internal structures are capable of abstracting the perceptual response patterns of human listeners. Two broad research questions will be explored: (1) Can probability models representing the pitch, envelope, and timing of an acoustic source be computed and integrated in a tractable manner? (2) What are the theoretical and empirical requirements for the partitioning, training, and recognition scoring of probability models for landmark-based acoustic features? Landmarks in speech are identifiable points in the flow of sound over time, such as consonant releases and closures, vowel centers, and glide extrema. The educational component of this project includes significant curriculum development at both the undergraduate and graduate levels, and a strong investment in the mentoring of undergraduate and graduate research trainees.
This CAREER award recognizes and supports the early career-development activities of a teacher-scholar who is likely to become an academic leader of the twenty-first century. This is fundamental scientific research in acoustics and computer science, but it addresses the very practical problem that computers are still far worse at recognizing speech than human beings are. Speech recognition technology has already become an important industry, but it will become far more important in the future as mobile computing and computer-mediated communications make it necessary for millions of people to control machines verbally rather than by means of keyboards. The educational component of this work will train graduate students to be teachers and communicators, as well as researchers, thus preparing them to help build the base of personnel needed in this exciting, growing area.
|
1 |
2004 — 2008 |
Shih, Chi-Lin Hasegawa-Johnson, Mark Cole, Jennifer (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Prosodic, Intonational, and Voice Quality Correlates of Disfluency @ University of Illinois At Urbana-Champaign
Spontaneous speech is typically interrupted by one disfluency every 10-20 words. While humans easily comprehend disfluent speech, computer speech recognizers often fail to separate disfluent regions from surrounding context, resulting in failed transcription and loss of meaning. This project develops disfluency recognition for automatic speech recognition by investigating perceptually salient acoustic correlates of disfluency. Effects of disfluency on the pitch, energy and voice source features are examined in the Switchboard corpus of spontaneous speech. The approximate repetition of pitch and energy contours is investigated as a cue marking the dependency between a disfluency and its subsequent repair. Analysis-by-synthesis techniques are adapted from the Stem-ML model of speech generation to recognize prosodic repetition that is often obscured by differences in scaling. Voice quality correlates of disfluency, such as glottalization, are tracked through several acoustic measures of the spectral envelope, with ROC testing performed to determine the best predictors. Correlations between disfluency and intonational features marking accent and phrasing are examined through the creation of a ToBI-standard intonation labeling of the speech corpus. Acoustic and prosodic correlates of disfluency are combined with a repetition language model in the design of a speech recognizer that automatically transcribes both words and disfluencies. The recognizer integrates cues at multiple linguistic levels which together serve to identify regions of disfluency in spontaneous speech.
This research will advance speech technology by enabling recognition of disfluency in natural speech. It will contribute new statistical and acoustic models of disfluency and a publicly accessible corpus of spontaneous speech with prosody and disfluency annotation.
|
1 |
2005 — 2009 |
Huang, Thomas (co-PI) [⬀] Gunderson, Jon Hasegawa-Johnson, Mark Perlman, Adrienne (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Audiovisual Distinctive-Feature-Based Recognition of Dysarthric Speech @ University of Illinois At Urbana-Champaign
Automatic dictation software with reasonably high word recognition accuracy is now widely available to the general public. Many people with gross motor impairments, including some people with cerebral palsy and closed head injuries, have not enjoyed the benefit of these advances, however, because their general motor impairment includes a component of dysarthria, that is to say reduced speech intelligibility caused by neuro-motor impairment, while the motor impairment often precludes normal use of a keyboard. For this reason, dysarthric users often now find it easier to use a small-vocabulary automatic speech recognition system, with code words representing letters and formatting commands, and with acoustic speech recognition models carefully adapted to the speech of the individual user. But development of such individualized speech recognition systems remains extremely labor-intensive, because so little is understood about the general characteristics of dysarthric speech. In this project, the PI will study the general audio and visual characteristics of articulation errors in dysarthric speech, and apply the results to the development of speaker-independent large-vocabulary and small-vocabulary audio and audiovisual dysarthric speech recognition systems. More specifically, the PI will research word-based, phone-based, and phonologic-feature-based audio and audiovisual speech recognition models for both small-vocabulary and large-vocabulary speech recognizers designed for unrestricted text entry on a personal computer. The models will be based on audio and video analysis of phonetically balanced speech samples from a group of speakers with dysarthria, categorized into the following four groups: very low intelligibility (0-25% intelligibility, as rated by human listeners), low intelligibility (25-50%), moderate intelligibility (50-75%), and high intelligibility (75-100%). Interactive phonetic analysis will seek to describe the talker-dependent characteristics of articulation error in dysarthria; based on analysis of preliminary data, the PI hypothesizes that manner of articulation errors, place of articulation errors, and voicing errors are approximately independent events. Preliminary experiments also suggest that different dysarthric users will require dramatically different speech recognition architectures, because the symptoms of dysarthria vary so much from subject to subject, so the PI will develop and test at least three categories of audio-only and audiovisual speech recognition algorithms for dysarthric users: phone-based and whole-word recognizers using hidden Markov models (HMMs), phonologic-feature-based and whole-word recognizers using support vector machines (SVMs), and hybrid SVM-HMM recognizers. The models will be evaluated to determine overall recognition accuracy of each algorithm, changes in accuracy due to learning, group differences in accuracy due to severity of dysarthria, and dependence of accuracy on vocabulary size.
Broader Impacts: This research will lay the foundation for constructing a speech recognition tool for practical use by computer users with neuro-motor disabilities. Tools and data developed in this project will all be released open-source, and will be designed so they can be easily ported to an open-source audiovisual speech recognition system for dysarthric users. The work may also have applicability beyond the target community, in that project outcomes may be relevant to many other populations (e.g., people with foreign accents) who have trouble training current ASR systems.
|
1 |
2006 — 2007 |
Hasegawa-Johnson, Mark Allan |
R21Activity Code Description: To encourage the development of new research activities in categorical program areas. (Support generally is restricted in level of support and in time.) |
Audiovisual Description and Recognition of Dysarthric Speech @ University of Illinois Urbana-Champaign
[unreadable] DESCRIPTION (provided by applicant): The primary aim of this proposal is to describe and automatically recognize the audible and visible correlates of phonological contrasts as produced by persons with spastic dysarthria. In order to meet the primary aim, this investigation will: - enroll a total of 16 subjects with dysarthria and 16 control subjects, - record each subject's production of phonetically balanced and pragmatically useful speech material using our AVICAR array of eight microphones and four video cameras, - measure the acoustic and visible correlates of consonant place of articulation, including formant locus, frication spectrum, lip aperture area, and jaw height, - develop automatic audio-only and audiovisual isolated word recognition algorithms, and - record each subject's participation in an objective comparison of audiovisual speech recognition, audio-only speech recognition, and typing as text input methods for human computer interface. LAY LANGUAGE SUMMARY: Subjects whose neuromotor deficit precludes or hinders their use of a keyboard may nevertheless retain some control over speech articulators. Our preliminary data demonstrate that subjects with 19-30% intelligibility (as rated by human listeners) may nevertheless achieve 90-100% recognition accuracy in an automatic isolated digit recognition task. We have found that the use of video in automatic speech recognition improves word recognition accuracy for talkers without dysarthria; we propose to extend our work to seek the same gains for talkers with dysarthria. [unreadable] [unreadable] [unreadable]
|
0.958 |
2007 — 2011 |
Ross, Brian (co-PI) [⬀] Bock, J. Kathryn Shih, Chi-Lin Hasegawa-Johnson, Mark Sproat, Richard [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Dhb: An Interdisciplinary Study of the Dynamics of Second Language Fluency @ University of Illinois At Urbana-Champaign
This project will develop and test psycholinguistic models of the relationship between first language fluency, second language competence and second language fluency. These models will be applied toward the automatic assessment of fluency in a second language. The project involves a unique collaboration between researchers with backgrounds in second-language pedagogy, testing methodology, linguistics, speech and language technology, and psychology, and is organized around a common set of data, namely oral presentations given by university students in third year Mandarin classes. These student performances are videotaped, transcribed and rated by trained raters who rate the students' fluency according to a custom-designed and validated testing procedure. The same students will be recruited at the beginning of the semester to participate in psycholinguistic experiments to measure their first language fluency, and related studies will be conducted during the course of the year. The results from expert rating of second-language fluency will be correlated with the psycholinguistic studies of first-language fluency. In parallel with this, the team will develop algorithms that will automatically assign scores to a student's second-language performance that will correlate with expert judgments. These algorithms will range from low-level signal processing methods to estimate such factors as syllable rate and pause duration, to Dynamic Bayesian Networks that combine information from large number of sources to improve the performance of Automatic Speech Recognition on the data. The results of this work will be both a better understanding of what it means to be fluent in a second language, as well as robust methods that will allow for objective automatic assessment of fluency.
|
1 |
2007 — 2012 |
Hasegawa-Johnson, Mark Cole, Jennifer [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri-Collaborative Research: Landmark-Based Robust Speech Recognition Using Prosody-Guided Models of Speech Variability @ University of Illinois At Urbana-Champaign
Proposal ID 0703859 Date 04/11/2007 Despite great strides in the development of automatic speech recognition technology, we do not yet have a system with performance comparable to humans in automatically transcribing unrestricted conversational speech, representing many speakers and dialects, and embedded in adverse acoustic environments. This approach applies new high-dimensional machine learning techniques, constrained by empirical and theoretical studies of speech production and perception, to learn from data the information structures that human listeners extract from speech. To do this, we will develop large-vocabulary psychologically realistic models of speech acoustics, pronunciation variability, prosody, and syntax by deriving knowledge representations that reflect those proposed for human speech production and speech perception, using machine learning techniques to adjust the parameters of all knowledge representations simultaneously in order to minimize the structural risk of the recognizer. The team will develop nonlinear acoustic landmark detectors and pattern classifiers that integrate auditory-based signal processing and acoustic phonetic processing, are invariant to noise, change in speaker characteristics and reverberation, and can be learned in a semi-supervised fashion from labeled and unlabeled data. In addition, they will use variable frame rate analysis, which will allow for multi-resolution analysis, as well as implement lexical access based on gesture, using a variety of training data. The work will improve communication and collaboration between people and machines and also improve understanding of how human produce and perceive speech. The work brings together a team of experts in speech processing, acoustic phonetics, prosody, gestural phonology, statistical pattern matching, language modeling, and speech perception, with faculty across engineering, computer science and linguistics. Support and engagement of students and postdoctoral fellows are part of the project, engaging in speech modeling and algorithm development. Finally, the proposed work will result in a set of databases and tools that will be disseminated to serve the research and education community at large.
|
1 |
2008 — 2014 |
Huang, Thomas (co-PI) [⬀] Hasegawa-Johnson, Mark Kaczmarski, Hank Goudeseune, Camille (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Fodava-Partner: Visualizing Audio For Anomaly Detection @ University of Illinois At Urbana-Champaign
Abstract FODAVA-Partner: Visualizing Audio for Anomaly Detection Mark Hasegawa-Johnson, Thomas Huang and Camille Goudeseune The goal of this proposal is to transform large audio corpora into a form suitable for visualization. Specifically, this proposal addresses the type of audio anomalies that human data analysts hear instantly: angry shouting, trucks at midnight on a residential street, gunshots. The human ear detects anomalies of this type rapidly and with high accuracy. Unfortunately, a data analyst can listen to only one sound at a time. Visualization shows the analyst many sounds at once, possibly allowing him or her to detect an anomaly several orders of magnitude faster than ?real time.? This proposal aims to render large audio data sets, comprising thousands of microphones or thousands of minutes, in the form of interactive graphics that reveal important anomalies at a glance. Data transformations will include signal processing, statistical modeling, and visualization. Signal processing will seek to characterize all of the ways in which the difference between two audio signals may be ?important,? including, for example, spectral differences, rhythmic differences, and differences in the impression made on the auditory cortex of a human listener. Statistical modeling will seek to characterize the range of audio events that are ?normal? or easily explicable, so that we may precisely measure the degree to which a potential anomaly is abnormal or inexplicable. Visualization methods will render measures of abnormality, and information about the signal characteristics of each anomaly, in a form suitable for rapid browsing. Two testbeds are proposed. The ?multi-day audio timeline? will be a portable application, visually similar to a nonlinear audio editing suite, which will allow the analyst to rapidly zoom in on potentially anomalous periods of time. The ?milliphone? will be a three-dimensional visualization tool for command and control centers. Audio recordings from one thousand security microphones scattered throughout a city or a large industrial site will be rendered in the form of brightly colored visible threads reaching skyward from a map of the secure region. The analyst will be able to listen to the audio recorded on any microphone by touching its thread; by touching the thread at different heights, the analyst will be able to audit different periods of time. The brightness, color, and thickness of each thread will display the abnormality and signal characteristics of the audio signal at each point in time. Data transformation research will map related types of abnormality to related color/brightness codes, so that important anomalies, and anomalies confirmed by multiple microphones, are immediately visible to the trained data analyst. This research seeks broad impact in the area of security analysis. Video cameras are routinely used for security monitoring of industrial sites, government installations, day care centers, and nursing homes. Data analysts and guards routinely browse the video recorded by up to twenty surveillance cameras simultaneously, fast-forwarding through uninteresting periods of time. Microphones would be used in the same applications, if they were useful; but there is at present no way for a data analyst to rapidly and accurately audit the signals from many different microphones. The proposed techniques will give guards and data analysts new data transformation and visualization techniques that will help them to rapidly identify dangerous situations signaled by inexplicable but anomalous audio signals.
|
1 |
2008 — 2010 |
Huang, Thomas (co-PI) [⬀] Hasegawa-Johnson, Mark Bernhardt-Walther, Dirk |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri Medium: Audio Diarization - Towards Comprehensive Description of Audio Events @ University of Illinois At Urbana-Champaign
?Perceptual salience? is a term used by psychologists of vision to describe the power of an object to draw viewer attention; for example, it has been demonstrated that eye movements target salient objects sooner than less-salient objects, and that salient objects are detected more quickly than less-salient objects. The first sub-goal of this research is to develop automatic measurements of perceptual salience for auditory events, defined here to be a center-surround contrast in terms of amplitude, spectrum, or temporal features such as zero-crossing rate and periodicity. The second sub-goal of this research is to test salience measurements in an audio event detection paradigm, using the 2007 University of Illinois CLEAR evaluation system (Classification and Labeling of Events, Activities and Relationships). The third sub-goal of this research is to compare audio event transcriptions generated by human labelers viewing an audiovisual record of a meeting vs. transcriptions generated by labelers who listen to the audio without watching any accompanying video; the experimental hypothesis states that auditory salience predicts audio-only labels better than it predicts audiovisual labels. This research is designed as a collaboration between experts in computer vision and audio signal processing. If successful, the proposed methods will help to add an audio channel to the video security monitoring systems currently installed in many hospitals, nursing homes, government buildings and industrial sites.
|
1 |
2010 — 2015 |
Poole, Marshall [⬀] Forsyth, David (co-PI) [⬀] Pena-Mora, Feniosky (co-PI) [⬀] Hasegawa-Johnson, Mark Bajcsy, Peter (co-PI) [⬀] Mchenry, Kenton |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cdi-Type Ii: Collaborative Research: Groupscope: Instrumenting Research On Interaction Networks in Complex Social Contexts @ University of Illinois At Urbana-Champaign
Many of the most important functions in society are undertaken by large groups or teams. Emergency response, product development, health care, education, and economic activity are pursued in the context of large, dynamic, interacting networks of groups. Theory and research on such networks of groups is much less developed than research on isolated small groups or formal organizations. A major challenge for research on networks of groups is the difficulties that accompany the collection and analysis of the huge bodies of high resolution, high volume, observational data necessary to study these large, dynamic networks of groups. The goal of this project is to address this challenge by applying advanced computing applications to capture, manage, annotate and analyze these massive observational sets of video, audio, and other data. The resulting data analysis system, GroupScope, will enable breakthrough research into social interaction in large, dynamic groups to be conducted much more quickly and with much higher reliability than was previously possible. It will do this by automating as many functions as possible to the highest degree possible, including managing huge volumes of video, audio, and sensor data, transcription, parsing audio for critical discourse events, annotation and indexing of video streams, and coding interaction. These first pass analyses can then be supplemented by human analysts (and their analyses in turn will feed into machine learning that will improve the computerized analysis).
GroupScope will be developed with the collaboration of social scientists studying emergency response teams, children's playground behavior, distributed teams, and product development teams. When developed, GroupScope will be deployed in a cyberenvironment, a Web 2.0 based cyberinfrastructure that enables a community of researchers to collaborate on common problems. The cyberenvironment will enable multiple researchers to analyze and code the same group data for both small groups and large dynamic groups and networks. Multiple analyses and codings working from diverse perspectives will enable discovery of previously unsuspected relationships among different levels and layers of human interaction. They can also be linked to survey responses from participants, enabling linkage to the realm of perceptions and traits.
Many of the most fundamental advances in science have come through the development of new instruments, such as more powerful telescopes or microscopes that can allow scientists to view molecules. In the same way GroupScope will shed light on the workings of critical functions performed by real world groups such as emergency response units, health care teams, stock exchanges, and military units. GroupScope will also have applications in the training of those working in multi-team systems, such as first responders to disasters. It can be used to record and "grade" training sessions, giving participants feedback on both strengths and weaknesses of their approaches.
|
1 |
2015 — 2017 |
Jyothi, Preethi (co-PI) [⬀] Hasegawa-Johnson, Mark Varshney, Lav |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eager: Matching Non-Native Transcribers to the Distinctive Features of the Language Transcribed @ University of Illinois At Urbana-Champaign
Automatic speech recognition (ASR) systems must be trained using hundreds of hours of speech, with synchronized text transcriptions. Transcribing that much speech is beyond the means of most language communities; therefore ASR systems do not exist for most languages. To overcome this bottleneck, this exploratory EAGER project asks people who don't understand a particular language to transcribe it as if they were listening to nonsense syllables. Of course, when people try to transcribe speech in a language they don't understand, they make mistakes. However there are patterns to those mistakes which can be modeled using decoding strategies developed for telephone and wireless communication, and used to route each transcription task to people whose native language helps them to perform it. The resulting transcriptions are then fused in order to recover correct transcriptions. Five different languages are to be tested, including languages with lexical tone, and languages with a variety of consonant contrasts very different from English. The resulting transcriptions can then train ASR systems in all five languages, and the quality of the research evaluated based on its ability to train those systems without using transcriptions produced by native speakers.
Mismatched crowdsourcing is formalized as a noisy channel; the talker encodes meaning in a string of symbols (phonemes) not all of which are reliably distinguishable by the perceiver. Models of second-language speech perception for each transcriber can be initialized using a perceptual assimilation model, then specialized. In particular, this proposal seeks increases in the scale and robustness of mismatched crowdsourcing by using error-correcting codes to divide the transcription task, and by then distributing each sub-task to transcribers whose native language contains the distinctive feature requested. It also seeks to develop new theory at the intersection of the current fields of crowdsourcing (the learnability of a function under conditions of label noise) and grammar induction (the learnability of a function from one language to another), and to perform grammar induction under conditions of label noise. Preliminary bounds exist for some aspects of this problem; the proposed research is designed to develop more detailed theoretic results, and test and apply them to determine the feasibility of creating serviceable ASR systems for under-resourced languages without having to use fluent speakers of those languages to transcribe speech in those languages.
|
1 |
2019 — 2022 |
Hasegawa-Johnson, Mark |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Small: Collaborative Research: Automatic Creation of New Speech Sound Inventories @ University of Illinois At Urbana-Champaign
Speech technology is supposed to be available for everyone, but in reality, it is not. There are 7000 languages spoken in the world, but speech technology (speech-to-text recognition and text-to-speech synthesis) only works in a few hundred of them. This project will solve that problem, by automatically figuring out the set of phonemes for each new language, that is, the set of speech sounds that define differences between words (for example, "peek" versus "peck:" long-E and short-E are distinct phonemes in English). Phonemes are the link between speaking and writing. A neural net that converts speech into text using some kind of phoneme inventory, and then back again, can be said to have used the correct phoneme inventory if its resynthesized speech always has the same meaning as the speech it started with. This approach can even be tested in languages that don't have any standard written form, because the text doesn't have to be real text: it could be chat alphabet (the kind of pseudo-Roman-alphabet that speakers of Arabic and Hindi sometimes use on twitter), or it could even be a picture (showing, in an image, what the user was describing). This research will make it possible for people to talk to their artificial intelligence systems (smart speakers, smart phones, smart cars, etc.) using their native languages. This research will advance science by providing big-data tools that scientists can use to study languages that do not have a (standard) writing system.
End-to-end neural network methods can be used to develop speech-to-text-to-speech (S2T2S) and other spoken language processing applications with little additional software infrastructure, and little background knowledge. In fact, toolkits provide recipes so that a researcher with no prior speech experience can train an end-to-end neural system after only a few hours of data preparation. End-to-end systems are only practical, however, for languages with thousands of hours of transcribed data. For under-resourced languages (languages with very little transcribed speech) cross-language adaptation is necessary; for unwritten languages (those lacking any standard and well-known orthographic convention), it is necessary to define a spoken language task that doesn't require writing before one can even attempt cross-language adaptation. Preliminary evidence suggests that both types of cross-language adaptation are performed more accurately if the system has available, or creates, a phoneme inventory for the under-resourced language, and leverages the phoneme inventory to facilitate adaptation. The aim of this project is to automatically infer the acoustic phoneme inventory for under-resourced and unwritten languages in order to maximize the speech technology quality of an end-to-end neural system adapted into that language. The research team has demonstrated that it is possible to visualize sub-categorical distinctions between sounds as a neural net adapts to a new phoneme category; proposed experiments 1 and 2 leverage visualizations of this type, along with other methods of phoneme inventory validation, to improve cross-language adaptation. Experiments 3 and 4 go one step further, by adapting to languages without orthography; for a speech technology system to be trained and used in a language without orthography, it must first learn a useful phoneme inventory. Innovations in this project that occur nowhere else include: (1) the use of articulatory feature transcription as a multi-task training criterion for an end-to-end neural system that seeks to learn the phoneme set of a new language, (2) the use of visualization error rate as a training criterion in multi-task learning -- this training criterion is based on a method recently developed to visualize the adaptation of phoneme categories in a neural network, (3) the application of cross-language adaptation to improve the error rates of image2speech applications in a language without orthography, (4) the use of non-standard orthography (chat alphabet) to transcribe speech in an unwritten language, and (5) the use of non-native transcription (mismatched crowdsourcing) to jump-start the speech2chat training task. The methods proposed here will facilitate the scientific study of language, for example, by helping phoneticians to document the phoneme inventories of undocumented languages, thereby expediting the study of currently undocumented endangered languages before they disappear. Conversely, in minority languages with active but shrinking native speaker populations, planned methods will help develop end-to-end neural training methods with which the native speakers can easily develop new speech applications. All planned software will be packaged as recipes for the speech recognition virtual kitchen, permitting high school students and undergraduates with no speech expertise to develop systems for their own languages, and encouraging their interest in speech.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |