2010 — 2012 |
Chiang, David |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eager: Phylo: Phylogenetic Reconstruction of Textual Histories @ University of Southern California
This project, supported by an EArly-concept Grant for Exploratory Research (EAGER), is developing computational models of how manuscripts of premodern texts changed over time due to copying with errors, intentional editing, and translation into different languages. The purpose of these models is to reconstruct the original texts and to better understand the forces that shaped them. We are building on work applying ideas from computational evolutionary biology to the task, but the main focus of the project is to explore whether cutting-edge ideas from computational linguistics and natural language processing are better suited for modeling the evolution of natural-language texts. In particular we are exploring the use of techniques from nonprojective dependency parsing to model the tree of relationships among manuscripts and statistical machine translation to model the relationship between pairs of manuscripts.
The tools that result from the project will be made publicly available in order to foster cross-disciplinary research. These tools will enable scholars of ancient and medieval literature to use our models to analyze collections of manuscripts that may not have been possible to analyze by hand before. The techniques explored will shed light on computationally hard learning and search problems such as those that frequently arise in natural language processing.
|
1 |
2011 — 2012 |
Chiang, David |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eager: Machine Translation For Language Preservation @ University of Southern California
In the last 50 years, computational linguistics research has touched barely 1% of the world's languages. In 100 years, 90% of them will be extinct or nearly so. What can computational linguistics offer to support the urgent task of documenting and analyzing the world's endangered languages? Based on the observation that bilingual parallel text is both the primary artifact collected in documentary linguistics as well as the primary object of statistical translation models, this project explores the use of machine translation to accelerate the global language documentation effort. Specifically, it develops novel ways to model any number of related languages simultaneously, pooling information from all the languages to make stronger inferences about each. In order to exploit language relationships, it explores methods that simultaneously model phonological, morphological, lexical, and syntactic phenomena. In addition, it develops algorithms to standardize highly variable transcription practices.
These technologies, which will be field-tested in the Eastern Highlands of Papua New Guinea, are designed to enable speakers of endangered languages who have no specialized linguistic training to create large collections of translated oral literature, providing an authentic and interpretable record of their language, serving current and future generations of scholars, teachers, and learners. They will do so, moreover, at much less cost than is needed to support the efforts to trained linguists and ethnographers to create such collections.
|
1 |
2014 — 2017 |
Chiang, David |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Small: Language Induction Meets Language Documentation: Leveraging Bilingual Aligned Audio For Learning and Preserving Languages @ University of Notre Dame
Thousands of the world's languages are in danger of dying out before they have been systematically documented. Many other languages have millions of speakers, yet they exist only in spoken form, and minimal documentary records are available. As a consequence, important sources of knowledge about human language and culture are inaccessible, and at risk of being lost forever. Moreover, it is difficult to develop technologies for processing these languages, leaving their speech communities on the far side of a widening digital divide. The first step to solving these problems is language documentation, and so the goal of this project is to develop computational methods based on automatic speech recognition and machine translation for documenting endangered and unwritten languages on an unprecedented scale.
To be successful, any approach must guarantee both the sufficiency and interpretability of the documentation it produces. This project ensures sufficiency by using a combination of community outreach, crowdsourcing techniques, and mobile/web technologies to collect hundreds of hours (millions of words) of speech. The interpretability is enabled by augmenting original speech recordings with careful verbatim repetitions along with translations into a well-resourced language. Finally, computational models are developed to automate transcription of recordings and alignment with translations, resulting in bilingual aligned text. The result is a kind of digital Rosetta Stone: a large-scale key for interpreting the world's languages even if they are not written, or no longer even spoken.
|
1 |
2020 — 2024 |
Chiang, David |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Fmitf: Track I: Differentiable Probabilistic Programming With Recursive Structured Models @ University of Notre Dame
Symbols (like the letters of the alphabet) and structures (like words formed out of letters) are natural for humans to work with: they are ubiquitous in daily life, they are easy for us to understand, and it is easy to write programs that work with them. But current artificial intelligence (AI) systems learn by making many small changes to see which ones improve the performance of the system; they are therefore good at working with representations that allow small changes, like numbers, and not so good with symbols and structures, like letters and words. This can be an obstacle both to building AI systems and to understanding why they work. A typical way for an AI system to learn to work with symbols and structures is to consider all choices and make small changes to their probabilities. But what if there are not 26 choices, but 26 trillion? For example, the grammatical structure of a sentence can be represented by a tree, one out of a large or even infinite number of possible trees. In such cases -- which are the rule rather than the exception -- one can resort to approximations, like randomly selecting a few thousand possibilities, or one can use carefully constructed algorithms to consider all of them. But it is not easy to do the latter or even to know when it is possible. This project's novelty is to develop a new programming framework to make it easy to code such algorithms, so that writing a program that learns to use trees can be as easy as writing a program that uses trees. If successful, the project's impact is to help make machine learning an everyday part of computer programming, not only for researchers but even for beginners.
This project draws on and contributes to the fields of machine learning, programming languages, and formal language theory. In machine learning, there is growing interest in neural networks that make probabilistic decisions about discrete structures such as trees that represent the possible grammatical structures of a sentence. In programming language research, there has been much work on probabilistic programs and operations on them that preserve meaning exactly. However, in existing frameworks for both neural networks and probabilistic programs, it is still difficult to represent distributions over recursive structures exactly and to efficiently perform operations on them like differentiation. This project uses ideas from formal language theory to bridge this gap, making it easy to work on these distributions exactly and efficiently. The project has three stages: First, it is extending and vectorizing exact transformations on probabilistic programs so that they work on programs parameterized by differentiable tensors. Second, the project is using hyperedge replacement graph grammars (HRGs) to represent distributions over recursive structures. HRGs generalize both graphical models and string/tree automata, providing a single highly expressive formalism for structured models. Methods for efficient inference on HRGs are also being developed. Third, the team is automating the translation of probabilistic code that uses recursive data structures into HRGs. The techniques developed are being implemented in an open-source deep-learning framework.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |
2021 — 2025 |
Chiang, David |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Language Documentation With An Artificial Intelligence (Ai) Helper @ University of Notre Dame
Documentation of languages, especially endangered languages, is crucial for conserving humanity’s knowledge and cultural heritage, as well as for advancing an understanding of human language. Traditional documentation methods produce invaluable materials such as grammars, dictionaries, and annotated texts, but require more time than can be afforded to keep up with current language extinction rates. The most constructive response to this crisis is to complement documentation efforts by collecting data for as many languages as possible now and to make them accessible and interpretable so that they can be studied later by both linguists and members of the language communities. Digital technologies make it practical to obtain many hours of recordings in an endangered language along with translations. This project advances technologies for analyzing the recordings at the sub-word, word, and clause level so that they become accessible for a wide variety of documentary purposes.
The project makes the information in digital recordings more interpretable for further linguistic analysis in three ways. First, the team is devising computational methods to automatically derive a basic phonological understanding and produce phonetic representations for languages, even if they do not have an established writing system. Second, the team is developing methods to automatically analyze the internal structure of words in languages where this structure is highly complex. Third, the team uses knowledge of more widely spoken languages to analyze related endangered languages. The resulting tool, the AI-helper toolbox, will be packaged with software that is currently widely in use by linguists and language communities in the language documentation process. All tools will be accessible through a web-based interface and the source code will be publicly available through GitHub.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |
2021 — 2023 |
Chiang, David |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Ri: Small: Nl(V)P:Natural Language (Variety) Processing @ University of Notre Dame
No language is a monolith. Languages vary richly across countries, regions, social classes, and other factors. Despite recent advances in natural language processing (NLP) technology for translating between languages, answering questions, or engaging in simple conversations, current approaches have largely focused only on "standard" varieties of languages. By ignoring other varieties, treating them essentially as statistical noise, current technologies neglect the millions of people who speak these varieties.
This project is creating ways to enable language technologies such as translation and question-answering systems, both to process and to generate fine-grained language varieties. The team will develop computational methods to automatically recognize features of different language varieties and then create approaches for integrating such linguistic information into the models powering language technologies. Additionally, the team will design methods to adapt models into varieties for which minimal training data may be available. The resulting suite of general methods will benefit diverse communities and less-privileged populations that speak underserved languages and varieties.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |