2007 — 2012 |
Yarowsky, David [⬀] Callison-Burch, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Multi-Level Modeling of Language and Translation @ Johns Hopkins University
Previous approaches to statistical machine translation (SMT) have employed phrase-based models which represent phrases as sequences of fully-inflected words, and are otherwise devoid of linguistic detail. Such approaches are unable to generalize and essentially rely on memorizing the translations of words and phrases that are observed in training data.
This project aims to improve the quality of SMT through the introduction of more sophisticated models which represent phrases using multiple levels of information. This can include basic linguistic information such as part of speech, lemmas, and agreement information (case, number, person), as well as more sophisticated linguistic detail including semantic classes, argument structure, co-reference, phrase boundaries, and information propagated from syntactic heads.
By annotating all data with this information and extending models appropriately, there is the potential to learn much more from training than was possible under previous approaches. There is now the potential to learn translations of unseen words if other forms of the words occur; it is now possible to learn general facts about a language's word order; it is now feasible to use linguistic context to generate grammatical output. Such generalization has the potential to result in much higher quality translation, especially for languages that only have small amounts of training data. It therefore represents a significant advance over previous approaches to SMT.
Multi-level models have the potential for wide-ranging impact on all language technologies. Simultaneous modeling of different levels of representation is an extremely useful and natural way of describing language. This project is developing a general framework for the creation of multi-level probabilistic models of language and translation, and exploring its application to tasks beyond translation including generation, paraphrasing, and the automatic evaluation of natural language technologies.
|
0.946 |
2010 — 2015 |
Karakos, Damianos (co-PI) [⬀] Khudanpur, Sanjeev [⬀] Callison-Burch, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Medium: Collaborative Research: Semi-Supervised Discriminative Training of Language Models @ Johns Hopkins University
This project is conducting fundamental research in statistical language modeling to improve human language technologies, including automatic speech recognition (ASR) and machine translation (MT).
A language model (LM) is conventionally optimized, using text in the target language, to assign high probability to well-formed sentences. This method has a fundamental shortcoming: the optimization does not explicitly target the kinds of distinctions necessary to accomplish the task at hand, such as discriminating (for ASR) between different words that are acoustically confusable or (for MT) between different target-language words that express the multiple meanings of a polysemous source-language word.
Discriminative optimization of the LM, which would overcome this shortcoming, requires large quantities of paired input-output sequences: speech and its reference transcription for ASR or source-language (e.g. Chinese) sentences and their translations into the target language (say, English) for MT. Such resources are expensive, and limit the efficacy of discriminative training methods.
In a radical departure from convention, this project is investigating discriminative training using easily available, *unpaired* input and output sequences: un-transcribed speech or monolingual source-language text and unpaired target-language text. Two key ideas are being pursued: (i) unlabeled input sequences (e.g. speech or Chinese text) are processed to learn likely confusions encountered by the ASR or MT system; (ii) unpaired output sequences (English text) are leveraged to discriminate between these well-formed sentences from the (supposed) ill-formed sentences the system could potentially confuse them with.
This self-supervised discriminative training, if successful, will advance machine intelligence in fundamental ways that impact many other applications.
|
0.946 |
2012 — 2014 |
Van Durme, Benjamin [⬀] Callison-Burch, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eager: Combining Natural Language Inference and Data-Driven Paraphrasing @ Johns Hopkins University
Natural language inference (NLI) and data-driven paraphrasing share the related goals of being able to detect the semantic relationship between two natural language expressions, and being able to re-word an input text so that the resulting text is meaning-equivalent but worded differently. On the one hand, work in recognizing textual entailment (RTE) within NLI has attempted to formalize the process of determining whether a natural language hypothesis is entailed by a natural language premise, sometimes called "natural logic". Research in data-driven paraphrasing, on the other hand, attempts to extract paraphrases at a variety of levels of granularity including lexical paraphrases (simple synonyms), phrasal paraphrases, phrasal templates (or "inference rules"), and sentential paraphrases, for various downstream applications such as question answering, information extraction, text generation, and summarization.
This EAGER award explores bridging the gap, through analysis of sentential paraphrasing via synchronous context free grammars (SCFGs), and how they may be coupled to formal constraints akin to recent work in phrase-based formulations of natural logic for RTE. Data-driven paraphrasing has largely neglected semantic formalisms, and NLI has relied heavily on hand-crafted resources like WordNet. If this project is successful it will potentially lead towards NLI systems that are more robust, and paraphrasing systems that are better formalized. Taken together, these improvements will allow better RTE systems to be developed. Moreover, this project has the potential to impact widely used human language technologies such as web search and natural language interfaces to mobile devices, and to further the connection between computational semantics and formal linguistics.
|
0.946 |
2014 — 2016 |
Callison-Burch, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eager: Simplification as Machine Translation @ University of Pennsylvania
This EArly Grant for Exploratory Research aims to advance the text simplification technology that automatically rewrites complex English texts into simpler English texts. Research into this topic has many potential practical applications. It can provide reading aids for people with disabilities, low-literacy, non-native backgrounds or non-expert knowledge. It can also help with many other computer technologies that need to process difficult words and complicated sentences. This one-year exploratory project focuses on simplification for children with different reading levels. If this technology is successful, it could help make knowledge accessible to all children and gradually help to improve their reading skills.
Simplification can be thought of as a monolingual translation task, where the output is equivalent in meaning to the input, but its surface form is constrained by a readability or grade-level requirement. Prior work has drawn the connection between machine translation and text simplification, but has treated the SMT technology as a black box. Going beyond previous work, this study provides an extensive exploration of adapting key parts of the statistical machine translation pipeline to simplify text. It aims to tailor simplification to different readability levels. The three research activities being undertaken in this study are: (1) constructing a "parallel corpus" consisting of complex sentence paired with several different levels of simplification, (2) developing automatic metrics for targeted simplification, and (3) designing features for targeted simplification.
|
1 |
2016 — 2017 |
Liberman, Mark (co-PI) [⬀] Cieri, Christopher [⬀] Callison-Burch, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ci-P: Planning For Scalable Language Resource Creation Through Novel Incentives and Crowdsourcing @ University of Pennsylvania
Advances in human language technologies enable systems that, for example, obey natural language commands and respond in kind, translate among many language pairs and summarize multilingual news. However, the technology's potential remains largely untapped because the linguistic resources that fuel development still fall far short of need. This community infrastructure planning (CI-P) initiative begins the process of building infrastructure to continuously develop high quality language resources, by employing techniques proven to work in multiple scientific disciplines. Social media, crowd-sourcing, games with a purpose and citizen science show us that human resources are effectively limitless for some activities. By offering human contributors appropriate opportunities and incentives, this project enhances language resource development well beyond what direct funding alone can produce. By removing constraints on participation, designing activities to appeal to multiple communities the project creates educational opportunities for the public including students and under-represented groups. The increase in scale and diversity of data also benefits those working in language related research, education and technology development. The availability of an ever-growing body of resources for an expanding range of languages will permit developers to supply technologies to a greater proportion of the world.
This project is the first step in the creation of infrastructure capable of high volume, continuous collection of language data and judgments through: ubiquity, perseverance, comprehensive annotation, automated training and certification, appropriate incentives, task engineering and variants of crowdsourcing. Building upon Linguistic Data Consortium's WebAnn framework, virtual front end web servers provide multiple interfaces to incentivize and engineer linguistic data contributions from targeted groups: linguists, citizen scientists, game players and students. Collection and annotation activities are analyzed into component tasks according to the skills they require and are assigned as appropriate to different workforces using different workflows. The combination of customized interfaces and novel incentive strategies enables ongoing, scalable data collection and annotation resulting in diverse language resources available to the wider Computer and Information Science and Engineering research and education communities.
|
1 |
2017 — 2019 |
Cieri, Christopher [⬀] Callison-Burch, Chris Liberman, Mark (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ci-New: Nieuw: Novel Incentives and Workflows in Linguistic Data Collection and Annotation @ University of Pennsylvania
Language touches every aspect of human life. People speak and write in order to manage relationships from the personal to the international, to gather and provide information, to negotiate, influence and inspire. Scientists use language to communicate their findings regardless of their field of study. Although researchers have been working for six decades to process language via computer, only in the past several years have their efforts have produced technologies of sufficient maturity that they can affect the lives of the average citizen. Today, some of the most fortunate use computers to search the vast archives of the Internet, to translate material from languages they do not understand into languages they do and to interact with smart devices by giving them natural language commands and queries and receive responses in kind. Despite the growth and promise of human language technologies, they are in fact available for only a tiny portion of the world's approximately 7000 languages and, even then, for only a limited range of situations. This is the case because the approaches that have proven most successful in developing human language technologies require vast amounts of spoken or written language material that have been augmented by human judgment as to their interpretation, but such resources are lacking for most languages and for many types of situations, even for languages of international importance, including English. This Research Infrastructure project will address this shortage of language resources by supporting the language technology research community to employ novel incentives and alternate workflows to greatly expand the methods that have been used to date for collecting and annotating language data. The resulting resources will support research and development on an expanded range of language technologies, leading to the creation and deployment of applications for an increasingly broad range of languages and situations.
Even a brief observation of user behavior on social media, online games, citizen science and public good initiatives demonstrates that many people around the world are willing to devote collectively vast amounts of effort when given appropriate motivation and effective tools. This project will harness some of the immense people-power that drives such activities and focus it on problems of developing language resources that help computers learn to process language. Specifically, the project will create a software toolkit to be developed by the project team in response to the needs of language technology researchers to create online activities that yield language resources. The activities will include games, citizen science and tools for language professionals, clustered into a series of portals that appeal to different populations of users. The project will build and maintain the database and web servers, with redundancy, load balancing and fail over, to run the principal instance of all of the activities, and an open-source release of the software will enable other researchers to build their own instances independently. Finally, the data resulting from this project will be shared with the least restrictive terms possible to further support language technology research and development activities worldwide.
|
1 |
2019 — 2023 |
Callison-Burch, Chris Ticona, Julia |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Fw-Htf-Rl: Collaborative Research: Enabling Marginalized Rural and Urban Digital Workers to Collaborate With Ai to Learn Skills, Increase Wages, and Access Creative Work @ University of Pennsylvania
Many rural areas in the United States face a lack of economic opportunity. The future of work can bring opportunities for rural and urban marginalized communities through online work and the gig economy. However, work on current platforms is often low-level labeling work offering few opportunities for advancement. It is often intended to train Artificial Intelligence to automate this work away, instead of training workers. The proposed project aims to uplift workers and improve the marketplace for online work so that digital work may help with the economic recovery of regions whose traditional industries have left. This project aims to develop sustainable methods for transitioning workers to high-skilled and creative digital jobs that are unlikely to be automated in the near to medium term future. Crowd work can be transformed to not only improve the work product for the employer, but also to help the worker move along the career paths necessary for the future of work. The project team from four universities, Carnegie Mellon U., West Virginia U., Pennsylvania State University and University of Pennsylvania has partnered with local institutions to provide workers training to perform progressively more advanced digital work, while earning money. The vision of the project is to scaffold workers through basic computer fluency, working with AI tools, and finally innovation and creativity skills. This work is in collaboration with a rural partner (Rupert Public Library, in Rupert, WV) and urban partner (CommunityForge in Wilkinsburg, PA) and also benefits from a partnership with Bosch Inc. in Pittsburgh, ConservationX Labs in Washington DC, and the State of West Virginia.
The proposed research addresses a fundamental challenge in that those who most need to develop skills to gain higher paying jobs cannot afford the unpaid time spent in training needed to develop them. Accomplishing this vision will require solving the following core research questions: (i) How can one best support the marginalized workers in their transition to online work?, (ii) How can Artificial Intelliegnce tools augment workers, rather than displace them?, (iii) How can tools be designed to help workers build skills and creativity for work that is unlikely to be automated in the future?. This project has the potential to make advances across a variety of interrelated fields including crowdsourcing, Artificial Intelligence, Human Computer Interaction, Cognitive Science, Learning Science, Sociology and Economics. Simultaneously enabling both improved work outcomes as well as skill development in crowd work will require the development of models of workers, skills, and their trajectories at a more nuanced level. Enabling workers to collaborate with Artificial Intelligence will require new human-computer interaction paradigms. Supporting creativity and the development of new skills will require the exploration of new organization and coordination structures. By grounding the investigations in real world contexts, the research aims for generalizable knowledge that can lay a foundation for research on the future of crowd work at the human-AI frontier
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |