2003 — 2008 |
Davidson, Susan (co-PI) [⬀] Liberman, Mark [⬀] Santorini, Beatrice Bird, Steven (co-PI) [⬀] Maxwell, Michael (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Querying Linguistic Databases @ University of Pennsylvania
With National Science Foundation support, Dr. Mark Liberman and Dr. Steven Bird will lead a team conducting three years of research on data models and query languages for linguistic databases. The project will develop relational and XML data models for linguistic databases combining annotated recordings, comparative wordlists, data tabulations, interlinear texts, syntactic trees, ontologies of descriptive terms, and links between all these types. High-level user interfaces will support query-by-example and online analytical processing, permitting linguists to select appropriate language data, integrate data from multiple sources, transform the structure of the data, add new annotations in collaboration with others, and convert it all to suitable formats for archiving and for use in research and teaching.
Describing and analyzing human languages depends on being able to manage large databases of annotated text and recorded speech. The size and complexity of these databases promises to bring unprecedented depth and breadth to empirical linguistic research. However, this promise will not be fulfilled until language scientists can readily access and manipulate the data. This project will apply recent research in databases to linguistics, develop a linguistic query language, and deploy it in a variety of open-source tools for creating, managing, analyzing, and displaying annotated linguistic databases. By making rich data re-usable, the research will open the way to a deeper and broader understanding of the world's languages.
|
1 |
2012 — 2015 |
Santorini, Beatrice |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: a Syntactically Annotated Corpus of Appalachian English @ University of Pennsylvania
The goal of this project is to create an online corpus of one million words of traditional Appalachian speech, which will be freely available to the scholarly community and to the public. Though often socially stigmatized, Appalachian English is historically central to the development of American English from its British origins, and the project aims to provide a resource unprecedented in scope and in public accessibility for cultural, historical, and linguistic research on the English of Appalachia.
The project is based on an existing collection of recordings and transcripts by Prof. Michael Montgomery, a recognized authority in the field, and its goals include digitizing the recordings, aligning the digitized sound files with the transcripts, and annotating the transcripts with detailed grammatical information. Digitizing the recordings will preserve this valuable cultural resource for future generations, and aligning the digitized recordings with the transcripts will allow researchers to rapidly find recorded words and phrases by searching the transcribed text. The grammatical annotation will allow in-depth analyses of particular constructions that are specific to Appalachian English or typical of vernacular American speech more generally, as well as comparisons of Appalachian English with contemporary standard American English, with other vernacular Englishes, and with earlier stages of the language.
Because the corpus will be large, publicly available, and searchable online with standard computational tools, it will foster replicability, thereby contributing to increased empirical rigor in linguistic research. These same properties will also make it possible to use the corpus as a teaching tool at the high school and college levels. It will also serve as a model for the creation of similar corpora of other varieties of English. In sum, the annotated corpus of Appalachian English will deepen our understanding of America's linguistic heritage and promote a scientifically informed appreciation of regional language and culture.
|
1 |
2016 — 2020 |
Santorini, Beatrice |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: a Corpus of New York City English: Audio-Aligned and Parsed @ University of Pennsylvania
This project aims to further the study of New York City English (NYCE) - the varieties of English particular to New York City and the surrounding region - through the development and use of an innovative audio-aligned and parsed corpus of New Yorkers' speech. The project will combine recent advances in speech corpus development tools with the special talents and backgrounds of undergraduates at the City University of New York (CUNY), to create the first such corpus of New York City English (the CUNY-CoNYCE). The CUNY-CoNYCE will be based on interviews with New Yorkers across the five boroughs and Long Island, conducted by CUNY undergraduates from Queens College, Lehman College (The Bronx), and the College of Staten Island. Because our student populations draw predominantly from neighborhoods across the five boroughs of New York City and Long Island, they are uniquely able to collectively gather and produce large quantities of speech data from all over the region. The ultimate product will be an on-line, freely accessible, ~1,000,000-word audio-aligned and grammatically annotated corpus of NYCE speech, which will be accompanied by a full set of digital, text-searchable recordings of the speech signal from which the corpus is transcribed.
In addition to answering questions about language variation and change in NYCE, the corpus will further research in all areas of linguistics, especially in phonetics, phonology, morphology, syntax, sociolinguistics, and discourse analysis. The use of oral history and sociological measurements of ethnic affiliation components in data collection will also make the CUNY-CoNYCE a useful tool for sociologists and anthropologists examining lived experience in urban settings, inter-ethnic relations, and near-term history of New York life. The project will also provide transformative research experiences for dozens of CUNY undergraduates, giving them unique research opportunities. Additionally, users of the corpus will develop an understanding of and appreciation for the grammar of non-standard dialects, and functions of non-standard speech as necessary linguistic resources for social integration.
|
1 |
2020 — 2023 |
Kulick, Seth (co-PI) [⬀] Santorini, Beatrice |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eager: Annotating and Extracting Detailed Syntactic Information From a 1.1-Billion-Word Corpus @ University of Pennsylvania
Over the past decade, very large text corpora of English have become available to researchers that turn out to be of considerable value for the language sciences. Even more recently, methods in natural language processing have advanced to a point where we can begin to imagine conducting linguistic research using automatically parsed and uncorrected corpora of the sort that has so far been conducted using human-corrected corpora. It is this new situation that the PIs wish to exploit by producing an automatically parsed billion-plus word corpus of early modern English based on the digitized Early English Books Online (EEBO) corpus that has recently been completed and made accessible to research. The aim is to create an automatically parsed database with a level of accuracy suitable for both linguistic and computational research, using the recently developed cutting-edge methods in natural language processing. The resulting resource will make possible investigations hitherto impossible; specifically, the information contained in a parsed version of EEBO will permit researchers to investigate frequency effects not just of words, but of larger grammatical units (phrases and clauses). In addition to their inherent linguistic interest, the results of such investigations may lead to the discovery of more sophisticated meaning-based properties and how these vary, which should be of value for research in natural language processing. The PIs have made progress on this goal, having created a first automatically parsed version of the EEBO corpus and begun to assess its accuracy. Some features like the syntax of clausal negation are already within our reach, but for many other structures, it remains to be determined how accurate retrieval with large-scale methods can be.
Since EEBO is more than 300 times larger than even the largest individual human-corrected corpora, it is expected that a more accurately parsed version of it than the one now available will begin to allow researchers to study phenomena that are only sporadically attested in existing English corpora, to zero in on the very beginnings and ends of historical changes, to investigate many different types of frequency effects (including the novel ones already mentioned) with an accuracy and reliability not hitherto possible, and to rigorously evaluate mathematical models of language change. Because the stage of English covered by EEBO (1500-1700) is already recognizably the modern language, a parsed version of EEBO can to some extent stand proxy for a corpus of Present-Day English for research in the language sciences. As a result, it should be useful as a training and testing ground for applications in computational linguistics including part-of-speech tagging, parsing, named entity recognition, and eventually lemmatization, sense disambiguation, and others. EEBO?s great genre variety and variable orthography and its moderate distance from Present-Day English will also make a parsed version of it a natural candidate for assessing and improving the robustness of these applications and for developing novel parser evaluation metrics that can serve as linguistically informed benchmarks for computational linguistics.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |