1999 — 2002 |
Macwhinney, Brian [⬀] Buneman, O. Peter Liberman, Mark Bird, Steven |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Kdi: Talkbank: a Multimodal Database of Communicative Interaction @ Carnegie-Mellon University
The goal of TalkBank is the creation of a distributed, web-based, data archiving system for transcribed video and audio data on communicative interactions. These interactions will include mothers talking with their children, family dinner table talk, classroom interactions, animal cries, signed language, formal debates, phone calls, talk with foreigners, club meetings, and dozens of other types of communicative interactions. The data will come from speakers of many languages, professions, and ages. Some speakers will have language disabilities and some will be language learners. The formal specification for data in TalkBank will use a system called Codon. Tools will be created for the entry of new and existing data into the Codon format; transcriptions will be linked to speech and video; and there will be extensive support for collaborative commentary from competing perspectives. Researchers will be able to locate a particular segment of an interaction and immediately play that section back on their computer monitors. TalkBank will facilitate comparisons across social groups, languages, and situations. It will also provide tools for working in detail on single populations, as well as collaborative commentaries that test competing interpretations against a constant set of data.
The initiative establishes an ongoing interaction between computer scientists, linguists, psychologists, sociologists, political scientists, criminologists, educators, ethologists, cinematographers, psychiatrists, and anthropologists.
|
0.951 |
2000 — 2003 |
Liberman, Mark (co-PI) [⬀] Bird, Steven |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Multidimensional Exploration of Linguistic Databases @ University of Pennsylvania
This project aims to foster a new mode of fundamental research in linguistics, namely 'web-based exploration of linguistic field data.'' The objectives are to develop tools for manipulating linguistic databases; to store and disseminate large datasets using the model; to exploit the tools and datasets in teaching and research; and -- underlying all of the above -- to explore new methods for representing and analyzing linguistic data. The consequence of this research will be increased accessibility, accountability, and stability of empirical linguistic research.
The project will provide wide-ranging support for empirical linguistic research, through the combination of traditional field methods with new technologies for exploring and visualizing complex databases. The interlinked, heterogeneous, and multimodal aspects of the data will be a key component, and the research will encompass data types including lexicons, interlinear texts, field notes, paradigms, grammar sketches, annotated recordings, annotated maps and photographs, folios, course notes, and problem sets, as well as links between all of these. A set of collaborators have granted access to their field data for the purposes of this project, and have agreed to road-test the new tools in their ongoing fieldwork. All of the primary data created by the project will be published on the web site of the Linguistic Data Consortium (LDC), for general public access, subject to the appropriate permissions having been granted. All tools and documentation produced by the project will be freely available to others.
|
1 |
2001 — 2007 |
Langendoen, D. Terence Aristar-Dry, Helen Aristar, Anthony Bird, Steven Ratliff, Martha |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
E-Meld: Electronic Metastructure For Endangered Languages Data
Language data is central to the research of a large social sciences community - not only linguists, but also anthropologists, archaeologists, historians, and sociologists interested in the culture of indigenous peoples. Members of this research community are currently faced with two urgent situations: the number of languages in the world is rapidly diminishing while the number of initiatives to create digital archives of language data is rapidly multiplying. The latter might seem to be an unalloyed good in the face of the former, but there are two ways things may go wrong without adequate collaboration among archivists, linguists, and language engineers. First, a common standard for the digitization of linguistic data may never be agreed upon. And the resulting variation in archiving practices and language representation would seriously inhibit data access, searching, and cross-linguistic comparison. Second, standards may be implemented without guidance from the people who best know the range of structural possibilities in human language-descriptive linguists who have done fieldwork on poorly described languages.
If digital archives of language data and documentation are to offer the widest possible access and to provide information in a maximally useful form, consensus must be reached about certain aspects of archive infrastructure. As the largest linguistic organization in the world and the central electronic publication of the discipline, The LINGUIST List is organizing a collaborative project with a dual objective: (1) to preserve endangered languages data and documentation and (2) to aid in the development of infrastructure for linguistic archives. One outcome of the project will be a LINGUIST List digital archive housing data from 10 endangered languages. But the focus on infrastructure will produce other, equally important results. In the first place, The LINGUIST archive will function, not only as a repository, but also as a 'showroom of best practice.' The archive will offer endangered languages data marked up and catalogued according to community consensus about best practice; furthermore, the archive will disseminate reference material delineating best practice and software tools supporting it. Another outcome will be the establishment on the LINGUIST List site of a central metadata server for the discipline; this server will organize information on all the language-related resources residing at distributed sites, not just endangered languages information alone. Other infrastructure-related outcomes include (1) the involvement of the linguistics community in establishing best practice, (2) the widespread dissemination of the resulting recommendations, and (3) the hands-on training of a substantial core of linguists and language archivists in the implementation of the guidelines. Although the data collection efforts will focus initially on endangered languages, the metadata server, the recommendations for best practice, and the distribution of supporting software will have a significant impact on all empirical research in linguistics. The project will thus add value to many other language-related projects currently planned or underway.
|
0.952 |
2003 — 2006 |
Comrie, Bernard Whalen, Douglas Bird, Steven Mason, James Bollacker, Kurt |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Project: the Rosetta Project- All Language Archive @ The Long Now Foundation
This collaborative project involves the Rosetta Project at The Long Now Foundation, the LINGUIST List, Stanford University, Eastern Michigan University, the Open Language Archive Community, and the Endangered Language Fund. The investigators are leading a global team of language specialists and native speakers to build a publicly accessible online archive for all documented human languages that serves as the definitive reference work on the languages of the world to date. Rosetta currently serves over 30,000 text pages documenting writing systems, phonology, grammar, vernacular texts, core wordlists, numbering systems, maps, audio files, and demographic/historical descriptions for over 1,000 languages. A major sub-component of the Rosetta archive is the ALL Language Word List Database - a collection of 200 term core vocabulary lists for the languages of the world, currently supporting 1,300 languages. This project supports the growth of this aspect of the Rosetta library with an expectation of increasing the coverage from 1,000 to 2,500 languages. Integral with this effort, LINGUIST is expanding and elaborating the functions of its "people" database - an index of the majority of the world's contemporary linguists, searchable by languages and families of interest, current research and teaching interests, course offerings, and contact information. This database is a critical resource to support the open contribution and peer review process, which builds Rosetta, as well as for educators wishing to find others with related teaching interests and sharable pedagogical materials. A compelling web environment offers "anywhere, anytime" tools for scholars and speakers to contribute and collaboratively view, vet, comment, correct and contextualize all the materials in the archive. These tools are combined with a user-focused site design, enabling both skilled and unskilled users to easily browse, locate, and download materials of interest. The result is an online digital library, which enables educators, researchers and learners to engage language datasets of unprecedented range and diversity. For many languages the Stanford Library and other libraries are providing links to "shelf materials" that provide more depth. This resource is also usable in linguistics courses that focus on properties of language, and it facilitates student research projects. The Cognitive, Psychological, and Language Sciences Program in the NSF Division of Behavioral and Cognitive Sciences (BCS) is providing significant co-funding of this project in recognition of its value in serving the broader educational goals of BCS and its parent Directorate for Social, Behavioral, and Economic Sciences.
|
0.903 |
2003 — 2008 |
Davidson, Susan (co-PI) [⬀] Liberman, Mark [⬀] Santorini, Beatrice (co-PI) [⬀] Bird, Steven Maxwell, Michael (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Querying Linguistic Databases @ University of Pennsylvania
With National Science Foundation support, Dr. Mark Liberman and Dr. Steven Bird will lead a team conducting three years of research on data models and query languages for linguistic databases. The project will develop relational and XML data models for linguistic databases combining annotated recordings, comparative wordlists, data tabulations, interlinear texts, syntactic trees, ontologies of descriptive terms, and links between all these types. High-level user interfaces will support query-by-example and online analytical processing, permitting linguists to select appropriate language data, integrate data from multiple sources, transform the structure of the data, add new annotations in collaboration with others, and convert it all to suitable formats for archiving and for use in research and teaching.
Describing and analyzing human languages depends on being able to manage large databases of annotated text and recorded speech. The size and complexity of these databases promises to bring unprecedented depth and breadth to empirical linguistic research. However, this promise will not be fulfilled until language scientists can readily access and manipulate the data. This project will apply recent research in databases to linguistics, develop a linguistic query language, and deploy it in a variety of open-source tools for creating, managing, analyzing, and displaying annotated linguistic databases. By making rich data re-usable, the research will open the way to a deeper and broader understanding of the world's languages.
|
1 |
2007 — 2011 |
Liberman, Mark [⬀] Bird, Steven |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Olac: Accessing the World's Language Resources @ University of Pennsylvania
Language resources are the bread and butter of language documentation and linguistic investigation. They include the primary objects of study such as texts and recordings, the outputs of research such as dictionaries and grammars, and the enabling technologies such as software tools and interchange standards. Increasingly, these resources are maintained and distributed in digital form. Although language resources have begun to proliferate on the web, they are often difficult or impossible to locate and reuse. In this collaborative research project, Drs. Mark Liberman and Steven Bird of the University of Pennsylvania and Dr. Gary Simons of the Graduate Institute of Applied Linguistics will address this problem through new research to enhance the digital infrastructure of the Open Language Archives Community (OLAC). OLAC provides a standard set of language resource descriptors and a portal that permits users to query dozens of language archives simultaneously using a single search. However, the current coverage of OLAC is only the tip of the iceberg. The aim of the project is to greatly improve access to language resources for linguists and the broader communities of interest, by achieving an order-of-magnitude increase in the coverage of the OLAC catalog and in the use of OLAC search services. The project will do so through two main areas of activity: developing guidelines and services that encourage language archives to follow best common practices that will facilitate language resource discovery through OLAC, and developing services to bridge from the resource catalogs of the library and web domains to the OLAC catalog.
The project should have a broad impact across the field of linguistics by developing an online service that gives linguists access to resources for the thousands of languages in the world. But the impact will extend well beyond the linguistics community. Access to these language resources will assist technologists who are endeavoring to make information technologies work with every language, not just a select few. It will also permit educators, students and members of society at large to access a wealth of materials that demonstrate the full range of linguistic diversity in the world. Yet another audience for access to language resources are the actual speakers of all the languages of the world. In the case of endangered languages, access to language resources is a critical asset in the process of language revitalization. The project will also serve to advocate the widespread use of ISO 639-3, a newly adopted standard that provides codes for precisely identifying the 7,500 known human languages, past and present. This will encourage reform in current cataloging practice which is based on an earlier ISO standard that recognizes fewer than 400 languages, and begin the process of helping the major storehouses of knowledge around the world to deal appropriately with linguistic diversity.
|
0.958 |
2010 — 2016 |
Liberman, Mark [⬀] Bird, Steven |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Prosodic Systems in New Guinea: Integrating Computational and Typological Approaches to Linguistic Analysis @ University of Pennsylvania
The world's languages make heavy use of prosody--tone, stress, intonation, and length--to communicate meaning, and tone is the most complex of these elements. Although non-tone languages typically exploit pitch for intonational purposes, the more sophisticated use of pitch in tone languages means that speakers of such languages will have quite different mental representations of pitch from speakers of English and better-known European non-tone languages. This project will investigate the tone and reduced-tone languages of New Guinea, a linguistically under-investigated area of the world which is home to a sixth of the world's languages. The project will collect substantial new bodies of recorded and transcribed language data from several undescribed tone languages. It will then use computational and theoretical methods to analyze the geographical distribution of tonal properties and the interaction of tone and other prosodic features.
The project will incorporate technology into linguistic field work and develop an exemplary model of prosodic description. Language consultants will be trained in the model's use, leading to more accessible primary data and more accountable descriptions. The data will be made available in a form that can be readily used by scholars, language teachers, and communities of speakers and will support the development of writing systems and literacy programs for these languages.
|
0.958 |
2012 — 2014 |
Liberman, Mark [⬀] Bird, Steven |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Language Preservation 2.0: Crowdsourcing Oral Language Documentation Using Mobile Devices @ University of Pennsylvania
Language Preservation 2.0
The purpose of this pilot project is to demonstrate the feasibility of a new approach to documenting endangered languages.
To allow wide-ranging investigation of a language even after it is no longer spoken, we need the equivalent of the million words of extant biblical Hebrew texts, or the five million words of extant classical Latin. But for endangered languages without a significant culture of literacy, diverse text collections on this scale seem out of reach.
Given typical speaking rates of about 10,000 word-equivalents per hour, a hundred hours of recorded speech -- conversations, narratives, or oral histories -- would give us the equivalent of a million words of text. With community involvement, hundreds of hours of such recordings are easily within reach.
However, transcribing such large audio collections is a daunting task, given the small number of literate native speakers and the time-consuming nature of such transcription, which can take 200 hours of work for every hour of audio. We propose to solve this problem by substituting re-speaking and verbal translation: one or more native speakers repeats each phrase of a recording, speaking slowly and carefully, and then translates it into a better-documented language.
The utility of translated passages as a way to analyze otherwise-unknown languages has been demonstrated many times, starting with the Rosetta Stone. This aspect of our task is easier, since at least a grammatical sketch will in general be available.
Our goal in this project is to demonstrate the utility of re-speaking. We believe that linguists, starting out with relatively little knowledge of a language, can produce phonetic transcriptions that will be good enough to support subsequent analysis resulting in coherent texts, in a process analogous to (but easier than) the process that allowed previous generations of scholars to learn to read ancient Egyptian or Sumerian.
|
0.958 |
2014 |
Bird, Steven Chiang, David [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Small: Language Induction Meets Language Documentation: Leveraging Bilingual Aligned Audio For Learning and Preserving Languages @ University of Southern California
Thousands of the world's languages are in danger of dying out before they have been systematically documented. Many other languages have millions of speakers, yet they exist only in spoken form, and minimal documentary records are available. As a consequence, important sources of knowledge about human language and culture are inaccessible, and at risk of being lost forever. Moreover, it is difficult to develop technologies for processing these languages, leaving their speech communities on the far side of a widening digital divide. The first step to solving these problems is language documentation, and so the goal of this project is to develop computational methods based on automatic speech recognition and machine translation for documenting endangered and unwritten languages on an unprecedented scale.
To be successful, any approach must guarantee both the sufficiency and interpretability of the documentation it produces. This project ensures sufficiency by using a combination of community outreach, crowdsourcing techniques, and mobile/web technologies to collect hundreds of hours (millions of words) of speech. The interpretability is enabled by augmenting original speech recordings with careful verbatim repetitions along with translations into a well-resourced language. Finally, computational models are developed to automate transcription of recordings and alignment with translations, resulting in bilingual aligned text. The result is a kind of digital Rosetta Stone: a large-scale key for interpreting the world's languages even if they are not written, or no longer even spoken.
|
0.958 |