2018 — 2021 |
Gonzalez, Richard (co-PI) [⬀] Mihalcea, Rada [⬀] Banea, Carmen |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Small: Demographic-Aware Lexical Semantics @ University of Michigan Ann Arbor
A central challenge in natural language processing is to develop methods for determining how meanings of words relate to one another. This task is called "lexical semantics", because "lexical" means "word" and "semantics" means "meaning". Traditional dictionaries do not solve the problem of lexical semantics, because definitions are often circular or incomplete, especially for the most common words. Instead, models of lexical semantics are computed by processing large bodies of text, using the principle that pairs of words that often appear in the same contexts must have meanings that are similar along some dimensions. For example, the words "man" and "boy" would be inferred to have similar meanings along the dimensions of "human" and "gender". However, a limitation of current models is that they assume that the meaning of words is the same for all speakers of a language. This is plainly false: we know, for example, that English speakers use words differently depending, among other factors, their age, gender, field of work, and geographic location; that is, on the basis of their demographics. This project will overcome this limitation by developing methods for demographic-aware lexical semantics, where people-centric information complements language-based information. This work will help improve systems for natural language communication between people and computers, such as Siri or Alexa, as well as improve systems for automatically translating between different languages.
Recent years have witnessed significant progress in research in lexical semantics using corpus-based approaches such as distributional vector-space models and word embeddings. At the same time, the growth of Web 2.0 has led to tremendous volumes of texts, most of which are rich in explicit or implicit demographic information, such as the age, gender, industry, or location of the writer. The goal of this project is to take the next natural step at the confluence of these two trends, and develop methods for demographic-aware lexical semantics, where people-centric information complements language-based information for enhanced linguistic representations that explicitly account for the demographics and traits of the people behind the language. The project targets the following three main research objectives. First, it develops novel demographic-aware word representations models that account not only for contextual knowledge but also for people-centric information. Methods that are explored include distributional vector-space models that can be composed to create demographic-aware vector-space representations for various demographic profiles, and joint word embeddings that combine generic context-based embeddings with specialized embeddings that reflect the specifics of given demographic dimensions. Second, building upon extensive previous work in behavioral studies targeting the identification of systematic heterogeneity across groups, lab studies are devised to validate the findings from the computational models. Third, the application of these novel people-centric word representations to three core tasks in natural language processing are explored, ranging from simple to complex, namely: word associations, text similarity, and diversified news.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.952 |