2008 — 2012 |
Huang, Heng (co-PI) [⬀] Ding, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Matrix-Model Machine Learning: Unifying Machine Learning and Scientific Computing @ University of Texas At Arlington
Collaborative Research: Matrix-Model Machine Learning: unifying machine learning and scientific computing
To analyze the ever-growing massive quantities of data for pattern recognition and knowledge discovery, effective machine learning models and efficient computational algorithms are essential tools. The goal of this research is to establish a theoretical foundation for solve challenging machine learning problems utilizing matrix/tensor computational methodologies, leveraging over the success of scientific computing over recent decades - including well-developed algorithms and mature, freely-available software.
This research begins with a critical connection between machine learning and scientific computing: an effective global solution to K-means clustering algorithm is provided by the principal component analysis which is based on singular value decomposition (SVD). This fundamental relationship will be systematically extended to matrices, tensors and multi-relational data, to deal with increasingly higher dimensions, multiple indexes and data types. The key goal of this research is to establish that well-known scientific computing techniques such as SVD, matrix and tensor decompositions can be directly utilized for pattern discovery, and further develop these computational methodologies for semi-supervised learning, clustering and classification. The focus will be on multi-index data (tensors, such as a sequence of weather maps or a sequence of traffics over a network) and multi-relational data (multiple pairwise relations, such as protein domains ? proteins ? pathways or words ? documents ? authors). Applications in genomics, text mining, and computer vision will be investigated.
|
1 |
2008 — 2009 |
Ding, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Sger: Collaborative Research: Non-Negative Matrix Factorizations For Data Mining: Algorithms and Applications @ University of Texas At Arlington
Nonnegative matrix factorization (NMF) factorizes an input nonnegative matrix into two nonnegative matrices of lower rank. It is recently discovered that NMF in the most basic form is equivalent to a relaxed K-means clustering, the most widely used pattern discovery algorithm in data mining. This direct link between mathematics and data mining sets in motion a large number of developments on using matrix factorizations for pattern discovery. It turns out that NMF provides more consistent and mathematically well-defined optimization formulations for many fundamental and emerging data-mining problems. NMF algorithms have well-understood properties; they are simple and easy-to-implement, well suited for distributed parallel architectures. This research aims to formally establish a comprehensive NMF-based framework for data mining. In particular, we will (1) extend matrix factorization data-mining methodology from current focus on clustering (pattern discovery) to newer problems: semi-supervised clustering (extending partial knowledge to whole data) and classifications (pattern prediction, such as predicting a cancer tumor tissue from a normal one); (2) develop fast numerical algorithms and incorporate state-of-the-art numerical optimization techniques; and (3) apply and evaluate the NMF algorithms in different real-world applications including text mining and bioinformatics.
|
1 |
2009 — 2014 |
Huang, Heng (co-PI) [⬀] Ding, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Non-Negative Matrix Factorizations For Data Mining: Foundations, Capabilities, and Applications @ University of Texas At Arlington
Nonnegative matrix factorization (NMF) factorizes an input nonnegative matrix into two nonnegative matrices of lower rank. It was recently discovered that NMF has unique ability to solve challenging data mining and machine learning problems. The advantage of NMF over existing unsupervised learning methods are (1) NMF can model widely varying data distributions, (2) NMF performs both hard and soft clustering simultaneously. (3) Many other data mining problems such as semi-supervised clustering problems can be reformulated as NMF problem. Building upon these foundations, the investigators propose to establish a NMF-based comprehensive framework for data mining: (a) Provide deeper understanding of NMF's clustering capability; (b) Extend data mining capability of NMF for solving various data mining and machine learning problems; (c) Develop fast numerical algorithms which incorporate the state-of-the-art developments from numerical optimization for various matrix factorization models; (d) Develop novel and rigorous proof strategies to prove the correctness and convergence properties of the numerical algorithms; (e) Apply and evaluate these new algorithms in real-world applications.
The proposed work creates a new paradigm of analyzing vast amount of data and discovering new knowledge from the data by transforming established matrix computational methodologies. This new technology can automatically group news articles into meaningful categories, discover protein modules in protein networks, extract weather patterns in climate data, segment pictures into distinct objects, detect communities on the Web, and enable many other scientific discoveries and new technologies creation. On a fundamental level, the proposed work establishes that a simple matrix factorization in fact solves challenging data mining problems. This research reinforces the importance of mathematics in today's data centric world and encourages students to learn mathematics.
|
1 |
2009 — 2010 |
Huang, Heng (co-PI) [⬀] Ding, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eager: Collaborative Research: Cross-Domain Knowledge Transformation Via Matrix Decompositions @ University of Texas At Arlington
EAGER: Collaborative Research: Cross-domain Knowledge Transformation via Matrix Decompositions
Traditional data mining algorithms discover knowledge in new domains starting from the scratch, ignoring knowledge learned in other domains. Knowledge transformation is a transformative paradigm that utilizes previously acquired knowledge in other domains to guide knowledge discovery process in a new domain and is especially useful for large data sets. In particular, utilizing applicable knowledge in other domains helps to stabilize the unsupervised learning and generate results that we may have preliminary understanding.
The goal of this project is to design and develop cross-domain knowledge transformation mechanisms for knowledge discovery. The transformation mechanisms are based on matrix decompositions where the knowledge been transferred are represented directly and explicitly ? making them easy to comprehend and be utilized in practice. The proposed mechanisms provide a versatile knowledge transformation framework with solid theoretical foundation and enable a new paradigm of unsupervised learning with domain knowledge.
The usefulness of these knowledge transformation mechanisms/systems will be demonstrated for effective information retrieval, consumer recommender systems, and product/online opinion sentiment analysis. The versatility of this transformative metholody will be verified across many domains.
|
1 |
2009 — 2014 |
Huang, Heng (co-PI) [⬀] Ding, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
New Theoretical Foundations of Tensor Applications: Clustering, Error Analysis, Global Convergence, and Robust Formulations @ University of Texas At Arlington
New Theoretical Foundations of Tensor Applications: Clustering, Error Analysis, Global Convergence, and Robust Formulations
Tensor decompositions become increasingly important in analyzing high-dimensional multi-index data. However, applications of tensor decompositions are so far restricted: (1) they are mainly used for data compression ? critically important tasks such as data clustering have not been addressed. (2) No bounds on reconstruction error exist ? the compression parameters are determined on a trial-and-error basis. (3) As solutions to non-convex optimizations, tensor decompositions are not unique. This could severely affect the reliability of tensor analysis. (4) Tensor decompositions are obtained via minimizing the sum of squared errors, thus are prone to noise or outliers in the data. A robust formulation of decomposition is highly desirable for applications with large noises. In this proposal, we investigate these new fundamental aspects of tensor applications: (1) Investigate the clustering capabilities of tensor decompositions, in addition to the established theoretical results on clustering; (2) Provide comprehensive error analysis of tensor decompositions and derive lower and upper error bounds; (3) Investigate conditions for global convergence for tensor decompositions and investigate good initializations for the cases where global convergence fails. (4) Develop robust formulations for tensor decompositions. In addition, we will develop user-friendly software toolbox that contains the resulting algorithms and make it available to the public. We will also educate graduate and undergraduate students with fundamentals in matrix and tensor computations. We will present tutorials and organize workshops on this new direction.
|
1 |
2014 — 2017 |
Huang, Heng [⬀] Ding, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Abi Innovation: a New Automated Data Integration, Annotations, and Interaction Network Inference System For Analyzing Drosophila Gene Expression @ University of Texas At Arlington
Large-scale in situ hybridization (ISH) screens are providing an abundance of data showing spatio-temporal patterns of gene expression that are valuable for understanding the mechanisms of gene regulation. Knowledge gained from analysis of Drosophila expression patterns is widely important, because a large number of genes involved in fruit fly development are commonly found in humans and other species. Thus, research efforts into the spatial and temporal characteristics of Drosophila gene expression images have been at the leading-edge of scientific investigations into the fundamental principles of different species development. Drosophila gene expression pattern images enable the integration of spatial expression patterns with other genomic datasets that link regulator with their downstream targets. This project addresses the computational challenges in analyzing Drosophila gene expression patterns by leveraging a new bioinformatics software system. It focuses on designing principled bioinformatics and computational biology algorithms and tools that will integrate multi-modal spatial patterns of gene expression for Drosophila embryos' developmental stage recognition and anatomical ontology term annotation, and will infer gene interaction networks to generate a more comprehensive picture of gene function and interaction. The bioinformatics methods resulting from the project activities are broadly applicable to a variety of fields such as biomedical science and engineering, systems biology, clinical pathology, oncology, and pharmaceutics. Novel tools to enhance courses and research experiences for diverse populations of students are planned to broaden participation in science.
This project investigates three challenging problems for studying the Drosophila embryo ISH Images via innovative bioinformatics algorithms: 1) the sparse multi-dimensional feature learning method to integrate the multimodal spatial gene expression patterns for annotating Drosophila ISH images, 2) the heterogeneous multi-task learning models using the high-order relational graph to jointly recognize the developmental stages and annotate anatomical ontology terms, 3) the embedded sparse representation algorithm to infer the gene interaction network. It is innovative to apply structured sparse learning, multi-task learning, and high-order relational graph models to Drosophila gene expression patterns analysis and holds great promise for scientific investigations into the fundamental principles of animal development. The algorithms and tools as outcomes of this research are expected to help knowledge discovery for applications in broader scientific and biological domains with massive high-dimensional and heterogeneous data sets. This project facilitates the development of novel educational tools to enhance several current courses at University of Texas at Arlington. The PIs engage minority students and under-served populations in research activities to provide opportunities for exposure to cutting-edge scientific research. For further information see the web site at: http://ranger.uta.edu/~heng/NSF-DBI-1356628.html
|
1 |
2016 — 2020 |
Huang, Heng [⬀] Rao, Jia (co-PI) [⬀] Ding, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Bigdata: Collaborative Research: Ia: Big Imaging-Omics Data Mining Framework For Precision Medicine @ University of Texas At Arlington
The research objective of this proposal is to address the computational challenges in an innovative BIGDATA application on imaging-omics based precision medicine. Recent advances in high-throughput imaging (such as histopathology image) and multi-omics (such as DNA sequence, RNA expression, methylation, etc.) technologies created new opportunities for exploring relationships between histology, molecular events, and clinical outcomes using quantitative methods. However, the unprecedented scale and complexity of these imaging-omic data have presented critical computational bottlenecks requiring new concepts and enabling tools. This project builds a new computational framework to integrate novel big data mining algorithms with cloud and high-performance computing strategies for revealing complex relationships between histopathology images, multi-omics, and phenotypic outcomes. This project is innovative and crucial not only to facilitating the development of new big data mining techniques, but also to addressing emerging scientific questions in imaging-omics and many other biomedical applications. The developed methods and tools are expected to impact other cancer genomics research and enable investigators working on cancer medicine to effectively test their scientific hypothesis. This project facilitates the development of novel educational tools to enhance several current courses. University of Texas at Arlington is a minority-serving institution and has large population of Hispanic and Black Americans. This project engages the minority students and under-served populations in research activities to give them a better exposure to cutting-edge science research.
To solve the key and challenge problems in big imaging-omics data mining, this project explores the following research tasks. First, the large-scale non-convex sparse learning models are developed for identifying outcome-relevant phenotypic traits from big histopathology images. Second, the biological domain knowledge is utilized to guide the sparse learning models to uncover the molecular bases of complex traits. Third, the data integration models are designed to integrate imaging-omics data from multiple sources and discover the heterogeneous biomarkers. Fourth, the Baysian learning model is explored to predict longitudinal cancer outcomes. Fifth, the cloud computing and high-performance computing strategies are developed to support the big imaging-omics data mining, such as optimizations for various data mining workloads on heterogeneous hardware (e.g. GPU and NUMA multicore processors) to fully unlock the potential of data center hardware. It is innovative to integrate big data mining algorithms with cloud and high-performance computing to imaging-omics that hold great promise for a systems biology of the precision medicine.
|
1 |