1995 — 1998 |
Hastie, Trevor |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Mathematical Sciences: Flexible Regression and Classification
Proposal: DMS 9504495 PI: Trevor Hastie Institution: Stanford University Title: Flexible Regression and Classification Abstract: The research concerns several research directions with a common theme: to push widely accepted but limited statistical tools in more adventurous directions, while retaining some of their attractive features, such as model interpretability. Specifically, the research involves the development of: a) nonparametric extensions of logistic regression for multiclass responses, including additive, projection pursuit and basis expansion techniques, as well as rank reduced models similar to Fisher's LDA; b) a new adaptive algorithm for basis selection, similar to Friedman's MARS model, which uses a natural penalized criterion to simultaneously select variables and shrinks their coefficients; c) a technique for locally adapting the nearest neighbor distance metric to combat the curse of dimensionality. Many important problems in data analysis and modeling focus on prediction. Some important examples include computer assisted diagnosis of disease (e.g. reading digital mammograms), heart disease risk assessment, automatic reading of handwritten digits (e.g. zip-codes on envelopes), speech recognition, to name a few. This research is about enriching the current toolbox of well established statistical models in a natural way to address some of these more complex scenarios. Often new exotic techniques, such as neural networks, are ``black boxes'' that appear to produce good results, but do not provide the analyst with an interpretable model, diagnostics or similar feedback to give them confidence that the box has produced sensible results. Statistics can play an active role in these important prediction and data analysis problems through the development competitive and defensible models. This research does just that by creating a blend between the well understood classical techniques and the new techniques that allow for model exploration.
|
1 |
1998 — 2023 |
Hastie, Trevor |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Flexible Statistical Modeling
This project studies methods for analyzing large datasets using L1 and related regularization. Coordinate descent algorithms are developed to provide entire families of solutions for L1 and more aggressive concave penalized regression problems. Applications include generalized linear regression models for very wide datasets, and structure-finding algorithms for undirected graphical models. L1 regularization is used as well to develop efficient convex algorithms for finding low-rank approximations (SVDs) to extremely large, sparsely populated matrices.
This project develops tools with a wide variety of applications, illustrated here in medicine and merchandising. Modern technologies in genomics produce measurements of half a million or more genotypes at particular locations (SNPs) along an individual's genome in a few hours. Armed with such measurements on a few thousand individuals, some sick and some healthy, this project develops powerful statistical tools for identifying groups of SNPs associated with diseases such as Alzheimer's or breast cancer. Online movie renters or book buyers are often asked to rate their purchases. Although each individual sees a minuscule fraction of the selections available, the investigators are able to develop recommender systems that exploit the overlap to learn genres of movies, and assign viewers to like-minded cliques, and which allow them to make recommendations for products not yet seen.
|
1 |
2002 — 2005 |
Hastie, Trevor |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Flexible Statistical Modelling
Proposal ID: DMS-0204612 PI: Trevor Hastie Title: Flexible statistical modeling
Abstract
The investigator and his students will study the popular support vector machines from the viewpoint of the traditional logistic regression model (with regularization). While the latter exhibits very similar generalization performance to the SVM, it has several advantages: it estimates class probabilities, and generalizes to multiclass problems seamlessly. This and several other even simpler approaches will be developed for the classification of gene expression arrays.
This proposal will thus develop useful models for making class predictions in a variety of high-dimensional situations, in particular for gene expression arrays. While the science journals abound with exotic statistical and computational algorithms for use in this fashionable domain, this investigator and his colleagues firmly believe that some very simple modifications of classical approaches perform as well or better, and are easier to understand.
|
1 |
2018 — 2022 |
Owen, Art [⬀] Hastie, Trevor |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Bigdata: F: Computationally Efficient Algorithms For Large-Scale Crossed Random Effects Models
The problems of deciding what to buy, where to eat, which movie to watch, and so forth are of enormous economic value to consumers, sellers, and the people employed making those goods and services. Companies try to match people and products using vast data sets recording purchases and opinions. Even with a large data set it is a challenge to get reliable results. Measurements on the same or similar products are correlated, as are measurements by the same or similar people; however, correlated data yield less information than uncorrelated data. Properly accounting for the correlation requires too much computation, even on modern large computers, because the amount of computation grows as a power of the size of the data. Ignoring those correlations will produce an analysis that becomes overconfident and findings that are not reproducible, leading to inefficiency and wasteful decisions. This project will develop computationally efficient and reliable methods to handle data of this kind as well as more complicated data structures. The results of this research will benefit both industry and individuals making purchasing decisions.
The problems described above are known as crossed random effects in the statistical literature. The statistically proper tools are linear mixed models and generalized linear mixed models. The usual ways to fit linear mixed models have a cost that grows faster than linearly in the size of the data set. The exponent is three halves. The same cost arises in a Bayesian approach. With large modern data sets these costs are completely out of reach. Some recent solutions work with the method of moments at a cost that scales linearly with the data size. This project will develop a backfitting method that starts with the moment method and then iterates towards the maximum likelihood solution. It will also extend to the generalized linear mixed model case in order to handle binary outcomes, such as whether the customer did or did not buy a particular item. While crossed random effects are prevalent in electronic commerce, they can arise in any setting where there are many to many relationships connecting one sort of entity to another. Any place where we have observations on the edges of a bipartite graph is a place where crossed random effects may arise. This work will also include random slope models.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |
2020 |
George, Paul (co-PI) [⬀] Hastie, Trevor J. Heilshorn, Sarah C [⬀] |
R21Activity Code Description: To encourage the development of new research activities in categorical program areas. (Support generally is restricted in level of support and in time.) |
Combinatorial Matrix-Mimetic Recombit Proteins as Engineered Nerve Guidance Conduits
ABSTRACT Over 500,000 Americans suffer from peripheral nerve injury (PNI), and despite surgical interventions, most suffer permanent loss of motor function and sensation. Current clinical options for long nerve gap PNI include naturally- derived grafts, which provide native matrix cues to regenerate neurons but suffer from very limited supply and batch-to-batch variability, or synthetic nerve guidance conduits (NGCs), which are easy to manufacture but often fail due to lack of regenerative cues. The main challenge with using any NGC for treatment of PNI is the immense trade-off between providing the complex matrix cues necessary for optimal nerve regeneration while providing a conduit that is readily available, reproducible, and easily fabricated. To overcome this challenge, we propose an entirely new type of biomaterial: a computationally optimized, protein-engineered recombinant NGC (rNGC). This rNGC combines the reliability of synthetic NGCs with the presentation of multiple regenerative matrix cues of natural NGCs. Because current understanding of cell-matrix interactions is insufficient to enable to direct design of a fully functional rNGC, we hypothesize that the use of machine learning, computational optimization methods will allow identification of an rNGC that promotes nerve regeneration similar to the current gold standard autograft. We utilize a family of protein-engineered, elastin-like proteins (ELPs) that are reproducible, with predictable, consistent material properties, and fully chemically defined for streamlined FDA approval. Due to ELPs? modular design, they have biomechanical (i.e. matrix stiffness) and biochemical (i.e. cell-adhesive ligand) properties that are independently tunable over a broad range. While numerous studies detail the effects of individual biomechanical or biochemical matrix cues on neurite outgrowth using single-variable approaches, their combinatorial effects have been largely unexplored as insufficient knowledge exists to make accurate predictions of their interactions a priori. This fundamentally prohibits the direct design of combinatorial matrix cues. We hypothesize that optimized presentation of biomechanical and biochemical cues will create a microenvironment that better mimics the native ECM milieu, resulting in synergistic ligand cross-talk to improve nerve regeneration. In Aim 1, we use computational optimization methods to identify the combination of ligand identities, ligand concentrations, and matrix stiffness that best enhances neurite outgrowth. We will develop and characterize a library of ELP variants with distinct cell-adhesive ligands derived from native ECM, and assess their ability to support neurite outgrowth from rat dorsal root ganglia (DRG). In Aim 2, we will validate our in vitro optimization results in a preclinical, rat sciatic nerve injury model. A core-shell, ELP-based rNGC with an inner core matrix of the optimized ELP formulation from Aim 1 will be fabricated and evaluated for its ability to enhance therapeutic outcome. Controls include reversed nerve autograft, hollow silicone conduit, and non-optimized ELP- based rNGC. This study would represent the first use of computational optimization methods to design a reproducible, reliable, recombinant biomaterial with multiple regenerative matrix cues.
|
0.958 |