REGULARIZATION IN STATISTICS, SEPTEMBER 6-11, 2003, BANFF ============================================================================== SUNDAY MORNING ------------------------------------------------------------------------------ Rudolf Beran Regularized bayes fits to incomplete unbalanced multi-way layouts Statistical computing provides a technological environment for advances in statistical thought beyond those supported by probability theory. This talk illustrates the point by developing low-risk adaptive Bayes fits to large discrete multi-way layouts, possibly unbalanced or incomplete or both, whose factor levels may be ordinal or nominal or both. Prior distributions that express competing notions about the smoothness and about the interactions (in the ANOVA sense) among the unknown means yield candidate Bayes estimators of these means. Minimizing estimated frequentist risk under a general model for the observed multi-way layout selects the adaptive Bayes fit from the class of candidate Bayes estimators. As the number of factor levels tends to infinity, the risk of the adaptive Bayes fit converges to the minimum risk attainable over the candidate class. The method is equivalent to regularization by penalized least squares, with smallest estimated risk used to select the penalty weights. ------------------------------------------------------------------------------ Stephen Smale Learning with determined inputs and generalized Shanoon-Nyquist sampling Extensions of Shannon-Nyquist sampling will be discussed in the context of regularization and learning. ------------------------------------------------------------------------------ Ivan Mizera Regularization in statistics: Metamorphoses and leftovers I will try, in the simplest regularization setting of nonparamateric regression (smoothing), to indicate similarities and differences between various possible approaches to similar questions. Topics addressed include the difference between dense (functional, gridded) and scattered (point) data formulation; various motivations for nonlinearity and how they relate to linear problems; certain bivariate aspects - the role of the domain and subsequent implications for the numerical developments; the classical desideratum of smoothing versus emerging alternative targets. A few historical and interdisciplinar parallels are drawn along the way. The challenges posed by nonlinearity are illustrated on plastic splines, a total-variation based regularization method developed for scattered data formulation. A very few practical examples are displayed, but certain time is devoted to physical justification and parallels of bivariate plastic and elastic splines. Finally, open problems stemming from the nonlinear nature of the subject are pointed out. ============================================================================== SUNDAY AFTERNOON ------------------------------------------------------------------------------ Enno Mammen Optimal estimation in additive regression models Optimal estimation of an additive nonparametric component is discussed for an additive regression model $$Y=m_0+m_1(X_1)+...+m_D(X_D) + \varepsilon,$$ where $Y$ is the response, $X_1,...,X_D$ are observed covariables, $m_1,...,m_D$ are unknown functions (with norming E$m_d(X_d)=0$ for $d=1,...D$), $m_0$ is an unknown constant and $\varepsilon$ is an unobserved error varriable. We show that up to first order an additive component, $m_1$ say, can be estimated as if the other components $m_2,...,m_D$ are known. This claim is proved for kernel smoothers, local polynomials, smoothing splines and orthogonal series estimates. We present a two step procedure that achieves this aim. After a presmoothing step the smoothing algorithm is applied to the presmoothed data. We compare this estimate with a theoretical estimate where the same smoothing algorithm is applied to the theoretical data $$Y-m_2(X_1)-...-m_D(X_D).$$ This estimate makes use of the unknown functions $m_2,...,m_D$. We show that the theoretical estimate and the two-step estimate are asymptotically equivalent. In this sense our estimate achieves an oracle bound. Our result shows that in this respect nonparametric estimation differs principally from parametric estimation. Extensions of the results to more general nonparametric models are discussed. The talk reports on joint work with Joel Horowitz and Jussi Klemela. ------------------------------------------------------------------------------ Joel Horowitz Nonparametric estimation in the presence of instrumental variables We suggest two nonparametric approaches, based on kernel methods and orthongonal series, respectively, to estimating regression functions in the presence of instrumental variables. For the first time in this class of problems we derive optimal convergence rates, and show that they are attained by particular estimators. In the presence of instrumental variables the relation that identifies the regression function also defines an ill-posed inverse problem, the "difficulty" of which depends on eigenvalues of a certain integral operator which is determined by the joint density of endogenous and instrumental variables. We delineate the role played by problem difficulty in determining both the optimal rate and the appropriate choice of smoothing parameter. (This is based on joint work with Peter Hall.) ------------------------------------------------------------------------------ Paul Speckman Adaptive function estimation using smooth stochastic variability priors We consider Bayesian nonparametric regression, with priors on the unknown function corresponding to smoothing with L-splines. The penalty term also includes a weight function that is optimized to give a locally adaptive smoother. Appropriate priors can be constructed that are similar to stochastic variability models in finance. Smoother versions than those commonly used in the finance literature appear to be more useful in nonparametric regression. MCMC is employed to estimate the models. Applications have been made to regression with normal errors and and also to density estimation. ------------------------------------------------------------------------------ S. C. Samuel Kou Statistical analysis of single molecule experiments in chemistry Recent technological advances allow scientists for the first time to follow a biochemical process on a single molecule basis, which, unlike traditional macroscopic experiments, raises many challenging data-analysis problems and calls for a sophisticated statistical modeling and inference effort. This paper provides the first likelihood-based analysis of the single-molecule fluorescence lifetime experiment, in which the conformational dynamics of a single DNA hairpin molecule is of interest. The conformational change is modeled as a continuous-time two-state Markov chain, which is not directly observable and has to be inferred from changes in photon emissions from a dye attached to the DNA hairpin molecule. In addition to the hidden Markov structure, the presence of molecular Brownian diffusion further complicates the matter. We show that closed form likelihood function can be obtained and a Metropolis-Hastings algorithm can be applied to compute the posterior distribution of the parameters of interest. The data augmentation technique is utilized to handle both the Brownian diffusion and the issue of model discrimination. Our results increase the estimating resolution by several folds. The success of this analysis indicates there is an urgent need to bring modern statistical techniques to the analysis of data produced by modern technologies. This work is joint with Sunney Xie and Jun Liu. ============================================================================== MONDAY MORNING ------------------------------------------------------------------------------ Emmanuel Candes Chirplets: Multiscale detection and recovery of chirps. In this talk, we consider the problem of detecting and recovering chirps from noisy data. Chirps are signals which are neither smoothly varying nor stationary but rather, which exhibit rapid oscillations and rapid changes in their frequency content. This behavior is very different than that assumed in the standard literature which typically assumes smoothness and homogeneity. One particular application of note in conjunction with this line of research is the detection ofgravitational waves. Building on recent advances in computational harmonic analysis, we design libraries of multiscale chirplets, and introduce detection strategies which are more sensitive than existing feature detectors. The idea is to use structured algorithms which exploit information in the chirplet dictionary to chain chirplets together adaptively as to form chirps with polygonal instantaneous frequency; these structured algorithms are so sensitive that they allow to detect signals whenever their strength makes them detectable by any method, no matter how intractable. Formally, we propose a test statistic which provably attains near-optimal decision bounds over a wide range of meaningful classes of chirps. In addition, there is a way to invoke dynamic programming to rapidly compute our test statistics. Similar strategies and results extend to the estimation problem. ------------------------------------------------------------------------------ Antonin Chambolle Total variation minimization - An algorithm for total variation minimization and applications I will present an algorithm for minimizing total variation under a quadratic constraint, based on the dual formulation. Some applications to image processing will be discussed. ------------------------------------------------------------------------------ Otmar Scherzer Tube methods for bounded variation regularization In this presentation we use statistical modeling for developing regularization models for de-noising, de-jittering, and de-blurring applications in image processing. We present an analysis of such regularization models. Inspired by a relation of the taut-string algorithm (commonly used in statistics) and bounded variation regularization we generalize the taut-string algorithm to higher dimensions, which is then a tube method, since the desired solution can be found within a tube around the data. Finally we present tube methods for the statistically motivated regularization algorithms. ============================================================================== MONDAY AFTERNOON ------------------------------------------------------------------------------ Roger Koenker Total variation regularization for noisy, scattered data Two variants of total variation regularization for bivariate function estimation will be described: piecewise constant functions on Voronoi tesselations, or Voronograms; and piecewise linear functions on Delaunay tesselations, or triograms. Voronograms employ a total variation roughness penalty on the fitted function and are well adapted to image segmentation and the estimation of sharp discontinuities. Triograms penalize the total variation of the gradient of the fitted function while preserving its continuity; they are also capable of preserving sharp features of the target function. Both methods employ a quantile infidelity criteria of goodness of fit and can be efficiently computed using primal-dual interior point methods. Sparsity of the underlying linear algebra enables quite large problems to be solved. ------------------------------------------------------------------------------ Arne Kovac Taut strings and modality We consider in this paper the problems of non-parametric regression and density estimation. Our goal in both situations is to specify a function $f$ that is adequate with the data, but does not contain spurious local extreme values. Our approaches rely on suitable definitions of adequacy. A measure of adequacy gives rise to a set of adequate functions, each of them representing a plausible model for the data in the sense that the data look like a ``typical'' sample from the model. In the regression context we employ multiresolution criteria where we check the sum of the resiudals or their signs on various scales and locations. For density estimation our notion is based on a generalization of the Kuiper metric. Having specified the set of adequacy we look for an adequate function with minimal modality. Our strategy in most approaches is to use the taut string method to generate a sequence of functions with increasing number of local extreme values. As soon as we encounter an adequate function we choose this function as an approximation to the data. In a different approach we minimize the total variation among all adequate functions. ------------------------------------------------------------------------------ Robert Nowak Near minimax optimal learning with dyadic classification trees This talk reports on a family of computationally practical classifiers that converge to the Bayes error at near-minimax optimal rates for a variety of distributions. The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X,Y) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classification rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching the parametric rate. We are not aware of any other practical classifiers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning and adaptation are discussed. ------------------------------------------------------------------------------ Sylvain Sardy $L_1$ penalized likelihood nonparametric function estimation We discuss the relationship between Laplace Markov random field and total variation regularization for regression, density estimation and linear inverse problems. We propose a relaxation algorithm to solve the dual optimization problem and we propose ways of choosing the regularization parameter $\lambda$. Monte Carlo simulations point to the need of a more flexible penalty, e.g., an $l_p$ penalty, where now both $\lambda$ and $p$ have to be selected. ============================================================================== TUESDAY MORNING ------------------------------------------------------------------------------ Chong Gu Model Diagnostics for Penalized Likelihood Estimates Functional ANOVA decompositions can be incorporated in multivariate function estimation through the penalized likelihood method. In this talk, we present some simple diagnostics for the ``testing'' of selected model terms in the decomposition; the elimination of practically insignificant terms generally enhances the interpretability of the estimates, and sometimes may also have inferential implications. What we try to achieve are the tasks of the traditional likelihood ratio tests, but in the absence of sampling distributions due to the typically infinite dimensional nulls in nonparametric settings. The diagnostics are illustrated in the settings of regression, probability density estimation, and hazard rate estimation. ------------------------------------------------------------------------------ Dennis Cox Functional-data ANOVA via randomization based multiple comparisons procedure We consider a very simple approach to Functional-Data ANOVA which has been used by many investigators: apply univariate ANOVA pointwise on a grid and use the randomization based multiple corrections method of Westfall and Young (1992) to obtain a desired significance level. This method is relatively straightforward and gives very interpretable results (regions in the space of the independent variable where there aresignificant differences), but there have been concerns that there are potential problems. In particular, most multiple comparisons procedures penalize for more comparisons, which suggests that the regions of statistical significance will shrink if one uses a finer grid for the pointwisetests, and maybe there is an optimal sampling interval or something. We conjecture that the Westfall and Young procedure in fact overcomes this problem and as the grid becomes finer, the corrected p-values converge to an approximately continuous function. Motivation and numerical examples will be presented in support of this conjecture. Other approaches to FANOVA will be discussed if time permits. ------------------------------------------------------------------------------ Charles Kooperberg - the topic would be something like "Bayesian Logic Regression" - I'm trying to fit regression models to SNP (Single Nucleiotide Polymorphism) data: many, essentially binary, covariates. I'm doing MCMC on the class of regression models (Logic Regression) that I'm interested in. Rather than summarizing all models by a mean or something like that, I'm trying to summarize all my MCMC models in a qualitative way. ------------------------------------------------------------------------------ Jim O. Ramsay From Data to Differential Equations Differential equations (DIFE's) can represent the underlying processes giving rise to observed functional data, and as such can offer a number of potential advantages over parametric or basis expansion models. They explicitly model the behavior of derivatives, and derivative estimates based on DIFE's are usually superior to those derived from conventional data smoothers. Solutions to a linear DIFE of order $m$ span an $m$-dimensional space, and consequently have the capacity to model curve-to-curve variation as well as to fit the data. We can build known structure features into DIFE models more easily than is usually the case for conventional functional models. And finally a DIFE offers a wider range of ways to introduce stochastic behavior into models. Earlier DIFE identification methods such as principal differential analysis required the intermediate step of first smoothing the data. We will discuss a technique for going directly from the discrete and noisy data to a DIFE that is based on the work of Heckman and Ramsay (2000). Some illustrations of its performance for simulated data will be offered as well as examples from process control in chemical engineering and for medical data on treatment regimes for lupus. Heckman, N. and Ramsay, J. O. (2000) Penalized regression with model-based penalties. The Canadian Journal of Statistics, 28, 241-258. ============================================================================== TUESDAY AFTERNOON ------------------------------------------------------------------------------ Selim Esedoglu Decomposition of images by the anisotropic Rudin-Osher-Fatemi model The total variation based image de-noising model of Rudin, Osher, and Fatemi can be generalized in a natural way to priviledge certain edge directions. We consider the resulting anisotropic energies and study properties of their minimizers. Joint work with Stanley J. Osher. ------------------------------------------------------------------------------ Frits Ruymgaart Improving regression function estimators in indirect models It turns out that the traditional estimators of linear functionals of the regression function in general, and of their Fourier coefficients in particular, are not asymptotically efficient in the sense of LeCam-Ha'jek-van der Vaart, except, possibly, when the error distribution is normal.Since these estimators are still root-n consistent,a procedure like the one in Bickel et al. (1993) applies to construct a modification which is asymptotically efficient. See also Pfanzagl (1994).A self-contained proof can be given.When repeated measurements are available such an improvement procedure can in particular be carried out for error distributions with heavy tails, starting with a preliminary estimator based on linear combinations of the order statistics in the subsamples.These procedures can be performed when the regression is the input of an operator equation and the operator has pure point spectrum and bounded, positive eigenvalues. Ordinary, direct regression is included as a special case. ------------------------------------------------------------------------------ Curt Vogel The wavefront reconstruction problem in adaptive optics Adaptive optics (AO) can dramatically improve the resolution of ground-based optical telescopes. In this talk, we review the basic concepts behind AO. We describe in the detail the computation of the wavefront reconstructor---the mapping from sensor measurements to mirror deformations---and we highlight its connection to statistical regularization. We also describe recent work on fast reconstruction algorithms. ============================================================================== WEDNESDAY MORNING ------------------------------------------------------------------------------ Gene Golub The method of moments and statistical computation ------------------------------------------------------------------------------ Werner Stuetzle and Tom Duchamp Spline smoothing on surfaces We present a method for estimating functions on topologically and/or geometrically complex surfaces from possibly noisy observations. Our approach is an extension of spline smoothing, using a finite element method. The paper has a substantial tutorial component: we start by reviewing smoothness measures for functions defined on surfaces, simplicial surfaces and differentiable structures on such surfaces, subdivison functions, and subdivision surfaces. After describing our method, we show results of an experiment comparing finite element approximations to exact smoothing splines on the sphere, and we give examples suggesting that generalized cross-validation is an effective way of determining the optimal degree of smoothing for function estimation on surfaces. ------------------------------------------------------------------------------ Jianhong Jackie Shen A young mathematician's reflection on vision: Illposedness and regularizations Vision, the perception of the 3-D world from its 2-D partial projections onto the left and right retinas, is fundamentally an illposed inverse problem. But after millions of years' of evolution, human vision is working astonishingly accurately and satisfactorily. How could it have become such a remarkable inverse problem solver, and what are the hidden (or subconscious) regularization techniques it employs? This talk attempts to reflect and shed some light on these billion-dollar or Nobel-level questions (the works of several Nobel Laureates will be mentioned indeed), based on the limited but unique experience and philosophy of an immature young mathematician. In particular, we discuss the topological, geometric, and statistical/Bayesian regularization techniques of visual perception. I owe deep appreciation to my former advisors and mentors leading me to the Door of Vision: Professors Gil Strang, Tony Chan, Stan Osher, David Mumford, Stu Geman, and Dan Kersten. ------------------------------------------------------------------------------ Tony Chan Geometric and total variation regularization in inverse problems In many inverse problems, one has to recover functions that are piecewise smooth separated by lower dimensional interfaces. For these problems, it is important to choose a regularization technique that will respect and preserve the discontinuities of the function values, as well as control the geometric regularity of the interfaces. Notable examples of this class of regularization include minimizing the total variation of the function and minimizing the surface area of the interfaces. In this talk, I'll review recent advances in this area, with illustrating examples from image restoration and segmentation, elliptic inverse problems, and medical image tomography problems. ============================================================================== THURSDAY MORNING ------------------------------------------------------------------------------ Sara van de Geer Oracle inequalities for regularized estimators The aim of this talk is to emphasize the special role of the$\ell_1$ penalty. We will first explain that this penalty is different from penalties based on the bias-variance trade-off. Next, we investigate least squares estimators with general penalties, and with smoothing splines as special case. We examine which penalties yield an oracle inequality for the estimator. For robust regression estimators,the theory is similar, provided the margin behavior is quadratic. The margin behavior is defined as the behavior of the theoretical loss near its minimizer. We establish an oracle inequality that showsthat the $\ell_1$ penalty adapts to both the smoothness as well as to possibly non-quadratic margin behavior. ------------------------------------------------------------------------------ Laurie Davies Approximating data The talk treats stochastic models as approximations to data in a consistent manner. The basic idea is that a model is an adequate approximation for a data set if typical samples generated under the model look like the data set. This somewhat loose formulation is made precise and building on this we discuss problems of topologies in data analysis and inference. It is argued that the topology of data analysis is weak (few open sets) whilst most principles of inference make use of concepts such as likelihood which are based on a strong topology (many open sets). This dichotomoy can only be removed by basing all of statistics on concepts which are continuous in the weak toplology. Examples of the idea of approximation with apllications to nonparametric regression, densities and financial data are given. ------------------------------------------------------------------------------ Grace Wahba The multicategory support vector machine and the polychotomous penalized likelihood estimate We describe two modern methods for statistical model building and classification, penalized likelihood methods and support vector machines (SVM's). Both are obtained as solutions to optimization problems in reproducing kernel Hilbert spaces (RKHS). A training set is given, and an algorithm for classifiying future observations is built from it. The ($k$-category) multichotomous penalized likelihood method returns a vector of probabilities $(p_1(t), \cdots p_k(t))$ where $t$ is the attribute vector of the object to be classified. The multicategory support vector machine returns a classifier vector $(f_1(t), \cdots f_k(t))$ satisfying$\sum_\ell f_{\ell}(t) = 0$, where $max_{\ell}f_{\ell}(t)$identifies the category. The two category SVM's are very well known,while the multi-category SVM (MSVM) described here, which includes modifications for unequal misclassification costs andunrepresentative training sets, is new. We describe applications of each method: For penalized likelihood, estimating the 10-year probability of death due to several causes, as a function of several risk factors observed in a demographic study,and for MSVM's, classifying radiance profiles from the MODIS instrument according to clear, water cloud or ice cloud. Some computational and tuning issues are noted.