alb.info (c) Peter M. Hooper, August 24, 2002 This document provides a description of the input and output files for the FORTRAN program alb.f. The alb.f program implements regression methodology described in Hooper, P.M. (2001). Flexible regression modeling with adaptive logistic basis functions (with discussion). The Canadian Journal of Statistics, 29:343-378. I refer to this below as the ALB paper. The alb.f program fits a flexible regression function as a linear combination of logistic basis functions: y = f(x) = Sum_{k=1,nk} delta_{k}*phi_k(x) x = a vector of np predictor variables (covariates) y = a real-valued response variable f(x) = a measure of location (e.g., mean, median, or quantile) of the conditional distribution of y given x. The logistic basis functions are parameterized in terms of squared Euclidean distance to reference points xi_k in the covariate space, with exponential functions normalized to sum to one: Set phi_k(x) = exp[ gamma_k - tau^{-2}*||x - xi_k||^2 ] then normalize so that Sum{k=1,nk} phi_k(x) = 1 for all x. Parameters are estimated from data {(x_i,y_i),i=1,n} by minimizing a training risk; e.g., Sum_{i=1,n} |y_i - f(x_i)|^pwr The interpretation of f(x) depends on the choice of pwr: pwr = 2 for the conditional mean of y given x pwr = 1 for the conditional median (when alpha = 0.5) Values of pwr between 1 and 2 can also be used for a robust location measure between the mean and median. When pwr = 1, one can obtain alpha-quantiles for probabilities 0 < alpha < 1. In this case, the training risk utilizes a check function. See the last section of the ALB paper. *********************************************************************** The Fortran 77 source code is contained in the text file alb2.f . You will need to compile this using the f77 compiler on your system. I recommend that you use an optimizer when compiling, to speed up compution. Suppose the executable file is called alb2.exe . To run alb2.exe, enter ./alb.exe filestem on the command line. Here "filestem" can be any word with at most 8 letters. The program uses two input files: rvUtable (supplied with the program) and filestem.in (prepared by the user). The program creates six output files: filestem.out filestem.rsk filestem.fit filestem.grd filestem.clu filestem.lbf as well as printing summary results to the monitor. I have provided two examples of input files: boyswh.in (1 predictor) boston.in (13 predictors) template.in (template input file with no data) *********************************************************************** Input file: rvUtable rvUtable contains a list of random integers used by by the random number generator in alb.f. *********************************************************************** Input file: filestem.in (where "filestem" can be any name with at most 8 letters) filestem.in is the main input file for alb. It contains parameters controlling options, parameters describing the data set, and the training and (optional) test data. I suggest that you simply edit the template.in file, saving the original. The format must be exactly as shown in the template file. Blank rows must be correctly located. Title line: This can be anything. The first 80 characters are reproduced in the output. np = number of predictors (denoted "d" in the ALB paper) pwr = power used in loss function (denoted "q" in the ALB paper) alpha = quantile probability, 0 < alpha < 1, alpha = 0.5 for median A value for alpha must be entered, but alpha is ignored unless pwr = 1. ctfm = transformation parameter, ctfm >= 0 ctfm = 0 standardizes each predictor to have zero mean and unit standard deviation ctfm > 0 calculates derived predictors as linear transformations of the standardized predictors, with reduced correlations among the derived predictors. A large value of ctfm produces uncorrelated principal components. I usually set ctfm = 0, but larger values may improve performance of the optimization algorithm when predictors are highly correlated and f(x) is a function of principal components with small eigenvalues. nk and lambda. nk = initial number of basis functions (denoted "K" in the ALB paper) lambda = smoothing parameter (not discussed in ALB paper) If lambda > 0, then the optimization algorithm repeatedly jitters each predictor by adding normally distributed random variables with mean 0 and standard deviation lambda. This has an effect similar to that ridge regression. Smoothing can be useful when the number of predictors is large relative to the sample size. One can enter multiple rows with (nk,lambda) values, and use e.g. 10-fold cross-validation to select nk. I usually set nk=1 and lambda=0, then let the program select nk by generalized cross-validation (GCV), as described below. Cross-validation options. indcv = 1 indicates that ncv-fold cross-validation is to be carried out indcv = 0 otherwise ncv = number of subsets used in the cross-validation partition. ncv must be between 2 and n = the number of cases in the training set. cvopt = option controlling the order of the cases used to construct cross-validation partition. cvopt = 1 uses the sequential order of the cases in filestem.in. cvopt = 2 uses an order corresponding to a permutation input with the data (an integer appended following the last predictor in each line of filestem.in). cvopt = 3 randomly permutes the order of the cases, independently for each (nk,lambda) pair. The ncv and cvopt lines must be present, but the values are ignored if indcv = 0. Output options: iprint = 1 to print results to output files iprint = 0 suppresses all output except summary results. ipredict = 1 to print predictors, gradients, and standardized gradients to filestem.grd ipredict = 0 suppresses this output ngcl = number of gradient cluster projections nprint = upper bound on number of cases printed GCV search options (generalized cross-validation, assumes lambda=0) isearch = 1 to carry out GCV search for optimal nk, starting with values of nk specified above isearch = 0 to just fit model for specified values of nk idr = 1 to fit models based on a subset of the gradient principal components, and attempt to find a lower-dimensional model that improves on original alb model idr = 0 to skip this This dimension reduction technique is not discussed in the ALB paper, and is still in an experimental state. nfail = number of successive failures before stopping GCV search I usually set nfail = 3 but (as discussed in the ALB paper) larger values are occasionally helpful. Initialization Tuning parameters controlling the gain function used in the initialization step of the training algorithm: i.e., gain(i) = gain(0)*w/(i + w), i=1,nm I recommend leaving these at the values set below 3000 nm1 determines number of iterations nm = nm1*nk**.5 1.0 rpt0 = rpt gain(0) 100 rpt1 determines w = rpt1*nk**.5 1 ctau (tau = ctau*snn, if ctau <= 0 then set default 1. 0 ywt determines weight assigned to (dlta-yy)**2 in vq Setting ywt=0 yields the initialization technique described in the ALB paper; i.e., initial reference points are K-means cluster centroids. Setting ywt>0 make use of co-variation in y and x when initializing the reference points. I am still experimenting with this. Training algorithm tuning parameters: I recommend leaving these at values set below. 50000 nm1 determines number of iterations nm = nm1*nk**.5 .25 rpt0 = rpt gain(0) 500 rpt1 determines w = rpt1*nk**.5 10 nrep = number of repetitions of first prop*nm iterations .10 prop = proportion of iterations repeated Training sample. One case per line, beginning with respone y, followed by predictors x (np values), followed by optional integer (one number, required if cvopt = 3). Values may be delimited by spaces, commas, or tabs. A row of the form 1.0D99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... indicates the end of the training data. Test sample. Same format as training sample. If there is no test sample, there must still be a row 1.0D99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... *********************************************************************** Output file: filestem.rsk filestem.rsk contains risks in a format suitable for plotting. This can be useful when selecting nk and lambda. For example, using the viking.in data, we obtain: iloop icv dim nk lambda rsktrn rskadj rsktst 1 1 0.00 1.00 0.00 1.0000 1.0004 -1.0000 2 1 1.00 2.00 0.00 0.1038 0.1040 -1.0000 3 1 2.00 3.00 0.00 0.0117 0.0117 -1.0000 4 1 2.00 4.00 0.00 0.0093 0.0094 -1.0000 5 1 2.00 5.00 0.00 0.0073 0.0074 -1.0000 iloop indexes the separate (nk,lambda) rows in the input file icv indexs the m-fold cross-validation groups (when appropriate) dim = dimension of the fitted model, dim <= min{np,nk-1} nk = number of basis functions lambda = smoothing parameter rsktrn = training risk rskadj = GCV-adjusted training risk rsktst = test risk (set to -1 when there is no test data) *********************************************************************** Output file: filestem.out filestem.out contains more extensive summary information: a recap of the input parameters, parameter estimates, risk estimates (for training and test sets), information about the gradient principal components and the gradient cluster centroids (see the Boston housing data example in Section 2 of the ALB paper) *********************************************************************** Output file: filestem.fit Contains fitted values and related variables. There are two formats. When np=1, the following variables are produced: iloop cluster loss y f sef df sedf d2f x iloop indexes the (nk,lambda) pairs in the input file cluster indicates membership of x-points in various gradient clusters loss = value of the loss function evaluated at (x,y) y = y value f = fitted value = estimated f(x) sef = standard error of f(x) df = first derivative of f(x) sedf = standard error of derivative of f(x) d2f = second derivative of f(x) x = x value (since np = 1) When np>1, the following variables are produced: iloop cluster loss y f sef 601 602 Variables 601, 602, etc contain the gradient principal components; i.e., linear combinations of the original predictors that are most useful for obtaining a global view. Often a 3D plot of f against the first two gradient pc's will reveal much of the structure of f *********************************************************************** Output file: filestem.grd When np > 1, contains gradients, standardized gradients, and predictors, identified by number as follows: 201, 202, etc. contain the gradients; i.e., the derivatives of f(x) with respect to x_1, x_2, etc. 301, 302, etc. contains the standardized gradients; i.e., gradient divided by its standard error. 101, 102, etc. contain the original predictors x_1, x_2, etc. Plots of gradients can reveal additive structure. If the effect of x_1 on f(x) is additive, then a plot of 201 versus 101 may show a line or curve with little scatter. If the joint effect of x_1 and x_2 on f(x) is additive, then 3D plots of 201 and 202 against (101,102) may show surfaces with little scatter. Plots of the standardized gradients may reveal variables with neglible effect on f(x). One may obtain a better fit by eliminating variables where the standardized gradients are (nearly) all in the interval (-2,+2). *********************************************************************** Output file: filestem.clu Contains linear combinations of predictors determined by gradient cluster centroids, with variables identified by numbers 501, 502, etc. It is may be helpful to plot the fitted values f against 501, 502, etc, with points colored by cluster, to obtain local views of the fitted f(x). E.g., in a plot of f versus 501, the points in cluster 1 usually lie close to a smooth curve (or possibly several curves, representing subclusters with the same gradient direction but different locations in the predictor space). The fitted values and cluster numbers appear as variables f and cluster in filestem.fit. *********************************************************************** Output file: filestem.lbf Contains the logistic basis functions, in two formats. When np=1, file contains three variables k = index for basis function bf = basis function phi_k(x) x = predictor This allows all basis functions to be plotted in a single plot, with different colors for different basis functions. When np>1, file contains the basis functions phi_k(x), with variables identified by numbers 801, 802, etc. If np=2, then one can examine 3d plots of basis functions against (x_1,x_2) (variables 101, 102 in filestem.grd). The basis functions do not appear to be useful for data analysis. They are included in the alb.f program to assist in understanding the components of the alb model. ***********************************************************************