alb.info
(c) Peter M. Hooper, August 24, 2002
This document provides a description of the input and output
files for the FORTRAN program alb.f.
The alb.f program implements regression methodology described in
Hooper, P.M. (2001).
Flexible regression modeling with adaptive logistic basis functions
(with discussion).
The Canadian Journal of Statistics, 29:343-378.
I refer to this below as the ALB paper.
The alb.f program fits a flexible regression function
as a linear combination of logistic basis functions:
y = f(x) = Sum_{k=1,nk} delta_{k}*phi_k(x)
x = a vector of np predictor variables (covariates)
y = a real-valued response variable
f(x) = a measure of location (e.g., mean, median, or quantile)
of the conditional distribution of y given x.
The logistic basis functions are parameterized in terms of squared
Euclidean distance to reference points xi_k in the covariate space,
with exponential functions normalized to sum to one:
Set phi_k(x) = exp[ gamma_k - tau^{-2}*||x - xi_k||^2 ]
then normalize so that Sum{k=1,nk} phi_k(x) = 1 for all x.
Parameters are estimated from data {(x_i,y_i),i=1,n}
by minimizing a training risk; e.g.,
Sum_{i=1,n} |y_i - f(x_i)|^pwr
The interpretation of f(x) depends on the choice of pwr:
pwr = 2 for the conditional mean of y given x
pwr = 1 for the conditional median (when alpha = 0.5)
Values of pwr between 1 and 2 can also be used for
a robust location measure between the mean and median.
When pwr = 1, one can obtain alpha-quantiles for probabilities
0 < alpha < 1. In this case, the training risk utilizes a
check function. See the last section of the ALB paper.
***********************************************************************
The Fortran 77 source code is contained in the text file alb2.f .
You will need to compile this using the f77 compiler on your system.
I recommend that you use an optimizer when compiling, to speed up compution.
Suppose the executable file is called alb2.exe .
To run alb2.exe, enter
./alb.exe filestem
on the command line. Here "filestem" can be any word with at most
8 letters. The program uses two input files:
rvUtable (supplied with the program) and
filestem.in (prepared by the user).
The program creates six output files:
filestem.out
filestem.rsk
filestem.fit
filestem.grd
filestem.clu
filestem.lbf
as well as printing summary results to the monitor.
I have provided two examples of input files:
boyswh.in (1 predictor)
boston.in (13 predictors)
template.in (template input file with no data)
***********************************************************************
Input file: rvUtable
rvUtable contains a list of random integers used by by the
random number generator in alb.f.
***********************************************************************
Input file: filestem.in
(where "filestem" can be any name with at most 8 letters)
filestem.in is the main input file for alb.
It contains parameters controlling options, parameters describing
the data set, and the training and (optional) test data.
I suggest that you simply edit the template.in file, saving
the original.
The format must be exactly as shown in the template file.
Blank rows must be correctly located.
Title line: This can be anything. The first 80 characters are
reproduced in the output.
np = number of predictors (denoted "d" in the ALB paper)
pwr = power used in loss function (denoted "q" in the ALB paper)
alpha = quantile probability, 0 < alpha < 1, alpha = 0.5 for median
A value for alpha must be entered, but alpha is ignored unless pwr = 1.
ctfm = transformation parameter, ctfm >= 0
ctfm = 0 standardizes each predictor to have zero mean and
unit standard deviation
ctfm > 0 calculates derived predictors as linear transformations of
the standardized predictors, with reduced correlations among the
derived predictors.
A large value of ctfm produces uncorrelated principal components.
I usually set ctfm = 0, but larger values may improve
performance of the optimization algorithm when predictors are highly
correlated and f(x) is a function of principal components with
small eigenvalues.
nk and lambda.
nk = initial number of basis functions (denoted "K" in the ALB paper)
lambda = smoothing parameter (not discussed in ALB paper)
If lambda > 0, then the optimization algorithm repeatedly jitters
each predictor by adding normally distributed random variables
with mean 0 and standard deviation lambda. This has an effect
similar to that ridge regression. Smoothing can be useful
when the number of predictors is large relative to the sample size.
One can enter multiple rows with (nk,lambda) values,
and use e.g. 10-fold cross-validation to select nk.
I usually set nk=1 and lambda=0, then let the program select
nk by generalized cross-validation (GCV), as described below.
Cross-validation options.
indcv = 1 indicates that ncv-fold cross-validation is to be carried out
indcv = 0 otherwise
ncv = number of subsets used in the cross-validation partition.
ncv must be between 2 and n = the number of cases in the training set.
cvopt = option controlling the order of the cases used
to construct cross-validation partition.
cvopt = 1 uses the sequential order of the cases in filestem.in.
cvopt = 2 uses an order corresponding to a permutation input
with the data (an integer appended following the last predictor
in each line of filestem.in).
cvopt = 3 randomly permutes the order of the cases, independently
for each (nk,lambda) pair.
The ncv and cvopt lines must be present, but the values
are ignored if indcv = 0.
Output options:
iprint = 1 to print results to output files
iprint = 0 suppresses all output except summary results.
ipredict = 1 to print predictors, gradients, and standardized
gradients to filestem.grd
ipredict = 0 suppresses this output
ngcl = number of gradient cluster projections
nprint = upper bound on number of cases printed
GCV search options (generalized cross-validation, assumes lambda=0)
isearch = 1 to carry out GCV search for optimal nk, starting with
values of nk specified above
isearch = 0 to just fit model for specified values of nk
idr = 1 to fit models based on a subset of the gradient
principal components, and attempt to find a lower-dimensional
model that improves on original alb model
idr = 0 to skip this
This dimension reduction technique is not discussed in the ALB paper,
and is still in an experimental state.
nfail = number of successive failures before stopping GCV search
I usually set nfail = 3 but (as discussed in the ALB paper) larger
values are occasionally helpful.
Initialization
Tuning parameters controlling the gain function
used in the initialization step of the training algorithm:
i.e., gain(i) = gain(0)*w/(i + w), i=1,nm
I recommend leaving these at the values set below
3000 nm1 determines number of iterations nm = nm1*nk**.5
1.0 rpt0 = rpt gain(0)
100 rpt1 determines w = rpt1*nk**.5
1 ctau (tau = ctau*snn, if ctau <= 0 then set default 1.
0 ywt determines weight assigned to (dlta-yy)**2 in vq
Setting ywt=0 yields the initialization technique described in
the ALB paper; i.e., initial reference points are K-means cluster
centroids. Setting ywt>0 make use of co-variation in y and x
when initializing the reference points. I am still experimenting
with this.
Training algorithm tuning parameters:
I recommend leaving these at values set below.
50000 nm1 determines number of iterations nm = nm1*nk**.5
.25 rpt0 = rpt gain(0)
500 rpt1 determines w = rpt1*nk**.5
10 nrep = number of repetitions of first prop*nm iterations
.10 prop = proportion of iterations repeated
Training sample.
One case per line, beginning with respone y,
followed by predictors x (np values),
followed by optional integer (one number, required if cvopt = 3).
Values may be delimited by spaces, commas, or tabs.
A row of the form
1.0D99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
indicates the end of the training data.
Test sample. Same format as training sample.
If there is no test sample, there must still be a row
1.0D99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
***********************************************************************
Output file: filestem.rsk
filestem.rsk contains risks in a format suitable for plotting.
This can be useful when selecting nk and lambda. For example,
using the viking.in data, we obtain:
iloop icv dim nk lambda rsktrn rskadj rsktst
1 1 0.00 1.00 0.00 1.0000 1.0004 -1.0000
2 1 1.00 2.00 0.00 0.1038 0.1040 -1.0000
3 1 2.00 3.00 0.00 0.0117 0.0117 -1.0000
4 1 2.00 4.00 0.00 0.0093 0.0094 -1.0000
5 1 2.00 5.00 0.00 0.0073 0.0074 -1.0000
iloop indexes the separate (nk,lambda) rows in the input file
icv indexs the m-fold cross-validation groups (when appropriate)
dim = dimension of the fitted model, dim <= min{np,nk-1}
nk = number of basis functions
lambda = smoothing parameter
rsktrn = training risk
rskadj = GCV-adjusted training risk
rsktst = test risk (set to -1 when there is no test data)
***********************************************************************
Output file: filestem.out
filestem.out contains more extensive summary information:
a recap of the input parameters,
parameter estimates,
risk estimates (for training and test sets),
information about the gradient principal components and the
gradient cluster centroids (see the Boston housing data example
in Section 2 of the ALB paper)
***********************************************************************
Output file: filestem.fit
Contains fitted values and related variables.
There are two formats.
When np=1, the following variables are produced:
iloop cluster loss y f sef df sedf d2f x
iloop indexes the (nk,lambda) pairs in the input file
cluster indicates membership of x-points in various gradient clusters
loss = value of the loss function evaluated at (x,y)
y = y value
f = fitted value = estimated f(x)
sef = standard error of f(x)
df = first derivative of f(x)
sedf = standard error of derivative of f(x)
d2f = second derivative of f(x)
x = x value (since np = 1)
When np>1, the following variables are produced:
iloop cluster loss y f sef 601 602
Variables 601, 602, etc contain the gradient principal components;
i.e., linear combinations of the original predictors that
are most useful for obtaining a global view. Often a 3D plot
of f against the first two gradient pc's will reveal much of
the structure of f
***********************************************************************
Output file: filestem.grd
When np > 1, contains gradients, standardized gradients, and
predictors, identified by number as follows:
201, 202, etc. contain the gradients; i.e., the
derivatives of f(x) with respect to x_1, x_2, etc.
301, 302, etc. contains the standardized gradients;
i.e., gradient divided by its standard error.
101, 102, etc. contain the original predictors x_1, x_2, etc.
Plots of gradients can reveal additive structure.
If the effect of x_1 on f(x) is additive, then a plot of
201 versus 101 may show a line or curve with little scatter.
If the joint effect of x_1 and x_2 on f(x) is additive, then
3D plots of 201 and 202 against (101,102) may show surfaces
with little scatter.
Plots of the standardized gradients may reveal variables with
neglible effect on f(x). One may obtain a better fit by eliminating
variables where the standardized gradients are (nearly) all
in the interval (-2,+2).
***********************************************************************
Output file: filestem.clu
Contains linear combinations of predictors determined by gradient
cluster centroids, with variables identified by numbers
501, 502, etc.
It is may be helpful to plot the fitted values f against
501, 502, etc, with points colored by cluster, to obtain local
views of the fitted f(x). E.g., in a plot of f versus 501,
the points in cluster 1 usually lie close to a smooth curve
(or possibly several curves, representing subclusters with
the same gradient direction but different locations in the
predictor space).
The fitted values and cluster numbers appear as variables
f and cluster in filestem.fit.
***********************************************************************
Output file: filestem.lbf
Contains the logistic basis functions, in two formats.
When np=1, file contains three variables
k = index for basis function
bf = basis function phi_k(x)
x = predictor
This allows all basis functions to be plotted in a single plot,
with different colors for different basis functions.
When np>1, file contains the basis functions phi_k(x), with variables
identified by numbers 801, 802, etc.
If np=2, then one can examine 3d plots of basis functions against
(x_1,x_2) (variables 101, 102 in filestem.grd).
The basis functions do not appear to be useful for data analysis.
They are included in the alb.f program to assist in understanding
the components of the alb model.
***********************************************************************