Analyzes survival and reliability data using Cox’s proportional hazards model.
Synopsis
#include<imsls.h>
float*imsls_f_prop_hazards_gen_linintn_observations, intn_columns, floatx[], int nef, int n_var_effects[], intindices_effects[], int max_class, int*ncoef, …, 0)
The type double function is imsls_d_prop_hazards_gen_lin.
Required Arguments
intn_observations (Input) Number of observations.
intn_columns (Input) Number of columns in x.
float x[] (Input) Array of length n_observations×n_columns containing the data. When optional argument itie = 1, the observations in x must be grouped by stratum and sorted from largest to smallest failure time within each stratum, with the strata separated.
intnef (Input) Number of effects in the model. In addition to effects involving classification variables, simple covariates and the product of simple covariates are also considered effects.
intn_var_effects[] (Input) Array of length nef containing the number of variables associated with each effect in the model.
intindices_effects[] (Input) Index array of length n_var_effects[0] + ... + n_var_effects[nef-1] containing the column indices of x associated with each effect. The first n_var_effects[0] elements of indices_effects contain the column indices of x for the variables in the first effect. The next n_var_effects[1] elements in indices_effects contain the column indices for the second effect, etc.
intmax_class (Input) An upper bound on the total number of different values found among the classification variables in x. For example, if the model consisted of two class variables, one with the values {1, 2, 3, 4} and a second with the values {0, 1}, then the total number of different classification values is 4 + 2 = 6, and max_class >= 6.
int *ncoef (Output) Number of estimated coefficients in the model.
Return Value
Pointer to an array of length ncoef×4, coef, containing the parameter estimates and associated statistics.
Column
Statistic
1
Coefficient estimate
2
Estimated standard deviation of the estimated coefficient.
3
Asymptotic normal score for testing that the coefficient is zero against the two-sided alternative.
4
p-value associated with the normal score in column 3.
Synopsis with Optional Arguments
#include<imsls.h>
float*imsls_f_prop_hazards_gen_lin(int n_observations, int n_columns, floatx[], int nef, intn_var_effects[], intindices_effects[], intmax_class, int*ncoef,
IMSLS_RETURN_USER, float coef[] (Output) If specified, coef is an array of length ncoef×4 containing the parameter estimates and associated statistics. See Return Value.
IMSLS_PRINT_LEVEL, int iprint (Input) Printing option.
iprint
Action
0
No printing is performed.
1
Printing is performed, but observational statistics are not printed.
2
All output statistics are printed.
Default: iprint = 0.
IMSLS_MAX_ITERATIONS, intmax_iterations (Input) Maximum number of iterations. max_iterations = 30 will usually be sufficient. Use max_iterations = 0 to compute the Hessian and gradient, stored in cov and gr, at the initial estimates. When max_iterations = 0, IMSLS_INITIAL_EST_INPUT must be used.
Default: max_iterations = 30.
IMSLS_CONVERGENCE_EPS, floateps (Input) Convergence criterion. Convergence is assumed when the relative change in algl from one iteration to the next is less than eps. If eps is zero, eps = 0.0001 is assumed.
Default: eps = 0.0001.
IMSLS_RATIO, floatratio (Input) Ratio at which a stratum is split into two strata. Let
be the observation proportionality constant, where zk is the design row vector for the k-th observation and wk is the optional fixed parameter specified by xk, ifix. Let rmin be the minimum value rk in a stratum, where, for failed observations, the minimum is over all times less than or equal to the time of occurrence of the k-th observation. Let rmax be the maximum value of rk for the remaining observations in the group. Then, if rmin > ratiormax, the observations in the group are divided into two groups at k. ratio = 1000 is usually a good value. Set ratio = -1.0 if no division into strata is to be made.
Default: ratio = 1000.0.
IMSLS_X_RESPONSE_COL, intirt (Input) Column index in x containing the response variable. For point observations, xi, irt contains the time of the i-th event. For right-censored observations, xi, irt contains the right-censoring time. Note that because imsls_f_prop_hazards_gen_lin only uses the order of the events, negative “times” are allowed.
Default: irt = 0.
IMSLS_CENSOR_CODES_COL, inticen (Input) Column index in x containing the censoring code for each observation. Default: A censoring code of 0 is assumed for all observations.
xi,icen
Censoring
0
Exact censoring time xi, irt.
1
Right censored. The exact censoring time is greater than xi, irt.
IMSLS_STRATIFICATION_COL, intistrat (Input) Column number in x containing the stratification variable. Column istrat in x contains a unique number for each stratum. The risk set for an observation is determined by its stratum.
Default: All observations are considered to be in one stratum.
IMSLS_CONSTANT_COL, intifix (Input) Column index in x containing a constant, wi, to be added to the linear response. The linear response is taken to be where wi is the observation constant, zi is the observation design row vector, and is the vector of estimated parameters. The “fixed” constant allows one to test hypotheses about parameters via the log-likelihoods.
Default: wi is assumed to be 0 for all observations.
IMSLS_FREQ_RESPONSE_COL, intifrq (Input) Column index in x containing the number of responses for each observation.
Default: A response frequency of 1 for each observation is assumed.
IMSLS_TIES_OPTION, intitie (Input) Method for handling ties.
itie
Method
0
Breslow’s approximate method.
1
Failures are assumed to occur in the same order as the observations input in x. The observations in x must be sorted from largest to smallest failure time within each stratum, and grouped by stratum. All observations are treated as if their failure/censoring times were distinct when computing the log-likelihood.
Default: itie = 0.
IMSLS_MAXIMUM_LIKELIHOOD, float*algl (Output) The maximized log-likelihood.
IMSLS_N_MISSING, int*nrmiss (Output) Number of rows of data in X that contain missing values in one or more columns irt, ifrq, ifix, icen, istrat, index_class_var, or indices_effects of x.
IMSLS_STATISTICS, float**case (Output) Address of a pointer to an array of length n_observations×5 containing the case statistics for each observation.
Column
Statistic
1
Estimated survival probability at the observation time.
2
Estimated observation influence or leverage.
3
A residual estimate.
4
Estimated cumulative baseline hazard rate.
5
Observation proportionality constant.
IMSLS_STATISTICS_USER, floatcase[] (Output) Storage for case is provided by the user. See IMSLS_STATISTICS.
IMSLS_X_MEAN, float**xmean (Output) Address of a pointer to an array of length ncoef containing the means of the design variables.
IMSLS_X_MEAN_USER, floatxmean[] (Output) Storage for xmean is provided by the user. See IMSLS_X_MEAN.
IMSLS_VARIANCE_COVARIANCE_MATRIX, float**cov (Output) Address of a pointer to an array of length ncoef*ncoef containing the estimated asymptotic variance-covariance matrix of the parameters. For max_iterations = 0, the return value is the inverse of the Hessian of the negative of the log-likelihood, computed at the estimates input in in_coef.
IMSLS_VARIANCE_COVARIANCE_MATRIX_USER, floatcov[] (Output) Storage for cov is provided by the user. See IMSLS_VARIANCE_COVARIANCE_MATRIX.
IMSLS_INITIAL_EST_INPUT, float*in_coef (Input) An array of length ncoef containing the initial estimates on input to prop_hazards_gen_lin.
Default: all initial estimates are taken to be 0.
IMSLS_UPDATE, float**gr (Output) Address of a pointer to an array of length ncoef containing the last parameter updates (excluding step halvings). For max_iterations = 0, gr contains the inverse of the Hessian times the gradient vector computed at the estimates input in in_coef.
IMSLS_UPDATE_USER, floatgr[] (Output) Storage for gr is provided by the user. See IMSLS_UPDATE.
IMSLS_DUMMY, intn_class_var, intindex_class_var[] (Input) Variable n_class_var is the number of classification variables. Dummy variables are generated for classification variables using the dummy_method = IMSLS_LEAVE_OUT_LAST of the IMSLS_DUMMY option of imsls_f_regressors_for_glm function (see Regression). Argument index_class_var is an index array of length n_class_var containing the column numbers of x that are the classification variables. (If n_class_var is equal to zero, index_class_var is not used).
Default: n_class_var = 0.
IMSLS_STRATUM_NUMBER, int**igrp (Output) Address of a pointer to an array of length n_observations giving the stratum number used for each observation. If ratio is not -1.0, additional “strata” (other than those specified by column istrat of x) may be generated. igrp also contains a record of the generated strata. See the Description section for more detail.
IMSLS_STRATUM_NUMBER_USER, intigrp[] (Output) Storage for igrp is provided by the user. See IMSLS_STRATUM_NUMBER.
IMSLS_CLASS_VARIABLES, int**n_class_values, float**class_values (Output) n_class_values is an address of a pointer to an array of length n_class_var containing the number of values taken by each classification variable. n_class_values[i] is the number of distinct values for the i-th classification variable. class_values is an address of a pointer to an array of length n_class_values[0] + n_class_values[1] + … + n_class_values[n_class_var‑1] containing the distinct values of the classification variables. The first n_class_values[0] elements of class_values contain the values for the first classification variable, the next n_class_values[1] elements contain the values for the second classification variable, etc.
IMSLS_CLASS_VARIABLES_USER, intn_class_values[], floatclass_values[] (Output) Storage for n_class_values and class_values is provided by the user. The length of class_values will not be known in advance, use max_class as the maximum length of class_values. See IMSLS_CLASS_VARIABLES.
Description
Function imsls_f_prop_hazards_gen_lin computes parameter estimates and other statistics in Proportional Hazards Generalized Linear Models. These models were first proposed by Cox (1972). Two methods for handling ties are allowed in imsls_f_prop_hazards_gen_lin. Time-dependent covariates are not allowed. The user is referred to Cox and Oakes (1984), Kalbfleisch and Prentice (1980), Elandt-Johnson and Johnson (1980), Lee (1980), or Lawless (1982), among other texts, for a thorough discussion of the Cox proportional hazards model.
Let λ(t, zi) represent the hazard rate at time t for observation number i with covariables contained as elements of row vector zi. The basic assumption in the proportional hazards model (the proportionality assumption) is that the hazard rate can be written as a product of a time varying function λ0(t), which depends only on time, and a function ƒ(zi), which depends only on the covariable values. The function ƒ(zi) used in imsls_f_prop_hazards_gen_lin is given as ƒ(zi) = exp(wi + βzi) where wi is a fixed constant assigned to the observation, and β is a vector of coefficients to be estimated. With this function one obtains a hazard rate λ(t, zi) = λ0(t) exp(wi + βzi). The form of λ0(t) is not important in proportional hazards models.
The constants wi may be known theoretically. For example, the hazard rate may be proportional to a known length or area, and the wi can then be determined from this known length or area. Alternatively, the wi may be used to fix a subset of the coefficients β (say, β1) at specified values. When wi is used in this way, constants wi = β1z1i are used, while the remaining coefficients in β are free to vary in the optimization algorithm. If user-specified constants are not desired, the user should set ifix to 0 so that wi = 0 will be used.
With this definition of λ(t, zi), the usual partial (or marginal, see Kalbfleisch and Prentice (1980)) likelihood becomes
where R(ti) denotes the set of indices of observations that have not yet failed at time ti (the risk set), ti denotes the time of failure for the i-th observation, nd is the total number of observations that fail. Right-censored observations (i.e., observations that are known to have survived to time ti, but for which no time of failure is known) are incorporated into the likelihood through the risk set R(ti). Such observations never appear in the numerator of the likelihood. When itie = 0, all observations that are censored at time ti are not included in R(ti), while all observations that fail at time ti are included in R(ti).
If it can be assumed that the dependence of the hazard rate upon the covariate values remains the same from stratum to stratum, while the time-dependent term, λ0(t), may be different in different strata, then imsls_f_prop_hazards_gen_lin allows the incorporation of strata into the likelihood as follows. Let k index the m = istrat strata. Then, the likelihood is given by
In imsls_f_prop_hazards_gen_lin, the log of the likelihood is maximized with respect to the coefficients β. A quasi-Newton algorithm approximating the Hessian via the matrix of sums of squares and cross products of the first partial derivatives is used in the initial iterations (the “Q-N” method in the output). When the change in the log-likelihood from one iteration to the next is less than 100*eps, Newton-Raphson iteration is used (the “N-R” method). If, during any iteration, the initial step does not lead to an increase in the log-likelihood, then step halving is employed to find a step that will increase the log-likelihood.
Once the maximum likelihood estimates have been computed, imsls_f_prop_hazards_gen_lin computes estimates of a probability associated with each failure. Within stratum k, an estimate of the probability that the i-th observation fails at time ti given the risk set R(tki) is given by
A diagnostic “influence” or “leverage” statistic is computed for each noncensored observation as:
where Hs is the matrix of second partial derivatives of the log-likelihood, and
is computed as:
Influence statistics are not computed for censored observations.
A “residual” is computed for each of the input observations according to methods given in Cox and Oakes (1984, page 108). Residuals are computed as
where dkj is the number of tied failures in group k at time tkj. Assuming that the proportional hazards assumption holds, the residuals should approximate a random sample (with censoring) from the unit exponential distribution. By subtracting the expected values, centered residuals can be obtained. (The j-th expected order statistic from the unit exponential with censoring is given as
where h is the sample size, and censored observations are not included in the summation.)
An estimate of the cumulative baseline hazard within group k is given as
The observation proportionality constant is computed as
Programming Notes
1. The covariate vectors zki are computed from each row of the input matrix x via function imsls_f_regressors_for_glm (see Chapter 2, Regression). Thus, class variables are easily incorporated into the zki. The reader is referred to the document for imsls_f_regressors_for_glm in the regression chapter for a more detailed discussion.
Note that imsls_f_prop_hazards_gen_lin calls imsls_f_regressors_for_glm with dummy_method = IMSLS_LEAVE_OUT_LAST of the IMSLS_DUMMY option.
2. The average of each of the explanatory variables is subtracted from the variable prior to computing the product zkiβ. Subtraction of the mean values has no effect on the computed log-likelihood or the estimates since the constant term occurs in both the numerator and denominator of the likelihood. Subtracting the mean values does help to avoid invalid exponentiation in the algorithm and may also speed convergence.
3. Function imsls_f_prop_hazards_gen_lin allows for two methods of handling ties. In the first method (itie = 1), the user is allowed to break ties in any manner desired. When this method is used, it is assumed that the user has sorted the rows in X from largest to smallest with respect to the failure/censoring times xi, irt within each stratum (and across strata), with tied observations (failures or censored) broken in the manner desired. The same effect can be obtained with itie = 0 by adding (or subtracting) a small amount from each of the tied observations failure/ censoring times ti = xi, irt so as to break the ties in the desired manner.
The second method for handling ties (itie = 0) uses an approximation for the tied likelihood proposed by Breslow (1974). The likelihood in Breslow’s method is as specified above, with the risk set at time ti including all observations that fail at time ti, while all observations that are censored at time ti are not included.
(Tied censored observations are assumed to be censored immediately prior to the time ti).
4. If IMSLS_INITIAL_EST_INPUT option is used, then it is assumed that the user has provided initial estimates for the model coefficients β in in_coef. When initial estimates are provided by the user, care should be taken to ensure that the estimates correspond to the generated covariate vector zki. If IMSLS_INITIAL_EST_INPUT option is not used, then initial estimates of zero are used for all of the coefficients. This corresponds to no effect from any of the covariate values.
5. If a linear combination of covariates is monotonically increasing or decreasing with increasing failure times, then one or more of the estimated coefficients is infinite and extended maximum likelihood estimates must be computed. Such estimates may be written as where ρ = ∞ at the supremum of the likelihood so that is the finite part of the solution. In imsls_f_prop_hazards_gen_lin, it is assumed that extended maximum likelihood estimates must be computed if, within any group k, for any time t,
where ρ = ratio is specified by the user. Thus, for example, if ρ = 10000, then imsls_f_prop_hazards_gen_lin does not compute extended maximum likelihood estimates until the estimated proportionality constant
is 10000 times larger for all observations prior to t than for all observations after t. When this occurs, imsls_f_prop_hazards_gen_lin computes estimates for by splitting the failures in stratum k into two strata at t (see Bryson and Johnson 1981). Censored observations in stratum k are placed into a stratum based upon the associated value for
The results of the splitting are returned in igrp.
The estimates based upon the stratified likelihood represent the finite part of the extended maximum likelihood solution. Function imsls_f_prop_hazards_gen_lin does not compute explicitly, but an estimate for may be obtained in some circumstances by setting ratio = -1 and optimizing the log‑likelihood without forming additional strata. The solution obtained will be such that for some finite value of ρ > 0. At this solution, the Newton-Raphson algorithm will not have “converged” because the Newton-Raphson step sizes returned in gr will be large, at least for some variables. Convergence will be declared, however, because the relative change in the log‑likelihood during the final iterations will be small.
Example
The following data are taken from Lawless (1982, page 287) and involve the survival of lung cancer patients based upon their initial tumor types and treatment type. In the first example, the likelihood is maximized with no strata present in the data. This corresponds to Example 7.2.3 in Lawless (1982, page 367). The input data is printed in the output. The model is given as:
where αi and γj correspond to dummy variables generated from column indices 5 and 6 of x, respectively, x1 corresponds to column index 2, x2 corresponds to column index 3, and x3 corresponds to column index 4 of x.