2 Methods

2.1 Notation

We observe RCT data \((Z, X, Y)\), where for each patient \(Z_i= 0, 1\) is the treatment status, \(Y_i = 0, 1\) is the observed outcome and \(X_i\) is a set of measured covariates. Let \(\{Y_i(z), z=0, 1\}\) denote the unobservable potential outcomes. We observe \(Y_i = Z_iY_i(1) + (1 - Z_i)Y_i(0)\). We are interested in predicting the conditional average treatment effect (CATE), \[\tau(x) = E\{Y(0) - Y(1)|X=x\}\] Assuming that \((Z, X, Y)\) is a random sample from the target population and that \(\big(Y(0), Y(1)\big)\perp \!\!\! \perp Z|X\), as we are in the RCT setting, we can predict CATE from \[\begin{align*} \tau(x) &= E\{Y(0)\:\vert\:X=x\}-E\{Y(1)\:\vert\:X=x\}\\ &=E\{Y\:\vert\:X=x, Z=0\}-E\{Y\:\vert\:X=x, Z=1\} \end{align*}\]

2.2 Simulation scenarios

We simulated a typical RCT, comparing equally-sized treatment and control arms in terms of a binary outcome. For each patient we generated 8 baseline covariates \(x_1,\dots,x_4\sim N(0, 1)\) and \(x_5,\dots,x_8\sim B(1,0.2)\). Outcomes in the control arm were generated from Bernoulli variables with true probabilities following a logistic regression model including all baseline covariates, i.e. \(P(Y(0)=1\:\vert\:X=x) = \text{expit}(lp_0) = e^{lp_0}/(1+e^{lp_0})\), with \(lp_0=lp_0(x)=x^t\beta\). In the base scenarios coefficient values \(\beta\) were such, that the AUC of the logistic regression model was 0.75 and the event rate in the control arm was \(20\%\).

Outcomes in the treatment arm were first generated using 3 simple scenarios: absent (OR = 1), moderate (OR = 0.8) or strong (OR = 0.5) constant relative treatment effect. We then introduced linear, quadratic and non-monotonic deviations from constant treatment effects using: \[lp_1 = \gamma_2(lp_0-c)^2 + \gamma_1(lp_0-c) + \gamma_0, \] where \(lp_1\) is the true linear predictor in the treatment arm, so that \(P(Y(1)=1\:\vert\:X=x) = \text{expit}(lp_1)\). Finally, we incorporated constant absolute harms for all treated patients, such that \(P(Y(1)=1|X=x) = \text{expit}(lp_1) + \text{harm}\).

The sample size for the base scenarios was set to 4,250 (80% power for the detection of a marginal OR of 0.8 with the standard alpha of 5%). We evaluated the effect of smaller or larger sample sizes of 1,063 and 17,000, respectively. We also evaluated the effect of risk model discriminative ability, adjusting the baseline covariate coefficients, such that the AUC of the regression model in the control arm was 0.65 and 0.85, respectively.

These settings resulted in a simulation study of 648 scenarios covering the observed HTE in 32 large trials as well as many other potential variations of risk-based treatment effect (Supplement, Sections 2 and 3) [7].

2.3 Individualized risk-based benefit predictions

In each simulation run we internally developed a prediction model on the entire population, using a logistic regression with main effects for all baseline covariates and treatment assignment. Individual risk predictions were derived by setting treatment assignment to 0. Another approach would be to derive the prediction model solely on the control patients; however, it has been shown to lead to biased benefit predictions [5,8,9].

A stratified HTE method has been suggested as an alternative to traditional subgroup analyses [3,4]. Patients are stratified into equally-sized risk strata—in this case based on risk quartiles. Absolute treatment effects within risk strata are estimated by the difference in event rate between control and treatment arm patients. We considered this approach as a reference, expecting it to perform worse than the other candidates, as its objective is to provide an illustration of HTE rather than to optimize individualized benefit predictions.

Second, we considered a model which assumes constant relative treatment effect (constant odds ratio). Hence, absolute benefit is predicted from \(\tau(x;\hat{\beta}) = \text{expit}(\hat{lp}_0) - \text{expit}(\hat{lp}_0+\delta_0)\), where \(\delta_0\) is the log of the assumed constant odds ratio and \(\hat{lp}_0 = \hat{lp}_0(x;\hat{\beta}) = x^t\hat{\beta}\) the linear predictor of the estimated baseline risk model.

Third, we considered a logistic regression model including treatment, the prognostic index, and their linear interaction. Absolute benefit is then estimated from \(\tau(x;\hat{\beta}) = \text{expit}(\delta_0+\delta_1\hat{lp}_0) - \text{expit}(\delta_0+\delta_2+(\delta_1+\delta_3)\hat{lp}_0)\) We will refer to this method as the linear interaction approach.

Fourth, we used restricted cubic splines (RCS) to relax the linearity assumption on the effect of the linear predictor [10]. We considered splines with 3 (RCS-3), 4 (RCS-4) and 5 (RCS-5) knots to compare models with different levels of flexibility.

Finally, we considered an adaptive approach using Akaike’s Information Criterion (AIC) for model selection. More specifically, we ranked the constant relative treatment effect model, the linear interaction model, and the RCS models with 3, 4, and 5 knots based on their AIC and selected the one with the lowest value. The extra degrees of freedom were 1 (linear interaction), 2, 3 and 4 (RCS models) for these increasingly complex interactions with the treatment effect.

2.4 Evaluation metrics

We evaluated the predictive accuracy of the considered methods by the root mean squared error (RMSE):

\[\text{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^n\big(\tau(x_i) - \hat{\tau}(x_i)\big)^2}\]

We compared the discriminative ability of the methods under study using c-for-benefit and the integrated calibration index (ICI) for benefit (Supplement, Section 6) [11].

For each scenario we performed 500 replications, within which all the considered models were fitted. We simulated a super-population of size 500,000 for each scenario within which we calculated RMSE and discrimination and calibration for benefit of all the models in each replication.

2.5 Empirical illustration

We demonstrated the different methods using 30,510 patients with acute myocardial infarction (MI) included in the GUSTO-I trial. 10,348 patients were randomized to tissue plasminogen activator (tPA) treatment and 20,162 were randomized to streptokinase. The outcome of interest was 30-day mortality (total of 2,128 events), recorded for all patients. In line with previous analyses [12,13], we fitted a logistic regression model with 6 baseline covariates, i.e. age, Killip class, systolic blood pressure, heart rate, an indicator of previous MI, and the location of MI, to predict 30-day mortality risk (Supplement, Section 8).