Efficient design and analysis of extended case-control studies

Detta är en avhandling från Stockholm : Karolinska Institutet, Dept of Medical Epidemiology and Biostatistics

Sammanfattning: The nested case-control design is widely used in epidemiology for its efficiency, as it combines the advantages of both cohort and case-control designs. This design is an extension of the matched case-control design, where the matching variable is the time of occurrence of the outcome. Consequently, the nested case-control data are usually analysed with conditional logistic regression; however, this analysis suffers from various limitations. Several authors have developed novel statistical methods for alternative analyses of nested case-control data using basic information from the underlying cohort. Among these methods, one approach consists of ignoring the matching, weighting the sampled individuals to recover a representation of the underlying cohort and analysing the data by maximising a weighted partial likelihood. This method can be considered when two conditions are fulfilled: 1) the sampling was performed in a well-defined underlying cohort for which basic information is available, and 2) the exact sampling procedure is known. This thesis aimed to refine and extend the scope of the weighted likelihood approach in nested case-control data analysis by investigating the advantages of this method as an alternative to the traditional conditional logistic regression in several situations. The reuse of nested case-control data to address a research question regarding a new outcome, the calculation of absolute risk, the mitigation of the problem of overmatching, the maximisation of the data exploitation in case of clustered data and the analysis of subgroups of nested case-control data were addressed in this thesis. While Studies I and III were motivated by an actual epidemiological question for which data were available, simulation studies were the main approach used in Studies II and IV. Reusing nested case-control data to address a research question regarding another outcome was the central point of interest in Study I. Addressing an epidemiological question regarding the risk factors for contralateral breast cancer, for which data on contralateral breast cancer case patients were available, the feasibility of reusing nested case-control data from a previous study as the control dataset was studied. Practical aspects of the approach were highlighted, such as the consequences of reusing data which have narrow inclusion criteria, the restriction in the choice of the type of weights which can be calculated and the importance of having information on censoring dates for controls. In addition, we found that an imperfect reconstruction of the study base led to similar estimates in the analysis compared to an appropriate study base reconstruction; moreover, we confirmed that using unstratified weights (in cases of stratified sampling) provided similar exposure estimates than stratified weights, provided that adjustments were made on the confounder variables which drove the sampling. We also confirmed that using a naïve unweighted method instead of an appropriate method led to biased estimates. Absolute risk estimation was studied in Study II. Two methods were compared with both simulation studies and a real data application. The ability of each method to provide valid absolute risk estimates was investigated, in particular in cases of matched study designs. Both the Langholz-Borgan and weighted methods provided valid estimates in most situations, the latter showing slightly higher levels of precision than the former. In case of fine matching, the Langholz-Borgan method was more prone to be biased than the weighted method and had larger standard errors. In Study III, we handled nested case-control data, which had been collected to address an epidemiological question regarding how radiation therapy and smoking interact in their association with lung cancer in female breast cancer patients. Data on paired organs (breast and lungs) were collected for exposure and outcome variables, which provided clustered data at the individual level. The collected data was also characterised by the problem of overmatching which arose at the design stage. Using weighted partial likelihood allowed mitigation of the problem of overmatching and better exploited the collected data, compared to conditional logistic regression. In addition, a further advantage of the weighted approach was to enable calculating the absolute risk for a lung to develop cancer given the radiation therapy dose received for breast cancer treatment and the smoking habits of the patient. In Study IV, we compared the conditional logistic regression and weighted likelihood methods in terms of validity and efficiency of nested case-control data subgroup analyses, with subgroups defined by different variables measured at baseline. All investigated subgroup analyses provided valid estimates with both analyses. The advantages of weighted likelihood compared to conditional logistic regression were highlighted for the estimate’s precision. In addition, we showed that the weighting system enabled, on average, the reconstruction of the correct number of individuals at risk over time, for the whole cohort and in subgroups. In conclusion, the weighted likelihood approach showed several advantages compared to the traditional conditional logistic regression in nested case-control data analysis, which reinforces, refines and extends what has been previously shown in the literature.

  HÄR KAN DU HÄMTA AVHANDLINGEN I FULLTEXT. (följ länken till nästa sida)