WP5 - GENE-ENVIRONMENT INTERACTION

Work package number 5 Start date or starting event:  Month 1
Work package title GENE-ENVIRONMENT INTERACTION
Activity Type
Participant number 4 2
9
11 13
Person-months participant 121.2 10
176
38.4 2

Objectives
1. To integrate available lifestyle, hormonal and environmental risk factor data across all studies within BCAC, OCAC, PRACTICAL and CIMBA/IBCCS

2. To assess whether environmental/lifestyle risk factors modify genetic susceptibility for breast, ovarian and prostate cancer

Description of work.
The gene-environment interaction analyses will be conducted to assess whether the effects of established or suspected lifestyle/environmental risk factors for breast, ovarian and prostate cancer differ in subgroups classified according to genetic susceptibility. Data on relevant risk factors are available through studies within the consortia. They will be synthesized so that defined variables for each of the risk factors are coded according to a common protocol.

Genetic susceptibility will be defined according to genetic loci that have already been identified applying strict criteria, or will be identified in the course of this project, to be associated with cancer risk. The strength of the participating consortia, BCAC for breast cancer, OCAC for sporadic ovarian cancer, PRACTICAL for prostate cancer and CIMBA/IBCCS for BRCA1/2 mutation carriers is the availability of data on genetic loci already shown to be related with cancer risks, in combination with information on lifestyle/environmental risk factors. Data will be available for 30,000 cases of breast cancer and 40,000 controls, 9,000 BRCA1/2 mutation carriers, 15,000 cases of ovarian cancer and 15,000 controls and 14,000 cases of prostate cancer and 14,000 controls.

For each consortium, genetic databases have already been established (WP2) and constitute an ideal starting point for consortia-wide databases that include information on both genetic and lifestyle/environmental risk factors enabling the gene-environment interaction analyses. In addition, the participating consortia encompass high potential to find new pathogenic loci in the course of the project (WP2, WP3) that will then provide us with an excellent opportunity to efficiently integrate these genetic loci in the ongoing gene-environment interaction analyses. The fine-scale mapping of genetic loci with strong evidence of risk associations, as conducted in WP4 may elucidate the functional genes in which the newly identified SNPs are located.

This will likely generate new hypotheses that can be pursued by also taking suspected risk factors into account in the gene- environment analyses. In addition to multiplicative gene-environment interaction (i.e. whether the joint effects of a selected gene and lifestyle factor are greater than expected when individual excess risks are multiplied) we will also examine gene-environment interaction at the additive level, i.e. whether the joint effects of a selected gene and lifestyle factor are greater than expected when individual excess risks are summed. Interaction at the additive level is of special relevance to estimate the impact of gene-environment interactions for public health.

Apart from gene-environment interaction analyses in which each type of cancer will be examined as a homogeneous disease, a further step will be taken in WP6, in which cases will be classified into subtypes according to the pathological/molecular characteristics of their tumours.

Task 5.1 Standardisation of epidemiological data and analysis of association between lifestyle factors and cancer risk
Standardization of epidemiological data is essential for the evaluation of gene-environment interaction across contributing studies. Established risk factors for breast cancer include reproductive variables, exogenous hormones, body weight and height/body mass index, physical activity, alcohol use and radiation exposure. Some similar risk factors are involved in ovarian cancer, including reproductive variables, exogenous hormones, and body weight/body mass index.

The aetiology of prostate cancer is less well understood and the only established risk factors are race and family history. Steroid hormones are likely to play a role since androgens are involved in the development and maintenance of the prostate gland, thus it is well justified to examine body weight/body mass index, height, physical activity and surrogate measures of hormonal status as possible risk factors. A recently published report from the World Cancer Research Fund indicated that few specific dietary factors play an important role for cancers of the breast, ovary and prostate. Of greater relevance for prevention is the maintenance of appropriate energy balance and the two indicator variables are obesity and physical activity, which will be accounted for in these analyses.

There is little evidence of infectious etiology for the cancers included in this project. Although not generally recognised as a risk factor, we showed recently in a meta-analysis that tobacco smoking increases the risk of breast cancer in women with NAT2 slow acetylation genotypes (21). It appears therefore warranted to evaluate whether smoking may be a risk factor also for ovarian and prostate cancer in genetically susceptible subgroups. There is recent evidence suggesting that long-term daily use of adult-strength aspirin may be associated with modestly reduced overall cancer incidence. A meta-analysis also supported current evidence that aspirin may reduce risk of breast cancer. Studies regarding the relationship between aspirin or non- steroidal anti-inflammatory drugs (NSAIDs) and cancers of the ovary and prostate have not yielded consistent results (22).

Because of the widespread use of aspirin and NSAIDs particularly in the older age groups, any association with risk may have important public health implications. Therefore, both established and suspected risk factors described for the three cancers will be examined for effect modification by genetic factors. The principal initial task will be to synthesize and standardise the existing epidemiological data on specified established or suspected lifestyle/environmental risk factors according to a common protocol. Questionnaires used in the different consortia involved, BCAC, CIMBA/IBCCS, OCAC, and PRACTICAL have been assembled to assess types and quality of data on the respective risk factors.

Global variables will be constructed for each of the established risk factors on which a maximal number of studies can provide information. For a subgroup of studies, more detailed variables will be defined. The review of the questionnaires and information provided by consortia members confirm that we have sufficient power for analysis of gene-environmental interaction for all the risk factors to be considered. The list of variables selected for the GxE analysis at different stages (see tasks) will be determined after standardization of data on the lifestyle/environmental factors. The following table gives a preliminary count of cases and controls with data on the respective risk factor (global variable) for the different cancer sites.


Risk factor Breast Ovarian Prostate BRCA1/2 carriers
cases controls cases controls cases controls
Reproductive history
age at menarche 33079 44456 11538 14653 - - 9691
pregnancy 33079 44456 11699 15134 - - 9873
breastfeeding 28963 40275 10071 13157 - - -
OC 31960 43439 11699 15134 - - 9606
Exogenous hormone use 31960 43439 11018 13570 - - 9606
Weight and height /
Body Mass Index
31960 43439 11538 14653 2465 2214 9362
Tobacco smoking 30750 42827 10721 14213 2465 2214 8816
Alcohol consumption 27953 39819 9442 12958 2465 2214 8816
Aspirin / NSAIDs 17072 29375 8404 10337 - - -
Physical activity 16059 19492 9107 11498 665 663 -


A data dictionary will be developed for each cancer site that specifies the exposure variables (definition and characterization) for detailed analysis of each risk factor. Common definitions of variables will be used for the three cancer sites (as common protocol), where possible. Definition of the exposure variables will take into account the different instruments used to collect epidemiologic data in the studies involved. For example, exposure variables for hormone replacement therapy (HRT) will include ever use, recency of use, duration of hormone replacement therapy in years, use of estrogen alone therapy, use of combined estrogen- progesterone therapy, and duration of use by type of HRT (coordinated by partner 9). Common procedures for the coding of unknown data will be specified.

The data dictionaries for each cancer site will be posted on the COGS website and accessible through member password. Subsequently, risk factor data will be retrieved from the members in each consortium. The databases will be established and made pseudo-anonymous through the use of study subject identifying numbers, which will also be used to merge the epidemiologic and genetic data. The databases will only be accessible for authorized persons via network and database passwords.

The database will be protected from changes by user actions and be programmed to optimise queries, inserts, and updates as well as providing meaningful data visualization, data backup and tracking and/or undo of updates. All data transferred will be subjected to editing and plausibility checks. The databases will then contain data on uniformly defined and cleaned global variables for all studies per cancer site and on more detailed variables for the maximal subgroup of studies - for unselected breast cancer (BCAC) at DKFZ (partner 4), for BRCA1/2 mutation carriers (CIMBA/IBCCS) at NKI (partner 9), and for prostate cancer by ICR (partner 13). For OCAC, a common database of epidemiologic variables has already been established and will be made available for analysis regarding ovarian cancer at DCS (partner 11). These databases will be fed into the central database in Cambridge (WP2) to also enable analyses for WP6 and WP7.

The BRCA1/2 carrier cohort database (IBCCS) already contains standardised risk factor information of a subgroup of ≥ 4,000 BRCA1/2 carriers within the CIMBA study, and will also involve updating information on newly diagnosed cancers and standardised risk factor information during follow-up. Following the establishment of the pooled databases, systematic analyses of lifestyle risk factors for each of the cancers included in the project will be carried out by the respective partners 4, 9, 11 in collaboration with consortia members.

Regression analysis will be performed to generate systematic risk estimates for each risk factor by cancer site. Specifically, logistic regression models will be employed for case-control analysis to yield odds ratios and 95% confidence intervals and weighted Cox proportional-hazards models for the cohort of BRCA1/2 mutation carriers to yield hazard ratios and 95% confidence intervals. Age at reference date will be included in all regression models and stratification by study centre is intended.

Given the hormone-related etiology of breast cancer, the analyses within the breast cancer studies will also be stratified according to menopausal status, where appropriate. Quantitative variables will be entered into the model continuously (where available) and categorically. Test for trend will be performed with a model with a continuous variable or a score for ordinal categories. Relevant confounding or effect modifying factors will be identified and then accounted for in multivariate analyses. Test for heterogeneity by study strata will be performed by comparing logistic regression models with and without a risk factor x study interaction term using a likelihood ratio test.


Task 5.2 Evaluation of gene-environment interaction (GxE) with established genetic susceptibility loci using existing genotype data Initially the gene-environment analysis will be conducted using the defined epidemiologic variables for the designated environmental/lifestyle risk factors with existing genotype data. We will use confirmed genetic variants already identified within the consortia and those variants confirmed by WP2 to show main effects for breast cancer in BCAC, for ovarian cancer in OCAC, for prostate cancer in PRACTICAL and for mutation carriers in CIMBA.

  1. For breast cancer, the analysis will include the 5 novel gene loci recently identified by the GWAS of breast cancer in BCAC and the 2 genetic variants identified through candidate gene analyses and replication within BCAC. Genotype data for these genetic variants are already available in BCAC so that these analyses will be carried out in 20,000 cases and 20,000 controls at DKFZ (partner 4). Similar genotype data is available for CIMBA (6000 mutation carriers including 3300 breast cancer cases) and GxE analyses will be carried out at NKI (partner 9). Further, newly confirmed SNPs coming forth from the analyses of GWAS by WP2 will also be included.

  2. For ovarian cancer, several novel genetic loci are expected to emerge from the ongoing GWAs in Cambridge and in OCAC (led by Tom Sellers in Moffit Research Institute, Tampa) by 2009. Analysis of gene-environment interactions will be carried out in this sample of 8,000 cases and 8,000 controls at DCS (partner 11) for the confirmed set of genetic loci determined by WP2.

  3. For prostate cancer, 5 SNPs have been identified through several GWAS. These SNPs will have been genotyped in 10,000 cases and 10,000 controls in PRACTICAL by the start of this project. GxE analysis will be carried with these 5, if replicated, and additional confirmed SNPs for prostate cancer in PRACTICAL at ICR (partner 13).

Analysis of gene-environment interaction will be initially carried out under the standard multiplicative logistic regression model. Departures from the multiplicative joint effect of genotype and environment exposure are measured by the OR for the interaction. An interaction parameter of more than 1 indicates a greater than multiplicative effect between the 2 factors. To test for effect modification, multiplicative interaction terms will be included in the models and evaluated by the log likelihood ratio test.

The main test of the null hypothesis of no interaction will generally be a likelihood test (2 d.f.) comparing a model that includes interaction terms for risk factor (dichotomous) x genotype (3 categories) with a model without interaction terms, and a likelihood test (1 d.f.) for a model including only one interaction for risk factor (dichotomous) x scored genotype (for allelic dose). Correspondingly greater number of interaction terms will be required when exposure variables with more than two categories are used.

For each risk factor, both global variables (e.g. parous versus non-parous, ever versus never HRT use) and more detailed variables (e.g. number of full-term pregnancies, current HRT use, duration of HRT use) will be assessed for interaction with the genetic loci. When significant heterogeneity in ORs (i.e. gene-environment interaction) is observed, the effect of lifestyle/environmental risk factors will be determined according to genotype by performing logistic regression to yield risk estimates for the risk factor separately for subgroups defined by the genotype or allele carrier status. If there is no evidence of modification of odds ratios under a null multiplicative model, departure from additivity will be evaluated for the same combinations of genotypes of confirmed genetic loci and environmental exposures.

Joint effects of genotype and environmental exposure will also be examined by the generation of subgroups defined by genotype and risk factor level. The groups can then be entered simultaneously in the logistic model to calculate the associated odds ratios with respect to a common reference group. This statistical analysis plan will be discussed, formally written up and circulated to all collaborators to promote homogeneous analyses across cancer sites.


Task 5.3 Evaluation of gene-environment interactions (GxE) with possible genetic susceptibility loci using existing genotype data
In the GWAS for breast cancer performed by the Cambridge group together with BCAC, the 31 most significant SNPs in a second stage study of 4,000 cases and 4,000 controls were tested in large series of case-control samples, including about 22,000 cases and 22,000 controls. Although 25 of these SNPs did not achieve genome-wide significance levels (10-7), it is possible that some may be associated with significant GxE effects.

Whether or not a significant main effect occurs in the presence of interaction depends on the strength of the GxE effect and the proportion in the population that is exposed to the involved genetic and environmental factors. For breast cancer, we shall perform the analyses using the genotype data of the 25 SNPs from the GWAS in BCAC, which is already available in about 20,000 cases and 20,000 controls for whom epidemiologic data have also been collected. The complete sample will be subdivided so that the screening for GxE will be performed in 10,000 cases and 10,000 controls and the most significant results obtained will be evaluated in a replication sample of the same size (partner 4).

For ovarian cancer, we shall perform similar analysis using the genotype data on the SNPs that were tested but not confirmed in the GWAS for ovarian cancer. The complete sample of 10,000 cases and 10,000 controls will also be subdivided to provide a screening and a replication sample. A similar approach as for ovarian cancer will be used for prostate cancer. To test for GxE between the 25 SNPs and defined epidemiological variables, we shall use the case- only design which has demonstrated greater power than case-control analysis to detect gene- environment interaction in the absence of gene-environment association in the population.

The interaction odds ratio can be estimated by fitting a logistic regression model with a dichotomous variable (either dichotomized exposure variable or genetic variable) as the dependent variable. Polytomous regression models will be fitted for dependent variables with more than two categories. However, this approach holds only under the assumption that genotype and environmental exposure are uncorrelated in the population. Therefore, when statistically significant interaction odds ratios are found using case-only analysis, the analysis of gene-environment interaction will be repeated using case-control data in order to validate the finding by eliminating false positive associations due to genotype-environmental exposure correlations in the population. Again, this statistical analysis plan will be formally written up and circulated to all collaborators to promote homogeneous analyses across cancer sites.


Task 5.4 Testing of gene-environment interactions (GxE) with all 1,536 SNPs genotyped in this project to select SNPs showing GxE but not main effects for genotyping in further case- control sets.
Genetic susceptibility loci that modify effects of risk factors but have a weak main effect may not be detected when testing for association with cancer risk alone. However, they may be detected by GxE analysis. In order to identify such loci, we shall test for GxE with all genotyped 1,536 SNPs from WP3 with selected risk factors and only dichotomized epidemiologic variables (partner 4).

Genetic variants found to show significant GxE will be submitted to WP2 for consideration and inclusion into the selected set of SNPs for genotyping in further case-control datasets. Validation of significant findings of GxE can then be carried out in the second series of 79,000 samples (20,000 breast cancer cases and 20,000 controls; 10,000 prostate cancer case and 10,000 controls; 5,000 ovarian cancer cases and 5,000 controls; 9,000 BRCA1/2 carriers) from the Consortia after integration of genotype data from WP3. With the large number of SNPs to be interrogated, we require a method for testing gene- environmental interaction in the presence of gene-environment association in the population that also adequately accounts for multiple testing.

Although the more powerful case-only approach is not valid in the presence of gene-environment association, adequate adjustment is possible if the source(s) for the gene-environment association are known and can be measured. Therefore we plan to use an Empirical Bayesian approach to test for gene-environment interaction that has the extra power of a case-only statistic without its bias (proposed by Duncan Thomas, USC, Los Angeles, unpublished). This is achieved by hierarchical modelling of population SNP-risk factor associations in cases accounting for the bias which is estimated from all the SNP-risk factor association assessments in the controls across all SNPs. By modelling on all the SNPs, multiple testing is accounted for in the test statistic.


Task 5.5 Evaluation of gene-environment interaction (GxE) with confirmed genetic susceptibility loci using existing genotype data
Detailed GxE analysis will be carried out for the defined epidemiologic variables and the list of 50 genetic variants selected by WP2 showing main effects for breast cancer, ovarian cancer and prostate cancer, to be genotyped in further case-controls sets for each cancer. Genotype data will be available for of 76,000 samples (20,000 breast cancer cases and 20,000 controls; 10,000 prostate cancer case and 10,000 controls; 5,000 ovarian cancer samples and 5,000 controls; 6,000 BRCA1/2 carriers).

The analyses will be carried out using the methodology described in task 5.3, whereby confounding or effect modifying factors will be accounted for in multivariate analyses. It will be possible to validate significant findings using data from the second series of 79,000 samples genotyped for these 50 SNPs from WP3.

Task 5.6 Evaluation of gene-environment interaction (GxE) with genetic susceptibility loci identified by WP2
The GxE analyses will be carried out for defined epidemiologic variables with the novel SNPs confirmed to have a main effect on cancer risk for the three cancer sites by WP2. The same statistical methods as described in Task 5.2 will be employed for assessing GxE with risk factors using both global and more detailed variables. In order to replicate within the project, we shall use the first series of samples for the 1536 SNP array (10,000 cases and 10,000 controls for breast cancer, 10,000 cases and 10,000 controls for prostate cancer, 5,000 cases and 5,000 controls for ovarian cancer and 6,000 BRCA1/2 carriers) to test for GxE with the novel SNPs.

Significant GxE findings will be validated using data from the second series of 79,000 samples (20,000 breast cancer cases and 20,000 controls; 10,000 prostate cancer case and 10,000 controls; 5,000 ovarian cancer samples and 5,000 controls; 9,000 BRCA1/2 carriers) from the Consortia after integration of genotype data from WP3. Apart from gene- environment interaction analyses in which each type of cancer will be examined as a homogeneous disease, a further step will be taken in WP6, in which tumour subtypes will be classified according to their pathological/molecular characteristics.

Deliverables (brief description) and months of delivery
D5.1. Questionnaires from consortia members collected and posted on COGS website (month 3)

D5.2. Data dictionary of established and suspected lifestyle/environmental risk factors for each cancer site, including variable definition and coding, finalized and placed on COGS website (month 6)

D5.3. A database with harmonized epidemiologic data established for each of the different cancers and for BRCA1/2 mutation carriers (month 12)

D5.4. Results posted on the COGS website from analysis of GxE performed for the already established susceptibility loci of breast cancer (month 18)

D5.5. List of selected lifestyle/environmental variables used for exploratory analysis of GxE analyses for breast cancer with all 1536 SNPs genotyped to be placed on COGS website (month 23)

D5.6. Results posted on the COGS website from analysis of GxE performed for the possible susceptibility loci of breast cancer from previous GWAS (month 26)

D5.7. Results posted on the COGS website from analysis of GxE (using global variables) performed for the already established susceptibility loci of breast cancer in BRCA1/2 mutation carriers and for the established susceptibility loci from GWAS for ovarian and prostate cancer (month 28)

D5.8. List of selected detailed variables for GxE analyses in breast, ovarian and prostate cancer to be placed on COGS website (month 30)

D5.9. Results posted on the COGS website from analysis of GxE analysis for confirmed susceptibility loci for breast, ovarian, and prostate cancer (month 40)

Work package number 4 Start date or starting event: Month 1
Work package title GENE-ENVIRONMENT INTERACTION
Activity Type
Participant number 4 2
9
11 13
Person-months participant 121.2 10
176
38.4 2

Objectives
1. To integrate available lifestyle, hormonal and environmental risk factor data across all studies within BCAC, OCAC, PRACTICAL and CIMBA/IBCCS

2. To assess whether environmental/lifestyle risk factors modify genetic susceptibility for breast, ovarian and prostate cancer

Description of work.
The gene-environment interaction analyses will be conducted to assess whether the effects of established or suspected lifestyle/environmental risk factors for breast, ovarian and prostate cancer differ in subgroups classified according to genetic susceptibility. Data on relevant risk factors are available through studies within the consortia. They will be synthesized so that defined variables for each of the risk factors are coded according to a common protocol.

Genetic susceptibility will be defined according to genetic loci that have already been identified applying strict criteria, or will be identified in the course of this project, to be associated with cancer risk. The strength of the participating consortia, BCAC for breast cancer, OCAC for sporadic ovarian cancer, PRACTICAL for prostate cancer and CIMBA/IBCCS for BRCA1/2 mutation carriers is the availability of data on genetic loci already shown to be related with cancer risks, in combination with information on lifestyle/environmental risk factors. Data will be available for 30,000 cases of breast cancer and 40,000 controls, 9,000 BRCA1/2 mutation carriers, 15,000 cases of ovarian cancer and 15,000 controls and 14,000 cases of prostate cancer and 14,000 controls.

For each consortium, genetic databases have already been established (WP2) and constitute an ideal starting point for consortia-wide databases that include information on both genetic and lifestyle/environmental risk factors enabling the gene-environment interaction analyses. In addition, the participating consortia encompass high potential to find new pathogenic loci in the course of the project (WP2, WP3) that will then provide us with an excellent opportunity to efficiently integrate these genetic loci in the ongoing gene-environment interaction analyses. The fine-scale mapping of genetic loci with strong evidence of risk associations, as conducted in WP4 may elucidate the functional genes in which the newly identified SNPs are located.

This will likely generate new hypotheses that can be pursued by also taking suspected risk factors into account in the gene- environment analyses. In addition to multiplicative gene-environment interaction (i.e. whether the joint effects of a selected gene and lifestyle factor are greater than expected when individual excess risks are multiplied) we will also examine gene-environment interaction at the additive level, i.e. whether the joint effects of a selected gene and lifestyle factor are greater than expected when individual excess risks are summed. Interaction at the additive level is of special relevance to estimate the impact of gene-environment interactions for public health.

Apart from gene-environment interaction analyses in which each type of cancer will be examined as a homogeneous disease, a further step will be taken in WP6, in which cases will be classified into subtypes according to the pathological/molecular characteristics of their tumours.

Task 5.1 Standardisation of epidemiological data and analysis of association between lifestyle factors and cancer risk
Standardization of epidemiological data is essential for the evaluation of gene-environment interaction across contributing studies. Established risk factors for breast cancer include reproductive variables, exogenous hormones, body weight and height/body mass index, physical activity, alcohol use and radiation exposure. Some similar risk factors are involved in ovarian cancer, including reproductive variables, exogenous hormones, and body weight/body mass index.

The aetiology of prostate cancer is less well understood and the only established risk factors are race and family history. Steroid hormones are likely to play a role since androgens are involved in the development and maintenance of the prostate gland, thus it is well justified to examine body weight/body mass index, height, physical activity and surrogate measures of hormonal status as possible risk factors. A recently published report from the World Cancer Research Fund indicated that few specific dietary factors play an important role for cancers of the breast, ovary and prostate. Of greater relevance for prevention is the maintenance of appropriate energy balance and the two indicator variables are obesity and physical activity, which will be accounted for in these analyses.

There is little evidence of infectious etiology for the cancers included in this project. Although not generally recognised as a risk factor, we showed recently in a meta-analysis that tobacco smoking increases the risk of breast cancer in women with NAT2 slow acetylation genotypes (21). It appears therefore warranted to evaluate whether smoking may be a risk factor also for ovarian and prostate cancer in genetically susceptible subgroups. There is recent evidence suggesting that long-term daily use of adult-strength aspirin may be associated with modestly reduced overall cancer incidence. A meta-analysis also supported current evidence that aspirin may reduce risk of breast cancer. Studies regarding the relationship between aspirin or non- steroidal anti-inflammatory drugs (NSAIDs) and cancers of the ovary and prostate have not yielded consistent results (22).

Because of the widespread use of aspirin and NSAIDs particularly in the older age groups, any association with risk may have important public health implications. Therefore, both established and suspected risk factors described for the three cancers will be examined for effect modification by genetic factors. The principal initial task will be to synthesize and standardise the existing epidemiological data on specified established or suspected lifestyle/environmental risk factors according to a common protocol. Questionnaires used in the different consortia involved, BCAC, CIMBA/IBCCS, OCAC, and PRACTICAL have been assembled to assess types and quality of data on the respective risk factors.

Global variables will be constructed for each of the established risk factors on which a maximal number of studies can provide information. For a subgroup of studies, more detailed variables will be defined. The review of the questionnaires and information provided by consortia members confirm that we have sufficient power for analysis of gene-environmental interaction for all the risk factors to be considered. The list of variables selected for the GxE analysis at different stages (see tasks) will be determined after standardization of data on the lifestyle/environmental factors. The following table gives a preliminary count of cases and controls with data on the respective risk factor (global variable) for the different cancer sites.


Risk factor Breast Ovarian Prostate BRCA1/2 carriers

cases controls cases controls cases controls
Reproductive history






age at menarche 33079 44456 11538 14653 - - 9691
pregnancy 33079 44456 11699 15134 - - 9873
breastfeeding 28963 40275 10071 13157 - - -
OC 31960 43439 11699 15134 - - 9606
Exogenous hormone use 31960 43439 11018 13570 - - 9606
Weight and height /
Body Mass Index
31960 43439 11538 14653 2465 2214 9362
Tobacco smoking 30750 42827 10721 14213 2465 2214 8816
Alcohol consumption 27953 39819 9442 12958 2465 2214 8816
Aspirin / NSAIDs 17072 29375 8404 10337 - - -
Physical activity 16059 19492 9107 11498 665 663 -


A data dictionary will be developed for each cancer site that specifies the exposure variables (definition and characterization) for detailed analysis of each risk factor. Common definitions of variables will be used for the three cancer sites (as common protocol), where possible. Definition of the exposure variables will take into account the different instruments used to collect epidemiologic data in the studies involved. For example, exposure variables for hormone replacement therapy (HRT) will include ever use, recency of use, duration of hormone replacement therapy in years, use of estrogen alone therapy, use of combined estrogen- progesterone therapy, and duration of use by type of HRT (coordinated by partner 9). Common procedures for the coding of unknown data will be specified.

The data dictionaries for each cancer site will be posted on the COGS website and accessible through member password. Subsequently, risk factor data will be retrieved from the members in each consortium. The databases will be established and made pseudo-anonymous through the use of study subject identifying numbers, which will also be used to merge the epidemiologic and genetic data. The databases will only be accessible for authorized persons via network and database passwords.

The database will be protected from changes by user actions and be programmed to optimise queries, inserts, and updates as well as providing meaningful data visualization, data backup and tracking and/or undo of updates. All data transferred will be subjected to editing and plausibility checks. The databases will then contain data on uniformly defined and cleaned global variables for all studies per cancer site and on more detailed variables for the maximal subgroup of studies - for unselected breast cancer (BCAC) at DKFZ (partner 4), for BRCA1/2 mutation carriers (CIMBA/IBCCS) at NKI (partner 9), and for prostate cancer by ICR (partner 13). For OCAC, a common database of epidemiologic variables has already been established and will be made available for analysis regarding ovarian cancer at DCS (partner 11). These databases will be fed into the central database in Cambridge (WP2) to also enable analyses for WP6 and WP7.

The BRCA1/2 carrier cohort database (IBCCS) already contains standardised risk factor information of a subgroup of ≥ 4,000 BRCA1/2 carriers within the CIMBA study, and will also involve updating information on newly diagnosed cancers and standardised risk factor information during follow-up. Following the establishment of the pooled databases, systematic analyses of lifestyle risk factors for each of the cancers included in the project will be carried out by the respective partners 4, 9, 11 in collaboration with consortia members.

Regression analysis will be performed to generate systematic risk estimates for each risk factor by cancer site. Specifically, logistic regression models will be employed for case-control analysis to yield odds ratios and 95% confidence intervals and weighted Cox proportional-hazards models for the cohort of BRCA1/2 mutation carriers to yield hazard ratios and 95% confidence intervals. Age at reference date will be included in all regression models and stratification by study centre is intended.

Given the hormone-related etiology of breast cancer, the analyses within the breast cancer studies will also be stratified according to menopausal status, where appropriate. Quantitative variables will be entered into the model continuously (where available) and categorically. Test for trend will be performed with a model with a continuous variable or a score for ordinal categories. Relevant confounding or effect modifying factors will be identified and then accounted for in multivariate analyses. Test for heterogeneity by study strata will be performed by comparing logistic regression models with and without a risk factor x study interaction term using a likelihood ratio test.


Task 5.2 Evaluation of gene-environment interaction (GxE) with established genetic susceptibility loci using existing genotype data Initially the gene-environment analysis will be conducted using the defined epidemiologic variables for the designated environmental/lifestyle risk factors with existing genotype data. We will use confirmed genetic variants already identified within the consortia and those variants confirmed by WP2 to show main effects for breast cancer in BCAC, for ovarian cancer in OCAC, for prostate cancer in PRACTICAL and for mutation carriers in CIMBA.

  1. For breast cancer, the analysis will include the 5 novel gene loci recently identified by the GWAS of breast cancer in BCAC and the 2 genetic variants identified through candidate gene analyses and replication within BCAC. Genotype data for these genetic variants are already available in BCAC so that these analyses will be carried out in 20,000 cases and 20,000 controls at DKFZ (partner 4). Similar genotype data is available for CIMBA (6000 mutation carriers including 3300 breast cancer cases) and GxE analyses will be carried out at NKI (partner 9). Further, newly confirmed SNPs coming forth from the analyses of GWAS by WP2 will also be included.

  2. For ovarian cancer, several novel genetic loci are expected to emerge from the ongoing GWAs in Cambridge and in OCAC (led by Tom Sellers in Moffit Research Institute, Tampa) by 2009. Analysis of gene-environment interactions will be carried out in this sample of 8,000 cases and 8,000 controls at DCS (partner 11) for the confirmed set of genetic loci determined by WP2.

  3. For prostate cancer, 5 SNPs have been identified through several GWAS. These SNPs will have been genotyped in 10,000 cases and 10,000 controls in PRACTICAL by the start of this project. GxE analysis will be carried with these 5, if replicated, and additional confirmed SNPs for prostate cancer in PRACTICAL at ICR (partner 13).

Analysis of gene-environment interaction will be initially carried out under the standard multiplicative logistic regression model. Departures from the multiplicative joint effect of genotype and environment exposure are measured by the OR for the interaction. An interaction parameter of more than 1 indicates a greater than multiplicative effect between the 2 factors. To test for effect modification, multiplicative interaction terms will be included in the models and evaluated by the log likelihood ratio test.

The main test of the null hypothesis of no interaction will generally be a likelihood test (2 d.f.) comparing a model that includes interaction terms for risk factor (dichotomous) x genotype (3 categories) with a model without interaction terms, and a likelihood test (1 d.f.) for a model including only one interaction for risk factor (dichotomous) x scored genotype (for allelic dose). Correspondingly greater number of interaction terms will be required when exposure variables with more than two categories are used.

For each risk factor, both global variables (e.g. parous versus non-parous, ever versus never HRT use) and more detailed variables (e.g. number of full-term pregnancies, current HRT use, duration of HRT use) will be assessed for interaction with the genetic loci. When significant heterogeneity in ORs (i.e. gene-environment interaction) is observed, the effect of lifestyle/environmental risk factors will be determined according to genotype by performing logistic regression to yield risk estimates for the risk factor separately for subgroups defined by the genotype or allele carrier status. If there is no evidence of modification of odds ratios under a null multiplicative model, departure from additivity will be evaluated for the same combinations of genotypes of confirmed genetic loci and environmental exposures.

Joint effects of genotype and environmental exposure will also be examined by the generation of subgroups defined by genotype and risk factor level. The groups can then be entered simultaneously in the logistic model to calculate the associated odds ratios with respect to a common reference group. This statistical analysis plan will be discussed, formally written up and circulated to all collaborators to promote homogeneous analyses across cancer sites.


Task 5.3 Evaluation of gene-environment interactions (GxE) with possible genetic susceptibility loci using existing genotype data
In the GWAS for breast cancer performed by the Cambridge group together with BCAC, the 31 most significant SNPs in a second stage study of 4,000 cases and 4,000 controls were tested in large series of case-control samples, including about 22,000 cases and 22,000 controls. Although 25 of these SNPs did not achieve genome-wide significance levels (10-7), it is possible that some may be associated with significant GxE effects.

Whether or not a significant main effect occurs in the presence of interaction depends on the strength of the GxE effect and the proportion in the population that is exposed to the involved genetic and environmental factors. For breast cancer, we shall perform the analyses using the genotype data of the 25 SNPs from the GWAS in BCAC, which is already available in about 20,000 cases and 20,000 controls for whom epidemiologic data have also been collected. The complete sample will be subdivided so that the screening for GxE will be performed in 10,000 cases and 10,000 controls and the most significant results obtained will be evaluated in a replication sample of the same size (partner 4).

For ovarian cancer, we shall perform similar analysis using the genotype data on the SNPs that were tested but not confirmed in the GWAS for ovarian cancer. The complete sample of 10,000 cases and 10,000 controls will also be subdivided to provide a screening and a replication sample. A similar approach as for ovarian cancer will be used for prostate cancer. To test for GxE between the 25 SNPs and defined epidemiological variables, we shall use the case- only design which has demonstrated greater power than case-control analysis to detect gene- environment interaction in the absence of gene-environment association in the population.

The interaction odds ratio can be estimated by fitting a logistic regression model with a dichotomous variable (either dichotomized exposure variable or genetic variable) as the dependent variable. Polytomous regression models will be fitted for dependent variables with more than two categories. However, this approach holds only under the assumption that genotype and environmental exposure are uncorrelated in the population. Therefore, when statistically significant interaction odds ratios are found using case-only analysis, the analysis of gene-environment interaction will be repeated using case-control data in order to validate the finding by eliminating false positive associations due to genotype-environmental exposure correlations in the population. Again, this statistical analysis plan will be formally written up and circulated to all collaborators to promote homogeneous analyses across cancer sites.


Task 5.4 Testing of gene-environment interactions (GxE) with all 1,536 SNPs genotyped in this project to select SNPs showing GxE but not main effects for genotyping in further case- control sets.
Genetic susceptibility loci that modify effects of risk factors but have a weak main effect may not be detected when testing for association with cancer risk alone. However, they may be detected by GxE analysis. In order to identify such loci, we shall test for GxE with all genotyped 1,536 SNPs from WP3 with selected risk factors and only dichotomized epidemiologic variables (partner 4).

Genetic variants found to show significant GxE will be submitted to WP2 for consideration and inclusion into the selected set of SNPs for genotyping in further case-control datasets. Validation of significant findings of GxE can then be carried out in the second series of 79,000 samples (20,000 breast cancer cases and 20,000 controls; 10,000 prostate cancer case and 10,000 controls; 5,000 ovarian cancer cases and 5,000 controls; 9,000 BRCA1/2 carriers) from the Consortia after integration of genotype data from WP3. With the large number of SNPs to be interrogated, we require a method for testing gene- environmental interaction in the presence of gene-environment association in the population that also adequately accounts for multiple testing.

Although the more powerful case-only approach is not valid in the presence of gene-environment association, adequate adjustment is possible if the source(s) for the gene-environment association are known and can be measured. Therefore we plan to use an Empirical Bayesian approach to test for gene-environment interaction that has the extra power of a case-only statistic without its bias (proposed by Duncan Thomas, USC, Los Angeles, unpublished). This is achieved by hierarchical modelling of population SNP-risk factor associations in cases accounting for the bias which is estimated from all the SNP-risk factor association assessments in the controls across all SNPs. By modelling on all the SNPs, multiple testing is accounted for in the test statistic.


Task 5.5 Evaluation of gene-environment interaction (GxE) with confirmed genetic susceptibility loci using existing genotype data
Detailed GxE analysis will be carried out for the defined epidemiologic variables and the list of 50 genetic variants selected by WP2 showing main effects for breast cancer, ovarian cancer and prostate cancer, to be genotyped in further case-controls sets for each cancer. Genotype data will be available for of 76,000 samples (20,000 breast cancer cases and 20,000 controls; 10,000 prostate cancer case and 10,000 controls; 5,000 ovarian cancer samples and 5,000 controls; 6,000 BRCA1/2 carriers).

The analyses will be carried out using the methodology described in task 5.3, whereby confounding or effect modifying factors will be accounted for in multivariate analyses. It will be possible to validate significant findings using data from the second series of 79,000 samples genotyped for these 50 SNPs from WP3.

Task 5.6 Evaluation of gene-environment interaction (GxE) with genetic susceptibility loci identified by WP2
The GxE analyses will be carried out for defined epidemiologic variables with the novel SNPs confirmed to have a main effect on cancer risk for the three cancer sites by WP2. The same statistical methods as described in Task 5.2 will be employed for assessing GxE with risk factors using both global and more detailed variables. In order to replicate within the project, we shall use the first series of samples for the 1536 SNP array (10,000 cases and 10,000 controls for breast cancer, 10,000 cases and 10,000 controls for prostate cancer, 5,000 cases and 5,000 controls for ovarian cancer and 6,000 BRCA1/2 carriers) to test for GxE with the novel SNPs.

Significant GxE findings will be validated using data from the second series of 79,000 samples (20,000 breast cancer cases and 20,000 controls; 10,000 prostate cancer case and 10,000 controls; 5,000 ovarian cancer samples and 5,000 controls; 9,000 BRCA1/2 carriers) from the Consortia after integration of genotype data from WP3. Apart from gene- environment interaction analyses in which each type of cancer will be examined as a homogeneous disease, a further step will be taken in WP6, in which tumour subtypes will be classified according to their pathological/molecular characteristics.

Deliverables (brief description) and months of delivery
D5.1. Questionnaires from consortia members collected and posted on COGS website (month 3)
D5.2. Data dictionary of established and suspected lifestyle/environmental risk factors for each cancer site, including variable definition and coding, finalized and placed on COGS website (month 6)
D5.3. A database with harmonized epidemiologic data established for each of the different cancers and for BRCA1/2 mutation carriers (month 12) D5.4. Results posted on the COGS website from analysis of GxE performed for the already established susceptibility loci of breast cancer (month 18) D5.5. List of selected lifestyle/environmental variables used for exploratory analysis of GxE analyses for breast cancer with all 1536 SNPs genotyped to be placed on COGS website (month 23)
D5.6. Results posted on the COGS website from analysis of GxE performed for the possible susceptibility loci of breast cancer from previous GWAS (month 26)
D5.7. Results posted on the COGS website from analysis of GxE (using global variables) performed for the already established susceptibility loci of breast cancer in BRCA1/2 mutation carriers and for the established susceptibility loci from GWAS for ovarian and prostate cancer (month 28)
D5.8. List of selected detailed variables for GxE analyses in breast, ovarian and prostate cancer to be placed on COGS website (month 30)
D5.9. Results posted on the COGS website from analysis of GxE analysis for confirmed susceptibility loci for breast, ovarian, and prostate cancer (month 40)

Additional information