WP2 - BIOSTATISTICS AND BIOINFORMATICS

Work package number 2 Start date or starting event: Month 1
Work package title BIOSTATISTICS AND BIOINFORMATICS
Activity Type RTD
Participant number 1 2 5 9
Person-months participant 32.2 204 19.2 65

Objectives

  1. To identify novel susceptibility loci by synthesis of existing genetic data (stage I).
  2. To identify 1536 SNPs for further evaluation of disease associations in large case-control studies
    (stage II).
  3. To identify novel genetic susceptibility loci through analysis of newly generated genotype data
    (stage III).
  4. To develop new tools for visualization of the association data.
  5. To develop and test new models for risk prediction for each cancer.
  6. To evaluate associations between genetic variants and gene expression levels in tumours.

Description of work
This work package will be responsible for statistical analyses and bioinformatics related to multiple aspects of the proposal.

Task 2.1 Synthesis of existing association data, Stage I
Tasks 2.1-2.4 will be co-ordinated in UCAM (Professor Easton). The initial task will be to assemble and construct databases of the genome-wide association data, together with phenotypic data, from studies of each cancer (at least 4 for breast cancer, 3 for prostate cancer and 2 for ovarian cancer). This step is referred to as stage I in the multi-step SNP selection procedure of COGS. The current data sets that will contribute GWAS data are:

Cancer Country Cases/control Status
Breast UK 400/400 Completed
US 1200/1200 Completed
Sweden 800/800 Completed
Finland 800/800 Completed
Germany 500/2000 Completed
Total Breast 3700/4200
Prostate UK 2000/2000 Completed
Australia 2000/2000 Completed
Sweden 1500/1500 Completed
US 1200/1200 Completed
Total prostate 6700/6700
Ovary UK 2000/2000 Mar-09
US 2000/2000 Mar-09
Total prostate 4000/4000
These data will be used to evaluate the association between each genotype and disease, combined over all available studies. To allow for differences in the genotype platforms used in each genome scan, we will use imputation methods to estimate genotypes at all known SNPs, using the international HapMap as a reference. We will then perform statistical tests of association for each known SNP against disease. We will also provide a web-based tool to visualise results from all the genome-scan data (and, subsequently, the follow-up results). Once the initial combined analyses are complete, we will identify a set of up to 1,536 SNPs for each disease, representing the most promising loci, for further genetic analyses as described in WP3 (task 3.2). First identifying the most significant SNPs, and then using multiple regression approaches to define a set of independently significant SNPs for further genotyping. Simultaneously, those loci with strong evidence of association will be passed to WP4 for fine-scale mapping.
Task 2.2 Analysis of Association data arising from WP3
Data from WP3 and WP4 will be integrated into databases for each disease. We will develop online systems for downloading data from each consortium site and for providing quality control. Standard methods will then be used to estimate age-specific cancer risks associated with each SNP, and to derive significance tests that combine the existing association data with that generated in task 2.1. These results will be used to identify a further set of SNPs (up to 50 for each disease) to typed in WP3 (task 3.3; taking into account any additional SNPs identified by WP5). This additional genotyping will in turn be integrated into the database, and final combined significance tests and age- specific risk estimates for each SNP derived.
Task 2.3 Modification of risk in BRCA1/2 carriers
SNPs identified as associated with breast and ovarian cancer in task 3.2 will also be typed in BRCA1 and BRCA2 carriers, via the CIMBA consortium (WP3). Distinct statistical methods will be required for this group, since the subjects are ascertained through a retrospective family-based cohorts rather than population-based case-control studies. We will estimate breast and ovarian cancer risks associated with these SNP genotypes in carriers, using a retrospective likelihood approach to condition on disease status.
Task 2.4 Fine-scale mapping analyses
The aim of this task will be to define, for each locus analysed in WP4, the set of variants that are most likely to be causally related to disease. We will aim to use a combination of regression analyses, haplotype analyses and coalescent approaches to define a likelihood for each variant in the region.
Task 2.5 Gene-gene interaction
CNIO (Dr Roger Milne) will be responsible for these analyses. We will conduct combined analyses of all SNPs that show clear evidence of association, to test for evidence for departure from a multiplicative model and to define risks associated with each combined genotype. Separately, we will conduct exploratory pairwise analysis for all SNPs selected for analysis in WP3. If strong evidence for interactions is found, these SNPs will then be available for future genotyping in the consortia.
Task 2.6 Expression and DNA copy number array analysis
NKI (Dr Lodewyk Wessels) will have responsibility for this task. The data that will be employed in this task are derived from between 4000 to 6000 breast cancer cases from the TRANSBIG-MINDACT trial and 1000 from NBAC. For each of these samples the following measurements will become available 1) 44K gene expression data 2) genotyping data through WP3 and 2) histo-pathological data via TRANSBIG/NBAC. We will concentrate on three sub-tasks. First we will employ gene expression data to define molecular subtypes, i.e. groups of samples that have a common gene expression profile, dissimilar from other samples in the dataset. Second, in collaboration with WP6, we will evaluate the association of these subtypes with disease-associated SNPs. A similar approach will be used to analyse the 4000 OCAC comparative genomic hybridization arrays (approach similar to expression data). Finally, we will employ histo-pathological data to subtype cases in the BCAC and CIMBA consortia (see WP6). This will allow SNP associations with disease subtypes to be evaluated in these much larger cohorts.

Task 2.6.1 Identifying molecular subtypes from array data
We will follow three strategies to perform tumour subtyping based on gene expression or array comparative genomic hybridization data: 1) apply known subtyping; 2) apply outcome-driven subtyping derived from a publicly available breast cancer expression compendium and existing predictive gene expression signatures and 3) derive a new subtyping from the TRANSBIG/NBAC/OCAC data, based on functional gene sets. Apply existing (published) molecular subtyping. In this approach the subtyping is performed by employing existing gene sets, which have been shown to yield groups of samples with distinct disease outcome. This will simply involve application of existing subtyping procedures to the tumours in this project.

Outcome-driven subtyping based on public breast cancer expression data. Since it is known that existing subtypes are heterogeneous in terms of outcome, we will perform outcome-driven subtyping based on the results derived from a large collection of publicly available breast cancer expression datasets containing 1147 samples. We employed most of the currently published profiles predictive of outcome to determine a consensus set of genes predictive of outcome. Functional analysis of this set revealed known and novel processes associated with outcome of specific subsets of tumours. Depending on the combination of processes active in a given tumour, the tumour set can be stratified into distinct outcome groups.

The gene expression of e.g. the TRANSBIG tumours will then be employed to stratify these accordingly. Geneset-based subtyping based on TRANSBIG expression data. We will develop a new subtyping based on grouping of module expression profiles - modules being functionally related sets of genes. This grouping can be achieved by, for example, hierarchical clustering or biclustering, and is aimed at revealing groups of tumours with homogeneous but distinct expression profiles. This approach has the advantage that functional interpretation can be attached to a particular subtype in terms of the activity of biological processes as represented by up- or down-regulated modules.


Task 2.6.2 Finding SNP associations with the define subtypes (performed by WP6)
The subtypes defined in Subtask 2.6.1 will be analyzed for association with single genetic loci, locus interactions, as well as locus-environment interactions. For the higher (than first) order analyses, a restricted set of known or novel (marginal) loci could be employed.

Task 2.6.3 Linking expression subtypes to histo-pathological data
Since expression data is not available for all samples in this project, we will employ histo-pathological data to link the expression subtypes defined on the TRANSBIG tumours to other large cohorts. This requires the selection of a set of histo-pathological parameters that can predict the expression subtypes as reliably as possible. This classifier can then be applied to subtype the BCAC and CIMBA set, to validate the findings obtained on the TRANSBIG tumours in Task 2.6.2.
Task 2.7 Risk model development
UCAM (Dr Antonis Antoniou) will have responsibility for this part of the WP. For breast and ovarian cancer, we will basis our design on our existing model (BOADICEA) that incorporates the effects of known high risk genes (BRCA1 and BRCA2). We will first extend the model to incorporate the effects of other known susceptibility loci. We will incorporate the effects of all SNP associations established in during the project, lifestyle factors and interactions. A separate model will be developed for prostate cancer. We will evaluate the models for their ability to discriminate high and low risk individuals and test their calibration using goodness of fit tests. For each model, we will extend our existing web-based tool to allow risk predictions for individuals with given combinations of genotypes and risk factors to be calculated.
Deliverables
D2.1. Construction of central databases for each cancer to hold individual phenotype and SNP data from the GWAS to be used in stage I (month 3). D2.2. Integration of SNP data from GWAS into the SNP database (month 6).
D2.3. Complete analysis of existing GWA data for breast , ovarian and prostate cancer – novel loci identified to be followed up by WP4/5/6 (month 9)
D2.4. Define set of SNPs for breast, ovarian and prostate cancer, for genotyping in stage II (WP3) (month 12).
D2.5. Incorporate individual SNP genotype data generated in stage II (WP3, task 3.2) into the central databases (month 22).
D2.6. First reports describing main effect association analyses for each cancer (month 24).
D2.7. Define set of SNPs to be typed for each cancer in WP3 for stage III (task 3.3) (month 24).
D2.8. Integration of gene expression data for breast cancer into the gene expression database (month 24).
D2.9. Website giving access to combined analyses (month 24).
D2.10. Incorporate data from WP3 (task 3.3) into database (month 33).
D2.11. Report describing main effect association analyses for subtypes (with WP7) (month 39).
D2.12. Reports on final main effects analysis for each cancer (month 42).
D2.13. Report on main effects of SNPs in BRCA1/2 carriers (month 42).
D2.14. Reports describing gene-gene interaction analyses in each cancer (month 45).
D2.15. Report of risk models for breast/ovary (month 45).
D2.16. Report of risk model for prostate cancer (month 45).
D2.17. Report describing mapping of expression subtypes to histopathological equivalents (month 45).
D2.18. Website for the risk models (month 48).

Additional information