p1RCC DATA DESCRIPTION
Patient Samples
The patient submitted a Whole Blood sample and FFPE Tissue Samples (Fixed Formalin Paraffin-Embedded; a method for preserving tissue). The Whole Blood sample serves as the source of “normal” DNA and the Tissue Sample serves as the source of “tumour” DNA. Within the project directories, WBA is the Whole Blood Sample and T1_1A is the Tumour Tissue Sample.
Whole Genome Sequencing
The qualified genomic DNA samples were randomly fragmented by Covaris technology and 350bp fragments were obtained after fragment selection. The end repair of DNA fragments were performed and an "A" base was added at the 3'-end of each strand. Adapters were then ligated to both ends of the end repaired/dA tailed DNA fragments, amplified by ligation-mediated PCR (LM- PCR ), then single strands were separated and cyclized. Rolling circle amplification (RCA) was then performed to produce DNA Nanoballs (DNBs). The qualified DNBs were loaded into patterned nanoarrays and pair-end reads were read through on the BGISEQ-500 platform. High-throughput sequencing is performed for each library to ensure that each sample meets the average sequencing coverage requirement (90x). Sequencing-derived raw image files were processed by BGISEQ-500 base-calling software for base-calling with default parameters. The sequence data of each individual sample is generated as paired-end reads, which is defined as "raw data" and stored in FASTQ format.
Video Overview of Methods | In-depth Info on BGISEQ-500 | Comparison to Ilumina HiSeq X Ten
Data
There are two primary directories within the project folder: rarekidneycancer_patient_0 and somatic. Within rarekidneycancer_patient_0 → F18FTSUSAT0015_HUMaasR you’ll find the data from the BGISEQ-500 DNB sequencing (includes both raw and processed data). Two folders corresponding to the two samples, named WBA and T1_1A, are laid out similarly within the directory. As previously noted, the WBA is the "normal" whole blood sample. The T1_1A sample is the "tumor" sample. Inside each are 3 folders, clean_data, result_alignment, and result_variation.
Inside the clean_data folder, you’ll find data which has been filtered to decrease the noise of the sequencing data. The process included (1) Removing reads containing the sequencing adapter; (2) Removing reads whose low-quality base ratio (base quality less than or equal to 5) is more than 50%; (3) Removing reads whose unknown base ('N'base) ratio is more than 10%. All downstream analysis was performed on this data.
Inside the results_alignment folder you’ll find the BAM files for each sample for loading as reads. All clean reads were aligned to the human reference genome (GRCh37/HG19) using the BWA-MEM method within the Burrows-Wheeler Aligner (BWA V0.7.12).
Inside the results_variation folder you’ll find directories for cnv (Copy Number Variants), indel (Insertions Deletions), and snp (Single Nucleotide Polymorphisms) directories (plus sv [Structural Variants] for the WBA sample). Within the indel and snp directories you’ll find the vcf files (both raw and filtered for each) for loading as callsets. Within the sv and cnv folders you’ll find csv/xls files for use. The genomic variations, were detected by HaplotypeCaller of GATK (v3.3.0). After that, the variant quality score recalibration (VQSR) method was applied to get high-confident variant calls. The CNVs were called using the CNVnator \[8\] v0.2.7 read-depth algorithm. The SVs were detected using Breakdancer or CREST. Then the SnpEff tool was applied to perform a series of annotations for the variants.
Within the somatic directory, you’ll find vcf files corresponding to the somatics variants between the tumour and “normal” DNA in the patient’s samples. The inputs to generate the somatic variant calls are the tumor and normal genomes, represented by the normal and tumor BAM files. The README file in the somatic directory contains the actual command-lines used as well as links to the software.
Review of Whole Genome Sequencing | Review of Somatic Mutations in Cancer | Wiki on Genetic Variation
Useful Resources
- NCBI Databases (Recommended: ClinVarm dbGaP, SRA, GEO DataSets, BioSystems)
- BioStars
- Genomic Data Commons Data Portal
- Kyoto Encyclopedia of Genes and Genomes (Ex: Pathway Exploration)
- BioConductor
- BioConda
- NCBI Hackathons
- Ben Busby's Slideshare