NBDC Research ID: hum0331.v1
SUMMARY
Aims: Whole genome sequencing is being promoted for better medical care of rare diseases and cancers. For these disease genome analyses, whole genome sequencing (WGS) analysis data of the healthy control group is necessary. We conducted WGS analysis of healthy individuals for cancers and rare diseases from biobank specimens held by six National Centers (NCs) Biobanks in Japan, taking regional variations into consideration, and construct a genome database of healthy individuals and disease control groups.
Methods: DNA samples that meet the criteria for the study will be shipped from the biobank and subjected to WGS analysis at a contract analysis laboratory. WGS analysis will be performed on a Novaseq6000 sequencer using a PCR-free protocol to obtain a minimum output of 90 Gb. The read data in fastq format obtained from the analysis will be subjected to data analysis (mapping and variant calling) at the principal institute (National Center for Global Health and Medicine), and the data including variant information will be made into a database.
Participants/Materials: A total of 9850 DNA samples from healthy individuals (including people with common complex diseases who do not have rare diseases or cancers) that can be utilized as controls for cancers and rare diseases studies. 560 were excluded after QC.
URL: https://ncbiobank.org/en/
https://ncbiobank.org/cbs/en/index.html
Dataset ID | Type of Data | Criteria | Release Date |
---|---|---|---|
hum0331.v1.freq.v1 | NGS (WGS) | Unrestricted-access | 2023/02/01 |
*When the research results including the data which were downloaded from NHA/DRA, are published or presented somewhere, the data user must refer the papers which are related to the data, or include them in the acknowledgment. Learn more
MOLECULAR DATA
Participants/Materials | Healthy individuals without any cancers or rare diseases (ICD10: Z006): 9290 individuals |
Targets | WGS |
Target Loci for Capture Methods | - |
Platform | Illumina [NovaSeq 6000] |
Library Source | DNAs extracted from peripheral blood cells |
Cell Lines | - |
Library Construction (kit name) | TruSeq DNA PCR-Free HT Library Prep Kit |
Fragmentation Methods | Ultrasonic fragmentation |
Spot Type | Paired-end |
Read Length (without Barcodes, Adaptors, Primers, and Linkers) | 150 bp |
QC |
Whole genome sequencing analysis was performed under the following conditions. - Confirm library size is 400bp-750bp. - At least 75% of the bases are QV30 or better. - Total number of bases after removal of duplicate reads by FASTQC is more than 90 GBase.
After alignment and variant calling, the following samples were excluded from the analysis. - Samples with abnormal values for depth and mapping rate. - Samples where the depth of the sex chromosome is inconsistent with the clinical information. - Any of the samples determined to be within the second degree of kinship in the KING program.
Variant call results were filtered for the following - Genotypes with GQ64 or with less than 25% minor alleles in heterozygous calls are set to no call - Set VQSR results to FILTER field in VCF - Set LowCR in FILTER field for variants with less than 95% call rate - Variants with a Hardy-Weinberg equilibrium test P-value less than 10-6 have HWE set in FILTER field |
Deduplication | MarkDuplicates (GATK4.1.0) compatible algorithm (Parabricks 3.1.0 fq2bam) |
Calibration for re-alignment and base quality | - |
Mapping Methods | bwa mem (v0.7.15) compatible algorithm (Parabricks 3.1.0 fq2bam) |
Mapping Quality | No hard filtering was performaed by mapping quality. |
Reference Genome Sequence | GRCh38 (+HLA+decoy) |
Coverage (Depth) | 34.0 (autosome) |
Detecting Methods for Variations | HaplotypeCaller (GATK4.1.0) compatible algorithm (Parabricks 3.1.0 haplotypecaller) |
SNV Numbers (after QC) |
153,554,029 (autosomes) 6,325,046 (chromosome X) |
INDEL Numbers (after QC) |
18,899,392 (autosomes) 836,126 (chromosome X) |
NBDC Dataset ID |
(Click the Dataset ID to download the file) |
Total Data Volume | 69.24 GB (vcf) |
Comments (Policies) | NBDC policy |
DATA PROVIDER
Principal Investigator: Katsushi Tokunaga
Affiliation: National Center for Global Health and Medicine Genome Medical Science
Project / Group Name: National Center Biobank Network
Funds / Grants (Research Project Number):
Name | Title | Project Number |
---|---|---|
Program for an Integrated Database of Clinical and Genomic Information, Japan Agency for Medical Research and Development (AMED) | Development of an integrated database of clinical genome information that contributes to the implementation of genomic medicine and the establishment of a continuous genomic medicine system in Japan | JP19kk0205012 |
PUBLICATIONS
Title | DOI | Dataset ID | |
---|---|---|---|
1 | Exploring the genetic diversity of the Japanese population: Insights from a large-scale whole genome sequencing analysis | doi: 10.1371/journal.pgen.1010625 | hum0331.v1.freq.v1 |
2 |