NBDC Research ID: hum0331.v1

 

SUMMARY

Aims: Whole genome sequencing is being promoted for better medical care of rare diseases and cancers. For these disease genome analyses, whole genome sequencing (WGS) analysis data of the healthy control group is necessary. We conducted WGS analysis of healthy individuals for cancers and rare diseases from biobank specimens held by six National Centers (NCs) Biobanks in Japan, taking regional variations into consideration, and construct a genome database of healthy individuals and disease control groups.

Methods: DNA samples that meet the criteria for the study will be shipped from the biobank and subjected to WGS analysis at a contract analysis laboratory. WGS analysis will be performed on a Novaseq6000 sequencer using a PCR-free protocol to obtain a minimum output of 90 Gb. The read data in fastq format obtained from the analysis will be subjected to data analysis (mapping and variant calling) at the principal institute (National Center for Global Health and Medicine), and the data including variant information will be made into a database.

Participants/Materials: A total of 9850 DNA samples from healthy individuals (including people with common complex diseases who do not have rare diseases or cancers) that can be utilized as controls for cancers and rare diseases studies. 560 were excluded after QC.

URL: https://ncbiobank.org/en/

    https://ncbiobank.org/cbs/en/index.html

 

Dataset IDType of DataCriteriaRelease Date
hum0331.v1.freq.v1 NGS (WGS) Unrestricted-access 2023/02/01

*Release Note

*When the research results including the data which were downloaded from NHA/DRA, are published or presented somewhere, the data user must refer the papers which are related to the data, or include them in the acknowledgment. Learn more

 

MOLECULAR DATA

hum0331.v1.freq.v1

Participants/Materials Healthy individuals without any cancers or rare diseases (ICD10: Z006): 9290 individuals
Targets WGS
Target Loci for Capture Methods -
Platform Illumina [NovaSeq 6000]
Library Source DNAs extracted from peripheral blood cells
Cell Lines -
Library Construction (kit name) TruSeq DNA PCR-Free HT Library Prep Kit
Fragmentation Methods Ultrasonic fragmentation
Spot Type Paired-end
Read Length (without Barcodes, Adaptors, Primers, and Linkers) 150 bp
QC

Whole genome sequencing analysis was performed under the following conditions.

- Confirm library size is 400bp-750bp.

- At least 75% of the bases are QV30 or better.

- Total number of bases after removal of duplicate reads by FASTQC is more than 90 GBase.

 

After alignment and variant calling, the following samples were excluded from the analysis.

- Samples with abnormal values for depth and mapping rate.

- Samples where the depth of the sex chromosome is inconsistent with the clinical information.

- Any of the samples determined to be within the second degree of kinship in the KING program.

 

Variant call results were filtered for the following

- Genotypes with GQ64 or with less than 25% minor alleles in heterozygous calls are set to no call

- Set VQSR results to FILTER field in VCF

- Set LowCR in FILTER field for variants with less than 95% call rate

- Variants with a Hardy-Weinberg equilibrium test P-value less than 10-6 have HWE set in FILTER field

Deduplication MarkDuplicates (GATK4.1.0) compatible algorithm (Parabricks 3.1.0 fq2bam)
Calibration for re-alignment and base quality -
Mapping Methods bwa mem (v0.7.15) compatible algorithm (Parabricks 3.1.0 fq2bam)
Mapping Quality No hard filtering was performaed by mapping quality.
Reference Genome Sequence GRCh38 (+HLA+decoy)
Coverage (Depth) 34.0 (autosome)
Detecting Methods for Variations HaplotypeCaller (GATK4.1.0) compatible algorithm (Parabricks 3.1.0 haplotypecaller)
SNV Numbers (after QC)

153,554,029 (autosomes)

6,325,046 (chromosome X)

INDEL Numbers (after QC)

18,899,392 (autosomes)

836,126 (chromosome X)

NBDC Dataset ID

hum0331.v1.freq.v1

hum0331.v1.freq-index.v1

(Click the Dataset ID to download the file)

README

Total Data Volume 69.24 GB (vcf)
Comments (Policies) NBDC policy

 

DATA PROVIDER

Principal Investigator: Katsushi Tokunaga

Affiliation: National Center for Global Health and Medicine Genome Medical Science

Project / Group Name: National Center Biobank Network

Funds / Grants (Research Project Number):

NameTitleProject Number
Program for an Integrated Database of Clinical and Genomic Information, Japan Agency for Medical Research and Development (AMED) Development of an integrated database of clinical genome information that contributes to the implementation of genomic medicine and the establishment of a continuous genomic medicine system in Japan JP19kk0205012

 

PUBLICATIONS

TitleDOIDataset ID
1 Exploring the genetic diversity of the Japanese population: Insights from a large-scale whole genome sequencing analysis doi: 10.1371/journal.pgen.1010625 hum0331.v1.freq.v1
2