NBDC Research ID: hum0343.v4

SUMMARY

Aims: To construct a prediction system for severe disease through whole genome sequencing, RNA sequencing, and ultra-high-precision HLA analysis of patients with COVID-19, asymptomatic infected patients, and patients suspected of having novel coronavirus infection. In addition, we will use anonymized data to analyze the prediction of severity of COVID-19 using mathematical models. To elucidate the association between viruses and autoimmune diseases and COVID-19.

Methods: Genome-wide association study (GWAS), RNA-seq, Protein expression analysis, eQTL/sQTL/pQTL study, whole genome sequencing

Participants/Materials: GWAS: 5,682 Japanese individuals (2,393 COVID-19 infected patients and 3,289 controls)　

RNA-seq: Maximum 1,019 COVID-19 infected patients

Protein expression: 1,384 COVID-19 infected patients

Whole genome sequencing: 1,164 COVID-19 infected patients

Dataset ID	Type of Data	Criteria	Release Date
hum0343.v1.covid19.v1	GWAS for COVID-19	Unrestricted-access	2022/05/26
hum0343.v1.count.v1	NGS (RNA-seq) for COVID-19	Unrestricted-access	2022/05/26
hum0343.v2.qtl.v1	eQTL/sQTL summary statistics for COVID-19	Unrestricted-access	2022/06/14
E-GEAD-759	NGS (RNA-seq) for COVID-19 Protein expression for COVID-19	Unrestricted-access	2024/06/24
hum0343.v3.qtl.v1	eQTL/pQTL summary statistics for COVID-19	Unrestricted-access	2024/06/24
JGAS000739	The presence or absence of endogenous herpesvirus 6 and anellovirus load calculated from NGS (WGS) for COVID-19	Controlled-access (Type I)	2024/10/02

*Release Note

*Data users need to apply an application for Using NBDC Human Data to reach the Controlled-access Data. Learn more

*When the research results including the data which were downloaded from NHA/DRA, are published or presented somewhere, the data user must refer the papers which are related to the data, or include them in the acknowledgment. Learn more

MOLECULAR DATA

GWAS


Participants/Materials	[GWAS-1] COVID-19 (ICD-10: U071): 2,393 cases, Healthy controls: 3,289 individuals [GWAS-2] Severe COVID-19: 990 cases and 3,289 healthy controls from [GWAS-1] [GWAS-3] COVID-19 under age 65: 1,484 cases and 2,377 healthy controls under age 65 from [GWAS-1] [GWAS-4] Severe COVID-19 under age 65: 440 cases and 2,377 healthy controls under age 65 from [GWAS-3]
Targets	Genome wide SNPs
Target Loci for Capture Methods	-
Platform	Illumina [Infinium Asian Screening Array]
Source	DNAs extracted from peripheral blood cells
Cell Lines	-
Reagents (Kit, Version)	Infinium Asian Screening Array
Genotype Call Methods (software)	genotyping: GenomeStudio haplotype phasing: SHAPEIT4 (autosome), SHAPEIT2 (X-chromosome) imputation: Minimac4
Association Analysis (software)	PLINK2
Filtering Methods	Sample QC: We excluded samples with (1) sample call rate < 0.97 (2) excess heterozygosity of genotypes > mean + 3SD (3) related samples with PI_HAT > 0.175 (4) outlier samples from East Asian clusters in principal component analysis with 1000 Genomes Project samples. Genotyping QC: We excluded variants with (1) variant call rate < 0.99 (2) significant call rate differences between cases and controls with P < 5.0×10-8 (3) deviation from Hardy-Weinberg equilibrium with P < 1.0×10-6 (4) minor allele count < 5 Imputation QC: MAF ≥ 0.1% and imputation score (Rsq) > 0.5
Marker Number (after QC)	[GWAS-1] 13,484,569 variants [GWAS-2] 13,199,053 variants [GWAS-3] 13,241,602 variants [GWAS-4] 12,764,136 variants
NBDC Dataset ID	hum0343.v1.covid19.v1 [GWAS-1][GWAS-2][GWAS-3][GWAS-4] (Click the gwas number to download files) Dictionary file
Total Data Volume	[GWAS-1] 361 MB [GWAS-2] 354 MB [GWAS-3] 354 MB [GWAS-4] 343 MB
Comments (Policies)	NBDC policy

RNA-seq


Participants/Materials	COVID-19 (ICD-10: U071): 473 cases
Targets	RNA-seq
Target Loci for Capture Methods	-
Platform	Illumina [NovaSeq6000]
Library Source	RNAs extracted from peripheral blood cells
Cell Lines	-
Library Construction (kit name)	NEBNext® Poly(A) mRNA Magnetic Isolation Module and NEBNext® Ultra™ Directional RNA Library Prep Kit for Illumina
Fragmentation Methods	Incubation in the buffer containing Mg2+ at 94°C for 15 minutes
Spot Type	Paired-end
Read Length (without Barcodes, Adaptors, Primers, and Linkers)	100 bp
Mapping Methods	Adapter removal: Trimmomatic (v0.39) Alignment: STAR (v2.7.9a) Annotation: GENCODE v30
Reference Genome Sequence	GRCh38/hg38
Detecting method for read count (software)	Gene level quantification and normalization: RSEM (v1.3.3)
QC	median transcripts per kilobase million (TPM) > 10
Gene Number	5991
NBDC Dataset ID	hum0343.v1.count.v1 (Click the Dataset ID to download the file) Sample Information
Total Data Volume	6 MB
Comments (Policies)	NBDC policy

eQTL/sQTL study


Participants/Materials	COVID-19 (ICD-10: U071): 465 cases (severe cases: 359, mild case: 106)
Targets	eQTL/sQTL summary statistics
Target Loci for Capture Methods	-
Platform	RNA-seq: Illumina [NovaSeq6000] SNP array data: Illumina [Infinium Asian Screening Array]
Library Source	read count data of RNA-seq and SNP array data of GWAS
Cell Lines	-
Library Construction (kit name)	RNA-seq: See RNA-seq SNP array data: See GWAS
Detecting method for read count (software)	Gene level quantification and normalization: RSEM (v1.3.3) Intron cluster quantification: LeafCutter (v0.2.7)
QC	Following GTEx pipeline (https://github.com/broadinstitute/gtex-pipeline/)
Detection method of eQTL (cis)	The eQTL effects of common (>1%) variants in cis (<+-1Mb) window of transcription sites were tested using fastQTL. Variant-gene pairs with cis-eQTL p-value<0.05, annotated with allele frequency (AF), p-value, effect size (beta) and posterior inclusion probability (PIP) were summarized.
Detection method of eQTL (trans)	Trans-eQTL effects were tested using tensorQTL. Variant-gene pairs with trans-eQTL p-value <5*10^-8, annotated with AF, p-value and beta were summarized.
Detection method of sQTL	The sQTL effects of common (>1%) variants in cis (<+-1Mb) window of intron cluster start sites were tested using fastQTL. Variant-intron cluster pairs with cis-sQTL p-value<0.05, annotated with AF, p-value, beta and PIP were summarized.
NBDC Dataset ID	hum0343.v2.qtl.v1 (Click the Dataset ID to download the file) Dictionary file
Total Data Volume	714 MB (tsv)
Comments (Policies)	NBDC policy

RNA-seq (E-GEAD-759)


Participants/Materials	COVID-19 (ICD-10: U071): 1,019 cases
Targets	RNA-seq
Target Loci for Capture Methods	-
Platform	Illumina [NovaSeq6000]
Library Source	RNAs extracted from peripheral blood cells
Cell Lines	-
Library Construction (kit name)	NEBNext® Poly(A) mRNA Magnetic Isolation Module and NEBNext® Ultra™ Directional RNA Library Prep Kit for Illumina
Fragmentation Methods	Incubation in the buffer containing Mg2+ at 94°C for 15 minutes
Spot Type	Paired-end
Read Length (without Barcodes, Adaptors, Primers, and Linkers)	100 bp
Mapping Methods	Alignment: STAR (v2.5.3a) Annotation: GENCODE v30
Reference Genome Sequence	GRCh38/hg38
Detecting method for read count (software)	Gene level quantification and normalization: RSEM (v1.3.0)
QC	Transcripts per kilobase million (TPM) ≥0.1 in ≥20% samples Read count ≥6 in ≥20% samples
Gene Number	20329
Genomic Expression Archive ID	E-GEAD-759 Dictionary file
Total Data Volume	91.6 MB (tsv)
Comments (Policies)	NBDC policy

Protein expression


Participants/Materials	COVID-19 (ICD-10: U071): 1,384 cases
Targets	Protein expression (2932 proteins)
Target Loci for Capture Methods	-
Platform	Olink [Olink Explore 3072]
Library Source	Plasma
Cell Lines	-
Library Construction (kit name)	Olink Explore 3072
Fragmentation Methods	-
Spot Type	-
Read Length (without Barcodes, Adaptors, Primers, and Linkers)	-
Detecting Methods for Proteins (software)	OlinkAnalyze v3.4.1
Normalization Methods	Normalized Protein eXpression (NPX) transformation
Validation Methods	Bridge sample comparison
Genomic Expression Archive ID	E-GEAD-759 Dictionary file
Total Data Volume	91.6 MB (tsv)
Comments (Policies)	NBDC policy

eQTL/pQTL study


Participants/Materials	COVID-19 (ICD-10: U071): 1,405 cases (severe cases: 995, mild case: 410) 　 eQTL analysis: 1,019 cases 　 pQTL analysis: 1,384 cases 　 (998 intersecting cases)
Targets	eQTL/pQTL summary statistics
Target Loci for Capture Methods	-
Platform	RNA-seq: Illumina [NovaSeq6000] SNP array data: Illumina [Infinium Asian Screening Array] Protein expression data: Olink Explore 3072
Library Source	read count data of RNA-seq, SNP array data of GWAS and Protein expression data
Cell Lines	-
Library Construction (kit name)	RNA-seq: See RNA-seq SNP array data: See GWAS Protein expression data: See Protein expression
Detecting method for read count (software)	Gene level quantification and normalization: RSEM (v1.3.0) Intron cluster quantification: OlinkAnalyze v3.4.1
QC	Following GTEx pipeline (https://github.com/broadinstitute/gtex-pipeline/)
Detection method of eQTL (cis)	The eQTL effects of cis variants (<+-1Mb window of transcription start sites, minor allele count >2) were tested using fastQTL. Then, variant-gene pairs with p-value<0.05 or posterior inclusion probability (PIP) >0.001, annotated with allele frequency (AF), p-value, effect size (beta) and, PIPs were summarized as separate files.
Detection method of pQTL (cis)	The pQTL effects of cis variants (<+-1Mb window of transcription start sites, minor allele count >2) were tested using fastQTL. Then, variant-gene pairs with p-value<0.05 or posterior inclusion probability (PIP) >0.001, annotated with allele frequency (AF), p-value, effect size (beta) and, PIPs were summarized as separate files.
NBDC Dataset ID	hum0343.v3.qtl.v1 (Click the Dataset ID to download the file) Dictionary file
Total Data Volume	881.5 MB (tsv)
Comments (Policies)	NBDC policy

JGAS000739


Participants/Materials	COVID-19 (ICD-10: U071): 1,164 cases (severe cases: 1,068)
Targets	WGS
Target Loci for Capture Methods	-
Platform	Illumina [NovaSeq 6000]
Library Source	DNAs extracted from peripheral blood cells
Cell Lines	-
Library Construction (kit name)	TruSeq DNA PCR-free Library Prep Kit
Fragmentation Methods	Ultrasonic fragmentation
Spot Type	Paired-end
Read Length (without Barcodes, Adaptors, Primers, and Linkers)	150 bp x 2
Methods for removing host sequence/detecting viral sequence (software)	https://github.com/shohei-kojima/integrated_HHV6_recon https://github.com/shohei-kojima/human_anellovirus_detection
QC	We conducted principal component analysis (PCA) against HapMap3 data using SNP data of the same individuals to confirm the East Asian genetic background.
Reference sequence for viral genome	Refer to the softwares' GitHub repositry. List of viral sequences
Japanese Genotype-phenotype Archive Dataset ID	JGAD000874
Total Data Volume	38.5 KB (tsv)
Comments (Policies)	NBDC policy

DATA PROVIDER

Principal Investigator: Koichi Fukunaga

Affiliation: Department of Medicine, Pulmonary Division, Keio University School of Medicine

Project / Group Name: -

Funds / Grants (Research Project Number):

Name	Title	Project Number
Project Promoting Support for Drug Discovery, Japan Agency for Medical Research and Development (AMED)	Development of genetically-designed COVID19 mucosal immune vaccine with molecular needle platform	JP20nk0101612
Research Program on Emerging and Re-emerging Infectious Diseases, Japan Agency for Medical Research and Development (AMED)	Promotion of genetic, immunological, and metabolic research necessary for the development of next-generation vaccines and drugs aiming to prevent the aggravation of coronavirus disease 2019	JP20fk0108415
Research Program on Emerging and Re-emerging Infectious Diseases, Japan Agency for Medical Research and Development (AMED)	Elucidation of pathogenesis and development of therapeutic strategies using genetic, immunological, and metabolic studies against SARS-CoV-2 variants	JP20fk0108452
Japan Program for Infectious Diseases Research and Infrastructure, Japan Agency for Medical Research and Development (AMED)	Elucidation of the pathophysiology of the sequelae of coronavirus disease 2019 using a multidisciplinary approach	JP21wm0325031
Core Research and Evolutional Science and Technology (CREST), Japan Science and Technology Agency (JST)	Research on Conquering Coronavirus Disease by Advanced Genomic Analysis and Artificial Intelligence	JPMJCR20H2
Practical Research Project for Allergic Diseases and Immunology, Japan Agency for Medical Research and Development (AMED)	Genomic prediction medicine of rheumatoid arthritis based on comprehensive immune-omics resources	20ek0410075h0001
KAKENHI Grant-in-Aid for Scientific Research (A)	Elucidation of tissue-specificity of disease biology using trans-layer omics analysis and whole-genome sequencing	19H01021
Program for Promoting Platform of Genomics based Drug Discovery, Japan Agency for Medical Research and Development (AMED)	Systematic evaluation of variant of uncertain significance (VUS) pathogenicity through population genomics data analysis and massively parallel reporter assay	JP22kk0305022
Fusion Oriented REsearch for disruptive Science and Technology (FOREST) program, Japan Science and Technology Agency (JST)	Towards a generalized and interpretable model for comprehensive understanding of human gene regulatory mechanisms	JPMJFR225Y
Promoting Individual Research to Nurture the Seeds of Future Innovation and Organizing Unique, Innovative Network (PRESTO), Japan Science and Technology Agency (JST)	Fundamental research to build an academic system resilient to pandemics	JPMJPR21R7

PUBLICATIONS

	Title	DOI	Dataset ID
1	DOCK2 is involved in the host genetics and biology of severe COVID-19	doi: 10.1038/s41586-022-05163-5	hum0343.v1.covid19.v1 hum0343.v1.count.v1
2	The whole blood transcriptional regulation landscape in 465 COVID-19 infected samples from Japan COVID-19 Task Force	doi: 10.1038/s41467-022-32276-2	hum0343.v2.qtl.v1
3	Statistically and functionally fine-mapped blood eQTLs and pQTLs from 1,405 humans reveal their distinct regulation patterns and disease relevance	doi: 10.1038/s41588-024-01896-3	E-GEAD-759 hum0343.v3.qtl.v1
4	Blood DNA virome associates with autoimmune diseases and COVID-19.		JGAD000874

USRES (Controlled-access Data)

Principal Investigator	Affiliation	Country/Region	Research Title	Data in Use (Dataset ID)	Period of Data Use