Contents:
With Statista you are always able to make informed decisions and boost your work efficiency. We provide you with detailed information about our Corporate Account. As a Premium user you get access to the detailed source references and background information about this statistic. As a Premium user you get access to background information and details about the release of this statistic.
To address this, we divided the samples into the two main ancestry groups based on self or third-party report: For more available works by Maskull Lasserre please click here. Digital Market Outlook Identify market potentials of the digital future. Table 3 Batch effects for the merged dataset using each study as its own "batch. Another QC step is investigation of population stratification 18 , Open in a separate window.
This feature is limited to our corporate solutions. Please contact us to get started with full access to dossiers, forecasts, studies and international data. We use cookies to personalize contents and ads, offer social media features, and analyze access to our website. In your browser settings you can configure or disable this, respectively, and can delete any already placed cookies. Please see our privacy statement for details about how we use data.
Industry Overview Most-viewed Statistics. Recent Statistics Popular Statistics. Smartphone market share worldwide by vendor Number of apps available in leading app stores Big Mac index - global prices for a Big Mac Revenue of the cosmetic industry in the U.
Value of the leading 10 textile exporters worldwide. World coffee per capita consumption: Cosmetics Industry in the U. Instagram accounts with the most followers worldwide Most popular global mobile messenger apps Number of paying Spotify subscribers worldwide Global all time unit sales of Call of Duty franchise games as of January Number of Starbucks locations worldwide Market share of leading carbonated beverage companies worldwide.
Total number of Nike retail stores worldwide Revenue and financial key figures of Coca-Cola National Basketball Association all-time scoring leaders Super Bowl wins by team Average ticket price for an NFL game by team. FIFA world ranking of men's national soccer teams Athletic footwear global market share by company.
Apple iPhone unit sales worldwide , by quarter. Global market share held by smartphone operating systems , by quarter.
Retail price of gasoline in the United States Number of McDonald's restaurants worldwide Revenue of Starbucks worldwide from to Number of restaurants in the U. Average daily rate of hotels in the U.
Dossiers Get a quick quantitative overview of a topic. Outlook Reports Forecasts on current trends. Surveys Current consumer and expert insights. Toplists Identify top companies useful for sales and analysis.
Market Studies Analyze complete markets. Industry Reports Understand and assess industries. Country Reports Enter a country and quickly unlock all its potential. Further Studies Get a deeper insight into your topic. Digital Market Outlook Identify market potentials of the digital future. Mobility Market Outlook Key topics in mobility. Company Database Sales and employment figures at a glance. Publication Finder Find studies from all around the internet. Premium statistics Industry-specific and extensively researched technical data partially from exclusive partnerships.
The tail observed was originally thought to be the non-European American samples. Manhattan plot showing test of association results for a merged dataset prior to strand fix. The x-axis corresponds to the each genotyped SNP along the genome; the y-axis corresponds to the -log p-value.
The red line indicated genome-wide significance. The q-q plot top illustrates that there are many more significant results black line than it expected by chance red dotted line. Given the thorough eMERGE QC procedures, it was concerning that these strand orientation issues were not detected earlier in the merging process.
To investigate, the following items were explored:. These tests uncovered not only the source of the strand orientation issue, but also several other important points. First, when evaluating HapMap concordance, it is critical to do across-study-site HapMap comparisons Figure 3. In this instance, we are describing HapMap control samples genotyped in different laboratories, but depending on the study design, these could be other known control samples. The initial QC concordance tests had been performed within site and so would not have identified a strand issue.
However, once the across-study-site comparisons were performed, strand definition issues became evident. Second, upon careful inspection of the PLINK files, it was apparent that these were created using different strand designations. Thus, it would have been beneficial to merge files from the individual sample files, rather than the PLINK files if possible. Finally, upon comparison of allele frequencies for a random set of SNPs, the strand issues became quite clear.
As such, these three QC steps should be an important component of data set merging. Flowchart illustrating concordance checks for HapMap and duplicate samples within each dataset i. Dataset 1 blue and 2 red and complications that can arise in a merged dataset orange if there are strand issues. With the strand orientation issue resolved, the five datasets were merged version 2. Additional steps were added to examine our finding across the different datasets. Figure 4 illustrates the Manhattan plot after the strand orientation issue was resolved.
Manhattan plot showing test of association results for a merged dataset after the strand fix. The q-q plot top illustrates that there are a few more significant results black line than it expected by chance red dotted line. Most of what was observed in the merged dataset was consistent with what was found in the individual dataset QC process The unique elements that emerged for sample QC include sample relatedness and population stratification.
We investigated sample relatedness using the --genome option in PLINK, which calculates pairwise kinship estimates based on the proportion of loci where each pair of individuals shares 0, 1, and 2 alleles identical by descent IBD Using these guidelines, we can provide evidence for plausible relationships in our dataset. We performed this step for the merged dataset and compared our relationships to our initial QC steps for each individual study.
We found the same relative pairs within each dataset, and additional evidence for several related pairs where pair members were from different sites Figure 5. The data shows 6 inferred across-site related pairs: Identical by descent IBD plot from merged dataset illustrating the proportion of SNPs that are shared between each pair-wise group of samples represented by a dot on the plot. Z0 corresponds to the sharing of 0 alleles between each pair. Z1 corresponds to the sharing of 1 allele between each pair. The pairs at 0,0 correspond to duplicate pairs who share 2 alleles in common.
One pair was determined to be the same subject with two different medical record numbers. The last pair indicated two different subjects associated with different decades of birth indicating a possible sample handling error. The sample with the lowest call rate from each duplicate pair was removed, except both subjects from the last duplicate pair were removed since the identity of the sample was in question.
It is not uncommon to identify duplicates with different IDs or related samples across datasets. For eMERGE, this was anticipated as three of the sites Marshfield, Mayo Clinic, and Northwestern University are in relatively close geographical proximity to one another and therefore may lead to a situation where an individual is enrolled in multiple biobanks. Even though each individual study likely identified plausible relationship pairs in its dataset, it is vital to examine relatedness across merged datasets to ensure samples are independent.
There likely may only be a trivial number of samples identified, but there is the rare occasion when one dataset may contain a significant subset of samples from another dataset, which could inflate the risk estimates if more cases are overlapping or create more noise by reducing the risk estimates if more controls are overlapping.
Another QC step is investigation of population stratification 18 , In the merged dataset, about 14, samples were classified i. We used Eigenstrat on this set of samples to examine population stratification using only SNPs that were ancestry informative continentally informative based on the Illumina Test Panel Figure 6 18 , We observed that using ancestry informative markers AIMs gave comparable results to using a random larger set of SNPs in less amount of computational time. Given the magnitude of the dataset and the various geographic regions of these samples, Figure 6 shows a large, loose clustering of samples.
Although the plot appears to be tightly clustered, given the size of the cluster, there is likely some degree of population stratification even among the European ancestry samples in our dataset. Further evidence of population stratification is notable when we examine batch effects across the eMERGE sites see Batch Effects section.
The non-European American samples were removed in the analysis, thus, no tail is observed as in Figure 1. The question remains how to statistically handle this population stratification. Traditional epidemiologic studies with various ancestry groups adjust the analyses using a race indicator variable.
However, these are all European ancestry samples; thus, using the principal components computed in Eigenstrat may be the best solution for controlling for the diversity within the European ancestry group i. When dealing with larger datasets i. As mentioned previously, this merged dataset was genotyped at two different genotyping centers, and both centers fail a small proportion of SNPs but each used different criteria to determine these failures. The failures are dataset-specific even within a center so any particular SNP might fail for only one dataset.
When merging datasets, if one SNP fails in one of the datasets, then the marker genotyping efficiency will likely be lower than the established threshold because the denominator total SNPs attempted differs between substudies. Figure 7 illustrates this effect indicated by a large proportion of SNPs dropping out simultaneously unlike the smooth curve observed from each individual dataset Figure 8.
The datasets were then merged with only high quality genotyping data. Because we wish to only keep SNPs that were successfully genotyped in all datasets, we start out with fewer SNPs in the merged dataset. For our merged dataset, 5. This resulted in , SNPs in 17, samples. Next, minor allele frequencies were investigated. With more samples and ancestry groups genotyped, there are likely to be at least one or two minor alleles for SNPs with very low MAFs present.
In contrast to the MAF analysis, the number of SNPs out of Hardy-Weinberg equilibrium was expected to be larger given that multiple ancestry groups are present As expected, when all ancestries are jointly considered, about a third of the SNPs are out of HWE at a p-value threshold of 10 -4 or less Table 2. This is likely due to population stratification and differing allele frequencies between the various ancestry groups.
To address this, we divided the samples into the two main ancestry groups based on self or third-party report: European and African ancestries. List of the number of SNPs out of Hardy-Weinberg Equilibrium in the overall merged dataset not race-stratified , samples, and 2, African American samples.
By randomizing these variables, we hoped to control for possible confounders if gross differences in the metrics for the batches i. This is not surprising given these studies include exclusively African Americans and the other studies are more heterogeneous. For the merged dataset, it seemed likely that population stratification for one plate versus all other plates across all studies would show gross differences in allele frequencies, and we had not detected any significant batch effects within each individual study.
However, we already had evidence that the merged dataset had significant population stratification, which we observed when we examined allele frequencies and p-values for SNPs per study, especially between the two earliest-completed studies Marshfield and Vanderbilt Table 3. The earliest dataset genotyped Marshfield Clinic contained many more related samples than the rest of the datasets; thus, a member from each first degree relative was removed from our dataset for the batch effect analyses.
When this did not eliminate the effect we observed, we considered only a tighter cluster of samples of European ancestry. Even with a tighter clustering of samples, we still observed the same two studies have more significant results Table 4. Batch effects for the merged dataset minus Marshfield related individuals and using a tighter European clustered group.
We have expanded the single-site eMERGE QC pipeline 14 to include additional steps to be used when merging datasets either for replication, higher-powered studies, or meta-analysis see Figure 9. Various other consortia have developed important QC and quality assurance QA pipelines to ensure thorough cleaning of data, especially GWAS data 23,26, These processes should not be neglected nor marginalized, as they help to reduce the number of false positive and false negative results. The best place to start is with a good study design. At the inception of the eMERGE network, we anticipated merging our datasets, thus, we included various safeguards to reduce the number of complications that could occur for example: We illustrate here a less complex merging of datasets than might be the case in other pooled studies, including data pooled from dbGaP.
Merging genotype data from various centers across different platforms creates additional complications and may require imputation. Likewise, merging phenotype data across studies also has its complexities. Quality control analyses had already been performed on each individual dataset, which was an important check when doing these analyses on the merged dataset. When pooling GWAS data, we have demonstrated that thorough cleaning of the individual datasets prior to merging, and then cleaning of the combined dataset once merged, are essential to obtaining good quality data.
Establishing that strand orientation of alleles is consistent among datasets prior to merging is an important first step. Once merged, investigation of kinship coefficients is important to check for unintended duplicates or related pairs across datasets. Following QC of the merged dataset, genetic analyses will be conducted within eMERGE-I to identify variants associated with fourteen phenotypes using subsets of individuals with each of the case and control definitions for the phenotypes available.
Given the demographics, the merged dataset will likely be split into two populations for application of stratified regression analyses: European ancestry and African ancestry. QC procedures, such as Eigenstrat, batch effects, and HWE analysis, will need to be performed on the subsets used for each particular phenotype to adjust for principal components, to ensure associations findings are not confounded by which studies the samples came from, and to ensure associated SNPs are not grossly out of HWE, respectively. When extracting subsets of samples for additional phenotype association studies, it is important to examine the distribution of samples among the different studies.
Pulling the majority of samples from one of these two studies Marshfield or VanderbiltW could lead to spurious results as observed in the batch effect analyses. In other words, if the proportion of cases and controls differ by site, an association observed could be a result of a batch effect instead of a true effect.
Through the eMERGE network, we have learned a significant amount of detail about large-scale quality control to ensure data integrity. In particular, merging datasets for joint analysis introduced interesting, unanticipated subtleties. We identified some potential points of concern when merging datasets as well as approaches to address these Figure 9. Such approaches will become even more important as we venture increasingly more into all the possibilities offered by dbGaP. With careful merging of datasets, we enable an increase in sample size, and likewise power to detect genetic associations and improve our understanding of the genetic architecture of complex traits.
National Center for Biotechnology Information , U.