Standard Genome Wide Association Study

In a standard GWAS analyses, a single phenotype is repeatedly regressed on each single nucleotide polymorphism (SNP) in the users genomic data file (as well as a fixed set of covariates). While there are several software packages that allow users to conduct GWAS analyses, GW-SEM is the only routine that allows users to conduct GWAS analyses that model ordinal data as a probit response, relying on a liability-threshold model to link the observed ordinal variable to a continuous underlying liability. GW-SEM can also manage quantitative (continuous) or binary (case-control) variables in the same manner as other standard GWAS software packages.

Example

To demonstrate how to conduct a GWAS using GW-SEM, we use simulated data for three correlated phenotypes for 6,000 individuals with accompanying GWAS data consisting of 2,000 SNPs, as well as 6 covariates (that are a proxy for age, sex, ancestry principle components or other confounds). As we are not allowed to post the raw data online, we simulated similar data that we use in this demonstration. The first step in the analysis is to load the GW-SEM package in R. The command load the packages is:

library(gwsem)
#> Loading required package: OpenMx
#> To take full advantage of multiple cores, use:
#>   mxOption(key='Number of Threads', value=parallel::detectCores()) #now
#>   Sys.setenv(OMP_NUM_THREADS=parallel::detectCores()) #before library(OpenMx)
#> 
#> Attaching package: 'gwsem'
#> The following object is masked from 'package:base':
#> 
#>     signif

After GW-SEM has been loaded into the R environment, you can read in the phenotypic data, examine it, and recode variables as necessary. For the current example, this can be done using the following commands:

location <- 'https://jpritikin.github.io/gwsem/gwsemItemExample/'

# Read the phenotype data into R and look at the data
phenoData <- read.table(file.path(location, "itemPhenoData.txt"),
                        header=TRUE)
head(phenoData)
#>   tobacco cannabis alcohol        pc1         pc2        pc3         pc4
#> 1       2        3       0 -0.6692077 -1.59248351 -0.6648463  0.09277401
#> 2       1        1       0  1.8676228  0.24132599 -0.2799667  0.18483088
#> 3       0        0       0  0.2991572 -0.29125218  0.4060377  1.24527580
#> 4       0        0       1 -1.5104513  3.12487392  0.5867622 -0.76880776
#> 5       2        3       2 -0.7101579 -0.17109929 -0.2001498 -0.19719950
#> 6       2        3       4  0.7511404  0.05465809  1.0260702  2.19846930
#>          pc5
#> 1 -0.9351552
#> 2 -0.9795520
#> 3  0.2449974
#> 4 -0.4592330
#> 5 -0.4026468
#> 6  0.2845062

table(phenoData$tobacco)
#> 
#>    0    1    2 
#> 2000 2000 2000
table(phenoData$cannabis)
#> 
#>    0    1    2    3 
#> 1500 1500 1500 1500
table(phenoData$alcohol)
#> 
#>    0    1    2    3    4 
#> 1200 1200 1200 1200 1200

As you can see, the tobacco variable is an categorical variable with 3 levels, cannabis is an categorical variable with 4 levels, and alcohol is an categorical variable with 5 levels. To tell R that you want to treat these variables as ordinal, you can use the following mxFactor() commands:

phenoData$tobacco  <- mxFactor(phenoData$tobacco  , levels = 0:2)
phenoData$cannabis <- mxFactor(phenoData$cannabis , levels = 0:3)
phenoData$alcohol  <- mxFactor(phenoData$alcohol  , levels = 0:4)

After you are confident that the variables are recoded in the optimal manner, you can begin building the GWAS model. To being with, let’s look at how to do a standard GWAS with only one item. The first step in conducting the GWAS analysis is to to build a general model. This takes a series of arguments:

tob <- buildItem(phenoData = phenoData,                              # the phenotypic data object
                     depVar = c("tobacco"),                          # the outcome or dependent variable 
                     covariates=c('pc1','pc2','pc3','pc4','pc5'),    # the necessary covariates
                     fitfun = "WLS")                                 # the fit function (WLS is much faster than ML)

After you have built the model and put it into an object (named tob), you can take this object which is technically an OpenMx model, and fit it using mxRun()"

tobFit <- mxRun(tob)
#> Running OneItem with 8 parameters
summary(tobFit)
#> Summary of OneItem 
#>  
#> free parameters:
#>             name     matrix     row     col      Estimate  Std.Error
#> 1 snp_to_tobacco          A tobacco     snp  0.0079547736 0.02076189
#> 2 pc1_to_tobacco          A tobacco     pc1 -0.0024232344 0.01459601
#> 3 pc2_to_tobacco          A tobacco     pc2  0.0008759935 0.01435417
#> 4 pc3_to_tobacco          A tobacco     pc3  0.0084324080 0.01480966
#> 5 pc4_to_tobacco          A tobacco     pc4  0.0230969252 0.01433350
#> 6 pc5_to_tobacco          A tobacco     pc5 -0.0053262676 0.01462257
#> 7  tobacco_Thr_1 Thresholds       1 tobacco -0.4146929439 0.04460971
#> 8  tobacco_Thr_2 Thresholds       2 tobacco  0.4470613537 0.04464401
#> 
#> Model Statistics: 
#>                |  Parameters  |  Degrees of Freedom  |  Fit (r'Wr units)
#>        Model:              8                   5992          4.35677e-29
#>    Saturated:              2                   5998          0.00000e+00
#> Independence:              2                   5998                   NA
#> Number of observations/statistics: 6000/6000
#> 
#> chi-square:  χ² ( df=0 ) = 0,  p = 1
#> CFI: NA 
#> TLI: 1   (also known as NNFI) 
#> RMSEA:  0  [95% CI (NA, NA)]
#> Prob(RMSEA <= 0.05): NA
#> To get additional fit indices, see help(mxRefModels)
#> timestamp: 2020-07-10 20:07:07 
#> Wall clock time: 0.4480011 secs 
#> optimizer:  SLSQP 
#> OpenMx version number: 2.17.4 
#> Need help?  See help(mxSummary)

It is strongly advised to test out the model, as it is a great time to learn if your model is being specified in a way that you didn’t expect or that the model is not giving you the answers that you are expecting. Now, provided that the model looks reasonable, you can plug the model that you have built into the GWAS function. To fit this function, you must tell GW-SEM :

library(curl)
curl_download(file.path(location, 'example.pgen'),
              file.path(tempdir(),'example.pgen'))
curl_download(file.path(location, 'example.pvar'),
              file.path(tempdir(),'example.pvar'))

GWAS(model = tob,                       # where the model is
    snpData = file.path(tempdir(),'example.pgen'),       # where the snpData 
    out=file.path(tempdir(),"tob.log"))   # where you want to save the results
#> Running OneItem with 8 parameters
#> Done. See '/tmp/RtmpdmiDTZ/tob.log' for results

When you execute this command, you will begin running your GWAS.

After you finish running the GWAS analysis, the next step is to read the results in to R so that you can examine them, plot them, prepare them to be used by other software packages, etc. While you are unlikely to want to read all of the output into R, it is useful to do this on a small test GWAS analysis to ensure that you are getting reasonable results. To do so, you can use the base R function read.delim. This is didactically useful, but it contains much more information than most people want

TobFullResult <- read.delim(file.path(tempdir(), "tob.log"))

Note that the results file is tab separated (with dates and times for each snp), so read.table will tell you that the the number of column names does not match the number of data columns.

More reasonably, we will want to examine the results for a specific parameter using the loadResults function. This function takes takes two arguments: the path to the data and the column in the results # file for the parameter that you want to examine.

succinct <- loadResults(path = file.path(tempdir(), "tob.log"),
                        focus =  "snp_to_tobacco")
succinct <- signif(succinct, focus =  "snp_to_tobacco")

In the post analysis section, I will give you a way to read several files into a single R object.

After you have read the results into R, you can manipulate them to your heart’s content. Here is a quick function to plot the p-values as a Manhattan plot:

plot(succinct)

As you will probably be conducting the GWAS analysis on a cluster, here are the essential components of the R script that you will need in your *.R file to run the GWAS.

library(gwsem)

# Read the phenotype data into R and look at the data
phenoData <- read.table(file.path(location, "itemPhenoData.txt"), header=TRUE)
phenoData$tobacco  <- mxFactor(phenoData$tobacco  , levels = 0:2)
phenoData$cannabis <- mxFactor(phenoData$cannabis , levels = 0:3)
phenoData$alcohol  <- mxFactor(phenoData$alcohol  , levels = 0:4)

tob <- buildItem(phenoData = phenoData, depVar = c("tobacco"), 
                 covariates=c('pc1','pc2','pc3','pc4','pc5'),
                 fitfun = "WLS") 

GWAS(model = tob,
     snpData = file.path(tempdir(), 'example.pgen'),
     out= file.path(tempdir(), "tob.log"))
#> Running OneItem with 8 parameters
#> Done. See '/tmp/RtmpdmiDTZ/tob.log' for results

Standard Genome Wide Association Study

Example

The Multivariate GWAS Model

Now we can plot all the residual manhattan plots