Introduction

Reference trait analysis is a method for relating incompatible phenotypes. For example, in the study of brain gene expression predisposing drug self-administration behaviors in mice, it is impossible to measure drug-naive brain gene expression and drug self-administration behaviors in the same individual mice.

In this example we are going succinctly demonstrate how the method works using gene expression and anxiety data from N=258 Diversity Outbred mice studied in Logan et al. 2013. In this example dataset, measurements of two separate sets of anxiety-related traits and hippocampal gene expression profiles have been collected for each of the 258 mice. Since the anxiety behaviors and hippocampal gene expression are not incompatible measures, we can test how well the reference trait analysis method performs.

Code for this analysis is available at this github repository.

Loading the data

# Open-field and light-dark box testing. 
# See https://www.ncbi.nlm.nih.gov/pubmed/23433259 for phenotyping details
of <- read_tsv("open_field_Logan_2013.tsv", col_types=cols(mouse_id="c", .default='d'))
ld <- read_tsv("light_dark_Logan_2013.tsv", col_types=cols(mouse_id="c", .default='d'))

# Hippocampal gene expression data (RNA-Seq)
# Data for 17,539 genes in 258 mice.
# These data have been upper quartile normalized and rankZ transformed.
expr <- read.csv("hippo_gene_expr.csv.gz", row.names=1)

We transform some of the phenotypes with log and sqrt to more closely achieve approximate normality:

sqrt_cols <- c("Percent_Time_Center", "Percent_Time_Periphery", "PercentTime_Mobile")
of[, sqrt_cols] <- sqrt(of[, sqrt_cols])
of[, "Total_Distance_cm"] <- log(of[, "Total_Distance_cm"])

Examining the data

Let’s make a few plots to look at the data.

# Make plots of open-field arena data:
of_long <- gather(of, "phenotype", "value", -mouse_id)
ggplot(of_long) + geom_histogram(aes(x=value), bins=50, color="gray") +
    facet_wrap(~ phenotype, scales='free') + theme_bw(base_size=16)

# Make plots of light-dark box data:
ld_long <- gather(ld, "phenotype", "value", -mouse_id)
ggplot(ld_long) + geom_histogram(aes(x=value), bins=50, color='gray') +
    facet_wrap(~ phenotype, scales='free') + theme_bw(base_size=16)

All phenotypes look roughly bell shaped.

Removing the effect of sex

We also regress out the effect of sex in an attempt to mitigate this known covariate that is not of primary interest.

remove_sex <- function(df) {
  residuals(lm(value ~ sex, data=df))
}
ld_corrected <- mutate(ld_long, sex=substring(mouse_id, 1, 1)) %>%
  group_by(phenotype) %>% nest() %>% 
  mutate(resid=map(data, remove_sex)) %>% 
  unnest(data, resid) %>% 
  select(mouse_id, phenotype, resid) %>%
  rename("value"="resid")

of_corrected <- mutate(of_long, sex=substring(mouse_id, 1, 1)) %>%
  group_by(phenotype) %>% nest() %>% 
  mutate(resid=map(data, remove_sex)) %>% 
  unnest(data, resid) %>% 
  select(mouse_id, phenotype, resid) %>%
  rename("value"="resid")

Splitting the animals into two groups

We call light-dark box traits the reference traits and open-field traits the target traits. In this example walkthrough, we pretend that we lack gene expression data for N/2=129 animals and we lack open-field data for the other 129 animals.

set.seed(1)
grp1 <- sort(sample(1:258, size=129, replace=FALSE))  # training group
grp2 <- setdiff(1:258, grp1)                          # test group

mouse_ids <- sort(of$mouse_id)
grp1_ids <- mouse_ids[grp1]
grp2_ids <- mouse_ids[grp2]

Building the model from training group animals

of_train <- filter(of, mouse_id %in% grp1_ids) %>% select(-mouse_id)
ld_train <- filter(ld, mouse_id %in% grp1_ids) %>% select(-mouse_id)
ccor <- cancor(of_train, ld_train, xcenter=TRUE, ycenter=TRUE)

Let’s confirm the correlation for the first canonical variables:

ccor$cor[1]
## [1] 0.6992908
of_canonical_variable1 <- as.matrix(of_train) %*% ccor$xcoef[, 1]
ld_canonical_variable1 <- as.matrix(ld_train) %*% ccor$ycoef[, 1]
# sanity check:
are_equal(ccor$cor[1], cor(of_canonical_variable1, ld_canonical_variable1)[1])
## [1] TRUE

Get projected traits from testing group animals

Here we use the canonical correlation model built above to get projected reference traits. Specifically, the model above gives us weights that we can apply to the reference trait data measured on grp2 animals to construct projected traits.

ld_test <- filter(ld, mouse_id %in% grp2_ids) %>% select(-mouse_id)
projected_ref_traits <- as.matrix(ld_test) %*% ccor$ycoef[, 1]

# compare to projection of target traits (open-field)
# (impossible in typical data but possible here because all animals have all data)
of_test <- filter(of, mouse_id %in% grp2_ids) %>% select(-mouse_id)
projected_target_traits <- as.matrix(of_test) %*% ccor$xcoef[, 1]

cor(projected_ref_traits, projected_target_traits)[1]
## [1] 0.6110526

Finding molecular correlates of projected reference traits

We may compare variation in projected reference traits to transcript abundance measurements:

expr_grp2 <- expr[, grp2_ids]
exprcors <- cor(t(expr_grp2), projected_ref_traits)

head(exprcors)
##                           [,1]
## ENSMUSG00000000001 -0.02751342
## ENSMUSG00000000028  0.09268146
## ENSMUSG00000000031  0.08498029
## ENSMUSG00000000037 -0.11694770
## ENSMUSG00000000049  0.12843627
## ENSMUSG00000000056  0.10435831

Session information

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.5.0 (2018-04-23)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2018-06-14
## Packages -----------------------------------------------------------------
##  package    * version    date       source                       
##  assertthat * 0.2.0      2017-04-11 CRAN (R 3.5.0)               
##  backports    1.1.2      2017-12-13 CRAN (R 3.5.0)               
##  base       * 3.5.0      2018-04-24 local                        
##  bindr        0.1.1      2018-03-13 CRAN (R 3.5.0)               
##  bindrcpp   * 0.2.2      2018-03-29 CRAN (R 3.5.0)               
##  broom        0.4.4      2018-03-29 CRAN (R 3.5.0)               
##  cellranger   1.1.0      2016-07-27 CRAN (R 3.5.0)               
##  cli          1.0.0      2017-11-05 CRAN (R 3.5.0)               
##  colorspace   1.3-2      2016-12-14 CRAN (R 3.5.0)               
##  compiler     3.5.0      2018-04-24 local                        
##  crayon       1.3.4      2017-09-16 CRAN (R 3.5.0)               
##  das        * 0.0.0.9000 2018-05-09 Github (daskelly/das@0a8f8f5)
##  datasets   * 3.5.0      2018-04-24 local                        
##  devtools     1.13.5     2018-02-18 CRAN (R 3.5.0)               
##  digest       0.6.15     2018-01-28 CRAN (R 3.5.0)               
##  dplyr      * 0.7.5      2018-05-19 cran (@0.7.5)                
##  evaluate     0.10.1     2017-06-24 CRAN (R 3.5.0)               
##  forcats    * 0.3.0      2018-02-19 CRAN (R 3.5.0)               
##  foreign      0.8-70     2017-11-28 CRAN (R 3.5.0)               
##  ggplot2    * 2.2.1      2016-12-30 CRAN (R 3.5.0)               
##  glue         1.2.0      2017-10-29 CRAN (R 3.5.0)               
##  graphics   * 3.5.0      2018-04-24 local                        
##  grDevices  * 3.5.0      2018-04-24 local                        
##  grid         3.5.0      2018-04-24 local                        
##  gtable       0.2.0      2016-02-26 CRAN (R 3.5.0)               
##  haven        1.1.1      2018-01-18 CRAN (R 3.5.0)               
##  hms          0.4.2      2018-03-10 CRAN (R 3.5.0)               
##  htmltools    0.3.6      2017-04-28 CRAN (R 3.5.0)               
##  httr         1.3.1      2017-08-20 CRAN (R 3.5.0)               
##  import       1.1.0      2015-06-22 CRAN (R 3.5.0)               
##  jsonlite     1.5        2017-06-01 CRAN (R 3.5.0)               
##  knitr        1.20       2018-02-20 CRAN (R 3.5.0)               
##  labeling     0.3        2014-08-23 CRAN (R 3.5.0)               
##  lattice      0.20-35    2017-03-25 CRAN (R 3.5.0)               
##  lazyeval     0.2.1      2017-10-29 CRAN (R 3.5.0)               
##  lubridate    1.7.4      2018-04-11 CRAN (R 3.5.0)               
##  magrittr     1.5        2014-11-22 CRAN (R 3.5.0)               
##  memoise      1.1.0      2017-04-21 CRAN (R 3.5.0)               
##  methods    * 3.5.0      2018-04-24 local                        
##  mnormt       1.5-5      2016-10-15 CRAN (R 3.5.0)               
##  modelr       0.1.1      2017-07-24 CRAN (R 3.5.0)               
##  munsell      0.4.3      2016-02-13 CRAN (R 3.5.0)               
##  nlme         3.1-137    2018-04-07 CRAN (R 3.5.0)               
##  parallel     3.5.0      2018-04-24 local                        
##  pillar       1.2.2      2018-04-26 CRAN (R 3.5.0)               
##  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.5.0)               
##  plyr         1.8.4      2016-06-08 CRAN (R 3.5.0)               
##  psych        1.8.4      2018-05-06 CRAN (R 3.5.0)               
##  purrr      * 0.2.4      2017-10-18 CRAN (R 3.5.0)               
##  R6           2.2.2      2017-06-17 CRAN (R 3.5.0)               
##  Rcpp         0.12.17    2018-05-18 cran (@0.12.17)              
##  readr      * 1.1.1      2017-05-16 CRAN (R 3.5.0)               
##  readxl       1.1.0      2018-04-20 CRAN (R 3.5.0)               
##  reshape2     1.4.3      2017-12-11 CRAN (R 3.5.0)               
##  rlang        0.2.0      2018-02-20 CRAN (R 3.5.0)               
##  rmarkdown    1.9        2018-03-01 CRAN (R 3.5.0)               
##  rprojroot    1.3-2      2018-01-03 CRAN (R 3.5.0)               
##  rstudioapi   0.7        2017-09-07 CRAN (R 3.5.0)               
##  rvest        0.3.2      2016-06-17 CRAN (R 3.5.0)               
##  scales       0.5.0      2017-08-24 CRAN (R 3.5.0)               
##  stats      * 3.5.0      2018-04-24 local                        
##  stringi      1.2.2      2018-05-02 CRAN (R 3.5.0)               
##  stringr    * 1.3.1      2018-05-10 CRAN (R 3.5.0)               
##  tibble     * 1.4.2      2018-01-22 CRAN (R 3.5.0)               
##  tidyr      * 0.8.0      2018-01-29 CRAN (R 3.5.0)               
##  tidyselect   0.2.4      2018-02-26 CRAN (R 3.5.0)               
##  tidyverse  * 1.2.1      2017-11-14 CRAN (R 3.5.0)               
##  tools        3.5.0      2018-04-24 local                        
##  utils      * 3.5.0      2018-04-24 local                        
##  withr        2.1.2      2018-03-15 CRAN (R 3.5.0)               
##  xml2         1.2.0      2018-01-24 CRAN (R 3.5.0)               
##  yaml         2.1.19     2018-05-01 CRAN (R 3.5.0)