An overview of the COPASutils package

In this example, the COPASutils package will be used to read in, process, and analyze data resulting from a Genome Wide Association Study (GWAS) using Caenorhabditis elegans nematode worms and a COPAS BIOSORT large particle flow cytometer. This example assumes that the COPASutils package is installed, with all necessary dependencies on your local machine. To install the COPASutils package, you can use the command install.packages("COPASutils").

Note: Some of the commands below will produce warnings when executed on a local machine. The warnings describe issues such as missing or strange values fed to the underlying functions and should not alarm the user. Warnings have been suppressed in this vignette for the sake of aesthetics.

The COPASutils will package first be loaded so that we can utilize the associated functions and example data:

library(COPASutils)

Reading Data

Note: All of the raw .txt files referenced below are available at https://github.com/AndersenLab/COPASutils-data

We can now read in the plate data from one of the example data sets. In this GWAS experimental design, each plate is set up with three worms sorted to each well in every other column. An example data set, called “plateData1.txt”, that illustrates this step of the experiment is available from the Andersen Laboratory website (andersenlab.org). This text file is a representative example of raw data collected from a Union Biometrica BIOSORT large particle flow cytometer with ReFLx sampler. We will first read this file in and save it to a data frame called setupRaw:

setupRaw <- readPlate("plateData1.txt")

If we look at the head of this data frame we can begin to step through the different data that are output by the COPAS machine:

head(setupRaw)
## Source: local data frame [6 x 15]
## Groups: row, col
## 
##   row col sort TOF EXT time green yellow red norm.EXT norm.green norm.yellow norm.red object call50
## 1   A   1    6 341 206    0    42     57  44   0.6041    0.12317     0.16716  0.12903 1.0000 object
## 2   A   1    6 345 226   62    57     61  44   0.6551    0.16522     0.17681  0.12754 1.0000 object
## 3   A   1    2  53  29  421    10      4   5   0.5472    0.18868     0.07547  0.09434 0.9987 object
## 4   A   1    6 341 185  452    60     64  47   0.5425    0.17595     0.18768  0.13783 1.0000 object
## 5   A   1    0 487 193  858    47     47  38   0.3963    0.09651     0.09651  0.07803 1.0000 object
## 6   A   1    0  58  17  920     3      3   4   0.2931    0.05172     0.05172  0.06897 0.9989 object

We can see that the COPAS machine groups the output data primarily by row and column of a 96-well plate, represented by the columns row and col in the above data frame. Each row in the data frame represents the readings for a single object. Working through the columns left to right, we see that the sorter returns the sort status of the object in the sort column (for our purposes, we are only concerned with instances where sort = 6, as these worms were sorted to the respective wells in the target plate; for a complete explanation of sort status codes, please consult the machine’s user manual). We then see TOF which stands for “time of flight” or a measure of the length of the object in microns. Next is EXT or “extinction”, a measure of the optical density of the object. Following this is the time column which represents the relative time from the first object sorted in each well. Using this scheme, the first object to pass through the flow cell for each well will therefore be assigned a time of 0. Next are the peak height values for each of the fluorescence channels, indicated by the green, yellow, and red columns. The next four columns contain the data for the normalized EXT, red, green, and yellow values (value/TOF). Lastly the columns object and call50 represent data returned from the support vector machine (SVM) that probabilistically determines whether each “object” is actually an object (cell, worm, etc.) or a bubble. This feature is useful if the experiment, like ours, requires the bubble trap hardware to be bypassed. It is important to note that the SVM was trained on data from our own machine and that its efficacy should be tested against a user’s own data before any user utilizes this feature. The call50 column displays “bubble” if the probability of being an object (object) is greater than .5, or “bubble” otherwise.

If you would like to remove the last two columns (i.e. read in data without the help of the SVM), you can set the SVM argument to FALSE, as below:

setupRaw2 <- readPlate("plateData1.txt", SVM=FALSE)
head(setupRaw2)
## Source: local data frame [6 x 13]
## Groups: row, col
## 
##   row col sort TOF EXT time green yellow red norm.EXT norm.green norm.yellow norm.red
## 1   A   1    6 341 206    0    42     57  44   0.6041    0.12317     0.16716  0.12903
## 2   A   1    6 345 226   62    57     61  44   0.6551    0.16522     0.17681  0.12754
## 3   A   1    2  53  29  421    10      4   5   0.5472    0.18868     0.07547  0.09434
## 4   A   1    6 341 185  452    60     64  47   0.5425    0.17595     0.18768  0.13783
## 5   A   1    0 487 193  858    47     47  38   0.3963    0.09651     0.09651  0.07803
## 6   A   1    0  58  17  920     3      3   4   0.2931    0.05172     0.05172  0.06897

We can also set cutoffs for minimum and maximum time of flight and extinction values as such:

setupRaw3 <- readPlate("plateData1.txt", tofmin=60, tofmax=1000, extmin=50, extmax=500)
head(setupRaw3)
## Source: local data frame [6 x 15]
## Groups: row, col
## 
##   row col sort TOF EXT time green yellow red norm.EXT norm.green norm.yellow norm.red object call50
## 1   A   1    6 341 206    0    42     57  44   0.6041    0.12317     0.16716  0.12903      1 object
## 2   A   1    6 345 226   62    57     61  44   0.6551    0.16522     0.17681  0.12754      1 object
## 3   A   1    6 341 185  452    60     64  47   0.5425    0.17595     0.18768  0.13783      1 object
## 4   A   1    0 487 193  858    47     47  38   0.3963    0.09651     0.09651  0.07803      1 object
## 5   A   1    0 257 248 2012    66     77  60   0.9650    0.25681     0.29961  0.23346      1 object
## 6   A   1    0 469 259 2199    63     70  57   0.5522    0.13433     0.14925  0.12154      1 object

Lastly, we can read in a data file from a BioSorter machine, which utilizes the LP Sampler instead of the BIOSORT machine’s ReFLx module. We can adjust the way in which the file is read to compensate for this difference by setting the reflx parameter to FALSE:

biosorterData <- readPlate("BioSorter.txt", reflx=FALSE)
head(biosorterData)
## Source: local data frame [6 x 15]
## Groups: row, col
## 
##   row col sort TOF   EXT time green yellow  red norm.EXT norm.green norm.yellow norm.red object call50
## 1   A   1    0 480 356.0    0  2.70   1.65 2.33   0.7417   0.005625    0.003438 0.004854 1.0000 object
## 2   A   1    0 138  29.9   95  0.51   0.39 0.41   0.2167   0.003696    0.002826 0.002971 0.9999 object
## 3   A   1    0  40   1.2  931  0.11   0.11 0.11   0.0300   0.002750    0.002750 0.002750 0.3757 bubble
## 4   A   1    0 280  71.0  973  8.25   1.06 3.47   0.2536   0.029464    0.003786 0.012393 1.0000 object
## 5   A   1    0 264  61.6 1125 11.80   1.14 2.65   0.2333   0.044697    0.004318 0.010038 1.0000 object
## 6   A   1    0 492  55.8 1195 11.00   1.85 6.03   0.1134   0.022358    0.003760 0.012256 1.0000 object
## [1] TRUE TRUE

Processing Data

Now that we have read our raw data into R using the readPlate function, we probably want to summarize the data by well. We are not necessarily interested in the data at the per object level, but it would be nice to get some summary statistics for each well. For instance, in the GWAS experiment from we which are examining the data, it is desired that 3 worms be sorted into each well in every other column. To summarize the data we can use the summarizePlate function:

setupSummarized <- summarizePlate(setupRaw)
colnames(setupSummarized)
##  [1] "row"                "col"                "n"                  "n.sorted"           "mean.TOF"          
##  [6] "median.TOF"         "mean.EXT"           "median.EXT"         "mean.red"           "median.red"        
## [11] "mean.green"         "median.green"       "mean.yellow"        "median.yellow"      "mean.norm.EXT"     
## [16] "median.norm.EXT"    "mean.norm.red"      "median.norm.red"    "mean.norm.green"    "median.norm.green" 
## [21] "mean.norm.yellow"   "median.norm.yellow"

We now see that we have many more trait variables, many of which describe the distribution of the originally measured values by well. We can get an even more complete picture by adding in extra quantiles (quantiles argument), log transformed values of EXT and the fluorescence channels (log argument), and the minimum and maximum of each trait (ends argument), as below:

setupSummarized2 <- summarizePlate(setupRaw, quantiles=TRUE, log=TRUE, ends=TRUE)
colnames(setupSummarized2)
##   [1] "row"                    "col"                    "n"                      "n.sorted"              
##   [5] "mean.TOF"               "min.TOF"                "q10.TOF"                "q25.TOF"               
##   [9] "median.TOF"             "q75.TOF"                "q90.TOF"                "max.TOF"               
##  [13] "mean.EXT"               "min.EXT"                "q10.EXT"                "q25.EXT"               
##  [17] "median.EXT"             "q75.EXT"                "q90.EXT"                "max.EXT"               
##  [21] "mean.red"               "min.red"                "q10.red"                "q25.red"               
##  [25] "median.red"             "q75.red"                "q90.red"                "max.red"               
##  [29] "mean.green"             "min.green"              "q10.green"              "q25.green"             
##  [33] "median.green"           "q75.green"              "q90.green"              "max.green"             
##  [37] "mean.yellow"            "min.yellow"             "q10.yellow"             "q25.yellow"            
##  [41] "median.yellow"          "q75.yellow"             "q90.yellow"             "max.yellow"            
##  [45] "mean.norm.EXT"          "min.norm.EXT"           "q10.norm.EXT"           "q25.norm.EXT"          
##  [49] "median.norm.EXT"        "q75.norm.EXT"           "q90.norm.EXT"           "max.norm.EXT"          
##  [53] "mean.norm.red"          "min.norm.red"           "q10.norm.red"           "q25.norm.red"          
##  [57] "median.norm.red"        "q75.norm.red"           "q90.norm.red"           "max.norm.red"          
##  [61] "mean.norm.green"        "min.norm.green"         "q10.norm.green"         "q25.norm.green"        
##  [65] "median.norm.green"      "q75.norm.green"         "q90.norm.green"         "max.norm.green"        
##  [69] "mean.norm.yellow"       "min.norm.yellow"        "q10.norm.yellow"        "q25.norm.yellow"       
##  [73] "median.norm.yellow"     "q75.norm.yellow"        "q90.norm.yellow"        "max.norm.yellow"       
##  [77] "mean.log.EXT"           "min.log.EXT"            "q10.log.EXT"            "q25.log.EXT"           
##  [81] "median.log.EXT"         "q75.log.EXT"            "q90.log.EXT"            "max.log.EXT"           
##  [85] "mean.log.red"           "min.log.red"            "q10.log.red"            "q25.log.red"           
##  [89] "median.log.red"         "q75.log.red"            "q90.log.red"            "max.log.red"           
##  [93] "mean.log.green"         "min.log.green"          "q10.log.green"          "q25.log.green"         
##  [97] "median.log.green"       "q75.log.green"          "q90.log.green"          "max.log.green"         
## [101] "mean.log.yellow"        "min.log.yellow"         "q10.log.yellow"         "q25.log.yellow"        
## [105] "median.log.yellow"      "q75.log.yellow"         "q90.log.yellow"         "max.log.yellow"        
## [109] "mean.log.norm.EXT"      "min.log.norm.EXT"       "q10.log.norm.EXT"       "q25.log.norm.EXT"      
## [113] "median.log.norm.EXT"    "q75.log.norm.EXT"       "q90.log.norm.EXT"       "max.log.norm.EXT"      
## [117] "mean.log.norm.red"      "min.log.norm.red"       "q10.log.norm.red"       "q25.log.norm.red"      
## [121] "median.log.norm.red"    "q75.log.norm.red"       "q90.log.norm.red"       "max.log.norm.red"      
## [125] "mean.log.norm.green"    "min.log.norm.green"     "q10.log.norm.green"     "q25.log.norm.green"    
## [129] "median.log.norm.green"  "q75.log.norm.green"     "q90.log.norm.green"     "max.log.norm.green"    
## [133] "mean.log.norm.yellow"   "min.log.norm.yellow"    "q10.log.norm.yellow"    "q25.log.norm.yellow"   
## [137] "median.log.norm.yellow" "q75.log.norm.yellow"    "q90.log.norm.yellow"    "max.log.norm.yellow"

We now have a great deal of information describing, in detail, the distribution of each of the measured traits. Again, each subset of new trait values can be removed by leaving each of the optional parameters (quantiles, log, and ends) set to FALSE. Each trait follows a specific naming system, wherein any mathematical transformation imparted on the original data is added at the beginning of the trait. For instance, the mean of all of the time of flight data (TOF) for each well can be found in the column mean.TOF. If we wanted the column corresponding to the 25th quantile of the log transformed extinction data (EXT), we could find it in the column named q25.log.EXT. All of the naming conventions are outlined in the table below:

Statistic Abbreviation Example
mean mean mean.TOF
median median median.EXT
minimum min min.yellow
maximum max max.green
normalized to time of flight norm mean.norm.red
quantiles qXX q25.red
log transformed data log mean.log.red

Some statistics, such as normalized and log values are calculated before the data are summarized and, as such, have a distribution of their own. For these traits, all mean, median, min, max and quantile data are available by stringing together names as in the above table.

In the experiment we are examining here, it was desired that no worms be sorted into the wells in the even columns. Therefore, the data we are seeing in these columns are the result of background debris or accidental placement of worms into the wells. We now want to remove these wells before we continue with our analysis. Included in the package is the removeWells function, which does exactly what its name implies:

setupSummarized3 <- removeWells(setupSummarized, c("A2", "A4", "A6")) # All wells you want to remove need to be included here
head(setupSummarized3[,1:10])
## Source: local data frame [6 x 10]
## Groups: 
## 
##   row col  n n.sorted mean.TOF median.TOF mean.EXT median.EXT mean.red median.red
## 1   A   1 24        3    342.8        341    241.5      230.5    64.92       49.5
## 2   A   2 NA       NA       NA         NA       NA         NA       NA         NA
## 3   A   3 70        3    307.2        310    190.9      209.5    62.61       52.0
## 4   A   4 NA       NA       NA         NA       NA         NA       NA         NA
## 5   A   5 69        3    323.9        346    203.1      215.0    47.00       48.0
## 6   A   6 NA       NA       NA         NA       NA         NA       NA         NA

The removeWells function takes as input a summarized plate data frame, as well as a vector of string corresponding to wells to be removed, and returns the data frame with all phenotype data in the frame set to NA. An optional drop parameter is used to specify whether to “NA out” trait data or drop those rows from the frame entirely:

setupSummarized4 <- removeWells(setupSummarized, c("A2", "A4", "A6"), drop=TRUE)
head(setupSummarized4[,1:10])
## Source: local data frame [6 x 10]
## Groups: 
## 
##   row col  n n.sorted mean.TOF median.TOF mean.EXT median.EXT mean.red median.red
## 1   A   1 24        3    342.8        341    241.5      230.5    64.92       49.5
## 3   A   3 70        3    307.2        310    190.9      209.5    62.61       52.0
## 5   A   5 69        3    323.9        346    203.1      215.0    47.00       48.0
## 7   A   7 64        3    301.4        285    185.6      202.0    54.28       46.5
## 8   A   8  2        2    245.0        245    211.5      211.5    43.50       43.5
## 9   A   9 82        3    324.0        356    190.7      205.5    44.24       47.0

The converse of the removeWells function is the fillWells function. This function fills the trait data for any wells missing from the selected data frame with NAs. If you want to fill the wells back in from our example above, where the rows were dropped from the data frame, we could use the following command:

setupSummarized5 <- fillWells(setupSummarized4)
head(setupSummarized5[,1:10])
##   row col  n n.sorted mean.TOF median.TOF mean.EXT median.EXT mean.red median.red
## 1   A   1 24        3    342.8        341    241.5      230.5    64.92       49.5
## 2   A   2 NA       NA       NA         NA       NA         NA       NA         NA
## 3   A   3 70        3    307.2        310    190.9      209.5    62.61       52.0
## 4   A   4 NA       NA       NA         NA       NA         NA       NA         NA
## 5   A   5 69        3    323.9        346    203.1      215.0    47.00       48.0
## 6   A   6 NA       NA       NA         NA       NA         NA       NA         NA

One issue when working with 96-well plates is that of placement within the plate. Edge wells, being exposed to more direct dry air currents, may experience greater evaporation, which may have an effect on different traits in those wells. To test this hypothesis, we can utilize the edgeEffect function:

edgeEffect(setupSummarized, "n")
## Warning: cannot compute exact p-value with ties
## [1] 0.3775

This function takes a summarized plate data frame and the name of the trait to test. In this instance we tested our setup plate for any effect with respect to the number of worms in each well. The function splits the plate population by wells found on the perimeter of the plate and those found on the interior, then performs a Wilcoxon Rank-Sum test between the two populations for the specified trait. Th resultant p-value is returned. Since the returned p-value does not exceed our significance threshold of .05, we fail to reject the null hypothesis that the two populations are drawn from the same distribution. If we want to simultaneously test all traits, we can simply not specify a trait to be tested. A data frame of all traits and associated p-values will be returned.

Plotting Data

Now that we have access to both the unsummarized and summarized data, we would like to visualize the results from this plate. The plate in this example was set up with worms in every other column. To confirm that large populations only exist in every other columns, we will plot a heat map of the plate representing the population present in each well using the plotTrait function:

plotTrait(setupSummarized, "n")

plot of chunk unnamed-chunk-16

We can see that the larger populations of worms are, in fact, only present in every other well, row-wise. We can also see wells that are missing data points, as they are colored gray and are missing the numerical annotation. The returned plot is a ggplot2 object and as such can be manipulated using standard ggplot2 functions. This is true for all of the following plotting functions as well. For instance, we can add a title as such:

plotTrait(setupSummarized, "n") + ggtitle("Example Heatmap")

plot of chunk unnamed-chunk-17

We are not simply limited to heat maps, however. By plotting the raw data, we can get a better feel for the distributions of the traits. We can plot a histogram of the values in each well:

plotTrait(setupRaw, "TOF", type="hist")

plot of chunk unnamed-chunk-18

Or we can plot a scatter plot between two traits:

plotTrait(setupRaw, "TOF", "EXT", type="scatter")

plot of chunk unnamed-chunk-19

We may also want to compare how traits differ across plates. Included in the package is a function called plotCompare that accomplishes this very task. Here, we’ll compare the distributions of the time-of-flight values between the data in setupRaw and a new plate called scoreRaw. These two plates will be entered as a list to the first argument in the function. Then, we’ll specify the trait to be compared (TOF in this case). Finally we’ll enter in a vector of the plates names as the optional third argument:

scoreRaw <- plateData2
plotCompare(list(setupRaw, scoreRaw), "TOF", plateNames=c("Setup", "Score"))

plot of chunk unnamed-chunk-20

We can see that side-by-side box plots are plotted for the values in each well. Likewise, we can compare the summarized values between plates by feeding in summarized plate data:

scoreSummarized <- summarizePlate(scoreRaw)
plotCompare(list(setupSummarized, scoreSummarized), "mean.TOF", plateNames=c("Setup", "Score"))

plot of chunk unnamed-chunk-21

In addition, we can also check for correlation between traits both within and across plates using a new function called plotCorMatrix. This function will plot a correlation heat map between all of the traits either within or between summarized plates. Here we will examine correlations between traits within the summarized setup data:

plotCorMatrix(setupSummarized)

plot of chunk unnamed-chunk-22

In the above matrix we can see that some traits are positively correlated, some are negatively correlated, and some are completely uncorrelated. We can also examine these patterns between plates as such:

plotCorMatrix(setupSummarized, scoreSummarized)

plot of chunk unnamed-chunk-23

We can now see an instance where all values of a trait (n.sorted) were equivalent and therefore the correlation could not be calculated. Tiles for these traits are all set to gray.

If we now transition to some new data, representing an experiment in which drug dose response curves were measured, we can utilize the last plot function in the package, plotDR. We will first load in our example data set, called doseData into the variable dosesRaw. We will then summarize this data, filling in the strains with a 96-element vector corresponding to the names of the strains across a plate, row-wise.

dosesRaw <- doseData
strains <- rep(c("Strain 1", "Strain 2", "Strain 3", "Strain 4"), each=6, times=4)
dosesSummarized <- summarizePlate(dosesRaw, strains)
doses <- rep(c(0,2.5,5,10,20,NA), times=16)
plotDR(dosesSummarized, dosages=doses, trait="mean.TOF")

plot of chunk unnamed-chunk-24

We can see that we now need to include the strains vector when summarizing the data as well as a dosages vector when plotting the dose response. We also might like to see how the strains vary across every trait. We can generate a list of ggplot2 objects using the plotDR_allTraits function:

plotList <- plotDR_allTraits(dosesSummarized, dosages=doses)

We can even access each plot by name using the scheme below:

plotList$median.red

plot of chunk unnamed-chunk-26

plotList$mean.EXT

plot of chunk unnamed-chunk-26

Conclusion

The COPASutils package provides quick, streamlined tools for reading, processing, and plotting data resulting from COPAS platform machines. Here we have analyzed data from several different experiments. Of course, analysis pipelines for the data will change from project to project and the pipeline described here may or may not be the best fit for your data. COPASutils provides a very general and flexible work flow for COPAS data and should be easily adaptable to your specific project.