# 1 Installation

# Install NACHO from CRAN:
install.packages("NACHO")

# Or the the development version from GitHub:
# install.packages("devtools")
devtools::install_github("mcanouil/NACHO")

# 2 Overview

NACHO (NAnostring quality Control dasHbOard) is developed for NanoString nCounter data.
NanoString nCounter data is a mRNA or miRNA expression assay and works with fluorescent barcodes (Geiss et al. 2008).
Each barcode is assigned a mRNA/miRNA, which can be counted after bonding with its target.
As a result each count of a specific barcode represents the presence of its target mRNA/miRNA.

NACHO is able to load, visualise and normalise the exported NanoString nCounter data and facilitates the user in performing a quality control.
NACHO does this by visualising quality control metrics, expression of control genes, principal components and sample specific size factors in an interactive web application.

With the use of two functions, RCC files are summarised and visualised, namely: summarise() and visualise().

• The summarise() function is used to preprocess the data.
• The visualise() function initiates a RStudio Shiny-based dashboard that visualises all relevant QC plots.

NACHO also includes a function normalise(), which calculates sample specific size factors and normalises the data.

• The normalise() function creates a list in which your settings, the raw counts and normalised counts are stored.

# 3 An example

To display the usage and utility of NACHO, we show three examples in which the above mentioned functions are used and the results are briefly examined. NACHO comes with presummarised data and in the first example we use this data to call the interactive web application with the use of visualise(). In the second example we show the process of going from raw RCC files to visualisations with a data set queried from GEO using GEOquery.

In the third example we use the summarised data from second example to calculate the sample specific size factors using normalise() and its added functionality to predict housekeeping genes.

Besides creating interactive visualisations, NACHO also identifies poorly performing samples which can be seen under the Outlier Table tab in the interactive web application.
While calling normalise() the user has the possibility to remove these outliers before size factor calculation.

## 3.1 Get NanoString nCounter data

### 3.1.1 Presummarised data from NACHO

This example shows how to use summarised data to call the interactive web application.
The raw data used in the summarisation is from a study of Liu et al. (2016) and was acquired from the NCBI GEO public database (Barrett et al. 2013).

library(NACHO)
data(GSE74821)
visualise(GSE74821)

### 3.1.2 Raw data from GEO

Numerous NanoString nCounter data sets are available from GEO (Barrett et al. 2013).
In this example we use a mRNA dataset from the study of Bruce et al. (2015) with the GEO accession number: GSE70970. The data is extracted and prepared using the following code.

Note: GEOquery seems to be currently down. Thus, the following code was not executed.

library(GEOquery)
gse <- getGEO("GSE70970")
# Get phenotypes
targets <- pData(phenoData(gse[[1]]))
getGEOSuppFiles(GEO = "GSE70970", baseDir = tempdir())
# Unzip data
untar(
tarfile = paste0(tempdir(), "/GSE70970/GSE70970_RAW.tar"),
exdir = paste0(tempdir(), "/GSE70970/Data")
)
targets\$IDFILE <- list.files(paste0(tempdir(), "/GSE70970/Data"))

After we extracted the dataset to the paste0(tempdir(), "/GSE70970/Data") directory, a Samplesheet.csv containing a column with the exact names of the files for each sample can be written or use as is.

## 3.2 The summarise() function

The first argument requires the path to the directory containing the RCC files, the second argument is the location of samplesheet followed by third argument with the column name containing the exact names of the files.
The housekeeping_genes and normalisation_method arguments respectively indicate which housekeeping genes and normalisation method should be used.

library(NACHO)
GSE70970_sum <- summarise(
data_directory = paste0(tempdir(), "/GSE70970/Data"), # Where the data is
ssheet_csv = targets, # The samplesheet
id_colname = "IDFILE", # Name of the column that contains the identfiers
housekeeping_genes = NULL, # Custom list of housekeeping genes
housekeeping_predict = TRUE, # Predict the housekeeping genes based on the data?
normalisation_method = "GEO", # Geometric mean or GLM
n_comp = 5 # Number indicating the number of principal components to compute.
)

## 3.3 The visualise() function

When the summarisation is done, the summarised (or normalised) data can be visualised using the visualise() function as can be seen in the following chunk of code.

visualise(GSE70970_sum)

On the left there are widgets that control the interactive plotting area. These widgets differ dependent on which main tab is chosen. The second layer of tabs is also interactive and changes based on which main tab is chosen. Each sample in the plots can be coloured based on either technical specifications which are included in the RCC files or based on specifications of your own choosing, though these specifications need to be included in the samplesheet.

## 3.4 The normalise() function

NACHO is allows the discovery of housekeeping genes within your own dataset. NACHO finds the five best suitable housekeeping genes, however, it is possible that one of these five genes might not be suitable, which is why a subset of these discovered housekeeping genes might work better in some cases. For this example we use the GSE70970 dataset from the previous example. The discovered housekeeping genes are saved in a global variable named predicted_housekeeping.

print(GSE70970_sum[["housekeeping_genes"]])
my_housekeeping <- GSE70970_sum[["housekeeping_genes"]][-c(1, 2)]
print(my_housekeeping)

The next step is the actual normalisation. The first argument requires the summary which is created with the summarise() function. The second arugument requires a vector of gene names. In this case it is a subset of the discovered housekeeping genes we just made. With the third argument the user has the choice to remove the outliers. Lastly the normalisation method can be choosed.
Here the user has a choice between "GLM" or "GEO". The differences between normalisation methods are nuanced, however, a preference for either method are use case specific.
In this example "GLM" is used.

GSE70970_norm <- normalise(
nacho_object = GSE70970_sum,
housekeeping_genes = my_housekeeping,
housekeeping_norm = TRUE,
normalisation_method = "GEO",
remove_outliers = TRUE
)

normalise() returns a list object (same as summarise()) with raw_counts and normalised_counts slots filled with the raw and normalised counts. Both counts are also in the NACHO data.frame.

# References

Barrett, Tanya, Stephen E. Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F. Kim, Maxim Tomashevsky, Kimberly A. Marshall, et al. 2013. “NCBI GEO: Archive for Functional Genomics Data Sets—Update.” Nucleic Acids Research 41 (D1): D991–D995. https://doi.org/10.1093/nar/gks1193.

Bruce, Jeff P., Angela B. Y. Hui, Wei Shi, Bayardo Perez-Ordonez, Ilan Weinreb, Wei Xu, Benjamin Haibe-Kains, et al. 2015. “Identification of a microRNA Signature Associated with Risk of Distant Metastasis in Nasopharyngeal Carcinoma.” Oncotarget 6 (6): 4537–50. https://doi.org/10.18632/oncotarget.3005.

Geiss, Gary K., Roger E. Bumgarner, Brian Birditt, Timothy Dahl, Naeem Dowidar, Dwayne L. Dunaway, H. Perry Fell, et al. 2008. “Direct Multiplexed Measurement of Gene Expression with Color-Coded Probe Pairs.” Nature Biotechnology 26 (3): 317–25. https://doi.org/10.1038/nbt1385.

Liu, Minetta C., Brandelyn N. Pitcher, Elaine R. Mardis, Sherri R. Davies, Paula N. Friedman, Jacqueline E. Snider, Tammi L. Vickery, et al. 2016. “PAM50 Gene Signatures and Breast Cancer Prognosis with Adjuvant Anthracycline- and Taxane-Based Chemotherapy: Correlative Analysis of C9741 (Alliance).” Npj Breast Cancer 2 (January): 15023. https://doi.org/10.1038/npjbcancer.2015.23.