Chapter 3 Importing Data

DE4Rumi takes 2 files as input:

  1. Counts matrix/table
  2. A table with metadata for each sample

3.1 Note on importing files

There are many functions used to import data into R, but it is suggested that the fread function from the data.table packages is used (Dowle and Srinivasan 2021). Not only is it much faster than alternatives, it also automatically detects file types and column separators (e.g. tab or comma) for most files. If you do not use data.table much, then you might not want to load it into your session and instead just name the package each time you use fread by typing data.table::fread(file = "./testfile.csv").

3.2 Import count data

Count data refers to data generated by Next Generation Sequencing technologies, such as RNAseq, and represents how abundant a gene’s mRNA was in the sample. These count tables are produced from processing .fastq files with an RNAseq pipeline, which is not discussed here. For example, the data used in this example was generated with a bioinformatics workflow consisting of FastQC -> trimmomatic -> STAR Aligner -> featureCounts on htt://usegalaxy.org.au. Visit https://davos-i.github.io/Galaxy_RNASeq_Book/ for more details.

The counts matrix/table must be imported with the first column titled gene_ensembl and all other columns named as the respective individual samples.
Each row contains the counts for each gene across all samples, and the first column must contain Ensembl ID’s e.g. ENSBTAG00000000005.

cts_all <- data.table::fread( file = "./Inputs/counts_matrix_bstCattle.txt")
#view first 10 rows of first 5 columns
head(cts_all[,1:5])
##          gene_ensembl VMH_15 VMH_13 VMH_12 VMH_11
## 1: ENSBTAG00000000005   1260    705    776    561
## 2: ENSBTAG00000000008      2      0      0      0
## 3: ENSBTAG00000000009     83     51     81     71
## 4: ENSBTAG00000000010   1084    619    674    462
## 5: ENSBTAG00000000011    230    179    237    170
## 6: ENSBTAG00000000012    380    287    293    227

3.2.1 Check structure of data

It’s often a good idea to check the structure of a table after it is imported, particularly to check that columns are of the correct type/class. E.g. are numbers treated as numbers (int) or are they characters (chr) that need converting? Fortunately data.table::fread guesses these correctly most of the time.

#showing first 15 columns to save space
str(cts_all[,1:15])
## Classes 'data.table' and 'data.frame':   27607 obs. of  15 variables:
##  $ gene_ensembl: chr  "ENSBTAG00000000005" "ENSBTAG00000000008" "ENSBTAG00000000009" "ENSBTAG00000000010" ...
##  $ VMH_15      : int  1260 2 83 1084 230 380 1124 1507 593 10 ...
##  $ VMH_13      : int  705 0 51 619 179 287 962 1050 200 12 ...
##  $ VMH_12      : int  776 0 81 674 237 293 1050 1249 377 5 ...
##  $ VMH_11      : int  561 0 71 462 170 227 712 856 206 10 ...
##  $ VMH_10      : int  977 0 53 733 221 317 853 1028 419 3 ...
##  $ VMH_09      : int  803 3 48 623 198 314 939 1049 381 17 ...
##  $ VMH_08      : int  847 0 45 519 227 243 815 937 273 5 ...
##  $ VMH_07      : int  893 2 50 695 186 243 718 1164 433 12 ...
##  $ VMH_06      : int  793 3 76 563 167 267 838 859 286 9 ...
##  $ VMH_05      : int  703 2 54 720 207 183 613 1189 425 9 ...
##  $ VMH_04      : int  529 1 51 445 142 190 721 704 180 4 ...
##  $ VMH_03      : int  669 3 55 550 154 225 567 893 330 3 ...
##  $ VMH_02      : int  614 0 50 508 170 244 726 839 321 4 ...
##  $ VMH_01      : int  693 3 29 613 154 246 607 910 471 4 ...
##  - attr(*, ".internal.selfref")=<externalptr>

3.3 Import column metadata

DE4Rumi automates many processes that would otherwise require manual input to set names of treatments and pairwise comparisons, etc. To do this, it relies on the sample names being in a specific format:

Tissue_Region_ID__Treatment or in R: paste0(Tissue_Region, "_", ID, "__", Treatment)

Note that there are two underscores between ID and Treatment

coldata <- data.table::fread("./Inputs/colData_all_regions.csv") 

#Turn animal ID's into characters (not integers), with leading 0's and then 
#make sample_names column
coldata <- 
  coldata %>% 
  mutate(ID = as.character(ID) %>% stringr::str_pad(3, pad = "0"),
         sample_names = paste0(Tissue_Region, "_", ID, "__", Treatment))

head(coldata)
##      Name Treatment Tissue_Region  ID sample_names
## 1: ARC_01        LR           ARC 004  ARC_004__LR
## 2: ARC_02       MAL           ARC 008 ARC_008__MAL
## 3: ARC_03        LR           ARC 022  ARC_022__LR
## 4: ARC_04        LR           ARC 032  ARC_032__LR
## 5: ARC_05       LAL           ARC 052 ARC_052__LAL
## 6: ARC_06       MAL           ARC 066 ARC_066__MAL

3.4 Rename columns to match required format

Each set of data will be different, and it is unlikely that the column names of the counts table match the required format for use in this package. Therefore, there users will normally have to rename these columns. Below is a quick way to do it.

You will notice that some names don’t look correct. This will be explained further in next step.

# Create new names for cts columns
## This step will normally require customisation for each new dataset
## This step is important to make sure the new names are in teh same order as the old names.
temp_colnames_calc <- 
  colnames(cts_all)[-1] %>% #Take the column names minus the first one
  as_tibble() %>% #convert it to a tibble
  dplyr::rename(Name = value) %>% #rename the 'value' column which is generated in previous step to be 'Name'
  left_join(coldata, by = "Name") #add the coldata table to it, by matching on the columns "Name"


# add new column names
## Take the column named 'sample_names" and add them as the new column names of the counts table, starting after the first column
colnames(cts_all)[-1] <- temp_colnames_calc[["sample_names"]]
## Name the first column 'gene_ensembl', as required.
colnames(cts_all)[1] <- "gene_ensembl"

# Check column names of counts table
colnames(cts_all)
##   [1] "gene_ensembl" "VMH_193__LAL" "VMH_184__LAL" "VMH_149__LR"  "VMH_136__LAL"
##   [6] "VMH_112__LAL" "VMH_096__MAL" "VMH_088__MAL" "VMH_079__LR"  "VMH_066__MAL"
##  [11] "VMH_052__LAL" "VMH_032__LR"  "VMH_022__LR"  "VMH_008__MAL" "VMH_004__LR" 
##  [16] "LHA_193__LAL" "LHA_186__MAL" "LHA_184__LAL" "LHA_149__LR"  "LHA_136__LAL"
##  [21] "LHA_112__LAL" "LHA_096__MAL" "LHA_088__MAL" "LHA_079__LR"  "LHA_066__MAL"
##  [26] "LHA_052__LAL" "LHA_032__LR"  "LHA_022__LR"  "LHA_008__MAL" "LHA_004__LR" 
##  [31] "ARC_193__LAL" "ARC_186__MAL" "ARC_184__LAL" "ARC_149__LR"  "ARC_136__LAL"
##  [36] "ARC_112__LAL" "ARC_096__MAL" "ARC_088__MAL" "ARC_079__LR"  "ARC_066__MAL"
##  [41] "ARC_052__LAL" "ARC_032__LR"  "ARC_022__LR"  "ARC_008__MAL" "ARC_004__LR" 
##  [46] "VMH_186__MAL" "RUM_066__MAL" "ABO_052__LAL" "DUO_079__LR"  "DUO_186__MAL"
##  [51] "ABO_149__LR"  "ABO_186__MAL" "RUM_112__LAL" "DUO_149__LR"  "DUO_004__LR" 
##  [56] "ABO_088__MAL" "ABO_022__LR"  "DUO_066__MAL" "RUM_079__LR"  "ABO_032__LR" 
##  [61] "ABO_184__LAL" "DUO_193__LAL" "RUM_136__LAL" "DUO_184__LAL" "ABO_193__LAL"
##  [66] "RUM_004__LR"  "ABO_008__MAL" "ABO_096__MAL" "ABO_112__LAL" "RUM_186__MAL"
##  [71] "DUO_052__LAL" "ABO_079__LR"  "RUM_032__LR"  "DUO_088__MAL" "ABO_004__LR" 
##  [76] "RUM_008__MAL" "RUM_096__MAL" "DUO_022__LR"  "DUO_112__LAL" "RUM_149__LR" 
##  [81] "RUM_193__LAL" "ABO_136__LAL" "ABO_066__MAL" "RUM_052__LAL" "DUO_032__LR" 
##  [86] "RUM_088__MAL" "DUO_008__MAL" "DUO_096__MAL" "RUM_022__LR"  "DUO_136__LAL"
##  [91] "RUM_184__LAL" "LIV_30"       "LIV_193__LAL" "LIV_186__MAL" "LIV_184__LAL"
##  [96] "LIV_149__LR"  "LIV_25"       "LIV_24"       "LIV_136__LAL" "LIV_22"      
## [101] "LIV_112__LAL" "LIV_20"       "LIV_19"       "LIV_18"       "LIV_096__MAL"
## [106] "LIV_088__MAL" "LIV_079__LR"  "LIV_066__MAL" "LIV_13"       "LIV_052__LAL"
## [111] "LIV_11"       "LIV_10"       "LIV_032__LR"  "LIV_08"       "LIV_022__LR" 
## [116] "LIV_06"       "LIV_05"       "LIV_04"       "LIV_03"       "LIV_008__MAL"
## [121] "LIV_004__LR"

3.5 Check data

The DE4Rumi package includes some helper functions to ensure that the column data and the counts matrix are formatted correctly.

In the previous step, there were some unusual looking names. This is because there is coldata information for samples that were not included in this analysis. This might be common if you have sequenced a subset of samples from a larger experiment. This is not a problem, and the following functions will handle it appropriately.

The first check we run is the check_count_matrix() function. It takes the counts data, the column data and the name of the column in the column data that contains the sample names (to match to the counts data). If there is an error, it will tell you more details in the Console of R Studio (at bottom of screen).

cts_filtered <- 
  check_count_matrix(count_data = cts_all,
                   colData = coldata,
                   column_with_col_names = sample_names)

In this case there was some checks that did not pass, but there was also an error which terminated the function. The critical error here was:

## FAIL: colData has entries in column_with_col_names that do not match a column in count_data. Potentially missing data?
## Entries in colData with no matching counts data: LIV_999__NO
##  Error in check_count_matrix(count_data = cts_all, colData = coldata, column_with_col_names = sample_names) : 
##   Checks terminated. See notes in Console. Run subset_colData() and try again.

In the console it listed the Entries in colData with no matching counts data: LIV_999__NO which means that there is an extra sample in the counts data. Users may wish to address this manually by re-importing a correct coldata file. Otherwise, the DE4Rumi also contains the function subset_colData() which can automatically remove the entry. Notice that the function outputs the new coldata, and we have assigned it to the same name: coldata.

coldata <- subset_colData(count_data = cts_all,
                          colData = coldata,
                          column_with_col_names = sample_names)
## Entries in colData with no matching counts data: LIV_999__NO
## Dimensions before filtering:  106   5
## Dimensions after filtering:  105   5

The output above shows that the missing sample has been removed. Next, we re-run the check_count_matrix() function. See that it also shows the extra coldata entries that we noticed above. Also notice that this function outputs a cts-filtered object which has been filtered, meaning it is ready to use downstream.

cts_filtered <- 
  check_count_matrix(count_data = cts_all,
                   colData = coldata,
                   column_with_col_names = sample_names)
## FAIL: Not all columns in count_data have matching data in colData
## Missing samples:
##  [1] "LIV_30" "LIV_25" "LIV_24" "LIV_22" "LIV_20" "LIV_19" "LIV_18" "LIV_13"
##  [9] "LIV_11" "LIV_10" "LIV_08" "LIV_06" "LIV_05" "LIV_04" "LIV_03"
## Function will now output count_data without these columns!!
## PASS: Rownames not detected
## PASS: First column name is 'gene_ensembl'
## 
##  Dimensions of raw counts dataset BEFORE sorting:
## genes:  27607     samples:  120
## 
##  Number of samples in colData: 105
## 
##  Dimensions of raw counts dataset AFTER sorting:
## genes:  27607     samples:  105

References

Dowle, Matt, and Arun Srinivasan. 2021. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.