Chapter 3 Importing Data
DE4Rumi takes 2 files as input:
- Counts matrix/table
- A table with metadata for each sample
3.1 Note on importing files
There are many functions used to import data into R, but it is suggested that
the fread
function from the data.table packages is used (Dowle and Srinivasan 2021).
Not only is it much faster than alternatives, it also automatically detects file
types and column separators (e.g. tab or comma) for most files. If you do not
use data.table much, then you might not want to load it into your session and
instead just name the package each time you use fread
by
typing data.table::fread(file = "./testfile.csv")
.
3.2 Import count data
Count data refers to data generated by Next Generation Sequencing technologies, such as RNAseq, and represents how abundant a gene’s mRNA was in the sample. These count tables are produced from processing .fastq files with an RNAseq pipeline, which is not discussed here. For example, the data used in this example was generated with a bioinformatics workflow consisting of FastQC -> trimmomatic -> STAR Aligner -> featureCounts on htt://usegalaxy.org.au. Visit https://davos-i.github.io/Galaxy_RNASeq_Book/ for more details.
The counts matrix/table must be imported with the first column titled
gene_ensembl
and all other columns named as the respective individual samples.
Each row contains the counts for each gene across all samples, and the first
column must contain Ensembl ID’s e.g. ENSBTAG00000000005
.
<- data.table::fread( file = "./Inputs/counts_matrix_bstCattle.txt")
cts_all #view first 10 rows of first 5 columns
head(cts_all[,1:5])
## gene_ensembl VMH_15 VMH_13 VMH_12 VMH_11
## 1: ENSBTAG00000000005 1260 705 776 561
## 2: ENSBTAG00000000008 2 0 0 0
## 3: ENSBTAG00000000009 83 51 81 71
## 4: ENSBTAG00000000010 1084 619 674 462
## 5: ENSBTAG00000000011 230 179 237 170
## 6: ENSBTAG00000000012 380 287 293 227
3.2.1 Check structure of data
It’s often a good idea to check the structure of a table after it is imported, particularly to check that
columns are of the correct type/class. E.g. are numbers treated as numbers (int) or are they characters (chr) that need converting?
Fortunately data.table::fread
guesses these correctly most of the time.
#showing first 15 columns to save space
str(cts_all[,1:15])
## Classes 'data.table' and 'data.frame': 27607 obs. of 15 variables:
## $ gene_ensembl: chr "ENSBTAG00000000005" "ENSBTAG00000000008" "ENSBTAG00000000009" "ENSBTAG00000000010" ...
## $ VMH_15 : int 1260 2 83 1084 230 380 1124 1507 593 10 ...
## $ VMH_13 : int 705 0 51 619 179 287 962 1050 200 12 ...
## $ VMH_12 : int 776 0 81 674 237 293 1050 1249 377 5 ...
## $ VMH_11 : int 561 0 71 462 170 227 712 856 206 10 ...
## $ VMH_10 : int 977 0 53 733 221 317 853 1028 419 3 ...
## $ VMH_09 : int 803 3 48 623 198 314 939 1049 381 17 ...
## $ VMH_08 : int 847 0 45 519 227 243 815 937 273 5 ...
## $ VMH_07 : int 893 2 50 695 186 243 718 1164 433 12 ...
## $ VMH_06 : int 793 3 76 563 167 267 838 859 286 9 ...
## $ VMH_05 : int 703 2 54 720 207 183 613 1189 425 9 ...
## $ VMH_04 : int 529 1 51 445 142 190 721 704 180 4 ...
## $ VMH_03 : int 669 3 55 550 154 225 567 893 330 3 ...
## $ VMH_02 : int 614 0 50 508 170 244 726 839 321 4 ...
## $ VMH_01 : int 693 3 29 613 154 246 607 910 471 4 ...
## - attr(*, ".internal.selfref")=<externalptr>
3.3 Import column metadata
DE4Rumi automates many processes that would otherwise require manual input to set names of treatments and pairwise comparisons, etc. To do this, it relies on the sample names being in a specific format:
Tissue_Region_ID__Treatment
or in R: paste0(Tissue_Region, "_", ID, "__", Treatment)
Note that there are two underscores between ID and Treatment
<- data.table::fread("./Inputs/colData_all_regions.csv")
coldata
#Turn animal ID's into characters (not integers), with leading 0's and then
#make sample_names column
<-
coldata %>%
coldata mutate(ID = as.character(ID) %>% stringr::str_pad(3, pad = "0"),
sample_names = paste0(Tissue_Region, "_", ID, "__", Treatment))
head(coldata)
## Name Treatment Tissue_Region ID sample_names
## 1: ARC_01 LR ARC 004 ARC_004__LR
## 2: ARC_02 MAL ARC 008 ARC_008__MAL
## 3: ARC_03 LR ARC 022 ARC_022__LR
## 4: ARC_04 LR ARC 032 ARC_032__LR
## 5: ARC_05 LAL ARC 052 ARC_052__LAL
## 6: ARC_06 MAL ARC 066 ARC_066__MAL
3.4 Rename columns to match required format
Each set of data will be different, and it is unlikely that the column names of the counts table match the required format for use in this package. Therefore, there users will normally have to rename these columns. Below is a quick way to do it.
You will notice that some names don’t look correct. This will be explained further in next step.
# Create new names for cts columns
## This step will normally require customisation for each new dataset
## This step is important to make sure the new names are in teh same order as the old names.
<-
temp_colnames_calc colnames(cts_all)[-1] %>% #Take the column names minus the first one
as_tibble() %>% #convert it to a tibble
::rename(Name = value) %>% #rename the 'value' column which is generated in previous step to be 'Name'
dplyrleft_join(coldata, by = "Name") #add the coldata table to it, by matching on the columns "Name"
# add new column names
## Take the column named 'sample_names" and add them as the new column names of the counts table, starting after the first column
colnames(cts_all)[-1] <- temp_colnames_calc[["sample_names"]]
## Name the first column 'gene_ensembl', as required.
colnames(cts_all)[1] <- "gene_ensembl"
# Check column names of counts table
colnames(cts_all)
## [1] "gene_ensembl" "VMH_193__LAL" "VMH_184__LAL" "VMH_149__LR" "VMH_136__LAL"
## [6] "VMH_112__LAL" "VMH_096__MAL" "VMH_088__MAL" "VMH_079__LR" "VMH_066__MAL"
## [11] "VMH_052__LAL" "VMH_032__LR" "VMH_022__LR" "VMH_008__MAL" "VMH_004__LR"
## [16] "LHA_193__LAL" "LHA_186__MAL" "LHA_184__LAL" "LHA_149__LR" "LHA_136__LAL"
## [21] "LHA_112__LAL" "LHA_096__MAL" "LHA_088__MAL" "LHA_079__LR" "LHA_066__MAL"
## [26] "LHA_052__LAL" "LHA_032__LR" "LHA_022__LR" "LHA_008__MAL" "LHA_004__LR"
## [31] "ARC_193__LAL" "ARC_186__MAL" "ARC_184__LAL" "ARC_149__LR" "ARC_136__LAL"
## [36] "ARC_112__LAL" "ARC_096__MAL" "ARC_088__MAL" "ARC_079__LR" "ARC_066__MAL"
## [41] "ARC_052__LAL" "ARC_032__LR" "ARC_022__LR" "ARC_008__MAL" "ARC_004__LR"
## [46] "VMH_186__MAL" "RUM_066__MAL" "ABO_052__LAL" "DUO_079__LR" "DUO_186__MAL"
## [51] "ABO_149__LR" "ABO_186__MAL" "RUM_112__LAL" "DUO_149__LR" "DUO_004__LR"
## [56] "ABO_088__MAL" "ABO_022__LR" "DUO_066__MAL" "RUM_079__LR" "ABO_032__LR"
## [61] "ABO_184__LAL" "DUO_193__LAL" "RUM_136__LAL" "DUO_184__LAL" "ABO_193__LAL"
## [66] "RUM_004__LR" "ABO_008__MAL" "ABO_096__MAL" "ABO_112__LAL" "RUM_186__MAL"
## [71] "DUO_052__LAL" "ABO_079__LR" "RUM_032__LR" "DUO_088__MAL" "ABO_004__LR"
## [76] "RUM_008__MAL" "RUM_096__MAL" "DUO_022__LR" "DUO_112__LAL" "RUM_149__LR"
## [81] "RUM_193__LAL" "ABO_136__LAL" "ABO_066__MAL" "RUM_052__LAL" "DUO_032__LR"
## [86] "RUM_088__MAL" "DUO_008__MAL" "DUO_096__MAL" "RUM_022__LR" "DUO_136__LAL"
## [91] "RUM_184__LAL" "LIV_30" "LIV_193__LAL" "LIV_186__MAL" "LIV_184__LAL"
## [96] "LIV_149__LR" "LIV_25" "LIV_24" "LIV_136__LAL" "LIV_22"
## [101] "LIV_112__LAL" "LIV_20" "LIV_19" "LIV_18" "LIV_096__MAL"
## [106] "LIV_088__MAL" "LIV_079__LR" "LIV_066__MAL" "LIV_13" "LIV_052__LAL"
## [111] "LIV_11" "LIV_10" "LIV_032__LR" "LIV_08" "LIV_022__LR"
## [116] "LIV_06" "LIV_05" "LIV_04" "LIV_03" "LIV_008__MAL"
## [121] "LIV_004__LR"
3.5 Check data
The DE4Rumi package includes some helper functions to ensure that the column data and the counts matrix are formatted correctly.
In the previous step, there were some unusual looking names. This is because there is coldata information for samples that were not included in this analysis. This might be common if you have sequenced a subset of samples from a larger experiment. This is not a problem, and the following functions will handle it appropriately.
The first check we run is the check_count_matrix()
function. It takes the counts data, the column data and the name of the column in the column data that contains the sample names (to match to the counts data). If there is an error, it will tell you more details in the Console of R Studio (at bottom of screen).
<-
cts_filtered check_count_matrix(count_data = cts_all,
colData = coldata,
column_with_col_names = sample_names)
In this case there was some checks that did not pass, but there was also an error which terminated the function. The critical error here was:
## FAIL: colData has entries in column_with_col_names that do not match a column in count_data. Potentially missing data?
## Entries in colData with no matching counts data: LIV_999__NO
## Error in check_count_matrix(count_data = cts_all, colData = coldata, column_with_col_names = sample_names) :
## Checks terminated. See notes in Console. Run subset_colData() and try again.
In the console it listed the Entries in colData with no matching counts data: LIV_999__NO
which means that there is an extra sample in the counts data. Users may wish to address this manually by re-importing a correct coldata file. Otherwise, the DE4Rumi also contains the function subset_colData()
which can automatically remove the entry. Notice that the function outputs the new coldata, and we have assigned it to the same name: coldata
.
<- subset_colData(count_data = cts_all,
coldata colData = coldata,
column_with_col_names = sample_names)
## Entries in colData with no matching counts data: LIV_999__NO
## Dimensions before filtering: 106 5
## Dimensions after filtering: 105 5
The output above shows that the missing sample has been removed. Next, we re-run the check_count_matrix()
function. See that it also shows the extra coldata entries that we noticed above. Also notice that this function outputs a cts-filtered
object which has been filtered, meaning it is ready to use downstream.
<-
cts_filtered check_count_matrix(count_data = cts_all,
colData = coldata,
column_with_col_names = sample_names)
## FAIL: Not all columns in count_data have matching data in colData
## Missing samples:
## [1] "LIV_30" "LIV_25" "LIV_24" "LIV_22" "LIV_20" "LIV_19" "LIV_18" "LIV_13"
## [9] "LIV_11" "LIV_10" "LIV_08" "LIV_06" "LIV_05" "LIV_04" "LIV_03"
## Function will now output count_data without these columns!!
## PASS: Rownames not detected
## PASS: First column name is 'gene_ensembl'
##
## Dimensions of raw counts dataset BEFORE sorting:
## genes: 27607 samples: 120
##
## Number of samples in colData: 105
##
## Dimensions of raw counts dataset AFTER sorting:
## genes: 27607 samples: 105