Chapter 5 More notes on mutliple lane files
The output from Illumina sequencing is sometimes provided in multiple files, each corresponding to a ‘Lane’ on the sequencer. It would be easier to ask the lab to provide the output as a single file, which can be computed using the --no-lane-splitting
option from Illumina’s bcl2fastq
program. However, it can also be handled in Galaxy.
The following is a description of how to handle this manually. Please note that this is automated within Step 1 Workflow.
If there are multiple files, it is best practice to run FastQC on each individual file, as there is a chance that one file could be corrupt or you may identify a bias for one particular ‘Lane’. If they are ok, then these files can be concatenated together before proceeding with all further steps. There are 2 workflows that will be used in Galaxy, with the first designed to work with each individual lane file and the second requiring 1 file per sample. This can be done by following these steps:
- ‘Apply Rule to Collection’ tool
- Input collection is the list with the re-labelled data
- Press Edit button
- There should be 1 column titled ‘A’ with the names of the files (the re-labelled names set previously)
Make new columns with grouping information:
- Press +Column -> Using a regular expression
- Select ‘Create column matching expression groups’
- Paste the following code into the Regular Expression box:
(.*?)_L(.*)
- Set the number of groups to 2
- Press Apply
Set columns as identifiers for grouping:
- Press +Rules -> Add / Modify Column Definitions
- Press +Add Definition -> List Identifiers
- Select column B, then click on ‘Assign another column’ and select C
- Press Apply
- Press Save, then Execute the job
This outputs a nested list to the history. The number of items in the list should match the number of samples/animals, and then each sample in the list should contain the number of individual files (e.g. 2 or 3 files).
To then join this datasets together, the tool ‘Collapse Collection’ will work in the background to be the same as using the ‘concatenate datasets tail-to-head’ tool to concatenate the files individually. Although not clear in the tool’s description, if it is provided a nested list, it will collapse the lowest level of groups together, in this case it is the Column C from above, which was the individual lanes. It should output a new list with the same number of files as the number of samples/animals. The names should be the names defined in Column B above: e.g. 2139_Stage 2_Fast:
- Open Collapse Collection tool
- Choose to use a dataset collection, not an individual file
- Select the nested list from above
- Execute with all other default settings