Chapter 5 More notes on mutliple lane files

The output from Illumina sequencing is sometimes provided in multiple files, each corresponding to a ‘Lane’ on the sequencer. It would be easier to ask the lab to provide the output as a single file, which can be computed using the --no-lane-splitting option from Illumina’s bcl2fastq program. However, it can also be handled in Galaxy.

The following is a description of how to handle this manually. Please note that this is automated within Step 1 Workflow.

If there are multiple files, it is best practice to run FastQC on each individual file, as there is a chance that one file could be corrupt or you may identify a bias for one particular ‘Lane’. If they are ok, then these files can be concatenated together before proceeding with all further steps. There are 2 workflows that will be used in Galaxy, with the first designed to work with each individual lane file and the second requiring 1 file per sample. This can be done by following these steps:

‘Apply Rule to Collection’ tool
Input collection is the list with the re-labelled data
Press Edit button
There should be 1 column titled ‘A’ with the names of the files (the re-labelled names set previously)

Make new columns with grouping information:

Press +Column -> Using a regular expression
Select ‘Create column matching expression groups’
Paste the following code into the Regular Expression box: (.*?)_L(.*)
Set the number of groups to 2
Press Apply

Set columns as identifiers for grouping:

Press +Rules -> Add / Modify Column Definitions
Press +Add Definition -> List Identifiers
Select column B, then click on ‘Assign another column’ and select C
Press Apply
Press Save, then Execute the job

This outputs a nested list to the history. The number of items in the list should match the number of samples/animals, and then each sample in the list should contain the number of individual files (e.g. 2 or 3 files).

To then join this datasets together, the tool ‘Collapse Collection’ will work in the background to be the same as using the ‘concatenate datasets tail-to-head’ tool to concatenate the files individually. Although not clear in the tool’s description, if it is provided a nested list, it will collapse the lowest level of groups together, in this case it is the Column C from above, which was the individual lanes. It should output a new list with the same number of files as the number of samples/animals. The names should be the names defined in Column B above: e.g. 2139_Stage 2_Fast:

Open Collapse Collection tool
Choose to use a dataset collection, not an individual file
Select the nested list from above
Execute with all other default settings