machine learning / bioelectricity / longevity

Bioinformatics pipeline example from the bottom up

This tutorial is aimed at scientists and bioinformaticians who know how to work the command line and have heard about pipelines before but feel lost in the jungle of tools like Docker, Nextflow, Airflow, Reflow, Snakemake, etc.

Bioinformatics pipeline example from the bottom up
Photo by Sigmund on Unsplash

This tutorial is aimed at scientists and bioinformaticians who know how to work the command line and have heard about pipelines before but feel lost in the jungle of tools like Docker, Nextflow, Airflow, Reflow, Snakemake, etc.

In this post, we're gonna strip away some of the complexity and take a simple bioinformatics workflow, and build a pipeline from the bottom up. The goal is to understand the pattern of how to take some scripts written in a language like bash or python and turn them into a more streamlined (and perhaps automated) workflow.

We start by introducing the pipeline that we're going to build. In essence, it is a set of python scripts that take some data, do something with that data and save the output somewhere else. The first step to creating a minimal pipeline is writing a master shell script that sequentially runs all of these python scripts. We then use a Makefile to do the very same while explaining some of the advantages that come with it. Finally, we use Nextflow, a commonly used bioinformatics workflow tool, to wrap up our pipeline. If you feel adventurous, you can follow this tutorial on how to run setup an AWS environment for Nextflow and then run your pipeline on it.

The workflow we're going to wrap in a pipeline looks like this:

  1. Take a set of .fasta protein files
  2. Split each into peptides using a variable number of missed cleavages
  3. Count the number of cysteines in total as well as the number of peptides that contain a cysteine
  4. Generate an output report containing this information in a .tsv file
  5. Create an archive to share with colleagues
An example output protein report

The first part of this tutorial is influenced by this post on how to create bioinformatics pipelines with Make. I won't go into as much depth to explain Makefiles themselves, so if this is the first time you're encountering a Makefile, I'd recommend going through the linked post first.

Setup

Go through the box below to install the needed tools. I tried to make the dependencies as small as possible.

Mac OS

# Add project to your path for this session.
export PATH="$PATH:$(pwd)"

# Open the terminal; Install utilities for homebrew
xcode-select --install

# Install homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install python3
Follow this tutorial: https://opensource.com/article/19/5/python-3-default-mac

# Install make
brew install make

# Install git
brew install git

# Install matplotlib
pip3 install matplotlib

# Install Nextflow (https://www.nextflow.io/docs/latest/getstarted.html)
wget -qO- https://get.nextflow.io | bash
chmod +x nextflow
## Move Nextflow to a directory in your $PATH such as /usr/local/bin
mv nextflow /usr/local/bin/

Linux

# Install python3, git and make
sudo apt-get update
sudo apt-get install python3 git make

# Install matplotlib
sudo apt-get install python3-matplotlib

# Install Nextflow (https://www.nextflow.io/docs/latest/getstarted.html)
wget -qO- https://get.nextflow.io | bash
chmod +x nextflow
## Move Nextflow to a directory in your $PATH such as /usr/local/bin
mv nextflow /usr/local/bin/

Introduction

In this section, we'll go through the basic intuition of what a pipeline is and why we need one. To walk through this from the ground up I chose a basic example. We have a bunch of proteins in .fasta files and want to create a report of how many cysteines each contains after it has been digested into peptides.

I created a Github repository which we'll be working with. To start off, fork it, clone it locally, and check out the branch part_00.  

# Fork and clone repository and switch to branch part_1
git clone https://github.com/<your_git_username>/bioinformatics-pipeline-tutorial.git
cd bioinformatics-pipeline-tutorial/
git checkout part_00

Open the project in your favorite code editor to check out the directory structure. We have two folders: bin/ contains the python scripts we'll use throughout this tutorial to transform our files and fasta/ contains a set of protein .fasta files that we'll use (I went with the four Yamanaka factors but feel free to drop in whatever your favorite protein is).

Make all the scripts in bin executable by running the following command:

chmod +x bin/01_digest_protein.py bin/02_count_amino_acids.py bin/03a_plot_count.py bin/03b_get_report.py

Let's walk through the steps manually using KLF4 as our protein. First, we need to digest our protein into peptides. This is what the prepared script 01_digest_protein.py does. Feel free to open up the file and check it out. The required flags for the script are an input .fasta file and an output file path. The optional flags have default values but feel free to play around with them. For example, we can change the number of missed cleavages by appending --missed_cleavages=1 to our command. To digest our protein, run:

mkdir data/
bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

We now have the file KLF4.txt in the data/ directory which should contain all the peptides of KLF4 after it was digested with trypsin (you can change the digestion enzyme by passing the --enzyme_regex flag).

Next up, we want to count the total # of amino acids in KLF4, the # of cysteines, the # of peptides and how many of them contain a cysteine. To do this, we run:

bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv

We can use the --amino_acid to change the amino acid to count (defaults to cysteine == C).

We're halfway there. Now we want to a) plot each output count file as a bar plot (see below) and b) create an output report summarizing the counts for multiple proteins.

Barplot charts showing the number of cysteines in peptides and amino acids

To get the output plot we run:

# Just show
bin/03a_plot_count.py data/KLF4.count.tsv show

# Save fig
bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png

Now we run all the same steps for MYC and generate an output report for the two proteins.

# Digest
bin/01_digest_protein.py fasta/MYC.fasta data/MYC.peptides.txt

# Count
bin/02_count_amino_acids.py fasta/MYC.fasta data/MYC.peptides.txt data/MYC.count.tsv

# Plot
bin/03a_plot_count.py data/MYC.count.tsv data/MYC.plot.png

# Generate Report for KLF4 and MYC
bin/03b_get_report.py data/KLF4.count.tsv data/MYC.count.tsv --output_file=data/protein_report.tsv

Lastly, create an archive of the resulting output files.

# Create a results/ folder and archive it for sharing
mkdir results
cp data/*plot.png data/protein_report.tsv results/
tar -czf data/results.tgz results
rm -r results

Together these scripts implement a common workflow:

  1. Digest protein(s)
  2. Count occurrences of amino acid in protein(s)
  3. Plot results
  4. Generate a report with the results
  5. Archive the plots and report

Instead of running each of the commands manually, as above, we can create a master script that runs the whole pipeline from start to finish. Our run_pipeline.sh looks like this:

#!/usr/bin/env bash
# USAGE: bash run_pipeline.sh

mkdir -p data

# 01. Digest
bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt
bin/01_digest_protein.py fasta/MYC.fasta data/MYC.peptides.txt
bin/01_digest_protein.py fasta/PO5F1.fasta data/PO5F1.peptides.txt
bin/01_digest_protein.py fasta/SOX2.fasta data/SOX2.peptides.txt

# 02. Count
bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv
bin/02_count_amino_acids.py fasta/MYC.fasta data/MYC.peptides.txt data/MYC.count.tsv
bin/02_count_amino_acids.py fasta/PO5F1.fasta data/PO5F1.peptides.txt data/PO5F1.count.tsv
bin/02_count_amino_acids.py fasta/SOX2.fasta data/SOX2.peptides.txt data/SOX2.count.tsv

# 03a. Plot
bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png
bin/03a_plot_count.py data/MYC.count.tsv data/MYC.plot.png
bin/03a_plot_count.py data/PO5F1.count.tsv data/PO5F1.plot.png
bin/03a_plot_count.py data/SOX2.count.tsv data/SOX2.plot.png

# 03b. Generate Report
bin/03b_get_report.py data/KLF4.count.tsv \
					  data/MYC.count.tsv \
					  data/PO5F1.count.tsv \
					  data/SOX2.count.tsv \
					  --output_file=data/protein_report.tsv

# 04. Archive the results in a tarball so we can share them with a colleague
rm -rf results
mkdir results
cp data/*plot.png data/protein_report.tsv results/
tar -czf data/results.tgz results
rm -r results

Now we have a reproducible pipeline that we can easily run by calling:

bash run_pipeline.sh

We can also share it with colleagues and we have some security that it will behave in the exact same manner when we rerun it (and don't have to worry about typos when using the command line manually).

If you're following along using your own GitHub repository, this is a good time to take a step back and commit your results.

git init
git add .
git commit -m "Finished setup"
git push

Let's also clean up the data folder for now, as we'll regenerate the files again in the next step:

rm data/*

Makefile

If you're starting at this point, please checkout branch "part_01" from the GitHub repo.

Now, let's say we wanted to use pie charts instead of bar plots. We could just go into 03a_plot_count.py and change plt.bar to plt.pie , right?

Sure, but then we'd have to rerun the entire script even though the first part didn't change at all. Even so, this wouldn't be a big deal because we only have four files but imagine we were running this for the whole human .fasta file or our files were just much bigger. Alas, our current pipeline is not ideal.

As I mentioned, this post gives a much deeper overview of how to create Makefiles for bioinformatics workflows. I'm only covering the basics needed for our little tutorial here. There's also a great cheat sheet here if you get stuck on some commands.

Make is a computer program originally designed to automate the compilation and installation of software. Make automates the process of building target files through a series of discrete steps. Despite it’s original purpose, this design makes it a great fit for bioinformatics pipelines, which often work by transforming data from one form to another (e.g. raw dataword counts???profit).
Source: http://byronjsmith.com/make-bml/

Let's start by creating a Makefile and porting our first step into it.

touch Makefile

Use the text editor of your choice to add to the Makefile. The simplest possible Makefile recipe is this:

targets: prerequisites
	command

We want to create the file KLF4.peptides.txt in the data/ folder (the target) using the bin/01_digest_protein.py script (the command), as before. Our input file is fasta/KLF4.fasta (the prerequisite). The result looks like this:

data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

The targets are what we create as a result of executing the commands. Now, run:

make data/KLF4.peptides.txt

Once we've executed this, you should have generated the KLF4.peptides.txt file in the data/ folder. Make will only allow rerunning the command if the prerequisites have been modified since the target was created.

Try running make data/KLF4.peptides.txt again. You should get the following message, telling you that the prerequisites have not changed and therefore the target won't be different if you run it again:

$ make data/KLF4.peptides.txt
make: 'data/KLF4.peptides.txt' is up to date.

We can go around this by just changing the modification time of fasta/KLF4.fasta and restore the original behavior.

touch fasta/KLF4.fasta
make data/KLF4.peptides.txt

Let's add the second step, the counting step. As a reminder: we want to create the file KLF4.count.tsv in the data/ folder (the target) using the bin/02_count_amino_acids.py script (the command). Our input files are fasta/KLF4.fasta and data/KLF4.peptides.txt (the prerequisites). The resulting Makefile looks like this:

data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

data/KLF4.count.tsv: fasta/KLF4.fasta data/KLF4.peptides.txt
	bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv

Now, try to add the plotting command (creates the file data/KLF4.plot.png) and the report (data/protein_report.tsv) yourself.

Here is the solution.

data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

data/KLF4.count.tsv: fasta/KLF4.fasta data/KLF4.peptides.txt
	bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv

data/KLF4.plot.png: data/KLF4.count.tsv
	bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png

data/protein_report.tsv: data/KLF4.count.tsv
	bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv

data/results.tgz: data/KLF4.plot.png data/protein_report.tsv
	rm -rf results
	mkdir results
	cp data/KLF4.plot.png data/protein_report.tsv results/
	tar -czf data/results.tgz results
	rm -r results

Let's remove all the files from the data/ subdirectory and run Make.

rm data/*
make data/results.tgz

You'll notice that Make executes every single command in the Makefile.

$ make data/result.tgz
bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt
bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv
bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png
bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv
rm -rf results
mkdir results
cp data/KLF4.plot.png data/protein_report.tsv results/
tar -czf data/results.tgz results
rm -r results

Why is that? Makefiles work in a pull-based fashion. This means that the workflow is invoked by asking for a specific output file, whereafter all tasks required for reproducing the file will be executed. We can visualize this by looking at the dependency graph. To generate it we're using makefile2graph and call:

make -Bnd data/results.tgz | make2graph | dot -Tpng -o out.png

We call make with our target data/results.tgz and in order to create it, we first need to create data/KLF4.plot.png and data/protein_report.tsv which in turn need data/KLF4.count.tsv and so on. That's why it generates all files at once.

The dependency graph of our Makefile

In order to just see what files will be created, we can run use the flag --dry-run or its short form -n.

$ make --dry-run data/result.tgz
bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt
bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv
bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png
bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv
rm -rf results
mkdir results
cp data/KLF4.plot.png data/protein_report.tsv results/
tar -czf data/results.tgz results
rm -r results

We can put an all target at the very top of our Makefile which is good practice as the topmost recipe is the one that is built by default when calling just make. Add the following to the top of your Makefile:

all: data/results.tgz

[...]

Another common target is clean:. Let's add the following below the all: target in our Makefile:

clean:
	rm -rf data/*

We can now create all our files by calling make all and clean the data/ folder by calling make clean.

We have to tell Make that all: and clean: will always refer to the targets in our Makefile and never to any files themselves, therefore we also add this to our Makefile:

.PHONY: all clean

Our Makefile should now look like this:

# Dummy targets
all: data/results.tgz

clean:
	rm -rf data/*

.PHONY: all clean

# Analysis and plotting
data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

data/KLF4.count.tsv: fasta/KLF4.fasta data/KLF4.peptides.txt
	bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv

data/KLF4.plot.png: data/KLF4.count.tsv
	bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png

data/protein_report.tsv: data/KLF4.count.tsv
	bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv

data/results.tgz: data/KLF4.plot.png data/protein_report.tsv
	rm -rf results
	mkdir results
	cp data/KLF4.plot.png data/protein_report.tsv results/
	tar -czf data/results.tgz results
	rm -r results

You might have noticed that there is a fair amount of repetition in each of the recipes. Let's take the first one and simplify it:

data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

In Makefiles, the two variables $^ and $@ refer to the prerequisite and target of a rule so we can rewrite the above as:

data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py $^ $@

In fact, to make sure that our python script is also considered as a prerequisite and the recipe is rerun when our script is updated we change it like this:

data/KLF4.peptides.txt: bin/01_digest_protein.py fasta/KLF4.fasta
	$^ $@

After applying these transformations, our Makefile should look like this:

# Dummy targets
all: data/results.tgz

clean:
	rm -rf data/*

.PHONY: all clean

# Analysis and plotting
data/KLF4.peptides.txt: bin/01_digest_protein.py fasta/KLF4.fasta
	$^ $@

data/KLF4.count.tsv: bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt
	$^ $@

data/KLF4.plot.png: bin/03a_plot_count.py data/KLF4.count.tsv
	$^ $@

data/protein_report.tsv: bin/03b_get_report.py data/KLF4.count.tsv
	$^ --output_file=$@

# Archive for sharing
data/results.tgz: data/KLF4.plot.png data/protein_report.tsv
	rm -rf results
	mkdir results
	cp $^ results/
	tar -czf $@ results
	rm -r results

Now the current Makefile only creates the KLF4 protein files. Let's add the other proteins, starting with MYC.

# Analysis and plotting
data/KLF4.peptides.txt: bin/01_digest_protein.py fasta/KLF4.fasta
	$^ $@
    
data/MYC.peptides.txt: bin/01_digest_protein.py fasta/MYC.fasta
	$^ $@

As you probably noticed, that would be a lot of repetition. We can pattern rules to abstract the individual protein names away:

# Analysis and plotting
data/%.peptides.txt: bin/01_digest_protein.py fasta/%.fasta
	$^ $@

If we would've gone ahead and blindly applied this to the last two rules we would've gotten the following error:

make: *** No rule to make target `data/%.count.tsv', needed by `data/%.plot.png'. Stop.

Why is that? Remember how Makefiles use a “pull-based” scheduling strategy?
If we were to use the wildcards % everywhere, we'd never actually tell the Makefile what the wildcard stands for.
At some point, we need to define some target files that Make can use as a basis to fill in the wildcards for the others.

We insert two variables at the top and use them for the last two rules. Voilà.

COUNT := data/KLF4.count.tsv data/MYC.count.tsv \
			data/PO5F1.count.tsv data/SOX2.count.tsv
PLOT := data/KLF4.plot.png data/MYC.plot.png \
			data/PO5F1.plot.png data/SOX2.plot.png

[...]

data/protein_report.tsv: bin/03b_get_report.py ${COUNT}
	$^ --output_file=$@

# Archive for sharing
data/results.tgz: ${PLOT} data/protein_report.tsv
	rm -rf results
	mkdir results
	cp $^ results/
	tar -czf $@ results
	rm -r results

Now we're back on track. Let's run it:

make

You might have noticed that the pipeline took a little bit longer to process the four proteins because it ran everything sequentially. You can use the --jobs or -j flag to run Make in parallel. That'll give us a nice speedup.

Let's check out the dependency graph at this point.

Your final Makefile should look like this:

COUNT := data/KLF4.count.tsv data/MYC.count.tsv \
			data/PO5F1.count.tsv data/SOX2.count.tsv
PLOT := data/KLF4.plot.png data/MYC.plot.png \
			data/PO5F1.plot.png data/SOX2.plot.png

# Dummy targets
all: data/results.tgz

clean:
	rm -rf data/*

.PHONY: all clean

# Analysis and plotting
data/%.peptides.txt: bin/01_digest_protein.py fasta/%.fasta
	$^ $@

data/%.count.tsv: bin/02_count_amino_acids.py fasta/%.fasta data/%.peptides.txt
	$^ $@

data/%.plot.png: bin/03a_plot_count.py data/%.count.tsv
	$^ $@

data/protein_report.tsv: bin/03b_get_report.py ${COUNT}
	$^ --output_file=$@

# Archive for sharing
data/results.tgz: ${PLOT} data/protein_report.tsv
	rm -rf results
	mkdir results
	cp $^ results/
	tar -czf $@ results
	rm -r results

Nextflow

If you're starting at this point, please checkout branch "part_02" from the GitHub repo.

We're now going to switch gears and turn to Nextflow. Makefiles are awesome but they're limited. For larger scale pipelines, the biggest limit is that we can't easily scale them horizontally. It'd have to be on the same machine. A lot of the steps we've implemented earlier though could easily run on multiple machines in parallel. Let's see how Nextflow helps with that. There are many tools like Nextflow, including but not limited to Airflow, Reflow, Snakemake, etc. They all have their advantages and disadvantages but I chose Nextflow for this tutorial (and for our work at talus.bio) because of its flexibility and popularity in the Bioinformatics community.

Start by creating a file main.nf which is a commonly used name for Nextflow files. We add only the digestProtein step, for now, to keep it simple.

#!/usr/bin/env nextflow

// Run workflow for all .fasta files in the fasta directory
fasta = Channel.fromPath("$baseDir/fasta/*.fasta")


// Helper function to extract the protein name from the filename
def getProtein(fileName) {
  fileName.getBaseName().tokenize(".")[0]
}


// Digest a protein and save the peptides
process digestProtein {  
  input:
    path input_fasta from fasta

  output:
    path "*.txt" into peptides

  script:
    def protein = getProtein(input_fasta)
    """
    01_digest_protein.py ${input_fasta} ${protein}.peptides.txt
    """
}

peptides.view()

We also create a small nextflow.config file that includes only the publishDir directive for now. Nextflow does all the work in the work folder and if we want our processes to export data to a different location we have to specify that with the publishDir directive. Here we use our data/ folder again.

process {
	publishDir = [path: "data/", mode: "copy"]
}

One of the main functional differences between Nextflow and Makefiles is that Makefiles are pull-based but Nextflow is push-based. With the Makefiles, we had to specify the output files we want to generate and it automatically figured out which parts it had to execute to generate them. Here we take all the .fasta files from our fasta folder and "push" them into the pipeline.

Nextflow uses the groovy programming language which is based on Java. This makes it a lot more flexible than using bash. There are two main concepts: Processes and Channels. Processes are similar to the rules in a Makefile. We specify input and output as well as a script that determines how to generate the output from the input.

Run the pipeline with nextflow run main.nf and you'll see that it runs the process digestProtein four times (once for each fasta).

We'll now add the other functions step by step. When adding the countAA process we notice that it also takes the fasta Channel as an input. Let's try to run it without changing anything.

#!/usr/bin/env nextflow

// Run workflow for all .fasta files in the fasta directory
fasta = Channel.fromPath("$baseDir/fasta/*.fasta")


// Helper function to extract the protein name from the filename
def getProtein(fileName) {
  fileName.getBaseName().tokenize(".")[0]
}


// Digest a protein and save the peptides
process digestProtein {
  input:
    path input_fasta from fasta

  output:
    path "*.txt" into peptides

  script:
    def protein = getProtein(input_fasta)
    """
    01_digest_protein.py ${input_fasta} ${protein}.peptides.txt
    """
}


// Count the number of times a given amino acid appears in a protein as well 
// as its peptides after digestion
process countAA {  
  input:
    path input_peptides from peptides
    path input_fasta from fasta

  output:
    path "*.tsv" into aa_count

  script:
    def protein = getProtein(input_peptides)
    """
    02_count_amino_acids.py ${input_fasta} ${input_peptides} ${protein}.count.tsv
    """
}

You should've gotten the following error:

$ nextflow run main.nf
N E X T F L O W  ~  version 21.09.0-edge
Launching `main.nf` [clever_solvay] - revision: 84edb4b9a9
Channel `fasta` has been used twice as an input by process `countAA` and process `digestProtein`

 -- Check script 'main.nf' at line: 31 or see '.nextflow.log' file for more details
[-        ] process > digestProtein -

To avoid this error we split our fasta channel into two.

[...]

// Run workflow for all .fasta files in the fasta directory
fasta = Channel.fromPath("$baseDir/fasta/*.fasta")
fasta.into { 
  fasta_a
  fasta_b 
}

[...]

process digestProtein {
  input:
    path input_fasta from fasta_a
[...]

process countAA {  
  input:
    path input_peptides from peptides
    path input_fasta from fasta_b
[...]

We also know that the output from countAA goes into both plotCount and generateReport so we use the same trick as with the fasta channel. Our main.nf file should now look like this. Note that we used .collect() both in generateReport and archiveResults. By default, Nextflow would've run these processes once for each item. In this case, we deliberately want to avoid that behavior, because our processes use all files at once.

#!/usr/bin/env nextflow

// Run workflow for all .fasta files in the fasta directory
fasta = Channel.fromPath("$baseDir/fasta/*.fasta")
fasta.into { 
  fasta_a
  fasta_b 
}


// Helper function to extract the protein name from the filename
def getProtein(fileName) {
  fileName.getBaseName().tokenize(".")[0]
}


// Digest a protein and save the peptides
process digestProtein {
  input:
    path input_fasta from fasta_a

  output:
    path "*.txt" into peptides

  script:
    def protein = getProtein(input_fasta)
    """
    01_digest_protein.py ${input_fasta} ${protein}.peptides.txt
    """
}


// Count the number of times a given amino acid appears in a protein as well 
// as its peptides after digestion
process countAA {  
  input:
    path input_peptides from peptides
    path input_fasta from fasta_b

  output:
    path "*.tsv" into aa_count_a, aa_count_b

  script:
    def protein = getProtein(input_peptides)
    """
    02_count_amino_acids.py ${input_fasta} ${input_peptides} ${protein}.count.tsv
    """
}


// Load the calculated counts and create a plot
process plotCount {  
  input:
    path input_count from aa_count_a

  output:
    path "*.png" into count_plot

  script:
    def protein = getProtein(input_count)
    """
    03a_plot_count.py ${input_count} ${protein}.plot.png
    """
}


// Get a list of input files from a given folder and create a report
process generateReport {  
  input:
    path input_count from aa_count_b.collect()

  output:
    path "*.tsv" into protein_report

  script:
    """
    03b_get_report.py ${input_count} --output_file=protein_report.tsv
    """
}


// Gather result files and archive them
process archiveResults {  
  input:
    path input_plot from count_plot.collect()
    path input_report from protein_report

  output:
    path "*.tgz" into archive_results

  script:
    """
    mkdir results
    cp ${input_plot} ${input_report} results/
    tar -czf results.tgz results
    """
}

So far we've been the "old" way of writing pipelines in Nextflow. I wrote the pipeline this way on purpose, in order to showcase the difference between push-based and pull-based. It's still a legitimate way of writing them but Nextflow has recently released a new DSL (Version 2) which makes the whole process more flexible and IMO a bit more elegant. Instead of having to think about how to connect processes, we treat them more like functions that take an input and output and connect them via a workflow block. Let's see what that would look like. Rename the main.nf to main_old.nf and copy its content into a new main.nf.

cp main.nf main_old.nf

We start by enabling the new DSL at the top of our file.

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

[...]

Then we remove all the from and into directives from our processes and add the following workflow block at the bottom.

[...]

workflow {
  // Run workflow for all .fasta files in the fasta directory
  fasta = Channel.fromPath("$baseDir/fasta/*.fasta")
  peptides = digestProtein(fasta)
  aa_count = countAA(peptides, fasta)
  count_plot = plotCount(aa_count)
  protein_report = generateReport(aa_count | collect)
  archive_results = archiveResults(count_plot | collect, protein_report)
}

Last check, your main.nf should now look like this:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2


// Helper function to extract the protein name from the filename
def getProtein(fileName) {
  fileName.getBaseName().tokenize(".")[0]
}


// Digest a protein and save the peptides
process digestProtein {
  input:
    path input_fasta

  output:
    path "*.txt"

  script:
    def protein = getProtein(input_fasta)
    """
    01_digest_protein.py ${input_fasta} ${protein}.peptides.txt
    """
}


// Count the number of times a given amino acid appears in a protein as well 
// as its peptides after digestion
process countAA {  
  input:
    path input_peptides
    path input_fasta

  output:
    path "*.tsv"

  script:
    def protein = getProtein(input_peptides)
    """
    02_count_amino_acids.py ${input_fasta} ${input_peptides} ${protein}.count.tsv
    """
}


// Load the calculated counts and create a plot
process plotCount {  
  input:
    path input_count

  output:
    path "*.png" 

  script:
    def protein = getProtein(input_count)
    """
    03a_plot_count.py ${input_count} ${protein}.plot.png
    """
}


// Get a list of input files from a given folder and create a report
process generateReport {  
  input:
    path input_count

  output:
    path "*.tsv"

  script:
    """
    03b_get_report.py ${input_count} --output_file=protein_report.tsv
    """
}


// Gather result files and archive them
process archiveResults {  
  input:
    path input_plot
    path input_report

  output:
    path "*.tgz"

  script:
    """
    mkdir results
    cp ${input_plot} ${input_report} results/
    tar -czf results.tgz results
    """
}


workflow {
  // Run workflow for all .fasta files in the fasta directory
  fasta = Channel.fromPath("$baseDir/fasta/*.fasta")
  peptides = digestProtein(fasta)
  aa_count = countAA(peptides, fasta)
  count_plot = plotCount(aa_count)
  protein_report = generateReport(aa_count | collect)
  archive_results = archiveResults(count_plot | collect, protein_report)
}

Conclusion

That's a wrap! We created a fully functional pipeline from the bottom-up, covering shell scripts, Makefiles, Nexflow as well as the two main types of execution: push- and pull-based. We've seen the benefits that modern tools like Nextflow can have over more traditional approaches like scripts and Makefiles. Hopefully, this tutorial provided a solid baseline for what a pipeline is and how to write one from scratch (while climbing up the ladder of complexity).

For follow-up questions or feedback on this article, you can submit an issue through the accompanying GitHub repository or reach me on Twitter.

If you want to learn more about the concepts covered in this article check out these tutorials:

If you're interested to explore some other tools check out these resources:

Optional: Nextflow in the Cloud ☁️

If you're starting at this point and want to follow along during this part, please checkout branch "part_03" from the GitHub repo. To see the final repo state, checkout branch "part_04".

If you have followed the tutorial I mentioned in the first paragraph and set up your AWS environment for Nextflow, there's one more thing to mention.
The biggest advantage of Nextflow over shell scripts or Mainfiles is its ability to easily scale into the cloud. To do that we don't need to change anything in the main.nf file itself. We need to

  1. Setup our AWS Environment
  2. Add another executor

The one we've been using implicitly so far looks like this (in the nextflow.config):

profiles {
	standard {
		process.executor = "local"
	}
}

[...]

We now add the Nextflow plugin for AWS as well as our region of choice and a role to operate with. We also add an executor for AWS Batch. You can either build the docker image I used from the Dockerfile in the repository or use the one I published called rmeinl/python-plt.
The nexflow.config file should then look like this:

// Profiles
profiles {
	standard {
		process.executor = "local"
	}
	cloud {
		process {
			executor = "awsbatch"
			queue = "terraform-nextflow-medium-size-spot-batch-job-queue"
			container = "rmeinl/python-plt:latest"
		}
		errorStrategy = "retry"
		maxRetries = 3
	}
}

// Process
process {
	publishDir = [path: "data/", mode: "copy"]
}

// Plugins
plugins {
    id "nf-amazon"
}

// AWS Setup
aws {
    region = "us-west-2"
    batch {
    	cliPath = "/home/ec2-user/bin/aws"
        jobRole = "arn:aws:iam::622568582929:role/terraform-nextflow-batch-job-role"
    }
}

In order to run this whole workflow in the cloud we call:

nextflow run main.nf -profile cloud

You should now see this message indicating success:

$ nextflow run main.nf -profile cloud
N E X T F L O W  ~  version 21.09.0-edge
Launching `main.nf` [dreamy_murdock] - revision: 7be483af55
Uploading local `bin` scripts folder to s3://terraform-nextflow-work-bucket/tmp/f4/43104ae6c68d4b50070806e54e391a/bin
executor >  awsbatch (14)
[90/eabf4a] process > digestProtein (3) [100%] 4 of 4 ✔
[77/fec491] process > countAA (4)       [100%] 4 of 4 ✔
[95/e4ea25] process > plotCount (4)     [100%] 4 of 4 ✔
[e4/a2dff2] process > generateReport    [100%] 1 of 1 ✔
[4a/01e553] process > archiveResults    [100%] 1 of 1 ✔
Completed at: 09-Dec-2021 19:36:18
Duration    : 4m 33s
CPU hours   : (a few seconds)
Succeeded   : 14