ricomnl

[Extension: redun] Bioinformatics pipelines from the bottom up

Rico Meinl — Wed, 11 May 2022 21:02:05 GMT

This is the first part of a series of extensions that I will add to my previous post on Bioinformatics pipelines from the bottom up. In order for you to be able to follow along, I'd recommend skimming over the other tutorial at least up until the part where we start with Makefiles. Here's what we'll cover in this tutorial:

Learn about the core features of redun by using it to reimplement a toy bioinformatics workflow
Run redun workflows on AWS Batch
Import submodules via pip
Emulate Makefile behavior in redun with a custom DSL

Motivation

Simple workflows are a great way to quickly get an in-depth look into the core features and advantages of new tools. The toy workflow we implemented in the first post (and reimplement in this one) consists of the following steps:

Take a set of .fasta protein files
Split each into peptides using a variable number of missed cleavages
Count the number of cysteines in total as well as the number of peptides that contain a cysteine
Generate an output report containing this information in a .tsv file
Create an archive to share with colleagues

In the last post, I covered vanilla bash, Makefiles, and Nextflow as three modes of execution for bioinformatics workflows. Given the size and scale of modern workflows, the two former are rarely a valid option for anyone anymore and Nextflow is just one example of a toolchain that enables developers to run their pipelines at scale in the cloud. There are a lot of others.

For my own work, Python is the main workhorse for all of my data processing and analysis code so naturally, I'm drawn towards something that can natively integrate with it. The most natural integration happens when the toolchain itself is written in Python and I can just annotate my functions with something like a @task operator to be able to chain them into a workflow. A couple of frameworks come to mind here:

Metaflow
Redun (covered in this blog post)
Dagster
Prefect
Latch SDK

In this post, we'll cover redun, a tool written by the data engineering team at Insitro which was open-sourced in November 2021. The Github repo contains the following description:

redun aims to be a more expressive and efficient workflow framework, built on top of the popular Python programming language. It takes the somewhat contrarian view that writing dataflows directly is unnecessarily restrictive, and by doing so we lose abstractions we have come to rely on in most modern high-level languages (control flow, composability, recursion, high order functions, etc). redun's key insight is that workflows can be expressed as lazy expressions, which are then evaluated by a scheduler that performs automatic parallelization, caching, and data provenance logging.

Redun introduces a bunch of interesting features and, in my opinion, it is one of the first workflow tools out there that really nailed it. I highly recommend checking out its very well-written design document as well as reading through the first 4 tutorials.

Setup

We start by cloning the existing GitHub repository and use part_00 as our starting point:

# Fork and clone repository and switch to branch part_00
git clone https://github.com//bioinformatics-pipeline-tutorial.git
cd bioinformatics-pipeline-tutorial/
git checkout part_00

The structure of the repository will look like this:

$ tree
.
├── README.md
├── bin
│   ├── 01_digest_protein.py
│   ├── 02_count_amino_acids.py
│   ├── 03a_plot_count.py
│   ├── 03b_get_report.py
│   └── __init__.py
└── fasta
    ├── KLF4.fasta
    ├── MYC.fasta
    ├── PO5F1.fasta
    └── SOX2.fasta

2 directories, 10 files

In order to set up our environment, let's add a requirements.txt file with the following content and run pip install -r requirements.txt (feel free to use virtualenv, conda, or poetry to set up a virtual environment).

redun
plotly
kaleido

First, we create a data/ directory for our workflow output data and touch a workflow.py file which we'll use to write our redun workflow using the existing code from the first tutorial in the bin/ folder as a reference.

mkdir -p data
touch workflow.py

Porting the workflow

A lot of bioinformatics workflows rely on programs that are shipped as binaries and are executed through the command line. Redun supports Script tasks as first-class citizens and we'll initially explore how to use these to write our workflow. The first step in our pipeline is the script in bin/01_digest_protein.py. In order to call it, let's start by adding some code to workflow.py:

"""workflow.py"""
import os

from redun import File, script, task

redun_namespace = "bioinformatics_pipeline_tutorial.script_workflow"


@task()
def digest_protein_task(
    input_fasta: File,
    enzyme_regex: str = "[KR]",
    missed_cleavages: int = 0,
    min_length: int = 4,
    max_length: int = 75,
) -> File:
    protein = input_fasta.basename().split(".")[0]
    output_path = os.path.join(
        os.path.split(input_fasta.dirname())[0], "data", f"{protein}.peptides.txt"
    )
    return script(
        f"""
        bin/01_digest_protein.py \
            {input_fasta.path} \
            {output_path} \
            --enzyme_regex {enzyme_regex} \
            --missed_cleavages {missed_cleavages} \
            --min_length {min_length} \
            --max_length {max_length}
        """,
        outputs=File(output_path),
    )

We can then execute the task like this:

redun run workflow.py digest_protein_task --input-fasta fasta/KLF4.fasta

If successful, you should see the file KLF4.peptides.txt in the data/ directory.

Now, a major reason (at least for me) to use redun is that we can natively define workflows in Python. If you're interested to see a working example of the complete workflow using script tasks, check out the scripts_workflow.py file in the final redun branch. In the following part, we will follow a different, Python-native approach of defining tasks. Execute these commands to rename workflow.py and touch a new file.

mv workflow.py scripts_workflow.py
touch workflow.py

The first step is to copy over the three functions needed for the digest_protein task: load_fasta(), save_peptides() and digest_protein() (from bin/01_digest_protein.py. Only three small changes were made: we added type annotations as a good practice, added a redun_namespace variable at the top to define the namespace in which we run our workflow, and lastly, we adapted the save_peptides() function to return a redun File object after saving the results to it. For reference, the updated workflow.py file:

"""workflow.py"""
import os
import re
from typing import List, Tuple

from redun import File, task

redun_namespace = "bioinformatics_pipeline_tutorial.workflow"


def load_fasta(input_file: File) -> Tuple[str, str]:
    """
    Load a protein with its metadata from a given .fasta file.
    """
    with input_file.open("r") as fasta_file:
        lines = fasta_file.read().splitlines()
    metadata = lines[0]
    sequence = "".join(lines[1:])
    return metadata, sequence


def save_peptides(filename: str, peptides: List[str]) -> File:
    """
    Write out the list of given peptides to a .txt file. Each line is a different peptide.
    """
    output_file = File(filename)
    with output_file.open("w") as out:
        for peptide in peptides:
            out.write("{}\n".format(peptide))
    return output_file


def digest_protein(
    protein_sequence: str,
    enzyme_regex: str = "[KR]",
    missed_cleavages: int = 0,
    min_length: int = 4,
    max_length: int = 75,
) -> List[str]:
    """
    Digest a protein into peptides using a given enzyme. Defaults to trypsin.
    """
    # Find the cleavage sites
    enzyme_regex = re.compile(enzyme_regex)
    sites = (
        [0]
        + [m.end() for m in enzyme_regex.finditer(protein_sequence)]
        + [len(protein_sequence)]
    )

    peptides = set()

    # Do the digest
    for start_idx, start_site in enumerate(sites):
        for diff_idx in range(1, missed_cleavages + 2):
            end_idx = start_idx + diff_idx
            if end_idx >= len(sites):
                continue
            end_site = sites[end_idx]
            peptide = protein_sequence[start_site:end_site]
            if len(peptide) < min_length or len(peptide) > max_length:
                continue
            peptides.add(peptide)
    return peptides

As you can see we now have what is basically the equivalent to our previous bin/01_digest_protein.py file without the main() function and parameter options (we'll add these later). We can now basically take the main() function from bin/01_digest_protein.py and add the @task operator to it:

"""workflow.py"""

[...]

@task()
def digest_protein_task(input_fasta: File, output_file: File) -> File:
    _, protein_sequence = load_fasta(input_fasta)
    peptides = digest_protein(protein_sequence)
    peptides_file = save_peptides(output_file.path, peptides)
    return peptides_file

In redun, there is no extra @workflow decorator. A workflow gets assembled when a task is called and that task has other tasks that it is dependent on. We will later see what that looks like. The cool thing about this is that it enables us to execute any task by itself (which is not possible with a lot of tools because they only let you execute workflows as a whole). To run the digest_protein_task(), we call:

redun run workflow.py digest_protein_task --input-fasta fasta/KLF4.fasta --output-file data/KLF4.peptides.txt

In order to not have to pass the --output-file path as an extra argument, we make it dependent on the input file by changing the code to the following:

"""workflow.py"""

[...]

@task()
def digest_protein_task(input_fasta: File) -> File:
    _, protein_sequence = load_fasta(input_fasta)
    peptides = digest_protein(protein_sequence)
    protein = input_fasta.basename().split(".")[0]
    output_path = os.path.join(
        os.path.split(input_fasta.dirname())[0], "data", f"{protein}.peptides.txt"
    )
    peptides_file = save_peptides(output_path, peptides)
    return peptides_file

Try rerunning it without the `--output-file` flag:

redun run workflow.py digest_protein_task --input-fasta fasta/KLF4.fasta

Now, as you might remember, the input we're dealing with is a list of input files, not just a single file. To make our workflow compatible with that, we add a main() task that takes as input a list of files.

"""workflow.py"""
[...]
from redun import File, task
from redun.file import glob_file

[...]

@task()
def main(input_dir: str) -> List[File]:
    input_fastas = [File(f) for f in glob_file(f"{input_dir}/*.fasta")]
    peptide_files = [digest_protein_task(fasta) for fasta in input_fastas]
    return peptide_files

Try running:

redun run workflow.py main --input-dir fasta/

Now, the digest_protein_task() should have been executed for all files in the fasta/ folder. We can ensure this by using redun's logging functionality. The command redun log - shows the execution of the most recent run. Alternatively one can note down the execution ID (see below) when launching a new run and use either the full string or everything up to the first -: redun log 1091e19e-b5b7-412b-bdf0-b703a9f79cd5 or redun log 1091e19e.

$ redun run workflow.py main --input-dir fasta/                                                                
[redun] redun :: version 0.8.7
[redun] config dir: /Users/ricomeinl/Downloads/bioinformatics-pipeline-tutorial/.redun
[redun] Start Execution 1091e19e-b5b7-412b-bdf0-b703a9f79cd5:  redun run workflow.py main --input-dir fasta/
[...]

By running either of the above, we can observe that redun indeed ran five tasks: the main tasks and the digest_protein task for all four files.

$ redun log -
Exec 1091e19e-b5b7-412b-bdf0-b703a9f79cd5 [ DONE ] 2022-05-11 18:17:15:  run workflow.py main --input-dir fasta/ (git_commit=785ccac738c29bb27efa5fe8e950c23018961621, git_origin_url=https://github.com/ricomnl/bioinformatics-pipel..., project=bioinformatics_pipeline_tutorial.workflow, redun.version=0.8.7, user=ricomeinl)
Duration: 0:00:00.15

Jobs: 5 (DONE: 5, CACHED: 0, FAILED: 0)
--------------------------------------------------------------------------------
Job acaf05c6 [ DONE ] 2022-05-11 18:17:15:  bioinformatics_pipeline_tutorial.workflow.main(input_dir='fasta/') 
  Job 9620645d [ DONE ] 2022-05-11 18:17:15:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/SOX2.fasta, hash=621d4a48)) 
  Job efb908dc [ DONE ] 2022-05-11 18:17:15:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/KLF4.fasta, hash=10761e8a)) 
  Job fdb4f1fc [ DONE ] 2022-05-11 18:17:15:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/PO5F1.fasta, hash=341326f2)) 
  Job f2bc3668 [ DONE ] 2022-05-11 18:17:15:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/MYC.fasta, hash=daeb9045))

Great stuff! Let's add the next task count_amino_acids().

We start by adding the helper functions from bin/02_count_amino_acids.py to our workflow.py file (with some small changes akin to the ones mentioned above).

"""workflow.py"""

[...]

def load_peptides(input_file: File) -> List[str]:
    """
    Load peptides from a .txt file as a list.
    """
    with input_file.open("r") as peptide_file:
        lines = peptide_file.read().splitlines()
    return lines


def save_counts(filename: str, peptide_counts: List[int]) -> File:
    """
    Write out the peptide counts to a .tsv file using tabs as a separator.
    """
    output_file = File(filename)
    with output_file.open("w") as out:
        out.write("{}\n".format("\t".join([str(c) for c in peptide_counts])))
    return output_file


def num_peptides(peptides: List[str]) -> int:
    """
    Retrieve the number of peptides in a given list.
    """
    return len(peptides)


def num_peptides_with_aa(peptides: List[str], amino_acid: str = "C") -> int:
    """
    Count the number of peptides in a given list that contain a given amino acid. 
    Defaults to cysteine.
    """
    return sum([1 if amino_acid in peptide else 0 for peptide in peptides])


def total_num_aa_in_protein(protein: str) -> int:
    """
    Count the total number of amino acids in a given protein string.
    """
    return len(protein)


def num_aa_in_protein(protein: str, amino_acid: str = "C") -> int:
    """
    Count the number of times a given amino acid occurs in a given protein.
    Defaults to cysteine.
    """
    return protein.count(amino_acid)
    

@task()
def digest_protein_task(input_fasta: File) -> File: ...

[...]

After that, we again port the main() function, this time from bin/02_count_amino_acids.py, to a new function decorated with @task():

"""workflow.py"""

[...]

@task()
def count_amino_acids_task(
    input_fasta: File, input_peptides: File, amino_acid: str = "C"
) -> File:
    """
    Count the number of times a given amino acid appears in a protein as well
    as its peptides after digestion.
    """
    _, protein_sequence = load_fasta(input_fasta)
    peptides = load_peptides(input_peptides)
    n_peptides = num_peptides(peptides)
    n_peptides_with_aa = num_peptides_with_aa(peptides, amino_acid=amino_acid)
    total_aa_in_protein = total_num_aa_in_protein(protein_sequence)
    aa_in_protein = num_aa_in_protein(protein_sequence, amino_acid=amino_acid)
    protein = input_fasta.basename().split(".")[0]
    output_path = os.path.join(
        os.path.split(input_fasta.dirname())[0], "data", f"{protein}.count.tsv"
    )
    aa_count_file = save_counts(
        output_path,
        [
            amino_acid,
            n_peptides,
            n_peptides_with_aa,
            total_aa_in_protein,
            aa_in_protein,
        ],
    )
    return aa_count_file


@task()
def main(input_dir: str) -> List[File]: ...

It's easy to test our task by itself:

redun run workflow.py count_amino_acids_task --input-fasta fasta/KLF4.fasta --input-peptides data/KLF4.peptides.txt

If successful, the task should have created a file called KLF4.count.tsv in the data/ folder. We can now combine the two tasks in our main() function and execute it with:

"""workflow.py"""

[...]

@task()
def main(input_dir: str) -> List[File]:
	input_fastas = [File(f) for f in glob_file(f"{input_dir}/*.fasta")]
    peptide_files = [digest_protein_task(fasta) for fasta in input_fastas]
    aa_count_files = [count_amino_acids_task(fasta, peptides) for (fasta, peptides) in zip(input_fastas, peptide_files)]
    return aa_count_files

redun run workflow.py main --input-dir fasta/

Running redun log - again will show that this time the first four tasks were cached because neither the code of the tasks nor the generated files were changed.

$ redun log -
Exec 8c438a1b-c24d-49c0-9c4c-cdf71e2504a8 [ DONE ] 2022-05-11 20:52:22:  run workflow.py main --input-dir fasta/ (git_commit=785ccac738c29bb27efa5fe8e950c23018961621, git_origin_url=https://github.com/ricomnl/bioinformatics-pipel..., project=bioinformatics_pipeline_tutorial.workflow, redun.version=0.8.7, user=ricomeinl)
Duration: 0:00:00.18

Jobs: 9 (DONE: 5, CACHED: 4, FAILED: 0)
--------------------------------------------------------------------------------
Job 72959d5d [ DONE ] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.main(input_dir='fasta/') 
  Job 344ede72 [CACHED] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/SOX2.fasta, hash=621d4a48)) 
  Job 0e853ce6 [CACHED] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/KLF4.fasta, hash=10761e8a)) 
  Job d8a5ea59 [CACHED] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/PO5F1.fasta, hash=341326f2)) 
  Job 80151743 [CACHED] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/MYC.fasta, hash=daeb9045)) 
  Job 60d89b74 [ DONE ] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.count_amino_acids_task(File(path=fasta/SOX2.fasta, hash=621d4a48), File(path=data/SOX2.peptides.txt, hash=de981d55), amino_acid='C') 
  Job 1271c054 [ DONE ] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.count_amino_acids_task(File(path=fasta/KLF4.fasta, hash=10761e8a), File(path=data/KLF4.peptides.txt, hash=365eea97), amino_acid='C') 
  Job 92e3bbe5 [ DONE ] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.count_amino_acids_task(File(path=fasta/PO5F1.fasta, hash=341326f2), File(path=data/PO5F1.peptides.txt, hash=cf7b5a5e), amino_acid='C') 
  Job ad2298f6 [ DONE ] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.count_amino_acids_task(File(path=fasta/MYC.fasta, hash=daeb9045), File(path=data/MYC.peptides.txt, hash=06a265e1), amino_acid='C')

I now encourage you to add the two final tasks yourself. Remember from the last post, we want to create plots for the generated counts of each protein .fasta file (bin/03a_plot_count.py ) and finally, generate an output report for the results (bin/03b_get_report.py). Below you can find the final code for the main() and the archive_results_task(). Try to fill in the code for plot_count_task() and get_report_task().

"""workflow.py"""


[...]


@task()
def plot_count_task(input_count: File) -> File:
    """
    Load the calculated counts and create a plot.
    """
    # TODO
    pass


@task()
def get_report_task(input_counts: List[File]) -> File:
    """
    Get a list of input files from a given folder and create a report.
    """
    # TODO
    pass


@task()
def archive_results_task(inputs_plots: List[File], input_report: File) -> File:
    output_path = os.path.join(
        os.path.split(input_report.dirname())[0], "data", f"results.tgz"
    )
    tar_file = File(output_path)
    with tar_file.open("wb") as out:
        with tarfile.open(fileobj=out, mode="w|gz") as tar:
            for file_path in inputs_plots + [input_report]:
                if get_filesystem_class(url=file_path.path).name == "s3":
                    tmp_file = File(os.path.basename(file_path.path))
                else:
                    tmp_file = file_path
                output_file = file_path.copy_to(tmp_file, skip_if_exists=True)
                tar.add(output_file.path)
    return tar_file


@task()
def main(
    input_dir: str,
    amino_acid: str = "C",
    enzyme_regex: str = "[KR]",
    missed_cleavages: int = 0,
    min_length: int = 4,
    max_length: int = 75,
) -> List[File]:
    input_fastas = [File(f) for f in glob_file(f"{input_dir}/*.fasta")]
    peptide_files = [
        digest_protein_task(
            fasta,
            enzyme_regex=enzyme_regex,
            missed_cleavages=missed_cleavages,
            min_length=min_length,
            max_length=max_length,
        )
        for fasta in input_fastas
    ]
    aa_count_files = [
        count_amino_acids_task(
            fasta, peptides, amino_acid=amino_acid
        )
        for (fasta, peptides) in zip(input_fastas, peptide_files)
    ]
    count_plots = [
        plot_count_task(aa_count)
        for aa_count in aa_count_files
    ]
    report_file = get_report_task(aa_count_files)
    results_archive = archive_results_task(
        count_plots, report_file
    )
    return results_archive

Hint: In order to port over bin/03a_plot_count.py the plot_counts() function needs to be adjusted to use plotly instead of matplotlib because redun parallelizes tasks across multiple threads and matplotlib will throw an error when it's run outside the main thread.

💡

Update: The code using matplotlib should still work when using the process-based instead of the thread-based executor.

Hence, here is the updated function using plotly:

"""workflow.py"""
import os
import re
from typing import List, Tuple

from plotly.subplots import make_subplots
import plotly.graph_objects as go
from redun import task, File
from redun.file import glob_file, get_filesystem_class

[...]

def plot_counts(filename: str, counts: List[str]) -> File:
    """
    Plot the calculated counts.
    """
    (
        amino_acid,
        n_peptides,
        n_peptides_with_aa,
        total_aa_in_peptides,
        aa_in_peptides,
    ) = counts
    labels_n_peptides = ["No. of Peptides", "No. of Peptides w/ {}".format(amino_acid)]
    labels_n_aa = ["Total No. of Amino Acids", "No. of {}'s".format(amino_acid)]
    colors = ["#001425", "#308AAD"]
    fig = make_subplots(rows=1, cols=2)
    fig.add_trace(
        go.Bar(
            x=labels_n_peptides,
            y=[int(n_peptides_with_aa), int(n_peptides)],
            marker_color=colors[0],
        ),
        row=1,
        col=1,
    )
    fig.add_trace(
        go.Bar(
            x=labels_n_aa,
            y=[int(aa_in_peptides), int(total_aa_in_peptides)],
            marker_color=colors[1],
        ),
        row=1,
        col=2,
    )
    fig.update_layout(
        height=600,
        width=800,
        title_text="{}'s in Peptides and Amino Acids".format(amino_acid),
        showlegend=False,
    )
    if get_filesystem_class(url=filename).name == "s3":
        tmp_file = File(os.path.basename(filename))
    else:
        tmp_file = File(filename)
    fig.write_image(tmp_file.path)
    output_file = tmp_file.copy_to(File(filename), skip_if_exists=True)
    return output_file

[...]

If you've made it all there way up until here, you can check your solution against the working version on the redun branch of the Github repository.

If you run redun run workflow.py main --input-dir fasta/, your local data/ repository should be populated with these files:

$ tree data 
data
├── KLF4.count.plot.png
├── KLF4.count.tsv
├── KLF4.peptides.txt
├── MYC.count.plot.png
├── MYC.count.tsv
├── MYC.peptides.txt
├── PO5F1.count.plot.png
├── PO5F1.count.tsv
├── PO5F1.peptides.txt
├── SOX2.count.plot.png
├── SOX2.count.tsv
├── SOX2.peptides.txt
├── protein_report.tsv
└── results.tgz

0 directories, 14 files

Taking it to the cloud

To define where a @task will run, we can specify a task executor like this:

@task(executor="my_executor")
def digest_protein_task():
    # ...

The executor my_executor then has to be defined in the redun configuration .redun/redun.ini. If you go and open it up you can see the default executor defined already:

# redun configuration.

[backend]
db_uri = sqlite:///redun.db

[executors.default]
type = local
max_workers = 20

As of now, redun supports AWS Batch and Glue executors that will run tasks in the cloud. A Kubernetes executor is currently in the making. We'll walk through how to create one for AWS Batch below.

I'm not going to go through all the steps on how to set up AWS Batch as there are a lot of great tutorials online. If you want to follow along make sure you have Docker installed and an existing AWS CLI setup.

You'll need the following AWS resources:

S3 Bucket
Push access to Elastic Container Registry (ECR)
An AWS Batch queue that we can publish jobs to

To get started, we need to create a Dockerfile like this:

FROM ubuntu:20.04

# Install OS-level libraries.
RUN apt-get update -y && DEBIAN_FRONTEND="noninteractive" apt-get install -y \
    python3 \
    python3-pip && \
    apt-get clean

WORKDIR /code

# Install our python code dependencies.
COPY requirements.txt .
RUN pip3 install --upgrade pip
RUN pip3 install -r requirements.txt

We'll also create a Makefile to simplify the process of building and pushing our Docker image:

IMAGE=bioinformatics_pipeline_tutorial
ACCOUNT=$(shell aws ecr describe-registry --query registryId --output text)
REGION=$(shell aws configure get region)
REGISTRY=$(ACCOUNT).dkr.ecr.$(REGION).amazonaws.com

login:
	aws ecr get-login-password --region $(REGION) | docker login --username AWS --password-stdin $(REGISTRY)

build:
	docker build -t $(REGISTRY)/$(IMAGE) --build-arg REGISTRY=$(REGISTRY) .

build-local:
	docker build -t $(IMAGE) --build-arg REGISTRY=$(REGISTRY) .

create-repo:
	aws ecr create-repository --repository=$(IMAGE)

push:
	docker push $(REGISTRY)/$(IMAGE)

bash:
	docker run --rm -it $(REGISTRY)/$(IMAGE) bash

bash-local:
	docker run --rm -it $(IMAGE) bash

To build and test the Docker image locally, run:

make build-local
docker run --rm -it bioinformatics_pipeline_tutorial pip list | grep "redun"

If the output is redun 0.8.7, the Dockerfile has built and installed the dependencies correctly.

Now, we can use the following make command to build our Docker image:

make login
make build

After the image builds, we need to publish it to ECR so that it is accessible by AWS Batch. There are several steps for doing that, which are covered in these make commands:

# If the docker repo does not exist yet.
make create-repo

# Push the locally built image to ECR.
make push

You might be wondering: how will our Python code get into the container? We didn't add our workflow.py file to the Docker image. The answer lies in redun's code packaging feature, which essentially packages all the Python code in the current directory into a tar file and copies it to our S3 scratch directory. From here, it will be downloaded into the running AWS Batch job. This makes it a lot faster to iterate, without having to rebuild the Docker image for every code change.

Let's add our custom AWS Batch executor to our .redun/redun.ini config in the current working directory:

[...]

[executors.batch]
type = aws_batch

# Required:
image = YOUR_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/bioinformatics_pipeline_tutorial
queue = YOUR_QUEUE_NAME
s3_scratch = s3://YOUR_BUCKET/redun/

# Optional:
role = arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_ROLE
job_name_prefix = redun-example
debug = False

To get a working example, you'll need to replace all caps variables YOUR_ACCOUNT_ID, YOUR_QUEUE_NAME, and YOUR_BUCKET, with your own AWS Account ID, AWS Batch queue name, and S3 bucket, respectively.

Next up, make sure that all the tasks are equipped with our shiny new AWS Batch executor.

"""workflow.py"""

[...]

@task(executor="batch")
def main(...)

Now we can execute our pipeline as usual, with the difference that redun will now run each task as a separate AWS Batch job and use the input data stored in the given S3 bucket (you'll need to upload the fasta/ folder to the S3 bucket you want to use).

redun run workflow.py main --input-dir s3://YOUR_BUCKET/fasta/

Note: I ran into the following error and used the fix detailed in this post by AWS to set up my role correctly.

ECS was unable to assume the role 'arn:aws:iam::***:role/role-name' that was provided for this task. Please verify that the role being passed has the proper trust relationship and permissions and that your IAM user has permissions to pass this role.

Redun provides a nice debug functionality through which the tasks run locally in Docker containers and the data is still pulled from S3. To enable it change the debug field in the .redun/redun.ini config:

[...]

[executors.batch]

[...]

debug = True

Then, in order to jump into a running task, you can add the familiar import pdb; pdb.set_trace() statement to debug.

Importing submodules via pip

Ok, so we've written a workflow that consists of five tasks that are connected together through our main() task. The workflow itself might be quite specific but it's easy to imagine that many individual tasks could be reused by other workflows. Redun solves this very elegantly and it's something that's hard to get right (e.g. Nextflow only very recently added this feature with the release of their DSL2 and it's still bumpy IMO).

We're going to create a src/ folder and put all of our reusable tasks in there so that someone who wants to use them can just pip install our Github repository (or released Python package).

mkdir -p bioinformatics_pipeline_tutorial/
touch bioinformatics_pipeline_tutorial/__init__.py
touch bioinformatics_pipeline_tutorial/lib.py

Now copy everything except the main() function into bioinformatics_pipeline_tutorial/lib.py. In the workflow.py file, import the task functions.

"""workflow.py"""
[...]

from bioinformatics_pipeline_tutorial.lib import (
    digest_protein_task,
    count_amino_acids_task,
    plot_count_task,
    get_report_task,
    archive_results_task,
)

[...]

Finally, let's add a setup.py file to make the Github repository installable. Feel free to try and publish your package on pip and install it for another project.

"""setup.py"""
from setuptools import setup


setup(
    name="bioinformatics_pipeline_tutorial",
    version="0.0.1",
    packages=["bioinformatics_pipeline_tutorial"],
    install_requires=["redun", "plotly", "kaleido"],
)

You can try it out by installing the finished module via:

pip install git+https://github.com//bioinformatics-pipeline-tutorial@redun

Then open up a Python console by calling python and try to import the digest_protein_task().

$ python
Python 3.8.5 (default, Sep 27 2020, 11:35:15) 
[Clang 12.0.0 (clang-1200.0.32.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from bioinformatics_pipeline_tutorial.lib import digest_protein_task
>>>

Check out the final state of this part of the tutorial by pulling the redun branch.

Redun Makefiles

The bonus round section of the second tutorial in the redun Github repository shows how redun can emulate Makefile behavior. Those of you who recall the original blog post will remember that in part_02 we created a Makefile to execute our pipeline. Let's have some fun and try to rewrite the Makefile in redun. This part will also showcase redun's ability to do recursions.

First, check out part_02 and run make all to make sure it's still working:

git checkout part_02

As you will recall from the last post, a Makefile is made up of recipes that specify how to build target files. The structure of each recipe is this:

targets: prerequisites
	command

This is how we would specify a recipe for how to create the file KLF4.peptides.txt in the data/ folder (the target) using the bin/01_digest_protein.py script (the command) with fasta/KLF4.fasta being the only prerequisite.

data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

We can then call make data/KLF4.peptides.txt to generate the target file.

To emulate this behavior in redun, we start by creating a file make_workflow.py.

touch make_workflow.py

We start by defining the first rule with a custom DSL:

"""make_workflow.py"""

redun_namespace = "bioinformatics_pipeline_tutorial.make_workflow"


# Custom DSL for describing targets, dependencies (deps), and commands.
rules = {
    "data/KLF4.peptides.txt": {
        "deps": ["fasta/KLF4.fasta"],
        "command": "bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt"
    },
}

Next, we copy the two functions run_command() and make() from the redun tutorial. The run_command() function takes as input a shell command specified as a string which it runs to generate the target file. The make() function generates the target by recursively creating all its dependencies (if needed).

"""make_workflow.py"""
import os
from typing import List, Optional

from redun import task, File
from redun.functools import const


redun_namespace = "bioinformatics_pipeline_tutorial.make_workflow"


# Custom DSL for describing targets, dependencies (deps), and commands.
rules = {
    "data/KLF4.peptides.txt": {
        "deps": ["fasta/KLF4.fasta"],
        "command": "bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt"
    },
}


@task()
def run_command(command: str, inputs: List[File], output_path: str) -> File:
    """
    Run a shell command to produce a target.
    """
    # Ignore inputs. We pass it as an argument to simply force a dependency.
    assert os.system(command) == 0
    return File(output_path)


@task()
def make(target: str = "all", rules: dict = rules) -> Optional[File]:
    """
    Make a target (file) using a series of rules.
    """
    rule = rules.get(target)
    if not rule:
        # No rule. See if target already exists.
        file = File(target)
        if not file.exists():
            raise ValueError(f"No rule for target: {target}")
        return file

    # Recursively make dependencies.
    inputs = [
        make(dep, rules=rules)
        for dep in rule.get("deps", [])
    ]

    # Run command, if needed.
    if "command" in rule:
        return run_command(rule["command"], inputs, target)
    else:
        return const(None, inputs)

We can generate a target by calling:

redun run make_workflow.py make --target data/KLF4.peptides.txt

Let's add some more rules:

"""make_workflow.py"""

[...]

# Custom DSL for describing targets, dependencies (deps), and commands.
rules = {
    "data/KLF4.peptides.txt": {
        "deps": ["fasta/KLF4.fasta"],
        "command": "bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt"
    },
    "data/KLF4.count.tsv": {
        "deps": ["fasta/KLF4.fasta", "data/KLF4.peptides.txt"],
        "command": "bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv"
    },
    "data/KLF4.plot.png": {
        "deps": ["data/KLF4.count.tsv"],
        "command": "bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png"
    },
    "data/protein_report.tsv": {
        "deps": ["data/KLF4.count.tsv"],
        "command": "bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv"
    },
    "data/results.tgz": {
        "deps": ["data/KLF4.plot.png", "data/protein_report.tsv"],
        "command": """rm -rf results
                      mkdir results
                      cp data/KLF4.plot.png data/protein_report.tsv results/
                      tar -czf data/results.tgz results
                      rm -r results"""
    },
}

[...]

Now we can run the command below and it will generate the files data/KLF4.plot.png, data/KLF4.count.tsv, and data/KLF4.peptides.txt.

redun run make_workflow.py make --target data/KLF4.plot.png

But what if we wanted to add rules for new proteins? If you read the last post, you already know the answer and it's not: add separate rules for each file. We'll use pattern matching. Now, the initial implementation from the redun examples doesn't support pattern matching. Therefore, let's add a more advanced match_target() function and use it within the make() function. We'll also need to adjust our rules accordingly and use % as a wild card.

"""make_workflow.py"""
import os
from typing import Dict, List, Optional

[...]

rules = {
    "data/%.peptides.txt": {
        "deps": ["fasta/%.fasta"],
        "command": "bin/01_digest_protein.py fasta/%.fasta data/%.peptides.txt"
    },
    "data/%.count.tsv": {
        "deps": ["fasta/%.fasta", "data/%.peptides.txt"],
        "command": "bin/02_count_amino_acids.py fasta/%.fasta data/%.peptides.txt data/%.count.tsv"
    },
    "data/%.plot.png": {
        "deps": ["data/%.count.tsv"],
        "command": "bin/03a_plot_count.py data/%.count.tsv data/%.plot.png"
    },
    "data/protein_report.tsv": {
        "deps": ["data/%.count.tsv"],
        "command": "bin/03b_get_report.py data/%.count.tsv --output_file=data/protein_report.tsv"
    },
    "data/results.tgz": {
        "deps": ["data/%.plot.png", "data/protein_report.tsv"],
        "command": """rm -rf results
                      mkdir results
                      cp data/%.plot.png data/protein_report.tsv results/
                      tar -czf data/results.tgz results
                      rm -r results"""
    },
}

def match_target(target: str = "all", rules: dict = rules) -> Optional[Dict[str, Dict]]:
    """
    Emulate GNU make pattern matching described here: 
    https://www.gnu.org/software/make/manual/html_node/Pattern-Match.html#Pattern-Match
    """
    rule = rules.get(target)
    if not rule:
        _, tbase = os.path.split(target)
        for rkey, rval in rules.items():
            _, rbase = os.path.split(rkey)
            if not "%" in rbase: continue
            pre, post = rbase.split("%")
            if tbase.startswith(pre) and tbase.endswith(post):
                stem = tbase[len(pre):-len(post)]
                rule = {
                    "deps": [dep.replace("%", stem) for dep in rval.get("deps", [])],
                    "command": rval.get("command", "").replace("%", stem),
                }
                break
    return rule
    
[...]

@task()
def make(target: str = "all", rules: dict = rules) -> Optional[File]:
    """
    Make a target (file) using a series of rules.
    """
    rule = match_target(target, rules) if not "%" in target else None
    [...]

We can now generate the target for any protein and our workflow will substitute the % with a matched stem if it finds one. Try running:

redun run make_workflow.py make --target data/MYC.plot.png

However, if you try to generate either one of the last two targets data/protein_report.tsv and data/results.tgz, you'll run into the following issue:

$ redun run make_workflow.py make --target data/protein_report.tsv
[...]
ValueError: No rule for target: data/%.count.tsv

This is the same behavior we'd get if we were to run the same command with make as seen in the last post:

make: *** No rule to make target `data/%.count.tsv', needed by `data/%.plot.png'. Stop.

This occurs because when trying to generate the target data/protein_report.tsv, one of its dependencies is data/%.count.tsv and there is no way for redun (or make) to know which stem to replace the wildcard with. Hence, at some point in our program, we need to define a list of target files that we want to generate. We insert two variables at the top and use them for the last two rules. We also add recipes for all and clean.

"""make_workflow.py"""
[...]

COUNT = ["data/KLF4.count.tsv", "data/MYC.count.tsv", "data/PO5F1.count.tsv", "data/SOX2.count.tsv"]
PLOT = ["data/KLF4.plot.png", "data/MYC.plot.png", "data/PO5F1.plot.png", "data/SOX2.plot.png"]


# Custom DSL for describing targets, dependencies (deps), and commands.
rules = {
    "all": {
        "deps": ["data/results.tgz"],
    },
    "clean": {
        "command": "rm -rf data/*",
    },
    "data/%.peptides.txt": {
        "deps": ["fasta/%.fasta"],
        "command": "bin/01_digest_protein.py fasta/%.fasta data/%.peptides.txt"
    },
    "data/%.count.tsv": {
        "deps": ["fasta/%.fasta", "data/%.peptides.txt"],
        "command": "bin/02_count_amino_acids.py fasta/%.fasta data/%.peptides.txt data/%.count.tsv"
    },
    "data/%.plot.png": {
        "deps": ["data/%.count.tsv"],
        "command": "bin/03a_plot_count.py data/%.count.tsv data/%.plot.png"
    },
    "data/protein_report.tsv": {
        "deps": COUNT,
        "command": "bin/03b_get_report.py {COUNT} --output_file=data/protein_report.tsv".format(COUNT=" ".join(COUNT))
    },
    "data/results.tgz": {
        "deps": PLOT + ["data/protein_report.tsv"],
        "command": """rm -rf results
                      mkdir results
                      cp {PLOT} data/protein_report.tsv results/
                      tar -czf data/results.tgz results
                      rm -r results""".format(PLOT=" ".join(PLOT))
    },
}

[...]

You should now be able to clean up all past files that were generated with:

redun run make_workflow.py make --target clean

Finally, run the following and check whether it actually generates all of our target files in the data/ folder:

redun run make_workflow.py make --target all
# Or:
redun run make_workflow.py make

You can browse the final state of this part of the tutorial in the redun branch.

Conclusion

That's it! Thanks for sticking with me all the way until the end, I hope it was fun and you got to explore some of redun's functionality. I linked some further resources below. Just to recap, here's what we covered:

Core features of redun
Run redun workflows on AWS Batch
Import submodules via pip
Emulate Makefile behavior in redun

For follow-up questions or feedback on this article, you can submit an issue through the accompanying GitHub repository or reach me on Twitter.

Huge thanks to Alex Trapp and Matt Rasmussen for their thoughts and feedback on the draft.

Resources

Data Science workflows at insitro: using redun on AWS Batch: https://aws.amazon.com/blogs/hpc/data-science-workflows-at-insitro-using-redun-on-aws-batch/
Data Science workflows at insitro: how redun uses the advanced service features from AWS Batch and AWS Glue: https://aws.amazon.com/blogs/hpc/how-insitro-redun-uses-advanced-aws-features/
Redun Design Document: https://insitro.github.io/redun/design.html
Redun tutorials I'd recommend: 03_scheduler, 04_script, 05_aws_batch, functools, setup_scheduler, and testing
Great thread on what makes a good pipeline: https://twitter.com/VictoriaCarr_/status/1521496097230839810

HTGAA 22

Rico Meinl — Wed, 16 Feb 2022 23:03:50 GMT

I'm participating as a committed listener in 2022's How to Grow (Almost) Anything. Here's the link to all my assignment submissions:

Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

A new tool that blends your everyday work apps into one. It’s the all-in-one workspace for you and your team

Notion

Bioinformatics pipeline example from the bottom up

Rico Meinl — Fri, 14 Jan 2022 16:34:31 GMT

This tutorial is aimed at scientists and bioinformaticians who know how to work the command line and have heard about pipelines before but feel lost in the jungle of tools like Docker, Nextflow, Airflow, Reflow, Snakemake, etc.

In this post, we're gonna strip away some of the complexity and take a simple bioinformatics workflow, and build a pipeline from the bottom up. The goal is to understand the pattern of how to take some scripts written in a language like bash or python and turn them into a more streamlined (and perhaps automated) workflow.

We start by introducing the pipeline that we're going to build. In essence, it is a set of python scripts that take some data, do something with that data and save the output somewhere else. The first step to creating a minimal pipeline is writing a master shell script that sequentially runs all of these python scripts. We then use a Makefile to do the very same while explaining some of the advantages that come with it. Finally, we use Nextflow, a commonly used bioinformatics workflow tool, to wrap up our pipeline. If you feel adventurous, you can follow this tutorial on how to run setup an AWS environment for Nextflow and then run your pipeline on it.

The workflow we're going to wrap in a pipeline looks like this:

Take a set of .fasta protein files
Split each into peptides using a variable number of missed cleavages
Count the number of cysteines in total as well as the number of peptides that contain a cysteine
Generate an output report containing this information in a .tsv file
Create an archive to share with colleagues

An example output protein report

The first part of this tutorial is influenced by this post on how to create bioinformatics pipelines with Make. I won't go into as much depth to explain Makefiles themselves, so if this is the first time you're encountering a Makefile, I'd recommend going through the linked post first.

Setup

Go through the box below to install the needed tools. I tried to make the dependencies as small as possible.

Mac OS

# Add project to your path for this session.
export PATH="$PATH:$(pwd)"

# Open the terminal; Install utilities for homebrew
xcode-select --install

# Install homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install python3
Follow this tutorial: https://opensource.com/article/19/5/python-3-default-mac

# Install make
brew install make

# Install git
brew install git

# Install matplotlib
pip3 install matplotlib

# Install Nextflow (https://www.nextflow.io/docs/latest/getstarted.html)
wget -qO- https://get.nextflow.io | bash
chmod +x nextflow
## Move Nextflow to a directory in your $PATH such as /usr/local/bin
mv nextflow /usr/local/bin/

Linux

# Install python3, git and make
sudo apt-get update
sudo apt-get install python3 git make

# Install matplotlib
sudo apt-get install python3-matplotlib

# Install Nextflow (https://www.nextflow.io/docs/latest/getstarted.html)
wget -qO- https://get.nextflow.io | bash
chmod +x nextflow
## Move Nextflow to a directory in your $PATH such as /usr/local/bin
mv nextflow /usr/local/bin/

Introduction

In this section, we'll go through the basic intuition of what a pipeline is and why we need one. To walk through this from the ground up I chose a basic example. We have a bunch of proteins in .fasta files and want to create a report of how many cysteines each contains after it has been digested into peptides.

I created a Github repository which we'll be working with. To start off, fork it, clone it locally, and check out the branch part_00.

# Fork and clone repository and switch to branch part_1
git clone https://github.com//bioinformatics-pipeline-tutorial.git
cd bioinformatics-pipeline-tutorial/
git checkout part_00

Open the project in your favorite code editor to check out the directory structure. We have two folders: bin/ contains the python scripts we'll use throughout this tutorial to transform our files and fasta/ contains a set of protein .fasta files that we'll use (I went with the four Yamanaka factors but feel free to drop in whatever your favorite protein is).

Make all the scripts in bin executable by running the following command:

chmod +x bin/01_digest_protein.py bin/02_count_amino_acids.py bin/03a_plot_count.py bin/03b_get_report.py

Let's walk through the steps manually using KLF4 as our protein. First, we need to digest our protein into peptides. This is what the prepared script 01_digest_protein.py does. Feel free to open up the file and check it out. The required flags for the script are an input .fasta file and an output file path. The optional flags have default values but feel free to play around with them. For example, we can change the number of missed cleavages by appending --missed_cleavages=1 to our command. To digest our protein, run:

mkdir data/
bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

We now have the file KLF4.txt in the data/ directory which should contain all the peptides of KLF4 after it was digested with trypsin (you can change the digestion enzyme by passing the --enzyme_regex flag).

Next up, we want to count the total # of amino acids in KLF4, the # of cysteines, the # of peptides and how many of them contain a cysteine. To do this, we run:

bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv

We can use the --amino_acid to change the amino acid to count (defaults to cysteine == C).

We're halfway there. Now we want to a) plot each output count file as a bar plot (see below) and b) create an output report summarizing the counts for multiple proteins.

Barplot charts showing the number of cysteines in peptides and amino acids

To get the output plot we run:

# Just show
bin/03a_plot_count.py data/KLF4.count.tsv show

# Save fig
bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png

Now we run all the same steps for MYC and generate an output report for the two proteins.

# Digest
bin/01_digest_protein.py fasta/MYC.fasta data/MYC.peptides.txt

# Count
bin/02_count_amino_acids.py fasta/MYC.fasta data/MYC.peptides.txt data/MYC.count.tsv

# Plot
bin/03a_plot_count.py data/MYC.count.tsv data/MYC.plot.png

# Generate Report for KLF4 and MYC
bin/03b_get_report.py data/KLF4.count.tsv data/MYC.count.tsv --output_file=data/protein_report.tsv

Lastly, create an archive of the resulting output files.

# Create a results/ folder and archive it for sharing
mkdir results
cp data/*plot.png data/protein_report.tsv results/
tar -czf data/results.tgz results
rm -r results

Together these scripts implement a common workflow:

Digest protein(s)
Count occurrences of amino acid in protein(s)
Plot results
Generate a report with the results
Archive the plots and report

Instead of running each of the commands manually, as above, we can create a master script that runs the whole pipeline from start to finish. Our run_pipeline.sh looks like this:

#!/usr/bin/env bash
# USAGE: bash run_pipeline.sh

mkdir -p data

# 01. Digest
bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt
bin/01_digest_protein.py fasta/MYC.fasta data/MYC.peptides.txt
bin/01_digest_protein.py fasta/PO5F1.fasta data/PO5F1.peptides.txt
bin/01_digest_protein.py fasta/SOX2.fasta data/SOX2.peptides.txt

# 02. Count
bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv
bin/02_count_amino_acids.py fasta/MYC.fasta data/MYC.peptides.txt data/MYC.count.tsv
bin/02_count_amino_acids.py fasta/PO5F1.fasta data/PO5F1.peptides.txt data/PO5F1.count.tsv
bin/02_count_amino_acids.py fasta/SOX2.fasta data/SOX2.peptides.txt data/SOX2.count.tsv

# 03a. Plot
bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png
bin/03a_plot_count.py data/MYC.count.tsv data/MYC.plot.png
bin/03a_plot_count.py data/PO5F1.count.tsv data/PO5F1.plot.png
bin/03a_plot_count.py data/SOX2.count.tsv data/SOX2.plot.png

# 03b. Generate Report
bin/03b_get_report.py data/KLF4.count.tsv \
					  data/MYC.count.tsv \
					  data/PO5F1.count.tsv \
					  data/SOX2.count.tsv \
					  --output_file=data/protein_report.tsv

# 04. Archive the results in a tarball so we can share them with a colleague
rm -rf results
mkdir results
cp data/*plot.png data/protein_report.tsv results/
tar -czf data/results.tgz results
rm -r results

Now we have a reproducible pipeline that we can easily run by calling:

bash run_pipeline.sh

We can also share it with colleagues and we have some security that it will behave in the exact same manner when we rerun it (and don't have to worry about typos when using the command line manually).

If you're following along using your own GitHub repository, this is a good time to take a step back and commit your results.

git init
git add .
git commit -m "Finished setup"
git push

Let's also clean up the data folder for now, as we'll regenerate the files again in the next step:

rm data/*

Makefile

If you're starting at this point, please checkout branch "part_01" from the GitHub repo.

Now, let's say we wanted to use pie charts instead of bar plots. We could just go into 03a_plot_count.py and change plt.bar to plt.pie , right?

Sure, but then we'd have to rerun the entire script even though the first part didn't change at all. Even so, this wouldn't be a big deal because we only have four files but imagine we were running this for the whole human .fasta file or our files were just much bigger. Alas, our current pipeline is not ideal.

As I mentioned, this post gives a much deeper overview of how to create Makefiles for bioinformatics workflows. I'm only covering the basics needed for our little tutorial here. There's also a great cheat sheet here if you get stuck on some commands.

Make is a computer program originally designed to automate the compilation and installation of software. Make automates the process of building target files through a series of discrete steps. Despite it’s original purpose, this design makes it a great fit for bioinformatics pipelines, which often work by transforming data from one form to another (e.g. raw data → word counts → ??? → profit).
Source: http://byronjsmith.com/make-bml/

Let's start by creating a Makefile and porting our first step into it.

touch Makefile

Use the text editor of your choice to add to the Makefile. The simplest possible Makefile recipe is this:

targets: prerequisites
	command

We want to create the file KLF4.peptides.txt in the data/ folder (the target) using the bin/01_digest_protein.py script (the command), as before. Our input file is fasta/KLF4.fasta (the prerequisite). The result looks like this:

data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

The targets are what we create as a result of executing the commands. Now, run:

make data/KLF4.peptides.txt

Once we've executed this, you should have generated the KLF4.peptides.txt file in the data/ folder. Make will only allow rerunning the command if the prerequisites have been modified since the target was created.

Try running make data/KLF4.peptides.txt again. You should get the following message, telling you that the prerequisites have not changed and therefore the target won't be different if you run it again:

$ make data/KLF4.peptides.txt
make: 'data/KLF4.peptides.txt' is up to date.

We can go around this by just changing the modification time of fasta/KLF4.fasta and restore the original behavior.

touch fasta/KLF4.fasta
make data/KLF4.peptides.txt

Let's add the second step, the counting step. As a reminder: we want to create the file KLF4.count.tsv in the data/ folder (the target) using the bin/02_count_amino_acids.py script (the command). Our input files are fasta/KLF4.fasta and data/KLF4.peptides.txt (the prerequisites). The resulting Makefile looks like this:

data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

data/KLF4.count.tsv: fasta/KLF4.fasta data/KLF4.peptides.txt
	bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv

Now, try to add the plotting command (creates the file data/KLF4.plot.png) and the report (data/protein_report.tsv) yourself.

Here is the solution.

data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

data/KLF4.count.tsv: fasta/KLF4.fasta data/KLF4.peptides.txt
	bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv

data/KLF4.plot.png: data/KLF4.count.tsv
	bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png

data/protein_report.tsv: data/KLF4.count.tsv
	bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv

data/results.tgz: data/KLF4.plot.png data/protein_report.tsv
	rm -rf results
	mkdir results
	cp data/KLF4.plot.png data/protein_report.tsv results/
	tar -czf data/results.tgz results
	rm -r results

Let's remove all the files from the data/ subdirectory and run Make.

rm data/*
make data/results.tgz

You'll notice that Make executes every single command in the Makefile.

$ make data/result.tgz
bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt
bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv
bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png
bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv
rm -rf results
mkdir results
cp data/KLF4.plot.png data/protein_report.tsv results/
tar -czf data/results.tgz results
rm -r results

Why is that? Makefiles work in a pull-based fashion. This means that the workflow is invoked by asking for a specific output file, whereafter all tasks required for reproducing the file will be executed. We can visualize this by looking at the dependency graph. To generate it we're using makefile2graph and call:

make -Bnd data/results.tgz | make2graph | dot -Tpng -o out.png

We call make with our target data/results.tgz and in order to create it, we first need to create data/KLF4.plot.png and data/protein_report.tsv which in turn need data/KLF4.count.tsv and so on. That's why it generates all files at once.

The dependency graph of our Makefile

In order to just see what files will be created, we can run use the flag --dry-run or its short form -n.

$ make --dry-run data/result.tgz
bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt
bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv
bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png
bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv
rm -rf results
mkdir results
cp data/KLF4.plot.png data/protein_report.tsv results/
tar -czf data/results.tgz results
rm -r results

We can put an all target at the very top of our Makefile which is good practice as the topmost recipe is the one that is built by default when calling just make. Add the following to the top of your Makefile:

all: data/results.tgz

[...]

Another common target is clean:. Let's add the following below the all: target in our Makefile:

clean:
	rm -rf data/*

We can now create all our files by calling make all and clean the data/ folder by calling make clean.

We have to tell Make that all: and clean: will always refer to the targets in our Makefile and never to any files themselves, therefore we also add this to our Makefile:

.PHONY: all clean

Our Makefile should now look like this:

# Dummy targets
all: data/results.tgz

clean:
	rm -rf data/*

.PHONY: all clean

# Analysis and plotting
data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

data/KLF4.count.tsv: fasta/KLF4.fasta data/KLF4.peptides.txt
	bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv

data/KLF4.plot.png: data/KLF4.count.tsv
	bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png

data/protein_report.tsv: data/KLF4.count.tsv
	bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv

data/results.tgz: data/KLF4.plot.png data/protein_report.tsv
	rm -rf results
	mkdir results
	cp data/KLF4.plot.png data/protein_report.tsv results/
	tar -czf data/results.tgz results
	rm -r results

You might have noticed that there is a fair amount of repetition in each of the recipes. Let's take the first one and simplify it:

data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

In Makefiles, the two variables $^ and $@ refer to the prerequisite and target of a rule so we can rewrite the above as:

data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py $^ $@

In fact, to make sure that our python script is also considered as a prerequisite and the recipe is rerun when our script is updated we change it like this:

data/KLF4.peptides.txt: bin/01_digest_protein.py fasta/KLF4.fasta
	$^ $@

After applying these transformations, our Makefile should look like this:

# Dummy targets
all: data/results.tgz

clean:
	rm -rf data/*

.PHONY: all clean

# Analysis and plotting
data/KLF4.peptides.txt: bin/01_digest_protein.py fasta/KLF4.fasta
	$^ $@

data/KLF4.count.tsv: bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt
	$^ $@

data/KLF4.plot.png: bin/03a_plot_count.py data/KLF4.count.tsv
	$^ $@

data/protein_report.tsv: bin/03b_get_report.py data/KLF4.count.tsv
	$^ --output_file=$@

# Archive for sharing
data/results.tgz: data/KLF4.plot.png data/protein_report.tsv
	rm -rf results
	mkdir results
	cp $^ results/
	tar -czf $@ results
	rm -r results

Now the current Makefile only creates the KLF4 protein files. Let's add the other proteins, starting with MYC.

# Analysis and plotting
data/KLF4.peptides.txt: bin/01_digest_protein.py fasta/KLF4.fasta
	$^ $@
    
data/MYC.peptides.txt: bin/01_digest_protein.py fasta/MYC.fasta
	$^ $@

As you probably noticed, that would be a lot of repetition. We can pattern rules to abstract the individual protein names away:

# Analysis and plotting
data/%.peptides.txt: bin/01_digest_protein.py fasta/%.fasta
	$^ $@

If we would've gone ahead and blindly applied this to the last two rules we would've gotten the following error:

make: *** No rule to make target `data/%.count.tsv', needed by `data/%.plot.png'. Stop.

Why is that? Remember how Makefiles use a “pull-based” scheduling strategy?
If we were to use the wildcards % everywhere, we'd never actually tell the Makefile what the wildcard stands for.
At some point, we need to define some target files that Make can use as a basis to fill in the wildcards for the others.

We insert two variables at the top and use them for the last two rules. Voilà.

COUNT := data/KLF4.count.tsv data/MYC.count.tsv \
			data/PO5F1.count.tsv data/SOX2.count.tsv
PLOT := data/KLF4.plot.png data/MYC.plot.png \
			data/PO5F1.plot.png data/SOX2.plot.png

[...]

data/protein_report.tsv: bin/03b_get_report.py ${COUNT}
	$^ --output_file=$@

# Archive for sharing
data/results.tgz: ${PLOT} data/protein_report.tsv
	rm -rf results
	mkdir results
	cp $^ results/
	tar -czf $@ results
	rm -r results

Now we're back on track. Let's run it:

make

You might have noticed that the pipeline took a little bit longer to process the four proteins because it ran everything sequentially. You can use the --jobs or -j flag to run Make in parallel. That'll give us a nice speedup.

Let's check out the dependency graph at this point.

Your final Makefile should look like this:

COUNT := data/KLF4.count.tsv data/MYC.count.tsv \
			data/PO5F1.count.tsv data/SOX2.count.tsv
PLOT := data/KLF4.plot.png data/MYC.plot.png \
			data/PO5F1.plot.png data/SOX2.plot.png

# Dummy targets
all: data/results.tgz

clean:
	rm -rf data/*

.PHONY: all clean

# Analysis and plotting
data/%.peptides.txt: bin/01_digest_protein.py fasta/%.fasta
	$^ $@

data/%.count.tsv: bin/02_count_amino_acids.py fasta/%.fasta data/%.peptides.txt
	$^ $@

data/%.plot.png: bin/03a_plot_count.py data/%.count.tsv
	$^ $@

data/protein_report.tsv: bin/03b_get_report.py ${COUNT}
	$^ --output_file=$@

# Archive for sharing
data/results.tgz: ${PLOT} data/protein_report.tsv
	rm -rf results
	mkdir results
	cp $^ results/
	tar -czf $@ results
	rm -r results

Nextflow

If you're starting at this point, please checkout branch "part_02" from the GitHub repo.

We're now going to switch gears and turn to Nextflow. Makefiles are awesome but they're limited. For larger scale pipelines, the biggest limit is that we can't easily scale them horizontally. It'd have to be on the same machine. A lot of the steps we've implemented earlier though could easily run on multiple machines in parallel. Let's see how Nextflow helps with that. There are many tools like Nextflow, including but not limited to Airflow, Reflow, Snakemake, etc. They all have their advantages and disadvantages but I chose Nextflow for this tutorial (and for our work at talus.bio) because of its flexibility and popularity in the Bioinformatics community.

Start by creating a file main.nf which is a commonly used name for Nextflow files. We add only the digestProtein step, for now, to keep it simple.

#!/usr/bin/env nextflow

// Run workflow for all .fasta files in the fasta directory
fasta = Channel.fromPath("$baseDir/fasta/*.fasta")


// Helper function to extract the protein name from the filename
def getProtein(fileName) {
  fileName.getBaseName().tokenize(".")[0]
}


// Digest a protein and save the peptides
process digestProtein {  
  input:
    path input_fasta from fasta

  output:
    path "*.txt" into peptides

  script:
    def protein = getProtein(input_fasta)
    """
    01_digest_protein.py ${input_fasta} ${protein}.peptides.txt
    """
}

peptides.view()

We also create a small nextflow.config file that includes only the publishDir directive for now. Nextflow does all the work in the work folder and if we want our processes to export data to a different location we have to specify that with the publishDir directive. Here we use our data/ folder again.

process {
	publishDir = [path: "data/", mode: "copy"]
}

One of the main functional differences between Nextflow and Makefiles is that Makefiles are pull-based but Nextflow is push-based. With the Makefiles, we had to specify the output files we want to generate and it automatically figured out which parts it had to execute to generate them. Here we take all the .fasta files from our fasta folder and "push" them into the pipeline.

Nextflow uses the groovy programming language which is based on Java. This makes it a lot more flexible than using bash. There are two main concepts: Processes and Channels. Processes are similar to the rules in a Makefile. We specify input and output as well as a script that determines how to generate the output from the input.

Run the pipeline with nextflow run main.nf and you'll see that it runs the process digestProtein four times (once for each fasta).

We'll now add the other functions step by step. When adding the countAA process we notice that it also takes the fasta Channel as an input. Let's try to run it without changing anything.

#!/usr/bin/env nextflow

// Run workflow for all .fasta files in the fasta directory
fasta = Channel.fromPath("$baseDir/fasta/*.fasta")


// Helper function to extract the protein name from the filename
def getProtein(fileName) {
  fileName.getBaseName().tokenize(".")[0]
}


// Digest a protein and save the peptides
process digestProtein {
  input:
    path input_fasta from fasta

  output:
    path "*.txt" into peptides

  script:
    def protein = getProtein(input_fasta)
    """
    01_digest_protein.py ${input_fasta} ${protein}.peptides.txt
    """
}


// Count the number of times a given amino acid appears in a protein as well 
// as its peptides after digestion
process countAA {  
  input:
    path input_peptides from peptides
    path input_fasta from fasta

  output:
    path "*.tsv" into aa_count

  script:
    def protein = getProtein(input_peptides)
    """
    02_count_amino_acids.py ${input_fasta} ${input_peptides} ${protein}.count.tsv
    """
}

You should've gotten the following error:

$ nextflow run main.nf
N E X T F L O W  ~  version 21.09.0-edge
Launching `main.nf` [clever_solvay] - revision: 84edb4b9a9
Channel `fasta` has been used twice as an input by process `countAA` and process `digestProtein`

 -- Check script 'main.nf' at line: 31 or see '.nextflow.log' file for more details
[-        ] process > digestProtein -

To avoid this error we split our fasta channel into two.

[...]

// Run workflow for all .fasta files in the fasta directory
fasta = Channel.fromPath("$baseDir/fasta/*.fasta")
fasta.into { 
  fasta_a
  fasta_b 
}

[...]

process digestProtein {
  input:
    path input_fasta from fasta_a
[...]

process countAA {  
  input:
    path input_peptides from peptides
    path input_fasta from fasta_b
[...]

We also know that the output from countAA goes into both plotCount and generateReport so we use the same trick as with the fasta channel. Our main.nf file should now look like this. Note that we used .collect() both in generateReport and archiveResults. By default, Nextflow would've run these processes once for each item. In this case, we deliberately want to avoid that behavior, because our processes use all files at once.

#!/usr/bin/env nextflow

// Run workflow for all .fasta files in the fasta directory
fasta = Channel.fromPath("$baseDir/fasta/*.fasta")
fasta.into { 
  fasta_a
  fasta_b 
}


// Helper function to extract the protein name from the filename
def getProtein(fileName) {
  fileName.getBaseName().tokenize(".")[0]
}


// Digest a protein and save the peptides
process digestProtein {
  input:
    path input_fasta from fasta_a

  output:
    path "*.txt" into peptides

  script:
    def protein = getProtein(input_fasta)
    """
    01_digest_protein.py ${input_fasta} ${protein}.peptides.txt
    """
}


// Count the number of times a given amino acid appears in a protein as well 
// as its peptides after digestion
process countAA {  
  input:
    path input_peptides from peptides
    path input_fasta from fasta_b

  output:
    path "*.tsv" into aa_count_a, aa_count_b

  script:
    def protein = getProtein(input_peptides)
    """
    02_count_amino_acids.py ${input_fasta} ${input_peptides} ${protein}.count.tsv
    """
}


// Load the calculated counts and create a plot
process plotCount {  
  input:
    path input_count from aa_count_a

  output:
    path "*.png" into count_plot

  script:
    def protein = getProtein(input_count)
    """
    03a_plot_count.py ${input_count} ${protein}.plot.png
    """
}


// Get a list of input files from a given folder and create a report
process generateReport {  
  input:
    path input_count from aa_count_b.collect()

  output:
    path "*.tsv" into protein_report

  script:
    """
    03b_get_report.py ${input_count} --output_file=protein_report.tsv
    """
}


// Gather result files and archive them
process archiveResults {  
  input:
    path input_plot from count_plot.collect()
    path input_report from protein_report

  output:
    path "*.tgz" into archive_results

  script:
    """
    mkdir results
    cp ${input_plot} ${input_report} results/
    tar -czf results.tgz results
    """
}

So far we've been the "old" way of writing pipelines in Nextflow. I wrote the pipeline this way on purpose, in order to showcase the difference between push-based and pull-based. It's still a legitimate way of writing them but Nextflow has recently released a new DSL (Version 2) which makes the whole process more flexible and IMO a bit more elegant. Instead of having to think about how to connect processes, we treat them more like functions that take an input and output and connect them via a workflow block. Let's see what that would look like. Rename the main.nf to main_old.nf and copy its content into a new main.nf.

cp main.nf main_old.nf

We start by enabling the new DSL at the top of our file.

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

[...]

Then we remove all the from and into directives from our processes and add the following workflow block at the bottom.

[...]

workflow {
  // Run workflow for all .fasta files in the fasta directory
  fasta = Channel.fromPath("$baseDir/fasta/*.fasta")
  peptides = digestProtein(fasta)
  aa_count = countAA(peptides, fasta)
  count_plot = plotCount(aa_count)
  protein_report = generateReport(aa_count | collect)
  archive_results = archiveResults(count_plot | collect, protein_report)
}

Last check, your main.nf should now look like this:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2


// Helper function to extract the protein name from the filename
def getProtein(fileName) {
  fileName.getBaseName().tokenize(".")[0]
}


// Digest a protein and save the peptides
process digestProtein {
  input:
    path input_fasta

  output:
    path "*.txt"

  script:
    def protein = getProtein(input_fasta)
    """
    01_digest_protein.py ${input_fasta} ${protein}.peptides.txt
    """
}


// Count the number of times a given amino acid appears in a protein as well 
// as its peptides after digestion
process countAA {  
  input:
    path input_peptides
    path input_fasta

  output:
    path "*.tsv"

  script:
    def protein = getProtein(input_peptides)
    """
    02_count_amino_acids.py ${input_fasta} ${input_peptides} ${protein}.count.tsv
    """
}


// Load the calculated counts and create a plot
process plotCount {  
  input:
    path input_count

  output:
    path "*.png" 

  script:
    def protein = getProtein(input_count)
    """
    03a_plot_count.py ${input_count} ${protein}.plot.png
    """
}


// Get a list of input files from a given folder and create a report
process generateReport {  
  input:
    path input_count

  output:
    path "*.tsv"

  script:
    """
    03b_get_report.py ${input_count} --output_file=protein_report.tsv
    """
}


// Gather result files and archive them
process archiveResults {  
  input:
    path input_plot
    path input_report

  output:
    path "*.tgz"

  script:
    """
    mkdir results
    cp ${input_plot} ${input_report} results/
    tar -czf results.tgz results
    """
}


workflow {
  // Run workflow for all .fasta files in the fasta directory
  fasta = Channel.fromPath("$baseDir/fasta/*.fasta")
  peptides = digestProtein(fasta)
  aa_count = countAA(peptides, fasta)
  count_plot = plotCount(aa_count)
  protein_report = generateReport(aa_count | collect)
  archive_results = archiveResults(count_plot | collect, protein_report)
}

Conclusion

That's a wrap! We created a fully functional pipeline from the bottom-up, covering shell scripts, Makefiles, Nexflow as well as the two main types of execution: push- and pull-based. We've seen the benefits that modern tools like Nextflow can have over more traditional approaches like scripts and Makefiles. Hopefully, this tutorial provided a solid baseline for what a pipeline is and how to write one from scratch (while climbing up the ladder of complexity).

For follow-up questions or feedback on this article, you can submit an issue through the accompanying GitHub repository or reach me on Twitter.

If you want to learn more about the concepts covered in this article check out these tutorials:

Bioinformatics pipelines with Make: http://byronjsmith.com/make-bml/
Bioinformatics pipelines with Nextflow: https://carpentries-incubator.github.io/workflows-nextflow/aio/index.html

If you're interested to explore some other tools check out these resources:

A pretty extensive list of all existing pipeline tools: https://github.com/pditommaso/awesome-pipeline
Nextflow vs Snakemake vs Reflow http://blog.booleanbiotech.com/nextflow-snakemake-reflow.html
How to choose the right one: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7906312/
This review is also interesting: https://academic.oup.com/bib/article/18/3/530/2562749
And these comparison threads

Optional: Nextflow in the Cloud ☁️

If you're starting at this point and want to follow along during this part, please checkout branch "part_03" from the GitHub repo. To see the final repo state, checkout branch "part_04".

If you have followed the tutorial I mentioned in the first paragraph and set up your AWS environment for Nextflow, there's one more thing to mention.
The biggest advantage of Nextflow over shell scripts or Mainfiles is its ability to easily scale into the cloud. To do that we don't need to change anything in the main.nf file itself. We need to

Setup our AWS Environment
Add another executor

The one we've been using implicitly so far looks like this (in the nextflow.config):

profiles {
	standard {
		process.executor = "local"
	}
}

[...]

We now add the Nextflow plugin for AWS as well as our region of choice and a role to operate with. We also add an executor for AWS Batch. You can either build the docker image I used from the Dockerfile in the repository or use the one I published called rmeinl/python-plt.
The nexflow.config file should then look like this:

// Profiles
profiles {
	standard {
		process.executor = "local"
	}
	cloud {
		process {
			executor = "awsbatch"
			queue = "terraform-nextflow-medium-size-spot-batch-job-queue"
			container = "rmeinl/python-plt:latest"
		}
		errorStrategy = "retry"
		maxRetries = 3
	}
}

// Process
process {
	publishDir = [path: "data/", mode: "copy"]
}

// Plugins
plugins {
    id "nf-amazon"
}

// AWS Setup
aws {
    region = "us-west-2"
    batch {
    	cliPath = "/home/ec2-user/bin/aws"
        jobRole = "arn:aws:iam::622568582929:role/terraform-nextflow-batch-job-role"
    }
}

In order to run this whole workflow in the cloud we call:

nextflow run main.nf -profile cloud

You should now see this message indicating success:

$ nextflow run main.nf -profile cloud
N E X T F L O W  ~  version 21.09.0-edge
Launching `main.nf` [dreamy_murdock] - revision: 7be483af55
Uploading local `bin` scripts folder to s3://terraform-nextflow-work-bucket/tmp/f4/43104ae6c68d4b50070806e54e391a/bin
executor >  awsbatch (14)
[90/eabf4a] process > digestProtein (3) [100%] 4 of 4 ✔
[77/fec491] process > countAA (4)       [100%] 4 of 4 ✔
[95/e4ea25] process > plotCount (4)     [100%] 4 of 4 ✔
[e4/a2dff2] process > generateReport    [100%] 1 of 1 ✔
[4a/01e553] process > archiveResults    [100%] 1 of 1 ✔
Completed at: 09-Dec-2021 19:36:18
Duration    : 4m 33s
CPU hours   : (a few seconds)
Succeeded   : 14

Awesome Open-Source Bio/Cheminformatics

Rico Meinl — Wed, 30 Jun 2021 20:38:46 GMT

A (growing) list of open-source Bio/Cheminformatics tools that I found useful in my work. If you know other tools in this realm that I should check out, please reach out.

Autodock Vina

#molecular-docking

Open-source program for doing molecular docking.

Publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041641/

Forks:

smina is a fork of AutoDock Vina that focuses on improving scoring and minimization
QuickVina - fast and accurate molecular docking tool, attained at accurately accelerating AutoDock Vina
Gnina - molecular docking program with integrated support for scoring and optimizing ligands using convolutional neural networks. It is a fork of smina, which is a fork of AutoDock Vina

Autodock GPU

#molecular-docking

OpenCL and Cuda accelerated version of AutoDock4.2.6. It leverages its embarrasingly parallelizable LGA by processing ligand-receptor poses in parallel over multiple compute units.

Github: https://github.com/ccsb-scripps/AutoDock-GPU
Publication: Accelerating AutoDock4 with GPUs and Gradient-Based Local Search, J. Chem. Theory Comput. 2021, 10.1021/acs.jctc.0c01006

VirtualFlow

#virtual-screening

VirtualFlow is a versatile, parallel workflow platform for carrying out virtual screening related tasks on Linux-based computer clusters of any type and size which are managed by a batchsystem (such as SLURM).

Github: https://github.com/VirtualFlow/VFVS
Publication: An open-source drug discovery platform enables ultra-large virtual screens. Nature 580, 663–668 (2020). https://doi.org/10.1038/s41586-020-2117-z

Gypsum-DL

#ligand-preparation

Gypsum-DL is a free, open-source program for preparing 3D small-molecule models. Beyond simply assigning atomic coordinates, Gypsum-DL accounts for alternate ionization, tautomeric, chiral, cis/trans isomeric, and ring-conformational forms.

Gitlab: https://git.durrantlab.pitt.edu/jdurrant/gypsum_dl
Publication: "Gypsum-DL: An Open-source Program for Preparing Small-molecule Libraries for Structure-based Virtual Screening." Journal of Cheminformatics 11:1. doi:10.1186/s13321-019-0358-3

LIT-PCBA

#dataset

15 target sets, 9780 actives and 407839 unique inactives selected from high-confidence PubChem Bioassay data

Data: http://drugdesign.unistra.fr/LIT-PCBA/
Publication: LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening. https://doi.org/10.1021/acs.jcim.0c00155

Apricot

#submodular-optimization

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: https://apricot-select.readthedocs.io/en/latest/index.html

Github: https://github.com/jmschrei/apricot
Publication: https://jmlr.org/papers/volume21/19-467/19-467.pdf

MolPal

#active-learning

Accelerating high-throughput virtual screening through molecular pool-based active learning.

Github: https://github.com/coleygroup/molpal
Publication: https://arxiv.org/abs/2012.07127

PyScreener

#virtual-screening

A pythonic interface to high-throughput virtual screening software.

Github: https://github.com/coleygroup/pyscreener

Other Resources

Building a virtual ligand screening pipeline using free software: a survey. https://doi.org/10.1093/bib/bbv037

How to set up your own ENS domain name

Rico Meinl — Sun, 25 Apr 2021 16:11:36 GMT

This is a short tutorial on how to set up your own ENS domain name using MetaMask and Google Chrome. Normally I'm using Brave but I thought doing the demos in Chrome would allow more people to access it.

Go to https://ens.domains/ and click Launch App.

2. Enter the name you are planning to register. It can end with .eth but doesn't have to.

3. Now we have to connect ENS to an Ethereum account. Though they offer multiple ways to connect your wallet (as shown below) we are going to use MetaMask for this tutorial. If you already have a MetaMask account set up, you can jump straight to step 10.

4. In order to download MetaMask, go to https://metamask.io/download.html and press Install MetaMask for Chrome. You'll be redirected to the Chrome Web Store.

5. Next, add MetaMask to Chrome by pressing Add to Chrome. After installation, you'll be redirected to the setup screen.

6. Press Get Started and Create a Wallet unless you already have one.

7. Create a password for your wallet.

8. Store your secret backup phrase in a safe place. It makes it easy to back up and restore your account. (The only reason I'm showing my phrase is because I'm using a throwaway account for this tutorial. You should never show it to anyone.

9. You're all set. You should now see a page with your MetaMask account like this.

10. Now that MetaMask is all set up, switch back to the ENS tab and click Connect to connect with your wallet. It'll open the same window as in step 3, but it should also include MetaMask now. You might have to refresh your browser.

11. Click MetaMask and select the account you want to authenticate with. Click Next and finally Connect. On the left side next to the ENS domain name you should now see that your account is connected to the mainnet.

12. In order to pay for the domain name you need to add some Ether to your account. The fastest way to do that is a direct deposit as shown below. Go to Buy > Directly Deposit Ether > View Account to get your MetaMask Ether address. Use the wallet of your choice to send Ether to this account. It could take up to 10 minutes for your funds to arrive.

13. Click on the domain name you want to register and select the number of years you want to reserve it for (2+ years are recommended, given the gas fees). There are three steps in total as listed on the website.

14. Request to register: Your wallet will open and you will be asked to confirm the first of two transactions required for registration.

15. Wait for 1 minute: The waiting period is required to ensure another person hasn’t tried to register the same name and protect you after your request. Afterward, your screen should look like this.

16. Complete Registration: Click Register and your wallet will re-open. Only after the 2nd transaction is confirmed you'll know if you got the name. This could take up to 10 minutes. As you can see, this transaction cost me about $50 in total but the gas fees are variable so it might be more or less depending on when you submit yours.

17. After the registration is completed, you should see the name show up under My Account.

18. Click Reverse record: not set. Select your ENS name then click Save, and submit the transaction to save it on the blockchain.

19. After about 10 minutes you should see that your reverse record has been set up successfully.

20. In order to add some records, click on your name in the list below. You should see that it already points to your Ethereum address. Click on Add/Edit Record. I'm going to add my BTC address, my website and twitter as well as github handle.

21. Finally, confirm the transaction and submit it to the blockchain via MetaMask. This should take another 10 minutes.

22. That's it! You can now use a browser like Opera or Brave to check whether everything worked out. I'm using Brave here, which will initially ask for confirmation to redirect via ENS. You should then see your record. If you neither have Brave or Opera, just go to https://app.ens.domains/name/.eth

We just walked through how to set up your own ENS domain name using MetaMask and Chrome. To give you a rough idea about the costs, the whole process cost me $97.86. Here's the breakdown:

$6.17 for step 14 (initial request)
$46.26 for step 16 (paying for the name)
$21.48 for step 18 (setting up the reverse record)
$23.95 for step 21 (adding custom records)

Obviously the majority of these costs are gas fees, you only pay ENS for step 16 so it will vary for you depending on when you set up yours.

Hope this was helpful!

Setting up Virtual Flow on AWS using Parallelcluster and Slurm

Rico Meinl — Mon, 19 Apr 2021 10:27:48 GMT

This is a short tutorial on how to set up AWS Parallelcluster with Slurm to run VirtualFlow.

VirtualFlow is a versatile, parallel workflow platform for carrying out virtual screening related tasks on Linux-based computer clusters of any type and size which are managed by a batchsystem (such as SLURM).

AWS Parallelcluster with Slurm

Creating our working environment

First, we'll create our working directory and set up a virtual environment using poetry. We need to add the awscli package as well as the aws-parallelcluster package.

mkdir parallel_cluster
cd parallel_cluster
poetry init
poetry add awscli aws-parallelcluster

Setting up the cluster config

To set up the AWS Parallelcluster I mainly followed this post. We start by creating the config for our cluster. Make sure to create an EC2 key pair beforehand.

 $ poetry run pcluster configure                  
Allowed values for AWS Region ID:
1. ap-northeast-1
2. ap-northeast-2
3. ap-south-1
4. ap-southeast-1
5. ap-southeast-2
6. ca-central-1
7. eu-central-1
8. eu-north-1
9. eu-west-1
10. eu-west-2
11. eu-west-3
12. sa-east-1
13. us-east-1
14. us-east-2
15. us-west-1
16. us-west-2
AWS Region ID [us-west-2]: 16
Allowed values for EC2 Key Pair Name:
1. parallelcluster
EC2 Key Pair Name [parallelcluster]: 1
Allowed values for Scheduler:
1. sge
2. torque
3. slurm
4. awsbatch
Scheduler [slurm]: 3
Allowed values for Operating System:
1. alinux
2. alinux2
3. centos7
4. centos8
5. ubuntu1604
6. ubuntu1804
Operating System [alinux2]: 2
Minimum cluster size (instances) [0]: 1
Maximum cluster size (instances) [10]: 
Head node instance type [t2.micro]: c4.large
Compute instance type [t2.micro]: c4.xlarge
Automate VPC creation? (y/n) [n]: y

We should now have a config file similar to this:

$ cat ~/.parallelcluster/config 
[aws]
aws_region_name = us-west-2

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[global]
cluster_template = default
update_check = true
sanity_check = true

[vpc default]
vpc_id = vpc-*****************
master_subnet_id = subnet-*****************

[cluster default]
key_name = parallelcluster
scheduler = slurm
master_instance_type = c4.large
base_os = alinux2
vpc_settings = default
queue_settings = compute

[queue compute]
enable_efa = false
enable_efa_gdr = false
compute_resource_settings = default

[compute_resource default]
instance_type = c4.xlarge
min_count = 1

Creating the cluster

After the config file is set, we can create our cluster using the following commands. AWS will then spin up our CloudFormation stack which will take a couple of minutes.

$ poetry run pcluster create test-cluster
Beginning cluster creation for cluster: test-cluster
Creating stack named: parallelcluster-test-cluster
...

In order to access our head node we can run the following:

poetry run pcluster ssh test-cluster -i ~/.ssh/

VirtualFlow

To get started with VirtualFlow I recommend running through the first tutorial to make sure the cluster has been set up correctly. I'm only going through the changes that need to be made and list the other steps solely for completeness. The tutorial does a good job at explaining each individual step.

Setting up VirtualFlow

First, we download the tutorial files and unzip them.

$ wget https://virtual-flow.org/sites/virtual-flow.org/files/tutorials/VFVS_GK.tar
$ tar -xvf VFVS_GK.tar
$ cd VFVS_GK/tools

Preparing the config files

There are two files in which we need to make changes. We want to make sure our batchsystem is set to 'SLURM' and change the partition to 'compute' which is the default name when we use AWS Parallelcluster.

# tools/templates/all.ctrl
...
batchsystem=SLURM
# Possible values: SLURM, TOQRUE, PBS, LSF, SGE
# Settable via range control files: No
...
partition=compute
# Partitions are also called queues in some batchsystems
# Settable via range control files: Yes

If 'compute' doesn't work, try running the following command to retrieve the correct partition name:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
compute*     up   infinite      5  idle~ compute-dy-c4xlarge-[5-9] 
compute*     up   infinite      5  alloc compute-dy-c4xlarge-[1-4],compute-st-c4xlarge-1

The second config file we need to adjust is the Slurm job template script. Usually we should be able to leave all the default values but I ran into this error:

srun: error: Unable to create step for job 874794: Memory required by task is not available

In order to solve it, we simply comment out the line with the --mem-per-cpu parameter.

# Slurm Settings
###############################################################################

#SBATCH --job-name=h-1.1
##SBATCH --mail-user=To be completed if uncommented
#SBATCH --mail-type=fail
#SBATCH --time=00-12:00:00
##SBATCH --mem-per-cpu=1024M
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=main
#SBATCH --output=../workflow/output-files/jobs/job-1.1_%j.out           # File to which standard out will be written
#SBATCH --error=../workflow/output-files/jobs/job-1.1_%j.out            # File to which standard err will be written
#SBATCH --signal=10@300

As a last preparation step we simply go back to the /tools subfolder and run this command:

./vf_prepare_folders.sh

More details here: https://docs.virtual-flow.org/tutorials/-LdE94b2AVfBFT72zK-v/vfvs-tutorial-1/setting-up-the-workflow.

Starting the jobs

To spin up our nodes, we simply run this command:

./vf_start_jobline.sh 1 12 templates/template1.slurm.sh submit 1

More details can be found here: https://docs.virtual-flow.org/tutorials/-LdE94b2AVfBFT72zK-v/vfvs-tutorial-1/starting-the-workflow.

Monitoring and Wrapping Up

In order to monitor the jobs and view the files after completion, I recommend the respective sections of the tutorial:

Monitoring

Completed Workflow

Using our own files

Running the same workflow with our own files is pretty straightforward. After we downloaded the template files in the 'Setting up VirtualFlow' step we need to replace the ligand library as well as our target protein.

Replacing the ligand library

The second tutorial in the VirtualFlow documentation has a section dedicated to this.

Using a different Protein

Here, I downloaded AutoDock Vina together with MGLTools and followed the tutorial on http://vina.scripps.edu which looks outdated but still works fine. We can use AutoDock Vina to convert our protein from .pbd to .pdbqt and use the 'GridBox' tool to get the necessary parameters for respective receptor config file.

# ../input-files/smina_rigid_receptor1/config.txt
receptor = ../input-files/receptor/.pdbqt
center_x = 28.614
center_y = 15.838
center_z = -2.045
size_x = 36.0
size_y = 32.0
size_z = 36.0
exhaustiveness = 4
scoring = vinardo
cpu = 1

We add our protein to the folder and change both the smina (/input-files/smina_rigid_receptor1) and qvina receptor (/input-files/qvina02_rigid_receptor1) config files.

That's it. Now we can follow the rest of the steps outlined in the 'VirtualFlow' section above.

The 80/20 Computer Science Degree

Rico Meinl — Tue, 30 Mar 2021 22:12:56 GMT

DRAFT/DISCLAIMER — This post is a submission to a competition on 1729.com. No prizes will be awarded for any submissions at this time. Learn more at 1729.com/decentralized-task-creation.

Nonetheless, I highly recommend everyone who is interested in CS to take this course.

Listening to podcasts with people in tech, you'll often hear that they got interested in the field because they built their own computers or coded their own games. Elon, for example, sold his first computer game at the age of 12 and built custom computers for others in university.

Now, that most people have laptops it becomes harder to just open them up, check what's inside and put it back together. Of course you could buy go and buy all the parts separately or get a DIY kit. Though this might not be logistically feasible for everyone which is a shame because this kind of tinkering is a great learning vehicle for anything related to Computer Science.

What if you could virtualize the whole experience while being guided by some world-class CS professors? Enter Nand to Tetris.

A game changer in CS education

Nand to Tetris was created by two CS professors, Noam Nisan and Shimon Schocken. In a nutshell, you'll build your own computer in a bottom-up fashion all the way up from NAND gates.
In the process, you'll get a hands-on coverage of most of the important ideas and techniques in applied computer science, focusing on computer architecture, compilation, and software engineering, in one course. Nand to Tetris also provides a hands-on overview of key data structures and algorithms, as they unfold in the context of 12 captivating hardware and software development projects.

Nand to Tetris courses are now taught at 200+ universities and high schools around the world. The students who take them range from high school students to Ph.D. students to Google engineers.

Task: Earn $500 in BTC

Complete all 12 projects and submit a link to the Github project repository

During these 12 projects you will build your own Assembler, Virtual Machine, Java-like High Level Language, Compiler and Operating System. In the optional 13th project you can tie all these things together to write an implementation of Tetris or any other game of your choice using all the components you previously built.

There is a guided Coursera course with two parts but just using the links above or the book works perfectly fine. Check out the introduction video here. Some inspirational projects can be found here.

Loading…

Sequential and simultaneous modes of awareness

Rico Meinl — Mon, 30 Nov 2020 17:24:24 GMT

The most interesting part in Ted Chiang's "Story of your life" is the parallel of the causal and teleological explanation with a sequential and simultaneous mode of awareness.

Fermat's principle of least time can be interpreted in terms of cause and effect: a difference in the index of refraction caused the light ray to change direction when it hit the surface of the water. This is most intuitive to us humans.

It can also be interpreted teleologically: the ray of light has to know where its destination is in order to compute the path of least time. This is more intuitive to the heptapods.

The parallel to the causal explanation is a sequential mode of awareness: experiencing events in order, and perceiving their relationship as cause and effect. This is how humans experience things. We don't know the future and are therefore able to exercise free will.

The parallel to the teleological explanation is a simultaneous mode of awareness: experiencing events all at once, and perceiving a purpose underlying them all. This is how heptapods experience. They already know the future, so freedom is meaningless and every act is performative*.

If you have free will, it's impossible to know about the future because you could change it. On the other side, if you know the future you cannot act freely anymore. (as in the example of the book of ages).

Sequential and simultaneous modes of awareness are like the optical illusion of the old and young lady. Both are valid but you can't see them at the same time.

*Performative language: Saying equals doing.
Example: At a wedding ceremony everybody knows that at the end the pastor will pronounce the couple husband and wife but it doesn't count until he actually says it.

The best way to encompass the future is by building a strong set of beliefs.

Rico Meinl — Wed, 28 Oct 2020 22:13:50 GMT

Using claims as a first-class citizen in your thinking helps you move towards strong beliefs.

If you don't explicitly state your claims you're never going to move in any direction. Everything will seem kind of relevant and worth pursuing.

Writing down claims can help manifest them. It helps to understand their implications, as well as supporting and opposing claims.

Claims eventually turn into beliefs and beliefs give perspective. They act like gravity. Strong beliefs are something that new information can be attached to.

The best way to get to a good forecast is by making predictions with limited information and trying to find opposing evidence; then using the accumulated insights to improve your predictions.

The best way to encompass the future is by building a strong set of beliefs.
— Rico Meinl (@rmeinl) October 28, 2020

Discussed on Twitter.

Why we gain compounding benefits from incremental knowledge tools

Rico Meinl — Tue, 27 Oct 2020 11:43:49 GMT

Knowledge and productivity are like compound interest. As knowledge workers, we live on the margins and every seemingly little improvement can add up to that compound in the long run.

The more you know, the more you learn; the more you learn, the more you can do; the more you can do, the more the opportunity.

With the old file cabinet like note taking systems there was literally no gain when going from 10 notes to 10.000 notes. It was probably more of a downward linear trend because of the growing lack of structure. With graph-based tools like Roam Research, your knowledge management system can improve almost exponentially the more you add to it (if done right). The increasing number of notes allows for ever more unexpected connections.

Roam Research is also an IDE for knowledge work and enables us to treat notes as composable blocks of knowledge. Text is not as composable as code or graphic elements.

But as the Zettelkasten shows, the notes that contribute to an idea and eventually to a piece of content are very much composable. Knowledge systems that compose and have atomic statements make it much easier to write and publish.

The interface of Roam is mouldable and we can build our own meta-tools on top of it. The question for all the builders will be if we can make the new meta-tools for knowledge as valuable as the meta-tools for programming.

When you zoom out and look at the bigger picture, a tool like @RoamResearch perhaps makes you 5% more productive in the short term. I realized today why this still matters a lot:
— Rico Meinl (@rmeinl) October 26, 2020

Discussed on Twitter.

Embed Twitter Threads in Roam Research

Rico Meinl — Fri, 23 Oct 2020 18:50:23 GMT

Get it here

How to use:
- Paste a tweet url into Roam.
- The thread is then copied to your clipboard.
- Paste it into Roam via CMD+V (Mac) CTRL-V (Windows).

Recommender Systems: The Most Valuable Application of Machine Learning (Part 2)

Rico Meinl — Sun, 04 Oct 2020 20:50:12 GMT

Why Recommender Systems are the most valuable application of Machine Learning and how Machine Learning-driven Recommenders already drive almost every aspect of our lives.

Read this article on Medium.

Recommender Systems already drive almost every aspect of our daily lives.

This is the second part of the article published on 11 May. In the first part I covered:

Business Value
Problem Formulation
Data
Algorithms

In this second part I will cover the following topics:

Evaluation Metrics
User Interface
Cold-start Problem
Exploration vs. Exploitation
The Future of Recommender Systems

Throughout this article, I will continue to use examples of the companies that have built the most widely used systems over the last couple of years, including Airbnb, Amazon, Instagram, LinkedIn, Netflix, Spotify, Uber Eats, and YouTube.

Evaluation Metrics

Now that we have the algorithm for our Recommender System, we need to find a way to evaluate its performance. As with every Machine Learning model, there are two types of evaluation:

Offline Evaluation
Online Evaluation

Offline/Online Testing Framework

Generally speaking, we can consider the Offline Evaluation metrics as low-level metrics, that are usually easily measurable. The most well-known example would be Netflix choosing to use root mean squared error (RMSE) as a proxy metric for their Netflix Prize Challenge. The Online Evaluation metrics are the high-level business metrics that are only measurable as soon as we ship our model into the real world and test it with real users. Some examples include customer retention, click-through rate, or user engagement.

Offline Evaluation

As most of the existing Recommender Systems consist of two stages (candidate generation and ranking), we need to pick the right metrics for each stage. For the candidate generation stage, YouTube, for instance, focuses on high precision so “out of all the videos that were pre-selected how many are relevant”. This makes sense given that in the first stage we want to filter for a smaller set of videos whilst making sure all of them are potentially relevant to the user. In the second stage, presenting a few “best” recommendations in a list requires a fine-level representation to distinguish relative importance among candidates with high recall (“how many of the relevant videos did we find”).

Often, most of the examples are using the standard evaluation metrics used in the Machine Learning community: from ranking measures, such as normalized discounted cumulative gain, mean reciprocal rank, or fraction of concordant pairs, to classification metrics including accuracy, precision, recall, or F-score.

Instagram formulated the optimization function of their final pass model a little different:

We predict individual actions that people take on each piece of media, whether they’re positive actions such as like and save, or negative actions such as “See Fewer Posts Like This” (SFPLT). We use a multi-task multi-label (MTML) neural network to predict these events.

As appealing as offline experiments are, they have a major drawback: they assume that members would have behaved in the same way, for example, playing the same videos, if the new algorithm being evaluated had been used to generate the recommendations. That’s why we need online evaluation to measure the actual impact our model has on the higher-level business metrics.

Online Evaluation

The approach to be aware of here is A/B testing. There are many interesting and exhaustive articles/courses that cover this well, therefore I won’t spend too much time on this. The only slight variation I have encountered is Netflix’s approach called “Consumer Data Science” that you can read about it here.

The most popular high-level metrics that companies are measuring here are Click-Through Rate and Engagement. Uber Eats goes further here and designed a multi-objective tradeoff that captures multiple high-level metrics to account for the overall health of their three-sided marketplace (among others: Marketplace Fairness, Gross Bookings, Reliability, Eater Happiness). In addition to medium-term engagement, Netflix focuses on member retention rates as their online tests can range from between 2–6 months.

YouTube famously prioritizes watch-time over click-through rate. They even wrote an article, explaining why:

Ranking by click-through rate often promotes deceptive videos that the user does not complete (“clickbait”) whereas watch time better captures engagement

Evaluating Embeddings

As covered in the section on algorithms, embeddings are a crucial part of the candidate generation stage. However, unlike with a classification or regression model, it’s notoriously difficult to measure the quality of an embedding given that they are often being used in different contexts. A sanity check we can perform is to map the high-dimensional embedding vector into a lower-dimensional representation (via PCA, t-SNE, or UMAP) or apply clustering techniques such as k-means and then visualize the results. Airbnb did this with their listing embeddings to confirm that listings from similar locations are clustered together.

User Interface

For a Machine Learning Engineer or Data Scientist, the probably most overlooked aspect of the equation is the User Interface. The problem is that if your UI does not contain the needed components to showcase the recommendations or showcases them in the wrong context, the feedback loop is inherently flawed.

Let’s take Linkedin as an example to illustrate this. If I’m browsing through people’s profiles, on the right-hand side of the screen I see recommendations for similar people. When I’m browsing through companies, I see recommendations for similar companies. The recommendations are adapted to my current goals and context and encourage me to keep browsing the site. If the similar companies recommendations would appear on a person’s profile, I would probably be less encouraged to click on their profile as it is not what I am currently looking for.

Similar User Recommendations on Linkedin

Similar Companies Recommendations on Linkedin

You can build the best Recommender System in the world, however, if your interface is not designed to serve the user’s needs and wants, no one will appreciate the recommendations. In fact, the User Interface challenge is so crucial that Netflix turned all components on their website into dynamic ones which are assembled by a Machine Learning algorithm to best reflect the goals of a user.

Spotify followed that model and adopted a similar layout for their home screen design, as can be seen below.

Personalizing Spotify Home with Machine Learning (Source: Spotify)

This is an ongoing area where there is still a lot of experimentation. As an example, YouTube recently changed their homepage interface to enable users to narrow down the recommendations for different topics:

New YouTube Home Page

Cold-start Problem

The cold-start problem is often seen in Recommender Systems because methods such as collaborative filtering rely heavily on past user-item interactions. Companies are confronted with the cold-start problem in two ways: user and item cold-start. Depending on the type of platform, either one of them is more prevalent.

User cold-start

Imagine a new member signs up for Netflix. At this point, the company doesn’t know anything about the new members’ preferences. How does the company keep her engaged by providing great recommendations?

In Netflix’s case, new members get a one-month free trial, during which cancellation rates are the highest while they decrease quickly after that. This is why any improvements to the cold-start problem present an immense business opportunity for Netflix, in order to increase engagement and retention in those first 30 days. Today, their members are given a survey during the sign-up process, during which they are asked to select videos from an algorithmically populated set that is then used as an input into all of their algorithms.

Item cold-start

Companies face a similar challenge when new items or content are added to the catalog. Platforms like Netflix or Prime Video hold an existing catalog of media items that changes less frequently (it takes time to create movies or series!), therefore they struggle less with this. On the contrary, on Airbnb or Zillow, new listings are created every day and at that point, they do not have an embedding as they were not present during the training process. Airbnb solves this the following way:

To create embeddings for a new listing we find 3 geographically closest listings that do have embeddings, and are of same listing type and price range as the new listing, and calculate their mean vector.

For Zillow, this is especially critical as some of the new home listings might only be on the site for a couple of days. They creatively solved this problem by creating a neural network-based mapping function from the content space to the embedding space, which is guided by the engagement data from users during the learning phase. This allows them to map a new home listing to the learned embedding space just by using its features.

Exploration vs. Exploitation

The concept of exploration/exploitation can be seen as the balancing of new content with well-established content. I was going to illustrate this concept myself, while I found this great excerpt that hits it right out of the ballpark:

“Imagine you’ve just entered an ice cream shop. You now face a crucial decision — out of about 30 flavors you need to choose only one!
You can go with two strategies: either go with that favorite flavor of yours that you already know is the best; or explore new flavors you never tried before, and maybe find a new best flavor.
These two strategies — exploitation and exploration — can also be used when recommending content. We can either exploit items that have high click-through rate with high certainty — maybe because these items have been shown thousands of times to similar users, or we can explore new items we haven’t shown to many users in the past. Incorporating exploration into your recommendation strategy is crucial — without it, new items don’t stand a chance against older, more familiar ones.”

(Source: Recommender Systems: Exploring the Unknown Using Uncertainty)

This tradeoff is a typical reinforcement learning problem and a commonly used approach is the multi-armed bandit algorithm. This is used by Spotify for the personalization of each users’ home page as well as Uber Eats for personalized recommendations optimized for their three-sided marketplace. Two scientists at Netflix gave a great talk about how they are using the MAB framework for movie recommendations.

Though I should mention that this is, by no means, the final solution to this problem, it seems to work for Netflix, Spotify, and Uber Eats, right?

Yes. But!

Netflix has roughly 160 million users and about 6.000 movies/shows. Spotify has about 230 million users and 50 million songs + 500.000 podcasts.

Twitter’s 330 million active users generate more than 500 million tweets per day (350.000 tweets per minute, 6.000 tweets per second). And then there’s YouTube, with its 300 hours of videos uploaded every minute!

The exploration space in the two latter cases is a little bit bigger than in the case of Netflix or Uber Eats, which makes the problem a lot more challenging.

The Future of Recommender Systems

This is the end of my little survey over Recommender Systems. As we have observed, Recommender Systems already guide so many aspects of our life. All the algorithms we covered over the course of these two articles are competing for our attention every day. And after all, they are all maximizing the time spent on their platform. As I illustrated in the section on Evaluation methods, most of the algorithms are optimizing for something like Click-through rate, engagement, or in YouTube’s case: watch time.

What does that mean for us as a consumer?

What it means is, that we are not in control of our desires anymore. While this might sound poetic, think about it. Let’s look at YouTube; we all have goals when coming to the site. We might want to listen to music, watch something funny, or learn something new. But all the content that is recommended to us (either through the Home Page recommendations, Search Ranking, or Watch Next) is optimized to keep us on the site for longer.

Lex Fridman and François Chollet had a great conversation about this on the Artificial Intelligence Podcast. Instead of choosing the metric to optimize for, what if companies would put the user in charge of choosing their own objective function? What if they would take the personal goals of the user’s profile into account and ask the user, what do you want to achieve? Right now, this technology is almost like our boss and we’re not in control of it. Wouldn’t it be incredible to leverage the power of Recommender Systems to be more like a mentor, a coach, or an assistant?

Imagine, as a consumer, you could ask YouTube to optimize the content to maximize learning outcomes. The technology is certainly already there. The challenge would really lie in aligning this with the existing business models and designing the right interface to empower the user to make that choice, and also to change as their goals evolve. With its new interface, YouTube is perhaps already taking baby-steps in that direction by putting the user in charge to select categories that she wants to see recommendations for. But this is just the beginning.

Could this be the way forward or is this just a consumer’s dream?

Resources

François Chollet: Keras, Deep Learning, and the Progress of AI | Artificial Intelligence Podcast

Airbnb — Listing Embeddings in Search Ranking

Airbnb — Machine Learning-Powered Search Ranking of Airbnb Experiences

Amazon — Amazon.com Recommendations Item-to-Item Collaborative Filtering

Amazon — The history of Amazon’s recommendation algorithm

Instagram — Powered by AI: Instagram’s Explore recommender system

LinkedIn — The Browsemaps: Collaborative Filtering at LinkedIn

Netflix — Netflix Recommendations: Beyond the 5 stars (Part 1)

Netflix — Netflix Recommendations: Beyond the 5 stars (Part 2)

Netflix — The Netflix Recommender System: Algorithms, Business Value, and Innovation

Netflix — Learning a Personalized Homepage

Pandora — Pandora’s Music Recommender

Spotify — Discover Weekly: How Does Spotify Know You So Well?

Spotify — For Your Ears Only: Personalizing Spotify Home with Machine Learning

Spotify — From Idea to Execution: Spotify’s Discover Weekly

Twitter — Embeddings@Twitter

Uber Eats — Food Discovery with Uber Eats: Recommending for the Marketplace

Uber Eats — Food Discovery with Uber Eats: Using Graph Learning to Power Recommendations

YouTube — The YouTube Video Recommendation System

YouTube — Collaborative Deep Learning for Recommender Systems

YouTube — Deep Neural Networks for YouTube Recommendations

Zillow — Home Embeddings for Similar Home Recommendations

Andrew Ng’s Machine Learning Course (Recommender Systems)

Google’s Machine Learning Crash Course — Embeddings

Recommender Systems: The Most Valuable Application of Machine Learning (Part 1)

Rico Meinl — Sun, 04 Oct 2020 20:46:09 GMT

Why Recommender Systems are the most valuable application of Machine Learning and how Machine Learning-driven Recommenders already drive almost every aspect of our lives.

Read this article on Medium.

Recommender Systems already drive almost every aspect of our daily lives.

Look back at your week: a Machine Learning algorithm determined what songs you might like to listen to, what food to order online, what posts you see on your favorite social networks, as well as the next person you may want to connect with, what series or movies you would like to watch, etc…

Machine Learning already guides so many aspects of our life without us necessarily being conscious of it. All of the applications mentioned above are driven by one type of algorithm: recommender systems.

In this article, I will explore and dive deeper into all the aspects that come into play to build a successful recommender system. The length of this article got a little out of hand so I decided to split it into two parts. This first part will cover:

Business Value
Problem Formulation
Data
Algorithms

The Second Part will cover:

Evaluation Metrics
User Interface
Cold-start Problem
Exploration vs. Exploitation
The Future of Recommender Systems

Throughout this article, I will be using examples of the companies that have built the most widely used systems over the last couple of years, including Airbnb, Amazon, Instagram, LinkedIn, Netflix, Spotify, Uber Eats, and YouTube.

Business Value

Harvard Business Review made a strong statement by calling Recommenders the single most important algorithmic distinction between “born digital” enterprises and legacy companies. HBR also described the virtuous business cycle these can generate: the more people use a company’s Recommender System, the more valuable they become and the more valuable they become, the more people use them.

The Virtuous Business Cycle of Recommender Systems (source: MDPI, CC)

We are encouraged to look at recommender systems, not as a way to sell more online, but rather to see it as a renewable resource for relentlessly improving customer insights and our own insights as well. If we look at the illustration above, we can see that many legacy companies also have tons of users and therefore tons of data. The reason their virtuous cycle has not picked up as much as the ones off Amazon, Netflix or Spotify is because of the lack of knowledge on how to convert their user data into actionable insights, which can then be used to improve their product or services.

Looking at Netflix, for example, shows how crucial this is, as 80% of what people watch comes from some sort of recommendation. In 2015, one of their papers quoted:

“We think the combined effect of personalization and recommendations save us more than $1B per year.”

If we look at Amazon, 35% of what customers purchase at Amazon comes from product recommendations and at Airbnb, Search Ranking and Similar Listings drive 99% of all booking conversions.

Problem Formulation

Now that we’ve seen the immense value, companies can gain from Recommender Systems, let’s look at the type of challenges that can be solved by them. Generally speaking, tech companies are trying to recommend the most relevant content to their users. That could mean:

similar home listings (Airbnb, Zillow)
relevant media, e.g. photos, videos and stories (Instagram)
relevant series and movies (Netflix, Amazon Prime Video)
relevant songs and podcasts (Spotify)
relevant videos (YouTube)
similar users, posts (LinkedIn, Twitter, Instagram)
relevant dishes and restaurants (Uber Eats)

The formulation of the problem is critical here. Most of the time, companies want to recommend content that users are most likely to enjoy in the future. The reformulation of this problem, as well as the algorithmic changes from recommending “what users are most likely to watch” to “what users are most likely to watch in the future” allowed Amazon PrimeVideo to gain a 2x improvement, a “once-in-a-decade leap” for their movie Recommender System.

“Amazon researchers found that using neural networks to generate movie recommendations worked much better when they sorted the input data chronologically and used it to predict future movie preferences over a short (one- to two-week) period.”

Data

Recommender Systems usually take two types of data as input:

User Interaction Data (Implicit/Explicit)
Item Data (Features)

The “classic”, and still widely used approach to recommender systems based on collaborative filtering (used by Amazon, Netflix, LinkedIn, Spotify and YouTube) uses either User-User or Item-Item relationships to find similar content. I’m not going to go deeper into the inner workings of this, as there are a lot of articles on that topic — like this one — that explain this concept well.

The user interaction data is the data we gather from the weblogs and can be divided into two groups:

Explicit data: explicit input from our users (e.g. movie ratings, search logs, liked, commented, watched, favorited, etc.)

Implicit data: information that is not provided intentionally but gathered from available data streams (e.g. search history, order history, clicked on, accounts interacted with, etc.)

The item data consists mainly of an item’s features. In YouTube’s case that would be a video’s metadata such as title and description. For Zillow, this could be a home’s Zip Code, City Region, Price, or Number of Bedrooms for instance.

Other data sources could be external data (for example, Netflix might add external item data features such as box office performance or critic reviews) or expert-generated data (Pandora’s Music Genome Project uses human input to apply values for each song in each of approximately 400 musical attributes).

A key insight here is that obviously, having more data about your users will inevitably lead to better model results (if applied correctly), however, as Airbnb shows in their 3-part journey to building a Ranking Model for Airbnb Experiences you can already achieve quite a lot with lesser data: the team at Airbnb already improved bookings by +13% with just 500 experiences and 50k training data size.

“The main take-away is: Don’t wait until you have big data, you can do quite a bit with small data to help grow and improve your business.”

Algorithms

Often, we associate Recommender Systems with just collaborative filtering. That’s fair, as in the past this has been the go-to method for a lot of the companies that have deployed successful systems in practice. Amazon was probably the first company to leverage item-to-item collaborative filtering. When they first released the inner workings of their method in a paper in 2003, the system had already been in use for six years.

Then, in 2006 Netflix followed suit with its famous Netflix Price Challenge which offered $1 million to whoever improved the accuracy of their existing system called Cinematch by 10%. Collaborative filtering was also a part of the early Recommender Systems at Spotify and YouTube. LinkedIn even developed a horizontal collaborative filtering infrastructure, known as Browsemaps. This platform enables rapid development, deployment, and computation of collaborative filtering recommendations for almost any use case on LinkedIn.

If you want to know more about collaborative filtering, I would recommend checking out Section 16 of Andrew Ng’s Machine Learning course on Coursera where he goes deeper into the math behind it.

Now, I would like to take a step back and generalize the concept of a Recommender System. While many companies used to rely on collaborative filtering, today there are a lot of other different algorithms at play that either complement or even replaced the collaborative filtering approach. Netflix went through this change when they shifted from a DVD shipping to a streaming business. As described in one of their papers:

“We indeed relied on such an algorithm heavily when our main business was shipping DVDs by mail, partly because in that context, a star rating was the main feedback that we received that a member had actually watched the video. […] But the days when stars and DVDs were the focus of recommendations at Netflix have long passed. […] Now, our recommender system consists of a variety of algorithms that collectively define the Netflix experience, most of which come together on the Netflix homepage.”

If we zoom out a little bit and look at Recommender Systems more broadly we find that they essentially consist of two parts:

Candidate Generation
Ranking

I am going to use YouTube’s Recommender System as an example below as they provided a good visualization, but that very same concept is applied by Instagram for recommendations in “Instagram Explore”, by Uber Eats in their Dish and Restaurant Recommender System, by Netflix for their movie recommendations and probably many other companies.

2-stage Recommender System (inspired by YouTube)

According to Netflix, the goal of Recommender Systems is to present a number of attractive items for a person to choose from. This is usually accomplished by selecting some items (candidate generation) and sorting them (ranking) in the order of expected enjoyment (or utility).

Let’s further investigate the two stages:

Candidate Generation

In this stage, we want to source the relevant candidates that could be eligible to show to our users. Here, we are working with the whole catalog of items so it can be quite large (YouTube and Instagram are great examples here). The key to doing this is entity embeddings. What are entity embeddings?

An entity embedding is a mathematical vector representation of an entity such that its dimensions might represent certain properties. Twitter has a great example of this in a blog post about Embeddings@Twitter: say we have two NBA players (Stephen Curry and LeBron James) and two musicians (Kendrick Lamar and Bruno Mars). We expect the distance between the embeddings of the NBA players to be smaller than the distance between the embeddings of a player and a musician. We can calculate the distance between two embeddings using the formula for Euclidean distance.

How do we come up with these embeddings?

Well, one way to do this would be collaborative filtering. We have our items and our users. If we put them in a matrix (for the example of Spotify) it could look like this:

After applying the matrix factorization algorithm, we end up with user vectors and song vectors. To find out which users’ tastes are most similar to another’s, collaborative filtering compares one users’ vector with all of the other users’ vectors, ultimately spitting out which users are the closest matches. The same goes for the Y vector, songs: you can compare a single song’s vector with all the others, and find out which songs are most similar to the one in question.

Another way to do this takes inspiration from applications in the domain of Natural Language Processing. Researchers generalized the word2vec algorithm, developed by Google in the early 2010s to all entities appearing in a similar context. In word2vec, the networks are trained by directly taking into account the word order and their co-occurrence, based on the assumption that words frequently appearing together in the sentences also share more statistical dependence. As Airbnb describes, in their blog post about creating Listing Embeddings:

More recently, the concept of embeddings has been extended beyond word representations to other applications outside of NLP domain. Researchers from the Web Search, E-commerce and Marketplace domains have realized that just like one can train word embeddings by treating a sequence of words in a sentence as context, the same can be done for training embeddings of user actions by treating sequence of user actions as context. Examples include learning representations of items that were clicked or purchased or queries and ads that were clicked. These embeddings have subsequently been leveraged for a variety of recommendations on the Web.

Apart from Airbnb, this concept is used by Instagram (IG2Vec) to learn account embeddings, by YouTube to learn video embeddings and by Zillow to learn categorical embeddings.

Another, more novel approach to this is called Graph Learning and it is used by Uber Eats for their dish and restaurant embeddings. They represent each of their dishes and restaurant in a separate graph and apply the GraphSAGE algorithm to obtain the representations (embeddings) of the respective nodes.

And last but not least, we can also learn an embedding as part of the neural network for our target task. This approach gets you an embedding well customized for your particular system, but may take longer than training the embedding separately. The Keras Embedding Layer would be one way to achieve this. Google covers this well as part of their Machine Learning Crash Course.

Once we have this vectorial representation of our items we can simply use Nearest Neighbour Search to find our potential candidates.
Instagram, for example, defines a couple of seed accounts (accounts that people have interacted with in the past) and uses their IG2Vec account embeddings to find similar accounts that are like those. Based on these accounts, they are able to find the media that these accounts posted or engaged with. By doing that, they are able to filter billions of media items down to a couple thousand and then sample 500 candidates from the pool and send those candidates downstream to the ranking stage.

This phase can also be guided by business rules or just user input (the more information we have the more specific we can be). As Uber Eats mentions in one of their blog posts, for instance, pre-filtering can be based on factors such as geographical location.

So, to summarize:

In the candidate generation (or sourcing) phase, we filter our whole content catalog for a smaller subset of items that our users might be interested in. To do this we need to map our items into a mathematical representation called embeddings so we can use a similarity function to find the most similar items in space. There are several ways to achieve this. Three of them being collaborative filtering, word2vec for entities, and graph learning.

Ranking

Let’s loop back to the case of Instagram. After the candidate generation stage, we have about 500 media items that are potentially relevant and that we could show to a user in their “Explore” feed.
But which ones are going to be the most relevant?

Because, after all, there are only 25 spots on the first page of the “Explore” section. And if the first items suck, the user is not going to be impressed nor intrigued to keep browsing. Netflix’s and Amazon PrimeVideo’s web interface shows only the top 6 recommendations on the first page associated with each title in its catalog. Spotify’s Discover Weekly Playlist contains only 30 songs.
Also, all of this is subject to the users’ device. Smartphones, of course, allowing for less space for relevant recommendations than a web browser.

“There are many ways one could construct a ranking function ranging from simple scoring methods, to pairwise preferences, to optimization over the entire ranking. If we were to formulate this as a Machine Learning problem, we could select positive and negative examples from our historical data and let a Machine Learning algorithm learn the weights that optimize our goal. This family of Machine Learning problems is known as “Learning to rank” and is central to application scenarios such as search engines or ad targeting. In the ranking stage, we are not aiming for our items to have a global notion of relevance, but rather look for ways of optimizing a personalized model” (Extract from Netflix Blog Post).

To accomplish this, Instagram uses a three-stage ranking infrastructure to help balance the trade-offs between ranking relevance and computation efficiency. In the case of Uber Eats, their personalized ranking system is “a fully-fledged ML model that ranks the pre-filtered dish and restaurant candidates based on additional contextual information, such as the day, time, and current location of the user when they open the Uber Eats app”. In general, the level of complexity for your model really depends on the size of your feature space. Many supervised classification methods can be used for ranking. Typical choices include Logistic Regression, Support Vector Machines, Neural Networks, or Decision Tree-based methods such as Gradient Boosted Decision Trees (GBDT). On the other hand, a great number of algorithms specifically designed for learning to rank have appeared in recent years such as RankSVM or RankBoost.

To summarise:

After selecting initial candidates for our recommendations, in the ranking stage, we need to design a ranking function that ranks items by their relevance. This can be formulated as a Machine Learning problem, and the goal here is to optimize a personalized model for each user. This step is important because in most interfaces we have limited space to recommend items so we need to make the best use of that space by putting the most relevant items at the very top.

Baseline

As for every Machine Learning algorithm, we need a good baseline to measure the improvement of any change. A good baseline to start with is just to use the most popular items in the catalog, as described by Amazon:

“In the recommendations world, there’s a cardinal rule. If I know nothing about you, then the best things to recommend to you are the most popular things in the world.”

However, if you don’t even know what is most popular, because you just launched a new product or new items — as was the case with Airbnb Experiences — you can just randomly re-rank the item collection daily until you have gathered enough data for your first model.

That’s a wrap for Part 1 of this series. There are a couple of points I wanted to emphasize in this article:

Recommender Systems are the most valuable application of Machine Learning as they are able to create a Virtuous Feedback Loop: the more people use a company’s Recommender System, the more valuable they become and the more valuable they become, the more people use them. Once you enter that Loop, the Sky is the Limit.
The right Problem Formulation is key.
In the Netflix Price Challenge, teams tried to build models that predict a users’ rating for a given movie. In the “real world”, companies use much more sophisticated data inputs which can be classified into two categories: Explicit and Implicit Data.
In today’s world, Recommender Systems rely on much more than just Collaborative Filtering.

In the Second Part I will cover:

Evaluation Metrics
User Interface
Cold-start Problem
Exploration vs. Exploitation

Resources

Airbnb — Listing Embeddings in Search Ranking

Airbnb — Machine Learning-Powered Search Ranking of Airbnb Experiences

Amazon — Amazon.com Recommendations Item-to-Item Collaborative Filtering

Amazon — The history of Amazon’s recommendation algorithm

Instagram — Powered by AI: Instagram’s Explore recommender system

LinkedIn — The Browsemaps: Collaborative Filtering at LinkedIn

Netflix — Netflix Recommendations: Beyond the 5 stars (Part 1)

Netflix — Netflix Recommendations: Beyond the 5 stars (Part 2)

Netflix — The Netflix Recommender System: Algorithms, Business Value, and Innovation

Netflix — Learning a Personalized Homepage

Pandora — Pandora’s Music Recommender

Spotify — Discover Weekly: How Does Spotify Know You So Well?

Spotify — For Your Ears Only: Personalizing Spotify Home with Machine Learning

Spotify — From Idea to Execution: Spotify’s Discover Weekly

Twitter — Embeddings@Twitter

Uber Eats — Food Discovery with Uber Eats: Recommending for the Marketplace

Uber Eats — Food Discovery with Uber Eats: Using Graph Learning to Power Recommendations

YouTube — The YouTube Video Recommendation System

YouTube — Collaborative Deep Learning for Recommender Systems

YouTube — Deep Neural Networks for YouTube Recommendations

Zillow — Home Embeddings for Similar Home Recommendations

Andrew Ng’s Machine Learning Course (Recommender Systems)

Google’s Machine Learning Crash Course — Embeddings

Machine Learning System Design

Rico Meinl — Mon, 02 Mar 2020 21:40:00 GMT

Read this post on Medium.

Facebook Field Guide to Machine Learning

While preparing for job interviews I found some great resources on Machine Learning System designs from Facebook, Twitter, Google, Airbnb, Uber, Instagram, Netflix, AWS and Spotify.

I find this to be a fascinating topic because it’s something not often covered in online courses.

Twitter

Instagram

Facebook

Uber Eats

Uber

Airbnb

Airbnb Experiences

Machine Learning-Powered Search Ranking of Airbnb Experiences

Linkedin

Google

Jeff Dean On Large-Scale Deep Learning At Google

Netflix

A Multi-Armed Bandit Framework for Recommendations at Netflix

Spotify

In addition, here are some resources on a more general process. Starting with the book Data Science for Business which explains the CRISP-DM (Cross Industry Standard Process for Data Mining).

The process involves six stages:

Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment

Here is a more high-level breakdown on how to apply CRISP-DM on AWS.

Facebook also created a video series where they go into depth in how they structure Machine Learning Projects with the Facebook Field Guide to Machine Learning.