<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[ricomnl]]></title><description><![CDATA[machine learning / bioelectricity / longevity]]></description><link>https://ricomnl.com/</link><image><url>https://ricomnl.com/favicon.png</url><title>ricomnl</title><link>https://ricomnl.com/</link></image><generator>Ghost 4.36</generator><lastBuildDate>Wed, 15 Apr 2026 13:18:26 GMT</lastBuildDate><atom:link href="https://ricomnl.com/blog/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[[Extension: redun] Bioinformatics pipelines from the bottom up]]></title><description><![CDATA[This is the first part of a series of extensions that I will add to my previous post on Bioinformatics pipelines from the bottom up.  Learn about the core features of redun by using it to reimplement a toy bioinformatics workflow.]]></description><link>https://ricomnl.com/blog/bottom-up-bioinformatics-pipeline-extension-redun/</link><guid isPermaLink="false">626fb68c3eb66e1c2428e4b8</guid><category><![CDATA[bioinformatics]]></category><category><![CDATA[pipelines]]></category><category><![CDATA[redun]]></category><category><![CDATA[insitro]]></category><category><![CDATA[biotech]]></category><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Wed, 11 May 2022 21:02:05 GMT</pubDate><media:content url="https://ricomnl.com/content/images/2022/05/sigmund-4CNNH2KEjhc-unsplash.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><aside class="toc"></aside><!--kg-card-end: html--><img src="https://ricomnl.com/content/images/2022/05/sigmund-4CNNH2KEjhc-unsplash.png" alt="[Extension: redun] Bioinformatics pipelines from the bottom up"><p>This is the first part of a series of extensions that I will add to my previous post on <a href="https://ricomnl.com/blog/bottom-up-bioinformatics-pipeline/">Bioinformatics pipelines from the bottom up</a>. In order for you to be able to follow along, I&apos;d recommend skimming over the other tutorial at least up until the part where we start with Makefiles. Here&apos;s what we&apos;ll cover in this tutorial:</p><ul><li>Learn about the core features of redun by using it to reimplement a toy bioinformatics workflow</li><li>Run redun workflows on AWS Batch</li><li>Import submodules via pip</li><li>Emulate Makefile behavior in redun with a custom DSL</li></ul><h1 id="motivation">Motivation</h1><p>Simple workflows are a great way to quickly get an in-depth look into the core features and advantages of new tools. The toy workflow we implemented in the first post (and reimplement in this one) consists of the following steps:</p><ol><li>Take a set of .fasta protein files</li><li>Split each into peptides using a variable number of missed cleavages</li><li>Count the number of cysteines in total as well as the number of peptides that contain a cysteine</li><li>Generate an output report containing this information in a .tsv file</li><li>Create an archive to share with colleagues</li></ol><p>In the last post, I covered vanilla bash, Makefiles, and Nextflow as three modes of execution for bioinformatics workflows. Given the size and scale of modern workflows, the two former are rarely a valid option for anyone anymore and Nextflow is just one example of a toolchain that enables developers to run their pipelines at scale in the cloud. There are <a href="https://github.com/pditommaso/awesome-pipeline">a lot</a> of others.</p><p>For my own work, Python is the main workhorse for all of my data processing and analysis code so naturally, I&apos;m drawn towards something that can natively integrate with it. The most natural integration happens when the toolchain itself is written in Python and I can just annotate my functions with something like a <code>@task</code> operator to be able to chain them into a workflow. A couple of frameworks come to mind here:</p><ul><li><a href="https://metaflow.org/">Metaflow</a></li><li><a href="https://github.com/insitro/redun">Redun</a> (covered in this blog post)</li><li><a href="https://docs.dagster.io/getting-started">Dagster</a></li><li><a href="https://www.prefect.io/">Prefect</a></li><li><a href="https://docs.latch.bio/">Latch SDK</a></li></ul><p>In this post, we&apos;ll cover redun, a tool written by the data engineering team at <a href="https://insitro.com/">Insitro</a> which was open-sourced in November 2021. The <a href="https://github.com/insitro/redun">Github repo</a> contains the following description:</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-text"><strong>redun</strong> aims to be a more expressive and efficient workflow framework, built on top of the popular Python programming language. It takes the somewhat contrarian view that writing dataflows directly is unnecessarily restrictive, and by doing so we lose abstractions we have come to rely on in most modern high-level languages (control flow, composability, recursion, high order functions, etc). redun&apos;s key insight is that workflows can be expressed as <a href="https://github.com/insitro/redun#whats-the-trick">lazy expressions</a>, which are then evaluated by a scheduler that performs automatic parallelization, caching, and data provenance logging.</div></div><p>Redun introduces a bunch of interesting features and, in my opinion, it is one of the first workflow tools out there that really nailed it. I highly recommend checking out its very well-written <a href="https://insitro.github.io/redun/design.html">design document</a> as well as reading through the <a href="https://github.com/insitro/redun/blob/main/examples/README.md">first 4 tutorials</a>.</p><h1 id="setup">Setup</h1><p>We start by cloning the existing <a href="https://github.com/ricomnl/bioinformatics-pipeline-tutorial">GitHub repository</a> and use <code>part_00</code> as our starting point:</p><pre><code class="language-bash"># Fork and clone repository and switch to branch part_00
git clone https://github.com/&lt;your_git_username&gt;/bioinformatics-pipeline-tutorial.git
cd bioinformatics-pipeline-tutorial/
git checkout part_00</code></pre><p>The structure of the repository will look like this:</p><pre><code class="language-bash">$ tree
.
&#x251C;&#x2500;&#x2500; README.md
&#x251C;&#x2500;&#x2500; bin
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; 01_digest_protein.py
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; 02_count_amino_acids.py
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; 03a_plot_count.py
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; 03b_get_report.py
&#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; __init__.py
&#x2514;&#x2500;&#x2500; fasta
    &#x251C;&#x2500;&#x2500; KLF4.fasta
    &#x251C;&#x2500;&#x2500; MYC.fasta
    &#x251C;&#x2500;&#x2500; PO5F1.fasta
    &#x2514;&#x2500;&#x2500; SOX2.fasta

2 directories, 10 files</code></pre><p>In order to set up our environment, let&apos;s add a <code>requirements.txt</code> file with the following content and run <code>pip install -r requirements.txt</code> (feel free to use virtualenv, conda, or poetry to set up a virtual environment).</p><pre><code class="language-txt">redun
plotly
kaleido</code></pre><p>First, we create a <code>data/</code> directory for our workflow output data and touch a <code>workflow.py</code> file which we&apos;ll use to write our redun workflow using the existing code from the first tutorial in the <code>bin/</code> folder as a reference.</p><pre><code class="language-bash">mkdir -p data
touch workflow.py</code></pre><h1 id="porting-the-workflow">Porting the workflow</h1><p>A lot of bioinformatics workflows rely on programs that are shipped as binaries and are executed through the command line. Redun supports <a href="https://insitro.github.io/redun/design.html#script-tasks">Script tasks</a> as first-class citizens and we&apos;ll initially explore how to use these to write our workflow. The first step in our pipeline is the script in <code>bin/01_digest_protein.py</code>. In order to call it, let&apos;s start by adding some code to <code>workflow.py</code>:</p><pre><code class="language-python">&quot;&quot;&quot;workflow.py&quot;&quot;&quot;
import os

from redun import File, script, task

redun_namespace = &quot;bioinformatics_pipeline_tutorial.script_workflow&quot;


@task()
def digest_protein_task(
    input_fasta: File,
    enzyme_regex: str = &quot;[KR]&quot;,
    missed_cleavages: int = 0,
    min_length: int = 4,
    max_length: int = 75,
) -&gt; File:
    protein = input_fasta.basename().split(&quot;.&quot;)[0]
    output_path = os.path.join(
        os.path.split(input_fasta.dirname())[0], &quot;data&quot;, f&quot;{protein}.peptides.txt&quot;
    )
    return script(
        f&quot;&quot;&quot;
        bin/01_digest_protein.py \
            {input_fasta.path} \
            {output_path} \
            --enzyme_regex {enzyme_regex} \
            --missed_cleavages {missed_cleavages} \
            --min_length {min_length} \
            --max_length {max_length}
        &quot;&quot;&quot;,
        outputs=File(output_path),
    )</code></pre><p>We can then execute the task like this:</p><pre><code class="language-bash">redun run workflow.py digest_protein_task --input-fasta fasta/KLF4.fasta</code></pre><p>If successful, you should see the file <code>KLF4.peptides.txt</code> in the <code>data/</code> directory.</p><p>Now, a major reason (at least for me) to use redun is that we can natively define workflows in Python. If you&apos;re &#xA0;interested to see a working example of the complete workflow using script tasks, check out the <code>scripts_workflow.py</code> file in the final <code>redun</code> <a href="https://github.com/ricomnl/bioinformatics-pipeline-tutorial/blob/redun/wf/script_workflow.py">branch</a>. In the following part, we will follow a different, Python-native approach of defining tasks. Execute these commands to rename <code>workflow.py</code> and touch a new file.</p><pre><code class="language-bash">mv workflow.py scripts_workflow.py
touch workflow.py</code></pre><p>The first step is to copy over the three functions needed for the <code>digest_protein</code> task: <code>load_fasta()</code>, <code>save_peptides()</code> and <code>digest_protein()</code> (from <code>bin/01_digest_protein.py</code>. Only three small changes were made: we added type annotations as a good practice, added a <code>redun_namespace</code> variable at the top to define the namespace in which we run our workflow, and lastly, we adapted the <code>save_peptides()</code> function to return a redun <code>File</code> object after saving the results to it. For reference, the updated <code>workflow.py</code> file:</p><pre><code class="language-python">&quot;&quot;&quot;workflow.py&quot;&quot;&quot;
import os
import re
from typing import List, Tuple

from redun import File, task

redun_namespace = &quot;bioinformatics_pipeline_tutorial.workflow&quot;


def load_fasta(input_file: File) -&gt; Tuple[str, str]:
    &quot;&quot;&quot;
    Load a protein with its metadata from a given .fasta file.
    &quot;&quot;&quot;
    with input_file.open(&quot;r&quot;) as fasta_file:
        lines = fasta_file.read().splitlines()
    metadata = lines[0]
    sequence = &quot;&quot;.join(lines[1:])
    return metadata, sequence


def save_peptides(filename: str, peptides: List[str]) -&gt; File:
    &quot;&quot;&quot;
    Write out the list of given peptides to a .txt file. Each line is a different peptide.
    &quot;&quot;&quot;
    output_file = File(filename)
    with output_file.open(&quot;w&quot;) as out:
        for peptide in peptides:
            out.write(&quot;{}\n&quot;.format(peptide))
    return output_file


def digest_protein(
    protein_sequence: str,
    enzyme_regex: str = &quot;[KR]&quot;,
    missed_cleavages: int = 0,
    min_length: int = 4,
    max_length: int = 75,
) -&gt; List[str]:
    &quot;&quot;&quot;
    Digest a protein into peptides using a given enzyme. Defaults to trypsin.
    &quot;&quot;&quot;
    # Find the cleavage sites
    enzyme_regex = re.compile(enzyme_regex)
    sites = (
        [0]
        + [m.end() for m in enzyme_regex.finditer(protein_sequence)]
        + [len(protein_sequence)]
    )

    peptides = set()

    # Do the digest
    for start_idx, start_site in enumerate(sites):
        for diff_idx in range(1, missed_cleavages + 2):
            end_idx = start_idx + diff_idx
            if end_idx &gt;= len(sites):
                continue
            end_site = sites[end_idx]
            peptide = protein_sequence[start_site:end_site]
            if len(peptide) &lt; min_length or len(peptide) &gt; max_length:
                continue
            peptides.add(peptide)
    return peptides</code></pre><p>As you can see we now have what is basically the equivalent to our previous <code>bin/01_digest_protein.py</code> file without the <code>main()</code> function and parameter options (we&apos;ll add these later). We can now basically take the <code>main()</code> function from <code>bin/01_digest_protein.py</code> and add the <code>@task</code> operator to it:</p><pre><code class="language-python">&quot;&quot;&quot;workflow.py&quot;&quot;&quot;

[...]

@task()
def digest_protein_task(input_fasta: File, output_file: File) -&gt; File:
    _, protein_sequence = load_fasta(input_fasta)
    peptides = digest_protein(protein_sequence)
    peptides_file = save_peptides(output_file.path, peptides)
    return peptides_file</code></pre><p>In redun, there is no extra <code>@workflow</code> decorator. A workflow gets assembled when a task is called and that task has other tasks that it is dependent on. We will later see what that looks like. The cool thing about this is that it enables us to execute any task by itself (which is not possible with a lot of tools because they only let you execute workflows as a whole). To run the <code>digest_protein_task()</code>, we call:</p><pre><code class="language-bash">redun run workflow.py digest_protein_task --input-fasta fasta/KLF4.fasta --output-file data/KLF4.peptides.txt</code></pre><p>In order to not have to pass the <code>--output-file</code> path as an extra argument, we make it dependent on the input file by changing the code to the following:</p><pre><code class="language-python">&quot;&quot;&quot;workflow.py&quot;&quot;&quot;

[...]

@task()
def digest_protein_task(input_fasta: File) -&gt; File:
    _, protein_sequence = load_fasta(input_fasta)
    peptides = digest_protein(protein_sequence)
    protein = input_fasta.basename().split(&quot;.&quot;)[0]
    output_path = os.path.join(
        os.path.split(input_fasta.dirname())[0], &quot;data&quot;, f&quot;{protein}.peptides.txt&quot;
    )
    peptides_file = save_peptides(output_path, peptides)
    return peptides_file</code></pre><p>Try rerunning it without the `--output-file` flag:</p><pre><code class="language-bash">redun run workflow.py digest_protein_task --input-fasta fasta/KLF4.fasta </code></pre><p>Now, as you might remember, the input we&apos;re dealing with is a list of input files, not just a single file. To make our workflow compatible with that, we add a <code>main()</code> task that takes as input a list of files.</p><pre><code class="language-python">&quot;&quot;&quot;workflow.py&quot;&quot;&quot;
[...]
from redun import File, task
from redun.file import glob_file

[...]

@task()
def main(input_dir: str) -&gt; List[File]:
    input_fastas = [File(f) for f in glob_file(f&quot;{input_dir}/*.fasta&quot;)]
    peptide_files = [digest_protein_task(fasta) for fasta in input_fastas]
    return peptide_files</code></pre><p>Try running:</p><pre><code class="language-bash">redun run workflow.py main --input-dir fasta/</code></pre><p>Now, the <code>digest_protein_task()</code> should have been executed for all files in the <code>fasta/</code> folder. We can ensure this by using redun&apos;s logging functionality. The command <code>redun log -</code> shows the execution of the most recent run. Alternatively one can note down the execution ID (see below) when launching a new run and use either the full string or everything up to the first <code>-</code>: <code>redun log 1091e19e-b5b7-412b-bdf0-b703a9f79cd5</code> or <code>redun log 1091e19e</code>.</p><pre><code class="language-bash">$ redun run workflow.py main --input-dir fasta/                                                                
[redun] redun :: version 0.8.7
[redun] config dir: /Users/ricomeinl/Downloads/bioinformatics-pipeline-tutorial/.redun
[redun] Start Execution 1091e19e-b5b7-412b-bdf0-b703a9f79cd5:  redun run workflow.py main --input-dir fasta/
[...]</code></pre><p>By running either of the above, we can observe that redun indeed ran five tasks: the main tasks and the digest_protein task for all four files.</p><pre><code class="language-bash">$ redun log -
Exec 1091e19e-b5b7-412b-bdf0-b703a9f79cd5 [ DONE ] 2022-05-11 18:17:15:  run workflow.py main --input-dir fasta/ (git_commit=785ccac738c29bb27efa5fe8e950c23018961621, git_origin_url=https://github.com/ricomnl/bioinformatics-pipel..., project=bioinformatics_pipeline_tutorial.workflow, redun.version=0.8.7, user=ricomeinl)
Duration: 0:00:00.15

Jobs: 5 (DONE: 5, CACHED: 0, FAILED: 0)
--------------------------------------------------------------------------------
Job acaf05c6 [ DONE ] 2022-05-11 18:17:15:  bioinformatics_pipeline_tutorial.workflow.main(input_dir=&apos;fasta/&apos;) 
  Job 9620645d [ DONE ] 2022-05-11 18:17:15:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/SOX2.fasta, hash=621d4a48)) 
  Job efb908dc [ DONE ] 2022-05-11 18:17:15:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/KLF4.fasta, hash=10761e8a)) 
  Job fdb4f1fc [ DONE ] 2022-05-11 18:17:15:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/PO5F1.fasta, hash=341326f2)) 
  Job f2bc3668 [ DONE ] 2022-05-11 18:17:15:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/MYC.fasta, hash=daeb9045)) </code></pre><p>Great stuff! Let&apos;s add the next task <code>count_amino_acids()</code>.</p><p>We start by adding the helper functions from <code>bin/02_count_amino_acids.py</code> to our <code>workflow.py</code> file (with some small changes akin to the ones mentioned above).</p><pre><code class="language-python">&quot;&quot;&quot;workflow.py&quot;&quot;&quot;

[...]

def load_peptides(input_file: File) -&gt; List[str]:
    &quot;&quot;&quot;
    Load peptides from a .txt file as a list.
    &quot;&quot;&quot;
    with input_file.open(&quot;r&quot;) as peptide_file:
        lines = peptide_file.read().splitlines()
    return lines


def save_counts(filename: str, peptide_counts: List[int]) -&gt; File:
    &quot;&quot;&quot;
    Write out the peptide counts to a .tsv file using tabs as a separator.
    &quot;&quot;&quot;
    output_file = File(filename)
    with output_file.open(&quot;w&quot;) as out:
        out.write(&quot;{}\n&quot;.format(&quot;\t&quot;.join([str(c) for c in peptide_counts])))
    return output_file


def num_peptides(peptides: List[str]) -&gt; int:
    &quot;&quot;&quot;
    Retrieve the number of peptides in a given list.
    &quot;&quot;&quot;
    return len(peptides)


def num_peptides_with_aa(peptides: List[str], amino_acid: str = &quot;C&quot;) -&gt; int:
    &quot;&quot;&quot;
    Count the number of peptides in a given list that contain a given amino acid. 
    Defaults to cysteine.
    &quot;&quot;&quot;
    return sum([1 if amino_acid in peptide else 0 for peptide in peptides])


def total_num_aa_in_protein(protein: str) -&gt; int:
    &quot;&quot;&quot;
    Count the total number of amino acids in a given protein string.
    &quot;&quot;&quot;
    return len(protein)


def num_aa_in_protein(protein: str, amino_acid: str = &quot;C&quot;) -&gt; int:
    &quot;&quot;&quot;
    Count the number of times a given amino acid occurs in a given protein.
    Defaults to cysteine.
    &quot;&quot;&quot;
    return protein.count(amino_acid)
    

@task()
def digest_protein_task(input_fasta: File) -&gt; File: ...

[...]</code></pre><p>After that, we again port the <code>main()</code> function, this time from <code>bin/02_count_amino_acids.py</code>, to a new function decorated with <code>@task()</code>:</p><pre><code class="language-python">&quot;&quot;&quot;workflow.py&quot;&quot;&quot;

[...]

@task()
def count_amino_acids_task(
    input_fasta: File, input_peptides: File, amino_acid: str = &quot;C&quot;
) -&gt; File:
    &quot;&quot;&quot;
    Count the number of times a given amino acid appears in a protein as well
    as its peptides after digestion.
    &quot;&quot;&quot;
    _, protein_sequence = load_fasta(input_fasta)
    peptides = load_peptides(input_peptides)
    n_peptides = num_peptides(peptides)
    n_peptides_with_aa = num_peptides_with_aa(peptides, amino_acid=amino_acid)
    total_aa_in_protein = total_num_aa_in_protein(protein_sequence)
    aa_in_protein = num_aa_in_protein(protein_sequence, amino_acid=amino_acid)
    protein = input_fasta.basename().split(&quot;.&quot;)[0]
    output_path = os.path.join(
        os.path.split(input_fasta.dirname())[0], &quot;data&quot;, f&quot;{protein}.count.tsv&quot;
    )
    aa_count_file = save_counts(
        output_path,
        [
            amino_acid,
            n_peptides,
            n_peptides_with_aa,
            total_aa_in_protein,
            aa_in_protein,
        ],
    )
    return aa_count_file


@task()
def main(input_dir: str) -&gt; List[File]: ...</code></pre><p>It&apos;s easy to test our task by itself:</p><pre><code class="language-bash">redun run workflow.py count_amino_acids_task --input-fasta fasta/KLF4.fasta --input-peptides data/KLF4.peptides.txt</code></pre><p>If successful, the task should have created a file called <code>KLF4.count.tsv</code> in the <code>data/</code> folder. We can now combine the two tasks in our <code>main()</code> function and execute it with:</p><pre><code class="language-python">&quot;&quot;&quot;workflow.py&quot;&quot;&quot;

[...]

@task()
def main(input_dir: str) -&gt; List[File]:
	input_fastas = [File(f) for f in glob_file(f&quot;{input_dir}/*.fasta&quot;)]
    peptide_files = [digest_protein_task(fasta) for fasta in input_fastas]
    aa_count_files = [count_amino_acids_task(fasta, peptides) for (fasta, peptides) in zip(input_fastas, peptide_files)]
    return aa_count_files</code></pre><pre><code class="language-bash">redun run workflow.py main --input-dir fasta/</code></pre><p>Running <code>redun log -</code> again will show that this time the first four tasks were cached because neither the code of the tasks nor the generated files were changed.</p><pre><code class="language-bash">$ redun log -
Exec 8c438a1b-c24d-49c0-9c4c-cdf71e2504a8 [ DONE ] 2022-05-11 20:52:22:  run workflow.py main --input-dir fasta/ (git_commit=785ccac738c29bb27efa5fe8e950c23018961621, git_origin_url=https://github.com/ricomnl/bioinformatics-pipel..., project=bioinformatics_pipeline_tutorial.workflow, redun.version=0.8.7, user=ricomeinl)
Duration: 0:00:00.18

Jobs: 9 (DONE: 5, CACHED: 4, FAILED: 0)
--------------------------------------------------------------------------------
Job 72959d5d [ DONE ] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.main(input_dir=&apos;fasta/&apos;) 
  Job 344ede72 [CACHED] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/SOX2.fasta, hash=621d4a48)) 
  Job 0e853ce6 [CACHED] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/KLF4.fasta, hash=10761e8a)) 
  Job d8a5ea59 [CACHED] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/PO5F1.fasta, hash=341326f2)) 
  Job 80151743 [CACHED] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.digest_protein_task(File(path=fasta/MYC.fasta, hash=daeb9045)) 
  Job 60d89b74 [ DONE ] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.count_amino_acids_task(File(path=fasta/SOX2.fasta, hash=621d4a48), File(path=data/SOX2.peptides.txt, hash=de981d55), amino_acid=&apos;C&apos;) 
  Job 1271c054 [ DONE ] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.count_amino_acids_task(File(path=fasta/KLF4.fasta, hash=10761e8a), File(path=data/KLF4.peptides.txt, hash=365eea97), amino_acid=&apos;C&apos;) 
  Job 92e3bbe5 [ DONE ] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.count_amino_acids_task(File(path=fasta/PO5F1.fasta, hash=341326f2), File(path=data/PO5F1.peptides.txt, hash=cf7b5a5e), amino_acid=&apos;C&apos;) 
  Job ad2298f6 [ DONE ] 2022-05-11 20:52:22:  bioinformatics_pipeline_tutorial.workflow.count_amino_acids_task(File(path=fasta/MYC.fasta, hash=daeb9045), File(path=data/MYC.peptides.txt, hash=06a265e1), amino_acid=&apos;C&apos;)</code></pre><p>I now encourage you to add the two final tasks yourself. Remember from the last post, we want to create plots for the generated counts of each protein .fasta file (<code>bin/03a_plot_count.py</code> ) and finally, generate an output report for the results (<code>bin/03b_get_report.py</code>). Below you can find the final code for the <code>main()</code> and the <code>archive_results_task()</code>. Try to fill in the code for <code>plot_count_task()</code> and <code>get_report_task()</code>.</p><pre><code class="language-python">&quot;&quot;&quot;workflow.py&quot;&quot;&quot;


[...]


@task()
def plot_count_task(input_count: File) -&gt; File:
    &quot;&quot;&quot;
    Load the calculated counts and create a plot.
    &quot;&quot;&quot;
    # TODO
    pass


@task()
def get_report_task(input_counts: List[File]) -&gt; File:
    &quot;&quot;&quot;
    Get a list of input files from a given folder and create a report.
    &quot;&quot;&quot;
    # TODO
    pass


@task()
def archive_results_task(inputs_plots: List[File], input_report: File) -&gt; File:
    output_path = os.path.join(
        os.path.split(input_report.dirname())[0], &quot;data&quot;, f&quot;results.tgz&quot;
    )
    tar_file = File(output_path)
    with tar_file.open(&quot;wb&quot;) as out:
        with tarfile.open(fileobj=out, mode=&quot;w|gz&quot;) as tar:
            for file_path in inputs_plots + [input_report]:
                if get_filesystem_class(url=file_path.path).name == &quot;s3&quot;:
                    tmp_file = File(os.path.basename(file_path.path))
                else:
                    tmp_file = file_path
                output_file = file_path.copy_to(tmp_file, skip_if_exists=True)
                tar.add(output_file.path)
    return tar_file


@task()
def main(
    input_dir: str,
    amino_acid: str = &quot;C&quot;,
    enzyme_regex: str = &quot;[KR]&quot;,
    missed_cleavages: int = 0,
    min_length: int = 4,
    max_length: int = 75,
) -&gt; List[File]:
    input_fastas = [File(f) for f in glob_file(f&quot;{input_dir}/*.fasta&quot;)]
    peptide_files = [
        digest_protein_task(
            fasta,
            enzyme_regex=enzyme_regex,
            missed_cleavages=missed_cleavages,
            min_length=min_length,
            max_length=max_length,
        )
        for fasta in input_fastas
    ]
    aa_count_files = [
        count_amino_acids_task(
            fasta, peptides, amino_acid=amino_acid
        )
        for (fasta, peptides) in zip(input_fastas, peptide_files)
    ]
    count_plots = [
        plot_count_task(aa_count)
        for aa_count in aa_count_files
    ]
    report_file = get_report_task(aa_count_files)
    results_archive = archive_results_task(
        count_plots, report_file
    )
    return results_archive</code></pre><p>Hint: In order to port over <code>bin/03a_plot_count.py</code> the <code>plot_counts()</code> function needs to be adjusted to use plotly instead of matplotlib because redun parallelizes tasks across multiple threads and matplotlib will throw an error when it&apos;s run outside the main thread. </p><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text">Update: The code using <code>matplotlib</code> should still work when using the process-based instead of the <a href="https://insitro.github.io/redun/executors.html#local-executor">thread-based executor</a>.</div></div><p>Hence, here is the updated function using plotly:</p><pre><code class="language-python">&quot;&quot;&quot;workflow.py&quot;&quot;&quot;
import os
import re
from typing import List, Tuple

from plotly.subplots import make_subplots
import plotly.graph_objects as go
from redun import task, File
from redun.file import glob_file, get_filesystem_class

[...]

def plot_counts(filename: str, counts: List[str]) -&gt; File:
    &quot;&quot;&quot;
    Plot the calculated counts.
    &quot;&quot;&quot;
    (
        amino_acid,
        n_peptides,
        n_peptides_with_aa,
        total_aa_in_peptides,
        aa_in_peptides,
    ) = counts
    labels_n_peptides = [&quot;No. of Peptides&quot;, &quot;No. of Peptides w/ {}&quot;.format(amino_acid)]
    labels_n_aa = [&quot;Total No. of Amino Acids&quot;, &quot;No. of {}&apos;s&quot;.format(amino_acid)]
    colors = [&quot;#001425&quot;, &quot;#308AAD&quot;]
    fig = make_subplots(rows=1, cols=2)
    fig.add_trace(
        go.Bar(
            x=labels_n_peptides,
            y=[int(n_peptides_with_aa), int(n_peptides)],
            marker_color=colors[0],
        ),
        row=1,
        col=1,
    )
    fig.add_trace(
        go.Bar(
            x=labels_n_aa,
            y=[int(aa_in_peptides), int(total_aa_in_peptides)],
            marker_color=colors[1],
        ),
        row=1,
        col=2,
    )
    fig.update_layout(
        height=600,
        width=800,
        title_text=&quot;{}&apos;s in Peptides and Amino Acids&quot;.format(amino_acid),
        showlegend=False,
    )
    if get_filesystem_class(url=filename).name == &quot;s3&quot;:
        tmp_file = File(os.path.basename(filename))
    else:
        tmp_file = File(filename)
    fig.write_image(tmp_file.path)
    output_file = tmp_file.copy_to(File(filename), skip_if_exists=True)
    return output_file

[...]</code></pre><p>If you&apos;ve made it all there way up until here, you can check your solution against the working version on the <code>redun</code> <a href="https://github.com/ricomnl/bioinformatics-pipeline-tutorial/blob/redun/wf/old_workflow.py">branch of the Github repository</a>.</p><p>If you run <code>redun run workflow.py main --input-dir fasta/</code>, your local <code>data/</code> repository should be populated with these files:</p><pre><code class="language-bash">$ tree data 
data
&#x251C;&#x2500;&#x2500; KLF4.count.plot.png
&#x251C;&#x2500;&#x2500; KLF4.count.tsv
&#x251C;&#x2500;&#x2500; KLF4.peptides.txt
&#x251C;&#x2500;&#x2500; MYC.count.plot.png
&#x251C;&#x2500;&#x2500; MYC.count.tsv
&#x251C;&#x2500;&#x2500; MYC.peptides.txt
&#x251C;&#x2500;&#x2500; PO5F1.count.plot.png
&#x251C;&#x2500;&#x2500; PO5F1.count.tsv
&#x251C;&#x2500;&#x2500; PO5F1.peptides.txt
&#x251C;&#x2500;&#x2500; SOX2.count.plot.png
&#x251C;&#x2500;&#x2500; SOX2.count.tsv
&#x251C;&#x2500;&#x2500; SOX2.peptides.txt
&#x251C;&#x2500;&#x2500; protein_report.tsv
&#x2514;&#x2500;&#x2500; results.tgz

0 directories, 14 files</code></pre><h1 id="taking-it-to-the-cloud">Taking it to the cloud</h1><p>To define where a <code>@task</code> will run, we can specify a task executor like this:</p><pre><code class="language-python">@task(executor=&quot;my_executor&quot;)
def digest_protein_task():
    # ...</code></pre><p>The executor <code>my_executor</code> then has to be defined in the <a href="https://insitro.github.io/redun/config.html">redun configuration</a> <code>.redun/redun.ini</code>. If you go and open it up you can see the default executor defined already:</p><pre><code class="language-bash"># redun configuration.

[backend]
db_uri = sqlite:///redun.db

[executors.default]
type = local
max_workers = 20</code></pre><p>As of now, redun supports AWS Batch and Glue executors that will run tasks in the cloud. A Kubernetes executor is <a href="https://github.com/insitro/redun/pull/22">currently in the making</a>. We&apos;ll walk through how to create one for AWS Batch below. </p><p>I&apos;m not going to go through all the steps on how to set up AWS Batch as there are a lot of great tutorials online. If you want to follow along make sure you have <a href="https://docs.docker.com/get-docker/">Docker installed</a> and an existing <a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html">AWS CLI setup</a>.</p><p>You&apos;ll need the following AWS resources:</p><ul><li>S3 Bucket</li><li>Push access to Elastic Container Registry (ECR)</li><li>An AWS Batch queue that we can publish jobs to</li></ul><p>To get started, we need to create a Dockerfile like this:</p><pre><code class="language-Dockerfile">FROM ubuntu:20.04

# Install OS-level libraries.
RUN apt-get update -y &amp;&amp; DEBIAN_FRONTEND=&quot;noninteractive&quot; apt-get install -y \
    python3 \
    python3-pip &amp;&amp; \
    apt-get clean

WORKDIR /code

# Install our python code dependencies.
COPY requirements.txt .
RUN pip3 install --upgrade pip
RUN pip3 install -r requirements.txt</code></pre><p>We&apos;ll also create a Makefile to simplify the process of building and pushing our Docker image:</p><pre><code class="language-Makefile">IMAGE=bioinformatics_pipeline_tutorial
ACCOUNT=$(shell aws ecr describe-registry --query registryId --output text)
REGION=$(shell aws configure get region)
REGISTRY=$(ACCOUNT).dkr.ecr.$(REGION).amazonaws.com

login:
	aws ecr get-login-password --region $(REGION) | docker login --username AWS --password-stdin $(REGISTRY)

build:
	docker build -t $(REGISTRY)/$(IMAGE) --build-arg REGISTRY=$(REGISTRY) .

build-local:
	docker build -t $(IMAGE) --build-arg REGISTRY=$(REGISTRY) .

create-repo:
	aws ecr create-repository --repository=$(IMAGE)

push:
	docker push $(REGISTRY)/$(IMAGE)

bash:
	docker run --rm -it $(REGISTRY)/$(IMAGE) bash

bash-local:
	docker run --rm -it $(IMAGE) bash
</code></pre><p>To build and test the Docker image locally, run:</p><pre><code class="language-bash">make build-local
docker run --rm -it bioinformatics_pipeline_tutorial pip list | grep &quot;redun&quot;</code></pre><p>If the output is <code>redun &#xA0; &#xA0; &#xA0; &#xA0; &#xA0; &#xA0; &#xA0; &#xA0; &#xA0; &#xA0;0.8.7</code>, the Dockerfile has built and installed the dependencies correctly. &#xA0;</p><p>Now, we can use the following make command to build our Docker image:</p><pre><code class="language-bash">make login
make build</code></pre><p>After the image builds, we need to publish it to ECR so that it is accessible by AWS Batch. There are several steps for doing that, which are covered in these make commands:</p><pre><code class="language-bash"># If the docker repo does not exist yet.
make create-repo

# Push the locally built image to ECR.
make push</code></pre><p>You might be wondering: how will our Python code get into the container? We didn&apos;t add our <code>workflow.py</code> file to the Docker image. The answer lies in redun&apos;s <a href="https://insitro.github.io/redun/executors.html#code-packaging">code packaging feature</a>, which essentially packages all the Python code in the current directory into a tar file and copies it to our S3 scratch directory. From here, it will be downloaded into the running AWS Batch job. This makes it a lot faster to iterate, without having to rebuild the Docker image for every code change.</p><p>Let&apos;s add our custom AWS Batch executor to our <code>.redun/redun.ini</code> config in the current working directory:</p><pre><code class="language-bash">[...]

[executors.batch]
type = aws_batch

# Required:
image = YOUR_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/bioinformatics_pipeline_tutorial
queue = YOUR_QUEUE_NAME
s3_scratch = s3://YOUR_BUCKET/redun/

# Optional:
role = arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_ROLE
job_name_prefix = redun-example
debug = False</code></pre><p>To get a working example, you&apos;ll need to replace all caps variables <code>YOUR_ACCOUNT_ID</code>, <code>YOUR_QUEUE_NAME</code>, and <code>YOUR_BUCKET</code>, with your own AWS Account ID, AWS Batch queue name, and S3 bucket, respectively.</p><p>Next up, make sure that all the tasks are equipped with our shiny new AWS Batch executor.</p><pre><code class="language-python">&quot;&quot;&quot;workflow.py&quot;&quot;&quot;

[...]

@task(executor=&quot;batch&quot;)
def main(...)</code></pre><p>Now we can execute our pipeline as usual, with the difference that redun will now run each task as a separate AWS Batch job and use the input data stored in the given S3 bucket (you&apos;ll need to upload the <code>fasta/</code> folder to the S3 bucket you want to use).</p><pre><code class="language-bash">redun run workflow.py main --input-dir s3://YOUR_BUCKET/fasta/</code></pre><p>Note: I ran into the following error and used <a href="https://aws.amazon.com/premiumsupport/knowledge-center/ecs-unable-to-assume-role/">the fix detailed in this post</a> by AWS to set up my role correctly.</p><div class="kg-card kg-callout-card kg-callout-card-red"><div class="kg-callout-text">ECS was unable to assume the role &apos;arn:aws:iam::***:role/role-name&apos; that was provided for this task. Please verify that the role being passed has the proper trust relationship and permissions and that your IAM user has permissions to pass this role.</div></div><p>Redun provides a nice <a href="https://github.com/insitro/redun/tree/main/examples/05_aws_batch#interactive-debugging">debug functionality</a> through which the tasks run locally in Docker containers and the data is still pulled from S3. To enable it change the <code>debug</code> field in the <code>.redun/redun.ini</code> config:</p><pre><code class="language-bash">[...]

[executors.batch]

[...]

debug = True</code></pre><p>Then, in order to jump into a running task, you can add the familiar <code>import pdb; pdb.set_trace()</code> statement to debug. </p><h1 id="importing-submodules-via-pip">Importing submodules via pip</h1><p>Ok, so we&apos;ve written a workflow that consists of five tasks that are connected together through our <code>main()</code> task. The workflow itself might be quite specific but it&apos;s easy to imagine that many individual tasks could be reused by other workflows. Redun solves this very elegantly and it&apos;s something that&apos;s hard to get right (e.g. Nextflow only very recently added this feature with the <a href="https://www.nextflow.io/docs/latest/dsl2.html#modules">release of their DSL2</a> and it&apos;s still bumpy IMO). </p><p>We&apos;re going to create a <code>src/</code> folder and put all of our reusable tasks in there so that someone who wants to use them can just <code>pip install</code> our Github repository (or released Python package).</p><pre><code class="language-bash">mkdir -p bioinformatics_pipeline_tutorial/
touch bioinformatics_pipeline_tutorial/__init__.py
touch bioinformatics_pipeline_tutorial/lib.py</code></pre><p>Now copy everything except the <code>main()</code> function into <code>bioinformatics_pipeline_tutorial/lib.py</code>. In the <code>workflow.py</code> file, import the task functions.</p><pre><code class="language-python">&quot;&quot;&quot;workflow.py&quot;&quot;&quot;
[...]

from bioinformatics_pipeline_tutorial.lib import (
    digest_protein_task,
    count_amino_acids_task,
    plot_count_task,
    get_report_task,
    archive_results_task,
)

[...]</code></pre><p>Finally, let&apos;s add a <code>setup.py</code> file to make the Github repository installable. Feel free to try and publish your package on pip and install it for another project.</p><pre><code class="language-python">&quot;&quot;&quot;setup.py&quot;&quot;&quot;
from setuptools import setup


setup(
    name=&quot;bioinformatics_pipeline_tutorial&quot;,
    version=&quot;0.0.1&quot;,
    packages=[&quot;bioinformatics_pipeline_tutorial&quot;],
    install_requires=[&quot;redun&quot;, &quot;plotly&quot;, &quot;kaleido&quot;],
)
</code></pre><p>You can try it out by installing the finished module via:</p><pre><code class="language-bash">pip install git+https://github.com/&lt;your_git_username&gt;/bioinformatics-pipeline-tutorial@redun</code></pre><p>Then open up a Python console by calling <code>python</code> and try to import the <code>digest_protein_task()</code>.</p><pre><code class="language-bash">$ python
Python 3.8.5 (default, Sep 27 2020, 11:35:15) 
[Clang 12.0.0 (clang-1200.0.32.2)] on darwin
Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
&gt;&gt;&gt; from bioinformatics_pipeline_tutorial.lib import digest_protein_task
&gt;&gt;&gt;</code></pre><p>Check out the final state of this part of the tutorial by pulling the <code>redun</code> <a href="https://github.com/ricomnl/bioinformatics-pipeline-tutorial/tree/redun">branch</a>.</p><h1 id="redun-makefiles">Redun Makefiles</h1><p>The <a href="https://github.com/insitro/redun/tree/main/examples/02_compile#bonus-round">bonus round section</a> of the second tutorial in the redun Github repository shows how redun can emulate Makefile behavior. Those of you who recall the original blog post will remember that in <code>part_02</code> we created a Makefile to execute our pipeline. Let&apos;s have some fun and try to rewrite the Makefile in redun. This part will also showcase redun&apos;s ability to do recursions.</p><p>First, check out <code>part_02</code> and run <code>make all</code> to make sure it&apos;s still working:</p><pre><code class="language-bash">git checkout part_02</code></pre><p>As you will recall from the last post, a Makefile is made up of recipes that specify how to build target files. The structure of each recipe is this:</p><pre><code class="language-Makefile">targets: prerequisites
	command</code></pre><p>This is how we would specify a recipe for how to create the file <code>KLF4.peptides.txt</code> in the <code>data/</code> folder (the target) using the <code>bin/01_digest_protein.py</code> script (the command) with <code>fasta/KLF4.fasta</code> being the only prerequisite.</p><pre><code class="language-Makefile">data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt</code></pre><p>We can then call <code>make data/KLF4.peptides.txt</code> to generate the target file.</p><p>To emulate this behavior in redun, we start by creating a file <code>make_workflow.py</code>. </p><pre><code class="language-bash">touch make_workflow.py</code></pre><p>We start by defining the first rule with a custom DSL:</p><pre><code class="language-python">&quot;&quot;&quot;make_workflow.py&quot;&quot;&quot;

redun_namespace = &quot;bioinformatics_pipeline_tutorial.make_workflow&quot;


# Custom DSL for describing targets, dependencies (deps), and commands.
rules = {
    &quot;data/KLF4.peptides.txt&quot;: {
        &quot;deps&quot;: [&quot;fasta/KLF4.fasta&quot;],
        &quot;command&quot;: &quot;bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt&quot;
    },
}</code></pre><p>Next, we copy the two functions <code>run_command()</code> and <code>make()</code> from the <a href="https://github.com/insitro/redun/blob/main/examples/02_compile/make2.py">redun tutorial</a>. The <code>run_command()</code> function takes as input a shell command specified as a string which it runs to generate the target file. The <code>make()</code> function generates the target by recursively creating all its dependencies (if needed).</p><pre><code class="language-python">&quot;&quot;&quot;make_workflow.py&quot;&quot;&quot;
import os
from typing import List, Optional

from redun import task, File
from redun.functools import const


redun_namespace = &quot;bioinformatics_pipeline_tutorial.make_workflow&quot;


# Custom DSL for describing targets, dependencies (deps), and commands.
rules = {
    &quot;data/KLF4.peptides.txt&quot;: {
        &quot;deps&quot;: [&quot;fasta/KLF4.fasta&quot;],
        &quot;command&quot;: &quot;bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt&quot;
    },
}


@task()
def run_command(command: str, inputs: List[File], output_path: str) -&gt; File:
    &quot;&quot;&quot;
    Run a shell command to produce a target.
    &quot;&quot;&quot;
    # Ignore inputs. We pass it as an argument to simply force a dependency.
    assert os.system(command) == 0
    return File(output_path)


@task()
def make(target: str = &quot;all&quot;, rules: dict = rules) -&gt; Optional[File]:
    &quot;&quot;&quot;
    Make a target (file) using a series of rules.
    &quot;&quot;&quot;
    rule = rules.get(target)
    if not rule:
        # No rule. See if target already exists.
        file = File(target)
        if not file.exists():
            raise ValueError(f&quot;No rule for target: {target}&quot;)
        return file

    # Recursively make dependencies.
    inputs = [
        make(dep, rules=rules)
        for dep in rule.get(&quot;deps&quot;, [])
    ]

    # Run command, if needed.
    if &quot;command&quot; in rule:
        return run_command(rule[&quot;command&quot;], inputs, target)
    else:
        return const(None, inputs)
</code></pre><p>We can generate a target by calling:</p><pre><code class="language-bash">redun run make_workflow.py make --target data/KLF4.peptides.txt</code></pre><p>Let&apos;s add some more rules:</p><pre><code class="language-python">&quot;&quot;&quot;make_workflow.py&quot;&quot;&quot;

[...]

# Custom DSL for describing targets, dependencies (deps), and commands.
rules = {
    &quot;data/KLF4.peptides.txt&quot;: {
        &quot;deps&quot;: [&quot;fasta/KLF4.fasta&quot;],
        &quot;command&quot;: &quot;bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt&quot;
    },
    &quot;data/KLF4.count.tsv&quot;: {
        &quot;deps&quot;: [&quot;fasta/KLF4.fasta&quot;, &quot;data/KLF4.peptides.txt&quot;],
        &quot;command&quot;: &quot;bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv&quot;
    },
    &quot;data/KLF4.plot.png&quot;: {
        &quot;deps&quot;: [&quot;data/KLF4.count.tsv&quot;],
        &quot;command&quot;: &quot;bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png&quot;
    },
    &quot;data/protein_report.tsv&quot;: {
        &quot;deps&quot;: [&quot;data/KLF4.count.tsv&quot;],
        &quot;command&quot;: &quot;bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv&quot;
    },
    &quot;data/results.tgz&quot;: {
        &quot;deps&quot;: [&quot;data/KLF4.plot.png&quot;, &quot;data/protein_report.tsv&quot;],
        &quot;command&quot;: &quot;&quot;&quot;rm -rf results
                      mkdir results
                      cp data/KLF4.plot.png data/protein_report.tsv results/
                      tar -czf data/results.tgz results
                      rm -r results&quot;&quot;&quot;
    },
}

[...]</code></pre><p>Now we can run the command below and it will generate the files <code>data/KLF4.plot.png</code>, <code>data/KLF4.count.tsv</code>, and <code>data/KLF4.peptides.txt</code>.</p><pre><code class="language-bash">redun run make_workflow.py make --target data/KLF4.plot.png</code></pre><p>But what if we wanted to add rules for new proteins? If you read the last post, you already know the answer and it&apos;s not: add separate rules for each file. We&apos;ll use pattern matching. Now, the initial implementation from the redun examples doesn&apos;t support pattern matching. Therefore, let&apos;s add a more advanced <code>match_target()</code> function and use it within the <code>make()</code> function. We&apos;ll also need to adjust our rules accordingly and use <code>%</code> as a wild card.</p><pre><code class="language-python">&quot;&quot;&quot;make_workflow.py&quot;&quot;&quot;
import os
from typing import Dict, List, Optional

[...]

rules = {
    &quot;data/%.peptides.txt&quot;: {
        &quot;deps&quot;: [&quot;fasta/%.fasta&quot;],
        &quot;command&quot;: &quot;bin/01_digest_protein.py fasta/%.fasta data/%.peptides.txt&quot;
    },
    &quot;data/%.count.tsv&quot;: {
        &quot;deps&quot;: [&quot;fasta/%.fasta&quot;, &quot;data/%.peptides.txt&quot;],
        &quot;command&quot;: &quot;bin/02_count_amino_acids.py fasta/%.fasta data/%.peptides.txt data/%.count.tsv&quot;
    },
    &quot;data/%.plot.png&quot;: {
        &quot;deps&quot;: [&quot;data/%.count.tsv&quot;],
        &quot;command&quot;: &quot;bin/03a_plot_count.py data/%.count.tsv data/%.plot.png&quot;
    },
    &quot;data/protein_report.tsv&quot;: {
        &quot;deps&quot;: [&quot;data/%.count.tsv&quot;],
        &quot;command&quot;: &quot;bin/03b_get_report.py data/%.count.tsv --output_file=data/protein_report.tsv&quot;
    },
    &quot;data/results.tgz&quot;: {
        &quot;deps&quot;: [&quot;data/%.plot.png&quot;, &quot;data/protein_report.tsv&quot;],
        &quot;command&quot;: &quot;&quot;&quot;rm -rf results
                      mkdir results
                      cp data/%.plot.png data/protein_report.tsv results/
                      tar -czf data/results.tgz results
                      rm -r results&quot;&quot;&quot;
    },
}

def match_target(target: str = &quot;all&quot;, rules: dict = rules) -&gt; Optional[Dict[str, Dict]]:
    &quot;&quot;&quot;
    Emulate GNU make pattern matching described here: 
    https://www.gnu.org/software/make/manual/html_node/Pattern-Match.html#Pattern-Match
    &quot;&quot;&quot;
    rule = rules.get(target)
    if not rule:
        _, tbase = os.path.split(target)
        for rkey, rval in rules.items():
            _, rbase = os.path.split(rkey)
            if not &quot;%&quot; in rbase: continue
            pre, post = rbase.split(&quot;%&quot;)
            if tbase.startswith(pre) and tbase.endswith(post):
                stem = tbase[len(pre):-len(post)]
                rule = {
                    &quot;deps&quot;: [dep.replace(&quot;%&quot;, stem) for dep in rval.get(&quot;deps&quot;, [])],
                    &quot;command&quot;: rval.get(&quot;command&quot;, &quot;&quot;).replace(&quot;%&quot;, stem),
                }
                break
    return rule
    
[...]

@task()
def make(target: str = &quot;all&quot;, rules: dict = rules) -&gt; Optional[File]:
    &quot;&quot;&quot;
    Make a target (file) using a series of rules.
    &quot;&quot;&quot;
    rule = match_target(target, rules) if not &quot;%&quot; in target else None
    [...]</code></pre><p>We can now generate the target for <em>any</em> protein and our workflow will substitute the <code>%</code> with a matched stem if it finds one. Try running:</p><pre><code class="language-bash">redun run make_workflow.py make --target data/MYC.plot.png</code></pre><p>However, if you try to generate either one of the last two targets <code>data/protein_report.tsv</code> and <code>data/results.tgz</code>, you&apos;ll run into the following issue:</p><pre><code class="language-bash">$ redun run make_workflow.py make --target data/protein_report.tsv
[...]
ValueError: No rule for target: data/%.count.tsv</code></pre><p>This is the same behavior we&apos;d get if we were to run the same command with make as seen in the last post:</p><pre><code class="language-bash">make: *** No rule to make target `data/%.count.tsv&apos;, needed by `data/%.plot.png&apos;. Stop.</code></pre><p>This occurs because when trying to generate the target <code>data/protein_report.tsv</code>, one of its dependencies is <code>data/%.count.tsv</code> and there is no way for redun (or make) to know which stem to replace the wildcard with. Hence, at some point in our program, we need to define a list of target files that we want to generate. We insert two variables at the top and use them for the last two rules. We also add recipes for <code>all</code> and <code>clean</code>.</p><pre><code class="language-python">&quot;&quot;&quot;make_workflow.py&quot;&quot;&quot;
[...]

COUNT = [&quot;data/KLF4.count.tsv&quot;, &quot;data/MYC.count.tsv&quot;, &quot;data/PO5F1.count.tsv&quot;, &quot;data/SOX2.count.tsv&quot;]
PLOT = [&quot;data/KLF4.plot.png&quot;, &quot;data/MYC.plot.png&quot;, &quot;data/PO5F1.plot.png&quot;, &quot;data/SOX2.plot.png&quot;]


# Custom DSL for describing targets, dependencies (deps), and commands.
rules = {
    &quot;all&quot;: {
        &quot;deps&quot;: [&quot;data/results.tgz&quot;],
    },
    &quot;clean&quot;: {
        &quot;command&quot;: &quot;rm -rf data/*&quot;,
    },
    &quot;data/%.peptides.txt&quot;: {
        &quot;deps&quot;: [&quot;fasta/%.fasta&quot;],
        &quot;command&quot;: &quot;bin/01_digest_protein.py fasta/%.fasta data/%.peptides.txt&quot;
    },
    &quot;data/%.count.tsv&quot;: {
        &quot;deps&quot;: [&quot;fasta/%.fasta&quot;, &quot;data/%.peptides.txt&quot;],
        &quot;command&quot;: &quot;bin/02_count_amino_acids.py fasta/%.fasta data/%.peptides.txt data/%.count.tsv&quot;
    },
    &quot;data/%.plot.png&quot;: {
        &quot;deps&quot;: [&quot;data/%.count.tsv&quot;],
        &quot;command&quot;: &quot;bin/03a_plot_count.py data/%.count.tsv data/%.plot.png&quot;
    },
    &quot;data/protein_report.tsv&quot;: {
        &quot;deps&quot;: COUNT,
        &quot;command&quot;: &quot;bin/03b_get_report.py {COUNT} --output_file=data/protein_report.tsv&quot;.format(COUNT=&quot; &quot;.join(COUNT))
    },
    &quot;data/results.tgz&quot;: {
        &quot;deps&quot;: PLOT + [&quot;data/protein_report.tsv&quot;],
        &quot;command&quot;: &quot;&quot;&quot;rm -rf results
                      mkdir results
                      cp {PLOT} data/protein_report.tsv results/
                      tar -czf data/results.tgz results
                      rm -r results&quot;&quot;&quot;.format(PLOT=&quot; &quot;.join(PLOT))
    },
}

[...]
</code></pre><p>You should now be able to clean up all past files that were generated with:</p><pre><code class="language-bash">redun run make_workflow.py make --target clean</code></pre><p>Finally, run the following and check whether it actually generates all of our target files in the <code>data/</code> folder:</p><pre><code class="language-bash">redun run make_workflow.py make --target all
# Or:
redun run make_workflow.py make</code></pre><p>You can browse the final state of this part of the tutorial in the <code>redun</code> <a href="https://github.com/ricomnl/bioinformatics-pipeline-tutorial/blob/redun/wf/make_workflow.py">branch</a>.</p><h1 id="conclusion">Conclusion</h1><p>That&apos;s it! Thanks for sticking with me all the way until the end, I hope it was fun and you got to explore some of redun&apos;s functionality. I linked some further resources below. Just to recap, here&apos;s what we covered:</p><ul><li>Core features of redun</li><li>Run redun workflows on AWS Batch</li><li>Import submodules via pip</li><li>Emulate Makefile behavior in redun</li></ul><p>For follow-up questions or feedback on this article, you can submit an issue through <a href="https://github.com/ricomnl/bioinformatics-pipeline-tutorial/issues">the accompanying GitHub repository</a> or reach me on <a href="https://twitter.com/ricomnl">Twitter</a>.</p><p>Huge thanks to <a href="https://twitter.com/AlexandreTrapp">Alex Trapp</a> and <a href="https://twitter.com/mattrasmus">Matt Rasmussen</a> for their thoughts and feedback on the draft.</p><h1 id="resources">Resources</h1><ul><li>Data Science workflows at insitro: using redun on AWS Batch: <a href="https://aws.amazon.com/blogs/hpc/data-science-workflows-at-insitro-using-redun-on-aws-batch/">https://aws.amazon.com/blogs/hpc/data-science-workflows-at-insitro-using-redun-on-aws-batch/</a></li><li>Data Science workflows at insitro: how redun uses the advanced service features from AWS Batch and AWS Glue: <a href="https://aws.amazon.com/blogs/hpc/how-insitro-redun-uses-advanced-aws-features/">https://aws.amazon.com/blogs/hpc/how-insitro-redun-uses-advanced-aws-features/</a></li><li>Redun Design Document: <a href="https://insitro.github.io/redun/design.html">https://insitro.github.io/redun/design.html</a></li><li>Redun <a href="https://github.com/insitro/redun/tree/main/examples">tutorials</a> I&apos;d recommend: <code>03_scheduler</code>, <code>04_script</code>, <code>05_aws_batch</code>, <code>functools</code>, <code>setup_scheduler</code>, and <code>testing</code></li><li>Great thread on what makes a good pipeline: <a href="https://twitter.com/VictoriaCarr_/status/1521496097230839810">https://twitter.com/VictoriaCarr_/status/1521496097230839810</a></li></ul>]]></content:encoded></item><item><title><![CDATA[HTGAA 22]]></title><description><![CDATA[<p>I&apos;m participating as a committed listener in 2022&apos;s <a href="https://htgaa2022.notion.site/htgaa2022/HTGAA-2022-d39e5560ad83483ab87d415f085b60c6">How to Grow (Almost) Anything</a>. Here&apos;s the link to all my assignment submissions:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://htgaa22-ricomeinl.notion.site/Rico-Meinl-efa379490adc4c169e51f9e7b0af4b87"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Notion &#x2013; The all-in-one workspace for your notes, tasks, wikis, and databases.</div><div class="kg-bookmark-description">A new tool that blends your everyday work apps into one.</div></div></a></figure>]]></description><link>https://ricomnl.com/blog/htgaa-22/</link><guid isPermaLink="false">620d828ab6907e05c48ae4e0</guid><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Wed, 16 Feb 2022 23:03:50 GMT</pubDate><content:encoded><![CDATA[<p>I&apos;m participating as a committed listener in 2022&apos;s <a href="https://htgaa2022.notion.site/htgaa2022/HTGAA-2022-d39e5560ad83483ab87d415f085b60c6">How to Grow (Almost) Anything</a>. Here&apos;s the link to all my assignment submissions:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://htgaa22-ricomeinl.notion.site/Rico-Meinl-efa379490adc4c169e51f9e7b0af4b87"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Notion &#x2013; The all-in-one workspace for your notes, tasks, wikis, and databases.</div><div class="kg-bookmark-description">A new tool that blends your everyday work apps into one. It&#x2019;s the all-in-one workspace for you and your team</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://htgaa22-ricomeinl.notion.site/images/logo-ios.png" alt><span class="kg-bookmark-author">Notion</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.notion.so/images/meta/default.png" alt></div></a></figure>]]></content:encoded></item><item><title><![CDATA[Bioinformatics pipeline example from the bottom up]]></title><description><![CDATA[This tutorial is aimed at scientists and bioinformaticians who know how to work the command line and have heard about pipelines before but feel lost in the jungle of tools like Docker, Nextflow, Airflow, Reflow, Snakemake, etc. ]]></description><link>https://ricomnl.com/blog/bottom-up-bioinformatics-pipeline/</link><guid isPermaLink="false">61b0dff49bc2fd1a4b2a2f91</guid><category><![CDATA[bioinformatics]]></category><category><![CDATA[pipelines]]></category><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Fri, 14 Jan 2022 16:34:31 GMT</pubDate><media:content url="https://ricomnl.com/content/images/2022/01/sigmund-4CNNH2KEjhc-unsplash.jpeg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><aside class="toc"></aside><!--kg-card-end: html--><img src="https://ricomnl.com/content/images/2022/01/sigmund-4CNNH2KEjhc-unsplash.jpeg" alt="Bioinformatics pipeline example from the bottom up"><p>This tutorial is aimed at scientists and bioinformaticians who know how to work the command line and have heard about pipelines before but feel lost in<a href="https://github.com/pditommaso/awesome-pipeline"> the jungle of tools</a> like Docker, Nextflow, Airflow, Reflow, Snakemake, etc. </p><p>In this post, we&apos;re gonna strip away some of the complexity and take a simple bioinformatics workflow, and build a pipeline from the bottom up. The goal is to understand the pattern of how to take some scripts written in a language like bash or python and turn them into a more streamlined (and perhaps automated) workflow.</p><p>We start by introducing the pipeline that we&apos;re going to build. In essence, it is a set of python scripts that take some data, do something with that data and save the output somewhere else. The first step to creating a minimal pipeline is writing a master shell script that sequentially runs all of these python scripts. We then use a Makefile to do the very same while explaining some of the advantages that come with it. Finally, we use Nextflow, a commonly used bioinformatics workflow tool, to wrap up our pipeline. If you feel adventurous, you can follow <a href="https://t-neumann.github.io/pipelines/AWS-pipeline/">this tutorial</a> on how to run setup an AWS environment for Nextflow and then run your pipeline on it.</p><p>The workflow we&apos;re going to wrap in a pipeline looks like this:</p><ol><li>Take a set of .fasta protein files</li><li>Split each into peptides using a variable number of missed cleavages</li><li>Count the number of cysteines in total as well as the number of peptides that contain a cysteine</li><li>Generate an output report containing this information in a .tsv file</li><li>Create an archive to share with colleagues</li></ol><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://ricomnl.com/content/images/2022/01/Screen-Shot-2022-01-14-at-10.06.25.png" class="kg-image" alt="Bioinformatics pipeline example from the bottom up" loading="lazy" width="1672" height="268" srcset="https://ricomnl.com/content/images/size/w600/2022/01/Screen-Shot-2022-01-14-at-10.06.25.png 600w, https://ricomnl.com/content/images/size/w1000/2022/01/Screen-Shot-2022-01-14-at-10.06.25.png 1000w, https://ricomnl.com/content/images/size/w1600/2022/01/Screen-Shot-2022-01-14-at-10.06.25.png 1600w, https://ricomnl.com/content/images/2022/01/Screen-Shot-2022-01-14-at-10.06.25.png 1672w" sizes="(min-width: 720px) 720px"><figcaption>An example output protein report</figcaption></figure><p>The first part of this tutorial is influenced by <a href="http://byronjsmith.com/make-bml/">this post</a> on how to create bioinformatics pipelines with Make. I won&apos;t go into as much depth to explain Makefiles themselves, so if this is the first time you&apos;re encountering a Makefile, I&apos;d recommend going through the linked post first.</p><h1 id="setup">Setup</h1><p>Go through the box below to install the needed tools. I tried to make the dependencies as small as possible.</p><h2 id="mac-os">Mac OS</h2><pre><code class="language-bash"># Add project to your path for this session.
export PATH=&quot;$PATH:$(pwd)&quot;

# Open the terminal; Install utilities for homebrew
xcode-select --install

# Install homebrew
/bin/bash -c &quot;$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)&quot;

# Install python3
Follow this tutorial: https://opensource.com/article/19/5/python-3-default-mac

# Install make
brew install make

# Install git
brew install git

# Install matplotlib
pip3 install matplotlib

# Install Nextflow (https://www.nextflow.io/docs/latest/getstarted.html)
wget -qO- https://get.nextflow.io | bash
chmod +x nextflow
## Move Nextflow to a directory in your $PATH such as /usr/local/bin
mv nextflow /usr/local/bin/</code></pre><h2 id="linux">Linux</h2><pre><code class="language-bash"># Install python3, git and make
sudo apt-get update
sudo apt-get install python3 git make

# Install matplotlib
sudo apt-get install python3-matplotlib

# Install Nextflow (https://www.nextflow.io/docs/latest/getstarted.html)
wget -qO- https://get.nextflow.io | bash
chmod +x nextflow
## Move Nextflow to a directory in your $PATH such as /usr/local/bin
mv nextflow /usr/local/bin/</code></pre><h1 id="introduction">Introduction</h1><p>In this section, we&apos;ll go through the basic intuition of what a pipeline is and why we need one. To walk through this from the ground up I chose a basic example. We have a bunch of proteins in .fasta files and want to create a report of how many cysteines each contains after it has been digested into peptides. </p><p>I created a <a href="https://github.com/ricomnl/bioinformatics-pipeline-tutorial">Github repository</a> which we&apos;ll be working with. To start off, <a href="https://docs.github.com/en/get-started/quickstart/fork-a-repo">fork it</a>, clone it locally, and check out the branch <code>part_00</code>. &#xA0;</p><pre><code class="language-bash"># Fork and clone repository and switch to branch part_1
git clone https://github.com/&lt;your_git_username&gt;/bioinformatics-pipeline-tutorial.git
cd bioinformatics-pipeline-tutorial/
git checkout part_00</code></pre><p>Open the project in your favorite code editor to check out the directory structure. We have two folders: <code>bin/</code> contains the python scripts we&apos;ll use throughout this tutorial to transform our files and <code>fasta/</code> contains a set of protein .fasta files that we&apos;ll use (I went with the four Yamanaka factors but feel free to drop in whatever your favorite protein is).</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/Screen-Shot-2022-01-14-at-10.19.28.png" class="kg-image" alt="Bioinformatics pipeline example from the bottom up" loading="lazy" width="664" height="622" srcset="https://ricomnl.com/content/images/size/w600/2022/01/Screen-Shot-2022-01-14-at-10.19.28.png 600w, https://ricomnl.com/content/images/2022/01/Screen-Shot-2022-01-14-at-10.19.28.png 664w"></figure><p>Make all the scripts in <code>bin</code> executable by running the following command:</p><pre><code class="language-bash">chmod +x bin/01_digest_protein.py bin/02_count_amino_acids.py bin/03a_plot_count.py bin/03b_get_report.py</code></pre><p>Let&apos;s walk through the steps manually using <code>KLF4</code> as our protein. First, we need to digest our protein into peptides. This is what the prepared script <code>01_digest_protein.py</code> does. Feel free to open up the file and check it out. The required flags for the script are an input .fasta file and an output file path. The optional flags have default values but feel free to play around with them. For example, we can change the number of missed cleavages by appending <code>--missed_cleavages=1</code> to our command. To digest our protein, run:</p><pre><code class="language-bash">mkdir data/
bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt</code></pre><p>We now have the file <code>KLF4.txt</code> in the <code>data/</code> directory which should contain all the peptides of <code>KLF4</code> after it was digested with trypsin (you can change the digestion enzyme by passing the <code>--enzyme_regex</code> flag). </p><p>Next up, we want to count the total # of amino acids in <code>KLF4</code>, the # of cysteines, the # of peptides and how many of them contain a cysteine. To do this, we run:</p><pre><code class="language-bash">bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv</code></pre><p>We can use the <code>--amino_acid</code> to change the amino acid to count (defaults to cysteine == C). </p><p>We&apos;re halfway there. Now we want to a) plot each output count file as a bar plot (see below) and b) create an output report summarizing the counts for multiple proteins. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://ricomnl.com/content/images/2022/01/Screen-Shot-2022-01-11-at-16.28.41.png" class="kg-image" alt="Bioinformatics pipeline example from the bottom up" loading="lazy" width="1512" height="1022" srcset="https://ricomnl.com/content/images/size/w600/2022/01/Screen-Shot-2022-01-11-at-16.28.41.png 600w, https://ricomnl.com/content/images/size/w1000/2022/01/Screen-Shot-2022-01-11-at-16.28.41.png 1000w, https://ricomnl.com/content/images/2022/01/Screen-Shot-2022-01-11-at-16.28.41.png 1512w" sizes="(min-width: 720px) 720px"><figcaption>Barplot charts showing the number of cysteines in peptides and amino acids</figcaption></figure><p>To get the output plot we run:</p><pre><code class="language-bash"># Just show
bin/03a_plot_count.py data/KLF4.count.tsv show

# Save fig
bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png</code></pre><p>Now we run all the same steps for <code>MYC</code> and generate an output report for the two proteins.</p><pre><code class="language-bash"># Digest
bin/01_digest_protein.py fasta/MYC.fasta data/MYC.peptides.txt

# Count
bin/02_count_amino_acids.py fasta/MYC.fasta data/MYC.peptides.txt data/MYC.count.tsv

# Plot
bin/03a_plot_count.py data/MYC.count.tsv data/MYC.plot.png

# Generate Report for KLF4 and MYC
bin/03b_get_report.py data/KLF4.count.tsv data/MYC.count.tsv --output_file=data/protein_report.tsv</code></pre><p>Lastly, create an archive of the resulting output files.</p><pre><code class="language-bash"># Create a results/ folder and archive it for sharing
mkdir results
cp data/*plot.png data/protein_report.tsv results/
tar -czf data/results.tgz results
rm -r results</code></pre><p>Together these scripts implement a common workflow:</p><ol><li>Digest protein(s)</li><li>Count occurrences of amino acid in protein(s)</li><li>Plot results</li><li>Generate a report with the results</li><li>Archive the plots and report</li></ol><p>Instead of running each of the commands manually, as above, we can create a master script that runs the whole pipeline from start to finish. Our <code>run_pipeline.sh</code> looks like this:</p><pre><code class="language-bash">#!/usr/bin/env bash
# USAGE: bash run_pipeline.sh

mkdir -p data

# 01. Digest
bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt
bin/01_digest_protein.py fasta/MYC.fasta data/MYC.peptides.txt
bin/01_digest_protein.py fasta/PO5F1.fasta data/PO5F1.peptides.txt
bin/01_digest_protein.py fasta/SOX2.fasta data/SOX2.peptides.txt

# 02. Count
bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv
bin/02_count_amino_acids.py fasta/MYC.fasta data/MYC.peptides.txt data/MYC.count.tsv
bin/02_count_amino_acids.py fasta/PO5F1.fasta data/PO5F1.peptides.txt data/PO5F1.count.tsv
bin/02_count_amino_acids.py fasta/SOX2.fasta data/SOX2.peptides.txt data/SOX2.count.tsv

# 03a. Plot
bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png
bin/03a_plot_count.py data/MYC.count.tsv data/MYC.plot.png
bin/03a_plot_count.py data/PO5F1.count.tsv data/PO5F1.plot.png
bin/03a_plot_count.py data/SOX2.count.tsv data/SOX2.plot.png

# 03b. Generate Report
bin/03b_get_report.py data/KLF4.count.tsv \
					  data/MYC.count.tsv \
					  data/PO5F1.count.tsv \
					  data/SOX2.count.tsv \
					  --output_file=data/protein_report.tsv

# 04. Archive the results in a tarball so we can share them with a colleague
rm -rf results
mkdir results
cp data/*plot.png data/protein_report.tsv results/
tar -czf data/results.tgz results
rm -r results
</code></pre><p>Now we have a reproducible pipeline that we can easily run by calling:</p><pre><code class="language-bash">bash run_pipeline.sh</code></pre><p>We can also share it with colleagues and we have some security that it will behave in the exact same manner when we rerun it (and don&apos;t have to worry about typos when using the command line manually).</p><p>If you&apos;re following along using your own GitHub repository, this is a good time to take a step back and commit your results.</p><pre><code class="language-bash">git init
git add .
git commit -m &quot;Finished setup&quot;
git push</code></pre><p>Let&apos;s also clean up the data folder for now, as we&apos;ll regenerate the files again in the next step:</p><pre><code class="language-bash">rm data/*</code></pre><h1 id="makefile">Makefile</h1><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-text">If you&apos;re starting at this point, please checkout branch &quot;part_01&quot; from the GitHub repo.</div></div><p>Now, let&apos;s say we wanted to use pie charts instead of bar plots. We could just go into <code>03a_plot_count.py</code> and change <code>plt.bar</code> to <code>plt.pie</code> , right? </p><p>Sure, but then we&apos;d have to rerun the entire script even though the first part didn&apos;t change at all. Even so, this wouldn&apos;t be a big deal because we only have four files but imagine we were running this for the whole human .fasta file or our files were just much bigger. Alas, our current pipeline is not ideal. </p><p>As I mentioned, <a href="http://byronjsmith.com/make-bml/">this post</a> gives a much deeper overview of how to create Makefiles for bioinformatics workflows. I&apos;m only covering the basics needed for our little tutorial here. There&apos;s also a <a href="https://devhints.io/makefile">great cheat sheet here</a> if you get stuck on some commands.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-text"><em>Make</em> is a computer program originally designed to automate the compilation and installation of software. <em>Make</em> automates the process of building target files through a series of discrete steps. Despite it&#x2019;s original purpose, this design makes it a great fit for bioinformatics pipelines, which often work by transforming data from one form to another (e.g. <em>raw data</em> &#x2192; <em>word counts</em> &#x2192; <em>???</em> &#x2192; <em>profit</em>).<br><em>Source: http://byronjsmith.com/make-bml/</em></div></div><p>Let&apos;s start by creating a Makefile and porting our first step into it.</p><pre><code class="language-bash">touch Makefile</code></pre><p>Use the text editor of your choice to add to the Makefile. The simplest possible Makefile recipe is this:</p><pre><code class="language-Makefile">targets: prerequisites
	command</code></pre><p>We want to create the file <code>KLF4.peptides.txt</code> in the <code>data/</code> folder (the target) using the <code>bin/01_digest_protein.py</code> script (the command), as before. Our input file is <code>fasta/KLF4.fasta</code> (the prerequisite). The result looks like this:</p><pre><code class="language-Makefile">data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt</code></pre><p>The targets are what we create as a result of executing the commands. Now, run:</p><pre><code class="language-bash">make data/KLF4.peptides.txt</code></pre><p>Once we&apos;ve executed this, you should have generated the <code>KLF4.peptides.txt</code> file in the <code>data/</code> folder. Make will only allow rerunning the command if the prerequisites have been modified since the target was created. </p><p>Try running <code>make data/KLF4.peptides.txt</code> again. You should get the following message, telling you that the prerequisites have not changed and therefore the target won&apos;t be different if you run it again:</p><pre><code class="language-bash">$ make data/KLF4.peptides.txt
make: &apos;data/KLF4.peptides.txt&apos; is up to date.</code></pre><p>We can go around this by just changing the modification time of <code>fasta/KLF4.fasta</code> and restore the original behavior.</p><pre><code class="language-bash">touch fasta/KLF4.fasta
make data/KLF4.peptides.txt</code></pre><p>Let&apos;s add the second step, the counting step. As a reminder: we want to create the file <code>KLF4.count.tsv</code> in the <code>data/</code> folder (the target) using the <code>bin/02_count_amino_acids.py</code> script (the command). Our input files are <code>fasta/KLF4.fasta</code> and <code>data/KLF4.peptides.txt</code> (the prerequisites). The resulting Makefile looks like this:</p><pre><code class="language-Makefile">data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

data/KLF4.count.tsv: fasta/KLF4.fasta data/KLF4.peptides.txt
	bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv</code></pre><p>Now, try to add the plotting command (creates the file <code>data/KLF4.plot.png</code>) and the report (<code>data/protein_report.tsv</code>) yourself.</p><p>Here is the solution.</p><pre><code class="language-Makefile">data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

data/KLF4.count.tsv: fasta/KLF4.fasta data/KLF4.peptides.txt
	bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv

data/KLF4.plot.png: data/KLF4.count.tsv
	bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png

data/protein_report.tsv: data/KLF4.count.tsv
	bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv

data/results.tgz: data/KLF4.plot.png data/protein_report.tsv
	rm -rf results
	mkdir results
	cp data/KLF4.plot.png data/protein_report.tsv results/
	tar -czf data/results.tgz results
	rm -r results</code></pre><p>Let&apos;s remove all the files from the <code>data/</code> subdirectory and run Make.</p><pre><code class="language-bash">rm data/*
make data/results.tgz</code></pre><p> You&apos;ll notice that Make executes every single command in the Makefile. </p><pre><code class="language-bash">$ make data/result.tgz
bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt
bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv
bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png
bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv
rm -rf results
mkdir results
cp data/KLF4.plot.png data/protein_report.tsv results/
tar -czf data/results.tgz results
rm -r results</code></pre><p>Why is that? Makefiles work in a pull-based fashion. This means that the workflow is invoked by asking for a specific output file, whereafter all tasks required for reproducing the file will be executed. We can visualize this by looking at the dependency graph. To generate it we&apos;re using <a href="https://github.com/lindenb/makefile2graph">makefile2graph</a> and call:</p><pre><code class="language-bash">make -Bnd data/results.tgz | make2graph | dot -Tpng -o out.png</code></pre><p>We call make with our target <code>data/results.tgz</code> and in order to create it, we first need to create <code>data/KLF4.plot.png</code> and <code>data/protein_report.tsv</code> which in turn need <code>data/KLF4.count.tsv</code> and so on. That&apos;s why it generates all files at once.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://ricomnl.com/content/images/2022/01/out.png" class="kg-image" alt="Bioinformatics pipeline example from the bottom up" loading="lazy" width="493" height="443"><figcaption>The dependency graph of our Makefile</figcaption></figure><p>In order to just see what files will be created, we can run use the flag <code>--dry-run</code> or its short form <code>-n</code>. </p><pre><code class="language-bash">$ make --dry-run data/result.tgz
bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt
bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv
bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png
bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv
rm -rf results
mkdir results
cp data/KLF4.plot.png data/protein_report.tsv results/
tar -czf data/results.tgz results
rm -r results</code></pre><p>We can put an <code>all</code> target at the very top of our Makefile which is good practice as the topmost recipe is the one that is built by default when calling just <code>make</code>. Add the following to the top of your Makefile:</p><pre><code class="language-Makefile">all: data/results.tgz

[...]</code></pre><p>Another common target is <code>clean:</code>. Let&apos;s add the following below the <code>all:</code> target in our Makefile:</p><pre><code class="language-Makefile">clean:
	rm -rf data/*</code></pre><p>We can now create all our files by calling <code>make all</code> and clean the <code>data/</code> folder by calling <code>make clean</code>.</p><p>We have to tell Make that <code>all:</code> and <code>clean:</code> will always refer to the targets in our Makefile and never to any files themselves, therefore we also add this to our Makefile:</p><pre><code class="language-Makefile">.PHONY: all clean</code></pre><p>Our Makefile should now look like this:</p><pre><code class="language-Makefile"># Dummy targets
all: data/results.tgz

clean:
	rm -rf data/*

.PHONY: all clean

# Analysis and plotting
data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt

data/KLF4.count.tsv: fasta/KLF4.fasta data/KLF4.peptides.txt
	bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt data/KLF4.count.tsv

data/KLF4.plot.png: data/KLF4.count.tsv
	bin/03a_plot_count.py data/KLF4.count.tsv data/KLF4.plot.png

data/protein_report.tsv: data/KLF4.count.tsv
	bin/03b_get_report.py data/KLF4.count.tsv --output_file=data/protein_report.tsv

data/results.tgz: data/KLF4.plot.png data/protein_report.tsv
	rm -rf results
	mkdir results
	cp data/KLF4.plot.png data/protein_report.tsv results/
	tar -czf data/results.tgz results
	rm -r results</code></pre><p>You might have noticed that there is a fair amount of repetition in each of the recipes. Let&apos;s take the first one and simplify it:</p><pre><code class="language-Makefile">data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py fasta/KLF4.fasta data/KLF4.peptides.txt</code></pre><p>In Makefiles, the two variables <code>$^</code> and <code>$@</code> refer to the prerequisite and target of a rule so we can rewrite the above as:</p><pre><code class="language-Makefile">data/KLF4.peptides.txt: fasta/KLF4.fasta
	bin/01_digest_protein.py $^ $@</code></pre><p>In fact, to make sure that our python script is also considered as a prerequisite and the recipe is rerun when our script is updated we change it like this:</p><pre><code class="language-Makefile">data/KLF4.peptides.txt: bin/01_digest_protein.py fasta/KLF4.fasta
	$^ $@</code></pre><p> After applying these transformations, our Makefile should look like this:</p><pre><code class="language-Makefile"># Dummy targets
all: data/results.tgz

clean:
	rm -rf data/*

.PHONY: all clean

# Analysis and plotting
data/KLF4.peptides.txt: bin/01_digest_protein.py fasta/KLF4.fasta
	$^ $@

data/KLF4.count.tsv: bin/02_count_amino_acids.py fasta/KLF4.fasta data/KLF4.peptides.txt
	$^ $@

data/KLF4.plot.png: bin/03a_plot_count.py data/KLF4.count.tsv
	$^ $@

data/protein_report.tsv: bin/03b_get_report.py data/KLF4.count.tsv
	$^ --output_file=$@

# Archive for sharing
data/results.tgz: data/KLF4.plot.png data/protein_report.tsv
	rm -rf results
	mkdir results
	cp $^ results/
	tar -czf $@ results
	rm -r results
</code></pre><p>Now the current Makefile only creates the KLF4 protein files. Let&apos;s add the other proteins, starting with MYC.</p><pre><code class="language-Makefile"># Analysis and plotting
data/KLF4.peptides.txt: bin/01_digest_protein.py fasta/KLF4.fasta
	$^ $@
    
data/MYC.peptides.txt: bin/01_digest_protein.py fasta/MYC.fasta
	$^ $@</code></pre><p> As you probably noticed, that would be a lot of repetition. We can pattern rules to abstract the individual protein names away: </p><pre><code class="language-Makefile"># Analysis and plotting
data/%.peptides.txt: bin/01_digest_protein.py fasta/%.fasta
	$^ $@</code></pre><p>If we would&apos;ve gone ahead and blindly applied this to the last two rules we would&apos;ve gotten the following error:</p><pre><code class="language-bash">make: *** No rule to make target `data/%.count.tsv&apos;, needed by `data/%.plot.png&apos;. Stop.</code></pre><p>Why is that? Remember how Makefiles use a &#x201C;pull-based&#x201D; scheduling strategy?<br>If we were to use the wildcards <code>%</code> everywhere, we&apos;d never actually tell the Makefile what the wildcard stands for. <br>At some point, we need to define some target files that Make can use as a basis to fill in the wildcards for the others.</p><p>We insert two variables at the top and use them for the last two rules. Voil&#xE0;.</p><pre><code class="language-Makefile">COUNT := data/KLF4.count.tsv data/MYC.count.tsv \
			data/PO5F1.count.tsv data/SOX2.count.tsv
PLOT := data/KLF4.plot.png data/MYC.plot.png \
			data/PO5F1.plot.png data/SOX2.plot.png

[...]

data/protein_report.tsv: bin/03b_get_report.py ${COUNT}
	$^ --output_file=$@

# Archive for sharing
data/results.tgz: ${PLOT} data/protein_report.tsv
	rm -rf results
	mkdir results
	cp $^ results/
	tar -czf $@ results
	rm -r results</code></pre><p>Now we&apos;re back on track. Let&apos;s run it:</p><pre><code class="language-bash">make</code></pre><p>You might have noticed that the pipeline took a little bit longer to process the four proteins because it ran everything sequentially. You can use the <code>--jobs</code> or <code>-j</code> flag to run Make in parallel. That&apos;ll give us a nice speedup.</p><p>Let&apos;s check out the dependency graph at this point.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2021/12/out-2.png" class="kg-image" alt="Bioinformatics pipeline example from the bottom up" loading="lazy" width="1712" height="539" srcset="https://ricomnl.com/content/images/size/w600/2021/12/out-2.png 600w, https://ricomnl.com/content/images/size/w1000/2021/12/out-2.png 1000w, https://ricomnl.com/content/images/size/w1600/2021/12/out-2.png 1600w, https://ricomnl.com/content/images/2021/12/out-2.png 1712w" sizes="(min-width: 720px) 720px"></figure><p>Your final Makefile should look like this:</p><pre><code class="language-Makefile">COUNT := data/KLF4.count.tsv data/MYC.count.tsv \
			data/PO5F1.count.tsv data/SOX2.count.tsv
PLOT := data/KLF4.plot.png data/MYC.plot.png \
			data/PO5F1.plot.png data/SOX2.plot.png

# Dummy targets
all: data/results.tgz

clean:
	rm -rf data/*

.PHONY: all clean

# Analysis and plotting
data/%.peptides.txt: bin/01_digest_protein.py fasta/%.fasta
	$^ $@

data/%.count.tsv: bin/02_count_amino_acids.py fasta/%.fasta data/%.peptides.txt
	$^ $@

data/%.plot.png: bin/03a_plot_count.py data/%.count.tsv
	$^ $@

data/protein_report.tsv: bin/03b_get_report.py ${COUNT}
	$^ --output_file=$@

# Archive for sharing
data/results.tgz: ${PLOT} data/protein_report.tsv
	rm -rf results
	mkdir results
	cp $^ results/
	tar -czf $@ results
	rm -r results
</code></pre><h1 id="nextflow">Nextflow</h1><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-text">If you&apos;re starting at this point, please checkout branch &quot;part_02&quot; from the GitHub repo.</div></div><p>We&apos;re now going to switch gears and turn to Nextflow. Makefiles are awesome but they&apos;re limited. For larger scale pipelines, the biggest limit is that we can&apos;t easily scale them horizontally. It&apos;d have to be on the same machine. A lot of the steps we&apos;ve implemented earlier though could easily run on multiple machines in parallel. Let&apos;s see how Nextflow helps with that. There are many tools like Nextflow, including but not limited to <a href="https://airflow.apache.org/">Airflow</a>, <a href="https://github.com/grailbio/reflow">Reflow</a>, <a href="https://snakemake.readthedocs.io/en/stable/">Snakemake</a>, etc. They all have their advantages and disadvantages but I chose Nextflow for this tutorial (and for our work at <a href="https://talus.bio">talus.bio</a>) because of its flexibility and popularity in the Bioinformatics community.</p><p>Start by creating a file <code>main.nf</code> which is a commonly used name for Nextflow files. We add only the <code>digestProtein</code> step, for now, to keep it simple.</p><pre><code class="language-groovy">#!/usr/bin/env nextflow

// Run workflow for all .fasta files in the fasta directory
fasta = Channel.fromPath(&quot;$baseDir/fasta/*.fasta&quot;)


// Helper function to extract the protein name from the filename
def getProtein(fileName) {
  fileName.getBaseName().tokenize(&quot;.&quot;)[0]
}


// Digest a protein and save the peptides
process digestProtein {  
  input:
    path input_fasta from fasta

  output:
    path &quot;*.txt&quot; into peptides

  script:
    def protein = getProtein(input_fasta)
    &quot;&quot;&quot;
    01_digest_protein.py ${input_fasta} ${protein}.peptides.txt
    &quot;&quot;&quot;
}

peptides.view()</code></pre><p>We also create a small <code>nextflow.config</code> file that includes only the <code>publishDir</code> directive for now. Nextflow does all the work in the <code>work</code> folder and if we want our processes to export data to a different location we have to specify that with the <code>publishDir</code> directive. Here we use our <code>data/</code> folder again.</p><pre><code class="language-groovy">process {
	publishDir = [path: &quot;data/&quot;, mode: &quot;copy&quot;]
}</code></pre><p>One of the main functional differences between Nextflow and Makefiles is that Makefiles are <strong>pull-based</strong> but Nextflow is <strong>push-based</strong>. With the Makefiles, we had to specify the output files we want to generate and it automatically figured out which parts it had to execute to generate them. Here we take all the <code>.fasta</code> files from our <code>fasta</code> folder and &quot;push&quot; them into the pipeline. </p><p>Nextflow uses the <a href="http://groovy-lang.org/documentation.html">groovy</a> programming language which is based on Java. This makes it a lot more flexible than using bash. There are two main concepts: <em>Processes</em> and <em>Channels</em>. Processes are similar to the rules in a Makefile. We specify input and output as well as a script that determines how to generate the output from the input. </p><p>Run the pipeline with <code>nextflow run main.nf</code> and you&apos;ll see that it runs the process <code>digestProtein</code> four times (once for each fasta).</p><p>We&apos;ll now add the other functions step by step. When adding the <code>countAA</code> process we notice that it also takes the <code>fasta</code> Channel as an input. Let&apos;s try to run it without changing anything.</p><pre><code class="language-groovy">#!/usr/bin/env nextflow

// Run workflow for all .fasta files in the fasta directory
fasta = Channel.fromPath(&quot;$baseDir/fasta/*.fasta&quot;)


// Helper function to extract the protein name from the filename
def getProtein(fileName) {
  fileName.getBaseName().tokenize(&quot;.&quot;)[0]
}


// Digest a protein and save the peptides
process digestProtein {
  input:
    path input_fasta from fasta

  output:
    path &quot;*.txt&quot; into peptides

  script:
    def protein = getProtein(input_fasta)
    &quot;&quot;&quot;
    01_digest_protein.py ${input_fasta} ${protein}.peptides.txt
    &quot;&quot;&quot;
}


// Count the number of times a given amino acid appears in a protein as well 
// as its peptides after digestion
process countAA {  
  input:
    path input_peptides from peptides
    path input_fasta from fasta

  output:
    path &quot;*.tsv&quot; into aa_count

  script:
    def protein = getProtein(input_peptides)
    &quot;&quot;&quot;
    02_count_amino_acids.py ${input_fasta} ${input_peptides} ${protein}.count.tsv
    &quot;&quot;&quot;
}</code></pre><p>You should&apos;ve gotten the following error:</p><pre><code class="language-bash">$ nextflow run main.nf
N E X T F L O W  ~  version 21.09.0-edge
Launching `main.nf` [clever_solvay] - revision: 84edb4b9a9
Channel `fasta` has been used twice as an input by process `countAA` and process `digestProtein`

 -- Check script &apos;main.nf&apos; at line: 31 or see &apos;.nextflow.log&apos; file for more details
[-        ] process &gt; digestProtein -</code></pre><p>To avoid this error we split our <code>fasta</code> channel into two. </p><pre><code class="language-groovy">[...]

// Run workflow for all .fasta files in the fasta directory
fasta = Channel.fromPath(&quot;$baseDir/fasta/*.fasta&quot;)
fasta.into { 
  fasta_a
  fasta_b 
}

[...]

process digestProtein {
  input:
    path input_fasta from fasta_a
[...]

process countAA {  
  input:
    path input_peptides from peptides
    path input_fasta from fasta_b
[...]</code></pre><p>We also know that the output from <code>countAA</code> goes into both <code>plotCount</code> and <code>generateReport</code> so we use the same trick as with the fasta channel. Our <code>main.nf</code> file should now look like this. Note that we used <code>.collect()</code> both in <code>generateReport</code> and <code>archiveResults</code>. By default, Nextflow would&apos;ve run these processes once for each item. In this case, we deliberately want to avoid that behavior, because our processes use all files at once.</p><pre><code class="language-groovy">#!/usr/bin/env nextflow

// Run workflow for all .fasta files in the fasta directory
fasta = Channel.fromPath(&quot;$baseDir/fasta/*.fasta&quot;)
fasta.into { 
  fasta_a
  fasta_b 
}


// Helper function to extract the protein name from the filename
def getProtein(fileName) {
  fileName.getBaseName().tokenize(&quot;.&quot;)[0]
}


// Digest a protein and save the peptides
process digestProtein {
  input:
    path input_fasta from fasta_a

  output:
    path &quot;*.txt&quot; into peptides

  script:
    def protein = getProtein(input_fasta)
    &quot;&quot;&quot;
    01_digest_protein.py ${input_fasta} ${protein}.peptides.txt
    &quot;&quot;&quot;
}


// Count the number of times a given amino acid appears in a protein as well 
// as its peptides after digestion
process countAA {  
  input:
    path input_peptides from peptides
    path input_fasta from fasta_b

  output:
    path &quot;*.tsv&quot; into aa_count_a, aa_count_b

  script:
    def protein = getProtein(input_peptides)
    &quot;&quot;&quot;
    02_count_amino_acids.py ${input_fasta} ${input_peptides} ${protein}.count.tsv
    &quot;&quot;&quot;
}


// Load the calculated counts and create a plot
process plotCount {  
  input:
    path input_count from aa_count_a

  output:
    path &quot;*.png&quot; into count_plot

  script:
    def protein = getProtein(input_count)
    &quot;&quot;&quot;
    03a_plot_count.py ${input_count} ${protein}.plot.png
    &quot;&quot;&quot;
}


// Get a list of input files from a given folder and create a report
process generateReport {  
  input:
    path input_count from aa_count_b.collect()

  output:
    path &quot;*.tsv&quot; into protein_report

  script:
    &quot;&quot;&quot;
    03b_get_report.py ${input_count} --output_file=protein_report.tsv
    &quot;&quot;&quot;
}


// Gather result files and archive them
process archiveResults {  
  input:
    path input_plot from count_plot.collect()
    path input_report from protein_report

  output:
    path &quot;*.tgz&quot; into archive_results

  script:
    &quot;&quot;&quot;
    mkdir results
    cp ${input_plot} ${input_report} results/
    tar -czf results.tgz results
    &quot;&quot;&quot;
}
</code></pre><p>So far we&apos;ve been the &quot;old&quot; way of writing pipelines in Nextflow. I wrote the pipeline this way on purpose, in order to showcase the difference between push-based and pull-based. It&apos;s still a legitimate way of writing them but Nextflow has recently released a new DSL (Version 2) which makes the whole process more flexible and IMO a bit more elegant. Instead of having to think about how to connect processes, we treat them more like functions that take an input and output and connect them via a <code>workflow</code> block. Let&apos;s see what that would look like. Rename the <code>main.nf</code> to <code>main_old.nf</code> and copy its content into a new <code>main.nf</code>.</p><pre><code class="language-bash">cp main.nf main_old.nf</code></pre><p>We start by enabling the new DSL at the top of our file.</p><pre><code class="language-groovy">#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

[...]</code></pre><p>Then we remove all the <code>from</code> and <code>into</code> directives from our processes and add the following <code>workflow</code> block at the bottom.</p><pre><code class="language-groovy">[...]

workflow {
  // Run workflow for all .fasta files in the fasta directory
  fasta = Channel.fromPath(&quot;$baseDir/fasta/*.fasta&quot;)
  peptides = digestProtein(fasta)
  aa_count = countAA(peptides, fasta)
  count_plot = plotCount(aa_count)
  protein_report = generateReport(aa_count | collect)
  archive_results = archiveResults(count_plot | collect, protein_report)
}</code></pre><p>Last check, your <code>main.nf</code> should now look like this:</p><pre><code class="language-groovy">#!/usr/bin/env nextflow
nextflow.enable.dsl = 2


// Helper function to extract the protein name from the filename
def getProtein(fileName) {
  fileName.getBaseName().tokenize(&quot;.&quot;)[0]
}


// Digest a protein and save the peptides
process digestProtein {
  input:
    path input_fasta

  output:
    path &quot;*.txt&quot;

  script:
    def protein = getProtein(input_fasta)
    &quot;&quot;&quot;
    01_digest_protein.py ${input_fasta} ${protein}.peptides.txt
    &quot;&quot;&quot;
}


// Count the number of times a given amino acid appears in a protein as well 
// as its peptides after digestion
process countAA {  
  input:
    path input_peptides
    path input_fasta

  output:
    path &quot;*.tsv&quot;

  script:
    def protein = getProtein(input_peptides)
    &quot;&quot;&quot;
    02_count_amino_acids.py ${input_fasta} ${input_peptides} ${protein}.count.tsv
    &quot;&quot;&quot;
}


// Load the calculated counts and create a plot
process plotCount {  
  input:
    path input_count

  output:
    path &quot;*.png&quot; 

  script:
    def protein = getProtein(input_count)
    &quot;&quot;&quot;
    03a_plot_count.py ${input_count} ${protein}.plot.png
    &quot;&quot;&quot;
}


// Get a list of input files from a given folder and create a report
process generateReport {  
  input:
    path input_count

  output:
    path &quot;*.tsv&quot;

  script:
    &quot;&quot;&quot;
    03b_get_report.py ${input_count} --output_file=protein_report.tsv
    &quot;&quot;&quot;
}


// Gather result files and archive them
process archiveResults {  
  input:
    path input_plot
    path input_report

  output:
    path &quot;*.tgz&quot;

  script:
    &quot;&quot;&quot;
    mkdir results
    cp ${input_plot} ${input_report} results/
    tar -czf results.tgz results
    &quot;&quot;&quot;
}


workflow {
  // Run workflow for all .fasta files in the fasta directory
  fasta = Channel.fromPath(&quot;$baseDir/fasta/*.fasta&quot;)
  peptides = digestProtein(fasta)
  aa_count = countAA(peptides, fasta)
  count_plot = plotCount(aa_count)
  protein_report = generateReport(aa_count | collect)
  archive_results = archiveResults(count_plot | collect, protein_report)
}
</code></pre><h1 id="conclusion">Conclusion</h1><p>That&apos;s a wrap! We created a fully functional pipeline from the bottom-up, covering shell scripts, Makefiles, Nexflow as well as the two main types of execution: push- and pull-based. We&apos;ve seen the benefits that modern tools like Nextflow can have over more traditional approaches like scripts and Makefiles. Hopefully, this tutorial provided a solid baseline for what a pipeline is and how to write one from scratch (while climbing up the ladder of complexity).</p><p>For follow-up questions or feedback on this article, you can submit an issue through <a href="https://github.com/ricomnl/bioinformatics-pipeline-tutorial/issues">the accompanying GitHub repository</a> or reach me on <a href="https://twitter.com/ricomnl">Twitter</a>.</p><p>If you want to learn more about the concepts covered in this article check out these tutorials:</p><ul><li>Bioinformatics pipelines with Make: <a href="http://byronjsmith.com/make-bml/">http://byronjsmith.com/make-bml/</a></li><li>Bioinformatics pipelines with Nextflow: <a href="https://carpentries-incubator.github.io/workflows-nextflow/aio/index.html">https://carpentries-incubator.github.io/workflows-nextflow/aio/index.html</a></li></ul><p>If you&apos;re interested to explore some other tools check out these resources:</p><ul><li>A pretty extensive list of all existing pipeline tools: <a href="https://github.com/pditommaso/awesome-pipeline">https://github.com/pditommaso/awesome-pipeline</a></li><li>Nextflow vs Snakemake vs Reflow <a href="http://blog.booleanbiotech.com/nextflow-snakemake-reflow.html">http://blog.booleanbiotech.com/nextflow-snakemake-reflow.html</a></li><li>How to choose the right one: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7906312/">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7906312/</a></li><li>This review is also interesting: <a href="https://academic.oup.com/bib/article/18/3/530/2562749">https://academic.oup.com/bib/article/18/3/530/2562749</a></li><li>And <a href="https://twitter.com/gauravjain49/status/1219040943380336642">these</a> <a href="https://twitter.com/michelebusby/status/1217212677896003584">comparison</a> <a href="https://twitter.com/marius/status/1129036323778486278">threads</a> &#xA0;</li></ul><h1 id="optional-nextflow-in-the-cloud-%E2%98%81%EF%B8%8F">Optional: Nextflow in the Cloud &#x2601;&#xFE0F;</h1><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-text">If you&apos;re starting at this point and want to follow along during this part, please checkout branch &quot;part_03&quot; from the GitHub repo. To see the final repo state, checkout branch &quot;part_04&quot;.</div></div><p>If you have followed the <a href="https://t-neumann.github.io/pipelines/AWS-pipeline/">tutorial I mentioned in the first paragraph</a> and set up your AWS environment for Nextflow, there&apos;s one more thing to mention. <br>The biggest advantage of Nextflow over shell scripts or Mainfiles is its ability to easily scale into the cloud. To do that we don&apos;t need to change anything in the <code>main.nf</code> file itself. We need to</p><ol><li>Setup our AWS Environment</li><li>Add another executor</li></ol><p>The one we&apos;ve been using implicitly so far looks like this (in the <code>nextflow.config</code>): </p><pre><code class="language-groovy">profiles {
	standard {
		process.executor = &quot;local&quot;
	}
}

[...]</code></pre><p>We now add the Nextflow plugin for AWS as well as our region of choice and a role to operate with. We also add an executor for AWS Batch. You can either build the docker image I used from the Dockerfile in the repository or use <a href="https://hub.docker.com/repository/docker/rmeinl/python-plt">the one I published</a> called <code>rmeinl/python-plt</code>. <br>The <code>nexflow.config</code> file should then look like this:</p><pre><code class="language-groovy">// Profiles
profiles {
	standard {
		process.executor = &quot;local&quot;
	}
	cloud {
		process {
			executor = &quot;awsbatch&quot;
			queue = &quot;terraform-nextflow-medium-size-spot-batch-job-queue&quot;
			container = &quot;rmeinl/python-plt:latest&quot;
		}
		errorStrategy = &quot;retry&quot;
		maxRetries = 3
	}
}

// Process
process {
	publishDir = [path: &quot;data/&quot;, mode: &quot;copy&quot;]
}

// Plugins
plugins {
    id &quot;nf-amazon&quot;
}

// AWS Setup
aws {
    region = &quot;us-west-2&quot;
    batch {
    	cliPath = &quot;/home/ec2-user/bin/aws&quot;
        jobRole = &quot;arn:aws:iam::622568582929:role/terraform-nextflow-batch-job-role&quot;
    }
}</code></pre><p>In order to run this whole workflow in the cloud we call:</p><pre><code class="language-bash">nextflow run main.nf -profile cloud</code></pre><p>You should now see this message indicating success:</p><pre><code class="language-bash">$ nextflow run main.nf -profile cloud
N E X T F L O W  ~  version 21.09.0-edge
Launching `main.nf` [dreamy_murdock] - revision: 7be483af55
Uploading local `bin` scripts folder to s3://terraform-nextflow-work-bucket/tmp/f4/43104ae6c68d4b50070806e54e391a/bin
executor &gt;  awsbatch (14)
[90/eabf4a] process &gt; digestProtein (3) [100%] 4 of 4 &#x2714;
[77/fec491] process &gt; countAA (4)       [100%] 4 of 4 &#x2714;
[95/e4ea25] process &gt; plotCount (4)     [100%] 4 of 4 &#x2714;
[e4/a2dff2] process &gt; generateReport    [100%] 1 of 1 &#x2714;
[4a/01e553] process &gt; archiveResults    [100%] 1 of 1 &#x2714;
Completed at: 09-Dec-2021 19:36:18
Duration    : 4m 33s
CPU hours   : (a few seconds)
Succeeded   : 14</code></pre>]]></content:encoded></item><item><title><![CDATA[Awesome Open-Source Bio/Cheminformatics]]></title><description><![CDATA[A (growing) list of open-source Bio/Cheminformatics tools that I found useful in my work. If you know other tools in this realm that I should check out, please reach out.]]></description><link>https://ricomnl.com/blog/open-source-bio-chem-informatics/</link><guid isPermaLink="false">61b0142f9bc2fd1a4b2a2f57</guid><category><![CDATA[bioinformatics]]></category><category><![CDATA[cheminformatics]]></category><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Wed, 30 Jun 2021 20:38:46 GMT</pubDate><content:encoded><![CDATA[<p>A (growing) list of open-source Bio/Cheminformatics tools that I found useful in my work. If you know other tools in this realm that I should check out, please reach out.</p><h3 id="autodock-vina"><a href="http://vina.scripps.edu/">Autodock Vina</a></h3><p>#molecular-docking</p><ul><li>Open-source program for doing <a href="http://en.wikipedia.org/wiki/Docking_(molecular)">molecular docking</a>.</li></ul><p>Publication: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041641/">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041641/</a></p><p>Forks:</p><ul><li><a href="https://github.com/mwojcikowski/smina">smina</a> is a fork of AutoDock Vina that focuses on improving scoring and minimization</li><li><a href="https://qvina.github.io/">QuickVina</a> - fast and accurate molecular docking tool, attained at accurately accelerating AutoDock Vina</li><li><a href="https://github.com/gnina/gnina">Gnina</a> - molecular docking program with integrated support for scoring and optimizing ligands using convolutional neural networks. It is a fork of smina, which is a fork of AutoDock Vina</li></ul><h3 id="autodock-gpu"><a href="https://github.com/ccsb-scripps/AutoDock-GPU">Autodock GPU</a></h3><p>#molecular-docking</p><ul><li>OpenCL and Cuda accelerated version of AutoDock4.2.6. It leverages its embarrasingly parallelizable LGA by processing ligand-receptor poses in parallel over multiple compute units.</li></ul><p>Github: <a href="https://github.com/ccsb-scripps/AutoDock-GPU">https://github.com/ccsb-scripps/AutoDock-GPU</a><br>Publication: Accelerating AutoDock4 with GPUs and Gradient-Based Local Search, <a href="https://doi.org/10.1021/acs.jctc.0c01006" rel="nofollow">J. Chem. Theory Comput. 2021, 10.1021/acs.jctc.0c01006</a></p><h3 id="virtualflow"><a href="https://virtual-flow.org/">VirtualFlow</a></h3><p>#virtual-screening</p><ul><li>VirtualFlow is a versatile, parallel workflow platform for carrying out virtual screening related tasks on Linux-based computer clusters of any type and size which are managed by a batchsystem (such as SLURM).</li></ul><p>Github: <a href="https://github.com/VirtualFlow/VFVS">https://github.com/VirtualFlow/VFVS</a><br>Publication: An open-source drug discovery platform enables ultra-large virtual screens. Nature 580, 663&#x2013;668 (2020). <a href="https://doi.org/10.1038/s41586-020-2117-z">https://doi.org/10.1038/s41586-020-2117-z</a></p><h3 id="gypsum-dl"><a href="https://durrantlab.pitt.edu/gypsum-dl/">Gypsum-DL</a></h3><p>#ligand-preparation</p><ul><li>Gypsum-DL is a free, open-source program for preparing 3D small-molecule models. Beyond simply assigning atomic coordinates, Gypsum-DL accounts for alternate ionization, tautomeric, chiral, cis/trans isomeric, and ring-conformational forms.</li></ul><p>Gitlab: <a href="https://git.durrantlab.pitt.edu/jdurrant/gypsum_dl">https://git.durrantlab.pitt.edu/jdurrant/gypsum_dl</a><br>Publication: &quot;Gypsum-DL: An Open-source Program for Preparing Small-molecule Libraries for Structure-based Virtual Screening.&quot; Journal of Cheminformatics 11:1. <a href="https://doi.org/10.1186/s13321-019-0358-3">doi:10.1186/s13321-019-0358-3</a></p><h3 id="lit-pcba"><a href="http://drugdesign.unistra.fr/LIT-PCBA/">LIT-PCBA</a></h3><p>#dataset</p><ul><li>15 target sets, 9780 actives and 407839 unique inactives selected from high-confidence <a href="http://drugdesign.unistra.fr/LIT-PCBA/Files/LIT-PCBA_bioactivities.xlsx">PubChem Bioassay data</a></li></ul><p>Data: <a href="http://drugdesign.unistra.fr/LIT-PCBA/">http://drugdesign.unistra.fr/LIT-PCBA/</a><br>Publication: LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening. <a href="https://doi.org/10.1021/acs.jcim.0c00155">https://doi.org/10.1021/acs.jcim.0c00155</a></p><h3 id="apricot"><a href="https://apricot-select.readthedocs.io/en/latest/index.html">Apricot</a></h3><p>#submodular-optimization</p><ul><li>apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: <a href="https://apricot-select.readthedocs.io/en/latest/index.html" rel="nofollow">https://apricot-select.readthedocs.io/en/latest/index.html</a></li></ul><p>Github: <a href="https://github.com/jmschrei/apricot">https://github.com/jmschrei/apricot</a><br>Publication: <a href="https://jmlr.org/papers/volume21/19-467/19-467.pdf">https://jmlr.org/papers/volume21/19-467/19-467.pdf</a></p><h3 id="molpal"><a href="https://github.com/coleygroup/molpal">MolPal</a></h3><p>#active-learning</p><ul><li>Accelerating high-throughput virtual screening through molecular pool-based active learning.</li></ul><p>Github: <a href="https://github.com/coleygroup/molpal">https://github.com/coleygroup/molpal</a><br>Publication: <a href="https://arxiv.org/abs/2012.07127">https://arxiv.org/abs/2012.07127</a></p><h3 id="pyscreener"><a href="https://github.com/coleygroup/pyscreener">PyScreener</a></h3><p>#virtual-screening</p><ul><li>A pythonic interface to high-throughput virtual screening software.</li></ul><p>Github: <a href="https://github.com/coleygroup/pyscreener">https://github.com/coleygroup/pyscreener</a></p><h3 id="other-resources">Other Resources</h3><ul><li>Building a virtual ligand screening pipeline using free software: a survey.<strong> </strong><a href="https://doi.org/10.1093/bib/bbv037">https://doi.org/10.1093/bib/bbv037</a></li></ul>]]></content:encoded></item><item><title><![CDATA[How to set up your own ENS domain name]]></title><description><![CDATA[This is a short tutorial on how to set up your own ENS domain name using MetaMask and Google Chrome.]]></description><link>https://ricomnl.com/blog/how-to-set-up-ens-domain-name/</link><guid isPermaLink="false">61b0142f9bc2fd1a4b2a2f53</guid><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Sun, 25 Apr 2021 16:11:36 GMT</pubDate><content:encoded><![CDATA[<p>This is a short tutorial on how to set up your own ENS domain name using MetaMask and Google Chrome. Normally I&apos;m using Brave but I thought doing the demos in Chrome would allow more people to access it.</p><ol><li>Go to https://ens.domains/ and click <strong>Launch App</strong>.</li></ol><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/1-launch.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>2. Enter the name you are planning to register. It can end with .eth but doesn&apos;t have to.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/2-search-name.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>3. Now we have to connect ENS to an Ethereum account. Though they offer multiple ways to connect your wallet (as shown below) we are going to use MetaMask for this tutorial. If you already have a MetaMask account set up, you can jump straight to step 10.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/3-connect-wallet.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>4. In order to download MetaMask, go to <a href="https://metamask.io/download.html">https://metamask.io/download.html</a> and press <strong>Install MetaMask for Chrome</strong>. You&apos;ll be redirected to the Chrome Web Store.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/4-metamask-chrome.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>5. Next, add MetaMask to Chrome by pressing <strong>Add to Chrome</strong>. After installation, you&apos;ll be redirected to the setup screen. </p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/5-add-metamask-chrome.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>6. Press <strong>Get Started</strong> and <strong>Create a Wallet</strong> unless you already have one. </p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/6-metamask-get-started.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>7. Create a password for your wallet.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/7-create-password.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>8. Store your <strong>secret</strong> backup phrase in a safe place. It makes it easy to back up and restore your account. (The only reason I&apos;m showing my phrase is because I&apos;m using a throwaway account for this tutorial. You should never show it to anyone.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/8-backup-phrase.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>9. You&apos;re all set. You should now see a page with your MetaMask account like this.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/9-account.png" class="kg-image" alt loading="lazy" width="2000" height="1213" srcset="https://ricomnl.com/content/images/size/w600/2022/01/9-account.png 600w, https://ricomnl.com/content/images/size/w1000/2022/01/9-account.png 1000w, https://ricomnl.com/content/images/size/w1600/2022/01/9-account.png 1600w, https://ricomnl.com/content/images/size/w2400/2022/01/9-account.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>10. Now that MetaMask is all set up, switch back to the ENS tab and click <strong>Connect </strong>to connect with your wallet. It&apos;ll open the same window as in step 3, but it should also include MetaMask now. You might have to refresh your browser.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/10-connect-wallet.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>11. Click <strong>MetaMask </strong>and select the account you want to authenticate with. Click <strong>Next</strong> and finally <strong>Connect</strong>. On the left side next to the ENS domain name you should now see that your account is connected to the mainnet.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/11-init-metamask.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>12. In order to pay for the domain name you need to add some Ether to your account. The fastest way to do that is a direct deposit as shown below. Go to <strong>Buy</strong> &gt; <strong>Directly Deposit Ether </strong> &gt; <strong>View Account </strong>to get your MetaMask Ether address. Use the wallet of your choice to send Ether to this account. It could take up to 10 minutes for your funds to arrive.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/12-deposit-eth.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>13. Click on the domain name you want to register and select the number of years you want to reserve it for (2+ years are recommended, given the gas fees). There are three steps in total as listed on the website.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/13.1-three-steps.png" class="kg-image" alt loading="lazy" width="2000" height="282" srcset="https://ricomnl.com/content/images/size/w600/2022/01/13.1-three-steps.png 600w, https://ricomnl.com/content/images/size/w1000/2022/01/13.1-three-steps.png 1000w, https://ricomnl.com/content/images/size/w1600/2022/01/13.1-three-steps.png 1600w, https://ricomnl.com/content/images/size/w2400/2022/01/13.1-three-steps.png 2400w" sizes="(min-width: 720px) 720px"></figure><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/13-register-years.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>14. <strong>Request to register</strong>: Your wallet will open and you will be asked to confirm the first of two transactions required for registration.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/14-request-to-register.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>15. <strong>Wait for 1 minute</strong>: The waiting period is required to ensure another person hasn&#x2019;t tried to register the same name and protect you after your request. Afterward, your screen should look like this.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/15-wait-for-1-minute.png" class="kg-image" alt loading="lazy" width="2000" height="1220" srcset="https://ricomnl.com/content/images/size/w600/2022/01/15-wait-for-1-minute.png 600w, https://ricomnl.com/content/images/size/w1000/2022/01/15-wait-for-1-minute.png 1000w, https://ricomnl.com/content/images/size/w1600/2022/01/15-wait-for-1-minute.png 1600w, https://ricomnl.com/content/images/size/w2400/2022/01/15-wait-for-1-minute.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>16. <strong>Complete Registration</strong>: Click <strong>Register</strong> and your wallet will re-open. Only after the 2nd transaction is confirmed you&apos;ll know if you got the name. This could take up to 10 minutes. As you can see, this transaction cost me about $50 in total but the gas fees are variable so it might be more or less depending on when you submit yours.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/16-complete-registration.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>17. After the registration is completed, you should see the name show up under <strong>My Account</strong>.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/17-my-account.png" class="kg-image" alt loading="lazy" width="2000" height="1220" srcset="https://ricomnl.com/content/images/size/w600/2022/01/17-my-account.png 600w, https://ricomnl.com/content/images/size/w1000/2022/01/17-my-account.png 1000w, https://ricomnl.com/content/images/size/w1600/2022/01/17-my-account.png 1600w, https://ricomnl.com/content/images/size/w2400/2022/01/17-my-account.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>18. Click <strong>Reverse record: not set</strong>. Select your ENS name then click <strong>Save</strong>, and submit the transaction to save it on the blockchain.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/18-reverse-record.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>19. After about 10 minutes you should see that your reverse record has been set up successfully.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/19-record-success.png" class="kg-image" alt loading="lazy" width="2000" height="1220" srcset="https://ricomnl.com/content/images/size/w600/2022/01/19-record-success.png 600w, https://ricomnl.com/content/images/size/w1000/2022/01/19-record-success.png 1000w, https://ricomnl.com/content/images/size/w1600/2022/01/19-record-success.png 1600w, https://ricomnl.com/content/images/size/w2400/2022/01/19-record-success.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>20. In order to add some records, click on your name in the list below. You should see that it already points to your Ethereum address. Click on <strong>Add/Edit Record</strong>. I&apos;m going to add my BTC address, my website and twitter as well as github handle.</p><p></p><p>21. Finally, confirm the transaction and submit it to the blockchain via MetaMask. This should take another 10 minutes.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2022/01/21-confirm-transaction.gif" class="kg-image" alt loading="lazy" width="2000" height="1100"></figure><p>22. That&apos;s it! You can now use a browser like Opera or Brave to check whether everything worked out. I&apos;m using Brave here, which will initially ask for confirmation to redirect via ENS. You should then see your record. If you neither have Brave or Opera, just go to <a href="https://app.ens.domains/name/rmeinl.eth">https://app.ens.domains/name/&lt;your_domain&gt;.eth</a></p><hr><p>We just walked through how to set up your own ENS domain name using MetaMask and Chrome. To give you a rough idea about the costs, the whole process cost me $97.86. Here&apos;s the breakdown: </p><ul><li>$6.17 for step 14 (initial request)</li><li>$46.26 for step 16 (paying for the name)</li><li>$21.48 for step 18 (setting up the reverse record)</li><li>$23.95 for step 21 (adding custom records)</li></ul><p>Obviously the majority of these costs are gas fees, you only pay ENS for step 16 so it will vary for you depending on when you set up yours.</p><p>Hope this was helpful!</p>]]></content:encoded></item><item><title><![CDATA[Setting up Virtual Flow on AWS using Parallelcluster and Slurm]]></title><description><![CDATA[<p>This is a short tutorial on how to set up AWS <a href="https://aws.amazon.com/hpc/parallelcluster/">Parallelcluster</a> with Slurm to run <a href="https://virtual-flow.org/">VirtualFlow</a>. </p><blockquote>VirtualFlow is a versatile, parallel workflow platform for carrying out virtual screening related tasks on Linux-based computer clusters of any type and size which are managed by a batchsystem (such as SLURM).</blockquote><h2 id="aws-parallelcluster-with-slurm">AWS</h2>]]></description><link>https://ricomnl.com/blog/setting-up-virtual-flow-on-aws/</link><guid isPermaLink="false">61b0142f9bc2fd1a4b2a2f51</guid><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Mon, 19 Apr 2021 10:27:48 GMT</pubDate><content:encoded><![CDATA[<p>This is a short tutorial on how to set up AWS <a href="https://aws.amazon.com/hpc/parallelcluster/">Parallelcluster</a> with Slurm to run <a href="https://virtual-flow.org/">VirtualFlow</a>. </p><blockquote>VirtualFlow is a versatile, parallel workflow platform for carrying out virtual screening related tasks on Linux-based computer clusters of any type and size which are managed by a batchsystem (such as SLURM).</blockquote><h2 id="aws-parallelcluster-with-slurm">AWS Parallelcluster with Slurm</h2><h3 id="creating-our-working-environment">Creating our working environment</h3><p>First, we&apos;ll create our working directory and set up a virtual environment using poetry. We need to add the awscli package as well as the aws-parallelcluster package.</p><pre><code>mkdir parallel_cluster
cd parallel_cluster
poetry init
poetry add awscli aws-parallelcluster</code></pre><h3 id="setting-up-the-cluster-config">Setting up the cluster config</h3><p>To set up the AWS Parallelcluster I mainly followed <a href="https://aws.amazon.com/blogs/opensource/aws-parallelcluster/">this post</a>. We start by creating the config for our cluster. Make sure to create an <a href="https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#KeyPairs">EC2 key pair</a> beforehand.</p><pre><code> $ poetry run pcluster configure                  
Allowed values for AWS Region ID:
1. ap-northeast-1
2. ap-northeast-2
3. ap-south-1
4. ap-southeast-1
5. ap-southeast-2
6. ca-central-1
7. eu-central-1
8. eu-north-1
9. eu-west-1
10. eu-west-2
11. eu-west-3
12. sa-east-1
13. us-east-1
14. us-east-2
15. us-west-1
16. us-west-2
AWS Region ID [us-west-2]: 16
Allowed values for EC2 Key Pair Name:
1. parallelcluster
EC2 Key Pair Name [parallelcluster]: 1
Allowed values for Scheduler:
1. sge
2. torque
3. slurm
4. awsbatch
Scheduler [slurm]: 3
Allowed values for Operating System:
1. alinux
2. alinux2
3. centos7
4. centos8
5. ubuntu1604
6. ubuntu1804
Operating System [alinux2]: 2
Minimum cluster size (instances) [0]: 1
Maximum cluster size (instances) [10]: 
Head node instance type [t2.micro]: c4.large
Compute instance type [t2.micro]: c4.xlarge
Automate VPC creation? (y/n) [n]: y</code></pre><p>We should now have a config file similar to this:</p><pre><code>$ cat ~/.parallelcluster/config 
[aws]
aws_region_name = us-west-2

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[global]
cluster_template = default
update_check = true
sanity_check = true

[vpc default]
vpc_id = vpc-*****************
master_subnet_id = subnet-*****************

[cluster default]
key_name = parallelcluster
scheduler = slurm
master_instance_type = c4.large
base_os = alinux2
vpc_settings = default
queue_settings = compute

[queue compute]
enable_efa = false
enable_efa_gdr = false
compute_resource_settings = default

[compute_resource default]
instance_type = c4.xlarge
min_count = 1</code></pre><h3 id="creating-the-cluster">Creating the cluster</h3><p>After the config file is set, we can create our cluster using the following commands. AWS will then spin up our CloudFormation stack which will take a couple of minutes.</p><pre><code>$ poetry run pcluster create test-cluster
Beginning cluster creation for cluster: test-cluster
Creating stack named: parallelcluster-test-cluster
...</code></pre><p>In order to access our head node we can run the following:</p><pre><code>poetry run pcluster ssh test-cluster -i ~/.ssh/&lt;key_name&gt;</code></pre><h2 id="virtualflow">VirtualFlow</h2><p>To get started with VirtualFlow I recommend running through the <a href="https://docs.virtual-flow.org/tutorials/-LdE94b2AVfBFT72zK-v/vfvs-tutorial-1/introduction">first tutorial</a> to make sure the cluster has been set up correctly. I&apos;m only going through the changes that need to be made and list the other steps solely for completeness. The tutorial does a good job at explaining each individual step.</p><h3 id="setting-up-virtualflow">Setting up VirtualFlow</h3><p>First, we download the tutorial files and unzip them.</p><pre><code>$ wget https://virtual-flow.org/sites/virtual-flow.org/files/tutorials/VFVS_GK.tar
$ tar -xvf VFVS_GK.tar
$ cd VFVS_GK/tools
</code></pre><h3 id="preparing-the-config-files">Preparing the config files</h3><p>There are two files in which we need to make changes. We want to make sure our batchsystem is set to &apos;SLURM&apos; and change the partition to &apos;compute&apos; which is the default name when we use AWS Parallelcluster. </p><pre><code># tools/templates/all.ctrl
...
batchsystem=SLURM
# Possible values: SLURM, TOQRUE, PBS, LSF, SGE
# Settable via range control files: No
...
partition=compute
# Partitions are also called queues in some batchsystems
# Settable via range control files: Yes</code></pre><p>If &apos;compute&apos; doesn&apos;t work, try running the following command to retrieve the correct partition name: </p><pre><code>$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
compute*     up   infinite      5  idle~ compute-dy-c4xlarge-[5-9] 
compute*     up   infinite      5  alloc compute-dy-c4xlarge-[1-4],compute-st-c4xlarge-1 </code></pre><p>The second config file we need to adjust is the Slurm job template script. Usually we should be able to leave all the default values but I ran into this error:</p><pre><code>srun: error: Unable to create step for job 874794: Memory required by task is not available</code></pre><p>In order to solve it, we simply comment out the line with the --mem-per-cpu parameter.</p><pre><code># Slurm Settings
###############################################################################

#SBATCH --job-name=h-1.1
##SBATCH --mail-user=To be completed if uncommented
#SBATCH --mail-type=fail
#SBATCH --time=00-12:00:00
##SBATCH --mem-per-cpu=1024M
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=main
#SBATCH --output=../workflow/output-files/jobs/job-1.1_%j.out           # File to which standard out will be written
#SBATCH --error=../workflow/output-files/jobs/job-1.1_%j.out            # File to which standard err will be written
#SBATCH --signal=10@300</code></pre><p>As a last preparation step we simply go back to the /tools subfolder and run this command:</p><pre><code>./vf_prepare_folders.sh</code></pre><p>More details here: <a href="https://docs.virtual-flow.org/tutorials/-LdE94b2AVfBFT72zK-v/vfvs-tutorial-1/setting-up-the-workflow">https://docs.virtual-flow.org/tutorials/-LdE94b2AVfBFT72zK-v/vfvs-tutorial-1/setting-up-the-workflow</a>.</p><h3 id="starting-the-jobs">Starting the jobs</h3><p>To spin up our nodes, we simply run this command:</p><pre><code>./vf_start_jobline.sh 1 12 templates/template1.slurm.sh submit 1</code></pre><p>More details can be found here: <a href="https://docs.virtual-flow.org/tutorials/-LdE94b2AVfBFT72zK-v/vfvs-tutorial-1/starting-the-workflow">https://docs.virtual-flow.org/tutorials/-LdE94b2AVfBFT72zK-v/vfvs-tutorial-1/starting-the-workflow</a>.</p><h3 id="monitoring-and-wrapping-up">Monitoring and Wrapping Up</h3><p>In order to monitor the jobs and view the files after completion, I recommend the respective sections of the tutorial: </p><p><a href="https://docs.virtual-flow.org/tutorials/-LdE94b2AVfBFT72zK-v/vfvs-tutorial-1/monitoring-the-workflow">Monitoring</a></p><p><a href="https://docs.virtual-flow.org/tutorials/-LdE94b2AVfBFT72zK-v/vfvs-tutorial-1/the-completed-workflow">Completed Workflow</a></p><h2 id="using-our-own-files">Using our own files</h2><p>Running the same workflow with our own files is pretty straightforward. After we downloaded the template files in the &apos;Setting up VirtualFlow&apos; step we need to replace the ligand library as well as our target protein. </p><h3 id="replacing-the-ligand-library">Replacing the ligand library</h3><p>The second tutorial in the VirtualFlow documentation has <a href="https://docs.virtual-flow.org/tutorials/-LdE94b2AVfBFT72zK-v/tutorial-2-vfvs-scratch/setting-up-the-workflow#preparing-the-input-files-folder">a section dedicated</a> to this.</p><h3 id="using-a-different-protein">Using a different Protein</h3><p>Here, I downloaded <a href="http://vina.scripps.edu/download.html">AutoDock Vina</a> together with <a href="http://mgltools.scripps.edu/downloads">MGLTools</a> and followed the <a href="http://vina.scripps.edu/tutorial.html">tutorial</a> on <a href="http://vina.scripps.edu/tutorial.html">http://vina.scripps.edu</a> which looks outdated but still works fine. We can use AutoDock Vina to convert our protein from .pbd to .pdbqt and use the &apos;GridBox&apos; tool to get the necessary parameters for respective receptor config file. </p><pre><code># ../input-files/smina_rigid_receptor1/config.txt
receptor = ../input-files/receptor/&lt;protein&gt;.pdbqt
center_x = 28.614
center_y = 15.838
center_z = -2.045
size_x = 36.0
size_y = 32.0
size_z = 36.0
exhaustiveness = 4
scoring = vinardo
cpu = 1</code></pre><p>We add our protein to the folder and change both the smina (/input-files/smina_rigid_receptor1) and qvina receptor (/input-files/qvina02_rigid_receptor1) config files. </p><p>That&apos;s it. Now we can follow the rest of the steps outlined in the &apos;VirtualFlow&apos; section above.</p>]]></content:encoded></item><item><title><![CDATA[The 80/20 Computer Science Degree]]></title><description><![CDATA[Nand to Tetris was created by two CS professors, Noam Nisan and Shimon Schocken. In a nutshell, you'll build your own computer in a bottom-up fashion all the way up from NAND gates. ]]></description><link>https://ricomnl.com/blog/nand2tetris/</link><guid isPermaLink="false">61b0142f9bc2fd1a4b2a2f50</guid><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Tue, 30 Mar 2021 22:12:56 GMT</pubDate><content:encoded><![CDATA[<p>DRAFT/DISCLAIMER &#x2014; This post is a submission to a competition on <a href="http://1729.com/" rel="noopener noreferrer">1729.com</a>. No prizes will be awarded for any submissions at this time. Learn more at <a href="https://1729.com/decentralized-task-creation" rel="noopener noreferrer">1729.com/decentralized-task-creation</a>. </p><p>Nonetheless, I highly recommend everyone who is interested in CS to take this course.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2021/03/nand2tetris-1.png" class="kg-image" alt loading="lazy" width="2000" height="1216" srcset="https://ricomnl.com/content/images/size/w600/2021/03/nand2tetris-1.png 600w, https://ricomnl.com/content/images/size/w1000/2021/03/nand2tetris-1.png 1000w, https://ricomnl.com/content/images/size/w1600/2021/03/nand2tetris-1.png 1600w, https://ricomnl.com/content/images/2021/03/nand2tetris-1.png 2000w" sizes="(min-width: 720px) 720px"></figure><p>Listening to podcasts with people in tech, you&apos;ll often hear that they got interested in the field because they built their own computers or coded their own games. Elon, for example, sold his first computer game at the age of 12 and built custom computers for others in university. </p><p>Now, that most people have laptops it becomes harder to just open them up, check what&apos;s inside and put it back together. Of course you could buy go and buy all the parts separately or get <a href="https://www.playpiper.com/">a DIY kit</a>. Though this might not be logistically feasible for everyone which is a shame because this kind of tinkering is a great learning vehicle for anything related to Computer Science. </p><p>What if you could virtualize the whole experience while being guided by some world-class CS professors? Enter <a href="https://www.nand2tetris.org/">Nand to Tetris</a>.</p><h2 id="a-game-changer-in-cs-education">A game changer in CS education</h2><p>Nand to Tetris was created by two CS professors, <a href="http://www.cs.huji.ac.il/~noam/" rel="noopener">Noam Nisan</a> and <a href="http://www.shimonschocken.com/" rel="noopener">Shimon Schocken</a>. In a nutshell, you&apos;ll build your own computer in a bottom-up fashion all the way up from <a href="https://en.wikipedia.org/wiki/NAND_gate#:~:text=In%20digital%20electronics%2C%20a%20NAND,HIGH%20(1)%20output%20results.">NAND gates</a>. <br>In the process, you&apos;ll get a hands-on coverage of most of the important ideas and techniques in applied computer science, focusing on computer architecture, compilation, and software engineering, in one course. Nand to Tetris also provides a hands-on overview of key data structures and algorithms, as they unfold in the context of 12 captivating hardware and software development projects.</p><blockquote>Nand to Tetris courses are now taught at 200+ universities and high schools around the world. The students who take them range from high school students to Ph.D. students to Google engineers.</blockquote><h2 id="task-earn-500-in-btc">Task: Earn $500 in BTC</h2><h3 id="complete-all-12-projects-and-submit-a-link-to-the-github-project-repository">Complete all 12 projects and submit a link to the Github project repository</h3><ol><li><a href="https://drive.google.com/file/d/1MY1buFHo_Wx5DPrKhCNSA2cm5ltwFJzM/view">Boolean Logic</a></li><li><a href="https://b1391bd6-da3d-477d-8c01-38cdf774495a.filesusr.com/ugd/56440f_2e6113c60ec34ed0bc2035c9d1313066.pdf">Boolean Arithmetic</a></li><li><a href="https://b1391bd6-da3d-477d-8c01-38cdf774495a.filesusr.com/ugd/56440f_e458602dcb0c4af9aaeb7fdaa34bb2b4.pdf">Sequential Logic</a></li><li><a href="https://b1391bd6-da3d-477d-8c01-38cdf774495a.filesusr.com/ugd/56440f_12f488fe481344328506857e6a799f79.pdf">Machine Language</a></li><li><a href="https://b1391bd6-da3d-477d-8c01-38cdf774495a.filesusr.com/ugd/56440f_96cbb9c6b8b84760a04c369453b62908.pdf">Computer Architecture</a></li><li><a href="https://b1391bd6-da3d-477d-8c01-38cdf774495a.filesusr.com/ugd/56440f_65a2d8eef0ed4e0ea2471030206269b5.pdf">Assembler</a></li><li><a href="https://drive.google.com/file/d/19fe1PeGnggDHymu4LlVY08KmDdhMVRpm/view">Virtual Machine I: Stack Arithmetic</a></li><li><a href="https://drive.google.com/file/d/1lBsaO5XKLkUgrGY6g6vLMsiZo6rWxlYJ/view">Virtual Machine II: Program Control</a></li><li><a href="https://drive.google.com/file/d/1rbHGZV8AK4UalmdJyivgt0fpPiD1Q6Vk/view">High Level Language</a></li><li><a href="https://drive.google.com/file/d/1ujgcS7GoI-zu56FxhfkTAvEgZ6JT7Dxl/view">Compiler I: Syntax Analysis</a></li><li><a href="https://drive.google.com/file/d/1DfGKr0fuJcCvlIPABNSg7fsLfFFqRLex/view">Compiler II: Code Generation</a></li><li><a href="https://drive.google.com/file/d/137PiYjt4CAZ3ROWiD0DJ8XMUbMM0_VHR/view">Operating System</a></li><li>Tetris</li></ol><p>During these 12 projects you will build your own Assembler, Virtual Machine, Java-like High Level Language, Compiler and Operating System. In the optional 13th project you can tie all these things together to write an implementation of Tetris or any other game of your choice using all the components you previously built.</p><p>There is a guided Coursera course with <a href="https://www.coursera.org/learn/build-a-computer">two</a> <a href="https://www.coursera.org/learn/nand2tetris2">parts</a> but just using the links above or the <a href="https://www.amazon.com/Elements-Computing-Systems-Building-Principles/dp/0262640686/ref=ed_oe_p">book</a> works perfectly fine. <a href="https://www.youtube.com/watch?v=wTl5wRDT0CU">Check out the introduction video here</a>. Some inspirational projects can be found <a href="https://www.nand2tetris.org/copy-of-talks">here</a>.</p><!--kg-card-begin: html--><iframe src="https://docs.google.com/forms/d/e/1FAIpQLSfZ4OSXgNF7mvOJt4q65xd-g2SeRNIgPqFSpmHYOLBJVpCSSg/viewform?embedded=true" width="640" height="1451" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe><!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[Sequential and simultaneous modes of awareness]]></title><description><![CDATA[<p>The most interesting part in Ted Chiang&apos;s &quot;Story of your life&quot; is the parallel of the causal and teleological explanation with a sequential and simultaneous mode of awareness.</p><p>Fermat&apos;s principle of least time can be interpreted in terms of cause and effect: a difference</p>]]></description><link>https://ricomnl.com/blog/sequential-and-simultaneous-modes-of-awareness/</link><guid isPermaLink="false">61b0142f9bc2fd1a4b2a2f4f</guid><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Mon, 30 Nov 2020 17:24:24 GMT</pubDate><content:encoded><![CDATA[<p>The most interesting part in Ted Chiang&apos;s &quot;Story of your life&quot; is the parallel of the causal and teleological explanation with a sequential and simultaneous mode of awareness.</p><p>Fermat&apos;s principle of least time can be interpreted in terms of cause and effect: a difference in the index of refraction caused the light ray to change direction when it hit the surface of the water. This is most intuitive to us humans. </p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2020/11/Swimmer-and-Lifeguard-2.jpg" class="kg-image" alt loading="lazy" width="960" height="720" srcset="https://ricomnl.com/content/images/size/w600/2020/11/Swimmer-and-Lifeguard-2.jpg 600w, https://ricomnl.com/content/images/2020/11/Swimmer-and-Lifeguard-2.jpg 960w" sizes="(min-width: 720px) 720px"></figure><p>It can also be interpreted teleologically: the ray of light has to know where its destination is in order to compute the path of least time. This is more intuitive to the heptapods.</p><p>The parallel to the causal explanation is a sequential mode of awareness: experiencing events in order, and perceiving their relationship as cause and effect. This is how humans experience things. We don&apos;t know the future and are therefore able to exercise free will.</p><p>The parallel to the teleological explanation is a simultaneous mode of awareness: experiencing events all at once, and perceiving a purpose underlying them all. This is how heptapods experience. They already know the future, so freedom is meaningless and every act is performative*.</p><p>If you have free will, it&apos;s impossible to know about the future because you could change it. On the other side, if you know the future you cannot act freely anymore. (as in the example of the book of ages).</p><p>Sequential and simultaneous modes of awareness are like the optical illusion of the old and young lady. Both are valid but you can&apos;t see them at the same time.</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2020/11/rQkQZ6pDZbEHz23rxckWPm.png" class="kg-image" alt loading="lazy" width="900" height="1235" srcset="https://ricomnl.com/content/images/size/w600/2020/11/rQkQZ6pDZbEHz23rxckWPm.png 600w, https://ricomnl.com/content/images/2020/11/rQkQZ6pDZbEHz23rxckWPm.png 900w" sizes="(min-width: 720px) 720px"></figure><hr><p>*Performative language: Saying equals doing.<br>	<em>Example</em>: At a wedding ceremony everybody knows that at the end the pastor will pronounce the couple husband and wife but it doesn&apos;t count until he actually says it.</p>]]></content:encoded></item><item><title><![CDATA[The best way to encompass the future is by building a strong set of beliefs.]]></title><description><![CDATA[<p>Using claims as a first-class citizen in your thinking helps you move towards strong beliefs.</p><p>If you don&apos;t explicitly state your claims you&apos;re never going to move in any direction. Everything will seem kind of relevant and worth pursuing.</p><p>Writing down claims can help manifest them.</p>]]></description><link>https://ricomnl.com/blog/the-best-way-to-encompass-the-future-is-by-building-a-strong-set-of-beliefs/</link><guid isPermaLink="false">61b0142f9bc2fd1a4b2a2f4e</guid><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Wed, 28 Oct 2020 22:13:50 GMT</pubDate><content:encoded><![CDATA[<p>Using claims as a first-class citizen in your thinking helps you move towards strong beliefs.</p><p>If you don&apos;t explicitly state your claims you&apos;re never going to move in any direction. Everything will seem kind of relevant and worth pursuing.</p><p>Writing down claims can help manifest them. It helps to understand their implications, as well as supporting and opposing claims.</p><p>Claims eventually turn into beliefs and beliefs give perspective. They act like gravity. Strong beliefs are something that new information can be attached to.</p><p>Related:</p><blockquote><a href="https://www.saffo.com/02008/07/26/strong-opinions-weakly-held/">The best way to get to a good forecast is by making predictions with limited information and trying to find opposing evidence; then using the accumulated insights to improve your predictions. </a></blockquote><figure class="kg-card kg-embed-card kg-card-hascaption"><blockquote class="twitter-tweet" data-width="550"><p lang="en" dir="ltr">The best way to encompass the future is by building a strong set of beliefs.</p>&#x2014; Rico Meinl (@rmeinl) <a href="https://twitter.com/rmeinl/status/1321574734161694720?ref_src=twsrc%5Etfw">October 28, 2020</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<figcaption>Discussed on Twitter.</figcaption></figure>]]></content:encoded></item><item><title><![CDATA[Why we gain compounding benefits from incremental knowledge tools]]></title><description><![CDATA[<p>Knowledge and productivity are like compound interest. As knowledge workers, we live on the margins and every seemingly little improvement can add up to that compound in the long run.</p><p><a href="https://notes.andymatuschak.org/Knowledge_work_should_accrete">The more you know, the more you learn</a>; the more you learn, the more you can do; the more you</p>]]></description><link>https://ricomnl.com/blog/the-marginal-benefits-we-gain-from-knowledge-tools-are/</link><guid isPermaLink="false">61b0142f9bc2fd1a4b2a2f4d</guid><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Tue, 27 Oct 2020 11:43:49 GMT</pubDate><content:encoded><![CDATA[<p>Knowledge and productivity are like compound interest. As knowledge workers, we live on the margins and every seemingly little improvement can add up to that compound in the long run.</p><p><a href="https://notes.andymatuschak.org/Knowledge_work_should_accrete">The more you know, the more you learn</a>; the more you learn, the more you can do; the more you can do, the more the opportunity. </p><p>With the old file cabinet like note taking systems there was literally no gain when going from 10 notes to 10.000 notes. It was probably more of a downward linear trend because of the growing lack of structure. With graph-based tools like Roam Research, your knowledge management system can improve almost exponentially the more you add to it (if done right). The increasing number of notes allows for ever more unexpected connections.</p><p>Roam Research is also an IDE for knowledge work and enables us to treat notes as composable blocks of knowledge. <a href="https://notes.andymatuschak.org/z7DvEiUpF6dYkFGbpZZTBKQVM9jjNnx8D8Xzu">Text is not as composable as code or graphic elements.</a></p><p>But as the Zettelkasten shows, the notes that contribute to an idea and eventually to a piece of content are very much composable. Knowledge systems that compose and have atomic statements make it much easier to write and publish.</p><p>The interface of Roam is mouldable and we can build our own meta-tools on top of it. The question for all the builders will be if we can make the new meta-tools for knowledge as valuable as the meta-tools for programming.</p><figure class="kg-card kg-embed-card kg-card-hascaption"><blockquote class="twitter-tweet" data-width="550"><p lang="en" dir="ltr">When you zoom out and look at the bigger picture, a tool like <a href="https://twitter.com/RoamResearch?ref_src=twsrc%5Etfw">@RoamResearch</a> perhaps makes you 5% more productive in the short term. I realized today why this still matters a lot:</p>&#x2014; Rico Meinl (@rmeinl) <a href="https://twitter.com/rmeinl/status/1320877510586966017?ref_src=twsrc%5Etfw">October 26, 2020</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<figcaption>Discussed on Twitter.</figcaption></figure>]]></content:encoded></item><item><title><![CDATA[Embed Twitter Threads in Roam Research]]></title><description><![CDATA[ Paste a tweet url into Roam. The thread is then copied to your clipboard. Paste it into Roam via CMD+V (Mac) CTRL-V (Windows).]]></description><link>https://ricomnl.com/blog/embed-tweet-threads-in-roam-research/</link><guid isPermaLink="false">61b0142f9bc2fd1a4b2a2f4c</guid><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Fri, 23 Oct 2020 18:50:23 GMT</pubDate><content:encoded><![CDATA[<h3></h3><p><a href="https://chrome.google.com/webstore/detail/scify/kedfefpmgjcfidhnabfodadhnlabpili">Get it here</a></p><p>How to use:<br>- Paste a tweet url into Roam. <br>- The thread is then copied to your clipboard. <br>- Paste it into Roam via CMD+V (Mac) CTRL-V (Windows).</p><figure class="kg-card kg-image-card"><img src="https://ricomnl.com/content/images/2020/10/roam-twitter.gif" class="kg-image" alt loading="lazy" width="800" height="374"></figure>]]></content:encoded></item><item><title><![CDATA[Recommender Systems: The Most Valuable Application of Machine Learning (Part 2)]]></title><description><![CDATA[Why Recommender Systems are the most valuable application of Machine Learning and how Machine Learning-driven Recommenders already drive almost every aspect of our lives.]]></description><link>https://ricomnl.com/blog/recommender-systems-the-most-valuable-application-of-machine-learning-part-2/</link><guid isPermaLink="false">61b0142f9bc2fd1a4b2a2f4b</guid><category><![CDATA[recommender systems]]></category><category><![CDATA[machine learning]]></category><category><![CDATA[data science]]></category><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Sun, 04 Oct 2020 20:50:12 GMT</pubDate><content:encoded><![CDATA[<p>Why Recommender Systems are the most valuable application of Machine Learning and how Machine Learning-driven Recommenders already drive almost every aspect of our lives.</p><p><a href="https://towardsdatascience.com/recommender-systems-the-most-valuable-application-of-machine-learning-2bc6903c63ce">Read this article on Medium.</a></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*F2mBbZRHPXxa3cyg3z8esg.png" class="kg-image" alt loading="lazy"><figcaption>Recommender Systems already drive almost every aspect of our daily&#xA0;lives.</figcaption></figure><hr><p>This is the second part of the article published on 11 May. In the first part I covered:</p><ul><li>Business Value</li><li>Problem Formulation</li><li>Data</li><li>Algorithms</li></ul><p>In this second part I will cover the following topics:</p><ul><li>Evaluation Metrics</li><li>User Interface</li><li>Cold-start Problem</li><li>Exploration vs. Exploitation</li><li>The Future of Recommender Systems</li></ul><p>Throughout this article, I will continue to use examples of the companies that have built the most widely used systems over the last couple of years, including Airbnb, Amazon, Instagram, LinkedIn, Netflix, Spotify, Uber Eats, and YouTube.</p><hr><h3 id="evaluation-metrics">Evaluation Metrics</h3><p>Now that we have the algorithm for our Recommender System, we need to find a way to evaluate its performance. As with every Machine Learning model, there are two types of evaluation:</p><ol><li>Offline Evaluation</li><li>Online Evaluation</li></ol><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*HjqVASOg7qdIUEafmYK9rg.png" class="kg-image" alt loading="lazy"><figcaption>Offline/Online Testing Framework</figcaption></figure><p>Generally speaking, we can consider the Offline Evaluation metrics as <em>low-level</em> metrics, that are usually easily measurable. The most well-known example would be Netflix choosing to use <em>root mean squared error</em> (RMSE) as a proxy metric for their Netflix Prize Challenge. The Online Evaluation metrics are the <em>high-level</em> business metrics that are only measurable as soon as we ship our model into the real world and test it with real users. Some examples include customer retention, click-through rate, or user engagement.</p><h4 id="offline-evaluation">Offline Evaluation</h4><p>As most of the existing Recommender Systems consist of two stages (candidate generation and ranking), we need to pick the right metrics for each stage. For the candidate generation stage,<a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf" rel="noopener"> YouTube</a>, for instance, focuses on <strong>high precision</strong> so <em>&#x201C;out of all the videos that were pre-selected how many are relevant&#x201D;</em>. This makes sense given that in the first stage we want to filter for a smaller set of videos whilst making sure all of them are potentially relevant to the user. In the second stage, presenting a few &#x201C;best&#x201D; recommendations in a list requires a fine-level representation to distinguish relative importance among candidates with <strong>high recall </strong>(<em>&#x201C;how many of the relevant videos did we find&#x201D;</em>)<strong>.</strong></p><p>Often, most of the examples are using the standard evaluation metrics used in the Machine Learning community: from ranking measures, such as normalized discounted cumulative gain, mean reciprocal rank, or fraction of concordant pairs, to classification metrics including accuracy, precision, recall, or F-score.</p><p><a href="https://ai.facebook.com/blog/powered-by-ai-instagrams-explore-recommender-system/" rel="noopener">Instagram formulated</a> the optimization function of their final pass model a little different:</p><blockquote>We predict individual actions that people take on each piece of media, whether they&#x2019;re positive actions such as like and save, or negative actions such as &#x201C;See Fewer Posts Like This&#x201D; (SFPLT). We use a multi-task multi-label (MTML) neural network to predict these events.</blockquote><p>As appealing as offline experiments are, they have a major drawback: they assume that members would have behaved in the same way, for example, playing the same videos, if the new algorithm being evaluated had been used to generate the recommendations. That&#x2019;s why we need online evaluation to measure the actual impact our model has on the higher-level business metrics.</p><h4 id="online-evaluation">Online Evaluation</h4><p>The approach to be aware of here is A/B testing. There are many interesting and exhaustive articles/<a href="https://www.udacity.com/course/ab-testing--ud257" rel="noopener">courses</a> that cover this well, therefore I won&#x2019;t spend too much time on this. The only slight variation I have encountered is Netflix&#x2019;s approach called &#x201C;Consumer Data Science&#x201D; that you can<a href="https://netflixtechblog.com/how-we-determine-product-success-980f81f0047e" rel="noopener"> read about it here</a>.</p><p>The most popular high-level metrics that companies are measuring here are <em>Click-Through Rate</em> and <em>Engagement</em>. Uber Eats goes further here and designed a multi-objective tradeoff that<a href="https://eng.uber.com/uber-eats-recommending-marketplace/" rel="noopener"> captures multiple high-level metrics</a> to account for the overall health of their three-sided marketplace (among others: Marketplace Fairness, Gross Bookings, Reliability, Eater Happiness). In addition to medium-term engagement, Netflix focuses on member retention rates as their online tests can<a href="https://dl.acm.org/doi/pdf/10.1145/2843948" rel="noopener"> range from between 2&#x2013;6 months</a>.</p><p>YouTube famously prioritizes watch-time over click-through rate. They even<a href="https://youtube-creators.googleblog.com/2012/08/youtube-now-why-we-focus-on-watch-time.html" rel="noopener"> wrote an article, explaining why</a>:</p><blockquote>Ranking by click-through rate often promotes deceptive videos that the user does not complete (&#x201C;clickbait&#x201D;) whereas watch time better captures engagement</blockquote><h4 id="evaluating-embeddings">Evaluating Embeddings</h4><p>As covered in the section on algorithms, embeddings are a crucial part of the candidate generation stage. However, unlike with a classification or regression model, it&#x2019;s<a href="https://blog.twitter.com/engineering/en_us/topics/insights/2018/embeddingsattwitter.html" rel="noopener"> notoriously difficult to measure the quality of an embedding</a> given that they are often being used in different contexts. A sanity check we can perform is to map the high-dimensional embedding vector into a lower-dimensional representation (via PCA, t-SNE, or UMAP) or apply clustering techniques such as k-means and then visualize the results. Airbnb did this with their<a href="https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search-601172f7603e"> listing embeddings</a> to confirm that listings from similar locations are clustered together.</p><hr><h3 id="user-interface">User Interface</h3><p>For a Machine Learning Engineer or Data Scientist, the probably most overlooked aspect of the equation is the User Interface. The problem is that if your UI does not contain the needed components to showcase the recommendations or showcases them in the wrong context, the feedback loop is inherently flawed.</p><p>Let&#x2019;s take Linkedin as an example to illustrate this. If I&#x2019;m browsing through people&#x2019;s profiles, on the right-hand side of the screen I see recommendations for <em>similar people</em>. When I&#x2019;m browsing through companies, I see recommendations for <em>similar companies</em>. The recommendations are adapted to my current goals and context and encourage me to keep browsing the site. If the <em>similar companies</em> recommendations would appear on a person&#x2019;s profile, I would probably be less encouraged to click on their profile as it is not what I am currently looking for.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*i-3rNsokIOjyoBRgeiOhvA.png" class="kg-image" alt loading="lazy"><figcaption>Similar User Recommendations on&#xA0;Linkedin</figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*A1NpLcH0HG0RrBB3oOTVKA.png" class="kg-image" alt loading="lazy"><figcaption>Similar Companies Recommendations on&#xA0;Linkedin</figcaption></figure><p>You can build the best Recommender System in the world, however, if your interface is not designed to serve the user&#x2019;s needs and wants, no one will appreciate the recommendations. In fact, the User Interface challenge is so crucial that<a href="https://netflixtechblog.com/learning-a-personalized-homepage-aa8ec670359a" rel="noopener"> Netflix turned all components on their website into dynamic ones</a> which are assembled by a Machine Learning algorithm to best reflect the goals of a user.</p><p>Spotify followed that model and <a href="https://labs.spotify.com/2020/01/16/for-your-ears-only-personalizing-spotify-home-with-machine-learning/" rel="noopener">adopted a similar layout for their home screen design</a>, as can be seen below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*ZopV25d9-x1Gma5-YIbkIA.png" class="kg-image" alt loading="lazy"><figcaption>Personalizing Spotify Home with Machine Learning (Source:&#xA0;<a href="https://www.oreilly.com/radar/personalization-of-spotify-home-and-tensorflow/" data-href="https://www.oreilly.com/radar/personalization-of-spotify-home-and-tensorflow/" class="markup--anchor markup--figure-anchor" rel="noopener" target="_blank">Spotify</a>)</figcaption></figure><p>This is an ongoing area where there is still a lot of experimentation. As an example, YouTube recently changed their homepage interface to enable users to narrow down the recommendations for different topics:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*Edcu3SHtcLuF5aZqI1N_rw.png" class="kg-image" alt loading="lazy"><figcaption>New YouTube Home&#xA0;Page</figcaption></figure><hr><h3 id="cold-start-problem">Cold-start Problem</h3><p>The<a href="https://en.wikipedia.org/wiki/Cold_start_%28computing%29" rel="noopener"> cold-start problem</a> is often seen in Recommender Systems because methods such as collaborative filtering rely heavily on past user-item interactions. Companies are confronted with the cold-start problem in two ways: user and item cold-start. Depending on the type of platform, either one of them is more prevalent.</p><h4 id="user-cold-start">User cold-start</h4><p>Imagine a new member signs up for Netflix. At this point, the company doesn&#x2019;t know anything about the new members&#x2019; preferences. How does the company keep her engaged by providing great recommendations?</p><p>In Netflix&#x2019;s case, new members get a one-month free trial, during which cancellation rates are the highest while they decrease quickly after that. This is why any improvements to the cold-start problem present an immense business opportunity for Netflix, in order to increase engagement and retention in those first 30 days. Today, their members are given a survey during the sign-up process, during which they are asked to select videos from an algorithmically populated set that is then used as an input into all of their algorithms.</p><h4 id="item-cold-start">Item cold-start</h4><p>Companies face a similar challenge when new items or content are added to the catalog. Platforms like Netflix or Prime Video hold an existing catalog of media items that changes less frequently (it takes time to create movies or series!), therefore they struggle less with this. On the contrary, on Airbnb or Zillow, new listings are created every day and at that point, they do not have an embedding as they were not present during the training process. Airbnb solves this the following way:</p><blockquote>To create embeddings for a new listing we find 3 geographically closest listings that do have embeddings, and are of same listing type and price range as the new listing, and calculate their mean vector.</blockquote><p>For Zillow, this is especially critical as some of the new home listings might only be on the site for a couple of days. They<a href="https://www.zillow.com/tech/embedding-similar-home-recommendation/" rel="noopener"> creatively solved this problem</a> by creating a neural network-based mapping function from the content space to the embedding space, which is guided by the engagement data from users during the learning phase. This allows them to map a new home listing to the learned embedding space just by using its features.</p><hr><h3 id="exploration-vs-exploitation">Exploration vs. Exploitation</h3><p>The concept of exploration/exploitation can be seen as the balancing of new content with well-established content. I was going to illustrate this concept myself, while I found this great excerpt that hits it right out of the ballpark:</p><blockquote>&#x201C;Imagine you&#x2019;ve just entered an ice cream shop. You now face a crucial decision&#x200A;&#x2014;&#x200A;out of about 30 flavors you need to choose only one!<br>You can go with two strategies: either go with that favorite flavor of yours that you already know is the best; or explore new flavors you never tried before, and maybe find a new best flavor.<br>These two strategies&#x200A;&#x2014;&#x200A;exploitation and exploration&#x200A;&#x2014;&#x200A;can also be used when recommending content. We can either exploit items that have high click-through rate with high certainty&#x200A;&#x2014;&#x200A;maybe because these items have been shown thousands of times to similar users, or we can explore new items we haven&#x2019;t shown to many users in the past. Incorporating exploration into your recommendation strategy is crucial&#x200A;&#x2014;&#x200A;without it, new items don&#x2019;t stand a chance against older, more familiar ones.&#x201D;</blockquote><p><em>(Source: </em><a href="https://anotherdatum.com/exploration-exploitation.html" rel="noopener"><em>Recommender Systems: Exploring the Unknown Using Uncertainty</em></a><em>)</em></p><p>This tradeoff is a typical reinforcement learning problem and a commonly used approach is the multi-armed bandit algorithm. This is used by Spotify for the<a href="http://sigir.org/afirm2019/slides/16.%20Friday%20-%20Music%20Recommendation%20at%20Spotify%20-%20Ben%20Carterette.pdf" rel="noopener"> personalization of each users&#x2019; home page</a> as well as Uber Eats for personalized recommendations<a href="https://eng.uber.com/uber-eats-recommending-marketplace/" rel="noopener"> optimized for their three-sided marketplace</a>. Two scientists at Netflix gave a great talk about how they are<a href="https://www.youtube.com/watch?v=kY-BCNHd_dM" rel="noopener"> using the MAB framework for movie recommendations</a>.</p><p>Though I should mention that this is, by no means, the final solution to this problem, it seems to work for Netflix, Spotify, and Uber Eats, right?</p><p>Yes. But!</p><p>Netflix has roughly 160 million users and about 6.000 movies/shows. Spotify has about 230 million users and 50 million songs + 500.000 podcasts.</p><p>Twitter&#x2019;s 330 million active users generate more than <strong><em>500 million tweets</em></strong> per day (350.000 tweets per minute, 6.000 tweets per second). And then there&#x2019;s YouTube, with its <strong><em>300 hours of videos</em></strong> uploaded every minute!</p><p>The exploration space in the two latter cases is a <em>little</em> bit bigger than in the case of Netflix or Uber Eats, which makes the problem a lot more challenging.</p><hr><h3 id="the-future-of-recommender-systems">The Future of Recommender Systems</h3><p>This is the end of my little survey over Recommender Systems. As we have observed, Recommender Systems already guide so many aspects of our life. All the algorithms we covered over the course of these two articles are competing for our attention every day. And after all, they are all maximizing the time spent on their platform. As I illustrated in the section on Evaluation methods, most of the algorithms are optimizing for something like Click-through rate, engagement, or in YouTube&#x2019;s case: watch time.</p><p><strong><em>What does that mean for us as a consumer?</em></strong></p><p>What it means is, that we are not in control of our desires anymore. While this might sound poetic, think about it. Let&#x2019;s look at YouTube; we all have goals when coming to the site. We might want to listen to music, watch something funny, or learn something new. But all the content that is recommended to us (either through the Home Page recommendations, Search Ranking, or Watch Next) is optimized to keep us on the site for longer.</p><p>Lex Fridman and Fran&#xE7;ois Chollet had a<a href="https://www.youtube.com/watch?v=Bo8MY4JpiXE" rel="noopener"> great conversation about this</a> on the Artificial Intelligence Podcast. Instead of choosing the metric to optimize for, what if companies would put the user in charge of choosing their own objective function? What if they would take the personal goals of the user&#x2019;s profile into account and ask the user, what do you want to achieve? Right now, this technology is almost like our boss and we&#x2019;re not in control of it. Wouldn&#x2019;t it be incredible to leverage the power of Recommender Systems to be more like a mentor, a coach, or an assistant?</p><p>Imagine, as a consumer, you could ask YouTube to optimize the content to maximize learning outcomes. The technology is certainly already there. The challenge would really lie in aligning this with the existing business models and designing the right interface to empower the user to make that choice, and also to change as their goals evolve. With its new interface, YouTube is perhaps already taking baby-steps in that direction by putting the user in charge to select categories that she wants to see recommendations for. But this is just the beginning.</p><p>Could this be the way forward or is this just a consumer&#x2019;s dream?</p><hr><p><strong><em>Resources</em></strong></p><p><a href="https://www.youtube.com/watch?v=Bo8MY4JpiXE" rel="noopener">Fran&#xE7;ois Chollet: Keras, Deep Learning, and the Progress of AI | Artificial Intelligence Podcast</a></p><p><a href="https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search-601172f7603e" rel="noopener">Airbnb&#x200A;&#x2014;&#x200A;Listing Embeddings in Search Ranking</a></p><p><a href="https://medium.com/airbnb-engineering/machine-learning-powered-search-ranking-of-airbnb-experiences-110b4b1a0789" rel="noopener">Airbnb&#x200A;&#x2014;&#x200A;Machine Learning-Powered Search Ranking of Airbnb Experiences</a></p><p><a href="https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf" rel="noopener nofollow noopener noopener">Amazon&#x200A;&#x2014;&#x200A;Amazon.com Recommendations Item-to-Item Collaborative Filtering</a></p><p><a href="https://www.amazon.science/the-history-of-amazons-recommendation-algorithm" rel="noopener nofollow noopener noopener">Amazon&#x200A;&#x2014;&#x200A;The history of Amazon&#x2019;s recommendation algorithm</a></p><p><a href="https://ai.facebook.com/blog/powered-by-ai-instagrams-explore-recommender-system/" rel="noopener nofollow noopener noopener">Instagram&#x200A;&#x2014;&#x200A;Powered by AI: Instagram&#x2019;s Explore recommender system</a></p><p><a href="https://ls13-www.cs.tu-dortmund.de/homepage/rsweb2014/papers/rsweb2014_submission_3.pdf" rel="noopener nofollow noopener noopener">LinkedIn&#x200A;&#x2014;&#x200A;The Browsemaps: Collaborative Filtering at LinkedIn</a></p><p><a href="https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429" rel="noopener nofollow noopener noopener">Netflix&#x200A;&#x2014;&#x200A;Netflix Recommendations: Beyond the 5 stars (Part 1)</a></p><p><a href="https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-2-d9b96aa399f5" rel="noopener nofollow noopener noopener">Netflix&#x200A;&#x2014;&#x200A;Netflix Recommendations: Beyond the 5 stars (Part 2)</a></p><p><a href="https://dl.acm.org/doi/pdf/10.1145/2843948" rel="noopener nofollow noopener noopener">Netflix&#x200A;&#x2014;&#x200A;The Netflix Recommender System: Algorithms, Business Value, and Innovation</a></p><p><a href="https://netflixtechblog.com/learning-a-personalized-homepage-aa8ec670359a" rel="noopener nofollow noopener noopener">Netflix&#x200A;&#x2014;&#x200A;Learning a Personalized Homepage</a></p><p><a href="https://pdfs.semanticscholar.org/f635/6c70452b3f56dc1ae07b4649a80239afb1b6.pdf" rel="noopener nofollow noopener noopener">Pandora&#x200A;&#x2014;&#x200A;Pandora&#x2019;s Music Recommender</a></p><p><a href="https://medium.com/s/story/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe" rel="noopener">Spotify&#x200A;&#x2014;&#x200A;Discover Weekly: How Does Spotify Know You So Well?</a></p><p><a href="https://labs.spotify.com/2020/01/16/for-your-ears-only-personalizing-spotify-home-with-machine-learning/" rel="noopener nofollow noopener noopener">Spotify&#x200A;&#x2014;&#x200A;For Your Ears Only: Personalizing Spotify Home with Machine Learning</a></p><p><a href="https://www.slideshare.net/MrChrisJohnson/from-idea-to-execution-spotifys-discover-weekly/31-1_0_0_0_1" rel="noopener nofollow noopener noopener">Spotify&#x200A;&#x2014;&#x200A;From Idea to Execution: Spotify&#x2019;s Discover Weekly</a></p><p><a href="https://blog.twitter.com/engineering/en_us/topics/insights/2018/embeddingsattwitter.html" rel="noopener nofollow noopener noopener">Twitter&#x200A;&#x2014;&#x200A;Embeddings@Twitter</a></p><p><a href="https://eng.uber.com/uber-eats-recommending-marketplace/" rel="noopener nofollow noopener noopener">Uber Eats&#x200A;&#x2014;&#x200A;Food Discovery with Uber Eats: Recommending for the Marketplace</a></p><p><a href="https://eng.uber.com/uber-eats-graph-learning/" rel="noopener nofollow noopener noopener">Uber Eats&#x200A;&#x2014;&#x200A;Food Discovery with Uber Eats: Using Graph Learning to Power Recommendations</a></p><p><a href="https://www.inf.unibz.it/~ricci/ISR/papers/p293-davidson.pdf" rel="noopener nofollow noopener noopener">YouTube&#x200A;&#x2014;&#x200A;The YouTube Video Recommendation System</a></p><p><a href="https://arxiv.org/pdf/1409.2944.pdf" rel="noopener nofollow noopener noopener">YouTube&#x200A;&#x2014;&#x200A;Collaborative Deep Learning for Recommender Systems</a></p><p><a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf" rel="noopener nofollow noopener noopener">YouTube&#x200A;&#x2014;&#x200A;Deep Neural Networks for YouTube Recommendations</a></p><p><a href="https://www.zillow.com/tech/embedding-similar-home-recommendation/" rel="noopener nofollow noopener noopener">Zillow&#x200A;&#x2014;&#x200A;Home Embeddings for Similar Home Recommendations</a></p><p><a href="https://www.youtube.com/watch?v=giIXNoiqO_U&amp;list=PL-6SiIrhTAi6x4Oq28s7yy94ubLzVXabj" rel="noopener nofollow noopener noopener">Andrew Ng&#x2019;s Machine Learning Course (Recommender Systems)</a></p><p><a href="https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture" rel="noopener nofollow noopener noopener">Google&#x2019;s Machine Learning Crash Course&#x200A;&#x2014;&#x200A;Embeddings</a></p><hr>]]></content:encoded></item><item><title><![CDATA[Recommender Systems: The Most Valuable Application of Machine Learning (Part 1)]]></title><description><![CDATA[Why Recommender Systems are the most valuable application of Machine Learning and how Machine Learning-driven Recommenders already drive almost every aspect of our lives.]]></description><link>https://ricomnl.com/blog/recommender-systems-the-most-valuable-application-of-machine-learning-part-1/</link><guid isPermaLink="false">61b0142f9bc2fd1a4b2a2f4a</guid><category><![CDATA[recommender systems]]></category><category><![CDATA[machine learning]]></category><category><![CDATA[data science]]></category><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Sun, 04 Oct 2020 20:46:09 GMT</pubDate><content:encoded><![CDATA[<p>Why Recommender Systems are the most valuable application of Machine Learning and how Machine Learning-driven Recommenders already drive almost every aspect of our lives.</p><p><a href="https://towardsdatascience.com/recommender-systems-the-most-valuable-application-of-machine-learning-part-1-f96ecbc4b7f5">Read this article on Medium.</a></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*F2mBbZRHPXxa3cyg3z8esg.png" class="kg-image" alt loading="lazy"><figcaption>Recommender Systems already drive almost every aspect of our daily&#xA0;lives.</figcaption></figure><hr><p>Look back at your week: a Machine Learning algorithm determined what songs you might like to listen to, what food to order online, what posts you see on your favorite social networks, as well as the next person you may want to connect with, what series or movies you would like to watch, etc&#x2026;</p><p>Machine Learning already guides so many aspects of our life without us necessarily being conscious of it. All of the applications mentioned above are driven by one type of algorithm: recommender systems.</p><p>In this article, I will explore and dive deeper into all the aspects that come into play to build a successful recommender system. The length of this article got a little out of hand so I decided to split it into two parts. This first part will cover:</p><ul><li>Business Value</li><li>Problem Formulation</li><li>Data</li><li>Algorithms</li></ul><p>The Second Part will cover:</p><ul><li>Evaluation Metrics</li><li>User Interface</li><li>Cold-start Problem</li><li>Exploration vs. Exploitation</li><li>The Future of Recommender Systems</li></ul><p>Throughout this article, I will be using examples of the companies that have built the most widely used systems over the last couple of years, including Airbnb, Amazon, Instagram, LinkedIn, Netflix, Spotify, Uber Eats, and YouTube.</p><hr><h3 id="business-value">Business Value</h3><p>Harvard Business Review made a strong statement by calling Recommenders the <a href="https://hbr.org/2017/08/great-digital-companies-build-great-recommendation-engines" rel="noopener">single most important algorithmic distinction between &#x201C;born digital&#x201D; enterprises and legacy companies</a>. HBR also described the virtuous business cycle these can generate: the more people use a company&#x2019;s Recommender System, the more valuable they become and the more valuable they become, the more people use them.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*U_eUe0NBy7uQPA10SFY-VQ.png" class="kg-image" alt loading="lazy"><figcaption>The Virtuous Business Cycle of Recommender Systems (source: <a href="https://www.mdpi.com/2199-8531/5/3/44/htm" data-href="https://www.mdpi.com/2199-8531/5/3/44/htm" class="markup--anchor markup--figure-anchor" rel="noopener" target="_blank">MDPI</a>,&#xA0;CC)</figcaption></figure><p>We are encouraged to look at recommender systems, not as a way to sell more online, but rather to see it as a renewable resource for <em>relentlessly improving customer insights and our own insights as well</em>. If we look at the illustration above, we can see that many legacy companies also have tons of users and therefore tons of data. The reason their virtuous cycle has not picked up as much as the ones off Amazon, Netflix or Spotify is because of the lack of knowledge on how to convert their user data into actionable insights, which can then be used to improve their product or services.</p><p>Looking at Netflix, for example, shows how crucial this is, as 80% of what people watch comes from some sort of recommendation. In <a href="https://dl.acm.org/doi/pdf/10.1145/2843948" rel="noopener">2015, one of their papers quoted</a>:</p><blockquote>&#x201C;We think the combined effect of personalization and recommendations save us more than $1B per year.&#x201D;</blockquote><p>If we look at Amazon, <a href="https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers" rel="noopener">35% of what customers purchase at Amazon</a> comes from product recommendations and at Airbnb, Search Ranking and Similar Listings drive <a href="https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search-601172f7603e">99% of all booking conversions</a>.</p><hr><h3 id="problem-formulation">Problem Formulation</h3><p>Now that we&#x2019;ve seen the immense value, companies can gain from Recommender Systems, let&#x2019;s look at the type of challenges that can be solved by them. Generally speaking, tech companies are trying to recommend the <strong>most relevant content</strong> to their users. That could mean:</p><ul><li>similar home listings (Airbnb, Zillow)</li><li>relevant media, e.g. photos, videos and stories (Instagram)</li><li>relevant series and movies (Netflix, Amazon Prime Video)</li><li>relevant songs and podcasts (Spotify)</li><li>relevant videos (YouTube)</li><li>similar users, posts (LinkedIn, Twitter, Instagram)</li><li>relevant dishes and restaurants (Uber Eats)</li></ul><p>The formulation of the problem is critical here. Most of the time, companies want to recommend content that users are most likely to enjoy in the future. The reformulation of this problem, as well as the algorithmic changes from recommending &#x201C;what users are most likely to watch&#x201D; to &#x201C;what users are most likely to watch <em>in the future</em>&#x201D; <a href="https://www.amazon.science/the-history-of-amazons-recommendation-algorithm" rel="noopener">allowed Amazon PrimeVideo to gain a 2x improvement</a>, a &#x201C;once-in-a-decade leap&#x201D; for their movie Recommender System.</p><blockquote>&#x201C;Amazon researchers found that using neural networks to generate movie recommendations worked much better when they sorted the input data chronologically and used it to predict future movie preferences over a short (one- to two-week) period.&#x201D;</blockquote><hr><h3 id="data">Data</h3><p>Recommender Systems usually take two types of data as input:</p><ul><li><strong>User Interaction Data </strong>(Implicit/Explicit)</li><li><strong>Item Data</strong> (Features)</li></ul><p>The &#x201C;classic&#x201D;, and still widely used approach to recommender systems based on <strong>collaborative filtering</strong> (used by <a href="https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf" rel="noopener">Amazon</a>, <a href="https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429" rel="noopener">Netflix</a>, <a href="https://ls13-www.cs.tu-dortmund.de/homepage/rsweb2014/papers/rsweb2014_submission_3.pdf" rel="noopener">LinkedIn</a>, <a href="https://medium.com/s/story/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe">Spotify</a> and <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf" rel="noopener">YouTube</a>) uses either User-User or Item-Item relationships to find similar content. I&#x2019;m not going to go deeper into the inner workings of this, as there are a lot of articles on that topic&#x200A;&#x2014;&#x200A;<a href="https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/" rel="noopener">like this one</a>&#x200A;&#x2014;&#x200A;that explain this concept well.</p><p>The <em>user interaction data</em> is the data we gather from the weblogs and can be divided into two groups:</p><p><strong><em>Explicit data</em></strong>: explicit input from our users (e.g. movie ratings, search logs, liked, commented, watched, favorited, etc.)</p><p><strong><em>Implicit data</em></strong>: information that is not provided intentionally but gathered from available data streams (e.g. search history, order history, clicked on, accounts interacted with, etc.)</p><p>The <em>item data</em> consists mainly of an item&#x2019;s features. In YouTube&#x2019;s case that would be a video&#x2019;s metadata such as title and description. For Zillow, this could be a home&#x2019;s Zip Code, City Region, Price, or Number of Bedrooms for instance.</p><p>Other data sources could be <strong><em>external data</em></strong> (for example, Netflix might <a href="https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-2-d9b96aa399f5" rel="noopener">add external item data features</a> such as box office performance or critic reviews) or <strong>expert-generated data</strong> (Pandora&#x2019;s <a href="https://pdfs.semanticscholar.org/f635/6c70452b3f56dc1ae07b4649a80239afb1b6.pdf" rel="noopener">Music Genome Project</a> uses human input to apply values for each song in each of approximately 400 musical attributes).</p><p>A key insight here is that obviously, having more data about your users will inevitably lead to better model results (if applied correctly), however, as Airbnb shows in their <a href="https://medium.com/airbnb-engineering/machine-learning-powered-search-ranking-of-airbnb-experiences-110b4b1a0789">3-part journey to building a Ranking Model for Airbnb Experiences</a> you can already achieve quite a lot with lesser data: the team at Airbnb already improved bookings by +13% with just 500 experiences and 50k training data size.</p><blockquote>&#x201C;The main take-away is: <em>Don&#x2019;t wait until you have big data, you can do quite a bit with small data to help grow and improve your business.</em>&#x201D;</blockquote><hr><h3 id="algorithms">Algorithms</h3><p>Often, we associate Recommender Systems with just collaborative filtering. That&#x2019;s fair, as in the past this has been the go-to method for a lot of the companies that have deployed successful systems in practice. Amazon was probably the first company to leverage <a href="https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf" rel="noopener">item-to-item collaborative filtering</a>. When they first released the inner workings of their method in a paper in 2003, the system had already been in use for six years.</p><p>Then, in 2006 Netflix followed suit with its famous Netflix Price Challenge which offered $1 million to whoever improved the accuracy of their existing system called <em>Cinematch</em> by 10%. Collaborative filtering was also a part of the early Recommender Systems at <a href="https://medium.com/s/story/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe">Spotify</a> and <a href="https://arxiv.org/pdf/1409.2944.pdf" rel="noopener">YouTube</a>. LinkedIn even developed a horizontal collaborative filtering infrastructure, <a href="https://ls13-www.cs.tu-dortmund.de/homepage/rsweb2014/papers/rsweb2014_submission_3.pdf" rel="noopener">known as Browsemaps</a>. This platform enables rapid development, deployment, and computation of collaborative filtering recommendations for almost any use case on LinkedIn.</p><p>If you want to know more about collaborative filtering, I would recommend checking out <a href="https://www.youtube.com/watch?v=giIXNoiqO_U&amp;list=PL-6SiIrhTAi6x4Oq28s7yy94ubLzVXabj" rel="noopener">Section 16 of Andrew Ng&#x2019;s Machine Learning course on Coursera</a> where he goes deeper into the math behind it.</p><p>Now, I would like to take a step back and generalize the concept of a Recommender System. While many companies used to rely on collaborative filtering, today there are a lot of other different algorithms at play that either complement or even replaced the collaborative filtering approach. Netflix went through this change when they shifted from a DVD shipping to a streaming business. As described in one of their papers:</p><blockquote>&#x201C;We indeed relied on such an algorithm heavily when our main business was shipping DVDs by mail, partly because in that context, a star rating was the main feedback that we received that a member had actually watched the video. [&#x2026;] But the days when stars and DVDs were the focus of recommendations at Netflix have long passed. [&#x2026;] Now, our recommender system consists of a variety of algorithms that collectively define the Netflix experience, most of which come together on the Netflix homepage.&#x201D;</blockquote><p>If we zoom out a little bit and look at Recommender Systems more broadly we find that they essentially consist of two parts:</p><ol><li><strong>Candidate Generation</strong></li><li><strong>Ranking</strong></li></ol><p>I am going to use <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf" rel="noopener">YouTube&#x2019;s Recommender System</a> as an example below as they provided a good visualization, but that very same concept is applied by Instagram for <a href="https://ai.facebook.com/blog/powered-by-ai-instagrams-explore-recommender-system/" rel="noopener">recommendations in &#x201C;Instagram Explore&#x201D;</a>, by Uber Eats in their <a href="https://eng.uber.com/uber-eats-graph-learning/" rel="noopener">Dish and Restaurant Recommender System</a>, by Netflix for their <a href="https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-2-d9b96aa399f5" rel="noopener">movie recommendations </a>and probably many other companies.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*6LG9QN2XEtK6UCOZG4cavA.png" class="kg-image" alt loading="lazy"><figcaption>2-stage Recommender System (inspired by&#xA0;<a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf" data-href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf" class="markup--anchor markup--figure-anchor" rel="noopener" target="_blank">YouTube</a>)</figcaption></figure><p>According to Netflix, the goal of Recommender Systems is to present a number of attractive items for a person to choose from. This is usually accomplished by selecting some items (<em>candidate generation</em>) and sorting them (<em>ranking</em>) in the order of expected enjoyment (or utility).</p><p>Let&#x2019;s further investigate the two stages:</p><h4 id="candidate-generation">Candidate Generation</h4><p>In this stage, we want to source the relevant candidates that could be eligible to show to our users. Here, we are working with the whole catalog of items so it can be quite large (YouTube and Instagram are great examples here). The key to doing this is entity embeddings. What are entity embeddings?</p><p>An entity embedding is a mathematical vector representation of an entity such that its dimensions might represent certain properties. Twitter has a great example of this in a <a href="https://blog.twitter.com/engineering/en_us/topics/insights/2018/embeddingsattwitter.html" rel="noopener">blog post about Embeddings@Twitter</a>: say we have two NBA players (Stephen Curry and LeBron James) and two musicians (Kendrick Lamar and Bruno Mars). We expect the distance between the embeddings of the NBA players to be smaller than the distance between the embeddings of a player and a musician. We can calculate the distance between two embeddings using the formula for Euclidean distance.</p><p><strong><em>How do we come up with these embeddings?</em></strong></p><p>Well, one way to do this would be collaborative filtering. We have our items and our users. If we put them in a matrix (for the example of Spotify) it could look like this:</p><figure class="kg-card kg-image-card"><img src="https://cdn-images-1.medium.com/max/1600/1*dkvXGVpAlK25F-WznatMMA.png" class="kg-image" alt loading="lazy"></figure><p>After applying the <a href="https://www.slideshare.net/MrChrisJohnson/from-idea-to-execution-spotifys-discover-weekly/31-1_0_0_0_1" rel="noopener">matrix factorization algorithm</a>, we end up with user vectors and song vectors. To find out which users&#x2019; tastes are most similar to another&#x2019;s, collaborative filtering compares one users&#x2019; vector with all of the other users&#x2019; vectors, ultimately spitting out which users are the closest matches. The same goes for the Y vector, <em>songs</em>: you can compare a single song&#x2019;s vector with all the others, and find out which songs are most similar to the one in question.</p><p>Another way to do this takes inspiration from applications in the domain of Natural Language Processing. Researchers generalized the <a href="https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" rel="noopener">word2vec algorithm</a>, developed by Google in the early 2010s to all entities appearing in a similar context. In word2vec, the networks are trained by directly taking into account the word order and their co-occurrence, based on the assumption that words frequently appearing together in the sentences also share more statistical dependence. As Airbnb describes, in their <a href="https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search-601172f7603e">blog post about creating Listing Embeddings</a>:</p><blockquote>More recently, the concept of embeddings has been extended beyond word representations to other applications outside of NLP domain. Researchers from the Web Search, E-commerce and Marketplace domains have realized that just like one can train word embeddings by treating a sequence of words in a sentence as context, the same can be done for training embeddings of user actions by treating sequence of user actions as context. Examples include learning representations of <a href="https://arxiv.org/pdf/1606.07154.pdf" rel="noopener nofollow noopener noopener noopener">items that were clicked or purchased</a> or <a href="https://arxiv.org/pdf/1607.01869.pdf" rel="noopener nofollow noopener noopener noopener">queries and ads that were clicked</a>. These embeddings have subsequently been leveraged for a variety of recommendations on the Web.</blockquote><p>Apart from Airbnb, this concept is used by Instagram (IG2Vec) to<a href="https://ai.facebook.com/blog/powered-by-ai-instagrams-explore-recommender-system/" rel="noopener"> learn account embeddings</a>, by YouTube to <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf" rel="noopener">learn video embeddings</a> and by Zillow to <a href="https://www.zillow.com/tech/embedding-similar-home-recommendation/" rel="noopener">learn categorical embeddings</a>.</p><p>Another, more novel approach to this is called Graph Learning and it is <a href="https://eng.uber.com/uber-eats-graph-learning/" rel="noopener">used by Uber Eats for their dish and restaurant embeddings</a>. They represent each of their dishes and restaurant in a separate graph and apply the <a href="http://snap.stanford.edu/graphsage/" rel="noopener">GraphSAGE algorithm</a> to obtain the representations (embeddings) of the respective nodes.</p><p>And last but not least, we can also learn an embedding as part of the neural network for our target task. This approach gets you an embedding well customized for your particular system, but may take longer than training the embedding separately. The <a href="https://keras.io/api/layers/core_layers/embedding/" rel="noopener">Keras Embedding Layer</a> would be one way to achieve this. Google covers this well as part of their <a href="https://developers.google.com/machine-learning/crash-course/embeddings/obtaining-embeddings" rel="noopener">Machine Learning Crash Course.</a></p><p>Once we have this vectorial representation of our items we can simply use Nearest Neighbour Search to find our potential candidates. <br><a href="https://ai.facebook.com/blog/powered-by-ai-instagrams-explore-recommender-system/" rel="noopener">Instagram, for example</a>, defines a couple of seed accounts (accounts that people have interacted with in the past) and uses their IG2Vec account embeddings to find similar accounts that are like those. Based on these accounts, they are able to find the media that these accounts posted or engaged with. By doing that, they are able to filter billions of media items down to a couple thousand and then sample 500 candidates from the pool and send those candidates downstream to the ranking stage.</p><p>This phase can also be guided by business rules or just user input (the more information we have the more specific we can be). As Uber Eats mentions <a href="https://eng.uber.com/uber-eats-graph-learning/" rel="noopener">in one of their blog posts</a>, for instance, pre-filtering can be based on factors such as geographical location.</p><p>So, to summarize:</p><p><em>In the candidate generation (or sourcing) phase, we filter our whole content catalog for a smaller subset of items that our users might be interested in. To do this we need to map our items into a mathematical representation called embeddings so we can use a similarity function to find the most similar items in space. There are several ways to achieve this. Three of them being collaborative filtering, word2vec for entities, and graph learning.</em></p><h4 id="ranking">Ranking</h4><p>Let&#x2019;s loop back to the case of Instagram. After the candidate generation stage, we have about 500 media items that are potentially relevant and that we could show to a user in their &#x201C;Explore&#x201D; feed. <br>But which ones are going to be the <strong>most relevant</strong>?</p><p>Because, after all, there are only 25 spots on the first page of the &#x201C;Explore&#x201D; section. And if the first items suck, the user is not going to be impressed nor intrigued to keep browsing. Netflix&#x2019;s and Amazon PrimeVideo&#x2019;s web interface <a href="https://www.amazon.science/the-history-of-amazons-recommendation-algorithm" rel="noopener">shows only the top 6 recommendations on the first page</a> associated with each title in its catalog. <a href="https://www.slideshare.net/MrChrisJohnson/from-idea-to-execution-spotifys-discover-weekly/31-1_0_0_0_1" rel="noopener">Spotify&#x2019;s Discover Weekly</a> Playlist contains only 30 songs. <br>Also, all of this is subject to the users&#x2019; device. Smartphones, of course, allowing for less space for relevant recommendations than a web browser.</p><p>&#x201C;There are many ways one could construct a ranking function ranging from simple scoring methods, to pairwise preferences, to optimization over the entire ranking. If we were to formulate this as a Machine Learning problem, we could select positive and negative examples from our historical data and let a Machine Learning algorithm learn the weights that optimize our goal. This family of Machine Learning problems is known as &#x201C;<a href="http://en.wikipedia.org/wiki/Learning_to_rank" rel="noopener">Learning to rank</a>&#x201D; and is central to application scenarios such as search engines or ad targeting. In the ranking stage, we are not aiming for our items to have a global notion of <em>relevance</em>, but rather look for ways of optimizing a personalized model&#x201D; <em>(Extract from </em><a href="https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-2-d9b96aa399f5" rel="noopener"><em>Netflix Blog Post</em></a><em>).</em></p><p>To accomplish this, <a href="https://ai.facebook.com/blog/powered-by-ai-instagrams-explore-recommender-system/" rel="noopener">Instagram uses a three-stage ranking infrastructure</a> to help balance the trade-offs between ranking relevance and computation efficiency. In the <a href="https://eng.uber.com/uber-eats-graph-learning/" rel="noopener">case of Uber Eats</a>, their personalized ranking system is &#x201C;a fully-fledged ML model that ranks the pre-filtered dish and restaurant candidates based on additional contextual information, such as the day, time, and current location of the user when they open the Uber Eats app&#x201D;. In general, the level of complexity for your model really depends on the size of your feature space. Many supervised classification methods can be used for ranking. Typical choices include Logistic Regression, Support Vector Machines, Neural Networks, or Decision Tree-based methods such as Gradient Boosted Decision Trees (GBDT). On the other hand, a great number of algorithms specifically designed for learning to rank have appeared in recent years such as RankSVM or RankBoost.</p><p>To summarise:</p><p><em>After selecting initial candidates for our recommendations, in the ranking stage, we need to design a ranking function that ranks items by their relevance. This can be formulated as a Machine Learning problem, and the goal here is to optimize a personalized model for each user. This step is important because in most interfaces we have limited space to recommend items so we need to make the best use of that space by putting the most relevant items at the very top.</em></p><h4 id="baseline">Baseline</h4><p>As for every Machine Learning algorithm, we need a good baseline to measure the improvement of any change. A good baseline to start with is just to use the <a href="https://www.amazon.science/the-history-of-amazons-recommendation-algorithm" rel="noopener">most popular items in the catalog, as described by Amazon</a>:</p><blockquote>&#x201C;In the recommendations world, there&#x2019;s a cardinal rule. If I know nothing about you, then the best things to recommend to you are the most popular things in the world.&#x201D;</blockquote><p>However, if you don&#x2019;t even know what is most popular, because you just launched a new product or new items&#x200A;&#x2014;&#x200A;as was the case with <a href="https://medium.com/airbnb-engineering/machine-learning-powered-search-ranking-of-airbnb-experiences-110b4b1a0789">Airbnb Experiences</a>&#x200A;&#x2014;&#x200A;you can just randomly re-rank the item collection daily until you have gathered enough data for your first model.</p><hr><p>That&#x2019;s a wrap for Part 1 of this series. There are a couple of points I wanted to emphasize in this article:</p><ul><li>Recommender Systems are the most valuable application of Machine Learning as they are able to create a Virtuous Feedback Loop: the more people use a company&#x2019;s Recommender System, the more valuable they become and the more valuable they become, the more people use them. Once you enter that Loop, the Sky is the Limit.</li><li>The right Problem Formulation is key.</li><li>In the Netflix Price Challenge, teams tried to build models that predict a users&#x2019; rating for a given movie. In the &#x201C;real world&#x201D;, companies use much more sophisticated data inputs which can be classified into two categories: Explicit and Implicit Data.</li><li>In today&#x2019;s world, Recommender Systems rely on much more than just Collaborative Filtering.</li></ul><p>In the Second Part I will cover:</p><ul><li>Evaluation Metrics</li><li>User Interface</li><li>Cold-start Problem</li><li>Exploration vs. Exploitation</li></ul><hr><p><strong><em>Resources</em></strong></p><p><a href="https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search-601172f7603e">Airbnb&#x200A;&#x2014;&#x200A;Listing Embeddings in Search Ranking</a></p><p><a href="https://medium.com/airbnb-engineering/machine-learning-powered-search-ranking-of-airbnb-experiences-110b4b1a0789">Airbnb&#x200A;&#x2014;&#x200A;Machine Learning-Powered Search Ranking of Airbnb Experiences</a></p><p><a href="https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf" rel="noopener">Amazon&#x200A;&#x2014;&#x200A;Amazon.com Recommendations Item-to-Item Collaborative Filtering</a></p><p><a href="https://www.amazon.science/the-history-of-amazons-recommendation-algorithm" rel="noopener">Amazon&#x200A;&#x2014;&#x200A;The history of Amazon&#x2019;s recommendation algorithm</a></p><p><a href="https://ai.facebook.com/blog/powered-by-ai-instagrams-explore-recommender-system/" rel="noopener">Instagram&#x200A;&#x2014;&#x200A;Powered by AI: Instagram&#x2019;s Explore recommender system</a></p><p><a href="https://ls13-www.cs.tu-dortmund.de/homepage/rsweb2014/papers/rsweb2014_submission_3.pdf" rel="noopener">LinkedIn&#x200A;&#x2014;&#x200A;The Browsemaps: Collaborative Filtering at LinkedIn</a></p><p><a href="https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429" rel="noopener">Netflix&#x200A;&#x2014;&#x200A;Netflix Recommendations: Beyond the 5 stars (Part 1)</a></p><p><a href="https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-2-d9b96aa399f5" rel="noopener">Netflix&#x200A;&#x2014;&#x200A;Netflix Recommendations: Beyond the 5 stars (Part 2)</a></p><p><a href="https://dl.acm.org/doi/pdf/10.1145/2843948" rel="noopener">Netflix&#x200A;&#x2014;&#x200A;The Netflix Recommender System: Algorithms, Business Value, and Innovation</a></p><p><a href="https://netflixtechblog.com/learning-a-personalized-homepage-aa8ec670359a" rel="noopener">Netflix&#x200A;&#x2014;&#x200A;Learning a Personalized Homepage</a></p><p><a href="https://pdfs.semanticscholar.org/f635/6c70452b3f56dc1ae07b4649a80239afb1b6.pdf" rel="noopener">Pandora&#x200A;&#x2014;&#x200A;Pandora&#x2019;s Music Recommender</a></p><p><a href="https://medium.com/s/story/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe">Spotify&#x200A;&#x2014;&#x200A;Discover Weekly: How Does Spotify Know You So Well?</a></p><p><a href="https://labs.spotify.com/2020/01/16/for-your-ears-only-personalizing-spotify-home-with-machine-learning/" rel="noopener">Spotify&#x200A;&#x2014;&#x200A;For Your Ears Only: Personalizing Spotify Home with Machine Learning</a></p><p><a href="https://www.slideshare.net/MrChrisJohnson/from-idea-to-execution-spotifys-discover-weekly/31-1_0_0_0_1" rel="noopener">Spotify&#x200A;&#x2014;&#x200A;From Idea to Execution: Spotify&#x2019;s Discover Weekly</a></p><p><a href="https://blog.twitter.com/engineering/en_us/topics/insights/2018/embeddingsattwitter.html" rel="noopener">Twitter&#x200A;&#x2014;&#x200A;Embeddings@Twitter</a></p><p><a href="https://eng.uber.com/uber-eats-recommending-marketplace/" rel="noopener">Uber Eats&#x200A;&#x2014;&#x200A;Food Discovery with Uber Eats: Recommending for the Marketplace</a></p><p><a href="https://eng.uber.com/uber-eats-graph-learning/" rel="noopener">Uber Eats&#x200A;&#x2014;&#x200A;Food Discovery with Uber Eats: Using Graph Learning to Power Recommendations</a></p><p><a href="https://www.inf.unibz.it/~ricci/ISR/papers/p293-davidson.pdf" rel="noopener">YouTube&#x200A;&#x2014;&#x200A;The YouTube Video Recommendation System</a></p><p><a href="https://arxiv.org/pdf/1409.2944.pdf" rel="noopener">YouTube&#x200A;&#x2014;&#x200A;Collaborative Deep Learning for Recommender Systems</a></p><p><a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf" rel="noopener">YouTube&#x200A;&#x2014;&#x200A;Deep Neural Networks for YouTube Recommendations</a></p><p><a href="https://www.zillow.com/tech/embedding-similar-home-recommendation/" rel="noopener">Zillow&#x200A;&#x2014;&#x200A;Home Embeddings for Similar Home Recommendations</a></p><p><a href="https://www.youtube.com/watch?v=giIXNoiqO_U&amp;list=PL-6SiIrhTAi6x4Oq28s7yy94ubLzVXabj" rel="noopener">Andrew Ng&#x2019;s Machine Learning Course (Recommender Systems)</a></p><p><a href="https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture" rel="noopener">Google&#x2019;s Machine Learning Crash Course&#x200A;&#x2014;&#x200A;Embeddings</a></p>]]></content:encoded></item><item><title><![CDATA[Machine Learning System Design]]></title><description><![CDATA[Some great resources on Machine Learning System designs from Facebook, Twitter, Google, Airbnb, Uber, Instagram, Netflix, AWS and Spotify.]]></description><link>https://ricomnl.com/blog/machine-learning-system-design/</link><guid isPermaLink="false">61b0142f9bc2fd1a4b2a2f48</guid><category><![CDATA[machine learning]]></category><category><![CDATA[system design]]></category><category><![CDATA[data science]]></category><dc:creator><![CDATA[Rico Meinl]]></dc:creator><pubDate>Mon, 02 Mar 2020 21:40:00 GMT</pubDate><content:encoded><![CDATA[<p><a href="https://becominghuman.ai/machine-learning-system-design-f2f4018f2f8">Read this post on Medium.</a></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://ricomnl.com/content/images/2020/10/1_mHGcYI_L-ci7jnkKkzfBOw.png" class="kg-image" alt loading="lazy" width="1400" height="783" srcset="https://ricomnl.com/content/images/size/w600/2020/10/1_mHGcYI_L-ci7jnkKkzfBOw.png 600w, https://ricomnl.com/content/images/size/w1000/2020/10/1_mHGcYI_L-ci7jnkKkzfBOw.png 1000w, https://ricomnl.com/content/images/2020/10/1_mHGcYI_L-ci7jnkKkzfBOw.png 1400w" sizes="(min-width: 720px) 720px"><figcaption>Facebook Field Guide to Machine Learning</figcaption></figure><p>While preparing for job interviews I found some great resources on Machine Learning System designs from Facebook, Twitter, Google, Airbnb, Uber, Instagram, Netflix, AWS and Spotify.</p><p>I find this to be a fascinating topic because it&#x2019;s something not often covered in online courses.</p><p><strong>Twitter</strong></p><ul><li><a href="https://blog.twitter.com/engineering/en_us/topics/insights/2017/using-deep-learning-at-scale-in-twitters-timelines.html" rel="noopener">Using Deep Learning at Scale in Twitter&#x2019;s Timelines</a></li><li><a href="https://blog.twitter.com/engineering/en_us/topics/insights/2019/improving-engagement-on-digital-ads-with-delayed-feedback.html" rel="noopener">Improving engagement on digital ads with delayed feedback</a></li><li><a href="https://blog.twitter.com/engineering/en_us/topics/insights/2018/embeddingsattwitter.html" rel="noopener">Embeddings@Twitter</a></li></ul><p><strong>Instagram</strong></p><ul><li><a href="https://instagram-engineering.com/lessons-learned-at-instagram-stories-and-feed-machine-learning-54f3aaa09e56" rel="noopener">Lessons Learned at Instagram Stories and Feed Machine Learning</a></li><li><a href="https://ai.facebook.com/blog/powered-by-ai-instagrams-explore-recommender-system/" rel="noopener">Powered by AI: Instagram&#x2019;s Explore recommender system</a></li></ul><p><strong>Facebook</strong></p><ul><li><a href="https://engineering.fb.com/security/fighting-abuse-scale-2019/" rel="noopener">Deep Entity Classification: An abusive account detection framework</a></li><li><a href="https://ai.facebook.com/blog/community-standards-report/" rel="noopener">New progress in using AI to detect harmful content</a></li></ul><p><strong>Uber Eats</strong></p><ul><li><a href="https://eng.uber.com/uber-eats-query-understanding/" rel="noopener">Food Discovery with Uber Eats: Building a Query Understanding Engine</a></li><li><a href="https://eng.uber.com/uber-eats-recommending-marketplace/" rel="noopener">Food Discovery with Uber Eats: Recommending for the Marketplace</a></li><li><a href="https://eng.uber.com/uber-eats-graph-learning/" rel="noopener">Food Discovery with Uber Eats: Using Graph Learning to Power Recommendations</a></li></ul><p><strong>Uber</strong></p><ul><li><a href="https://eng.uber.com/nlp-deep-learning-uber-maps/" rel="noopener">Applying Customer Feedback: How NLP &amp; Deep Learning Improve Uber&#x2019;s Maps</a></li><li><a href="https://eng.uber.com/forecasting-introduction/" rel="noopener">Forecasting at Uber: An Introduction</a></li></ul><p><strong>Airbnb</strong></p><ul><li><a href="https://medium.com/airbnb-engineering/using-machine-learning-to-predict-value-of-homes-on-airbnb-9272d3d4739d">Using Machine Learning to Predict Value of Homes On Airbnb</a></li><li><a href="https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search-601172f7603e">Listing Embeddings in Search Ranking</a></li><li><a href="https://medium.com/airbnb-engineering/learning-market-dynamics-for-optimal-pricing-97cffbcc53e3">Learning Market Dynamics for Optimal Pricing</a></li><li><a href="https://medium.com/airbnb-engineering/categorizing-listing-photos-at-airbnb-f9483f3ab7e3">Categorizing Listing Photos at Airbnb</a></li><li><a href="https://medium.com/airbnb-engineering/applying-deep-learning-to-airbnb-search-7ebd7230891f">Applying Deep Learning To Airbnb Search</a></li><li><a href="https://medium.com/airbnb-engineering/discovering-and-classifying-in-app-message-intent-at-airbnb-6a55f5400a0c">Discovering and Classifying In-app Message Intent at Airbnb</a></li></ul><p><strong>Airbnb Experiences</strong></p><ul><li><a href="https://medium.com/airbnb-engineering/machine-learning-powered-search-ranking-of-airbnb-experiences-110b4b1a0789">Machine Learning-Powered Search Ranking of Airbnb Experiences</a></li></ul><p><strong>Linkedin</strong></p><ul><li><a href="https://engineering.linkedin.com/blog/2018/10/an-introduction-to-ai-at-linkedin" rel="noopener">An Introduction to AI at LinkedIn</a></li><li><a href="https://engineering.linkedin.com/blog/2019/fairness-privacy-transparency-by-design" rel="noopener">Fairness, Privacy, and Transparency by Design in AI/ML Systems</a></li><li><a href="https://engineering.linkedin.com/blog/2019/06/building-communities-around-interests" rel="noopener">Communities AI: Building Communities Around Interests on LinkedIn</a></li><li><a href="https://engineering.fb.com/security/fighting-abuse-scale-2019/" rel="noopener">Preventing abuse using unsupervised learning</a></li></ul><p><strong>Google</strong></p><ul><li><a href="http://highscalability.com/blog/2016/3/16/jeff-dean-on-large-scale-deep-learning-at-google.html" rel="noopener">Jeff Dean On Large-Scale Deep Learning At Google</a></li></ul><p><strong>Netflix</strong></p><ul><li><a href="https://www.youtube.com/watch?v=kY-BCNHd_dM" rel="noopener">A Multi-Armed Bandit Framework for Recommendations at Netflix</a></li></ul><p><strong>Spotify</strong></p><ul><li><a href="https://labs.spotify.com/2020/01/16/for-your-ears-only-personalizing-spotify-home-with-machine-learning/" rel="noopener">For Your Ears Only: Personalizing Spotify Home with Machine Learning</a></li><li><a href="https://medium.com/s/story/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe">How Does Spotify Know You So Well?</a></li></ul><hr><p>In addition, here are some resources on a more general process. Starting with the book <a href="https://www.amazon.com/Data-Science-Business-Data-Analytic-Thinking/dp/1449361323" rel="noopener">Data Science for Business</a> which explains the CRISP-DM (Cross Industry Standard Process for Data Mining).</p><p>The process involves six stages:</p><ol><li>Business Understanding</li><li>Data Understanding</li><li>Data Preparation</li><li>Modelling</li><li>Evaluation</li><li>Deployment</li></ol><p>Here is a more high-level breakdown on <a href="https://gist.github.com/bluekidds/cad5c0ea2e5051b638ec39810f3c4b09" rel="noopener">how to apply CRISP-DM on AWS.</a></p><p>Facebook also created a video series where they go into depth in how they structure Machine Learning Projects with the <a href="https://research.fb.com/the-facebook-field-guide-to-machine-learning-video-series/" rel="noopener">Facebook Field Guide</a> to Machine Learning.</p>]]></content:encoded></item></channel></rss>