BioDAP - Bioinformatics Data Analysis Pipeline

Introduction

BioDAP started in an effort to more reliably analyze microarray data which I was generating in lab. Originally I had developed a series of perl scripts to analyze my data using a fairly common workflow. The data started out in a tab-delimited text file. From here, I fed the file to perl script after perl script... each one adding a column or two to the file. This left me with a ton of intermediate data files (one for each script), and a ton of stdout log files. This was all nice and easy, and worked, and I was able to get the gene list I needed.

Then I decided I wanted to change a few filtering features... and generate a few other lists. This meant that my previously linear set of piped perl scripts was now a branched workflow. This turned out to be simply unworkable.

Based up this experience, and the experience of helping others analyze microarray data, I came up with BioDAP. Now, there have been many other bioinformatics pipelines written, but none of them seemed to quite fit the bill, were targeted at other use-cases (like the all web-services Taverna), or restricted you to other tools (like the Emboss JPIPE), or seemingly defunct (BioPipe). I needed something to analyze my data in a reproducible manner that let me tailor the analysis parameters.

So, I started to replicate my perl scripting workflow, but using a Java engine that reads a pipeline workflow from an XML file.  This was definitely inspired by Apache ant's build.xml file.

There are only two concepts in BioDAP, but they provide everything needed for data analysis.  They are DataSources and Tasks.

Below is an example of a BioDAP pipeline xml file.

<?xml version="1.0" encoding="UTF-8"?>
<experiment name="Experiment Foo">
    <import package="com.fourspaces.biodap.task.io.*"/>
    <import package="com.fourspaces.biodap.task.math.*"/>
    <import package="com.fourspaces.biodap.datasource.*"/> 
    <param name="infile" type="filename">data.txt</param>

    <data-source type="InMemoryDataSet"/>
    <task type="TabImporter">
        <param name="filename">${infile}</param>
        <param name="key_column_name">Accn</param>
    </task>
    <task type="Mean">
        <bind-input name="column" from="A, B,C,D,E,F"/>
        <bind-output name="mean" to="ave"/>
        <bind-output name="count" to="N"/>
        <bind-output name="standard_deviation" to="stddev"/>
    </task>
    <task type="Print">--- output ---</task>
    <task type="TabExporter"/>
</experiment>

DataSources

DataSources are fairly self explainatory.  These are the containers that hold all of the data.  Data is organized similarly to an Excel spreadsheet (but a but more powerful).  There is a "key" column that contains a unique identifier for each row.  There are then any number of other columns that hold data values. 

The type of data is automatically determined by BioDAP's DataValue class.  For example, BioDAP can determine that "Example" is a string, "1.0" is a number, etc...  This information comes in handy when it comes to running tasks that expect certain types of data.  Currently the supported data types are: NUMBER, TEXT, LONGTEXT, BOOLEAN, DATE, TIME, DATETIME, GENERAL, COLLECTION, URL, EMPTY.

There can be many data sources, and you can copy data from one data source to another.  They can also have various backing implementations.  For example, the default implementation (InMemoryDataSource) stores data in memory in a HashMap.  However, you could also have implementations that are backed by databases, flat files, etc...  If you do not define a data source, an anonymous InMemoryDataSource datasource is created for you.

You can pass parameters to data sources in initialization as well as assign them a unique name. 

Example full format:

<data-source type="InMemoryDataSource" name="two">
        <param name="id-column">$ID$</param>
</data-source> 

Lists 

In addition to storing data, data sources can also store a number of tuple lists. These are lists of rows that are created largely by filtering tasks.  This allows you to create things like gene lists based upon a variety of criteria.

Tasks

As expected, tasks perform most of the work.  A task operates on a list of rows that belong to a data source. By default, this is the list of all rows on the anonymous data source.  The task can operate on the data source as a whole, or on a row by row basis.  Tasks accept three types of additional configuration: bind-input, bind-output, and params.

Tasks may require a certain input or inputs.  These inputs are drawn from the columns stored in the data source.  These are named inputs that vary based upon the task.  In order to bind a required input from a column in your data source to the task, you use the bind-input configuration.  This simply binds a named input "name" to a column or columns ("from").

The same type of operation exists for task outputs.  A task can also have many named outputs, which are bound to new or existing columns in the target data source.  This is handled by the bind-output configuration.  Instead of binding from a column, you bind "to" a column.

Additionally, a task may require some other configuration.  This is handled by the param configuration. These are named parameters that can contain other information, such as filter cutoffs, file names, etc...

Tasks can either place their results in the original data source or another "target" data source.  By default they place their results (if there are any) in the original data source.

Tasks don't have to operate on a data source.  They can print out values, they can interact with databases, or run other scripts.  If it can be expressed in Java, or called externally, you can use a task to perform the job.

Many tasks have already be written, including tasks to import tab-delimited files, perform basic filtering (greater than, less than, etc...), connect to a database and perform a JDBC query (binding output), call an external script (such as a perl script), perform basic math and statistics (mean, sum, log, min, max, ratio), operate regular expression matching, and call R scripts directly (binding inputs and outputs for either whole data-source or row-by-row operations).

Example full format:

<task type="RegExMatch">
     <bind-input name="column" from="Gene"/>
     <bind-output name="1" to="gene_number"/>
     <param name="regex">GENE(\d+)</param>
</task>

The above task iterates over each row, performing the listed regular expression match on the "Gene" column.  It then binds the first match to a column named "gene_number".

Other configuration options 

Import 

The import parameter serves to save the user from having to type full java class names in the "type" fields for DataSources and Tasks.  You can enter either a wildcard package name to import everything in that package into the main namespace, or single classes.  This works identically to the import keyword in java or the use keyword in perl. 

Context parameters

You can also define any number of parameters in the pipeline file.  These are called context parameters because their values are not stored in a data-source, but within the context of the BioDAP engine itself.  Input bindings, output bindings, and parameters can use these context parameters in lieu of hard coded values.  This helps to eliminate typos as you only have to define elements once.  An example would be defining a list of columns that correspond to experimental values and ones that correspond to control values.  You could then operate on each list with many tasks, but with only typing the list once.  

As an added benefit, if you are using the GUI version, you can also be prompted for context parameter values instead of hard-coding them in the pipeline xml file.  Supported promptable parameters include filenames, column names (single or multiple), and plain text strings. 

Context parameters (and column values in tasks) can be referenced like this: ${param_name}