Debugging thousand CPU/hour multigigabyte analyses with Python Decorators

Description

How to handle multi-gigabyte datasets, multi-hour runs and debug them quickly using Python.

Abstract

I have written a few scientific applications that required many hours to complete, and spent a lot of time debugging these typical scientific computing problems.

I will talk about architecture for such big data analysis tasks, and how to debug them quickly even if runs take at least overnight to complete, and generate thousands of special cases and exceptions.

I will:

Describe standard problems of "application fusion" scripting used in Bioinformatics data analysis.
Consider the architectural requirements of open data analyses with software that may be only partially open, and only barely supported by its mother scientific institution.
State typical problems solved in this environment, with large multigigabyte datasets and many hours or days of aggregate runtime over the cluster.
Show how to address these challenges best with Python and its advanced programming language features.