Variant Call Format: really?

1000 genomes are making their genotypes available in variant call format (vcf). Now as others have noticed, vcf isn’t the prettiest format around. There are a few things to dislike:

  • The data is in ‘wide’ format which means that a file is fifteen screens wide and hides rare variation in a load of noise – it’s also ‘ungreppable’.
  • The use of tab delimiting with embedded semi-colon delimiting is an old trick, but these days everyone has a json parser handy so you should just use json for an structured field inside a tab delimited file.   Since strings, lists and key/value pairs all have simple json representations this will cover all cases.
  • The fact that the file headers dynamically describe the structure of the fields makes it less a format and more a family of formats.  The genotype fields have arbitrary keys.
  • Despite the flexibility defining genotypes, the locus information is not extensible so you cannot annotate the site with extra information.

Looking into PiCloud’s Sandbox

EDIT: this was due to python 2.7 incompatibility and incorrect documentation. These examples all work with python 2.6

PiCloud looks very interesting. Execute on demand python could give you a much greater level of control over cloud computing use. Now, of course, sandboxing python is not easy, despite some well known implementations.

So PiCloud don’t write much about the limitations of the sandbox, so I gave it a quick poke to see what happens:


import cloud

def blocker(x=1):
    while True:
        x = x + 1
    return x

jid = cloud.call(blocker)
cloud.join(jid)
print cloud.result(jid)
print cloud.info(jid, ['stde

This produces:

Traceback (most recent call last):
cloud.cloud.CloudException: Job 38: Traceback (most recent call last):
  File "/root/.local/lib/python2.6/site-packages/cloudserver/workers/employee/child.py", line 334, in run
  File "tp.py", line 4, in blocker
    while True:
SystemError: unknown opcode

Which is an interesting error!

Trying this:


def blocker(x=1):
    while x < 10:
        x = x + 1
    return x

returns

cloud.cloud.CloudException: Job 39: None

So we're not allowed while loops?

Friday quiz: where and when does this describe?

But these evils, though great, were small compared to
those far more deep-seated signs of disease which now
showed themselves throughout the country. One of these
was the obliteration of thrift from the minds of the ***
people. The *** are naturally thrifty; but, with such
masses of money and with such uncertainty as to its future
value, the ordinary motives for saving and care diminished,
and a loose luxury spread throughout the country. A still
worse outgrowth was the increase of speculation and
gambling. With the plethora of paper currency in *** ap·
peared the first evidences of that cancerous disease which
always follows large issues of irredeemable currency,—a
disease more permanently injurious to a nation than war,
pestilence or famine. For at the great metropolitan centers
grew a luxurious, speculative, stock-gambling body, which,
like a malignant tumor, absorbed into itself the strength of
the nation and sent out its cancerous fibres to the remotest
hamlets. At these city centers abundant wealth seemed to
be piled up: in the country at large there grew a dislike
of steady labor and a contempt for moderate gains and sim-
ple living.

This outgrowth was a
vast debtor class in the nation, directly interested in the
depreciation of the currency in which they were to pay their
debts. The nucleus of this class was formed by those who
had purchased the church lands from the government. Only
small payments down had been required and the remainder
was to be paid in deferred installments: an indebtedness of
a multitude of people had thus been created to the amount
of hundreds of millions. This body of debtors soon saw, of
course, that their interest was to depreciate the currency in
which their debts were to be paid; and these were speedily
joined by a far more influential class;—by that class whose
speculative tendencies had been stimulated by the abun-
dance of paper money, and who had gone largely into debt,
looking for a rise in nominal values. Soon demagogues of
the viler sort in the political clubs began to pander to it;
a little later important persons in this debtor class were to
be found intriguing in the Assembly—first in its seats and
later in more conspicuous places of public trust. Before long,
the debtor class became a powerful body extending through
all ranks of society. From the stock-gambler who sat in the
Assembly to the small land speculator in the rural districts;
from the sleek inventor of canards on the *** Exchange
to the lying stock-jobber in the market town, all pressed
vigorously for new issues of paper; all were apparently able
to demonstrate to the people that in new issues of paper lay
the only chance for national prosperity.

Siôn Simon’s Comments on Tom Watson in the DEB debate: Beyond Irony?

Siôn Simon decided to caricature the opponent’s of the Digital Economy Bill with a piece of ‘fan fiction’ based on Star Wars. As chilling effects notes, copyright law is not clear about the use of characters in derivative works. However, I’m sure as a staunch defender of copyright he will have cleared it with LucasFilm first; after all they carefully defend their creations when anyone else uses them

pivot tables with sqlalchemy

If your database doesn’t support pivots, here is a quick technique to get pivot columns with sqlalchemy

import operator
from sqlalchemy.sql import case, func, select

def pivot_report(report, pivot_on=None, pivot_columns=None, pivot_func=func.sum,
                    non_pivot_columns=None, group_by=None):
    """ produce a pivot on a select

    if we have a report: 

        id, type, count
        1, white, 10
        1, black, 20
        2, white, 12
        2, black, 20

    and we want 

        id, black count, white count
        1, 20, 10
        2, 20, 12

    pass in type as the pivot_on, and [count] as pivot columns  

    """

    # find all possible values of the pivot
    pivot_values = map(
        operator.itemgetter(0),
        select([pivot_on], from_obj=[report]).distinct().execute()
    )

    # build the new pivot columns
    new_columns = [
        pivot_func(case([(pivot_on == value, column)])).label("%s %s" % (value, column))
        for value in pivot_values
        for column in pivot_columns
    ]

    return select(
        non_pivot_columns + new_columns,
        from_obj=[report],
        group_by=group_by
    )

# example code
from sqlalchemy import Table, Column, Integer, String, MetaData
from sqlalchemy import create_engine

metadata = MetaData()
example = Table('example', metadata,
    Column('id', Integer),
    Column('type', String),
    Column('count', Integer),
)

engine = create_engine('sqlite:///:memory:')
metadata.bind = engine
example.create()
example.insert().execute(
    dict(id=1, type='white', count=10),
    dict(id=1, type='black', count=20),
    dict(id=2, type='white', count=30),
    dict(id=2, type='black', count=40),
)

report = example.select()

# now build the pivot
report = pivot_report(
    report,
    pivot_on=report.c.type,
    pivot_columns=[report.c.count],
    non_pivot_columns=[report.c.id],
    group_by=[report.c.id])

for r in report.execute():
    print r.items()

object ceremony, dynamic languages, JSON and algebraic data types

The reason most people end up using a dynamic language is to avoid the boilerplate associated with object creation. You know, typing “FileWriter fout = new FileWriter(”fred.txt”);” gets boring quickly. I think this is a good enough reason to move to another language on its own. This boilerplate is also sometimes called ceremony, and I have come to realize that far from being low ceremony, dynamic languages actually revolve around ceremony.

Think of the hash of hashes, list of hashes approach that you often find in a perl, python or ruby program. These are so useful, that they have been codified as JSON – which can simply be evaled to return your data in several languages. Ad-hoc data structures like this have a great appeal when hacking something in python, yet you quickly get to a pain point when using them when the data doesn’t look exactly as you would expect and you need to handle exceptions and edge cases.

So why is the hash of hashes approach so tempting? Because it avoids the ceremony around object creation. Things like Python’s “__init__” and Perl’s “bless” are ceremony and are necessarily ceremony because all a type in a dynamic language is just data with some ceremony. Clearly, perl has got a perfect name for the ceremony in “bless”.

The eureka moment comes when you use a typed language that is low ceremony, such as Haskell. Algebraic data types give you the freedom to create complex data structures without ceremony, which you can then process without the hassle involved in unpicking a big blob of JSON, which will typically need lots of switches. Instead, you can pattern match on the type.

So if you went to a dynamic language to avoid the ceremony, you may well be moving in the wrong direction.

cogent: the unsung hero of bioinformatics and python

I recently started using cogent – the COmparative GENomics Toolkit and discovered that it is an excellent piece of kit. A google search for ‘python ensembl‘ doesn’t even show it at all, yet it definitely has the best bindings for ensembl avaiable in python – they’re based on sqlalchemy making it easy enough to pull of any query. Have a look at the full list of examples.

Installing python bioinformatics tools with virtualenv and pip

Python seems to have developed a decent set of tools for quickly building development environments. I want to store my notes on how to get a good environment for bioinformatics set up quickly.

First of all, if you haven’t already, install virtualenv and pip. Both are easy installable. Now install virtualenv wrapper.

Now we are going to setup a bioinformatics environment with both biopython and pygr installed so that you can hack on them. Firstly create a new virtualenv, passing the no site packages flag to keep this clean:

james@flapjack:~/Documents/virtualenvs$ mkvirtualenv --no-site-packages bio
New python executable in bio/bin/python
Installing setuptools............done.
(bio)james@flapjack:~/Documents/virtualenvs$ cdvirtualenv
(bio)james@flapjack:~/Documents/virtualenvs/bio$

Now, to install biopython we first use pip to install numpy:

(bio)james@flapjack:~/Documents/virtualenvs/bio$ pip -E . install numpy
Downloading/unpacking numpy
...

Important to remember the ‘-E’ flag which tells pip to use the virtualenv we are in (this should be added to virtualenv_wrapper IMHO). Now we can install biopython from our github fork, using the ‘-e’ flag to keep it editable (i.e we are hacking on it).

(bio)james@flapjack:~/Documents/virtualenvs/bio$ pip -E . install -e git://github.com/jamescasbon/biopython.git#egg=biopython
Obtaining biopython from git+git://github.com/jamescasbon/biopython.git#egg=biopython
Cloning git://github.com/jamescasbon/biopython.git to ./src/biopython
remote: Counting objects: 22719, done.
...

Next up, we want pygr so we need pyrex to build the c files:

(bio)james@flapjack:~/Documents/virtualenvs/bio$ pip -E . install -U pyrex -f http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/Pyrex-0.9.8.5.tar.gz
Downloading/unpacking pyrex
Downloading Pyrex-0.9.8.5.tar.gz (242Kb): 242Kb downloaded
In the tar file /var/folders/Gn/GneSaDeKGaGpZXx+hcopdU+++TI/-Tmp-/tmpTEdhFd/Pyrex-0.9.8.5.tar.gz the member Pyrex-0.9.8.5/Demos/embed/Makefile is invalid: 'filename None not found'
Running setup.py egg_info for package pyrex
Installing collected packages: pyrex
Running setup.py install for pyrex
changing mode of build/scripts-2.5/pyrexc from 644 to 755
changing mode of /Users/james/Documents/virtualenvs/bio/bin/pyrexc to 755
Successfully installed pyrex

Now, to get and editable pygr:

(bio)james@flapjack:~/Documents/virtualenvs/bio$ pip -E . install -e git://github.com/jamescasbon/pygr.git#egg=pygr
Obtaining pygr from git+git://github.com/jamescasbon/pygr.git#egg=pygr
Cloning git://github.com/jamescasbon/pygr.git to ./src/pygr
remote: Counting objects: 6281, done.
...
Successfully installed pygr

Finally, ipython:

(bio)james@flapjack:~/Documents/virtualenvs/bio$ pip -E . install ipython
Downloading/unpacking ipython
Downloading ipython-0.9.1.tar.gz (2.8Mb): 2.8Mb downloaded

We now have a completely isolated environment, where pygr and biopython are editable:

(bio)james@flapjack:~/Documents/virtualenvs/bio$ bin/ipython
Python 2.5.4 (r254:67916, Mar 2 2009, 10:40:04)
Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.

In [1]: import pygr

In [2]: pygr.__file__
Out[2]: '/Users/james/Documents/virtualenvs/bio/src/pygr/pygr/__init__.pyc'

More money…

My company has met its goals and secured a second tranche of VC money. To celebrate, we’ve even got a website: Population Genetics Technologies.

Two excellent pieces of writing

Now that I’ve hung up my hat at the Open Rights Group, I actually have time to read stuff for pleasure again. And it has been with great pleasure that I’ve read the two pieces listed below. Sometimes it doesn’t matter what you’re writing about – the quality of your prose sings through. In the case of these two pieces, though, that quality is matched by the urgency of the subject matter. Enjoy.