Ubuntu 9.10, nicknamed Karmic Koala, is about to be released, and in a moment of idleness, I upgraded my old 9.04 install to the latest beta. Upgrading is always generates a slight feeling of dread, taking the plunge from the cozy stability of bugs I’ve learned to work around, into the great unknown, but it all went even smoother than the previous one. And on the plus side, ghc is now, finally, upgraded to an almost modern release, (6.10.4) and lots of libraries are included as well. Great work by Joachim Breitner and his army of debianizers. So I’m all ready to take advantage of my new compiler and its improvements, but first I need to bring all my software up to speed. I’ll make notes here as I go along, and hopefully this will be useful also for users of other Linux distributions.
First I need to install the bioinformatics library. I’m about to release 0.4.1, but this is also a good opportunity to check that everything works with 0.4 (which is what you’ll find on Hackage), so let’s do that first. Using darcs, I pull the repo up to the 0.4 tag (but you can of course get the tarball from Hackage):
% ./Setup.hs configure
Configuring bio-0.4...
Setup.hs: At least the following dependencies are missing:
QuickCheck <2, binary -any
(Side note: you may notice that I run Setup.hs directly, as opposed to using runhaskell. I prefer it this way, but you may have to do a chmod +x Setup.hs if you downloaded this from the darcs repository or similar.)
Since we want to use the system libraries as far as possible, these libraries are just an apt-get away:
sudo apt-get install libghc6-quickcheck1\*
sudo apt-get install libghc6-binary-\*
Now, let’s try again:
% ./Setup.hs configure
Configuring bio-0.4...
% ./Setup.hs build
Preprocessing library bio-0.4...
Building bio-0.4...
Binary: Int64 truncated to fit in 32 bit Int
ghc: panic! (the 'impossible' happened)
(GHC version 6.10.4 for i386-unknown-linux):
Prelude.chr: bad argument
Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
Okay, this was not what was supposed to happen. As always, dropping to #haskell on IRC is the first thing to do, and sure enough:
<sereven> ketil: that's also shown up for xmonad users when .hi and .o files weren't cleaned
between rebuilds mixing different versions. usually between ghc updates IIRC.
Let’s try to get rid of old cruft lying about, polluting directories:
./Setup.hs clean && ./Setup.hs configure && ./Setup.hs build
Sure enough, this time it worked. For good measure, we’ll run the unit tests:
make test
After a zillion tests, we notice that everything is go, great!
Next, it is time to go through the list of bioinformatics applications. Since my working directory is a mess of branches and versions, we’ll just go over the published applications and versions on Hackage.
xsact is an application to do sequence (in particular EST sequence) clustering. It predates and thues doesn’t actually use the bioinformatics library, but we’ll check it anway. So we try the familiar command line:
./Setup.hs clean && ./Setup.hs configure && ./Setup.hs build
And things compile. However, the version on Hackage is outdated, so we’ll upload a new version, 1.7.1. One test case still fails, but I can’t imagine anybody is using it to generate Newick-formatted trees — I am certainly not — and since there are many equally correct outputs (including tree rotations and rounding modes), output is likely correct anyway. Holler if you need it!
rbr is an application to mask repeats in sequence data. Normally, this is done using a library of known repeats, but this application tries to do it using statistics, making the — I think justifiable — assumption that repeats are going to be more common than non-repeats. The version on Hackage is old, and only works with the library prior to 0.4, so again this is a good time to push the latest changes out in the limelight. Compiling this works great, by the way.
cluster_tools is a package that contains a bunch of binaries, useful for working with the results of sequence clustering, including extracting various information from ACE files. This uses another library, called simpleargs, that simplifies command line argument parsing for simple cases. Again, the Hackage version is for bio<0.4, so a new version will be pushed. At the same time, we make a mental note to push version 0.2 of simpleargs to Hackage as well, instead of keeping age-old modifications buried forever.
dephd is my Swiss-army-knife of sequence analysis, and lets you do various things like converting between formats, plotting and trimming by quality. This is a more live project than most of the others (I’m currently working on improved quality trimming and automatic generation of files for submission to GenBank), but the currently available version also compiles without incident.
estreps is a couple of programs I needed for repeat analysis, perhaps not tremendously interesting, but at least rselect, which lets you select randomized subsets from Fasta files, might be of interest to some? We try the usual invocation to compile, and get:
src/Unigene.hs:24:23:
Couldn't match expected type `a' against inferred type `Unknown'
`a' is a rigid type variable bound by
the type signature for `clusters' at src/Unigene.hs:22:41
This error arises due to the introduction of phantom types for identifying sequences introduced in bio 0.4. Unfortunately, this version of estreps contains some adaption to this model, so it won’t compile against older versions either. So it looks like yet another sdist for Hackage. Look for version 0.3.1.
flower is a utility for extracting information from SFF files (containing sequences from Roche’s 454 machines). Although a new version is around the corner, the old 0.2 just works.
xml2x is a utility for converting BLAST results in XML format into CSVs, that somehow is more compatible with biologists. Trying to compile it fails, with the following error:
src/Xml2X.hs:152:49:
Couldn't match expected type `[b]'
against inferred type `Maybe Bio.Sequence.KEGG.KO'
In the first argument of `concatMap', namely `(flip M.lookup ks)'
In the first argument of `($)', namely
`concatMap (flip M.lookup ks)'
In the second argument of `($)', namely
`concatMap (flip M.lookup ks) $ map chop $ map subject fs'
It turns out that somewhere along the way, the lookup function from Data.Map was de-generalized from working on arbitrary monads to just returning a Maybe. I was using this to return a list, using the empty list to signal an unsuccessful lookup. This is easily remedied, but that means yet another sdist for Hackage.
korfu is a utility for identifying open reading frames in sequence data. It hasn’t yet been ported to version 0.4 of the library, but works if you install 0.3.5. I updated this too, since it didn’t have a category. Now it too resides in the bioinformatics section.
In retrospect, it seems like giving old code a thorough spring cleaning once in a while. Although nothing really critical or difficult happened, a good number of small annoyances were discovered, and a bunch of new sdists are now ready to be uploaded to Hackage. Next will be converting all this into debian packages.
The important question is of course, how do we avoid this in the future? During development, it is important to be able to modify libraries and appliations, but installing a new version of the biolib, say, overwrites the old one, and suddenly I’m compiling and testing everything against a different library than Joe Random Hackage User is going to find. I have some thoughts on how to avoid this, but if you have a method that works nicely, I’m all ears.