Archive for category bioinformatics

Double standards in Nature biotechnology

OK, So that is a relatively inflammatory and controversial headline, edging on the side of tabloid sensationalism. What is refers to is probably a situation that I may never find myself in again, which is in this months edition of Nature Biotechnology I am an author on two, biological standards related publications.

The first is a letter advertising the PSI’s MIAPE Guidelines for reporting the use of gel electrophoresis in proteomics. This letter is also accompanied by letters referring to the  MIAPE guidelines for Mass Spectrometry, Mass Spectrometry Informatics and protein modification data.

The second is a paper on the Minimum Information about a Biomedical or Biological Investigations (MIBBI) registry entitled Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project.

The following press release describes this paper in more detail.

More than 20 grass-roots standardisation groups, led by scientists at the European Bioinformatics Institute (EMBL-EBI) and the Centre for Ecology & Hydrology (CEH), have combined forces to form the “Minimum Information about a Biomedical or Biological Investigation” (MIBBI) initiative. Their aim is to harmonise standards for high-throughput biology, and their methodology is described in a Commentary article, published today in the journal Nature Biotechnology.

Data standards are increasingly vital to scientific progress, as groups from around the world look to share their data and mine it more effectively. But the proliferation of projects to build “Minimum Information” checklists that describe experimental procedures was beginning to create problems. “There was no way of even finding all the current checklist projects without days of googling,” says the EMBL-EBI’s Chris Taylor, who shares first authorship of the paper with Dawn Field (CEH) and Susanna-Assunta Sansone (EMBL-EBI). “As a result, much of the great work that’s going into developing community standards was being overlooked, and different communities were at risk of developing mutually incompatible standards. MIBBI will help to prevent them from reinventing the wheel.

The MIBBI Portal already offers a one-stop shop for researchers, funders, journals and reviewers searching for a comprehensive list of minimum information checklists. The next step will be to build the MIBBI Foundry, which will bring together diverse communities to rationalise and streamline standardisation efforts. “Communities working together through MIBBI will produce non-overlapping minimal information modules,” says CEH’s Dawn Field. “The idea is that each checklist will fit neatly into a jigsaw, with each community being able to take the pieces that are relevant to them.” Some, such as checklists describing the nature of a biological sample used for an experiment, will be relevant to many communities, whereas others, such as standards for describing a flow cytometry experiment, may be developed and used by a subset of communities.

“MIBBI represents the first new effort taking the Open Biomedical Ontologies (OBO) as its role model”, says Susanna-Assunta Sansone. “The MIBBI Portal operates in a manner analogous to OBO as an open information resource, while the MIBBI Foundry fosters collaborative development and integration of checklists into self-contained modules just like the OBO Foundry does for the ontologies”.

There is a growing understanding of the value of such minimal information standards among biologists and an increased willingness to work together across disciplinary boundaries. The benefits include making experimental data more reproducible and allowing more powerful analyses over diverse sets of data. New checklist communities are encouraged to register with MIBBI and consider joining the MIBBI Foundry.

Press release issued by the EMBL-European Bioinformatics Institute and the Centre for Ecology and Hydrology, UK.
Zemanta Pixie

, , , , , , , , , , , , , , ,

Leave a comment

CARMEN – A Scalable Science cloud

Paul Watson presents a talk on CARMEN a the Google Seattle Conference on Scalability.

, , , , , , ,

1 Comment

Proteomics Standards Initiative recommendations

Several new standard artefacts have progressed through the public consultation process of the PSI. They are the MIAPE Column Chromatography document and the Mass Spectrometer Markup Language Specification (mzML).

Leave a comment

CARMEN; Re-branded

carmen logo

We have gone through a bit of professional branding with a shiney new logo and some publicity material. The website has also been re-designed. The original drupal site was replaced with a Plone site, a decision I was not involved with. I am not sure I am a big fan of the big mug-shot which spills over the website template, the fact you have to scroll to the bottom of the front page to find out what the project is about is less than ideal. Feel free to comment

1 Comment

Open-ed gel electrophoresis data

Several months ago – about 3, I made a public commitment to make the data I have generated during my Phd open and available online. Well I have not ignored this and in the interim I have been investigating various ways I can do this. Not only do I want to make it available but I want to structure it in a standard form, namely the gelML format. In addition, I was involved in developing it the specification and therefore, I have somewhat an obligation to use it. As it is an XML transfer format I needed to be make changes and revision it, like developing code, so in that sense recording the data on a wiki or blog would not be appropriate. For this reason I have chosen to create a google code project for gel electrophoresis data and do everything in subversion. You can browse the subversion repository or check it out anonymously. The geML file that will eventually (as its still very much a work in progress) contain the data is here. As I am doing this, I though I might as well publish my lab book while I was at it. This will be done using LateX and the pdf that gets generated can be found here.

To date, this is still a work in progress and a reverse engineering project, as the experiments are not being done live. It may take a while to complete but in the end I hope presenting my data in gelML and making my labbook available can be more of a benefit than decomposing for years on cellulose.


How do you select your Scientific Journal?

I was trying to work out a suitable journal to where I could submit a paper on sepCV, the PSI ontology for sample preparation and separation techniques. I found my self drawing up a table, so I thought I would blog it. My initial remit is that is should be in a proteomics relevant journal as well as bioinformatics, as we are trying to encourage a greater community contribution for term collection. In this respect it has to be open access. I would also prefer the journal to accept LateX instead on proprietary formats such as word. I was really disappointed with the number of journals that only accept word documents, even PlosOne refuses anything other than word or rft, tut, tut.

Based on this loose criteria, Proteome science comes top closely followed by Journal of Proteomics and Bioinformatics (if I sacrifice using word for open access). BMC Bioinformatics also ticks all the boxes but it misses out on the proteomics audience. The table below also includes Impact factor, but I did not really take that into consideration. Wouldn’t it be nice is there was an app that you could just enter in criteria like this, target audience, submission format, copyright, etc and get back the journals that meet these requirements. Something like this would save me an afternoon trawling through the web building spreadsheets.

How do you select your journal?



Journal Word Latex Open Access Impact Factor


Proteomics Yes NO cant find statement 5.735


Journal of Proteomics and Bioinformatics Yes NO Yes


Biomedical Informatics Yes Yes NO 2.346


BMC Bioinformatics Yes Yes Yes 3.62


PlosONE Yes No Yes


Proteome Science Yes Yes Yes


Proteomics research Yes Yes No 5.151


Modeling experiments and presenting the information

I have struggled to keep up with this discussion, with excuses ranging from attending workshops, major release deadline on the horizon (now past) and a post-mortem on the release schedule, to attending (only to please the parents) my graduation ceremony. I am only now starting to catch up on my feeds but dauntingly Google reader tells me I have 1000+, moan moan moan, (
Anyway, in Cameron’s last post on the subject he points to all the previous discussions and other commentary on the topic. I will pick up from his last post and respond to some his responses to the responses to responses.

I still feel that we are trying to describe and achieve different things, but that this discussion is a great way of getting to the bottom of this and achieving some clarity in our description and language.

This certainly may be the case, I want to present FuGE as something that is worth considering rather than re-inventing. However, this is no denying that FuGE is a datamodel and does not come with a high degree of tool support or nice user interfaces, which Cameron is crying out for, as are most lab scientists from a usability point of view.

I got off to a very bad start here. I should have used the word ‘capture’ here. This to me is about capturing the data streams that come out of lab work.

This seems to be a change of tact 🙂 The original post was about a data model for lab notebooks . There is no reason why data streams can not be structured. However in reading the rest of Cameron’s post there would appear to be a 4th point of separation of modeling experiments, following on from the 3 presented here.

4. The publication of the data

Here we start to see how the different motivations are driving our views. What I want here is a marker on a web document that says ‘I am a scientific experiment’ (page was a more term to use – I simply mean any web document, generally accessed through discrete web pages). This will allow aggregation and distribution of the notebook a la PostGenomic or Chemical Blogspace. To me this is more important than the format of the underlying data. If I can find interesting data I will probably put the work into extracting it in a form useful to me To Frank I suspect the aggregation and indexing is a peripheral issue. If the data isn’t in a agreed format it isn’t useful for him.

This comment seems to re-iterate the 4th level of separation, structuring the experimental data is separate from publishing it. I do not see any reason why a document that contained structured data could not be embedded in a wiki,blog, lump of RDF or whatever. Once you have found it by whatever publication mechanism and arrived at the data, it is going to be alot easier to do interesting stuff with it, if it is in a common structure. Imagine the scenario of 10, in fact lets say 500 to cross the barrier of humans doing it faster, laboratory’s doing the same type of experiment. Would it not be cool if you could write one app to interegate 500 open lab books with one input structure, instead of 500 file parsers which would then be placed into a common format anyway, to do some cool meta-analysis on what protocol produced the best results out of the 500?

The table!

Again this is a central user interface issue for us. Capturing an experiment in the wet lab, whether noting it as it happens or planning what you are going to in advance, is often most easily done with a table. Tables are not well implemented in the wiki and blog frameworks we are using for these systems. Therefore providing a table to capture the experiment is critical if you actually want anyone to use your system. Our users consistently identify this as the single biggest barrier to them using our system.

A table is a visual summary of your experiment. In order to produce a table you have to think about what you are recording and model the table accordingly in advance. Structurally this is more efficiently achieved at the model layer. Visually, there is no question that a summary of what you captured works well via a table. Is a table the best mechanism? it is certainly easier while entering information via a pen on paper. Ultimately if you abstract far enough back a table is column separated values. FuGE provides a mechanism to define matrices of data (tables) which can be presented to the user.

Now the heavyweight approach to this is to say; ‘That’s why you need a data model. Once you have that you can generate a nice web form to capture the necessary data’. The problem with this comes when you do something slightly different. As an example I had a template set up in our system for capturing the setup of SDS-PAGE gels. This would go and look for anything that tagged as ‘protein’ as potential samples and present these in a drop down menu. This was fine until the day I wanted to run a DNA-protein conjugate on the gel. Essentially I had broken my own data model. This could be fixed, and I did fix it, by changing the way my template looked for potential samples. But in the cut and thrust of real lab work (as opposed to an academic pottering under sufferance of his students) this isn’t feasible. We can’t extend the data model every time we do something new – we are always doing something new.
The work is in developing a solid data model in the first place 🙂 Restricting inputs to a particular material in this case proteins, wont work if you try to add DNA. You will get no argument from the here. This is why FuGE (and GelML) have “material” or “data” inputs to a protocol allowing any type of material and therefore any experiment to be captured. On the other hand you can restrict the data model (the drop down box) to be only proteins. This does not break the model. Instead the model stays the same and you add more terms to you ontology or controlled vocabulary. In response to “we are always doing something new”, I would not doubt this. However, I would suggest you are always doing something new inside the structure of input=some_material (and or some_data), apply process, output some_material (and or some_data).

[FG]….FuGE is designed so that it provides a generic structure which can then be described or further specialised by the user/application by extending the model itself or by using cv’s/ontologies or free text. This provides the flexibility and in theory future proof.

But does this require that the user does the extension every time they move on to something new. As a matter of interest, how much time and effort went into agreeing the GelML? Is it practical to do this extension over and over again? And who will fund it?

There is no question that to pick a particular technology or process and model it takes time. To answer your question GelML probably took 2 years to complete, which is not trivial. However FuGE the data model for experiments – analogous to what you proposed in your original post, probably has taken close to 5 years of development with a larger number of developers than GelML. I could envision that creating a lab note book with FuGE as an underlying model you could re-use these extensions – like pluging in specific experiments to your generic lab book. Conforming to a common structure will only allow this plug-in scenario to be achieved, whether it is FuGE or another model.  Funding, certainly, who will fund it? All the main funders are starting to say we should make our data available but provide very little monetary incentive to do so. GelML was not funded, we did it out of the goodness of our hearts and the greater good.

My concern is that achieving added value requires the controlled vocabulary. If we are going to just end up using free text because a cv doesn’t exist for the experiment we are doing then why use a complex data structure?

You are also correct in your assertion in that the added value, or the semantics is in the ontology not the data model. Using a data model allows you to understand certain information contained in the structure; that is is a material, that it is a protocol, that it is a piece of equipment. The ontology allows you to say specifically what it is and what it means. This does present a catch 22 in that without the ontology it is difficult to add semantics, is free text more suitable, I would say no. Its easier and you will understand it. By the very notion that you are making your labbook available you want other people and computers to interpret it and understand it. For example, are the free text terms, 1D, gel electrophoresis, gel, matrix separation, electrophoresis, all referring to the same thing that you use SDS-PAGE to refer to? You might assume so, I might have implied otherwise. You cant tell unless they are associated with meaning – free text has no meaning, only assumed interpretation. This is the motivation behind OBI – the ontology for Biomedical Investigations. This probably suffers from the same labeling as FuGE as it is science experiments, not just biology.

So I will start with re-pointing what I believe to be the areas of conflation within these discussions
  1. the representation of experiments – the data model
  2. the presentation or level of abstraction to the user (probably some what dependent on 3.)
  3. the implementation of the data model
  4. the publication of the data (Notification, RSS etc.)

FuGE itself is only applicable to point 1. It will provide a structure to represent experiments. That’s it. I believe it is applicable to a lab note book. However there is no glossing over the fact that there needs to be an abstraction over the model (2) dependent on (3) to allow it to be used by scientists and to make this a reality – this is work that has to be done and its not me offering to do it either 🙂 Once this is in place it should be relatively trivial to publish or notify others of experiments (4).


The First MIBBI Workshop: Day 1

MIBBI is a registry of scientific experiment reporting guidelines with the idea to foster a foundry of best practice to further develop and encourage modular development and re-use of reporting guidelines. The first workshop is being held at the EBI on the 2nd – 3rd April 2008 and is a relatively closed workshop to those developers and guidelines that are registered on the site. The schedule for day one is a whistle stop tour consisting of 5 min talks (adjusting for an academics interpretation of what 5 minutes means) for all the guidelines that exist, their scope and the people behind them. Due to this I am not going to comment on individual talks. I presented two talks during the day. One on CARMEN and the development of the MINI: Electrophysiology reporting guidelines, and one, standing in for Andy Jones on FuGE.

I tried sharing these slides via google presentation, they looked quite nice. However, wordpress does not seem to allow them to embed. So I put them on slide share instead. These set the tone for the discussions for the afternoon and tomorrow.


A data model for life-science experiments; FuGE

This post may be one in a series of responses to Cameron’s post on “Proposing a data model for Open Notebooks“. When I originally read this post I commented on the fact that a data model for experiments actually exists and that he may get some mileage out of it rather than starting from scratch and re-creating the wheel. Several discussions have followed on from this original post and Neil has picked up on it as well, with sentiments that I agree with.

I think a large part of this discussion confuses and conflates 3 issues which I believe to be separate;

  1. the representation of experiments – the data model
  2. the presentation or level of abstraction to the user (probably some what dependent on 3.)
  3. the implementation of the data model

With these three issues in mind, to start with, I am going back to the original post and respond to some of the comments.

What I’m suggesting is a standard format to describe experiments;…

A “standard” in the true sense of the word (established by consensus and approved by a recognized body) already exists to describe life-science experiments. It is a data model represented in UML called FuGE.

…..a default format for online notebooks. The object is to do a number of things. Firstly identify the page(s) as being an online laboratory notebook so that they can be aggregated or auto-processed as appropriate.

I see this as two different and separate things, the data model which represents experiments, and the presentation of the model to the user, in this case described as an online notebook. Page numbers are an arbitrary visual aid, they are not integral to modelling experiments

…Secondly to make rich metadata available in a human readable and machine processable form making mashups and other things possible using tools such as Yahoo! Pipes, Dapper, and the growing range of other interesting tools, but not to impose any unnecessary limitations on what that metadata might look like. ..

I am not going to deal with metadata here, as the post will probably be long enough. However, traditionally, metadata, (cv’s and ontologies) have been used to add specificity or meaning to the structured data. The choice of the metadata to use (or build) will be dependent on the application.

Another issue is the tables. My original thinking was that if we had a data model for tables then most of our problems would go away.

I am not sure I agree here. What is a table? I see it as a particular visual display mechanism that you have chosen to represent you results. The results can be modelled more accurately within the data model such as chemical-has_measurement, measurement has_numerical value and has_unit. I believe this statement is confusing the visual presentation of data with structuring the data.
However the argument against still stands. Anything that requires a fixed vocabulary is going to break

Well, anything that requires a fixed vocabulary is less flexible, breaking is something different. If it breaks doing the job it was designed to do then this is a problem. If it breaks when applied to a different application, then well, it was not designed for that application in the first place. FuGE is designed so that it provides a generic structure which can then be described or further specialised by the user/application by extending the model itself or by using cv’s/ontologies or free text. This provides the flexibility and in theory future proof.

Overall an experiment has inputs and outputs. These may be data or material objects. Procedures take inputs and generate outputs.[..] Broadly speaking there seem to be three types of item; material objects , data, and procedures (possibly also comments). For each of these we require a provenance (author), and a date

I would agree with you assessment of what classes are needed. This corresponds to what FuGE contains as illustrated in the digram below (click on image to see original)


In summary, the position I want to present is that FuGE is a data model to represent scientific experiments. Several domains are using it to represent their experiments from traditional biology/molecular biology to neurophysiology. I believe FuGE could form the underlying model for a “notebook” via an abstraction/presentation layer to the user. In how should it be implemented, blog, wiki, database, latex, XML, RDF, OWL, I am not going to hypothesis. However, a database implementation of the FuGE schema is already in development called SyMBA which abstracts away from the user presenting simple web forms to fill out the XML which is then stored as a relation database.


Minimum Information about a Neuroscience Investigation (MINI)

The idea behind the CARMEN project is that we provide a system to store electrophysiology data and analysis services so that data can be shared and analysed in the “Neuro-cloud”. An important factor in realising this system is that the stored data and the services have to be described in a way that is both human and computationally amenable. The first stage of this is agreeing what information should actually be ascribed to the data. In other words, the balance between what the experimentalist want to say about their data and what informaticians need to know about a particular data set in order to perform their analysis. To this end we have defined what we believe to be the minimum information that must be ascribed to an electrophysiology experiment for submission to the CARMEN system. It follows the now well practised format of MIAME and MIAPE minimum reporting requirements. In the first instance the document only represents consensus within the CARMEN consortium. However, it could form the basis of a community reporting standard for electrophysiology experiments. The document is available on Nature preceedings at the following URL and comments and opinions are encouraged.