Archive for April, 2008
Several months ago – about 3, I made a public commitment to make the data I have generated during my Phd open and available online. Well I have not ignored this and in the interim I have been investigating various ways I can do this. Not only do I want to make it available but I want to structure it in a standard form, namely the gelML format. In addition, I was involved in developing it the specification and therefore, I have somewhat an obligation to use it. As it is an XML transfer format I needed to be make changes and revision it, like developing code, so in that sense recording the data on a wiki or blog would not be appropriate. For this reason I have chosen to create a google code project for gel electrophoresis data and do everything in subversion. You can browse the subversion repository or check it out anonymously. The geML file that will eventually (as its still very much a work in progress) contain the data is here. As I am doing this, I though I might as well publish my lab book while I was at it. This will be done using LateX and the pdf that gets generated can be found here.
To date, this is still a work in progress and a reverse engineering project, as the experiments are not being done live. It may take a while to complete but in the end I hope presenting my data in gelML and making my labbook available can be more of a benefit than decomposing for years on cellulose.
I was trying to work out a suitable journal to where I could submit a paper on sepCV, the PSI ontology for sample preparation and separation techniques. I found my self drawing up a table, so I thought I would blog it. My initial remit is that is should be in a proteomics relevant journal as well as bioinformatics, as we are trying to encourage a greater community contribution for term collection. In this respect it has to be open access. I would also prefer the journal to accept LateX instead on proprietary formats such as word. I was really disappointed with the number of journals that only accept word documents, even PlosOne refuses anything other than word or rft, tut, tut.
Based on this loose criteria, Proteome science comes top closely followed by Journal of Proteomics and Bioinformatics (if I sacrifice using word for open access). BMC Bioinformatics also ticks all the boxes but it misses out on the proteomics audience. The table below also includes Impact factor, but I did not really take that into consideration. Wouldn’t it be nice is there was an app that you could just enter in criteria like this, target audience, submission format, copyright, etc and get back the journals that meet these requirements. Something like this would save me an afternoon trawling through the web building spreadsheets.
How do you select your journal?
I have struggled to keep up with this discussion, with excuses ranging from attending workshops, major release deadline on the horizon (now past) and a post-mortem on the release schedule, to attending (only to please the parents) my graduation ceremony. I am only now starting to catch up on my feeds but dauntingly Google reader tells me I have 1000+, moan moan moan,
Anyway, in Cameron’s last post on the subject he points to all the previous discussions and other commentary on the topic. I will pick up from his last post and respond to some his responses to the responses to responses.
I still feel that we are trying to describe and achieve different things, but that this discussion is a great way of getting to the bottom of this and achieving some clarity in our description and language.
This certainly may be the case, I want to present FuGE as something that is worth considering rather than re-inventing. However, this is no denying that FuGE is a datamodel and does not come with a high degree of tool support or nice user interfaces, which Cameron is crying out for, as are most lab scientists from a usability point of view.
I got off to a very bad start here. I should have used the word ‘capture’ here. This to me is about capturing the data streams that come out of lab work.
This seems to be a change of tact 🙂 The original post was about a data model for lab notebooks . There is no reason why data streams can not be structured. However in reading the rest of Cameron’s post there would appear to be a 4th point of separation of modeling experiments, following on from the 3 presented here.
4. The publication of the data
Here we start to see how the different motivations are driving our views. What I want here is a marker on a web document that says ‘I am a scientific experiment’ (page was a more term to use – I simply mean any web document, generally accessed through discrete web pages). This will allow aggregation and distribution of the notebook a la PostGenomic or Chemical Blogspace. To me this is more important than the format of the underlying data. If I can find interesting data I will probably put the work into extracting it in a form useful to me To Frank I suspect the aggregation and indexing is a peripheral issue. If the data isn’t in a agreed format it isn’t useful for him.
This comment seems to re-iterate the 4th level of separation, structuring the experimental data is separate from publishing it. I do not see any reason why a document that contained structured data could not be embedded in a wiki,blog, lump of RDF or whatever. Once you have found it by whatever publication mechanism and arrived at the data, it is going to be alot easier to do interesting stuff with it, if it is in a common structure. Imagine the scenario of 10, in fact lets say 500 to cross the barrier of humans doing it faster, laboratory’s doing the same type of experiment. Would it not be cool if you could write one app to interegate 500 open lab books with one input structure, instead of 500 file parsers which would then be placed into a common format anyway, to do some cool meta-analysis on what protocol produced the best results out of the 500?
Again this is a central user interface issue for us. Capturing an experiment in the wet lab, whether noting it as it happens or planning what you are going to in advance, is often most easily done with a table. Tables are not well implemented in the wiki and blog frameworks we are using for these systems. Therefore providing a table to capture the experiment is critical if you actually want anyone to use your system. Our users consistently identify this as the single biggest barrier to them using our system.
A table is a visual summary of your experiment. In order to produce a table you have to think about what you are recording and model the table accordingly in advance. Structurally this is more efficiently achieved at the model layer. Visually, there is no question that a summary of what you captured works well via a table. Is a table the best mechanism? it is certainly easier while entering information via a pen on paper. Ultimately if you abstract far enough back a table is column separated values. FuGE provides a mechanism to define matrices of data (tables) which can be presented to the user.
Now the heavyweight approach to this is to say; ‘That’s why you need a data model. Once you have that you can generate a nice web form to capture the necessary data’. The problem with this comes when you do something slightly different. As an example I had a template set up in our system for capturing the setup of SDS-PAGE gels. This would go and look for anything that tagged as ‘protein’ as potential samples and present these in a drop down menu. This was fine until the day I wanted to run a DNA-protein conjugate on the gel. Essentially I had broken my own data model. This could be fixed, and I did fix it, by changing the way my template looked for potential samples. But in the cut and thrust of real lab work (as opposed to an academic pottering under sufferance of his students) this isn’t feasible. We can’t extend the data model every time we do something new – we are always doing something new.
[FG]….FuGE is designed so that it provides a generic structure which can then be described or further specialised by the user/application by extending the model itself or by using cv’s/ontologies or free text. This provides the flexibility and in theory future proof.
But does this require that the user does the extension every time they move on to something new. As a matter of interest, how much time and effort went into agreeing the GelML? Is it practical to do this extension over and over again? And who will fund it?
There is no question that to pick a particular technology or process and model it takes time. To answer your question GelML probably took 2 years to complete, which is not trivial. However FuGE the data model for experiments – analogous to what you proposed in your original post, probably has taken close to 5 years of development with a larger number of developers than GelML. I could envision that creating a lab note book with FuGE as an underlying model you could re-use these extensions – like pluging in specific experiments to your generic lab book. Conforming to a common structure will only allow this plug-in scenario to be achieved, whether it is FuGE or another model. Funding, certainly, who will fund it? All the main funders are starting to say we should make our data available but provide very little monetary incentive to do so. GelML was not funded, we did it out of the goodness of our hearts and the greater good.
My concern is that achieving added value requires the controlled vocabulary. If we are going to just end up using free text because a cv doesn’t exist for the experiment we are doing then why use a complex data structure?
You are also correct in your assertion in that the added value, or the semantics is in the ontology not the data model. Using a data model allows you to understand certain information contained in the structure; that is is a material, that it is a protocol, that it is a piece of equipment. The ontology allows you to say specifically what it is and what it means. This does present a catch 22 in that without the ontology it is difficult to add semantics, is free text more suitable, I would say no. Its easier and you will understand it. By the very notion that you are making your labbook available you want other people and computers to interpret it and understand it. For example, are the free text terms, 1D, gel electrophoresis, gel, matrix separation, electrophoresis, all referring to the same thing that you use SDS-PAGE to refer to? You might assume so, I might have implied otherwise. You cant tell unless they are associated with meaning – free text has no meaning, only assumed interpretation. This is the motivation behind OBI – the ontology for Biomedical Investigations. This probably suffers from the same labeling as FuGE as it is science experiments, not just biology.
- the representation of experiments – the data model
- the presentation or level of abstraction to the user (probably some what dependent on 3.)
- the implementation of the data model
- the publication of the data (Notification, RSS etc.)
FuGE itself is only applicable to point 1. It will provide a structure to represent experiments. That’s it. I believe it is applicable to a lab note book. However there is no glossing over the fact that there needs to be an abstraction over the model (2) dependent on (3) to allow it to be used by scientists and to make this a reality – this is work that has to be done and its not me offering to do it either 🙂 Once this is in place it should be relatively trivial to publish or notify others of experiments (4).
The main focus of the day was discussing what it means to be registered on the MIBBI site and be a member of the Foundry. As a straw man, rather than starting from scratch, we used the OBO foundry principles to see if they could be applied to reporting check-lists, and came up with a draft set of principles. A full “official” report of the workshop should be forthcoming.
As a break from the rigorous standards development and process of reporting check-lists we went to a local restaurant and experienced anarchy when it came to understanding the menu. The meal itself was brilliant, however some degree of semantic extraction had to be applied to the menu to actually understand what we were ordering – No fancy algorithms were applied here, other than aksing the staff so explain it in English! And we think we have problems in the life-sciences! An example of the menu can be seen in the photo graphs which also include the delegates the meal and the venue.
|First MIBBI workshop 2-3rd April 2008|
MIBBI is a registry of scientific experiment reporting guidelines with the idea to foster a foundry of best practice to further develop and encourage modular development and re-use of reporting guidelines. The first workshop is being held at the EBI on the 2nd – 3rd April 2008 and is a relatively closed workshop to those developers and guidelines that are registered on the site. The schedule for day one is a whistle stop tour consisting of 5 min talks (adjusting for an academics interpretation of what 5 minutes means) for all the guidelines that exist, their scope and the people behind them. Due to this I am not going to comment on individual talks. I presented two talks during the day. One on CARMEN and the development of the MINI: Electrophysiology reporting guidelines, and one, standing in for Andy Jones on FuGE.
I tried sharing these slides via google presentation, they looked quite nice. However, wordpress does not seem to allow them to embed. So I put them on slide share instead. These set the tone for the discussions for the afternoon and tomorrow.