Modeling experiments and presenting the information

I have struggled to keep up with this discussion, with excuses ranging from attending workshops, major release deadline on the horizon (now past) and a post-mortem on the release schedule, to attending (only to please the parents) my graduation ceremony. I am only now starting to catch up on my feeds but dauntingly Google reader tells me I have 1000+, moan moan moan,
Anyway, in Cameron’s last post on the subject he points to all the previous discussions and other commentary on the topic. I will pick up from his last post and respond to some his responses to the responses to responses.

I still feel that we are trying to describe and achieve different things, but that this discussion is a great way of getting to the bottom of this and achieving some clarity in our description and language.

This certainly may be the case, I want to present FuGE as something that is worth considering rather than re-inventing. However, this is no denying that FuGE is a datamodel and does not come with a high degree of tool support or nice user interfaces, which Cameron is crying out for, as are most lab scientists from a usability point of view.

I got off to a very bad start here. I should have used the word ‘capture’ here. This to me is about capturing the data streams that come out of lab work.

This seems to be a change of tact 🙂 The original post was about a data model for lab notebooks . There is no reason why data streams can not be structured. However in reading the rest of Cameron’s post there would appear to be a 4th point of separation of modeling experiments, following on from the 3 presented here.

4. The publication of the data

Here we start to see how the different motivations are driving our views. What I want here is a marker on a web document that says ‘I am a scientific experiment’ (page was a more term to use – I simply mean any web document, generally accessed through discrete web pages). This will allow aggregation and distribution of the notebook a la PostGenomic or Chemical Blogspace. To me this is more important than the format of the underlying data. If I can find interesting data I will probably put the work into extracting it in a form useful to me To Frank I suspect the aggregation and indexing is a peripheral issue. If the data isn’t in a agreed format it isn’t useful for him.

This comment seems to re-iterate the 4th level of separation, structuring the experimental data is separate from publishing it. I do not see any reason why a document that contained structured data could not be embedded in a wiki,blog, lump of RDF or whatever. Once you have found it by whatever publication mechanism and arrived at the data, it is going to be alot easier to do interesting stuff with it, if it is in a common structure. Imagine the scenario of 10, in fact lets say 500 to cross the barrier of humans doing it faster, laboratory’s doing the same type of experiment. Would it not be cool if you could write one app to interegate 500 open lab books with one input structure, instead of 500 file parsers which would then be placed into a common format anyway, to do some cool meta-analysis on what protocol produced the best results out of the 500?

The table!

Again this is a central user interface issue for us. Capturing an experiment in the wet lab, whether noting it as it happens or planning what you are going to in advance, is often most easily done with a table. Tables are not well implemented in the wiki and blog frameworks we are using for these systems. Therefore providing a table to capture the experiment is critical if you actually want anyone to use your system. Our users consistently identify this as the single biggest barrier to them using our system.

A table is a visual summary of your experiment. In order to produce a table you have to think about what you are recording and model the table accordingly in advance. Structurally this is more efficiently achieved at the model layer. Visually, there is no question that a summary of what you captured works well via a table. Is a table the best mechanism? it is certainly easier while entering information via a pen on paper. Ultimately if you abstract far enough back a table is column separated values. FuGE provides a mechanism to define matrices of data (tables) which can be presented to the user.

Now the heavyweight approach to this is to say; ‘That’s why you need a data model. Once you have that you can generate a nice web form to capture the necessary data’. The problem with this comes when you do something slightly different. As an example I had a template set up in our system for capturing the setup of SDS-PAGE gels. This would go and look for anything that tagged as ‘protein’ as potential samples and present these in a drop down menu. This was fine until the day I wanted to run a DNA-protein conjugate on the gel. Essentially I had broken my own data model. This could be fixed, and I did fix it, by changing the way my template looked for potential samples. But in the cut and thrust of real lab work (as opposed to an academic pottering under sufferance of his students) this isn’t feasible. We can’t extend the data model every time we do something new – we are always doing something new.

The work is in developing a solid data model in the first place 🙂 Restricting inputs to a particular material in this case proteins, wont work if you try to add DNA. You will get no argument from the here. This is why FuGE (and GelML) have “material” or “data” inputs to a protocol allowing any type of material and therefore any experiment to be captured. On the other hand you can restrict the data model (the drop down box) to be only proteins. This does not break the model. Instead the model stays the same and you add more terms to you ontology or controlled vocabulary. In response to “we are always doing something new”, I would not doubt this. However, I would suggest you are always doing something new inside the structure of input=some_material (and or some_data), apply process, output some_material (and or some_data).

[FG]….FuGE is designed so that it provides a generic structure which can then be described or further specialised by the user/application by extending the model itself or by using cv’s/ontologies or free text. This provides the flexibility and in theory future proof.

But does this require that the user does the extension every time they move on to something new. As a matter of interest, how much time and effort went into agreeing the GelML? Is it practical to do this extension over and over again? And who will fund it?

There is no question that to pick a particular technology or process and model it takes time. To answer your question GelML probably took 2 years to complete, which is not trivial. However FuGE the data model for experiments – analogous to what you proposed in your original post, probably has taken close to 5 years of development with a larger number of developers than GelML. I could envision that creating a lab note book with FuGE as an underlying model you could re-use these extensions – like pluging in specific experiments to your generic lab book. Conforming to a common structure will only allow this plug-in scenario to be achieved, whether it is FuGE or another model. Funding, certainly, who will fund it? All the main funders are starting to say we should make our data available but provide very little monetary incentive to do so. GelML was not funded, we did it out of the goodness of our hearts and the greater good.

My concern is that achieving added value requires the controlled vocabulary. If we are going to just end up using free text because a cv doesn’t exist for the experiment we are doing then why use a complex data structure?

You are also correct in your assertion in that the added value, or the semantics is in the ontology not the data model. Using a data model allows you to understand certain information contained in the structure; that is is a material, that it is a protocol, that it is a piece of equipment. The ontology allows you to say specifically what it is and what it means. This does present a catch 22 in that without the ontology it is difficult to add semantics, is free text more suitable, I would say no. Its easier and you will understand it. By the very notion that you are making your labbook available you want other people and computers to interpret it and understand it. For example, are the free text terms, 1D, gel electrophoresis, gel, matrix separation, electrophoresis, all referring to the same thing that you use SDS-PAGE to refer to? You might assume so, I might have implied otherwise. You cant tell unless they are associated with meaning – free text has no meaning, only assumed interpretation. This is the motivation behind OBI – the ontology for Biomedical Investigations. This probably suffers from the same labeling as FuGE as it is science experiments, not just biology.

Summary

So I will start with re-pointing what I believe to be the areas of conflation within these discussions

the representation of experiments – the data model
the presentation or level of abstraction to the user (probably some what dependent on 3.)
the implementation of the data model
the publication of the data (Notification, RSS etc.)

FuGE itself is only applicable to point 1. It will provide a structure to represent experiments. That’s it. I believe it is applicable to a lab note book. However there is no glossing over the fact that there needs to be an abstraction over the model (2) dependent on (3) to allow it to be used by scientists and to make this a reality – this is work that has to be done and its not me offering to do it either 🙂 Once this is in place it should be relatively trivial to publish or notify others of experiments (4).

This entry was posted on April 14, 2008, 4:53 pm and is filed under bioinformatics, data standards, ontology, open data, open science. You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.

#1 by Duncan on April 15, 2008 - 7:51 am

Hi Frank, nice post. I like your four point summary, but for many bench-based scientists it is hard for them to seperate representation and presenation of anything, experiments included…

#2 by Cameron Neylon on April 15, 2008 - 9:36 am

Duncans point may be the key one. I think its similar to what I mean when I say that we need to get our heads around what the user’s data model is (remembering that these may be different for the person doing the experiment and the person using the description of the experiment later).

But Frank’s point about conflation is precisely that I am failing to separate the representation and presentation I think.

Science in the open » More on FuGE and data models for lab notebooks

fgibson.com