Archive for category online
Several months ago – about 3, I made a public commitment to make the data I have generated during my Phd open and available online. Well I have not ignored this and in the interim I have been investigating various ways I can do this. Not only do I want to make it available but I want to structure it in a standard form, namely the gelML format. In addition, I was involved in developing it the specification and therefore, I have somewhat an obligation to use it. As it is an XML transfer format I needed to be make changes and revision it, like developing code, so in that sense recording the data on a wiki or blog would not be appropriate. For this reason I have chosen to create a google code project for gel electrophoresis data and do everything in subversion. You can browse the subversion repository or check it out anonymously. The geML file that will eventually (as its still very much a work in progress) contain the data is here. As I am doing this, I though I might as well publish my lab book while I was at it. This will be done using LateX and the pdf that gets generated can be found here.
To date, this is still a work in progress and a reverse engineering project, as the experiments are not being done live. It may take a while to complete but in the end I hope presenting my data in gelML and making my labbook available can be more of a benefit than decomposing for years on cellulose.
This post may be one in a series of responses to Cameron’s post on “Proposing a data model for Open Notebooks“. When I originally read this post I commented on the fact that a data model for experiments actually exists and that he may get some mileage out of it rather than starting from scratch and re-creating the wheel. Several discussions have followed on from this original post and Neil has picked up on it as well, with sentiments that I agree with.
I think a large part of this discussion confuses and conflates 3 issues which I believe to be separate;
- the representation of experiments – the data model
- the presentation or level of abstraction to the user (probably some what dependent on 3.)
- the implementation of the data model
With these three issues in mind, to start with, I am going back to the original post and respond to some of the comments.
What I’m suggesting is a standard format to describe experiments;…
A “standard” in the true sense of the word (established by consensus and approved by a recognized body) already exists to describe life-science experiments. It is a data model represented in UML called FuGE.
…..a default format for online notebooks. The object is to do a number of things. Firstly identify the page(s) as being an online laboratory notebook so that they can be aggregated or auto-processed as appropriate.
I see this as two different and separate things, the data model which represents experiments, and the presentation of the model to the user, in this case described as an online notebook. Page numbers are an arbitrary visual aid, they are not integral to modelling experiments
…Secondly to make rich metadata available in a human readable and machine processable form making mashups and other things possible using tools such as Yahoo! Pipes, Dapper, and the growing range of other interesting tools, but not to impose any unnecessary limitations on what that metadata might look like. ..
I am not going to deal with metadata here, as the post will probably be long enough. However, traditionally, metadata, (cv’s and ontologies) have been used to add specificity or meaning to the structured data. The choice of the metadata to use (or build) will be dependent on the application.
Another issue is the tables. My original thinking was that if we had a data model for tables then most of our problems would go away.
I am not sure I agree here. What is a table? I see it as a particular visual display mechanism that you have chosen to represent you results. The results can be modelled more accurately within the data model such as chemical-has_measurement, measurement has_numerical value and has_unit. I believe this statement is confusing the visual presentation of data with structuring the data.
However the argument against still stands. Anything that requires a fixed vocabulary is going to break
Well, anything that requires a fixed vocabulary is less flexible, breaking is something different. If it breaks doing the job it was designed to do then this is a problem. If it breaks when applied to a different application, then well, it was not designed for that application in the first place. FuGE is designed so that it provides a generic structure which can then be described or further specialised by the user/application by extending the model itself or by using cv’s/ontologies or free text. This provides the flexibility and in theory future proof.
Overall an experiment has inputs and outputs. These may be data or material objects. Procedures take inputs and generate outputs.[..] Broadly speaking there seem to be three types of item; material objects , data, and procedures (possibly also comments). For each of these we require a provenance (author), and a date
I would agree with you assessment of what classes are needed. This corresponds to what FuGE contains as illustrated in the digram below (click on image to see original)
In summary, the position I want to present is that FuGE is a data model to represent scientific experiments. Several domains are using it to represent their experiments from traditional biology/molecular biology to neurophysiology. I believe FuGE could form the underlying model for a “notebook” via an abstraction/presentation layer to the user. In how should it be implemented, blog, wiki, database, latex, XML, RDF, OWL, I am not going to hypothesis. However, a database implementation of the FuGE schema is already in development called SyMBA which abstracts away from the user presenting simple web forms to fill out the XML which is then stored as a relation database.