Archive for category data standards
Several months ago – about 3, I made a public commitment to make the data I have generated during my Phd open and available online. Well I have not ignored this and in the interim I have been investigating various ways I can do this. Not only do I want to make it available but I want to structure it in a standard form, namely the gelML format. In addition, I was involved in developing it the specification and therefore, I have somewhat an obligation to use it. As it is an XML transfer format I needed to be make changes and revision it, like developing code, so in that sense recording the data on a wiki or blog would not be appropriate. For this reason I have chosen to create a google code project for gel electrophoresis data and do everything in subversion. You can browse the subversion repository or check it out anonymously. The geML file that will eventually (as its still very much a work in progress) contain the data is here. As I am doing this, I though I might as well publish my lab book while I was at it. This will be done using LateX and the pdf that gets generated can be found here.
To date, this is still a work in progress and a reverse engineering project, as the experiments are not being done live. It may take a while to complete but in the end I hope presenting my data in gelML and making my labbook available can be more of a benefit than decomposing for years on cellulose.
I was trying to work out a suitable journal to where I could submit a paper on sepCV, the PSI ontology for sample preparation and separation techniques. I found my self drawing up a table, so I thought I would blog it. My initial remit is that is should be in a proteomics relevant journal as well as bioinformatics, as we are trying to encourage a greater community contribution for term collection. In this respect it has to be open access. I would also prefer the journal to accept LateX instead on proprietary formats such as word. I was really disappointed with the number of journals that only accept word documents, even PlosOne refuses anything other than word or rft, tut, tut.
Based on this loose criteria, Proteome science comes top closely followed by Journal of Proteomics and Bioinformatics (if I sacrifice using word for open access). BMC Bioinformatics also ticks all the boxes but it misses out on the proteomics audience. The table below also includes Impact factor, but I did not really take that into consideration. Wouldn’t it be nice is there was an app that you could just enter in criteria like this, target audience, submission format, copyright, etc and get back the journals that meet these requirements. Something like this would save me an afternoon trawling through the web building spreadsheets.
How do you select your journal?
I have struggled to keep up with this discussion, with excuses ranging from attending workshops, major release deadline on the horizon (now past) and a post-mortem on the release schedule, to attending (only to please the parents) my graduation ceremony. I am only now starting to catch up on my feeds but dauntingly Google reader tells me I have 1000+, moan moan moan,
Anyway, in Cameron’s last post on the subject he points to all the previous discussions and other commentary on the topic. I will pick up from his last post and respond to some his responses to the responses to responses.
I still feel that we are trying to describe and achieve different things, but that this discussion is a great way of getting to the bottom of this and achieving some clarity in our description and language.
This certainly may be the case, I want to present FuGE as something that is worth considering rather than re-inventing. However, this is no denying that FuGE is a datamodel and does not come with a high degree of tool support or nice user interfaces, which Cameron is crying out for, as are most lab scientists from a usability point of view.
I got off to a very bad start here. I should have used the word ‘capture’ here. This to me is about capturing the data streams that come out of lab work.
This seems to be a change of tact 🙂 The original post was about a data model for lab notebooks . There is no reason why data streams can not be structured. However in reading the rest of Cameron’s post there would appear to be a 4th point of separation of modeling experiments, following on from the 3 presented here.
4. The publication of the data
Here we start to see how the different motivations are driving our views. What I want here is a marker on a web document that says ‘I am a scientific experiment’ (page was a more term to use – I simply mean any web document, generally accessed through discrete web pages). This will allow aggregation and distribution of the notebook a la PostGenomic or Chemical Blogspace. To me this is more important than the format of the underlying data. If I can find interesting data I will probably put the work into extracting it in a form useful to me To Frank I suspect the aggregation and indexing is a peripheral issue. If the data isn’t in a agreed format it isn’t useful for him.
This comment seems to re-iterate the 4th level of separation, structuring the experimental data is separate from publishing it. I do not see any reason why a document that contained structured data could not be embedded in a wiki,blog, lump of RDF or whatever. Once you have found it by whatever publication mechanism and arrived at the data, it is going to be alot easier to do interesting stuff with it, if it is in a common structure. Imagine the scenario of 10, in fact lets say 500 to cross the barrier of humans doing it faster, laboratory’s doing the same type of experiment. Would it not be cool if you could write one app to interegate 500 open lab books with one input structure, instead of 500 file parsers which would then be placed into a common format anyway, to do some cool meta-analysis on what protocol produced the best results out of the 500?
Again this is a central user interface issue for us. Capturing an experiment in the wet lab, whether noting it as it happens or planning what you are going to in advance, is often most easily done with a table. Tables are not well implemented in the wiki and blog frameworks we are using for these systems. Therefore providing a table to capture the experiment is critical if you actually want anyone to use your system. Our users consistently identify this as the single biggest barrier to them using our system.
A table is a visual summary of your experiment. In order to produce a table you have to think about what you are recording and model the table accordingly in advance. Structurally this is more efficiently achieved at the model layer. Visually, there is no question that a summary of what you captured works well via a table. Is a table the best mechanism? it is certainly easier while entering information via a pen on paper. Ultimately if you abstract far enough back a table is column separated values. FuGE provides a mechanism to define matrices of data (tables) which can be presented to the user.
Now the heavyweight approach to this is to say; ‘That’s why you need a data model. Once you have that you can generate a nice web form to capture the necessary data’. The problem with this comes when you do something slightly different. As an example I had a template set up in our system for capturing the setup of SDS-PAGE gels. This would go and look for anything that tagged as ‘protein’ as potential samples and present these in a drop down menu. This was fine until the day I wanted to run a DNA-protein conjugate on the gel. Essentially I had broken my own data model. This could be fixed, and I did fix it, by changing the way my template looked for potential samples. But in the cut and thrust of real lab work (as opposed to an academic pottering under sufferance of his students) this isn’t feasible. We can’t extend the data model every time we do something new – we are always doing something new.
[FG]….FuGE is designed so that it provides a generic structure which can then be described or further specialised by the user/application by extending the model itself or by using cv’s/ontologies or free text. This provides the flexibility and in theory future proof.
But does this require that the user does the extension every time they move on to something new. As a matter of interest, how much time and effort went into agreeing the GelML? Is it practical to do this extension over and over again? And who will fund it?
There is no question that to pick a particular technology or process and model it takes time. To answer your question GelML probably took 2 years to complete, which is not trivial. However FuGE the data model for experiments – analogous to what you proposed in your original post, probably has taken close to 5 years of development with a larger number of developers than GelML. I could envision that creating a lab note book with FuGE as an underlying model you could re-use these extensions – like pluging in specific experiments to your generic lab book. Conforming to a common structure will only allow this plug-in scenario to be achieved, whether it is FuGE or another model. Funding, certainly, who will fund it? All the main funders are starting to say we should make our data available but provide very little monetary incentive to do so. GelML was not funded, we did it out of the goodness of our hearts and the greater good.
My concern is that achieving added value requires the controlled vocabulary. If we are going to just end up using free text because a cv doesn’t exist for the experiment we are doing then why use a complex data structure?
You are also correct in your assertion in that the added value, or the semantics is in the ontology not the data model. Using a data model allows you to understand certain information contained in the structure; that is is a material, that it is a protocol, that it is a piece of equipment. The ontology allows you to say specifically what it is and what it means. This does present a catch 22 in that without the ontology it is difficult to add semantics, is free text more suitable, I would say no. Its easier and you will understand it. By the very notion that you are making your labbook available you want other people and computers to interpret it and understand it. For example, are the free text terms, 1D, gel electrophoresis, gel, matrix separation, electrophoresis, all referring to the same thing that you use SDS-PAGE to refer to? You might assume so, I might have implied otherwise. You cant tell unless they are associated with meaning – free text has no meaning, only assumed interpretation. This is the motivation behind OBI – the ontology for Biomedical Investigations. This probably suffers from the same labeling as FuGE as it is science experiments, not just biology.
- the representation of experiments – the data model
- the presentation or level of abstraction to the user (probably some what dependent on 3.)
- the implementation of the data model
- the publication of the data (Notification, RSS etc.)
FuGE itself is only applicable to point 1. It will provide a structure to represent experiments. That’s it. I believe it is applicable to a lab note book. However there is no glossing over the fact that there needs to be an abstraction over the model (2) dependent on (3) to allow it to be used by scientists and to make this a reality – this is work that has to be done and its not me offering to do it either 🙂 Once this is in place it should be relatively trivial to publish or notify others of experiments (4).
The main focus of the day was discussing what it means to be registered on the MIBBI site and be a member of the Foundry. As a straw man, rather than starting from scratch, we used the OBO foundry principles to see if they could be applied to reporting check-lists, and came up with a draft set of principles. A full “official” report of the workshop should be forthcoming.
As a break from the rigorous standards development and process of reporting check-lists we went to a local restaurant and experienced anarchy when it came to understanding the menu. The meal itself was brilliant, however some degree of semantic extraction had to be applied to the menu to actually understand what we were ordering – No fancy algorithms were applied here, other than aksing the staff so explain it in English! And we think we have problems in the life-sciences! An example of the menu can be seen in the photo graphs which also include the delegates the meal and the venue.
|First MIBBI workshop 2-3rd April 2008|
MIBBI is a registry of scientific experiment reporting guidelines with the idea to foster a foundry of best practice to further develop and encourage modular development and re-use of reporting guidelines. The first workshop is being held at the EBI on the 2nd – 3rd April 2008 and is a relatively closed workshop to those developers and guidelines that are registered on the site. The schedule for day one is a whistle stop tour consisting of 5 min talks (adjusting for an academics interpretation of what 5 minutes means) for all the guidelines that exist, their scope and the people behind them. Due to this I am not going to comment on individual talks. I presented two talks during the day. One on CARMEN and the development of the MINI: Electrophysiology reporting guidelines, and one, standing in for Andy Jones on FuGE.
I tried sharing these slides via google presentation, they looked quite nice. However, wordpress does not seem to allow them to embed. So I put them on slide share instead. These set the tone for the discussions for the afternoon and tomorrow.
This post may be one in a series of responses to Cameron’s post on “Proposing a data model for Open Notebooks“. When I originally read this post I commented on the fact that a data model for experiments actually exists and that he may get some mileage out of it rather than starting from scratch and re-creating the wheel. Several discussions have followed on from this original post and Neil has picked up on it as well, with sentiments that I agree with.
I think a large part of this discussion confuses and conflates 3 issues which I believe to be separate;
- the representation of experiments – the data model
- the presentation or level of abstraction to the user (probably some what dependent on 3.)
- the implementation of the data model
With these three issues in mind, to start with, I am going back to the original post and respond to some of the comments.
What I’m suggesting is a standard format to describe experiments;…
A “standard” in the true sense of the word (established by consensus and approved by a recognized body) already exists to describe life-science experiments. It is a data model represented in UML called FuGE.
…..a default format for online notebooks. The object is to do a number of things. Firstly identify the page(s) as being an online laboratory notebook so that they can be aggregated or auto-processed as appropriate.
I see this as two different and separate things, the data model which represents experiments, and the presentation of the model to the user, in this case described as an online notebook. Page numbers are an arbitrary visual aid, they are not integral to modelling experiments
…Secondly to make rich metadata available in a human readable and machine processable form making mashups and other things possible using tools such as Yahoo! Pipes, Dapper, and the growing range of other interesting tools, but not to impose any unnecessary limitations on what that metadata might look like. ..
I am not going to deal with metadata here, as the post will probably be long enough. However, traditionally, metadata, (cv’s and ontologies) have been used to add specificity or meaning to the structured data. The choice of the metadata to use (or build) will be dependent on the application.
Another issue is the tables. My original thinking was that if we had a data model for tables then most of our problems would go away.
I am not sure I agree here. What is a table? I see it as a particular visual display mechanism that you have chosen to represent you results. The results can be modelled more accurately within the data model such as chemical-has_measurement, measurement has_numerical value and has_unit. I believe this statement is confusing the visual presentation of data with structuring the data.
However the argument against still stands. Anything that requires a fixed vocabulary is going to break
Well, anything that requires a fixed vocabulary is less flexible, breaking is something different. If it breaks doing the job it was designed to do then this is a problem. If it breaks when applied to a different application, then well, it was not designed for that application in the first place. FuGE is designed so that it provides a generic structure which can then be described or further specialised by the user/application by extending the model itself or by using cv’s/ontologies or free text. This provides the flexibility and in theory future proof.
Overall an experiment has inputs and outputs. These may be data or material objects. Procedures take inputs and generate outputs.[..] Broadly speaking there seem to be three types of item; material objects , data, and procedures (possibly also comments). For each of these we require a provenance (author), and a date
I would agree with you assessment of what classes are needed. This corresponds to what FuGE contains as illustrated in the digram below (click on image to see original)
In summary, the position I want to present is that FuGE is a data model to represent scientific experiments. Several domains are using it to represent their experiments from traditional biology/molecular biology to neurophysiology. I believe FuGE could form the underlying model for a “notebook” via an abstraction/presentation layer to the user. In how should it be implemented, blog, wiki, database, latex, XML, RDF, OWL, I am not going to hypothesis. However, a database implementation of the FuGE schema is already in development called SyMBA which abstracts away from the user presenting simple web forms to fill out the XML which is then stored as a relation database.
The idea behind the CARMEN project is that we provide a system to store electrophysiology data and analysis services so that data can be shared and analysed in the “Neuro-cloud”. An important factor in realising this system is that the stored data and the services have to be described in a way that is both human and computationally amenable. The first stage of this is agreeing what information should actually be ascribed to the data. In other words, the balance between what the experimentalist want to say about their data and what informaticians need to know about a particular data set in order to perform their analysis. To this end we have defined what we believe to be the minimum information that must be ascribed to an electrophysiology experiment for submission to the CARMEN system. It follows the now well practised format of MIAME and MIAPE minimum reporting requirements. In the first instance the document only represents consensus within the CARMEN consortium. However, it could form the basis of a community reporting standard for electrophysiology experiments. The document is available on Nature preceedings at the following URL and comments and opinions are encouraged. http://precedings.nature.com/documents/1720/version/1
I described in an earlier post that data sharing in Neuroscience is relatively non-existent. Some commentary on the subject has appeared since then via the 2007 SfN Satellite Symposium on Data Sharing entitled Value Added by Data SharingLong-TermPotentiation of Neuroscience Research, published in Neuroinformatics. I was also excited to see an article published last week Data Sharing for Computational Neuroscienc, also in Neuroinformatics. However, there is a caveat or two. Apart from ignoring all the data representation issues presented in other domains such as bioinformatics, the re-use of data models such as FuGE, or contribution to ontology efforts such as OBI, all these articles are not open access! How ironic, or should that be how embarrassing. Phil also covers this issue in his blog.
Oh well, looks as if there is still a challenge in the domain of Neuroscience for access to valuable insights into information flow in the brain. Who want to know how the brain works anyway? You can always pay $32 to springer if you want to find out.
When finnished, I would have liked it to be published some where like Nature Preceedings, however they only accept proprietary Microsoft files and pdfs rather than XML documents. I also though of creating a Google code project for it, but it seems quite elaborate for something nobody else would be contributing to and once completed would be rather static. Any suggestions are very welcome.