Archive for category open data
In a recent Nature editorial entitled Standardizing data, several projects were highlighted that are forfeiting there chances of winning a Nobel prize (according to Quackenbush) and championing the blue collar science of data standardization.in the life-sciences.
I wanted to take the article a step further highlight three significant properties of scientific data that I believe to be fundamental in considering how to curate, standardize or simply represent scientific data; from primary data, to lab books, to publication. These significant properties of scientific data are the content, syntax, and semantics, or more simply put -What do we want to say? How do we say it? What does it all mean? These three significant properties of data are what I refer to as the Triumvirate of scientific data.
Content: What do we want to say?
Data Content is defined as the items, topics or information that is “contained in” or represented by a data object. What is, should or must be said. Generic data content standards exists, such as Dublin Core, as well as more focused or domain specific standards. Most aspects of the research life-cycle have a content standard. For example, when submitting a manuscript to a scientific publisher you are required to conform to a content standard for that Journal. For example, PlosOne calls their content standard Criteria for Publication and lists seven points to conform to.
The Minimum Information about [insert favourite technology] are efforts by the relevant communities to define content standards for their experiments. These do (should) not define how the content is represented (in a database or file format) rather they state what information is required to describe an experiment. Collecting and defining content standards for the life-sciences is the premise of the MIBBI project.
Syntax: How do we say it?
The content of data is independent of any structure, language implementation or semantics. For example when viewing a journal article on Biomed central you typically have the option to view or download the “Full Text” which is often represented in HTML or you have the option of viewing the PDF file or XML. Each representation has the same scientific content to a human but is structured and then rendered (or “presented”) to the user in three different syntax.
The majority of the structural of syntactic representation of scientific data is largely database centric. However, alternative methods can be identified such as Wikis (OpenWetWare, UsefulChem), Blogs (LaBLog), XML, (GelML), RDF (UniProt export) or described as a data model (FuGE) which can be realised in multiple syntax
Semantics: What do we mean?
The explicit meaning of data is very difficult to get right and is a difficult problem in the life-sciences. One word can have many meanings and one meaning can be described by many words. A good example of a failure to correctly determine the semantics of data is described in the paper by Zeeberg et al 2004. In the paper they describe the mis-interpretation of the semantics of gene names. This mis-interpretation of semantics resulted in an irreversible conversion to date-format by Excel and which percolated through to the curated LocusLink public repository.
Within the life-sciences the issue of semantics is being addressed via the use of Controlled vocabularies and ontologies.
According to the Neurocommons definition; A controlled vocabulary is an association between formal names (identifiers) and their definitions. A ontology is a controlled vocabulary augmented with logical constraints that describe their interrelationships. Not only do we need semantics for data, we need shared semantics, so that we are able to describe data consistently, within laboratories, across collaborations and transcending scientific domains. The OBO Foundry is one of the projects tasked with fostering the orthogonal development of ontologies – one term only appears in one ontology and is referenced by others – with the goal of shared semantics.
When considering how to curate, standardize or represent scientific data, either internally within laboratories, or externally for publication, the three significant properties of content, syntax and semantics should be considered carefully for the specific data. Consistent representation of data conforming to the Triumvirate of scientific data will provide a platform for the dissemination, interpretation, evaluation and advancement of scientific knowledge.
Thanks to Phil Lord for helpful discussions on the Triumvirate of data
Conflict of interest
I am involved in the MIBBI project, the development of GelML and a member of the OBO Foundry via the OBI project.
Several months ago – about 3, I made a public commitment to make the data I have generated during my Phd open and available online. Well I have not ignored this and in the interim I have been investigating various ways I can do this. Not only do I want to make it available but I want to structure it in a standard form, namely the gelML format. In addition, I was involved in developing it the specification and therefore, I have somewhat an obligation to use it. As it is an XML transfer format I needed to be make changes and revision it, like developing code, so in that sense recording the data on a wiki or blog would not be appropriate. For this reason I have chosen to create a google code project for gel electrophoresis data and do everything in subversion. You can browse the subversion repository or check it out anonymously. The geML file that will eventually (as its still very much a work in progress) contain the data is here. As I am doing this, I though I might as well publish my lab book while I was at it. This will be done using LateX and the pdf that gets generated can be found here.
To date, this is still a work in progress and a reverse engineering project, as the experiments are not being done live. It may take a while to complete but in the end I hope presenting my data in gelML and making my labbook available can be more of a benefit than decomposing for years on cellulose.
I have struggled to keep up with this discussion, with excuses ranging from attending workshops, major release deadline on the horizon (now past) and a post-mortem on the release schedule, to attending (only to please the parents) my graduation ceremony. I am only now starting to catch up on my feeds but dauntingly Google reader tells me I have 1000+, moan moan moan,
Anyway, in Cameron’s last post on the subject he points to all the previous discussions and other commentary on the topic. I will pick up from his last post and respond to some his responses to the responses to responses.
I still feel that we are trying to describe and achieve different things, but that this discussion is a great way of getting to the bottom of this and achieving some clarity in our description and language.
This certainly may be the case, I want to present FuGE as something that is worth considering rather than re-inventing. However, this is no denying that FuGE is a datamodel and does not come with a high degree of tool support or nice user interfaces, which Cameron is crying out for, as are most lab scientists from a usability point of view.
I got off to a very bad start here. I should have used the word ‘capture’ here. This to me is about capturing the data streams that come out of lab work.
This seems to be a change of tact 🙂 The original post was about a data model for lab notebooks . There is no reason why data streams can not be structured. However in reading the rest of Cameron’s post there would appear to be a 4th point of separation of modeling experiments, following on from the 3 presented here.
4. The publication of the data
Here we start to see how the different motivations are driving our views. What I want here is a marker on a web document that says ‘I am a scientific experiment’ (page was a more term to use – I simply mean any web document, generally accessed through discrete web pages). This will allow aggregation and distribution of the notebook a la PostGenomic or Chemical Blogspace. To me this is more important than the format of the underlying data. If I can find interesting data I will probably put the work into extracting it in a form useful to me To Frank I suspect the aggregation and indexing is a peripheral issue. If the data isn’t in a agreed format it isn’t useful for him.
This comment seems to re-iterate the 4th level of separation, structuring the experimental data is separate from publishing it. I do not see any reason why a document that contained structured data could not be embedded in a wiki,blog, lump of RDF or whatever. Once you have found it by whatever publication mechanism and arrived at the data, it is going to be alot easier to do interesting stuff with it, if it is in a common structure. Imagine the scenario of 10, in fact lets say 500 to cross the barrier of humans doing it faster, laboratory’s doing the same type of experiment. Would it not be cool if you could write one app to interegate 500 open lab books with one input structure, instead of 500 file parsers which would then be placed into a common format anyway, to do some cool meta-analysis on what protocol produced the best results out of the 500?
Again this is a central user interface issue for us. Capturing an experiment in the wet lab, whether noting it as it happens or planning what you are going to in advance, is often most easily done with a table. Tables are not well implemented in the wiki and blog frameworks we are using for these systems. Therefore providing a table to capture the experiment is critical if you actually want anyone to use your system. Our users consistently identify this as the single biggest barrier to them using our system.
A table is a visual summary of your experiment. In order to produce a table you have to think about what you are recording and model the table accordingly in advance. Structurally this is more efficiently achieved at the model layer. Visually, there is no question that a summary of what you captured works well via a table. Is a table the best mechanism? it is certainly easier while entering information via a pen on paper. Ultimately if you abstract far enough back a table is column separated values. FuGE provides a mechanism to define matrices of data (tables) which can be presented to the user.
Now the heavyweight approach to this is to say; ‘That’s why you need a data model. Once you have that you can generate a nice web form to capture the necessary data’. The problem with this comes when you do something slightly different. As an example I had a template set up in our system for capturing the setup of SDS-PAGE gels. This would go and look for anything that tagged as ‘protein’ as potential samples and present these in a drop down menu. This was fine until the day I wanted to run a DNA-protein conjugate on the gel. Essentially I had broken my own data model. This could be fixed, and I did fix it, by changing the way my template looked for potential samples. But in the cut and thrust of real lab work (as opposed to an academic pottering under sufferance of his students) this isn’t feasible. We can’t extend the data model every time we do something new – we are always doing something new.
[FG]….FuGE is designed so that it provides a generic structure which can then be described or further specialised by the user/application by extending the model itself or by using cv’s/ontologies or free text. This provides the flexibility and in theory future proof.
But does this require that the user does the extension every time they move on to something new. As a matter of interest, how much time and effort went into agreeing the GelML? Is it practical to do this extension over and over again? And who will fund it?
There is no question that to pick a particular technology or process and model it takes time. To answer your question GelML probably took 2 years to complete, which is not trivial. However FuGE the data model for experiments – analogous to what you proposed in your original post, probably has taken close to 5 years of development with a larger number of developers than GelML. I could envision that creating a lab note book with FuGE as an underlying model you could re-use these extensions – like pluging in specific experiments to your generic lab book. Conforming to a common structure will only allow this plug-in scenario to be achieved, whether it is FuGE or another model. Funding, certainly, who will fund it? All the main funders are starting to say we should make our data available but provide very little monetary incentive to do so. GelML was not funded, we did it out of the goodness of our hearts and the greater good.
My concern is that achieving added value requires the controlled vocabulary. If we are going to just end up using free text because a cv doesn’t exist for the experiment we are doing then why use a complex data structure?
You are also correct in your assertion in that the added value, or the semantics is in the ontology not the data model. Using a data model allows you to understand certain information contained in the structure; that is is a material, that it is a protocol, that it is a piece of equipment. The ontology allows you to say specifically what it is and what it means. This does present a catch 22 in that without the ontology it is difficult to add semantics, is free text more suitable, I would say no. Its easier and you will understand it. By the very notion that you are making your labbook available you want other people and computers to interpret it and understand it. For example, are the free text terms, 1D, gel electrophoresis, gel, matrix separation, electrophoresis, all referring to the same thing that you use SDS-PAGE to refer to? You might assume so, I might have implied otherwise. You cant tell unless they are associated with meaning – free text has no meaning, only assumed interpretation. This is the motivation behind OBI – the ontology for Biomedical Investigations. This probably suffers from the same labeling as FuGE as it is science experiments, not just biology.
- the representation of experiments – the data model
- the presentation or level of abstraction to the user (probably some what dependent on 3.)
- the implementation of the data model
- the publication of the data (Notification, RSS etc.)
FuGE itself is only applicable to point 1. It will provide a structure to represent experiments. That’s it. I believe it is applicable to a lab note book. However there is no glossing over the fact that there needs to be an abstraction over the model (2) dependent on (3) to allow it to be used by scientists and to make this a reality – this is work that has to be done and its not me offering to do it either 🙂 Once this is in place it should be relatively trivial to publish or notify others of experiments (4).
MIBBI is a registry of scientific experiment reporting guidelines with the idea to foster a foundry of best practice to further develop and encourage modular development and re-use of reporting guidelines. The first workshop is being held at the EBI on the 2nd – 3rd April 2008 and is a relatively closed workshop to those developers and guidelines that are registered on the site. The schedule for day one is a whistle stop tour consisting of 5 min talks (adjusting for an academics interpretation of what 5 minutes means) for all the guidelines that exist, their scope and the people behind them. Due to this I am not going to comment on individual talks. I presented two talks during the day. One on CARMEN and the development of the MINI: Electrophysiology reporting guidelines, and one, standing in for Andy Jones on FuGE.
I tried sharing these slides via google presentation, they looked quite nice. However, wordpress does not seem to allow them to embed. So I put them on slide share instead. These set the tone for the discussions for the afternoon and tomorrow.
This post may be one in a series of responses to Cameron’s post on “Proposing a data model for Open Notebooks“. When I originally read this post I commented on the fact that a data model for experiments actually exists and that he may get some mileage out of it rather than starting from scratch and re-creating the wheel. Several discussions have followed on from this original post and Neil has picked up on it as well, with sentiments that I agree with.
I think a large part of this discussion confuses and conflates 3 issues which I believe to be separate;
- the representation of experiments – the data model
- the presentation or level of abstraction to the user (probably some what dependent on 3.)
- the implementation of the data model
With these three issues in mind, to start with, I am going back to the original post and respond to some of the comments.
What I’m suggesting is a standard format to describe experiments;…
A “standard” in the true sense of the word (established by consensus and approved by a recognized body) already exists to describe life-science experiments. It is a data model represented in UML called FuGE.
…..a default format for online notebooks. The object is to do a number of things. Firstly identify the page(s) as being an online laboratory notebook so that they can be aggregated or auto-processed as appropriate.
I see this as two different and separate things, the data model which represents experiments, and the presentation of the model to the user, in this case described as an online notebook. Page numbers are an arbitrary visual aid, they are not integral to modelling experiments
…Secondly to make rich metadata available in a human readable and machine processable form making mashups and other things possible using tools such as Yahoo! Pipes, Dapper, and the growing range of other interesting tools, but not to impose any unnecessary limitations on what that metadata might look like. ..
I am not going to deal with metadata here, as the post will probably be long enough. However, traditionally, metadata, (cv’s and ontologies) have been used to add specificity or meaning to the structured data. The choice of the metadata to use (or build) will be dependent on the application.
Another issue is the tables. My original thinking was that if we had a data model for tables then most of our problems would go away.
I am not sure I agree here. What is a table? I see it as a particular visual display mechanism that you have chosen to represent you results. The results can be modelled more accurately within the data model such as chemical-has_measurement, measurement has_numerical value and has_unit. I believe this statement is confusing the visual presentation of data with structuring the data.
However the argument against still stands. Anything that requires a fixed vocabulary is going to break
Well, anything that requires a fixed vocabulary is less flexible, breaking is something different. If it breaks doing the job it was designed to do then this is a problem. If it breaks when applied to a different application, then well, it was not designed for that application in the first place. FuGE is designed so that it provides a generic structure which can then be described or further specialised by the user/application by extending the model itself or by using cv’s/ontologies or free text. This provides the flexibility and in theory future proof.
Overall an experiment has inputs and outputs. These may be data or material objects. Procedures take inputs and generate outputs.[..] Broadly speaking there seem to be three types of item; material objects , data, and procedures (possibly also comments). For each of these we require a provenance (author), and a date
I would agree with you assessment of what classes are needed. This corresponds to what FuGE contains as illustrated in the digram below (click on image to see original)
In summary, the position I want to present is that FuGE is a data model to represent scientific experiments. Several domains are using it to represent their experiments from traditional biology/molecular biology to neurophysiology. I believe FuGE could form the underlying model for a “notebook” via an abstraction/presentation layer to the user. In how should it be implemented, blog, wiki, database, latex, XML, RDF, OWL, I am not going to hypothesis. However, a database implementation of the FuGE schema is already in development called SyMBA which abstracts away from the user presenting simple web forms to fill out the XML which is then stored as a relation database.
The idea behind the CARMEN project is that we provide a system to store electrophysiology data and analysis services so that data can be shared and analysed in the “Neuro-cloud”. An important factor in realising this system is that the stored data and the services have to be described in a way that is both human and computationally amenable. The first stage of this is agreeing what information should actually be ascribed to the data. In other words, the balance between what the experimentalist want to say about their data and what informaticians need to know about a particular data set in order to perform their analysis. To this end we have defined what we believe to be the minimum information that must be ascribed to an electrophysiology experiment for submission to the CARMEN system. It follows the now well practised format of MIAME and MIAPE minimum reporting requirements. In the first instance the document only represents consensus within the CARMEN consortium. However, it could form the basis of a community reporting standard for electrophysiology experiments. The document is available on Nature preceedings at the following URL and comments and opinions are encouraged. http://precedings.nature.com/documents/1720/version/1
Call for Papers for Bio-Ontologies 2008. Submissions are now invited Bio-Ontologies 2008: Knowledge in Biology, a SIG at Intelligent Systems for Molecular Biology 2008.
Key Dates to remember:
- Submission due: Friday 2nd May
- Notifications: Friday 23rd May
- Final Version Due: Friday 30th May
- Workshop: Sunday 20th July
Bio-Ontologies has existed as a SIG at ISMB for more than a decade, making it one of the longest running. For this time, Bio-Ontologies has provided a forum for discussion on the latest and most cutting edge research on ontologies. In this decade, the use of ontologies has become mature, moving from niche to mainstream usage within bioinformatics. Following on from last year’s reflective look, this year we are broadening the scope of SIG; we are interested in any formal or informal approach to organising, presenting and disseminating knowledge in biology.
So, for example:
- Semantic and/or Scientific wikis.
- Multimedia blogs
- Tag Clouds
- Collaborative Curation Platforms
- Collaborative Ontology Authoring and Peer-Review Mechanisms
are topics which will be of relevance to the SIG, in addition to the more traditional areas for bio-ontologies.
- Biological Applications of Ontologies
- Reports on Newly Developed or Existing Bio-Ontologies
- Tools for Developing Ontologies
- Use of Ontologies in Data Communication Standards
- Use of Semantic Web technologies in Bioinformatics
- Implications of Bio-Ontologies or the Semantic Web for drug discovery
- Current Research In Ontology Languages and its implication for Bio-Ontologies
Please note, that this year ISCB have made an innovative schedule, holding some of the SIGs DURING ISMB. Bio-Ontologies is on the Sunday parallel to the main conference.
Submissions are now open and can be submitted through easychair.
Instructions to Authors
We are inviting two types of submissions.
Short papers, up to 4 pages.
Poster abstracts, up to 1/2 page.
Following review, successful papers will be presented at the Bio-Ontologies SIG. Poster abstracts will be provided poster space and time will be allocated during the day for at least one poster session. Unsuccesful papers will automatically be considered for poster presentation; there is no need to submit both on the same topic.
- Phillip Lord, Newcastle University
- Susanna-Assunta Sansone, EBI
- Nigam Shah, Stanford
- Matt Cockerill, BioMedCentral
The programme committee, organised alphabetically is:
- Mike Bada, University of Colorado
- Judith Blake, Jackson Laboratory
- Frank Gibson, Newcastle University
- Cliff Joslyn, Pacific National Laboratory
- Wacek Kusnierczyk, Norwegian University of Science and Technology
- Robin MacEntire, GSK
- Helen Parkinson, EBI
- Daniel Rubin, Stanford University
- Alan Ruttenberg, Science Commons
- Robert Stevens, University of Manchester
- and the conference organisers.
Submission templates are available from the Bio-Ontologies website.