Archive for category bioinformatics
I have placed a e-prints of a manuscript, on Nature preceedings, that I have been working on, in collaboration with the authors listed on the manuscript. It presents a review of the available published ontology engineering methodologies, and then assess their suitability when applied to community ontology development (the decentralised setting).
It is a lengthy document. Here is the abstract:
This paper addresses two research questions: “How should a well-engineered methodology facilitate the development of ontologies within communities of practice?” and “What methodology should be used?” If ontologies are to be developed by communities then the ontology development life cycle should be better understood within this context. This paper presents the Melting Point (MP), a proposed new methodology for developing ontologies within decentralized settings. It describes how MP was developed by taking best practices from other methodologies, provides details on recommended steps and recommended processes, and compares MP with alternatives. The methodology presented here is the product of direct first-hand experience and observation of biological communities of practice in which some of the authors have been involved. The Melting Point is a methodology engineered for decentralised communities of practice for which the designers of technology and the users may be the same group. As such, MP provides a potential foundation for the establishment of standard practices for ontology engineering.
These are the slides I gave at a DCC workshop entitled, “Digital curation 101″ which aimed to give and overview of what to consider regarding data curation and management in the context of applying for research funding. The presentation starts with definitions of content syntax and semantics, and example of how these concepts are being applied in the life-sciences, specifically proteomics.
The premise of the BioSysBio conference is to
bring together the best young researchers working in Synthetic Biology, Systems Biology and Bioinformatics, providing a platform to hear and discuss the most recent and scientific advances and applications in these fascinating fields.
This years BioSysBio 09 has just taken place in Cambridge, UK. The program was more slanted towards synthetic biology rather than more traditional systems biology, which I think reflects the growing momentum that synthetic biology has gained in the past year. I think this is a good progress and I was secretley glad as I did not want to spend 3 days looking at massive network diagrams squashed onto power point slides.
This was the first conference I had been to that the organisers actually requested that we use the BioSysBio FriendFeed room and Twitter to communicate, so I did. Half way through the first day the organisers demonstrated the FF room, which seemed to exist solely of Allyson’s posts, and questions were asked if she was a blogging bot. When we did confirm there was actually a female at an engineering conference, she was thereafter known as the BioSysBio poster girl.
As ever Ally was monumental in her blogging during the conference and all her posts can be found here. At one stage Simon did try to blog her talk to the same detail and speed, but he just kept coming up withe excuses about the wifi being slow – eventually he got there.
This was the first time I attended BioSysBio and I thoroughly enjoyed the experience. In general all of the talks were of a high standard most notable for me were Allyson Lister’s talk on Saint: a lightweight SBML annotation integration environment, Christina Smolke on Programming RNA Devices to Control Cellular Information Processing, Piers Millet on Why Secure Synthetic Biology? and Drew Endy on Building a new Biology. It was also good to hear about improvements for the Registry of standard biological parts by Randy Rettberg and the wiki style community building of the product catalogue, or data sheet about each part.
There is no point in me re-posting coverage that has already been documented, so if you would like to follow what happened you can follow the #biosysbio twitter stream, the biosysbio FreindFeed Room, or if you want a more comprehensive overview, Ally’s blog.
This was also the first time I had used twitter (via tweetdeck) instead of Friendfeed to microblog a conference. This approach certainly generated alot of noise and random soundbites, and was probably a fast way to make notes. However, although everything is grouped under the #biosysbio tag, they are not grouped around a particular talk or discussion thread. I can’t help thinking that microblogging via FriendFeed would be more focused around a specific talk and provide a more focused discussion, as opposed to just covering what was happening second by second.
Related articles by Zemanta
- Microblogging finds its way into PLoS (mndoci.com)
- Welcome to EveryONE (scienceblogs.com)
- Scientists learning to program “synthetic life” with DNA (arstechnica.com)
In a recent Nature editorial entitled Standardizing data, several projects were highlighted that are forfeiting there chances of winning a Nobel prize (according to Quackenbush) and championing the blue collar science of data standardization.in the life-sciences.
I wanted to take the article a step further highlight three significant properties of scientific data that I believe to be fundamental in considering how to curate, standardize or simply represent scientific data; from primary data, to lab books, to publication. These significant properties of scientific data are the content, syntax, and semantics, or more simply put -What do we want to say? How do we say it? What does it all mean? These three significant properties of data are what I refer to as the Triumvirate of scientific data.
Content: What do we want to say?
Data Content is defined as the items, topics or information that is “contained in” or represented by a data object. What is, should or must be said. Generic data content standards exists, such as Dublin Core, as well as more focused or domain specific standards. Most aspects of the research life-cycle have a content standard. For example, when submitting a manuscript to a scientific publisher you are required to conform to a content standard for that Journal. For example, PlosOne calls their content standard Criteria for Publication and lists seven points to conform to.
The Minimum Information about [insert favourite technology] are efforts by the relevant communities to define content standards for their experiments. These do (should) not define how the content is represented (in a database or file format) rather they state what information is required to describe an experiment. Collecting and defining content standards for the life-sciences is the premise of the MIBBI project.
Syntax: How do we say it?
The content of data is independent of any structure, language implementation or semantics. For example when viewing a journal article on Biomed central you typically have the option to view or download the “Full Text” which is often represented in HTML or you have the option of viewing the PDF file or XML. Each representation has the same scientific content to a human but is structured and then rendered (or “presented”) to the user in three different syntax.
The majority of the structural of syntactic representation of scientific data is largely database centric. However, alternative methods can be identified such as Wikis (OpenWetWare, UsefulChem), Blogs (LaBLog), XML, (GelML), RDF (UniProt export) or described as a data model (FuGE) which can be realised in multiple syntax
Semantics: What do we mean?
The explicit meaning of data is very difficult to get right and is a difficult problem in the life-sciences. One word can have many meanings and one meaning can be described by many words. A good example of a failure to correctly determine the semantics of data is described in the paper by Zeeberg et al 2004. In the paper they describe the mis-interpretation of the semantics of gene names. This mis-interpretation of semantics resulted in an irreversible conversion to date-format by Excel and which percolated through to the curated LocusLink public repository.
Within the life-sciences the issue of semantics is being addressed via the use of Controlled vocabularies and ontologies.
According to the Neurocommons definition; A controlled vocabulary is an association between formal names (identifiers) and their definitions. A ontology is a controlled vocabulary augmented with logical constraints that describe their interrelationships. Not only do we need semantics for data, we need shared semantics, so that we are able to describe data consistently, within laboratories, across collaborations and transcending scientific domains. The OBO Foundry is one of the projects tasked with fostering the orthogonal development of ontologies – one term only appears in one ontology and is referenced by others – with the goal of shared semantics.
When considering how to curate, standardize or represent scientific data, either internally within laboratories, or externally for publication, the three significant properties of content, syntax and semantics should be considered carefully for the specific data. Consistent representation of data conforming to the Triumvirate of scientific data will provide a platform for the dissemination, interpretation, evaluation and advancement of scientific knowledge.
Thanks to Phil Lord for helpful discussions on the Triumvirate of data
Conflict of interest
I am involved in the MIBBI project, the development of GelML and a member of the OBO Foundry via the OBI project.
The MIAPE: Gel Informatics module formalised by the Proteomics Standards Initiative (PSI) now available for Public Comment on the PSI Web site. Typically alot of this information will be contained in the image analysis software, so we would especially encourage software vendors to review the document. The public
comment period enables the wider community to provide feedback on a proposed standard before it is formally accepted, and thus is an important step in the standardisation process.
This message is to encourage you to contribute to the standards development activity by commenting on the material that is available online. We invite both positive and negative comments. If negative comments are being made, these could be on the relevance, clarity, correctness, appropriateness, etc, of the proposal as a whole or of specific parts of the proposal.
If you do not feel well placed to comment on this document, but know someone who may be, please consider forwarding this request. There is no requirement that people commenting should have had any prior contact with the PSI.
If you have comments that you would like to make but would prefer not to make public, please email the PSI editor Norman Paton.