Posts Tagged RDF
The Semantic Web of Life Science
Posted by peanutbutter in bioinformatics, life-science, semantic web on April 25, 2009
This summary was born out of a question on Twitter and percolated to FriendFeed, which was “Who is using RDF and integrating other resources at the minute and what are those resources? From this question, several resources were highlighted.
UniProt. The comprehensive resource of protein information is available as an RDF distribution and each Protein record has a corresponding RDF download option.
Phil pointed out Semantic Systems Biology, As systems biology is largely concerned with representing networks and interactions at a systems level, a language like RDF would seem an obvious choice to represent this type of knowledge, to aid semantic description and data integration.
Melanie pointed out the following resources such as Bio2RDF. This project aims to RDF-ize numerous public life-science resources using what they call a three step approach which they have developed. The following image illustrates some of the resources that are included in Bio2RDF.
The NeuroCommons project seeks to make all scientific research materials – research articles, annotations, data, physical materials – as available and as usable as they can be. As a result they have an RDF triple store which they encourage you to either contribute to or download and use.
For a more general overview of resources that exist as an RDF implementation, the Linked Open Data cloud provides a graphical summary of the resources that exists and the relationships between them.
If you know of any more life-science resources or projects using RDF, then please do comment below. Egon has indicated he is working on RDF-ing the NMRShiftDB and ChEMBL’s Starlight, and Andrew Clegg is considering a project proposal involving RDF. As a result a very interesting discussion ensued on FF.
The Triumvirate of Scientific Data
Posted by peanutbutter in bioinformatics, data standards, ontology, open data on October 30, 2008
In a recent Nature editorial entitled Standardizing data, several projects were highlighted that are forfeiting there chances of winning a Nobel prize (according to Quackenbush) and championing the blue collar science of data standardization.in the life-sciences.
I wanted to take the article a step further highlight three significant properties of scientific data that I believe to be fundamental in considering how to curate, standardize or simply represent scientific data; from primary data, to lab books, to publication. These significant properties of scientific data are the content, syntax, and semantics, or more simply put -What do we want to say? How do we say it? What does it all mean? These three significant properties of data are what I refer to as the Triumvirate of scientific data.
Content: What do we want to say?
Data Content is defined as the items, topics or information that is “contained in” or represented by a data object. What is, should or must be said. Generic data content standards exists, such as Dublin Core, as well as more focused or domain specific standards. Most aspects of the research life-cycle have a content standard. For example, when submitting a manuscript to a scientific publisher you are required to conform to a content standard for that Journal. For example, PlosOne calls their content standard Criteria for Publication and lists seven points to conform to.
The Minimum Information about [insert favourite technology] are efforts by the relevant communities to define content standards for their experiments. These do (should) not define how the content is represented (in a database or file format) rather they state what information is required to describe an experiment. Collecting and defining content standards for the life-sciences is the premise of the MIBBI project.
Syntax: How do we say it?
The content of data is independent of any structure, language implementation or semantics. For example when viewing a journal article on Biomed central you typically have the option to view or download the “Full Text” which is often represented in HTML or you have the option of viewing the PDF file or XML. Each representation has the same scientific content to a human but is structured and then rendered (or “presented”) to the user in three different syntax.
The majority of the structural of syntactic representation of scientific data is largely database centric. However, alternative methods can be identified such as Wikis (OpenWetWare, UsefulChem), Blogs (LaBLog), XML, (GelML), RDF (UniProt export) or described as a data model (FuGE) which can be realised in multiple syntax
Semantics: What do we mean?
The explicit meaning of data is very difficult to get right and is a difficult problem in the life-sciences. One word can have many meanings and one meaning can be described by many words. A good example of a failure to correctly determine the semantics of data is described in the paper by Zeeberg et al 2004. In the paper they describe the mis-interpretation of the semantics of gene names. This mis-interpretation of semantics resulted in an irreversible conversion to date-format by Excel and which percolated through to the curated LocusLink public repository.
Within the life-sciences the issue of semantics is being addressed via the use of Controlled vocabularies and ontologies.
According to the Neurocommons definition; A controlled vocabulary is an association between formal names (identifiers) and their definitions. A ontology is a controlled vocabulary augmented with logical constraints that describe their interrelationships. Not only do we need semantics for data, we need shared semantics, so that we are able to describe data consistently, within laboratories, across collaborations and transcending scientific domains. The OBO Foundry is one of the projects tasked with fostering the orthogonal development of ontologies – one term only appears in one ontology and is referenced by others – with the goal of shared semantics.
When considering how to curate, standardize or represent scientific data, either internally within laboratories, or externally for publication, the three significant properties of content, syntax and semantics should be considered carefully for the specific data. Consistent representation of data conforming to the Triumvirate of scientific data will provide a platform for the dissemination, interpretation, evaluation and advancement of scientific knowledge.
Thanks to Phil Lord for helpful discussions on the Triumvirate of data
Conflict of interest
I am involved in the MIBBI project, the development of GelML and a member of the OBO Foundry via the OBI project.