I work on a project called Science Commons – part of the Creative Commons (CC) non-profit organization (CC is the creator of a set of legal tools for sharing copyrighted works on the Web using a modular set of machine-readable contracts. CC licenses cover more the 150,000,000 copyrighted objects on the Web, including such high-impact offerings as BioMed Central, Public Library of Science, Nature Precedings, Hindawi Publishing, and the Uniprot database of proteins.) Science Commons is building a toolkit of policy, contracts, and technology that increases the chance of meaningful discovery through a shared, open approach to scientific research. We’re building part of the infrastructure for knowledge sharing, and we’re also deploying some test cases to demonstrate the power of investing in this kind of infrastructure.
Science Commons isn’t alone. Sharing approaches that address a single piece of the research cycle are making real, but painfully slow, progress. Open Access journals are far from the standard. Biological research materials are still hard to find and harder to access. And while most data remains behind the firewall at laboratories, even those data sets that do make it online are frequently poorly annotated and hard to use. The existing approaches are not creating the radical acceleration of scientific advancement that is made possible by the technical infrastructure to generate and share information.
Science Commons represents an integrated approach – one with potential to create this radical acceleration. We are targeting three key blocking points in the scientific research cycle – access to the literature, experimental materials, and data sharing – in a unified approach. We are testing the hypothesis that the solutions to one problem represent inputs to the next problem, and that a holistic approach to the problems discussed here potentially benefits from network effects and can create disruptive change in the throughput of scientific research. I will outline how these approaches represent tentative steps towards open knowledge infrastructure in the field of neuroscience.
Above is the biological pathway for Huntington’s Disease. This pathway is like a circuit – it governs the movement of information between genes and proteins, processes and locations in the cell. This one is a relatively simple pathway, as far as such things go. More complex pathways can have hundreds of elements in the network, each “directional” - not just linked like Web pages, but typed and directed links, where the kind of relationship and the causal order are vital both in vitro and in silico.
In this pathway, the problem is the HD gene in the middle of the circuit - if that gene is broken, it leads to a cascade that causes a rare, fatal disease where the brain degenerates rapidly. Although the genetic element has been understood for a long time, there is no cure. Not enough people get the disease for it to be financially worth finding a cure, given how expensive it is to find drugs and get them to market. That’s cold comfort to the tens of thousands of people who succumb each year and to their families who know they have a 50% chance of passing on the gene and disease to their children. But that’s the reality.
Years of research have led to an enormous amount of knowledge about Huntington’s. For example, a search in the U.S. government’s free Entrez web resource on “Huntington’s” yields more than 6,000 papers, 450+ gene sequences, 200+ protein sequences, and 55,000 expression and molecular abundance profiles. That’s a lot of knowledge. The papers alone would take 17 years to read, at the rate of one paper per day (and that’s assuming no new papers are published in the intervening years). Yet Huntington’s actually provides a relatively small result. One of the actors in the pathway is called “TP53.” That brings up another 2,500 papers, but also brings up (in an indirect link to a page about sequences for this entity) that it has a synonym: “p53.” Entrez brings back 42,000 articles from that search string – 115 years to read!
It goes on and on. And having all of this knowledge is wonderful. But there are more than a few problems here. The first is something you might call “cognitive overload.” Our brains simply aren’t strong enough to take in 500,000 papers, read them all, build a mental model of the information, and then use that information to make decisions - decisions like, what happens if I knock out that CASP box in the pathway, with 27,000 papers?
The other problems stem from the complexity of the body. In what other circuits is each entity in the pathway involved? What about those tricky causal relationships above and below it in the circuit? What are the implications of intervention in this circuit on the other circuits?
Some of these entities, the boxes in the diagram, are metaphorically similar to the airport in Knoxville, TN. Knocking out that airport doesn’t foul up a lot of air traffic. But some of these - P53 for example - are more like Chicago. Interfering with that piece of the network reverberates across a lot of unrelated pieces of the network. That’s what we call side effects, and it’s one of the reasons drugs are so expensive - we know that we can impact this circuit, but we don’t realize how badly it affects everything else until we run the drug in the only model available that covers all possible impacts: the human body.
And this is just the papers. There are thousands of databases with valuable information in them. Each of them has different access privilege conditions, different formats, different languages, and different goals; wasn’t designed to work with anything else; and is maintained at different levels of quality. But they have vital - or potentially vital, to the right person asking the right question - information. And if we could connect the knowledge around these knowledge sources into a single network we just might be able to leverage the power of other technologies built for other networks. (Like Google – but maybe more like the next Google, something as dramatically better and different and radical as Google was when we first saw it in the late 1990s.)
There are two problems to be addressed here. One is the materials that underpin this knowledge, these databases and articles. Those materials are “dark” to the Web, invisible, and not subject to the efficiency gains we take for granted in the consumer world. The second is the massive knowledge overload that the average scientist faces. I’ll outline two proofs of concept to demonstrate the value of investment in infrastructure for knowledge sharing that can address these problems.