CTWatch Quarterly » Cyberinfrastructure For Knowledge Sharing

Perspectives

Cyberinfrastructure For Knowledge Sharing

John Wilbanks, Science Commons

Forget “Web 2.0” – what about “Web 1.0” for science?

Much of the functionality we take for granted on the Web comes from making the choice to make sharing information easier, not harder. A good example is the way that Google interacts with the scientific literature.

With few exceptions, we rank the importance and relevance of scientific articles the way we always have, with citations and “impact factors.” Citations are longstanding and important. Impact factors – the number of citations to the articles in a journal – are the dominant metric for journal quality. And for a long time, citations were clearly the best, and perhaps the only, statistical measure of quality of a journal. In a print world, a world without hyperlinks and search engines and blogs and collaborative filtering, citations are a beacon of relevance.

But we live in a different world now. We have the ability to make connection after connection between documents, to traverse easily from one page to another page. Hyperlinks are cheap and they’re everywhere. It was a conscious design decision made by Tim Berners-Lee to allow this functionality. Other competing systems thought it insane that the WWW would let just anyone link to just anything else – those links might be broken, leading to the dreaded “404 not found” – and that would obviously kill the WWW! It hasn’t worked out that way. The choice to allow users the right to make hyperlinks, to make hyperlinking easy and fast, not only did not kill the Web, it is a big part of what makes Google searching so powerful.

Google ranks pages by downloading enormous chunks of the Web and running software that analyzes the linkages between Web pages. The system quite literally depends on there being lots and lots of links, many of them perhaps useless on their own, but which in aggregate provide hints of relevance. Thus, the number one Google search on the words “Science Commons” is the Web page analyzed with the words “Science Commons” that has the most links pointing to it. There’s more complexity, obviously, but that’s a big part of the idea.

If those Web pages were private, the page ranking system wouldn’t work. The Web pages themselves are part of the infrastructure on which Google operates, on which millions of startup dreams are founded. In a world where every page was locked, where every Web designer had to ask permission to make an inbound link…we wouldn’t have the sprawling value creation we associate with the Internet. It would look a lot more like Prodigy looked a long time ago: a closed network that can’t compete in the end with the open networks.

Put another way, we have far more efficiency brought to bear on accelerating our capability to order consumer products than we do on accelerating our capability to perform scientific research. Biological reagents and assays are re-invented and reverse-engineered by readers of “papers” – years of laboratory work, data, living DNA and more compressed down to the digital equivalent of a sheet of dead tree.

We need the Web to work as well for science as it does for other areas. The capabilities now exist to integrate information, data, physical tools, order fulfillment, overnight shipping, online billing, one-to-one orders, and more. If we are to solve the persistent health problems of the world, of infectious disease in the developing world and rare disease in the developed world, the “Web 1.0” efficiency is an obvious benefit to bring to the life sciences.

But these advances we take for granted in daily life, like Google’s relevance based search of the entire Web, eBay’s many-to-many listing and fulfillment, Amazon’s one-click ordering, won’t come to science accidentally. There’s a significant collective action problem blocking the adoption of these systems and preventing the network effects from taking over in discovery.

But it’s not just the Google issue, which simply forces us to forego existing technology and focus on citations as we have always done. Citations carry more constrictions as a search metaphor. You are likely to enter the citation search ranked world when you know what to search. But you might not know what you’re looking for. You might not know how to say it in the nomenclature of a related, but distinct, discipline.

It goes on. Citation linkages between papers are subject to enormous social pressures. One cites the papers of one’s bosses, of course. Review articles can skew impact factors. And of course, a tried-and-true way to get a heavily cited article remains to be horrifically, memorably wrong.

And over the long term, the lack of more complex and realistic interconnections between articles – a web, a set of highways, an infrastructure connecting the knowledge – is that we can’t begin to integrate the articles with the databases. That's because the actors in the articles (the genes, proteins, cells and diseases) are described in hundreds of databases.

And if we could link the articles not just to each other by a richer method than citations, but to the databases, we can inch closer to the goal of a Rosetta Stone of knowledge, the small element upon which we can begin to have truly integrated, public knowledge spaces. That would in turn allow us to begin automatically indexing the data that robots are producing in labs every day, to meaningfully extract actionable information from the terabytes of genomic data we are capable of producing.

You get those virtues only where you are dealing with the knowledge claims themselves, not the sub-component of them the people in the field thought it worthy to expose. Only a better infrastructure gets you there, just as the modern highway system in the US allowed for better efficiency than the evolved hodge-podge of state highways. Citation linkages are very useful (and a later version needs to cross reference them with these highways we propose – we didn’t throw away the state highways, after all!). This is simply a different set of tasks, and one that can be accomplished if enough smart people have enough rights and time to work on the knowledge.

But sadly, no one – no one! – has the right to download and index with scholarly literature without burning years of time and money in negotiations. Google has spent years asking for the right to index a lot of the scholarly canon for its Scholar project, but that’s not some open land trust for any researcher to work on. It’s just for Google. And the fact that Google alone has the right to index articles for such a service means that the next Google, the next set of genius entrepreneurs with a taste for search coding away in the halls of the local university, can’t apply their skills to the sciences.

Though we have the capability to drastically increase the sharing at a much lower cost through digital distribution, search, and more, the reaction has been instead to segregate knowledge behind walls of cost, technology, and competitive secrecy. The net result is that we’re doing things the way we always did, but only somewhat faster. If we want to bring both efficiency gains and radical transformation to the life sciences, getting more knowledge online, with the rights to transform, twist, tag, reformat, translate, and more, is going to be part of the solution. We have to start allowing the best minds of the world to apply the newest technologies to the scientific problems facing us.

There isn’t a single, open “Web” of content to search – it’s owned by a group of publishers who prevent indexing and search outside their own engines, and who use copyright and contracts to keep it locked up. There isn’t any easy way to find the tools of biological science – it’s a complicated social system of call-and-response, of email and phone calls, of “are you in the club of scientists worth partnering with?” questions and answers. And there isn’t a standard way to get your orders fulfilled, but instead a system in itself of materials transfer and ordering, university technology transfer, commercial incentives, deliberate withholding, and more. We don’t have the Web working yet for science.

Pages: 1 2 3 4 5

CTWatch is a collaborative effort				Sponsored By