One of these Knowledge Centers, managed by the Mayo Clinic and Herndon, Va.-based SemanticBits, has a focus on vocabularies. The Mayo Clinic team has technical knowledge of tools in the caBIG vocabulary domain and expertise with software such as LexBIG, Semantic Media Wiki, and Prot
é
g
é
. SemanticBits, meantime, will develop online resources, including the knowledge base that delivers these vocabulary resources to users seeking to build their own interoperable tools and systems.
Vinay Kumar, COO of SemanticBits, told BioInform
that the 3-year old company has worked on several distributed computing grid infrastructure projects for NCI and NIH, and has developed quantitative biological tools, tools for clinical trials management, and applications involving semantic interoperability in the areas of vocabularies and metadata.
One tool that SemanticBits has developed for caBIG is caTRIP, an application that lets users post a query across several caBIG data services designed using common data elements in order to link different services.
The Knowledge Centers bring experts together, he said. "If people have questions they can come to us and we will work together to address that question."
In the next few weeks, caBIG will announce the vendors who qualify as licensed caBIG service providers under the caBIG Support Service Providers Program. Among those waiting is Fremont, Calif.-based BioPhase Systems.
"Many scientists have their data in Excel spreadsheets; we can help them migrate their data into the tools supported by caBIG," BioPhase founder and CEO Meena Vora told BioInform. BioPhase has developed software to integrate genomic and proteomic data analysis and has the potential to integrate caBIG applications, she said.
Where Space Ends
During a session on caBIG's Enterprise Support Network, however, some participants expressed confusion about the delineation between the Knowledge Centers and the vendor-based support program.
As Leslie Derr, caBIG's director of community alliances, explained to BioInform, the Knowledge Centers will provide web-based assistance while support service providers will work on a fee-for-service basis to, for example, tailor an installation or perform data migration for caBIG users. "To me there are very clear distinctions there," she said.
Knowledge Centers give researchers with expertise in a particular area of biomedicine and IT the ability to administer and monitor the caBIG infrastructure, she said. "Instead of the government maintaining expertise, [and] having that all focused within the government, we have empowered the community to provide that domain expertise."
Miguel Buddle, an associate at Booz Allen Hamilton, told BioInform
after the session that the confusion may arise from the fact that this was the first public discussion of these two new entities. "As a new concept, people are having trouble seeing that line and maybe there is more communication we have to do on that," he said.
For high-level support, such as calling a help desk or obtaining customized training or documentation, institutions "absolutely should turn to support service providers," he said. "The Knowledge Centers provide only a limited amount of support that is entirely web-based," he added.
"The government doesn't want to be in the position of competing with private enterprise in this area," he said.
At the same time, he said he believes that this endeavor with the government sponsoring "truly open development" of projects that are then turned over to the community "is a pretty unique way of doing business," he said. "It's hard for us to make the transition ... but it's certainly critical for the success of caBIG for it to grow," he said.
The Day-to-Day of caBIG
When it comes to software, adaptation and adoption are quite different beasts. While many scientists realize the value of adopting caBIG tools, and described it as a fairly straightforward process, the rubber does not hit a smooth road when it comes to adaptation, which calls for software engineering so that caBIG tools link to legacy systems.
Sometimes small solutions can make a big difference. Northwestern University Biomedical Informatics Center's Gilbert Feng outlined to Bioinform
a "bridge" he developed called caBIO2BioC, laughingly adding that it urgently needs a shorter name.
This tool builds a connection between caBIG and BioConductor such that a query in R syntax leads to a reply from the caBIO database in XML, which, through an XML parsing library, is returned in R.
At his institution, as at many others, researchers struggle to organize, integrate, and analyze their data. "Therefore, that is very important to connect BioConductor to caBIG," Feng said.
While there are packages that claim this connection is already possible, Feng explained that universal data retrieval between caBIG and BioConductor was previously not available.
Some adaptations require more than a software tool. As Booz Allen Hamilton's Adams indicated in a session on adaptation, the caBIG way of doing things begins with a well-established data model annotated with standardized vocabularies. This annotated information model is converted into common data elements, and then the information model can generate the application programming interface.
Outlining various design patterns of adaptation to connect a legacy tool to a caBIG tool, he explained that these patterns entail varying degrees of software engineering. Some design patterns might apply wrappers, while others can involve a message broker to transfer a message between the tools, which may be "really good" at institutions that already have a tradition of messaging with a robust HL7 V2 messaging architecture. Others, meantime, may include the use of extract transfer and load scripts and data warehousing.
As Adams' colleague Reechik Chatterjee outlined, different design patterns are associated with different costs. For example, generating an API takes "a lot of effort" and users should keep the relationship between the API and the database in mind.
Doing the data mapping between the caBIG API and tables in a legacy database is "a considerable cost," said Chatterjee, and requires experts to be on hand for the task.
That is the situation Jackson Laboratory faced when it became one of the first institutions to adopt caArray, a microarray data management system now in version 2.0. And that is why "mapping" was, for a while, not exactly Grace Stafford's favorite term.
Stafford, senior bioinformatics specialist at the Jackson Lab, was responsible for the mapping project, which took 220 hours, she said.
The end result has been that the Jackson Lab's internal database appears unchanged for researchers who use the system for tracking gene expression or genetic aberrations. Users can request their data be exported to caArray anytime.
"We were very eager to get our data exposed to the grid and build a state-of-the art data analysis environment for our cancer center investigators," said Charles Donnelly in a presentation. He directs the Jackson Lab's computational sciences group, which helps scientists with the development of scientific applications, statistics and other kinds of analysis, laboratory management, and also caBIG deployment.
"We went to the proverbial caBIG hardware store," he said, to find the tools needed for the caArray adaptation, but found that they were lacking.
As it turned out, the project required domain experts, software engineers, biostatisticians, bioinformaticists, research scientists, and project managers. It began with a view to how much this adaptation was going to cost and, more importantly, anlyze it for the scientists at the lab to assure that the project was going to help with research and have scientific impact.
"That is why we do this, and not just because I am a propeller head, which I definitely am, and it is really cool computer science, but it actually needs to have scientific impact," Donnelly said.
Jackson Lab's internally developed database tracks investigators submitting requests, stores data, tracks the tissues, manages workflow, and presents the data to researchers. "You really don't want to disrupt that process," he said. CaArray, on the other hand, is more of a repository with a really strong querying capability, he told BioInform.