Linking Data: New Life for Semantic Technologies

Cambridge Semantics provides flexible solutions for the data deluge.

By Kevin Davies

February 4, 2011 | A small software company formed by a group of former IBM staffers is breathing new life into semantic technologies. But don’t look for Cambridge Semantics to harp on the term.

“The world of people well versed in semantic technology is still quite small,” says co-founder Lee Feigenbaum. “It’s important that anyone working with our software should not be IT. You won’t see the word ‘semantics’ anywhere in our software. It’s an enabler for us. We can’t build our software without these technologies, but now we’ve built them, we’ve no interest in preaching that you’re using semantics.” (see, “Masters of the Semantic Web,” Bio•IT World, Oct 2005)

“We don’t lead with ‘Semantic Web’ as a marketing term,” adds senior product manager Rob Gonzalez. “We’d like to see more companies like us trying to solve real-world problems. For us it’s about the problems we’re solving.”

Along with CTO Sean Martin, Feigenbaum was one of a group of about 20 people in an advanced technology group at IBM dating back to 1995. The group’s mission was to research new Internet technologies (including semantic technologies) and potential applications for IBM. An early client was a group of cancer researchers at the Massachusetts General Hospital (the Center for the Development of a Virtual Tumor), for which the IBM team helped to deploy semantic technologies for building and sharing models, data, and literature.

In 2007, Martin and Feigenbaum, together with Simon Martin and Emmett Eldred, established Cambridge Semantics and spent a couple of years building up the engineering team and testing early products before launching its first commercial product in late 2009. Luckily, much of the IBM group’s technology was open source. “People have been [saying] that they can’t build libraries or services that are really reusable or discoverable. We think with semantics, you get these benefits,” says Feigenbaum.

Early customers include Johnson & Johnson, Merck, and Biogen Idec, although Cambridge Semantics’ client base includes Fortune 500 companies in advertizing and the oil industry. “This technology can be used in many industries, but is particularly geared toward life sciences,” says Gonzalez. “The data bonanza isn’t comparable to other industries. Life scientists simply need this flexibility.”

Semantic Sidestep

There’s a saying that Feigenbaum admits is neither new nor particularly funny, but it makes a point: If you put ten Semantic Web advocates in a room, you’ll get 15 different explanations of what the Semantic Web is. “You have a loosely coupled set of technologies that people can use for a million different things. People will latch onto something and say this is the real semantic technology.”

Indeed, Feigenbaum is blunt in his criticism of vendors and users alike who proclaim the magical properties of the Semantic Web. “I’ve seen pharma talk about semantics as the ultimate data integration/analysis tool. That’s all well and good and we might get there in the next 10-15 years, but it’s never been what we’ve seen in semantics.”

For Feigenbaum, the interesting bit of semantic technology is the notion of rebranding data in a flexible and agile way. “The underlying properties of semantic technologies let you build very agile, adaptive software systems as data sources changes. It happens in all industries but especially in life sciences.”

Semantics is about flexibility and having a common data model upon which one can take information from a variety of sources—XML, relational databases, or public clinical trial database—and “map them to a common format not constrained by any a priori database schema or XML structure. We saw this flexibility in 2001, and proved it out at IBM. That’s what we wanted to leverage.”

Cambridge Semantics released its first three products in 2009. “There’s no magic to the software,” says Feigenbaum. Just an easy-to-use interface and set of tools that allows users to point to a particular area in a spreadsheet, for example, and ascribe a meaning, e.g. adverse event, assay result. “You have these common vocabularies and data models, and the system takes care of finding values that match and links them together, without having necessarily considered that way of linking things when you set up the system.”

The Anzo Data Collaboration Server, which sits on the user’s server, is semantic middleware, the plumbing that runs and connects everything else. Says Feigenbaum: “It invokes Web services. It has data services and server services that let you build flexible applications.”

Anzo on the Web is a Web 2.0-style application for self-service reporting of any data connected to the data collaboration server. Typically, when users want to use a new data source, Gonzalez explains, they have to change the database, then the application code, then the web tier. “With Anzo on the Web, you can bring the new data source easily into the data collaboration server, and it propagates throughout the system without requiring a lot of manual changes, so it’s resilient to new types of information being added.” The application is designed for scientists who aren’t necessarily IT experts. “They don’t have to go to IT to build new views; they can do it,” says Feigenbaum.

Anzo for Excel is a plug-in to Microsoft Excel that lets people use spreadsheets more effectively. It makes the collection of ad hoc data trivial, says Feigenbaum. “It turns Excel into a data collection application and lets it serve as a user interface for all this data integrated on the server. Now you can consume the data.” A recently-released second version adds an unnamed component that allows users to collect and integrate data from relational databases.

The company announced in mid-January an agreement with Cray to collectively develop and market solutions, including the Cray XMT system and the Anzo product suite. But Feigenbaum is also using the Amazon Cloud, particularly with new prospects. “The data integration paradigm we’re preaching is anathema to a lot of traditional IT,” says Feigenbaum, particularly in regard to procuring hardware, which can sometimes take months. “Many customers run a proof-of-concept in the Cloud with hosted versions of the software. That lets them prove out the technology and work on the procurement to deploy inside their firewall.”

One of the chief benefits of Cambridge Semantics, says Feigenbaum, is that it affords pharma customers the ability not only to pull in and analyze the data from a traditional database but also “the last 10-15% of their data that might be lurking in a desktop spreadsheet or a public resource such as NCBI. They don’t want to spend millions of dollars and 18 months only to get 90% of the way. They need to handle the heterogeneity of Excel and public data. [The missing data] might only be a small part of the total information but it’s a deal breaker.”

Early users span applications from manufacturing quality control to budgeting, allowing customers such as Biogen Idec to compare their actual spend with budget projections. Merck is using Cambridge Semantics applications to procure time on lab equipment.

Cambridge Semantics is still learning from its early customers where its technology can be leveraged. One promising area is in clinical trial data management. Says Feigenbaum: “When you’ve brought together data that don’t normally talk to each other, there’s a bunch of things you can do, such as looking at data for a drug across trials/phases. But some [historical] trials might have used SAS or Oracle Clinical. This is a good way to bring data together,” perhaps to identify reporting discrepancies for regulatory purposes.

An alternative term for semantic technologies that is growing in popularity is “linked data.” “It’s fine,” shrugs Feigenbaum. “It’s just another name. It’s had some success in life sciences, but I don’t care what it’s called.” •

This article also appeared in the January-February 2011 issue of Bio-IT World Magazine. Subscriptions are free for qualifying individuals. Apply today.