BitSpeed Pushes Software Solutions for High-Speed Data Transfer
By Kevin Davies
February 7, 2013 | Imagine a piece of software that could contemporaneously write the same data file in Los Angeles while it is actually streaming off a next-generation sequencing (NGS) instrument in New York, essentially reducing the time to transport said data from hours or minutes to virtually zero.
It sounds a little far-fetched, but that is the promised performance of Concurrency, the latest software product from Los Angeles-based BitSpeed. Currently being tested in a research lab at the University of Southern California (USC), BitSpeed executives believe it will warrant a close look by many life science organizations struggling to manage big data or balking at the cost or ease of use of existing commercial or open-source solutions for data transport.
Concurrency updates BitSpeed’s Velocity software, which expedites the transfer of large data files. Although based on a different protocol, BitSpeed hopes to offer a compelling alternative to Aspera, which over the past few years has become the dominant commercial provider of data transport protocols, gaining strong traction within the life sciences community.
The BioTeam consultant Chris Dwan, who is currently working with the New York Genome Center, says the bandwidth problem addressed by companies like Aspera, BitSpeed, new tools such as EMC Isilon's SyncIQ, and GridFTP from Globus Online, is critical. “There are a lot of underutilized 1 Gb/sec connections out there in the world,” says Dwan.
“Aspera’s done a good job,” BitSpeed co-founder Doug Davis conceded in an interview with Bio-IT World, before laying out why he thinks his software is superior in cost effectiveness, ease of configuration, features and performance.
Moving Data
Doug Davis |
BitSpeed was founded in 2008 by Davis and Allan Ignatin, who previously founded Tape Laboratories, a developer of back-up technologies and virtual tape libraries. The company developed a close relationship with Hewlett Packard (its back-up technology still exists as part of HP’s widely used NonStop series) until the company was sold in 2006.
Later, Ignatin reconnected with Davis, a former CEO of Tape Laboratories, and hatched the idea of BitSpeed. “We noticed problems in transferring data outside buildings,” says Davis. “But what did we know? We were just storage guys—we thought latency was just a necessary evil.”
Initially BitSpeed focused on local area network (LAN) optimizations, but the founders soon recognized a much bigger opportunity. Launched in 2010, Velocity gained a foothold in the video and entertainment sector as well as other verticals. Some health care centers such as the Mayo Clinic also signed on, but the medical space wasn’t the initial focus.
Velocity is a peer-to-peer software package that does three things, says Davis: “Accelerate. Ensure. Secure.” It’s about enhancing the speed, integrity, and security of the data, he says. The product works on LANs as well as within organizations and between storage nodes. “No other solution does that,” says Davis.
The software installs within a few minutes, says Davis, and configures automatically. Because of a modular architecture, it is embeddable in other solutions. There are two licensing models—either point-to-point or multitenant. “You can put a big license in the cloud deployment or data center, and all clients are free of charge. It’s a compelling model,” says Davis.
Protocol PreferencesAs reported in a Bio-IT World cover story in 2010, Aspera’s patented fasp data transfer protocol makes use of UDP (user datagram protocol), which was originally developed more than 30 years ago as a means to provide a fast way of accelerating data movement.
In Davis’ opinion, however, UDP is like “throwing mud on wall, then picking up what falls off with a shovel, and repeating the process until all the mud is on the wall.” A transmission might be reported as successful even as packets of data are still being sent to complete the transmission, he says.
BitSpeed, by contrast, is based on TCP (transmission control protocol). “We’re the only company with accelerated TCP,” says Davis. “We can perform better, provide more security and order than UDP–based solutions.”
TCP is an ordered protocol, which Davis argues is important for data integrity. “We grab the data, mixed up in some cases, and lay the data down at the destination in the same sequence. This is important – if the data are jumbled, you might need a third software package to re-order the data.”
UDP and TCP have their respective advocates, of course, but as Cycle Computing CEO Jason Stowe points out, both also have tradeoffs and there is only so much that can be deduced by evaluating algorithms theoretically. “TCP inherently gets better reliability and is used by many protocols, including HTTP, at the cost of lower throughput and overhead,” says Stowe. He also points out that “noisy networks aren't friendly to UDP either.”
But the only true test, says Stowe, is a benchmark with real-world data between real endpoints, ideally also including open protocols such as FDT and Tsunami.
Another potential advantage of TCP is that it is being studied extensively by standards organizations. While modest improvements are being made to UDP, according to Ignatin, “TCP has thousands of organizations working on it, all of which we can take advantage of. They’re distributed and converted to each operating system fairly transparently. So when a new congestion/control algorithm [is released], we get it automatically. We’ve added MD5 checksums on every block, so all data received are 100 percent intact.”
Stowe notes, however, that checksums are “commonly used to verify that transfers occurred without error in many systems.”
Allan Ignatin |
As the name suggests, one of the potential virtues of Velocity is speed of data transfer, which emerges from a complex multi-level buffering scheme that “gulps data off storage and puts it back on,” says Ignatin. “We take the connection between two points, and use all the available bandwidth. Most connections use 30-40 percent efficiency, so we get more bang for the buck. We take a single TCP connection, break it into multiple parallel streams, then re-assemble [the data] on the other end.”
“The bigger the bandwidth, the bigger the acceleration,” says Davis. In one benchmarking test, he says Velocity took 1 minute 43 seconds to move 10 gigabytes (GB) data to four sites in New York, Tokyo, Rome and Sydney—regardless of distance.
Thus far, the only significant deployment within a life sciences organization is in the lab of neuroscientist James Knowles at the USC Keck Medical Center. (The introduction was made by Ignatin’s wife, who is a USC faculty member.) At the time of Velocity’s installation, the Knowles lab had three Illumina sequencers sending data to a Windows server and a Solaris server, writing at about 4 MB/sec. The Solaris server transfers data to the HPC computing center six miles away.
In the capable hands of system administrator Andrew Clark, Velocity has expedited the transport of about 1 terabyte of NGS data daily to the HPC computing center. What formerly crawled along at 5-7 megabytes (MB)/second was upgraded to nearly 80 MB/sec without configuration, and 112 MB/sec with configuration. Typical transport times of 20 hours were slashed to less than two.
When Clark’s team added compression, he found no benefit at first—until it became apparent that the storage I/O of the disk array in the HPC center wasn’t fast enough. “This is a pretty common result for the software,” says Davis. “Marketing geniuses that we are, it never occurred to us that we could do this.”
Following the installation of a faster disk array, transfer speeds doubled to nearly 235 MB/sec. Clark said Velocity “has proved absolutely invaluable to speeding up our data transfers.”
Active Replication
As promising as Velocity looks, BitSpeed has particularly high hopes for its latest software, Concurrency—a patent-pending technology that does active file replication. The product was unveiled in May 2012 at the National Association of Broadcasters convention.
Explains Ignatin: “Concurrency senses the beginning of file and writes it in multiple locations at the same time. As data are created at source, it’s being created at the destination. The destination, in turn, can be transferring it simultaneously to another location. It’s called ‘chain multi-casting’ and saves a lot of time.”
“We’ve made it virtually automatic,” Ignatin continues. “We watch those folders for creation of files that match a specific description—in name, or suffix, time, whatever. There is no limit to the number of watch folders we can handle. It’s not like a Dropbox. None of the SysAdmins at server B, C, or D have to do anything.”
With Concurrency, Davis says, data written to local the servers are also written to a center miles away. “When the sequencers have finished, it’s already there.” In theory, hours of transport time are reduced to essentially zero. At USC, Clark has been experimenting with Concurrency, but he told Bio-IT World that the product was still being evaluated and he had no further comment.
BitSpeed has also developed faster algorithms for data compression and encryption. The compression algorithms run as the data are in flight, which in principle provides further performance advantages. A pair of encryption algorithms optimizes security, including a proprietary algorithm called ASC (Advanced Symmetric Cipher). “It’s a robust algorithm… with very little CPU usage,” says Ignatin.
The ability to have data encrypted during flight should prove attractive for patient and other data requiring HIPAA compatibility and other forms of compliance. “How do [users] get the big data to/from the cloud? How do they ensure it is secure?” asks Davis. It may expand use of the cloud, as a cloud provider’s security is of little use if the data aren’t secured en route, he says.
Davis says that BitSpeed’s software is attractively priced and interested parties can register online for a 15-day free trial.
But while rival protocols duke it out in the marketplace, Dwan from The BioTeam says they are still missing the bigger issue, namely “the question of making the data scientifically useful and usable. None of these tools address that question at all.”