Recent years have seen an explosion in the number and variety of devices producing data streams, from software logs to handheld phones to aerial cameras. About 2.5 quintillion bytes of data are created globally each day. Much of this data starts its life widely distributed. However, users often want to analyze data across the system as a whole. By focusing only on data processing within datacenters, the research community is overlooking an increasingly important part of the Big Data challenge.
Today, a common technique for data analysis is to backhaul all the data generated at wide-area sources to a central datacenter, where it is then stored and processed. This approach allows analysts to use existing tools developed for single-datacenter large-scale analytics. Backhaul, however, incurs the high cost associated with wide-area data transfer. This tradeoff is a bad one given historical and contemporary trends in computing cost.
We have been investigating new systems and algorithmic approaches to this problem. We built a stream processing system, JetStream, designed for the wide area.
JetStream addresses bandwidth limits in two ways, both of which are explicit in the programming model. The system incorporates structured storage in the form of OLAP data cubes, so data can be stored for analysis near where it is generated. Using cubes, queries can aggregate data in ways and locations of their choosing. The system also includes adaptive filtering and other transformations that adjusts data quality to match available bandwidth.
A paper about JetStream appeared at NSDI 2014.