The CCC “Big Data Computing Study Group” helped organize two adjacent events in Sunnyvale in March: the “Hadoop Summit” and the “Data-Intensive Scalable Computing (DISC) Symposium”.
The Hadoop Summit was an open event, hosted by Yahoo! Research. Its goal was to build a community among users of the open-source Hadoop software suite for distributed programming in the map-reduce style. About 350 people attended, a much larger crowd than originally expected. The DISC Symposium was an invitation-only event (~125 attendees) whose goal was to build a community among DISC researchers.
The presentations at the Hadoop Summit were fascinating. While they varied greatly in technical depth, in total they gave a sense of rapid growth in the amount of ingenuity being directed towards solving large-scale data-intensive problems on scalable computing clusters. As one might expect, academic researchers were among the speakers, as well as people from industry research labs at Yahoo!, IBM, and Microsoft. But there were also technical talks by developers at places like Google, Amazon, Rapleaf, Facebook, and Autodesk, each one essentially doing a “show-and-tell” on interesting data-intensive problems being tackled in their companies. This gave the attendees a glimpse into the growing industry interest in Hadoop.
The DISC Symposium had attendees from a broad range of companies and research institutions. By design, the program was broad and shallow — the idea was to bring together researchers from all aspects of DISC. Among the highlights:
- In the “DISC systems” arena, Randy Bryant laid out a broad range of research challenges, Jeff Dean gave a lightening-fast overview of Google’s tools (clusters, GFS, MapReduce, BigTable, Chubby), and Garth Gibson talked about challenges in large-scale data systems.
- In the “middleware” arena, ChengXiang Zhai discussed text information management, and Joe Hellerstein promoted declarative programming as a universal elixer.
- In the “applications” arena, Jill Mesirov described computational paradigms for genomic medicine, Jon Kleinberg talked about algorithms for analyzing large-scale social network data, and Alex Szalay described applications in the physical sciences.
- Jeannette Wing and Christophe Bisciglia announced NSF’s new program supporting DISC research utilizing a large-scale cluster provided by Google and IBM.
Slides from all presentations at both the Hadoop Summit and the DISC Symposium, as well as videos of most presentations, are available here.
So what can we conclude about all of this? Well, at the Hadoop Summit, the speakers (especially the ones from industry) were not the “usual suspects”, especially considering the fairly hard-score technical nature of research in large-scale distributed systems. However, there is an overwhelming sense that a major wave is starting, and overall we the excitement level at the meeting was extremely high.
Regarding the concept of “DISC”, here is our unabashed opinion about all of this: Ubiquitous cheap sensors (in gene sequencers, in telescopes, in buildings, on the sea floor, in the form of point-of-sale terminals or the readable web, etc.) are transforming many fields from data-poor to data-rich. The enormous volume of data makes “automated discovery” (machine learning, data mining, visualization) essential, requiring innovation throughout the stack. The traditional “high performance computing” crowd has missed the boat on this one. (The focus must be on the data.) Web companies such as Google, Yahoo!, and Microsoft have made significant strides. But there remains lots of room — and lots of need — for additional breakthroughs. Bluntly, a university that lacks this “big data” capability is not going to be competitive.
The job of the Computing Community Consortium is to facilitate the computing research community in envisioning, articulating, and pursuing longer-range, more audacious research challenges. “Visioning workshops” such as these are one route that the CCC is pursuing. This was the first CCC-sponsored meeting. While there’s room for improvement (more time for discussion, more younger attendees, …), most participants viewed this workshop as a success — there was a real buzz.
Let us know your thoughts!
— Ed Lazowska and Peter Lee
Though the dominant effort from the academic high-performance computing community has been on compute-intensive applications, it is not quite true that data-intensive applications did not receive any attention.
Systems like ADR and FREERIDE pre-date Google Map-Reduce, but have many similarities with this systems. Compiler support for data-intensive systems has also been explored around 2000-2002.