The following is a special contribution to this blog by CCC Executive Council Member Mark D. Hill of the University of Wisconsin-Madison.
Internet-based services that we have all come to love (e.g., search, email, social networks, video/photo sharing) are all powered by large back-end data centers, designed and managed as large warehouse-scale computers. Emerging cloud computing workloads also use such warehouse-scale computers, making it even more important to understand and optimize this class of computer systems.
But until now, such warehouse-scale computers have (true to their name!) been big black boxes, with very little insight about detailed performance characteristics of deployments at scale: What is the nature of workloads that run on these large computers? How well are they served by the underlying microarchitecture of current processors? Where are the next opportunities for improving this important class of systems?
Engineers at Google, in collaboration with researchers at Harvard University, have recently presented some answers to these questions. Their recent paper at this summer’s International Symposium on Computer Architecture, titled “Profiling a warehouse-scale computer” presents results from a longitudinal study spanning tens of thousands of servers in actual Google data centers, examining detailed microarchitectural characteristics of thousands of different applications when serving live traffic across billions of user requests.
So, what did these researchers find? One key nugget is that Google workloads demonstrate significant diversity (that has been increasing over the years). Another is that they differ in some significant ways from the traditional SPEC benchmarks that we are used to seeing in common architectural studies. Notably, workloads running on the Google computer have significantly lower useful work (instructions per cycle) done on their processors than the typical SPEC benchmark, and also suffer from a significantly larger fraction of front-end pipeline pressure from instruction stalls.
The paper is also chock-full of other interesting nuggets of information, around other bottlenecks in the CPU pipeline and cache/memory hierarchies. Notably, while there are no significant hotspots at the individual workload level, across all the workloads at the warehouse-scale computer level, a few common low-level functions in the software stack account for nearly one third of the total cycles!
This paper is a great start, shedding light on some of the mysteries around large warehouse-scale computers in the wild, and the opportunities to optimize the software and hardware stack. But there is more to do. It is a great opportunity for our community to write the sequel.