More computing is happening in the cloud. Personal and corporate applications are being sourced from network based services. ‘cloud’ is a good way of describing the consumer experience: the consumer does not have to worry about the details of implementation or carry the burden of physical plant. However, it is a little misleading in an important way. As users pull more from the cloud, providers increasingly concentrate capacity to meet the need. We move to a utility model. Such utilities require large computational capacity, and considerable physical plant (space, power, cooling, …). The cloud is fed from massive physical infrastructure.
One of the interesting stories of our times is the rush to build large processing plants by Google, Microsoft and others. Web-scale computing exists alongside major physical presences.
Communications of the ACM is celebrating its 50th anniversary with a special issue. It has a couple of short articles which address issues of operating at this scale.
David Patterson talks about the dramatic differences between “developing software for millions to use as a service versus distributing software for millions to run their PCs” (Technical perspective: the data center is the computer). He quotes Luiz Barroso of Google: “The data center is now the computer”. He talks about the challenges of writing applications where the target deployment environment is a data center, and introduces MapReduce, an approach Google developed to address this issue. The challenge is to architect for large systems made up of thousands of individual computers.
Jeffrey Dean and Sanjay Ghemawat of Google describe the MapReduce programming model (MapReduce: simplified data processing on large clusters). What struck me here was the scale of operation. Here is the abstract:
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day. [MapReduce]
Head-spinning, as Niall Kennedy’s suggests:
It’s some fascinating large-scale processing data that makes your head spin and appreciate the years of distributed computing fine-tuning applied to today’s large problems. [Google processes over 20 petabytes of data per day]
My colleague Thom Hickey introduces MapReduce here and talks about using it here.