The picture of how data gets analysed on a single computer is very simple: a program is written to read the data, do some manipulation and write out the results. This is something that every student of programming learns to do. The figure below shows a whimsical interpretation of software processing a data file. The rectangle represents a volume of data that requires 3 time units to analyse, and the (data gobbling analytical) fish represents the data analysis program running on a single processor core.
The volume of data is handled from beginning to end, and when the last of the data has been looked at, the results are written or displayed and the analysis stops.
If the amount of data coming in for analysis is increasing, and in simple case, if the volume of work is doubled, the program will take twice as long to do the analysis, in this case 6 time units.
Of course, the actual time required to analyse twice as much data will depend on the algorithms used in the analysis.
As the amount of data to be analysed in a day continues to increase, eventually the computer and analysis software will no longer be able to keep up with the stream of data coming in. One short-term solution to this might be to purchase a new computer to run the analysis program on. This approach used to work quite well when CPU clock speeds were increasing at a fairly predictable rate.
Unfortunately we have passed the time when significant speed increases are seen on new single core processors. New generations of CPUs tend to have more cores, but only a minimal increase in clock speed. For an analysis that is running on a single computer core, the new machine won’t help much to increase the volume of data that can be analysed in a day.
To get around this, and process a larger amount of data in a day, we must learn to use multiple cores and/or multiple computers to analyse the data.
The data set must be split into independent pieces, which we can then distribute to multiple computers for analysis concurrently. Notice that we are assuming that the data is such that we can easily divide it into independent pieces. This is referred to as “unstructured data”.
We arbitrarily divide our data volume as shown in Fig. 3, and then we assign the task of analysing each chunk of the data to a different computer. Each computer analyses its piece concurrently with the others as shown in Fig. 4.
The entire analysis now requires only one time unit instead of the three that a single computer needs. In the ideal case, if we can create N chunks of data, and have them all analysed simultaneously, we would expect the time to get the result to be reduced by a factor of N.
This is the fundamental idea of all parallel computation, whether we are doing numerical analysis within a High Performance Computing facility or analysing large data sets in a Hadoop cluster. The model presented above is simple in that we assume the data can be split into independent pieces for analysis. A common name for this type of parallel work is “embarrassingly parallel”. I prefer the name “perfectly parallel” to describe this situation. This type of parallel work is possible when no communication needs to happen between the analysing processes. When each chunk of data is done, the result needs to be aggregated with the results of all the other chunks to find the results for the entire data set.
Some data can be analysed in this embarrassingly parallel fashion and while other data requires communication between processes as analysis progresses. In practice this communication sets limits to how efficiently analysis can speed up. We will discuss this is future articles, along with methods to measure how efficiently the analysis is speeding up as the number of cores increases.
Acknowledgement: the data gobbling analytical fish was created by Denise Thornton from Videre Analytics.