03 Jul Amdahl’s Law Part 6: How to Measure and Compare Parallel Systems
This last post in my six-part series addresses systems effects, future directions, and I score these equations using the same metrics I rated Amdahl’s law and Karp and Flatt’s equation. For review, my full equations…
Unified Memory Architecture
In a system where serial and parallel resources – main processor and offload resources – both have equal access to main memory, they can pass main memory pointers to code and data between them, effectively reducing the memory-to-memory transfer component of my equations to zero. Variants of this approach can be based on tiered memory architectures that implement large shared pools of multiple access memory, where serial and parallel resources may also have their own local memory.
TX and PX are vampire performance terms – they suck time and power out of a system without contributing directly to useful work results in:
And where the serial and parallel can be pipelined:
Likewise, we still cannot pipeline power consumption:
For future unified memory systems the TDP term is likely to remain zero, because it will remain inefficient to integrate remote data access into highly parallel loops.
These are fairly simple equations that can be used to generate reasonable estimates of performance and power consumption in modern compute offload architectures.
To confine the scope of this equation to a system on chip (SoC) design, practically speaking we can eliminate the failure domain issue for many consumer applications, and perhaps for short duration server and workstation applications as well. That simplifies the runtime equation:
It’s this final step that closed the distance and generates a simplified form of the equation that is close to Karp and Flatt:
At this level, it is apparent that I’ve added a data movement term and an acknowledgement that, in the physical world, parallelism is rough grained – that it comes in discrete increments of parallel processing hardware.
As mentioned previously, modern compilers and runtime environments automatically and opportunistically parallelize threads of “serial” code to run on multiple processing resources, specifically multiple processor cores in a multi-core SoC. While it seems like this behavior will not substantially affect the power equation (for reasons already listed above), it may have a measurable effect on the time it takes to execute the formerly serial code. Generalizations and rules-of-thumb would be helpful to incorporate at some point.
There can be substantial synchronization issues when a single task generates a large number of threads containing different code, assumed compiled for the same processor, but perhaps vastly exceeding the hardware thread count of that processor. Creating bounds or heuristics here would be useful.
Likewise, message passing between parallel threads can introduce substantial performance challenges. My equations implicitly assume that message passing is negligible or that it incurs zero overhead – that the threads are either “stateless” or that they do not explicitly synchronize with other threads while they are running (synchronization is performed with returned data structures, at a database level after a thread terminates, etc.).
Addressing concurrent runtime environments – multiple threads, tasks or applications scheduled on the same physical hardware resource, but not in a single cohesive execution stream . It would be interesting to folks building hyperscale datacenters to create a set of equations that describe the aggregate behavior of such a system.
Failure domains and bound estimates for the real world performance of large systems is also an interesting future topic.
Scorecard for the New Equations
Here is how my two equations score:
|Performance||Absolute, based on measured time and power|
|Heterogeneous Processors||Explicit via separate time and power terms|
|Scale-out||Implicit via data movement terms|
|Data Movement||Explicit via new terms|
|Failure Rates||Explicit via new term|
|Power Consumption||Explicit via second equation|
I intend to use these equations to examine trade-offs in systems architecture. I believe I have created a useful high-level tool for modeling the performance of a diverse array of modern heterogeneous compute architectures. These equations should scale well, albeit in a general fashion, from single SoCs to multi-chip systems and then to systems of systems.
There are already many choices for serial processors, with various flavors of big cores and small cores of a few popular instruction sets. Couple that choice with a wide variety of architectures and implementations for compute offload engines, for example: general purpose graphics engines (GPGPUs), digital signal processors (DSPs), vector processing instruction set extensions, field programmable gate arrays (FPGAs), and interesting pairings of big cores with seas of smaller cores.
It should become apparent that data movement, as expressed by chip-level interconnects, system-level fabrics, and datacenter networking choices already have a substantial effect on system performance.
Exascale is the next logical step past hyperscale. The only way to get there is to start by considering holistic system performance.
Karp, Alan H. & Flatt, Horace P. (1990). “Measuring Parallel Processor Performance”. Communication of the ACM33 (5): 539–543.doi:10.1145/78607.78614
Acknowledgements for the Series
Chris Miller provided sound technical proofing.
Samuel Teich helped me crystalize my thinking and contributed to the form of the equations.
Mark Welker asked insightful questions and improved theme elements along the way.
This is the last in a six-part series, but I will continue to refine my assumptions and equations. Previous posts in this series: