



#### An Empirical Methodology for Judging the Performance of Parallel Algorithms on Heterogeneous Clusters

#### Jackson W. Massey, Anton Menshov, and Ali E. Yilmaz

#### Department of Electrical & Computer Engineering University of Texas at Austin

FEM 2016 Florence, Italy, May 2016







- Motivation
  - End of an Era
  - Heterogeneous Computing
  - Computational Systems
- Proposed Methodology
  - Generalized Parallel Efficiency Definition
  - Iso-Efficiency Maps
- Applications
  - MPI vs. OpenMP vs. MPI/OpenMP for Multi-core CPU with MOM
  - Multi-core CPU vs. MIC vs. Multi-core CPU+MIC with MOM
  - Iterative vs. Direct MOM Solver
- Summary & Conclusions



## Limits of Sequential Computing



#### Processor Performance Plateaued about 2004 "The Future of Computing Performance-Game Over or Next Level?" **BONAL ACADEMIES** S. H. Fuller, L. I. Millett, Eds.; National Microprocessor Performance "Expectation Gap" over Time (1985-2020 projected) **PUTING PERFORMANCE** 1,000,000 Research Council. 2011. Game Ov 100,000 The Expectation Gap 10,000 Trends in Computation Transistors 1,000 (Thousands) 10<sup>6</sup> 100 10<sup>5</sup> Sequential Performance 10<sup>4</sup> (SpecINT) 1995 2010 2015 2020 1990 2000 2005 Frequency Year of Introduction (MHz) $10^{3}$ "The end of the exponential runup in uniprocessor performance and the market saturation of the Typical Power 10<sup>2</sup> general-purpose processor mark the end of the (Watts) "killer micro." This is a golden time for innovation Cores 10<sup>1</sup> in computing architectures and software. We have already begun to see diversity in computer 10<sup>0</sup> designs to optimize for such metrics as power and throughput. The next generation of discoveries will require advances at both the hardware and the 1975 1980 1985 1990 1995 2000 2005 2010 2015 software levels." Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten, and D. Wentzlaff



### **Interesting Times**



"Gearing up for the next challenge in highperformance computing," Research Highlights, Lawrence Livermore National Lab, Mar. 2015.

Central processing unit (CPU) Multicore CPU Memory (MEM) Cache Graphic processing unit (GPU)



Single CPU per node with main memory

New programming models



2000-2010 Multiple CPUs per node sharing main memory

"Are supercomputing's elite turning backs on accelerators?" hpcwire.com, June 2014 Accelerators



L Clearspeed ATI Radeon Nvidia Kepler Nvidia Fermi

MEM GPU MEM

2000-2010

Accelerators usher in

era of heterogeneity



2014 Accelerators share common view of memory with CPU



2015 Simple low-power cores and non-uniform memory access



Processor in memory

RF THAN BY M. MITCHELL WALDROP

THE SEMICONDUCTOR INDUSTRY Nature, Feb. 2016 WILL SOON ABANDON ITS PURSUIT OF MOORE'S LAW.

#### **NOW THINGS COULD GET A LOT MORE INTERESTING.**

"The industry road map released next month will for the first time lay out a research and development plan that is not centered on Moore's law..."



# Clusters of Heterogeneous Nodes



#### Top Supercomputers 2015 (top500.org)

- 1. Tianhe-2
  Intel Xeon E5 + Xeon Phi 31S1P
  12 Cores 2.2 GHz
- 2. Titan Cray XK7 AMD Opteron 6274+ Nvidia K20x 16 Cores 2.2 GHz
- 10. Stampede
   Intel Xeon E5 + Xeon Phi SE10P
   2x 8 Cores 2.7 GHz +
   61 Cores ~ 1070 GFLOPs



Stampede



#### Tianhe-2

Titan

#### Performance Development





## **Clusters of Heterogeneous Nodes**





#### Stampede – Poweredge C8220 Node

2 x Intel Xeon E5-2680 Phi SE10P

- 2.7 GHz

- 0.3456 TF peak DP 1.074 TF peak DP
- 1.1 GHz
- 2 x 8 cores 61 cores (244 threads)
- 256-bit vector unit 512-bit vector unit



- Heterogeneous computing
- Coordination of different types of "processors" to perform computations
  - Differences include: clock speeds, memory size/speed, instruction sets, ...
- Must re-think concepts of computational power, efficiency
- Must account for types of processors not just number of processors



# Computational Systems for Science & Engineering



- Ingredients of "computational systems" (e.g., for solving EM problems)
- System = algorithm + software implementation + hardware architecture
- Ongoing advances in each ingredient
  - Often focus on one and make abstractions (sweeping generalizations/ simplifications) about others, assuming/hoping

| "best" | "best"         | "best"              |        | "best"   |
|--------|----------------|---------------------|--------|----------|
| system | =<br>algorithm | ∩<br>implementation | $\cap$ | hardware |

and improved ingredient => improved system

- Enabled tremendous progress, becoming more difficult/less valid: Algorithm dependent hardware performance (architects often recognize this), implementation dependent algorithm performance (coders often recognize this) ...
- No "universal best system" for all problems but some systems (much) better for important problem classes
- <u>How to judge different systems</u>? Define problem, <u>define metrics</u>, apply system, collect data, observe/explain/compare, ...



# Computational Systems for Science & Engineering



- Metrics/figures of merit/performance measures for judging CEM systems
- Most important ones:
  - 1. Accuracy: Is error in solution acceptable? (Need a reference)
  - 2. Cost: Is problem solved fast enough? (Need a lower limit)
  - 3. Efficiency: How much of available computational power is wasted? (Must define available computational power)
  - 4. Scalability: How much should system grow when problem grows to keep metrics acceptable? (Must define paths to grow problem, system)
  - 5... stability/robustness, error convergence rate, portability, user interface, ...
- This work:
  - Efficiency (and scalability) on heterogeneous computers & clusters of them
- Key concepts (computational power, workload)
- Proposed methodology (iso-efficiency contours and acceptable performance)
- Examples comparing different systems





## **Proposed Methodology**

- Generalized Parallel Efficiency Definition
- Iso-Efficiency Maps



#### Efficiency for Heterogeneous Clusters



- System of interest = algorithm + software implementation + *P* processors
- Well-known for homogeneous clusters of *P* identical processors

W: workload

- $t_{\scriptscriptstyle p}(W)$  : wallclock time to solve problem using only processor p
- $t_{obs}(W)$ : wallclock time to solve problem using all P processors

$$e(P,W) \triangleq \frac{t_{\scriptscriptstyle 1}(W)}{Pt_{\scriptscriptstyle \rm obs}(P,W)}: \text{(parallel) efficiency of system}$$

Generalization to heterogeneous clusters of different types of processors

L. Pastor and J. L. B. Orero, "An efficiency and scalability model for heterogeneous clusters," in *Proc. IEEE Conf. Cluster Comp.*, Oct. 2001, pp. 427-434.

$$\begin{split} e(C_{\rm tot},W) &\triangleq \frac{1/t_{\rm obs}(P,W)}{C_{\rm tot}(P,W)}: \text{(parallel) efficiency of system} \\ \\ & S \\ C_{\rm tot}(P,W) \triangleq \sum_{p=1}^{P} C_p(W): \text{total comp. power available to system} \\ & C_p(W) \triangleq \frac{1}{t_p(W)}: \text{ average comp. power of system using only processor } p \end{split}$$



### Efficiency for <u>Heterogeneous Clusters</u>



- System of interest = algorithm + software implementation + *P* processors

L. Pastor and J. L. B. Orero, "An efficiency and scalability model for heterogeneous clusters," in *Proc. IEEE Conf. Cluster Comp.*, Oct. 2001, pp. 427-434.

$$\begin{split} e(C_{\text{tot}},W) &\triangleq \frac{1/t_{\text{obs}}(P,W)}{C_{\text{tot}}(P,W)} : \text{(parallel) efficiency of system} \\ C_{\text{tot}}(P,W) &\triangleq \sum_{p=1}^{P} C_{p}(W) : \text{total comp. power available to system} \\ C_{p}(W) &\triangleq \frac{1}{t_{p}(W)} : \text{average comp. power of system using only processor } p \end{split}$$

#### Properties & interpretation

- Salient feature: Define W to be independent of system!
- $C_{tot}$ : Part of work that could have been done per sec (if system efficiency=1)
- $1/t_{obs}$ : Part of work actually done per sec
- Reduces to usual definition for homogeneous clusters
- *C*<sub>tot</sub>, *e* sensitive to algorithm, software implementation, number/type of processors
   used & workload => Can study effect of each ingredient





# **Iso-Efficiency Maps**

- System of interest = algorithm + software implementation + *P* processors

 $e(C_{_{\rm tot}},W) \triangleq \frac{1 \ / \ t_{_{\rm obs}}(P,W)}{C_{_{\rm tot}}(P,W)}: \text{(parallel) efficiency of system}$ 

- Maps of iso-efficiency contours
  - Generate by sweeping P, W and recording  $t_{obs}$ . Plot in  $C_{tot} W$  plane.



Pitfall: Must find a way to estimate  $t_p(W)$  for large W. Extrapolating from larger  $C_{tot}$  often too rosy. Extrapolating from smaller W may not be possible.

 <sup>0.6</sup> F. Wei and A. E. Yılmaz, "A Systematic Approach to Judging Parallel Algorithms: Acceptable Parallelization Regions in the N-P Plane," in *Proc. FEM* '14, May 2014.

 $C_{\text{tot}}$ : Part of work that could have been done per sec  $1/t_{\text{obs}}$ : Part of work actually performed per sec





# **Iso-Efficiency Maps**

- System of interest = algorithm + software implementation + *P* processors

 $e(C_{_{\rm tot}},W) \triangleq \frac{1 \ / \ t_{_{\rm obs}}(P,W)}{C_{_{\rm tot}}(P,W)}: \text{(parallel) efficiency of system}$ 

#### Maps of iso-efficiency contours

- Generate by sweeping P, W and recording  $t_{obs}$ . Plot in  $C_{tot} - W$  plane.



 $C_{\text{tot}}$ : Part of work that could have been done per sec  $1/t_{\text{obs}}$ : Part of work actually done per sec

- Specify requirements: e.g.,
  - acceptable efficiency  $e \ge 0.9$
  - must do >0.1% of work per sec
- Find highest *C*<sub>tot</sub> that meets requirements
- Pitfall: Must find a way to estimate reference t<sub>p</sub>(W) for large W. Extrapolating from larger C<sub>tot</sub> often too rosy. Extrapolating from smaller W may not be possible.

(F. Wei and A. E. Yılmaz, "A Systematic Approach to Judging Parallel Algorithms: Acceptable Parallelization Regions in the N-P Plane," in *Proc. FEM '14*, May 14.)





### **Applications**

#### Benchmark Description

- System Evaluation for Algorithm I Iterative Solver
  - System Evaluation for Algorithm II –Direct Solver
    - Computational System Comparison



#### **Example Algorithms**



CFIE for perfectly conducting closed surface S

$$EFIE: \mathbf{E}^{inc}(\mathbf{r})\Big|_{\tan S} = \frac{j\omega\mu_0 \iint_S \mathbf{J}_S(\mathbf{r}')g(\mathbf{R})dS'}{-\frac{\nabla}{j\omega\varepsilon_0}\iint_S \nabla' \cdot \mathbf{J}_S(\mathbf{r}')g(\mathbf{R})dS'}\Big|_{\tan S}$$

$$MFIE: \hat{\mathbf{n}} \times \mathbf{H}^{inc}(\mathbf{r}) = \mathbf{J}_{kl}^{\mathbf{S}} - \hat{\mathbf{n}} \times \left(\nabla \times \iint_S \mathbf{J}_S(\mathbf{r}')g(\mathbf{R})dS'\right)$$

$$CFIE = \alpha EFIE + (1-\alpha)\eta_0 MFIE$$

$$g(\mathbf{R}) = e^{-jk_0R}/4\pi R; \quad \eta_0 = \sqrt{\mu_0/\varepsilon_0}$$

$$Rao, Wilton, Glisson, IEEE$$

$$Trans. AP, May 1982.$$

Method of moments solver

$$\mathbf{J}_{S}(\mathbf{r}) \cong \sum_{n=1}^{N} \mathbf{I}[n] \mathbf{f}_{n}(\mathbf{r}) \Longrightarrow \mathbf{Z}_{N \times N} \mathbf{I}_{N \times N_{\text{RHS}}} = \mathbf{V}_{N \times N_{\text{RHS}}}$$

• Computational complexity Matrix fill:  $O(N^2)$  Algorithm I=> Iterative solve (TFQMR):  $O(N_{iter}N_{RHS}N^2)$ Algorithm II=> Direct solve (MKL LAPACK + ScaLAPACK\*):  $O(N^3 + N_{RHS}N^2)$ \*ScaLAPACK block size = 512





N

 $r/\lambda_0$ 

## Sample Workloads



Asymptotic algorithm costs:

$$t_{
m fill} \propto \left(rac{r}{\lambda_0}
ight)^4$$
  
Algorithm I  $t_{
m solve} \propto \left(rac{r}{\lambda_0}
ight)^4 N_{
m pos} N_{
m iter}$   
Algorithm II  $t_{
m solve} \propto \left(rac{r}{\lambda_0}
ight)^6 + \left(rac{r}{\lambda_0}
ight)^4 N_{
m pos}$ 



155 310

211 947

6

7

• Find reference  $t_p(W)$  for large W by extrapolating from  $t_p(W)$  for small Wusing asymptotic expressions

• Workload definition:  $W = \frac{r}{\lambda_0}$ 

• Fill acceptable efficiency  $e \ge 0.9$ 

Must do >0.1% of work per sec

• Solve acceptable efficiency  $e \ge 0.5$ 





### **Applications**

- Benchmark Description
- System Evaluation for Algorithm I Iterative Solver
  - System Evaluation for Algorithm II –Direct Solver
    - Computational System Comparison



Vary OpenMP threads (1-16) Vary OpenMP threads (1-8)

1 OpenMP thread



### Q: Which process/thread configuration is best?











## Iterative Solver Intranode CPU+MIC Study



- Symmetric MIC run\*
- CPU
  - 2 MPI processes
  - 8 OpenMP threads each
- MIC
  - 1 MPI process
  - 60 OpenMP threads
- Find optimal workload balance
- \* Simplest method to use MIC with CPU (not ideal)









### **Applications**

- Benchmark Description
- System Evaluation for Algorithm I Iterative Solver
  - System Evaluation for Algorithm II –Direct Solver
    - Computational System Comparison



Q: Overall, which parallel

implementation is best?

# Direct Solver Intranode CPU Study





Vary OpenMP threads (1-16) Vary OpenMP threads (1-8) 1 OpenMP thread



Q: Which process/thread

configuration is best?

#### Direct Solver Intranode CPU Study



#### Pure OpenMP





A:  $\begin{cases} 1 \text{ MPI process with} \\ 8 \text{ OpenMP threads,} \\ 1 \text{ MPI process with} \\ 16 \text{ OpenMP threads,} \end{cases} \quad W \ge 0.75 \end{cases}$ 

1 MPI process Vary OpenMP threads (1-16)



Q: Which process/thread

configuration is best?

### **Direct Solver** Intranode MIC Study



W < 1.25

 $W \ge 1.75$ 





## Direct Solver Intranode CPU+MIC Study



- *W* = 2
- Direct solve
  - Automatic offload (Intel MKL)

C

- CPU: 1 MPI process with 16 OpenMP threads
- MIC: 30 OpenMP threads

Direct: Use MKL automatic varying workload



#### Q: Which process/thread configuration is best? Intranode CPU+MIC Study



ICES Pus s



Q: Which hardware + parallel

implementation is best?

## Direct Solver Intranode Study



**CPU Pure OpenMP CPU+MIC** MIC Pure OpenMP e 10<sup>2</sup> 10  $t_{obs}$  $t_{obs}$  Sample Sample Sample 10 > 0.9 100  $10^{0}$ 0.7 Solve 10<sup>-</sup> 0.5 10-4 0 0.3 10-3 10-4 < 0.1 10-4 2<sup>10<sup>-4</sup> - 0.5</sup> 10 0.5 1 2 0.5 Work  $(W = r/\lambda_0)$ 2 Work  $(W = r/\lambda_0)$ Work  $(W = r/\lambda_0)$ 

MIC: 1 MPI process with varying OpenMP threads (1-240) CPU: 1 MPI process with varying OpenMP threads (1-16)

| A: { | CPU Pure OpenMP,<br>CPU+MIC, | W < 1.25     |
|------|------------------------------|--------------|
|      | CPU+MIC,                     | $W \ge 1.25$ |

CPU: 1 MPI process with 16 OpenMP threads MIC: Varying OpenMP threads (1-240) MKL automatic offloading



#### Q: Which hardware + parallel implementation is best? nternode CPU+MIC Study









### **Applications**

- Benchmark Description
- System Evaluation for Algorithm I Iterative Solver
  - System Evaluation for Algorithm II –Direct Solver
    - Computational System Comparison

#### C: Which computational system is better? Direct vs. Iterative Solver <u>nternode CPU+MIC Study</u>



- Varying nodes (<1-64), P<sub>nodes</sub>
- $(<1-04), P_{node}$
- CPU / node
  - Iterative: 1-2 MPI processes with up to 8 OpenMP threads

IS DATERBUTY OF TERMS AT ADDITION

ICES

- Direct: 1 MPI process with up to 16 OpenMP threads
- MIC / node
  - 60 OpenMP threads
  - Iterative: 1 MPI process & 25% workload
    - Direct: MKL automatic offload& varying workload





## **Summary & Conclusions**



#### **Observations**



- Judging algorithms, software, hardware
  - Era of independently judging algorithms, implementations, and hardware is (probably) ending
  - Will not be able to (credibly) claim
    - processor p1 is better/faster/more energy efficient/... than processor p2 without mentioning algorithm & software properties
    - algorithm A is better/faster/... than algorithm B without mentioning software & hardware properties
  - Must judge entire system (algorithm + software implementation + hardware)
    - can still judge ingredients but in context
    - faster not always better, must evaluate cost!
      - => Q: Which one is better? Destination 100km away:
      - (a) Drive in 1h or 2h? A: Of course faster is better.
      - (b) Drive in 1h spending 10L of fuel or 2h spending 1L of fuel? A: It depends...
    - (parallel) efficiency is a reasonable metric for judging cost of computational systems, even for heterogeneous computing



#### **Empirical Approach**



- Proposed methodology
  - Carefully define problem of interest
  - Define workload independent of system (not in terms of basic operations [flops])
  - Determine average computational power of system under different configurations
    - Evaluate efficiency as a function of available computational power, workload, determine iso-efficiency contours
  - Compare & contrast

#### Pros & cons

- + Can compare entire computational systems
- + Can compare ingredients (hardware, software implementations, algorithms) by modifying only one and keeping other ingredients fixed
- Requires (access to) whole system
- Must generate lots of data including those from relatively inefficient simulations