Sunday, April 24, 2011

Performance Analysis.


Scope of This Tutorial: 
A variety of profiling and execution analysis tools exist for both serial and parallel programs. They range widely in usefulness and complexity: 
Simple command line timing utilities 
Fortran and C timing routines 
Profilers 
Execution trace generators 
Graphical execution analyzers - with/without trace generation 
Both real-time and post-execution tools 

Most of the more sophisticated and useful tools have a learning curve associated with them and would deserve a full day tutorial themselves. 

The purpose of this tutorial is to briefly review a range of performance analysis tools, and to provide pointers for more information to many of these tools. 

Although a number of the tools reviewed are cross-platform, the emphasis of this tutorial is their usage on the IBM SP platform. 
 Motivation: 
Writing large-scale parallel and distributed scientific applications that make optimum use of computational resources is a challenging problem. Very often, resources are under-utilized or used inefficiently. 

The factors which determine a program's performance are complex, interrelated, and oftentimes, hidden from the programmer. Some of them are listed by category below. 

Application Related Factors:

Algorithms 

Dataset Sizes 
Memory Usage Patterns 
Use of I/O 
Communication Patterns 
Task Granularity 
Load Balancing 
Amdahl's Law 

Hardware Related Factors: 
Processor Architecture 
Memory Hierarchy 
I/O Configuration 
Network 

Software Related Factors: 
Operating system 
Compiler 
Preprocessor 
Communication protocols 
Libraries 

Because of these challenges and complexities, performance analysis tools are essential to optimizing an application's performance. They can assist you in understanding what your program is "really doing" and suggest how program performance should be improved. 


Performance Considerations and Strategies 



The most important goal of performance tuning is to reduce a program's wall clock execution time. Reducing resource usage in other areas, such as memory or disk requirements, may also be a tuning goal. 

Performance tuning is an iterative process used to optimize the efficiency of a program. It usually involves finding your program's hot spots and eliminating the bottlenecks in them. 

Hot Spot: An area of code within the program that uses a disproportionately high amount of processor time. 

Bottleneck: An area of code within the program that uses processor resources inefficiently and therefore causes unnecessary delays. 

Performance tuning usually involves profiling - using software tools to measure a program's run-time characteristics and resource utilization. 

Use profiling tools and techniques to learn which areas of your code offer the greatest potential performance increase BEFORE you start the tuning process. Then, target the most time consuming and frequently executed portions of a program for optimization. 

Consider optimizing your underlying algorithm: an extremely fine-tuned O(N * N) sorting algorithm may perform significantly worse than a untuned O(N log N) algorithm. 

For data dependent computations, benchmark based on a variety of realistic (both size and values) input data sets. Maintain consistent input data during the fine-tuning process. 

Take advantage of compiler and preprocessor optimizations when possible. 

Finally, know when to stop - there are diminishing returns in successive optimizations. Consider a program with the following breakdown of execution time percentages for the associated parts of the program: 
Procedure % CPU Time
main() 13%
procedure1() 17%
procedure2() 20%
procedure3() 50% 



A 20% increase in the performance of procedure3() results in a 10% performance increase overall. 

A 20% increase in the performance of main() results in only a 2.6% performance increase overall. 


The time command returns the total execution time of your program. 

The format of the output is different for the Korn shell and the C shell. The basic information is : 
Real time: the total wall clock (start to finish) time your program took to load, execute, and exit. 
User time: the total amount of CPU time your program took to execute. 
System time: the amount of CPU time spent on operating system calls in executing your program. 

The system and user times are defined differently across different computer architectures. 

Example csh time output: 

  1 2 3 4 5 6 7 8
1.150u 0.020s 0:01.76 66.4% 15+3981k 24+10io 0pf+0w



Explanation: 
1.15 seconds of user CPU time 
0.02 seconds of system (kernel) time used on behalf of user 
1.76 seconds real time (wall clock time) 
66.4% total CPU time (user+system) during execution as a percentage of elapsed time. 
15 Kbytes of shared memory usage and 3981 Kbytes of unshared data space 
24 block input operations and 10 block output operations 
no page faults 
no swaps 

Example ksh time output: 

  real 0m2.58s
  user 0m1.14s
  sys 0m0.03s
Explanation: 
0 minutes, 2.58 seconds of wall clock time 
0 minutes, 1.14 seconds of user CPU time 
0 minutes, 0.03 seconds of system CPU time .