Performance Prediction and Improvement Techniques for Parallel Programs in Multiprocessors

Detta är en avhandling från Ronneby : Blekinge Institute of Technology

Sammanfattning: The performance of a computer system is important. One way of improving performance is to use multiprocessors with several processors that can work in parallel. Where multiprocessors are used, the programs must also be parallel in order to achieve high performance. However, it is not always easy to write parallel programs for multiprocessors; program developers need support in this area. Such support includes, for example, information regarding how well the parallel program scales-up when the number of processors increases and identification of performance bottlenecks; ideally, the result should be presented graphically. Bottlenecks arise both as a result of parallelization as well as traditional (sequential) code. Further, the developer may need to predict performance on other systems than the one used for development, since the development environment often is the (uni-processor) workstation on the developer's desk. One way of increasing the performance may be to bind threads on processors statically. Finding the optimal allocation is NP-hard and it is necessary to resort to heuristic algorithms. When heuristic algorithms are used we do not know how near/far we are from the optimal allocation. Finding a bound for the program's completion time shows what should be achievable using a heuristic algorithm. In this thesis, I present techniques how to simulate a multiprocessor execution of a parallel program based on a monitored execution on a uni-processor. The result of the (simulated) multiprocessor execution is graphically presented in order to give feedback to the developer. The techniques can be used for heuristic algorithms to find an allocation of threads to processors. Further, I show an algorithm that identifies the critical path of the parallel program on a multiprocessor, thereby identifying the segments that are worthwhile optimizing. I also show how to calculate a tight bound on the minimal completion time for the optimal allocation of threads to processors. Finally, I discuss the implications of the choice of simulation model. The techniques and algorithms described have been manifested in a prototype tool which I have used to perform empirical studies. The tool has been validated using a real multiprocessor.