Public:Intelligent Optimization Tuning

From ZevilsWiki

Jump to: navigation, search

Contents

Notes To Self

Math syntax help

Abstract

Modern computer architectures and compilers are both very complicated. There are many different (valid) ways in which a compiler can optimize any given program, and these different transformations can have very different performance characteristics. Compiler makers try to hide this complexity from programmers, but they do so at a cost to performance of the generated code. By exploiting well-known machine learning techniques, we can more fully exploit the compiler's optimization capabilities.

Terminology

A transformation is a mapping from some "source" language to some "machine" language. For some source code s, let Ts be the set of all valid transformations over s. A transformation is valid if and only if the transformed coded produces the same results, when executed, as the original code with respect to all defined aspects of the system. Typically, the time it takes to execute the program is one behaviour which is undefined, and it is desirable to minimize this time. For some transformation t, let t(t) be this execution time. We present here an efficient mechanism for finding \min t(t) | t \in T_s[TODO: reword]

Limiting the Search Space

It would be too computationally expensive to generate the entire set Ts. We can utilize the knowledge that compiler developers have about the problem domain to restrict the search space. While most developers want to be isolated from the details of optimization and just want the compiler to "do the right thing", some developers are extremely sensitive to performance and want an extremely fine-grained level of control over the output produced by the compiler. So, when compiler developers are writing optimization passes, the routines in the compiler which select which transformation will be emitted, if it may be desirable for the pass to behave differently, they often introduce a facility for application developers to tune the pass. We will limit the search space to only those transformations which the compiler could produce under some combination of these optimization parameters.

In order to evaluate the performance of our mechanism, we need some way to evaluate the execution time of compiled code. To make this feasable, we will restrict our study to source code which forms a complete program which can have its execution time measured in some automated fashion. There have been attempts to use static analysis to determine the execution time of programs without actually executing them,[1] but to guarantee that our measurements are meaningful, we will avoid these.

Search Methodologies

For our initial trials, we used ACOVEA. ACOVEA uses genetic algorithms to perform a directed search of the transformation space. The software we attempted to optimize was the GCC compiler itself, and we studied its execution time by measuring how long it took it to compile the SPEC CPU2006 benchmark (excluding those benchmarks which required a Fortran compiler). Unfortunately, ACOVEA requires a very large number of trial runs to produce results, and when a trial run consists of building a very large program and executing a time-consuming benchmark, the amount of time it takes to run can be prohibitive if large clusters of machines aren't available. Also, ACOVEA lacks the ability to resume an interrupted run, so if the machine hangs or runs out of disk space, the process will need to start over.

Future Work

Once we've mastered the ability to determine, for a given program, the optimal set of flags to compile it with (through computations which may be extremely time-consuming,) our goal is to construct a corpus of many programs and their optimal flags. Then, we plan to write software which will enable an arbitrary novel input to be compared against the programs in the corpus extremely quickly; the corpus program most similar to the input will be retrieved, and its flags can be used as a "reasonable approximation" of what the best flags for the input would be, without needing to spend days or weeks of machine time finding them.

In order to perform this classification task, we will need to extract certain features from the corpus and input programs and use them to drive some classification algorithm. Some possible features[2] we might use are:

  • Number of floating point operations in the program's loop bodies
  • Number of operands in the program's loop bodies
  • Size of the program's live ranges
  • Average level to which the program's loops are nested

ACOVEA Details

To see which flags were included in the universe of possible solutions for my ACOVEA run, see the ACOVEA configuration file. These flags were selected because they are either commonly suggested or because the GCC manual either recommends them or indicates that they may sometimes produce faster code and may sometimes produce slower code on the i386 architecture. Note that the "command" is set to a custom script, mkcompiler. This script will build a copy of GCC using the specified options and then write a second script to the location where ACOVEA expects its output. This second script installs a compiler (the first script includes the path to the compiler to install in the body of the second script), builds the SPEC CPU2006 benchmarks with it, and then emits the time that the build took.

Building a compiler and then building SPEC with it are very time-consuming operations, and ACOVEA needs to do many trials to get useful data. Based on a sampling of 500 trials, it would have taken one month of run-time for ACOVEA to complete on the hardware I had available (a Mac Mini Core Solo.) This may have been sufficient for the run to finish in the time available, but there were some incidents of the machine running out of disk space in the middle of the run, thus invalidating data generated after that point. ACOVEA does not have any checkpointing facility which would allow an interrupted run to be resumed and selected data discarded, so these incidents required restarting from scratch.

For descriptions of the programs which whose compile time was benchmarked, see the list of SPEC CPU2006 benchmarks. Any programs which required a Fortran compiler were omitted.

References

  1. K. Yotov, X. Li, G. Ren, M. Cibulskis, G. DeJong, M. Garzaran, D. Padua, K. Pingali, P. Stodghill, and P. Wu. A comparison of empirical and model-driven optimization. SIG-PLAN Not., 38(5):63–76, 2003.
  2. M. Stephenson and S. Amarasinghe. Predicting Unroll Factors Using Supervised Classification. International Symposium on Code Generation and Optimization, 2005
Personal tools