{"title": "AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning", "book": "Proceedings of Machine Learning and Systems", "page_first": 70, "page_last": 81, "abstract": "The performance of the code a compiler generates depends on the order in which it applies the optimization passes.  \r\nChoosing a good order--often referred to as the {\\em phase-ordering} problem--is an NP-hard problem. As a result, existing solutions rely on a variety of heuristics.\r\nIn this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning.\r\nTo this end, we implement a framework that takes a program and finds a sequence of passes that optimize the performance of the generated circuit. \r\nWithout loss of generality, we instantiate this framework in the context of an LLVM compiler and target high-level synthesis programs. \r\nWe use random forests to quantify the correlation between the effectiveness of a given pass and the program's features. This helps us reduce the search space by avoiding orderings that are unlikely to improve the performance of a given program. \r\nWe compare the performance of deep reinforcement learning to state-of-the-art algorithms that address the phase-ordering problem.\r\nIn our evaluation, we show that reinforcement learning improves circuit performance by 28\\%\r\nwhen compared to using the -O3 compiler flag, and it achieves competitive results compared to the state-of-the-art solutions, while requiring fewer samples. \r\nMore importantly, unlike existing state-of-the-art solutions, our reinforcement learning solution can generalize to more than 12,000 different programs after training on as few as a hundred programs for less than ten minutes.", "full_text": "AUTOPHASE: JUGGLING HLS PHASE ORDERINGS IN RANDOM FORESTS\r\n                                           WITHDEEPREINFORCEMENTLEARNING\r\n                    AmeerHaj-Ali*1 QijingHuang*1 WilliamMoses2 JohnXiang1 KrsteAsanovic1 JohnWawrzynek1\r\n                                                                       Ion Stoica1\r\n                                                                      ABSTRACT\r\n                    The performance of the code a compiler generates depends on the order in which it applies the optimization\r\n                    passes. Choosing a good order\u2013often referred to as the phase-ordering problem, is an NP-hard problem. As a\r\n                    result, existing solutions rely on a variety of heuristics. In this paper, we evaluate a new technique to address the\r\n                    phase-ordering problem: deep reinforcement learning. To this end, we implement AutoPhase: a framework that\r\n                    takes a program and uses deep reinforcement learning to \ufb01nd a sequence of compilation passes that minimizes its\r\n                    execution time. Without loss of generality, we construct this framework in the context of the LLVM compiler\r\n                    toolchain and target high-level synthesis programs. We use random forests to quantify the correlation between\r\n                    the effectiveness of a given pass and the program\u2019s features. This helps us reduce the search space by avoiding\r\n                    phase orderings that are unlikely to improve the performance of a given program. We compare the performance of\r\n                    AutoPhase to state-of-the-art algorithms that address the phase-ordering problem. In our evaluation, we show that\r\n                    AutoPhase improves circuit performance by 28% when compared to using the -O3 compiler \ufb02ag, and achieves\r\n                    competitive results compared to the state-of-the-art solutions, while requiring fewer samples. Furthermore, unlike\r\n                    existing state-of-the-art solutions, our deep reinforcement learning solution shows promising result in generalizing\r\n                    to real benchmarks and 12,874 different randomly generated programs, after training on a hundred randomly\r\n                    generated programs.\r\n               1    INTRODUCTION                                               In this paper, we build off the LLVM compiler (Lattner &\r\n               High-Level Synthesis (HLS) automates the process of cre-        Adve, 2004). However, our techniques, can be broadly ap-\r\n               ating digital hardware circuits from algorithms written in      plicable to any compiler that uses a series of optimization\r\n               high-level languages. Modern HLS tools (Xilinx, 2019; In-       passes. In this case, the optimization of an HLS program\r\n               tel, 2019; Canis et al., 2013) use the same front-end as the    consists of applying a sequence of analysis and optimiza-\r\n               traditional software compilers. They rely on traditional soft-  tion phases, where each phase in this sequence consumes\r\n               ware compiler techniques to optimize the input program\u2019s        the output of the previous phase, and generates a modi\ufb01ed\r\n               intermediate representation (IR) and produce circuits in the    version of the program for the next phase. Unfortunately,\r\n               form of RTL code. Thus, the quality of compiler front-end       these phases are not commutative which makes the order in\r\n               optimizations directly impacts the performance of HLS-          which these phases are applied critical to the performance\r\n               generated circuit.                                              of the output.\r\n               Program optimization is a notoriously dif\ufb01cult task. A pro-     Consider the program in Figure 1, which normalizes a vec-\r\n               gram must be just in \u201dthe right form\u201d for a compiler to         tor. Without any optimizations, the norm function will take\r\n                                                                               \u0398(n2)tonormalizeavector. However,asmartcompilerwill\r\n               recognize the optimization opportunities. This is a task a      implement the loop invariant code motion (LICM) (Much-\r\n               programmer might be able to perform easily, but is often        nick, 1997) optimization, which allows it to move the call to\r\n               dif\ufb01cult for a compiler. Despite a decade of research on        magabovetheloop,resultinginthecodeontheleftcolumn\r\n               developing sophisticated optimization algorithms, there is      in Figure 2. This optimization brings the runtime down to\r\n               still a performance gap between the HLS generated code          \u0398(n)\u2014abigspeedupimprovement. Anotheroptimization\r\n               and the hand-optimized one produced by experts.                 the compiler could perform is (function) inlining (Muchnick,\r\n                 *Equal contribution   1University of California, Berkeley     1997). With inlining, a call to a function is simply replaced\r\n               2Massachusetts Institute of Technology.  Correspondence to:     with the body of the function, reducing the overhead of the\r\n               Ameer Haj-Ali <ameerh@berkeley.edu>, Qijing Huang <qi-          function call. Applying inlining to the code will result in\r\n               jing.huang@berkeley.edu>.                                       the code in the right column of Figure 2.\r\n               Proceedings of the 3rd MLSys Conference, Austin, TX, USA,\r\n               2020. Copyright 2020 by the author(s).\r\n                              AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning\r\n               __attribute__((const))                                             environment could be the program and/or the optimization\r\n               double mag(const double *A, int n) {                               passes applied so far. The action is the optimization pass to\r\n                     double sum = 0;                                              apply next, and the reward is the improvement in the circuit\r\n                     for(int i=0; i<n; i++){\r\n                           sum += A[i] * A[i];                                    performanceafterapplyingthispass. Theparticularframing\r\n                     }                                                            of the problem as an RL problem has a signi\ufb01cant impact\r\n                     return sqrt(sum);                                            on the solution\u2019s effectiveness. Signi\ufb01cant challenges ex-\r\n               }                                                                  ist in understanding how to formulate the phase ordering\r\n               void norm(double          restrict out,\r\n                                        *                                         optimization problem in an RL framework.\r\n                              const double        restrict in, int n) {\r\n                                                *\r\n                     for(int i=0; i<n; i++) {                                     In this paper, we consider three approaches to represent the\r\n                           out[i] = in[i] / mag(in, n);                           environment\u2019s state. The \ufb01rst approach is to directly use\r\n                     }\r\n               }                                                                  salient features from the program. The second approach is\r\n                                                                                  to derive the features from the sequence of optimizations\r\n                    Figure 1: A simple program to normalize a vector.             weapplied while ignoring the program\u2019s features. The third\r\n                                                                                  approach combines the \ufb01rst two approaches. We evaluate\r\n                                                                                  these approaches by implementing a framework that takes\r\n                                                                                  a group of programs as input and quickly \ufb01nds a phase\r\n               Now, consider applying these optimization passes in the            ordering that competes with state-of-the-art solutions. Our\r\n               opposite order: \ufb01rst inlining then LICM. After inlining,           maincontributions are:\r\n               we get the code on the left of Figure 3. Once again we\r\n               get a modest speedup, having eliminated n function calls,             \u2022 Extend a previous work (Huang et al., 2019) and lever-\r\n               thoughourruntimeisstill\u0398(n2). Ifthecompilerafterwards                    age deep RL to address the phase-ordering problem.\r\n               attempted to apply LICM, we would \ufb01nd the code on the                 \u2022 Animportance analysis on the features using random\r\n               right of Figure 3. LICM was able to successfully move the                forests to signi\ufb01cantly reduce the state and action\r\n               allocation of sum outside the loop. However, it was unable               spaces.\r\n               to move the instruction setting sum=0 outside the loop, as\r\n               doing so would mean that all iterations excluding the \ufb01rst            \u2022 AutoPhase: a framework that integrates the current\r\n               one would end up with a garbage value for sum. Thus, the                 HLScompiler infrastructure with the deep RL algo-\r\n               internal loop will not be moved out.                                     rithms.\r\n               As this simple example illustrates, the order in which the            \u2022 WeshowthatAutoPhasegetsa28%improvementover\r\n               optimization phases are applied can be the difference be-               -O3forninereal benchmarks. Unlike all state-of-the-\r\n               tween the program running in \u0398(n2) versus \u0398(n). It is thus               art approaches, deep RL demonstrates the potential\r\n               crucial to determine the optimal phase ordering to maximize              to generalize to thousands of different programs after\r\n               the circuit speeds. Unfortunately, not only is this a dif\ufb01cult           training on a hundred programs.\r\n               task, but the optimal phase ordering may vary from program         2    BACKGROUND\r\n               to program. Furthermore, it turns out that \ufb01nding the opti-\r\n               malsequenceofoptimizationphasesisanNP-hardproblem,                 2.1   CompilerPhase-ordering\r\n               and exhaustively evaluating all possible sequences is infea-\r\n               sible in practice. In this work, for example, the search space     Compilers execute optimization passes to transform pro-\r\n               extends to more than 2247 phase orderings.                         grams into more ef\ufb01cient forms to run on various hardware\r\n               Thegoalofthispaperistoprovideamechanismforautomat-                 targets. Groups of optimizations are often packaged into\r\n               ically determining good phase orderings for HLS programs           \u201coptimization levels\u201d , such as -O0 and -O3, for ease. While\r\n               to optimize for the circuit speed. To this end, we aim to          these optimization levels offer a simple set of choices for de-\r\n               leverage recent advancements in deep reinforcement learn-          velopers, they are handpicked by the compiler-designers and\r\n               ing (RL) (Sutton & Barto, 1998; Haj-Ali et al., 2019b) to          often most bene\ufb01t certain groups of benchmark programs.\r\n               address the phase ordering problem. With RL, a software            Thecompiler community has attempted to address the issue\r\n               agent continuously interacts with the environment by taking        byselecting a particular set of compiler optimizations on a\r\n               actions. Each action can change the state of the environ-          per-program or per-target basis for software (Triantafyllis\r\n               ment and generate a \u201dreward\u201d. The goal of RL is to learn           et al., 2003; Almagor et al., 2004; Pan & Eigenmann, 2006;\r\n               a policy\u2014that is, a mapping between the observed states            Ansel et al., 2014).\r\n               of the environment and a set of actions\u2014to maximize the            Since the search space of phase-ordering is too large for\r\n               cumulative reward. An RL algorithm that uses a deep neu-           an exhaustive search, many heuristics have been proposed\r\n               ral network to approximate the policy is referred to as a          to explore the space by using machine learning. Huang et\r\n               deep RL algorithm. In our case, the observation from the           al. tried to address this challenge for HLS applications\r\n                             AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning\r\n                                                                                    void norm(double         restrict out,\r\n                                                                                                            *\r\n                                                                                                  const double       restrict in, int n) {\r\n                                                                                                                    *\r\n                                                                                         double precompute, sum = 0;\r\n                                                                                         for(int i=0; i<n; i++){\r\n               void norm(double         restrict out,                                          sum += A[i]         A[i];\r\n                                       *                                                                        *\r\n                             const double        restrict in, int n) {                   }\r\n                                               *\r\n                     double precompute = mag(in, n);                                     precompute = sqrt(sum);\r\n                     for(int i=0; i<n; i++) {                                            for(int i=0; i<n; i++) {\r\n                          out[i] = in[i] / precompute;                                         out[i] = in[i] / precompute;\r\n                     }                                                                   }\r\n               }                                                                    }\r\n                                Figure 2: Progressively applying LICM (left) then inlining (right) to the code in Figure 1.\r\n                                                                                   void norm(double         restrict out,\r\n                                                                                                           *\r\n               void norm(double         restrict out,                                            const double        restrict in, int n) {\r\n                                       *                                                                           *\r\n                             const double        restrict in, int n) {                   double sum;\r\n                                               *\r\n                     for(int i=0; i<n; i++) {                                            for(int i=0; i<n; i++) {\r\n                          double sum = 0;                                                     sum = 0;\r\n                          for(int j=0; j<n; j++){                                             for(int j=0; j<n; j++){\r\n                                sum += A[j] * A[j];                                                 sum += A[j] * A[j];\r\n                          }                                                                   }\r\n                          out[i] = in[i] / sqrt(sum);                                         out[i] = in[i] / sqrt(sum);\r\n                     }                                                                   }\r\n               }                                                                   }\r\n                                Figure 3: Progressively applying inlining (left) then LICM (right) to the code in Figure 1.\r\n               by using modi\ufb01ed greedy algorithms (Huang et al., 2013;           2.2   Reinforcement Learning Algorithms\r\n               2015). It achieved 16%improvementvs-O3ontheCHstone                Reinforcementlearning(RL)isamachinelearningapproach\r\n               benchmarks (Hara et al., 2008), which we used in this paper.      in which an agent continually interacts with the environ-\r\n               In (Agakov et al., 2006) both independent and Markov mod-         ment (Kaelbling et al., 1996). In particular, the agent ob-\r\n               els were applied to automatically target an optimized search      serves the state of the environment, and based on this ob-\r\n               space for iterative methods to improve the search results.        servation takes an action. The goal of the RL agent is then\r\n               In (Stephenson et al., 2003), genetic algorithms were used        to compute a policy\u2013a mapping between the environment\r\n               to tune heuristic priority functions for three compiler opti-     states and actions\u2013that maximizes a long term reward.\r\n               mization passes. Milepost GCC (Fursin et al., 2011) used\r\n               machine learning to determine the set of passes to apply to       RL can be viewed as a stochastic optimization solution\r\n               a given program, based on a static analysis of its features. It   for solving Markov Decision Processes (MDPs) (Bellman,\r\n               achieved an 11% execution time improvement over -O3, for          1957), when the MDP is not known. An MDP is de\ufb01ned by\r\n               the ARCrecon\ufb01gurable processor on the MiBench program             a tuple with four elements: S,A,P(s,a),r(s,a) where S\r\n               suite1. In (Kulkarni & Cavazos, 2012) the challenge was for-      is the set of states of the environment, A describes the set of\r\n               mulated as a Markov process and supervised learning was                                                   \u2032\u223c\r\n                                                                                 actions or transitions between states, s   P(s,a)describes\r\n               used to predict the next optimization, based on the current       the probability distribution of next states given the current\r\n               program state. OpenTuner (Ansel et al., 2014) autotunes           state and action and r(s,a) : S \u00d7 A \u2192 R is the reward of\r\n               a program using an AUC-Bandit-meta-technique-directed             taking action a in state s. Given an MDP, the goal of the\r\n               ensemble selection of algorithms. Its current mechanism           agent is to gain the largest possible aggregate reward. The\r\n               for selecting the compiler optimization passes does not con-      objective of an RL algorithm associated with an MDP is to\r\n               sider the order or support repeated optimizations. Wang           \ufb01nd a decision policy \u03c0\u2217(a|s) : s \u2192 A that achieves this\r\n               et al. (Wang & OBoyle, 2018), provided a survey for us-           goal for that MDP:\r\n               ing machine learning in compiler optimization where they                                     \"              #\r\n               also described that using program features might be helpful.                                   X\r\n               NeuroVectorizer (Haj-Ali et al., 2020; 2019a) used deep RL          \u03c0\u2217 = argmaxE                   r(s ,a )   =\r\n                                                                                                      \u223c              t   t\r\n                                                                                                     \u03c4 \u03c0(\u03c4)\r\n               for automatically tuning compiler pragmas such as vector-                      \u03c0                t\r\n               ization and interleaving factors. NeuroVectorizer achieves                               T\r\n                                                                                              argmaxXE                       [r(s ,a )].   (1)\r\n                                                                                                                   \u223c             t   t\r\n               97% of the oracle performance (brute-force search) on a                                       (s ,a ) \u03c0(s ,a )\r\n                                                                                                  \u03c0            t  t     t  t\r\n               wide range of benchmarks.                                                                t=1\r\n                                                                                 DeepRLleveragesaneuralnetworktolearnthepolicy(and\r\n                              AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning\r\n               sometimes the reward function). Policy Gradient (PG) (Sut-         tains a family of population-based meta-heuristic optimiza-\r\n               ton et al., 2000), for example, updates the policy directly by     tion algorithms inspired by natural selection. The main idea\r\n               differentiating the aggregate reward E in Equation 1:              of these algorithms is to sample a population of solutions\r\n                                                                                  andusethegoodonestodirectthedistributionoffuturegen-\r\n                  \u2207 J =                                                           erations. Two commonly used Evolutionary Algorithms are\r\n                    \u03b8    \"                                            #\r\n                     N                                                            Genetic Algorithms (GA) (Goldberg, 2006) and Evolution\r\n                  1 X X                              X                            Strategies (ES) (Conti et al., 2018).\r\n                          (    \u2207 log\u03c0 (a |s ))(          r(s ,a ))        (2)\r\n                 N                \u03b8     \u03b8  i,t  i,t          i,t  i,t\r\n                     i=1     t                         t                          GAgenerallyrequires a genetic representation of the search\r\n               and updating the network parameters (weights) in the direc-        space where the solutions are coded as integer vectors. The\r\n               tion of the gradient:                                              algorithm starts with a pool of candidates, then iteratively\r\n                                                                                  evolves the pool to include solutions with higher \ufb01tness by\r\n                                      \u03b8 \u2190\u03b8+\u03b1\u2207 J,                          (3)     the three following strategies: selection, crossover, and mu-\r\n                                                    \u03b8\r\n               Notethat PGisanon-policymethodinthatitusesdecisions                tation. Selection keeps a subset of solutions with the highest\r\n               made directly by the current policy to compute the new             \ufb01tness values. These selected solutions act as parents for\r\n               policy.                                                            the next generation. Crossover merges pairs from the parent\r\n                                                                                  solutions to produce new offsprings. Mutation perturbs the\r\n               Over the past couple of years, a plethora of new deep RL           offspring solutions with a low probability. The process re-\r\n               techniques have been proposed (Mnih et al., 2016; Ross             peats until a solution that reaches the goal \ufb01tness is found\r\n               et al., 2011). In this paper, we mainly focus on Proximal          or after a certain number of generations.\r\n               Policy Optimization (PPO) (Schulman et al., 2017), Asyn-           ESworkssimilarlytoGA.However,thesolutionsarecoded\r\n               chronous Advantage Actor-critic (A3C) (Mnih et al., 2016).         as real numbers in ES. In addition, ES is self-adapting.\r\n               PPOisavariantofPGthatenablesmultipleepochsofmini-                  Thehyperparameters, such as the step size or the mutation\r\n               batch updates to improve the sample complexity. Vanilla            probability, are different for different solutions. They are\r\n               PG performs one gradient update per data sample while              encoded in each solution, so good settings get to the next\r\n               PPO uses a novel surrogate objective function to enable            generation with good solutions. Recent work (Salimans\r\n               multiple epochs of minibatch updates. It alternates between        et al., 2017) has used ES to update policy weights for RL and\r\n               sampling data through interaction with the environment and         showed it is a good alternative for gradient-based methods.\r\n               optimizing the surrogate objective function using stochastic\r\n               gradient ascent. It performs updates that maximizes the            3    THEPROPOSEDAUTOPHASE\r\n               reward function while ensuring the deviation from the previ-            FRAMEWORKFORAUTOMATICPHASE\r\n               ous policy is small by using a surrogate objective function.            ORDERING\r\n               Theloss function of PPO is de\ufb01ned as:\r\n                 CLIP          \u02c6              \u02c6                           \u02c6       Weleverage an existing open-source HLS framework called\r\n               L       (\u03b8) = E [min(r (\u03b8)A ,clip(r (\u03b8),1\u2212\u03b5,1+\u03b5)A )]\r\n                                t        t     t        t                  t      LegUp(Canisetal., 2013) that compiles a C program into a\r\n                                                                          (4)     hardware RTL design. In (Huang et al., 2013), an approach\r\n                                                                \u03c0 (a |s )\r\n               where r (\u03b8) is de\ufb01ned as a probability ratio       \u03b8  t t   so\r\n                        t                                      \u03c0    (a |s )       is devised to quickly determine the number of hardware ex-\r\n                                                                \u03b8old  t t\r\n               r(\u03b8    ) = 1. This term penalizes policy update that move\r\n                   old                                                            ecution cycles without requiring time-consuming logic sim-\r\n                                     \u02c6\r\n               r (\u03b8) from r(\u03b8     ). A denotes the estimated advantage that\r\n                t              old    t                                           ulation. We develop our RL simulator environment based\r\n               approximates how good at is compared to the average. The           on the existing harness provided by LegUp and validate our\r\n               second term in the min function acts as a disincentive for         \ufb01nal results by going through the time-consuming logic sim-\r\n               movingrt outside of [1\u2212\u03b5,1+\u03b5] where \u03b5 is a hyperparam-             ulation. AutoPhase takes a program (or multiple programs)\r\n               eter.                                                              and intelligently explores the space of possible passes to\r\n               A3Cusesanactor(usually a neural network) that interacts            \ufb01gureoutanoptimalpasssequencetoapply. Table1listsall\r\n               with the critic, which is another network that evaluates the       the passes used in AutoPhase. The work\ufb02ow of AutoPhase\r\n               action by computing the value function. The actor tells            is illustrated in Figure 4.\r\n               the actor how good its action was and how it should ad-            3.1   HLSCompiler\r\n               just. The update performed by the algorithm can be seen as\r\n                                    \u02c6\r\n               \u2207 log\u03c0 (a |s )A .\r\n                 \u03b8      \u03b8  i,t  i,t  t                                            AutoPhase takes a set of programs as input and compiles\r\n                                                                                  them to a hardware-independent intermediate representa-\r\n               2.3   Evolutionary Algorithms                                      tion (IR) using the Clang front-end of the LLVM compiler.\r\n               Evolutionary algorithms are another technique that can be          Optimization and analysis passes act as transformations on\r\n               used to search for the best compiler pass ordering. It con-        the IR, taking a program as input and emitting a new IR as\r\n                                      AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning\r\n                                                    New Compiler Pass\u2028                                     RLlearning and hope the neural network can capture the\r\n                                                           (Action)                                        correlation of certain combinations of these features and\r\n                       Optimization                        Feature    Features                             certain optimizations. Table 2 lists all the features used.\r\n                          Passes                          Extractor   (State)\r\n                                                                                       Gradient            3.4     RandomProgramGenerator\r\n                          Program            LLVM \u2028         Cycle     Runtime     Learning                 Asadata-driven approach, RL generalizes better if we train\r\n                          Generator            IR          Pro\ufb01ler    (Reward)      Agent \r\n                                                                                                           the agent on more programs. However, there are a lim-\r\n                            Real \u2028                           HLS                                           ited number of open-source HLS examples online. There-\r\n                        Benchmarks                        Compiler               Hardware                  fore, we expand our training set by automatically generating\r\n                       Input Programs                                                                      synthetic HLS benchmarks. We \ufb01rst generate standard C\r\n                                                                                                           programs using CSmith (Yang et al., 2011), a random C pro-\r\n                    Figure 4: The block diagram of AutoPhase. The input pro-                               gramgenerator, which is originally designed to generate test\r\n                    grams are compiled to an LLVM IR using Clang/LLVM.                                     cases for \ufb01nding compiler bugs. Then, we develop scripts to\r\n                    The feature extractor and clock-cycle pro\ufb01ler are used to                              \ufb01lter out programs that take more than \ufb01ve minutes to run\r\n                    generate the input features (state) and the runtime improve-                           onCPUorfailtheHLScompilation.\r\n                    ment (reward), respectively from the IR. The input features                            3.5     Overall Flow of AutoPhase\r\n                    and runtime improvement are fed to the deep RL agent as in\r\n                    input data to train on. The RL agent predicts the next best                            We integrate the compilation utilities into a simulation\r\n                    optimization passes to apply. After convergence, the HLS                               environment in Python with APIs similar to an OpenAI\r\n                    compiler is used to compile the LLVM IR to hardware RTL.                               gym (Brockman et al., 2016). The overall \ufb02ow works as\r\n                    output. The HLS tool LegUp is invoked after the compiler                               follows:\r\n                    optimization as a back-end pass, which transforms LLVM                                    1. The input program is compiled into LLVM IR using\r\n                    IRinto hardware modules.                                                                      the Clang/LLVM.\r\n                    3.2    Clock-cycle Pro\ufb01ler                                                                2. The IR Feature Extractor is run to extract salient pro-\r\n                                                                                                                  gramfeatures.\r\n                    OncethehardwareRTLisgenerated, one could run a hard-                                      3. LegUpcompiles the LLVM IR into hardware RTL.\r\n                    ware simulation to gather the cycle count results of the\r\n                    synthesized circuit. This process is quite time-consuming,                                4. The Clock-cycle Pro\ufb01ler estimates a clock-cycle count\r\n                    hindering RL and all other optimization approaches. There-                                    for the generated circuit.\r\n                    fore, we approximate cycle count using the pro\ufb01ler in                                     5. The RL agent takes the program features or the his-\r\n                    LegUp(Huangetal.,2013), which leverages the software                                          togram of previously applied passes and the improve-\r\n                    traces and runs 20\u00d7 faster than hardware simulation. In                                       mentinclock-cycle count as input data to train on.\r\n                    LegUp, the frequency of the generated circuits is set as a                                6. TheRLagentpredictsthenextbestoptimizationpasses\r\n                    compiler constraint that directs the HLS scheduling algo-                                     to apply.\r\n                    rithm. In other words, HLS tool will always try to generate\r\n                    hardware that can run at a certain frequency. In our exper-                               7. NewLLVMIRisgeneratedafterthenewoptimization\r\n                    iment setting, without loss of generality, we set the target                                  sequence is applied.\r\n                    frequency of all generated hardware to 200MHz. We ex-                                     8. The machine learning algorithm iterates through steps\r\n                    perimented with lower frequencies too; the improvements                                       (2)\u2013(7) until convergence.\r\n                    were similar but the cycle counts the different algorithms\r\n                    achieved were better as more logic could be \ufb01tted in a single                          Note that AutoPhase uses the LLVM compiler and the\r\n                    cycle.                                                                                 passes used are listed in Table 2. However, adding sup-\r\n                    3.3    IRFeatureExtractor                                                              port for any compiler or optimization passes in AutoPhase\r\n                                                                                                           is very easy and straightforward. The action and state de\ufb01-\r\n                    Wangetal.(Wang&OBoyle,2018)proposedtoconverta                                          nitions must be speci\ufb01ed again.\r\n                    program into an observation by extracting all the features\r\n                    from the program. Similarly, in addition to the LegUp                                  4      CORRELATIONOFPASSESAND\r\n                    backend tools, we developed analysis passes to extract 56                                     PROGRAMFEATURES\r\n                    static features from the program, such as the number of\r\n                    basic blocks, branches, and instructions of various types.                             Similar to the case with many deep learning approaches,\r\n                    Weusethesefeatures as partially observable states for the                              explainability is one of the major challenges we face when\r\n                                       AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning\r\n                                                                                 Table 1: LLVM Transform Passes.\r\n                                 0                    1              2             3             4              5          6           7             8                9               10\r\n                      -correlated-propagation    -scalarrepl    -lowerinvoke     -strip   -strip-nondebug     -sccp    -globalopt    -gvn     -jump-threading     -globaldce    -loop-unswitch\r\n                            11               12                 13                 14               15           16             17                18               19               20              21\r\n                      -scalarrepl-ssa   -loop-reduce     -break-crit-edges    -loop-deletion    -reassociate   -lcssa    -codegenprepare      -memcpyopt      -functionattrs   -loop-idiom     -lowerswitch\r\n                           22              23               24            25          26           27         28            29                30              31          32           33\r\n                      -constmerge     -loop-rotate    -partial-inliner  -inline    -early-cse    -indvars    -adce    -loop-simplify    -instcombine     -simplifycfg     -dse    -loop-unroll\r\n                             34               35           36       37          38           39              40            41            42           43           44              45\r\n                        -lower-expect    -tailcallelim   -licm     -sink    -mem2reg      -prune-eh     -functionattrs   -ipsccp    -deadargelim     -sroa    -loweratomic     -terminate\r\n                                                                                      Table 2: Program Features.\r\n                                             0       NumberofBBwheretotalargsforphinodes>5                  28                   NumberofAndinsts\r\n                                             1     NumberofBBwheretotalargsforphinodesis[1,5]               29    NumberofBB\u2019swithinstructions between [15,500]\r\n                                             2              NumberofBB\u2019swith1predecessor                    30       NumberofBB\u2019swithlessthan15instructions\r\n                                             3      NumberofBB\u2019swith1predecessorand1successor               31                  NumberofBitCastinsts\r\n                                             4     NumberofBB\u2019swith1predecessorand2successors               32                    NumberofBrinsts\r\n                                             5               NumberofBB\u2019swith1successor                     33                    NumberofCallinsts\r\n                                             6             NumberofBB\u2019swith2predecessors                    34              NumberofGetElementPtrinsts\r\n                                             7     NumberofBB\u2019swith2predecessorsand1successor               35                   NumberofICmpinsts\r\n                                             8      NumberofBB\u2019swith2predecessorsandsuccessors              36                   NumberofLShrinsts\r\n                                             9              NumberofBB\u2019swith2successors                     37                   NumberofLoadinsts\r\n                                            10            NumberofBB\u2019swith>2predecessors                    38                    NumberofMulinsts\r\n                                            11        NumberofBB\u2019swithPhinode#inrange(0,3]                  39                    NumberofOrinsts\r\n                                            12         NumberofBB\u2019swithmorethan3Phinodes                    40                    NumberofPHIinsts\r\n                                            13              NumberofBB\u2019swithnoPhinodes                      41                    NumberofRetinsts\r\n                                            14           NumberofPhi-nodesatbeginning of BB                 42                   NumberofSExtinsts\r\n                                            15                     Numberofbranches                         43                   NumberofSelectinsts\r\n                                            16               Numberofcalls that return an int               44                    NumberofShlinsts\r\n                                            17                   Numberofcritical edges                     45                   NumberofStoreinsts\r\n                                            18                       Numberofedges                          46                    NumberofSubinsts\r\n                                            19       Numberofoccurrences of 32-bit integer constants        47                   NumberofTruncinsts\r\n                                            20       Numberofoccurrences of 64-bit integer constants        48                    NumberofXorinsts\r\n                                            21             Numberofoccurrences of constant 0                49                   NumberofZExtinsts\r\n                                            22             Numberofoccurrences of constant 1                50                  Numberofbasicblocks\r\n                                            23              Numberofunconditional branches                  51            Numberofinstructions (of all types)\r\n                                            24     NumberofBinaryoperations with a constant operand         52              Numberofmemoryinstructions\r\n                                            25                    NumberofAShrinsts                         53             Numberofnon-external functions\r\n                                            26                     NumberofAddinsts                         54                Total arguments to Phi nodes\r\n                                            27                    NumberofAllocainsts                       55                NumberofUnaryoperations\r\n                    applying deep RL to the phase-ordering challenge. To ana-                               4.1     ImportanceofProgramFeatures\r\n                    lyze and understand the correlation of passes and program                               Theheat map in Figure 5 shows the importance of different\r\n                    features, we use random forests (Breiman, 2001) to learn                                features on whether a pass should be applied. The higher\r\n                    the importance of different features. Random forest is an                               the value is, the more important the feature is (the sum of\r\n                    ensembleofmultipledecisiontrees. Thepredictionmadeby                                    the values in each row is one). The random forest is trained\r\n                    each tree could be explained by tracing the decisions made                              with 150,000 samples generated from the random programs.\r\n                    at each node and calculating the importance of different                                Theindexmappingoffeatures and passes can be found in\r\n                    features on making the decisions at each node. This helps                               Tables 1 and 2. For example, the yellow pixel corresponding\r\n                    us to identify the effective features and passes to use and                             to feature index 17 and pass index 23 re\ufb02ects that number-\r\n                    showwhetherouralgorithms learn informative patterns on                                  of-critical-edges affects the decision on whether to apply\r\n                    data.                                                                                  -loop-rotate greatly. A critical edge in control \ufb02ow graph is\r\n                    For each pass, we build two random forests to predict                                   anedgethatisneithertheonlyedgeleavingitssourceblock,\r\n                    whether applying it would improve the circuit performance.                              nor the only edge entering its destination block. The critical\r\n                    The\ufb01rstforesttakestheprogramfeaturesasinputswhilethe                                    edges can be commonly seen in a loop as a back edge so\r\n                    second takes a histogram of previously applied passes. To                               the number of critical edges might roughly represent the\r\n                    gatherthetrainingdatafortheforests, werunPPOwithhigh                                    number of loops in a program. The transform pass -loop-\r\n                    exploration parameter on 100 randomly generated programs                                rotate detects a loop and transforms a while loop to a do-\r\n                    to generate feature\u2013action\u2013reward tuples. The algorithm                                 while loop to eliminate one branch instruction in the loop\r\n                    assigns higher importance to the input features that affect                             body. Applyingthepassresultsinbettercircuitperformance\r\n                    the \ufb01nal prediction more.                                                               as it reduces the total number of FSM states in a loop.\r\n                                                                                                            Other expected behaviors are also observed in this \ufb01gure.\r\n                             AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning\r\n               Figure 5: Heat map illustrating the importance of feature         Figure 6: Heat map illustrating the importance of indices of\r\n               and pass indices.                                                 previously applied passes and the new pass to apply.\r\n                                                                                 4.2   ImportanceofPreviously Applied Passes\r\n               For instance, the correlation between number of branches          Figure 6 illustrates the impact of previously applied passes\r\n               and the transform passes -loop-simplify, -tailcallelism           onthe new pass to apply. The higher the value is, the more\r\n               (which transforms calls of the current function i.e., self re-    important having the old pass is. From this \ufb01gure, we learn\r\n               cursion, followed by a return instruction with a branch to the    that for the programs we trained on passes -scalarrepl, -gvn,\r\n               entry of the function, creating a loop), -lowerswitch (which      -scalarrepl-ssa, -loop-reduce, -loop-deletion, -reassociate,\r\n               rewrites switch instructions with a sequence of branches).        -loop-rotate, -partial-inliner, -early-cse, -adce, -instcombine,\r\n               Other interesting behaviors are also captured. For example,       -simplifycfg, -dse, -loop-unroll, -mem2reg, and -sroa, are\r\n               in the correlation between binary operations with a constant      moreimpactful on the performance compared to the rest of\r\n               operand and -functionattrs, which marks different operands        the passes regardless of their order in the trajectory. Point\r\n               of a function as read-only (constant). Some correlations          (23,23)hasthehighestimportanceinwhichimpliesthatpass\r\n               are harder to explain, for example, number of BitCast in-         -loop-rotate is very helpful and should be included if not\r\n               structions and -instcombine, which combines instructions          applied before. By examining thousands of the programs,\r\n               into fewer simpler instructions. This is actually a result of     we \ufb01nd that -loop-rotate indeed reduces the cycle count\r\n               -instcombine reducing the loads and stores that call bitcast      signi\ufb01cantly. Interestingly, applying this pass twice is not\r\n               instructions for casting pointer types. Another example is        harmful if the passes were given consecutively. However,\r\n               numberofmemoryinstructions and -sink, where -sink basi-           giving this pass twice with some other passes between them\r\n               cally moves memory instructions into successor blocks and         is sometimes very harmful. Another interesting behavior\r\n               delays the execution of memory until needed. Intuitively,         our heat map captured is the fact that applying pass 33\r\n               whether to apply -sink should be dependent on whether             (-loop-unroll) after (not necessarily consecutive) pass 23\r\n               there is any memory instruction in the program. Our last          (-loop-rotate) was much more useful compared to applying\r\n               example to show is number of occurrences of constant 0            these two passes in the opposite order.\r\n               and -deadargelim, where -deadargelim helped eliminate\r\n               dead/unused constant zero arguments.\r\n               Overall, weobservethatallthepassesarecorrelatedtosome             5    PROBLEMFORMULATION\r\n               features and are able to affect the \ufb01nal circuit performance.     5.1   TheRLEnvironmentDe\ufb01nition\r\n               Wealso observe that multiple features are not effective at        Assume the optimal number of passes to apply is N and\r\n               directing decisions and training with them could increase         there are K transform passes to select from in total, our\r\n               the variance that would result in lower prediction accuracy       search space S for the phase-ordering problem is [0,KN).\r\n               of our results. For example, the total number of instructions     Given M program features and the history of already ap-\r\n               did not give a direct indication of whether applying a pass       plied passes, the goal of deep RL is to learn the next best\r\n               wouldbehelpful or not. This is because sometimes more in-         optimization pass a to apply that minimizes the long term\r\n               structions could improve the performance (for example, due        cycle count of the generated hardware circuit. Note that the\r\n               to loop unrolling) and eliminating unnecessary code could         optimization state s is partially observable in this case as the\r\n               also improve the performance. In addition, the importance         Mprogramfeaturescannotfully capture all the properties\r\n               of features varies among different benchmarks depending           of a program.\r\n               onthe tasks they perform.\r\n                              AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning\r\n               Action Space \u2013 we de\ufb01ne our action space A as {a \u2208 Z :             wenormalize the program features to the total number of\r\n                                                                                                                                    o\r\n               a \u2208 [0,K)} where K is the total number of transform                instructions in the input program ( o         = f ),which\r\n                                                                                                                        f norm     of51\r\n               passes.                                                            is feature #51 in Table 2.\r\n               Observation Space \u2013 two types of input features were con-\r\n                                            1                              M      6    EVALUATION\r\n               sidered in our evaluation: \r\n program features of \u2208 Z\r\n                                     2\r\n               listed in Table 2 and \r\n action history which is a histogram        TorunourdeepRLalgorithmsweuseRLlib(Liangetal.,\r\n                                                      K\r\n               of previously applied passes oa \u2208 Z . After each RL step           2017), an open-source library for reinforcement learning\r\n               where the pass i is applied, we call the feature extractor in      that offers both high scalability and a uni\ufb01ed API for a va-\r\n               our environment to return new o , and update the action\r\n                                                   f                              riety of applications. RLlib is built on top of Ray (Moritz\r\n               histogram element o     to o   +1.\r\n                                     a     a\r\n                                      i      i                                    et al., 2018), a high-performance distributed execution\r\n               Reward \u2013 the cycle count of the generated circuit is re-           frameworktargetedatlarge-scale machinelearning and rein-\r\n               ported by the clock-cycle pro\ufb01ler at each RL iteration. Our        forcement learning applications. We ran the framework on\r\n               reward is de\ufb01ned as R = c           \u2212c ,where c           and      a four-core Intel i7-4765T CPUwith a Tesla K20c GPUfor\r\n                                             prev     cur           prev\r\n               ccur represent the previous and the current cycle count of         training and inference.\r\n               the generated circuit respectively. It is possible to de\ufb01ne a      Wesetourfrequency constraint in HLS to 200MHz and use\r\n               different reward for different objectives. For example, the        the number of clock cycles reported by the HLS pro\ufb01ler\r\n               reward could be de\ufb01ned as the negative of the area and thus        as the circuit performance metric. In (Huang et al., 2013),\r\n               the RL agent will optimize for the area. It is also possi-         results showed a one-to-one correspondence between the\r\n               ble to co-optimize multiple objectives (e.g., area, execution      clock cycle count and the actual hardware execution time\r\n               time, power, etc.) by de\ufb01ning a combination of different           under certain frequency constraint. Therefore, better clock\r\n               objectives.                                                        cycle count will lead to better hardware performance.\r\n               5.2   Applying Multiple Passes per Action                          6.1   Performance\r\n               Analternative to the action formulation above is to evaluate       Toevaluate the effectiveness of various algorithms for tack-\r\n               a complete sequence of passes with length N instead of             ling the phase-ordering problem, we run them on nine\r\n               a single action a at each RL iteration. Upon the start of          real HLS benchmarks and compare the results based on\r\n               training a new episode, the RL agent resets all pass indices       the \ufb01nal HLS circuit performance and the sample ef\ufb01-\r\n                      N                      K\r\n               p \u2208 Z to the index value         . For pass p at index i, the\r\n                                              2             i                     ciency against state-of-the-art approaches for overcoming\r\n               next action to take is either to change to a new pass or           the phase ordering, which include random search, Greedy\r\n               not. By allowing positive and negative index update for            Algorithms (Huang et al., 2013), OpenTuner (Ansel et al.,\r\n               each p, we reduced the total steps required to traverse all        2014), and Genetic Algorithms (Fortin et al., 2012). These\r\n               possible pass indices. The sub-action space a for each\r\n                                                                  i               benchmarks are adapted from CHStone (Hara et al., 2008)\r\n               pass is thus de\ufb01ned as [\u22121,0,1]. The total action space            and LegUp examples. They are: adpcm, aes, blow\ufb01sh,\r\n               Ais de\ufb01ned as [\u22121,0,1]N. At each step, the RL agent\r\n               predicts the updates [a ,a ,...,a ] to N passes, and the           dhrystone, gsm, matmul, mpeg2, qsort, and sha. For this\r\n                                        1   2      N                              evaluation, the input features/rewards were not normalized,\r\n               current optimization sequence [p ,p ,...,p ] is updated to\r\n                                                  1  2      N                     the pass length was set to 45, and each algorithm was run on\r\n               [p +a ,p +a ,...,p +a ].\r\n                 1     1   2     2      N     N                                   a per-program basis. Table 3 lists the action and observation\r\n               5.3   Normalization Techniques                                     spaces used in all the deep RL algorithms.\r\n               In order for the trained RL agent to work on new programs,         ThebarchartinFigure7showsthepercentageimprovement\r\n               we need to properly normalize the program features and             of the circuit performance compared to -O3 results on the\r\n               rewards so they represent a meaningful state among dif-            nine real benchmarks from CHStone. The dots on the blue\r\n               ferent programs. In this work, we experiment with two              line in Figure 7 show the total number of samples for each\r\n                             1                                                    program, which is the number of times the algorithm calls\r\n               techniques: \r\n taking the logarithm of program features             the simulator to gather the cycle count. -O0 and -O3 are\r\n                                  2\r\n               or rewards and, \r\n normalizing to a parameter from the              the default compiler optimization levels. RL-PPO1 is a\r\n               original input program that roughly depicts the problem            PPO explorer where we set all the rewards to 0 to test if\r\n                                    1\r\n               size. For technique \r\n, note that taking the logarithm of the       the rewards are meaningful. RL-PPO2 is the PPO agent\r\n               program features not only reduces their magnitude, it also         that learns the next pass based on a histogram of applied\r\n               correlates them in a different manner in the neural network.       passes. RL-A3C is the A3C agent that learns based on\r\n                                                            w w\r\n               Since, w log(o     ) +w log(o ) = log(o 1o 2), the neu-\r\n                        1      f1       2      f2           f1 f2                 the program features. Greedy performs the greedy algo-\r\n               ral network is learning to correlate the products of features      rithm, whichalwaysinsertsthepassthatachievesthehighest\r\n                                                                           2\r\n               instead of a linear combination of them. For technique \r\n,\r\n                                     AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning\r\n                                            Table 3: The observation and action spaces used in the different deep RL algorithms.\r\n                                                            RL-PPO1            RL-PPO2                      RL-A3C                      RL-PPO3               RL-ES\r\n                                DeepRLAlgorithm               PPO                 PPO                         PPO                          A3C                  ES\r\n                                 Observation Space      Program Features     Action History    Action History + Program Features     Program Features    Program Features\r\n                                   Action Space           Single-Action       Single-Action             Multiple-Action               Single-Action        Single-Action\r\n                         0.3                   0.24 0.25       0.28  0.28 0.26  0.27   840010000        6.2     Generalization\r\n                         0.2                                                              8000          With deep RL, the search should bene\ufb01t from prior knowl-\r\n                         0.1             0.09                                   6789                    edgelearnedfromotherdifferentprograms. Thisknowledge\r\n                                                          0.03     4384   6080       0.07 6000\r\n                         0.0  -0.23 0.0                 3510  4000                        4000          should be transferable from one program to another. For ex-\r\n                         0.1                       2484                                                 ample, as discussed in section 4 applying pass -loop-rotate\r\n                                                                                          2000      Samples / Programis always bene\ufb01cial, and -loop-unroll should be applied af-\r\n                      Improvement over -O30.2118888                                       0             ter -loop-rotate. Note that the black-box search algorithms,\r\n                                                                                                        such as OpenTuner, GA, and greedy algorithms, cannot gen-\r\n                             -O0  -O3RL-PPO1RL-PPO2RL-A3CGreedyRL-PPO3RL-ES                             eralize. For these algorithms, rerunning a new search with\r\n                                                            OpenTuner         Random \r\n                                                                    Genetic-DEAP                        manycompilations is necessary for every new program, as\r\n                    Figure 7: Circuit Speedup and Sample Size Comparison.                               they do not learn any patterns from the programs to direct\r\n                                                                                                        the search and can be viewed as a smart random search.\r\n                                                                                                        Toevaluate how generalizable deep RL could be with dif-\r\n                   speedup at the best position (out of all possible positions it                       ferent programs and whether any prior knowledge could\r\n                   can be inserted to) in the current sequence. RL-PPO3 uses                            be useful, we train on 100 randomly-generated programs\r\n                   a PPO agent and the program features but with the action                             using PPO. Random programs are used for transfer learning\r\n                   space described in Section 5.2. explained in Section 5.2.                            due to lack of suf\ufb01cient benchmarks and because it is the\r\n                   OpenTuner runs an ensemble of six algorithms, which                                  worst-case scenario, i.e., they are very different from the\r\n                   includes two families of algorithms: particle swarm opti-                            programs that we use for inference. The improvement can\r\n                   mization (Kennedy, 2010) and GA, each with three different                           be higher if we train on programs that are similar to the\r\n                   crossover settings. RL-ES is similar to A3C agent that                               ones we inference on. We train a network with 256 \u00d7 256\r\n                   learns based on the program features, but updates the policy                         fully connected layers and use the histogram of previously\r\n                   network using the evolution strategy instead of backpropa-                           applied passes concatenated to the program features as the\r\n                   gation. Genetic-DEAP (Fortin et al., 2012) is a genetic                              observation and passes as actions.\r\n                   algorithm implementation. random randomly generates                                  Asdescribed in Section 5.3, we experiment with two nor-\r\n                   a sequence of 45 passes at once instead of sampling them                                                                                                1\r\n                   one-by-one.                                                                          malization techniques for the program features: \r\n taking\r\n                                                                                                                                                                       2\r\n                                                                                                        the logarithm of all the program features and \r\n normaliz-\r\n                   From Greedy, we see that always adding the pass in the                               ing the program features to the total number of instruc-\r\n                   current sequence that achieves the highest reward leads to                           tions in the program. In each pass sequence, the inter-\r\n                   sub-optimalcircuitperformance. RL-PPO2achieveshigher                                 mediate reward was de\ufb01ned as the logarithm of the im-\r\n                   performance than RL-PPO1, which shows that the deep RL                               provement in cycle count after applying each pass. The\r\n                   captures useful information during training. Using the his-                          logarithm was chosen so that the RL agent will not give\r\n                   togram of applied passes results in better sample ef\ufb01ciency,                         much larger weights to big rewards from programs with\r\n                   but using the program features with more samples results                             longer execution time. Three approaches were evaluated:\r\n                   in a slightly higher speedup. RL-PPO2, for example, at the                           filtered-norm1usesthe\ufb01ltered(basedontheanalysis\r\n                   minorcostof4%lowerspeedup,achieves50\u00d7moresample                                      in Section 4 where we only keep the important features and\r\n                   ef\ufb01ciency than OpenTuner. UsingEStoupdatethepolicy                                   passes) program features and passes from Section with nor-\r\n                                                                                                                                      1\r\n                   is supposed to be more sample ef\ufb01cient for problems with                             malization technique \r\n, original-norm2 uses all the\r\n                   sparse rewards like ours, however, our experiments did not                           program features and passes with normalization technique\r\n                                                                                                         2\r\n                   bene\ufb01t from that. Furthermore, RL-PPO3 with multiple                                 \r\n,andfiltered-norm2usesthe\ufb01lteredprogramfea-\r\n                   action updates achieves a higher speedup than the other                              tures and passes from Section 4 with normalization tech-\r\n                                                                                                                  2\r\n                   deep RL algorithms with a single action. One reason for                              nique \r\n. Filtering the features and passes might not be\r\n                   that is the ability of RL-PPO3 to explore more passes per                            ideal, especially when different programs have different\r\n                   compilation as it applies multiple passes simultaneously in                          feature characteristics and impactful passes. However, re-\r\n                   between every compilation. On the other hand, the other                              ducing the number of features and passes helps to reduce\r\n                   deep RL algorithms apply a single pass at a time.                                    variance among all programs and signi\ufb01cantly narrow the\r\n                            AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning\r\n                                                                                                                  0.02  0.03   0.04  30\r\n                                                                                   0.0 -0.23  0.0    -0.24 -0.02\r\n                                                                                                                                     20\r\n                                                                                   0.1\r\n                                                                                                                                     10\r\n                                                                                   0.2                                                   Samples / Program\r\n                                                                                Improvement over -O3111     1      1     1      1    0\r\n                                                                                       -O0    -O3             Greedy\r\n                                                                                            Genetic-DEAPOpenTuner\r\n                                                                                                             RL-filtered-norm1RL-filtered-norm2\r\n               Figure 8: Episode reward mean as a function of step for        Figure 9: Circuit Speedup and Sample Size Comparison for\r\n               the original approach where we use all the program fea-        deep RL Generalization.\r\n               tures and passes and for the \ufb01ltered approach where we\r\n               \ufb01lter the passes and features (with different normalization\r\n               techniques). Higher values indicate faster circuit speed.\r\n                                                                              This evaluation shows that the deep RL-based inference\r\n                                                                              achieves higher speedup than the predetermined sequences\r\n                                                                              produced by the state-of-the-art black-box algorithms for\r\n               search space.                                                  newprograms. The predetermined sequences that are over-\r\n               Figure 8 shows the episode reward mean as a function           \ufb01tted to the random programs can cause poor performance\r\n               of the step for the three approaches.    We observe that       in unseen programs (e.g., -24% for Genetic-DEAP). Be-\r\n                                                                                                               2\r\n               filtered-norm2 and filtered-norm1 converge                     sides, normalization technique \r\n works better compared\r\n                                                                                                           1\r\n               much faster and achieve a higher episode reward mean           to normalization technique\r\n for deep RL generalization\r\n               than original-norm2, which uses all the features and           (4% vs 3% speedup). This indicates that normalizing the\r\n               passes. At roughly 8,000 steps the filtered-norm2and           different instructions to the total number of instructions i.e.,\r\n                                                                                                                                        2\r\n               filter-norm1already achieve a very high episode re-            the distribution of the different instructions in Technique \r\n\r\n              wardmean,withminorimprovementsinlatersteps. Further-            represents more universal characteristics across different\r\n                                                                                                                              1\r\n               more, the episode reward mean of the \ufb01ltered approaches        programs, while taking the log in Technique \r\n only sup-\r\n               is still higher than that of original-norm2 even when          presses the value ranges of different program features. Fur-\r\n              weallowedit to train for 20 times more steps (i.e., 160,000     thermore, when we use other 12,874 randomly generated\r\n               steps). This indicates that \ufb01ltering the features and passes   programs as the testing set with filtered-norm2, the\r\n               signi\ufb01cantly improved the learning process. All three ap-      speedup is 6% compared to -O3.\r\n               proaches learned to always apply pass -loop-rotate, and\r\n              -loop-unroll after -loop-rotate. Another useful pass that the   7    CONCLUSIONS\r\n               three approaches learned to apply is -loop-simplify, which     In this paper, we propose an approach based on deep RL\r\n               performs several transformations to transform natural loops    to improve the performance of HLS designs by optimizing\r\n               into a simpler form that enables subsequent analyses and       the order in which the compiler applies optimization phases.\r\n               transformations.                                               Weuserandomforeststoanalyzethe relationship between\r\n              We now compare the generalization results                 of    programfeatures and optimization passes. We then leverage\r\n               filtered-norm2 and filtered-norm1 with                         this relationship to reduce the search space by identifying\r\n               the other black-box algorithms. We use 100 randomly-           the most likely optimization phases to improve the perfor-\r\n               generated programs as the training set and nine real           mance, given the program features. Our RL based approach\r\n               benchmarks from CHStone as the testing set for the deep        achieves 28% better performance than compiling with the\r\n               RL-based methods. With the state-of-the-art black-box          -O3\ufb02agaftertrainingforafewminutes,anda24%improve-\r\n               algorithms, we \ufb01rst search for the best pass sequences that    ment after training for less than a minute. Furthermore, we\r\n               achieved the lowest aggregated hardware cycle counts for       showthat unlike prior work, our solution shows potential to\r\n               the 100 random programs and then directly apply them to        generalize to a variety of programs. While in this paper we\r\n               the nine test set programs. In Figure 9, the bar chart shows   have applied deep RL to HLS, we believe that the same ap-\r\n               the percentage improvement of the circuit performance          proach can be successfully applied to software compilation\r\n               compared to -O3 on the nine real benchmarks, the dots          and optimization. Going forward, we envision using deep\r\n               on the blue line show the total number of samples each         RLtechniques to optimize a wide range of programs and\r\n               inference takes for one new program.                           systems.\r\n                            AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning\r\n               ACKNOWLEDGEMENT                                                   strategies for deep reinforcement learning via a popula-\r\n              This research is supported in part by NSF CISE Expedi-             tion of novelty-seeking agents. In Advances in Neural\r\n               tions AwardCCF-1730628,theDefenseAdvancedResearch                 Information Processing Systems, pp. 5032\u20135043, 2018.\r\n               Projects Agency (DARPA) through the Circuit Realiza-           Fortin, F.-A., De Rainville, F.-M., Gardner, M.-A., Parizeau,\r\n                                                                                               \u00b4\r\n               tion at Faster Timescales (CRAFT) Program under Grant             M., and Gagne, C. DEAP: Evolutionary algorithms made\r\n               HR0011-16-C0052, the Computing On Network Infras-                 easy. Journal of Machine Learning Research, 13:2171\u2013\r\n               tructure for Pervasive Perception, Cognition and Action           2175, jul 2012.\r\n              (CONIX) Research Center, NSF Grant 1533644, LANL                Fursin, G., Kashnikov, Y., Memon, A. W., Chamski, Z.,\r\n               Grant 531711, and DOE Grant DE-SC0019323, and gifts               Temam,O.,Namolaru,M.,Yom-Tov,E.,Mendelson,B.,\r\n               from Alibaba, Amazon Web Services, Ant Financial, Cap-            Zaks, A., Courtois, E., et al. Milepost gcc: Machine learn-\r\n               italOne, Ericsson, Facebook, Futurewei, Google, IBM, In-          ing enabled self-tuning compiler. International journal\r\n               tel, Microsoft, Nvidia, Scotiabank, Splunk, VMware, and           of parallel programming, 39(3):296\u2013327, 2011.\r\n              ADEPTLabindustrial sponsors and af\ufb01liates. The views\r\n               and opinions of authors expressed herein do not necessarily    Goldberg, D. E. Genetic algorithms. Pearson Education\r\n               state or re\ufb02ect those of the United States Government or any      India, 2006.\r\n               agency thereof.                                                Haj-Ali, A., Ahmed, N. K., Willke, T., Shao, S., Asanovic,\r\n               REFERENCES                                                        K., and Stoica, I. Learning to vectorize using deep rein-\r\n                                                                                 forcement learning. In Workshop on ML for Systems at\r\n               Agakov, F., Bonilla, E., Cavazos, J., Franke, B., Fursin,         NeurIPS, December 2019a.\r\n                 G., O\u2019Boyle, M. F., Thomson, J., Toussaint, M., and          Haj-Ali, A., K. Ahmed, N., Willke, T., Gonzalez, J.,\r\n                 Williams, C. K. Using machine learning to focus iterative       Asanovic, K., and Stoica, I. A view on deep reinforce-\r\n                 optimization. In Proceedings of the International Sympo-        ment learning in system optimization. arXiv preprint\r\n                 siumonCodeGenerationandOptimization,pp.295\u2013305.                 arXiv:1908.01275, 2019b.\r\n                 IEEEComputerSociety, 2006.                                   Haj-Ali, A., Ahmed, N. K., Willke, T., Shao, S., Asanovic,\r\n               Almagor, L., Cooper, K. D., Grosul, A., Harvey, T. J.,            K., and Stoica, I. Neurovectorizer: End-to-end vectoriza-\r\n                 Reeves, S. W., Subramanian, D., Torczon, L., and Water-         tion with deep reinforcement learning. International Sym-\r\n                 man,T. Finding effective compilation sequences. ACM             posium on Code Generation and Optimization (CGO),\r\n                 SIGPLANNotices,39(7):231\u2013239, 2004.                             February 2020.\r\n               Ansel, J., Kamil, S., Veeramachaneni, K., Ragan-Kelley,        Hara, Y., Tomiyama, H., Honda, S., Takada, H., and Ishii,\r\n                 J., Bosboom, J., O\u2019Reilly, U.-M., and Amarasinghe, S.           K. CHstone: A benchmark program suite for practical\r\n                 Opentuner: An extensible framework for program auto-            c-based high-level synthesis. In Circuits and Systems,\r\n                 tuning. In Proceedings of the 23rd international con-           2008. ISCAS 2008. IEEE International Symposium on,\r\n                 ference on Parallel architectures and compilation, pp.          pp. 1192\u20131195, 2008.\r\n                 303\u2013316. ACM, 2014.                                          Huang, Q., Lian, R., Canis, A., Choi, J., Xi, R., Brown, S.,\r\n               Bellman, R. A markovian decision process. In Journal of           and Anderson, J. The effect of compiler optimizations on\r\n                 Mathematics and Mechanics, pp. 679\u2013684, 1957.                   high-level synthesis for fpgas. In Field-Programmable\r\n                                                                                 CustomComputingMachines(FCCM),2013IEEE21st\r\n               Breiman, L. Random forests. Machine learning, 45(1):              Annual International Symposium on, pp. 89\u201396. IEEE,\r\n                 5\u201332, 2001.                                                     2013.\r\n               Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,       Huang, Q., Lian, R., Canis, A., Choi, J., Xi, R., Calagar,\r\n                 Schulman, J., Tang, J., and Zaremba, W. Openai gym,             N., Brown, S., and Anderson, J. The effect of compiler\r\n                 2016.                                                           optimizationsonhigh-levelsynthesis-generatedhardware.\r\n                                                                                 ACMTransactions on Recon\ufb01gurable Technology and\r\n               Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A.,         Systems (TRETS), 8(3):14, 2015.\r\n                 Czajkowski, T., Brown, S. D., and Anderson, J. H. Legup:     Huang, Q., Haj-Ali, A., Moses, W., Xiang, J., Stoica, I.,\r\n                 Anopen-source high-level synthesis tool for fpga-based          Asanovic, K., and Wawrzynek, J. Autophase: Compiler\r\n                 processor/accelerator systems. ACM Transactions on              phase-ordering for hls with deep reinforcement learn-\r\n                 EmbeddedComputingSystems(TECS),13(2):24, 2013.                  ing. In 2019 IEEE 27th Annual International Symposium\r\n               Conti, E., Madhavan, V., Such, F. P., Lehman, J., Stanley,        on Field-Programmable Custom Computing Machines\r\n                 K., and Clune, J. Improving exploration in evolution            (FCCM),pp.308\u2013308.IEEE,2019.\r\n                                AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning\r\n                Intel.     Intel High-Level Synthesis Compiler, 2019.                   Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever,\r\n                   URL https://www.intel.com/content/www/                                  I. Evolution strategies as a scalable alternative to rein-\r\n                   us/en/software/programmable/quartus-                                    forcement learning. arXiv preprint arXiv:1703.03864,\r\n                   prime/hls-compiler.html.                                                2017.\r\n                Kaelbling, L. P., Littman, M. L., and Moore, A. W. Rein-                Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and\r\n                   forcement learning: A survey. In Reinforcement learning:                Klimov, O. Proximal policy optimization algorithms.\r\n                   Asurvey, volume 4, pp. 237\u2013285, 1996.                                   arXiv preprint arXiv:1707.06347, 2017.\r\n                Kennedy, J. Particle swarm optimization. Encyclopedia of                Stephenson, M., Amarasinghe, S., Martin, M., and O\u2019Reilly,\r\n                   machine learning, pp. 760\u2013766, 2010.                                    U.-M. Meta optimization: Improving compiler heuris-\r\n                                                                                           tics with machine learning. In Proceedings of the ACM\r\n                Kulkarni, S. and Cavazos, J. Mitigating the compiler opti-                 SIGPLAN2003ConferenceonProgrammingLanguage\r\n                   mization phase-ordering problem using machine learning.                 Design and Implementation, PLDI \u201903, 2003.\r\n                   In Proceedings of the ACM International Conference on                Sutton, R. S. and Barto, A. G. Introduction to reinforcement\r\n                   Object Oriented Programming Systems Languages and                       learning, volume 135. MIT press Cambridge, 1998.\r\n                   Applications, OOPSLA \u201912, 2012.\r\n                Lattner, C. and Adve, V. Llvm: A compilation framework                  Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour,\r\n                   for lifelong program analysis & transformation. In Inter-               Y. Policy gradient methods for reinforcement learning\r\n                   national Symposium on Code Generation and Optimiza-                     with function approximation. In Advances in neural in-\r\n                   tion, 2004. CGO 2004., pp. 75\u201386. IEEE, 2004.                           formation processing systems, pp. 1057\u20131063, 2000.\r\n                                                                                        Triantafyllis, S., Vachharajani, M., Vachharajani, N., and\r\n                Liang, E., Liaw, R., Moritz, P., Nishihara, R., Fox, R., Gold-             August, D. I. Compiler optimization-space exploration.\r\n                   berg, K., Gonzalez, J. E., Jordan, M. I., and Stoica, I.                In Proceedings of the international symposium on Code\r\n                   Rllib: Abstractions for distributed reinforcement learning.             generation and optimization: feedback-directed and run-\r\n                   arXiv preprint arXiv:1712.09381, 2017.                                  time optimization, pp. 204\u2013215. IEEE Computer Society,\r\n                Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,                  2003.\r\n                   T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-                Wang, Z. and OBoyle, M. Machine learning in compiler\r\n                   chronous methods for deep reinforcement learning. In                    optimization. In Machine Learning in Compiler Opti-\r\n                   International conference on machine learning, pp. 1928\u2013                 mization, volume 106, pp. 1879\u20131901, Nov 2018.\r\n                   1937, 2016.                                                          Xilinx.     Vivado High-Level Synthesis, 2019.               URL\r\n                Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw,                    https://www.xilinx.com/products/\r\n                   R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan,                  design-tools/vivado/integration/esl-\r\n                   M.I., et al. Ray: A distributed framework for emerging                  design.html.\r\n                   {AI} applications. In 13th {USENIX} Symposium on                     Yang, X., Chen, Y., Eide, E., and Regehr, J. Finding and\r\n                   Operating Systems Design and Implementation ({OSDI}                     understanding bugs in c compilers. In ACM SIGPLAN\r\n                   18), pp. 561\u2013577, 2018.                                                 Notices, volume 46, pp. 283\u2013294. ACM, 2011.\r\n                Muchnick,S.S. Advancedcompilerdesignandimplementa-\r\n                   tion. In Advanced Compiler Design and Implementation.\r\n                   MorganKaufmann,1997.\r\n                Pan, Z. and Eigenmann, R. Fast and effective orchestration\r\n                   of compiler optimizations for automatic performance tun-\r\n                   ing. In Proceedings of the International Symposium on\r\n                   CodeGeneration and Optimization, pp. 319\u2013332. IEEE\r\n                   ComputerSociety, 2006.\r\n                Ross, S., Gordon, G., and Bagnell, D. A reduction of imita-\r\n                   tion learning and structured prediction to no-regret online\r\n                   learning. In Proceedings of the fourteenth international\r\n                   conference on arti\ufb01cial intelligence and statistics, pp.\r\n                   627\u2013635, 2011.", "award": [], "sourceid": 26, "authors": [{"given_name": "Ameer", "family_name": "Haj-Ali", "institution": "UC Berkeley"}, {"given_name": "Qijing (Jenny)", "family_name": "Huang", "institution": "Berkeley"}, {"given_name": "John", "family_name": "Xiang", "institution": "UC Berkeley"}, {"given_name": "William", "family_name": "Moses", "institution": "MIT"}, {"given_name": "Krste", "family_name": "Asanovic", "institution": "UC Berkeley"}, {"given_name": "John", "family_name": "Wawrzynek", "institution": "UC Berkeley"}, {"given_name": "Ion", "family_name": "Stoica", "institution": "UC Berkeley"}]}