{"title": "Sense & Sensitivities: The Path to General-Purpose Algorithmic Differentiation", "book": "Proceedings of Machine Learning and Systems", "page_first": 58, "page_last": 69, "abstract": "We present Zygote, an algorithmic differentiation (AD) system for the Julia language. Zygote is designed to address the needs of both the machine learning and scientific computing communities, who have historically been siloed by their very different tools. As well as fostering increased collaboration between these communities, we wish to enable \\textit{differentiable programming} ($\\partial P$), in which arbitrary numerical programs can make use of gradient-based optimisation. We present and evaluate our proposed solutions to the performance/expressiveness tradeoffs in current systems, as well as our work applying AD to many common programming language features, which is applicable to work in other languages and systems.", "full_text": "                             SENSE & SENSITIVITIES: THE PATH TO GENERAL-PURPOSE\r\n                                                   ALGORITHMICDIFFERENTIATION\r\n                                                                       Michael J Innes1\r\n                                                                         ABSTRACT\r\n                     Wepresent Zygote, an algorithmic differentiation (AD) system for the Julia language. Zygote is designed to\r\n                     address the needs of both the machine learning and scienti\ufb01c computing communities, who have historically\r\n                     been siloed by their very different tools. As well as fostering increased collaboration between these communities,\r\n                     wewish to enable differentiable programming (\u2202P), in which arbitrary numerical programs can make use of\r\n                     gradient-based optimisation. We present and evaluate our proposed solutions to the performance/expressiveness\r\n                     tradeoffs in current systems, as well as our work applying AD to many common programming language features,\r\n                     which is applicable to work in other languages and systems.\r\n                1    INTRODUCTION                                                  torise (batch) their programs manually to amortise the high\r\n               Algorithmic Differentiation (AD)1 has a split personality.          cost of the runtime, which is practical for simple matrix-\r\n               Its forward (Wengert, 1964) and later reverse (Speelpenning,        multiply-based architectures.\r\n               1980) modes were \ufb01rst developed for scienti\ufb01c computing,            Zygote is designed to bring these parallel universes together.\r\n               in languages like Fortran. Since then it has proven a sharp         It supports high-level and expressive semantics with full\r\n               tool in the numerical computing toolbox, \ufb01nding applica-            support for the control \ufb02ow and data model of its host lan-\r\n               tions to the valuation of contracts in \ufb01nance, inference in         guage, Julia (Bezanson et al., 2017). Yet a low-overhead\r\n               statistical models, \ufb01ne-tuning of systems in engineering,           source code transform (SCT) based implementation makes\r\n               process optimisation in operations research, state estima-          it applicable to scalar code. A major consequence is that\r\n               tion in quantum mechanics, and much more. Supporting                existing Julia packages now form part of a differentiable li-\r\n               this, diverse implementations have \ufb02ourished in many high-          brary ecosystem, without needing to be rewritten for a given\r\n               performance languages (Hascoet & Pascual, 2013; Utke                framework. Recent work in ML increasingly incorporates\r\n               et al., 2008; Hogan, 2014; Shiriaev & Griewank, 1996;               advanced numerical programs, such as physics engines (De-\r\n               Griewank et al., 1996).                                             grave et al., 2019), ray tracers (Li et al., 2018) and scienti\ufb01c\r\n               Alongside this, much recent innovation has come from the            models (Innes et al., 2019), and we believe there is enor-\r\n               machine learning (ML) community, who independently re-              mousvalueindomainexperts and ML practitioners sharing\r\n               discovered the reverse mode as \u2018backpropagation of errors\u2019          code. We refer to the building of these complex end-to-end\r\n               (Rumelhart et al., 1988) and built AD systems tailored to           differentiable systems as Differentiable Programming, or\r\n               their use cases (Bergstra et al., 2010; Maclaurin et al., 2015;     \u2202P (Wikipedia contributors, 2019).\r\n               Tokui et al., 2015; Chen et al., 2015; Abadi et al., 2016;          1.1   Convergence in AD Design\r\n               Neubig et al., 2017; Paszke et al., 2017). The dif\ufb01culty\r\n               of writing Fortran and C++ has led ML practitioners to              Operator-overloading and source-to-source approaches to\r\n               use higher-level languages; they generally write relatively         ADhavetraditionally been distinct, but recent work in the\r\n               simple programs (network architectures), though are typ-            MLcommunityblursthisline. Graph building by operator\r\n               ically more demanding about expressiveness and support              overloading can be thought of as partial evaluation, elid-\r\n               for higher order differentiation. Where HPC developers are          ing indirection in the host language (function calls, control\r\n               intolerant of AD overhead, ML practitioners typically vec-          \ufb02ow) and applying a transformation to the resulting trace\r\n                   1Julia Computing, Inc., Edinburgh, United Kingdom. Corre-       of numerical operations. But the trace need not be a pure\r\n               spondence to: Michael J Innes <mike.j.innes@gmail.com>.             Wengert list, and \u2018staging\u2019 more of the host language\u2019s con-\r\n                                                                                   structs (for example, control \ufb02ow) into the trace makes this\r\n               Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA,        look increasingly like source-to-source AD. Many state-of-\r\n               2019. Copyright 2019 by the author(s).                              the-art ML systems, which originally took very different\r\n                   1Also expanded as automatic or analytic differentiation.        approaches to AD, are converging towards these staged pro-\r\n                                                                      Sense & Sensitivities\r\n               grammingapproaches(Agrawaletal., 2019; PyTorch Team,               abstract over differentiation.\r\n               2018; Frostig et al., 2018).                                       Acompiler-integrated approach instead hides this process,\r\n               In a recent system like JAX (Frostig et al., 2018), the Python     generating and compiling necessary derivative code along-\r\n               interpreter can be viewed as taking on the role of compiler.       side the original. This has several bene\ufb01ts: gradient code\r\n               Like Julia\u2019s abstract interpreter, it evaluates code with par-     will never be out of sync with the original program, and can\r\n               tial information (types) in order to statically resolve poly-      be automatically updated even when functions are dynami-\r\n               morphic methods, and then takes a back seat for program            cally re-de\ufb01ned; a callee-derives interface means libraries\r\n               runtime. There is an important difference, however. The            can differentiate user-provided functions without the user\r\n               tracing approach is inherently lossy with respect to program       being aware of it; and since differentiated code need not be\r\n               semantics (a non-lossy system would simply be a Python             represented textually, it frees us to use a more convenient\r\n               compiler). In particular, data dependent control \ufb02ow, I/O,         representation.\r\n               mutable data structures, and global variables can all lead         Static Single Assignment (SSA) form (Cytron et al., 1991)\r\n               to errors or surprising semantics. Users must be careful           turns out to be convenient for differentiation, being as ex-\r\n               to respect referential transparency, avoiding many useful          pressive as an AST but with a greatly reduced number of\r\n               features of the host language.                                     features. This section outlines our proposed derivative trans-\r\n               Zygoteextendsthestagedapproachwithsupportforabroad                 form for SSA form code, building on the Wengert list trans-\r\n               range of language features. We take a fully-staged compiler        formation that all reverse-mode ADs use at a minimum.\r\n               IRthatpreservesallsemantics, carry out the derivative trans-\r\n               formation (\u00a72.3), add support for Julia\u2019s features one by one      2.1   Notation & Background\r\n               (\u00a73), and \ufb01nally apply Julia\u2019s optimisation procedures. Be-        The(partial) derivative of a function y = f(x) is typically\r\n               ing semantically lossless, and avoiding the need for user          written \u2202y. An important special case is when the function\r\n               annotations, is essential for our goal of building a differ-                \u2202x\r\n               entiable library ecosystem, since otherwise differentiation        output is a scalar l, typically a loss to be minimized. We\r\n                                                                                  write this as \u2202l = x\u00af, known as a sensitivity (or, equivalent\r\n               would have to be manually enabled and veri\ufb01ed correct for                        \u2202x\r\n               each new library.                                                  for our purposes, a gradient). Reverse mode AD propagates\r\n                                                                                  sensitivities without caring about the details of l, so this\r\n               Ourapproach and philosophy is closest to that of Tapenade,         notation usefully abstracts over it. In forward mode the\r\n               but with the twist that our source transformation operates         equivalent perturbation of an intermediate x by some scalar\r\n               on Julia IR \u201cin \ufb02ight\u201d in the compiler rather than textual         input m is written \u2202x = x\u02d9. This discussion focuses on\r\n               source code (\u00a72) (though it is nevertheless purely syntac-                               \u2202m\r\n                                                                                  reverse mode as the technically more dif\ufb01cult case.\r\n               tic, and does not rely on non-local static information such        For uniformity we do not specify the derivatives of com-\r\n               as inferred types). Zygote is also comparable to the Swift         ponent functions like sin(x) or a \u00d7 b directly in the rules\r\n               for TensorFlow project, but does not require differentiable        of differentiation, but instead treat these as handled via a\r\n               functions and types to be explicitly annotated, and operates       higher-order differentiation function J. Given a function\r\n               on(lowered) surface syntax rather than typed compiler IR.          y = f(x ,x ,...), we write y,B         = J(f,x ,x ,...); J\r\n               The elegant recursive formulation of AD was introduced                       1   2                     y            1   2\r\n                                                                                  returns the usual result y as well as a pullback function B .\r\n               byStalin\u2207 (Pearlmutter & Siskind, 2008) and also used in                                                                        y\r\n                                                                                  Thenx\u00af ,x\u00af ,... = B (y\u00af); the pullback accepts the gradient\r\n                                  \u00a8                                                       1   2         y\r\n               Myia (van Merrienboer et al., 2018), but not generalised           with respect to y and returns gradients with respect to each\r\n               beyond a simple \u03bb-calculus language; our work shows that           input x . Pullbacks are linear functions which implement\r\n               this approach can be married with ef\ufb01cient Tapenade-like                   i\r\n               handling of control \ufb02ow, as well as extending its seman-           the chain rule for f, as in equation 1, and for mathematical\r\n               tics to handle practical language features like mutable data       primitives they are easily written down. Some examples are\r\n               structures.                                                        showninTable1.\r\n               2    TRANSFORMINGSSA-FORMIR                                                              \u2202l      \u2202l \u2202y\r\n               Historically, source code transform (SCT)-based AD sys-                             x\u00af = \u2202x = \u2202y \u2202x = By(y\u00af)                   (1)\r\n               tems have worked by parsing source code from \ufb01les, trans-\r\n               forming the abstract syntax tree (AST) and emitting a new          This notation has the bene\ufb01t of making no distinction be-\r\n               source code \ufb01le. The user can then inspect, modify and             tween program subroutines and basic mathematical func-\r\n               compile the derivative code at will. This is both a useful fea-    tions. Indeed, the goal of our AD will be to produce a\r\n               ture and a drawback, given that this caller-derives interface      pullback for the whole program as if it had been built-in to\r\n               requires manual intervention and does not allow libraries to       begin with. We can then de\ufb01ne differentiation recursively:\r\n                                                                                  If during differentiation of f we call J(g,...), we can ei-\r\n                                                                                            Sense & Sensitivities\r\n                    Table 1. Pullbacks for some simple mathematical functions.                                and \u2202y/\u2202b. This can be realised either by interpreting the\r\n                                                                                                             Wengert expression in reverse, or by explicitly creating\r\n                                          FUNCTION              PULLBACK                                      an adjoint expression as follows. (The underlined variables\r\n                                                                                                             werede\ufb01nedintheprimal,andtheadjointclosesoverthem.)\r\n                                          y = a+b                  (y\u00af, y\u00af)\r\n                                          y = a\u00d7b             (y\u00af \u00d7 b,y\u00af \u00d7 a)\r\n                                          y = sin(x)            y\u00af \u00d7 cos(x)\r\n                                          y = exp(x)               y\u00af \u00d7 y                                                                      y\u00af   \u21901\r\n                                          y = log(x)                y\u00af/x                                                                         3\r\n                                                                                                                                          a\u00af , y\u00af   \u2190B(y\u00af )\r\n                                                                                                                                            1    2         3   3\r\n                                                                                                                                          a\u00af , y\u00af   \u2190B(y\u00af )\r\n                    ther look up a built-in gradient or generate an appropriate                                                             2    1         2   2\r\n                                                                                                                                                 a\u00af \u2190 a\u00af +a\u00af\r\n                    pullback for g via some AD technique (such as Innes 2020).                                                                            1       2\r\n                                                                                                                                              \u00af\r\n                                                                                                                                              b,    \u2190B(y\u00af )\r\n                                                                                                                                                           1   1\r\n                    2.2     Differentiating Wengert Lists\r\n                    Consider the following mathematical function, which may\r\n                    be part of our target program. We assume that y is further                                Realising this code as a function, with y\u00af as an argument,\r\n                    used to calculate l, and that we know \u2202l/\u2202y.                                                                                                        3\r\n                                                                                                              creates the pullback for f. Inlining all function calls yields\r\n                                              y = f(a,b) =             a                                      an ef\ufb01cient symbolic derivative; the J notation really is just\r\n                                                                    a+b2                                      notation.\r\n                    Wecanrewritethis equivalently by naming each intermedi-\r\n                    ate result, using the arrow y \u2190 f(x) to assign the name y                                                                                2\r\n                                                                                                                                             y \u2190a+b\r\n                    to the value of f(x) (i.e. a let binding).                                                                                2\r\n                                                                                                                                             y\u00af  \u2190\u2212a\r\n                                                                                                                                              2          y2\r\n                                                               2                                                                                          2\r\n                                                    y \u2190b\r\n                                                      1                                                                                                a\r\n                                                    y \u2190a+y                                                                                    y \u2190\r\n                                                      2              1                                                                                y\r\n                                                              a                                                                                         2\r\n                                                    y \u2190                                                                                                1\r\n                                                      3       y                                                                               a\u00af \u2190        +y\u00af\r\n                                                               2                                                                                      y         2\r\n                                                                                                                                                        2\r\n                                                                                                                                              \u00af\r\n                                                                                                                                              b \u21902by\u00af\r\n                    This form can be viewed as a simple programming lan-                                                                                   2\r\n                    guage; it is often referred to as a Wengert list, tape or graph\r\n                    (Bartholomew-Biggs et al., 2000). The Wengert list is easy\r\n                    to differentiate. First wrap all function calls with J to create                         This is enough to implement most common AD systems,\r\n                    a primal version of f.                                                                   which use operator overloading to build a Wengert list dur-\r\n                                              y ,B \u2190J(\u02c6,b,2)                                                  ing program execution (known as a \u201cdynamic graph\u201d in\r\n                                                1    1\r\n                                              y ,B \u2190J(+,a,y )                                                 the ML world). If numerical evaluation is interleaved with\r\n                                                2    2                    1                                   forward execution and reverse transformation, the adjoint\r\n                                              y ,B \u2190J(/,a,y )\r\n                                                3    3                   2                                    list need not be explicitly realised. However, the re\ufb02ection\r\n                                                                                                              and dynamism required is still expensive; to avoid this we\r\n                    Given the gradient y\u00af , we can call the pullback B to get                                 must begin to generalise the Wengert list to express more\r\n                                                  i                                          i\r\n                    gradients for the inputs to y . Where a variable x is used                                powerful programs.\r\n                                                             i\r\n                    multiple times, each corresponding pullback produces a\r\n                    contribution to the gradient (the a\u00afi below) which must be                                2.3    Static Single Assignment\r\n                    summed. This is motivated by the multivariable chain rule\r\n                    given in equation 2.                                                                      SSAform(Cytronetal., 1991) generalises the Wengert list\r\n                                                                                                             with goto-like control \ufb02ow, while preserving the explicit\r\n                                              \u2202l         \u2202l \u2202y            \u2202l \u2202y\r\n                                       x\u00af =        =              1 +             2                (2)        data \ufb02ow that makes analysis straightforward. For example,\r\n                                              \u2202x        \u2202y \u2202x           \u2202y \u2202x                                 consider a simple branching function:\r\n                                                           1                2\r\n                                                   =B (y\u00af )+B (y\u00af )                                (3)\r\n                                                          y     1        y     2\r\n                                                           1               2\r\n                    By applying these steps we can begin with the gradient                                                       f(x) = \u001a x                  x>0\r\n                    y\u00af = 1 and proceed in reverse over the list to get \u2202y/\u2202a                                                                     0.01x       otherwise\r\n                                                                      Sense & Sensitivities\r\n                                                                                    \u00af\r\n               Whichwecanrepresentas:                                              A,just as if the branch were a function call. However, since\r\n                               block #1: (x)                                       Amaybranchtomultipledifferent blocks, passing different\r\n                                                                                                 \u00af\r\n                                                                                   arguments, A\u2019s argument list must include all gradients\r\n                                         br#2unlessx>0                             it may need. For example, if A also branches to C with\r\n                                                                                                                                                \u00af\r\n                                         br#3(x)                                   arguments (y,z), A\u2019s argument list must be (x,\u00af y\u00af,z\u00af). B\r\n                                                                                                           \u00af                     2\r\n                               block #2:                                           will pass (x,\u00af y\u00af, 0) and C will pass (0,y\u00af,z\u00af).\r\n                                       y \u21900.01\u00d7x\r\n                                        1\r\n                                         br#3(y )                                               block #1: (y\u00af)\r\n                                                   1\r\n                               block #3: (y)                                                               br#3(y\u00af)unless b 6= 1\r\n                                         return y                                                          br#2\r\n                                                                                                block #2:\r\n                                                                                                        x\u00af  \u2190B (y\u00af)\r\n                                                                                                       ,  1      y\r\n               In SSA form the function is split into a series of basic                                           1\r\n               blocks, delimited by block labels. Each block has a (possi-                                 br#3(x\u00af1)\r\n               bly empty) Wengert list of operations to execute, and ends                       block #3: (x\u00af)\r\n               with a branch to another block (br) which may depend on a                                   return x\u00af\r\n               condition, or else returns a value to be used as the output\r\n               of the function. Blocks are equivalent to a set of mutually-\r\n               recursive closures (Steele Jr & Sussman, 1976; Appel, 1998;         Asinprimitive pullbacks, the adjoint code closes over vari-\r\n               MLIRContributors, 2019).                                            ables from the primal (the underlined variables). Note, how-\r\n               Primal code is created much as before, with the addition            ever, that since blocks can execute multiple times, these\r\n               that dummyargumentsareaddedtoblockstorecordcontrol                  variables actually refer to a set of values, one for each exe-\r\n               \ufb02ow. In this case block 3 has two predecessors, and b tells         cution of the given block. The primal can be augmented to\r\n               us which of the two to branch to in the adjoint.                    record these values on stacks, which the adjoint then pops\r\n                                                                                   when the variable is used, so that each run of an adjoint\r\n                                                                                   block sees the value of the corresponding primal de\ufb01nition.\r\n                               block #1: (x)                                       This is not the only possible approach; for example, the\r\n                                         br#2unlessx>0                             values could be recomputed (checkpointing), and mixed\r\n                                                                                   approaches are able to make time-space tradeoffs (Hascoet\r\n                                         br#3(x,1)                                 &Pascual, 2013). In a reversible programming model, such\r\n                               block #2:                                           as reversible neural networks (Chang et al., 2017), the core\r\n                                  y ,B    \u2190J(\u00d7,0.01,x)                             adjoint transformation remains the same but primal values\r\n                                   1   y1                                          canbere-calculatedinreverse, andacombinationofapprox-\r\n                                         br#3(y ,2)\r\n                                                   1                               imate reversal and checkpointing is used for differentiation\r\n                               block #3: (y,b)                                     of ODEs(Rackauckas et al., 2018).\r\n                                         return y                                  For a more complex example of these rules in practice we\r\n                                                                                                                   n\r\n                                                                                   take a simple calculation of x , represented in Julia code\r\n                                                                                   as:\r\n               Animportantconcept is the control \ufb02ow graph (CFG). Each              function pow(x, n)\r\n               block is a vertex, with directed edges for each possible                r = 1\r\n               branch between blocks. To generate an adjoint for our IR,               while n > 0\r\n               webeginwiththeCFG.TheadjointCFGisthetranspose                              n -= 1\r\n               of the primal CFG, having all edges reversed. To make                      r *= x\r\n               this intuitive, consider unrolling a given execution of the             end\r\n               function into a Wengert list; the reversed Wengert list must            return r\r\n               effectively run each (instruction of every) block in reverse         end\r\n               order. Thus each time block A branches to block B in the            TheprimalcodeillustrateshowloopsarerepresentedinSSA\r\n                               \u00af                   \u00af\r\n               primal, block B much branch to A in the adjoint.\r\n                                                                                      2In SSA it is valid to use a variable de\ufb01ned by a previous\r\n               Creating the adjoint follows the rules for Wengert lists as         block (so long as the de\ufb01nition dominates the usage). We avoid\r\n               above, with the addition of differentiating branches. If A          special handling for these variables turning them into explicit block\r\n                                                      \u00af\r\n               passes arguments (x,y) to B, then B should pass (x,\u00af y\u00af) to         arguments, so that all variables in a block are locally de\ufb01ned.\r\n                                                                                                Sense & Sensitivities\r\n                     form, with mutable variables like r split into immutable r                                   all of Zygote\u2019s semantics and functionality are provided via\r\n                                                                                                          i\r\n                     and explicitly carried across loop iterations.                                               its library of custom adjoints (that is, manual overrides of\r\n                                         block #1: (x,n)                                                          the J function), which encompass both core mathematical\r\n                                                                                                                  de\ufb01nitions and support for data structures and mutation, new\r\n                                                        br#2(n,1,1)                                               numericalandmathematicaltypes,hardwarespecialisations,\r\n                                         block #2: (n1,r1,b)                                                      andmixed-modeAD,typicallyexpressedinonlyafewlines\r\n                                                        br#4unlessn >0                                            of code.\r\n                                                                               1\r\n                                         block #3:                                                                Where practical this section shows Julia code for the im-\r\n                                                    n \u2190n \u22121                                                       plementation, rather than a mathematical abstraction, to\r\n                                                      2         1                                                 demonstrate how these features look in practice. All of\r\n                                             r ,B        \u2190J(\u00d7,r ,x)\r\n                                               2     r2                 1                                         themworkcurrently; most are supported out of the box in\r\n                                                        br#2(n ,r ,2)                                             the core Zygote library, while \ufb01xed-point iterative compu-\r\n                                                                      2    2\r\n                                         block #4:                                                                tations and support for cross-language AD have working\r\n                                                        return r                                                  prototypes outside of the package.\r\n                                                                    1\r\n                                                                                                                  3.1     CustomAdjoints\r\n                     The adjoint code is similarly a loop that computes r\u00af and                                    Manually de\ufb01ning gradients is a crucial part of Zygote\u2019s\r\n                     x\u00af (note that x\u00af is the ith variable representing x\u00af, not the                                interface, and not just for supplying primitives: users are\r\n                                          i                                                                       encouraged to use custom adjoints to build entirely new\r\n                     gradient of xi). x is used once in each iteration of the loop,\r\n                     so we accumulate x\u00af across all iterations.3                                                  features, and we show some examples of their somewhat\r\n                                                                                                                  surprising expressive power.\r\n                                           block #1: (y\u00af)                                                         Gradient hooks allow an arbitrary function to be applied\r\n                                                         br#2(y\u00af,0)                                               to the gradient, for example hook(-, x) to reverse the\r\n                                           block #2: (r\u00af ,x\u00af )                                                    sign of x\u00af. There are many uses for this, including gradient\r\n                                                              1     1\r\n                                                         br#4unlessb 6= 1                                         clipping and debugging.\r\n                                                         br#3                                                       hook(f, x) = x\r\n                                           block #3:                                                                @adjoint hook(f, x) =\r\n                                                r\u00af , x\u00af   \u2190B (r\u00af )                                                      (x, dx -> (nothing, f(dx)))\r\n                                                  2    2         r2    1\r\n                                                     x\u00af3 \u2190 x\u00af1 + x\u00af2                                              Thefunctionnestlevelisabletodore\ufb02ectiononthegra-\r\n                                                         br#2(r\u00af ,x\u00af )\r\n                                                                      2     3                                     dient process itself; if called within a differentiated function\r\n                                           block #4:                                                              it will return the order of differentiation being performed.\r\n                                                         return x\u00af1\r\n                                                                                                                    nestlevel() = 0\r\n                                                                                                                    @adjoint nestlevel() =\r\n                                                                                                                        (nestlevel()+1, _ -> ())\r\n                     Storing pullbacks (rather than raw primal values) allows us\r\n                     to handle dynamic code where the de\ufb01nition of the func-                                      A simple implementation of checkpointing is similarly\r\n                     tion f is not known until runtime. We need not sacri\ufb01ce                                      straightforward.\r\n                     performance for this; where f can be statically resolved, the\r\n                     closure type tags can be elided and the pullback de\ufb01nition                                     checkpoint(f, x) = f(x)\r\n                     inlined so that only values are stored contiguously on the                                     @adjoint checkpoint(f, x) =\r\n                     stack, behaving at runtime similarly to Tapenade.                                                  (f(x), dy -> J(f, x)[2](dy))\r\n                     3      DIFFERENTIATION SEMANTICS                                                             Theremaining functionality in this section is similarly pro-\r\n                     Zygote\u2019s core transform is simple and mechanical, and only                                   vided by custom adjoints.\r\n                     around200linesofcode. Itisworthemphasisingthatalmost                                         3.2     DataStructures & Mutation\r\n                          3Seemingly, so also is r. But note each loop iteration sees a\r\n                     different de\ufb01nition of r, so the gradients are independent. A bene\ufb01t                         AlthoughJulia\u2019s data model is fairly complex, we can de\ufb01ne\r\n                     of SSA form is that this distinction becomes syntactically clear,                            differentiation of data structures by starting with a simple\r\n                     and need not be handled specially.                                                           tuple like C = (x ,x ). If we call \ufb01rst(C) to retrieve the\r\n                                                                                                                                             1    2\r\n                                                                      Sense & Sensitivities\r\n               \ufb01rst element we must then \ufb01nd the gradient with respect to          output, thus allowing higher-order derivatives via nested\r\n                                                                             \u00af\r\n               C in the adjoint program. We create an adjoint object C,            application of J (as in J(J,f,x)).\r\n               which mirrors the structure of C while storing the gradient\r\n               of each internal element (x\u00af ,x\u00af ). Summing adjoint objects         3.3   ConcurrencyandParallelism\r\n                                            1   2\r\n               sumstheelements. The pullbacks for operations on C are              Julia supports a concurrency model based on communi-\r\n               as follows.                                                         cating sequential processes (CSP, Hoare 1978). A zero-\r\n                          FUNCTION                 PULLBACK                        argument function or closure (a thunk) can be scheduled\r\n                                                    \u00af           \u00af                  as a task (or coroutine), and executed independently of the\r\n                          C=(x ,x )          (\ufb01rst(C),second(C))\r\n                                  1  2                                             main thread. Tasks communicate with each other through\r\n                          y = \ufb01rst(C)                (y\u00af, 0)\r\n                          y = second(C)              (0,y\u00af)                        shared queues called channels. Typically, the main thread\r\n                                                                                   will create a series of tasks and wait for them all to \ufb01nish\r\n               Anyother(immutable)structdiffersonlyinnumberof\ufb01elds                 before continuing.\r\n               or names of accessor functions, making it straightforward           Zygote makes CSP differentiable by the following trans-\r\n               to generalise this.                                                 formation. Firstly, when a task is scheduled, its thunk f is\r\n               Tohandle mutation, consider a one-element \u201cbox\u201d structure           replaced by J(f), producing a pullback. Once the task is\r\n               B. Wecanget(B)toretrieve the current stored value, and              complete, we associate it with an adjoint task which will run\r\n               set(B,x) to erase that value and replace it with x. The             the pullback. During the reverse pass, we reach the point\r\n                               \u00af\r\n               adjoint object B is also a box, which we retrieve via lookup        where the original task was awaited in the primal code, and\r\n               rather than by pullback return values; a global lookup is           schedule the adjoint task. The adjoint task executes and\r\n               necessary to handle the non-local data\ufb02ow that mutation             communicates with other adjoint tasks as needed, \ufb01nally\r\n               introduces. The pullbacks are as follows.                                                                 \u00af\r\n                                                                                   producing a gradient of the thunk f.\r\n                          FUNCTION               PULLBACK                          Channels can be differentiated as in \u00a73.2; for each channel\r\n                                                 \u00af      \u00af                          c we create an empty adjoint channel c\u00af. Sending a value to\r\n                          x=get(B)          set(B,get(B)+x\u00af)                       c becomes receiving a sensitivity from c\u00afand vice versa.\r\n                                                   \u00af        \u00af\r\n                          set(B,x)       (x\u00af = get(B);set(B,0);x\u00af)\r\n               Amutablestruct can be seen as a boxed tuple or a tuple of           Julia supports shared-memory parallelism by multiplexing\r\n               boxes; in either case it generalises similarly to other mutable     tasks onto OS threads, so support for tasks means that mul-\r\n               data structures. For example, a stack can be implemented            tithreaded code is also differentiable. Julia uses the same\r\n               as a box containing a tuple-based linked list. In general we        concepts, though a slightly different API, for distributed\r\n               will want to use more ef\ufb01cient data structures (e.g. stacks in      / multi-node parellelism, so the same techniques can be\r\n               contiguous memory or hash maps), but the this formalism             straightforwardly transferred to differentiation of distributed\r\n               allows us to easily derive appropriate specialised pullbacks        code. In an experimental setting we were able to achieve a\r\n               for them.                                                           1.5\u00d7speedupwhenusingtwocorestogetthegradient of a\r\n                                                                                   simple function using map-reduce parallelism.\r\n               Onecaveat: pullbacks frequently close over their inputs (for        Care must be taken that accumulate/reset operations in the\r\n               example, both input arrays in matrix multiplication), and           adjoint are atomic, since there may otherwise be a race con-\r\n               if they are mutated the pullback will be incorrect. Arrays          dition due to multiple reads from the same array location in\r\n               must therefore either be immutable, be copied on capture,           the primal, or due to tasks sharing mutable state. Differen-\r\n               or have mutations recorded and reversed during the adjoint          tiation of parallel code at other levels of abstraction, such\r\n               program. This is generally not true for operations on other         as the level of parallel for loops or map-reduce, presents\r\n               data structures (which do not get captured), so things like                                                       \u00a8\r\n               stacks need no special support.                                     different challenges and opportunities (Huckelheim et al.,\r\n                                                                                                                                      \u00a8\r\n                                                                                   2019; Hovland, 1997; Naumann et al., 2008; Bucker et al.,\r\n               Closures are just structs with a call method (c2 Wiki Con-          2001).\r\n               tributors, 2018); the parts of the struct represent the closure\u2019s\r\n               environment. When calling closures we need to recognise             3.4   Mixed-ModeAD\r\n               a hidden zeroth argument, the closure environment, and              While reverse mode AD is a powerful tool, especially in\r\n               produce an adjoint for that object. In our compiler all func-       optimisation problems, there are many alternative ways to\r\n               tions actually accept this hidden argument\u2014which may be             calculate derivatives, and specialised approaches can have\r\n               empty as a special case\u2014so both closures and higher-order           advantagesinmanysituations. Forexample,Julia\u2019sforward-\r\n               functions are supported with no extra effort.                       modeAD(Revelsetal.,2016)hasconstantmemoryover-\r\n               Given that adjoint code makes use of both stacks and clo-           head(comparedtoreversemode\u2019stape,linearinthenumber\r\n               sures, the above ensures that the AD can consume its own            of instructions executed) and has minimal time overhead,\r\n                                                                      Sense & Sensitivities\r\n               making it ideal for long-running computations with a small          and thus the usual gradient update z := z \u2212 \u03b7z\u00aflowers the\r\n               number of inputs. Similarly, TaylorSeries.jl (Benet et al.,         loss. (This is equivalent to differentiating a pair of two reals\r\n               2018) can calculate arbitrary-order forward-mode deriva-            (x,y).)\r\n               tives in one shot.                                                  This sensitivity is not the true complex derivative \u2202        =\r\n                                                                                                                                            \u2202z\r\n               Mixed mode is exposed by writing forwarddiff(f,                      \u2202 + \u2202 = \u2202 \u2212i \u2202 ,which(forholomorphicfunctions)\r\n                                                                                   \u2202x     \u2202iy    \u2202x     \u2202y\r\n               x). This calculates the same result as f(x), but addi-              will satisfy f(z+\u01eb) \u2248 f(z)+\u2202f\u01eb. BytheCauchy-Riemann\r\n               tionally calculates the Jacobian via forward mode, stores                                           \u2202z\r\n                                                                                   equations, \u2202f is conjugate to the sensitivity z\u00afof \u211cf(z) mak-\r\n               it, and applies it during the backwards pass using a cus-                       \u2202z\r\n               tom adjoint. Similarly, checkpointed AD is exposed via              ing it straightforward and ef\ufb01cient to calculate. In the more\r\n               checkpoint(f, x) (\u00a73.1). Zygote can be instructed                   general non-holomorphic case one needs either the equiv-\r\n               to always use forward mode (or another AD technique) on             alent 2 \u00d7 2 real Jacobian or the two Wirtinger derivatives\r\n                                                                                   (\u2202f, \u2202f ), both of which are readily derived from the sensi-\r\n               a given function, or even to have heuristics for the best            \u2202z \u2202z\u2217\r\n               method, so that for users of a library, differentiation is ef\ufb01-     tivities of \u211cf(z) and \u2111f(z).\r\n               cient by default.                                                   This generalises straightforwardly to Cm \u2192 Cn functions,\r\n               There are many more ways to exploit problem structure               meaning Zygote can be used for general complex differenti-\r\n               in AD. In nested optimisation problems, for example, as-            ation.\r\n               suming convergence of the inner solver makes some of its            3.6   Staged Programming\r\n               gradients analytical zeros, and avoids differentiation of the\r\n               solver operations. As another example, consider evaluating          Zygote does not require users to manually specify which\r\n               an in\ufb01nite Taylor series; in practice only a \ufb01nite number of        code should be staged and optimised \u2013 that is the job of\r\n               terms are considered, up to numerical precision. By default,        a compiler \u2013 but nor does it prevent users from explicitly\r\n               each differentiation of this \ufb01nite series will drop a term, re-     staging computation. In fact, Julia is an excellent tool for\r\n               ducing precision. However, one can de\ufb01ne a custom adjoint           staged- and meta-programming techniques, and Zygote sim-\r\n               that derives the Taylor series itself analytically, which is        ply works with this as any other language feature.\r\n               then evaluated to the same precision as the original function.      For example, many numerical libraries provide an einsum\r\n               This trick can be generalised to a \u201c\ufb01xed-point iteration\u201d op-       interface, allowing tensor operations to be expressed with\r\n               erator which has an appropriate adjoint de\ufb01ned (Schlenkrich         a syntax based on Einstein notation. The syntax is usually\r\n               et al., 2008). Similar concerns come up when differentiating        expressed as a string and, in dynamic interfaces like Py-\r\n               domain speci\ufb01c languages (DSLs), such as Halide (Li et al.,         Torch, parsing the string incurs an overhead each time the\r\n               2018); even if the DSL compiles to differentiable Julia code,       expression is run. While Julia provides a dynamic interface,\r\n               it is easier to exploit performance and numerical optimisa-         wedon\u2019t have to pay this cost: Einsum can be implemented\r\n               tions by differentiating at the highest level of abstraction        as a macro, explicitly parsing the notation at compile time\r\n               available.                                                          and leaving only raw tensor operations behind. Zygote sees\r\n               Asimilar problem is differentiating code in other languages,        only the \ufb01nal matrix multiply and sum operations, so this\r\n               for example Python code invoked via PyCall.jl (Johnson              has no overhead compared to writing them manually. The\r\n               et al., 2018). In this case, we can write an adjoint for the        sameistrueofJulia\u2019s other powerful metaprogramming and\r\n               low-level pycall function which invokes a Python AD,                staging tools, such as generated functions.\r\n               capturing its tape in a pullback. To a user, calling imported\r\n               Python functions inside a call to gradient then works               3.7   HardwareBackends\r\n               transparently.\r\n                                                                                   Zygote transforms generic programs and mathematical ex-\r\n               3.5   ComplexDifferentiation                                        pressions \u2013 written in terms of mathematical operators like\r\n               ComplexnumbersarenotaspecialcaseattheADlevel,but                    \u00d7,+etc. \u2013into new generic programs that calculate a gra-\r\n                                                                                   dient. Thus Zygote is completely agnostic to the data types\r\n               instead are treated as any other user-de\ufb01ned type. Zygote\u2019s         running through the program and how they are implemented\r\n               pre-de\ufb01ned rules for numerical operations (e.g. multiplica-         or represented in memory. A Zygote program written for\r\n               tion and addition) immediately generalize to the complex            \ufb02oating point numbers therefore works equally well with\r\n               numbers, and only rules for the real and imag functions             rational numbers, arbitrary-precision \ufb02oats and integers,\r\n               are needed in addition for full complex support.                    measurements, hardware-speci\ufb01c types like BFloat16,\r\n               Zygote de\ufb01nes the sensitivity of a complex number z =               and combinations of these.\r\n               x+yibyz\u00af=x\u00af+y\u00afi. Thisde\ufb01nitionis useful for gradient\r\n                                                                            \u2217       julia> gradient(x -> x\u02c62 + 3x + 1, 1/3)\r\n               descent since for small, real \u03b7, f(z + \u03b7z\u00af) \u2248 f(z) + \u03b7z\u00afz\u00af ,         (3.6666666666666665,)\r\n                                                                   Sense & Sensitivities\r\n                                                                               only possible with Zygote\u2019s semantics-preserving approach\r\n                julia> gradient(x -> x\u02c62 + 3x + 1, 1//3)                       to AD.\r\n                (11//3,)\r\n                                                                               Asidefromthecorrectnessbene\ufb01tsofworkingwithtypes,it\r\n                julia> gradient(x -> x\u02c62 + 3x + 1,                             is increasingly recognised that incorporating existing knowl-\r\n                                      1/3 \u00b1 0.01)                              edge and code into machine learning leads to richer and\r\n                (3.6666666666666665 \u00b1 0.02,)                                   more powerful models; this is particularly valuable in sci-\r\n               Thesameistrueforarrays;theprogramgradient(x ->                  enti\ufb01c computing, where powerful explicit models exist for\r\n               \u03c3.(W x .+ b), x)worksequallywellwhetherW,b                      manysystemsthat need not be learned from scratch (Innes\r\n                     *                                                         et al., 2019).\r\n               and x are dense arrays, sparse arrays, arrays backed by\r\n               GPUmemory,ordistributed arrays stored over a cluster of         3.9   DeepLearning\r\n               hundreds of nodes. The operations \u00d7, broadcasting and so        Zygote is not, in itself, a deep learning library. Deep learn-\r\n               on are called on the adjoint arrays and thus launched on the\r\n               GPUorcluster as appropriate.                                    ing tools and interfaces \u2013 such as for common architectures\r\n               Julia\u2019s TPUsupport,inXLA.jl(Fischer&Saba,2018),takes            like LSTM (Gers et al., 1999) and gradient descent rules like\r\n               advantage of this composability. Rather than being tied to a    ADAM(Kingma&Ba,2014)\u2013areprovidedbyhigher-level\r\n               particular AD implementation, as in current TPU frontends,      libraries like Flux (Innes, 2018; Innes et al., 2018). Never-\r\n               XLA.jl compiles general Julia code to the TPU. When the         theless, many standard things are simple with Zygote alone.\r\n                                                                                                        \u00af \u00af\r\n               program being compiled happens to call gradient, ML             Getting the gradients (W,b) for a logistic regression is a\r\n               happens.                                                        one-liner:\r\n               3.8  External Libraries                                           gradient(\r\n                                                                                    (W, b) -> loss(\u03c3.(W x .+ b), y), W, b)\r\n                                                                                                              *\r\n               Support for types and libraries distinguishes frameworks\r\n               from programming languages, and we support these in dif-        More complex architectures differ only in the details of\r\n               ferentiable programming too. For example, the Colors.jl         the forward pass, which can include loops and recursion,\r\n               package(Holyetal.,2018)providesrepresentationsofRGB             and reuse common patterns and layers from libraries. A\r\n               colours (among many other colour spaces), and functions         hand-written LSTM looks as follows:\r\n               over these colour spaces can be differentiated. By default,\r\n               each \ufb01eld of a structure is differentiated independently, as      for (x, y) in (xs, ys)\r\n                                                                                    forget = \u03c3.(Wf x + Uf h + bf)\r\n               in \u00a73.2.                                                                                *         *\r\n                                                                                    input    = \u03c3.(Wi x + Ui h + bi)\r\n                                                                                                       *         *\r\n                                                                                    output = \u03c3.(Wo x + Uo h + bo)\r\n                julia> a = RGB(1, 0, 0);                                                               *         *\r\n                                                                                    cell\u2032 = tanh.(Wc x + Uc h + bc)\r\n                                                                                                         *         *\r\n                                                                                    cell = forget . cell + input . cell\u2032\r\n                julia> gradient(a -> a.r\u02c62, a)                                                          *                     *\r\n                ((r = 2.0f0, g = nothing, b = nothing),)                            h = o .* tanh.(cell)\r\n                                                                                    loss += distance(h, y)\r\n                                                                                 end\r\n               Colors.jl also provides many useful procedures, such as\r\n               for computing perceptual colour differences (Luo et al.,        More complex models with many parameters can be han-\r\n               2001). These can also be differentiated and even used as        dled by bundling the weights into structures (as in autograd\r\n               loss functions in machine learning models.                      (Maclaurin et al., 2015), \u00a73.2). We can then get the gradient\r\n                                                                               of a model m (which is made callable with an input x to\r\n                julia> b = RGB(0, 1, 0);                                       invoke the forward pass) as follows.\r\n                julia> colordiff(a, b)                                           gradient(m -> distance(m(x), y), m)\r\n                86.60823557376344\r\n                julia> gradient(b -> colordiff(a, b), b)                       Themodelmisequivalenttoaclosure, where closed-over\r\n                ((r = -1.77, g = 28.88, b = -0.04),)                           variables are trainable weights. In Flux we refer to this kind\r\n                                                                               of closure as a \u201clayer\u201d, since they can be composed together\r\n               We emphasise that colordiff comprises hundreds of               just as in a high level library like Keras (Chollet et al., 2015),\r\n               lines of code including types, dispatch, control \ufb02ow, table     which effectively implements function combinators. How-\r\n               lookups, and other language features. It was written before     ever, they can also be freely mixed with more \u201cimperative\u201d\r\n               Julia had any AD support, but nevertheless has not needed       code, as in this homebrew maxout layer (Goodfellow et al.,\r\n               modi\ufb01cations in order to be safely differentiable. This is      2013).\r\n                                                                    Sense & Sensitivities\r\n                m1 = Chain(Dense(10, 5, relu), Dense(5, 2))                                   Figure 1. AD Overhead Benchmarks\r\n                m2 = Chain(Dense(10, 5, relu), Dense(5, 2))\r\n                model = x -> softmax(max.(m1(x), m2(x)))\r\n                                                                                        50\r\n               We suggest that this kind of layer is a fundamental unit\r\n               of abstraction in differentiable programming (much as                    40\r\n               procedures, objects and functions are in their respective           )\r\n               paradigms) and thus has relevance well beyond neural net-           ns   30\r\n                                                                                   (\r\n               works.                                                              ime\r\n                                                                                   T    20\r\n               3.10   Higher-Order Derivatives\r\n               Zygote naturally and intuitively supports higher-order                   10\r\n               derivatives,  by differentiating a program that calls\r\n               gradient.                                                                 0\r\n                                                                                             pow          mlp         lstm    blackscholes\r\n                gradient(x -> gradient(sin, x)[1], \u03c0/2)\r\n                (-1.0,)\r\n                                                                                altered versions that do no work, other than computing\r\n               However, we note that alongside the \ufb01rst-order performance       shapes. All computation time then comes from overhead,\r\n               bene\ufb01ts of SCT ADs like Tapenade, Zygote also inherits           either from the Julia runtime or Zygote, and this is averaged\r\n               someoftheir pitfalls with respect to higher-order differen-      over the number of primitive operations (custom adjoints)\r\n               tiation. In particular, the differentiation transform tends to   in the computation. (In the case of the pow benchmark,\r\n               double the size of the original code, leading to exponential     weassumethat the cost of the scalar numerical workload\r\n               code in the order of differentiation, and correspondingly        is negligible, making the overhead estimate conservative.)\r\n               large compile times (tens of seconds for third-order deriva-     The results are shown in \ufb01gure 1, and found to be on the\r\n               tives); this is mitigated by the interpreted approach taken      order of 50ns or less. Overhead is primarily caused by (a)\r\n               by many tracing ADs. We believe the most promising over-         differentiation producing code that is harder for the Julia\r\n               all solution will be to interleave SCT differentiation with      compiler to infer, resulting in worse optimisation and (b)\r\n               optimisation (particularly dead code elimination, algebraic      the management of Zygote\u2019s heap-allocated stacks. Com-\r\n               simpli\ufb01cation and common subexpression elimination), as          piler improvements and stack pre-allocation strategies are\r\n                                                  \u00a8\r\n               implemented by Myia (van Merrienboer et al., 2018).              planned to reduce this overhead even further, as well as to\r\n               Zygote can also differentiate, or be differentiated by, other    makeoptimisations more reliable in the presence of Julia\u2019s\r\n               ADsinJulia, similar to \u00a73.4. In many cases this is prefer-       heuristic-based compiler.\r\n               able to nested reverse mode; for example, for Hessians,          To demonstrate the impact of AD overhead, we calculate\r\n               forward-over-reverse better exploits the numerical and run-      gradients for a multi-layer perceptron using both a tracing\r\n               time properties of forward and reverse mode AD.                  AD,Tracker.jl,andZygote,measuringtheratioofwallclock\r\n                                                                                time taken over a range of batch sizes. Because Tracker\u2019s\r\n               4    PERFORMANCECHARACTERISTICS                                  overhead is comparable to GPU kernel launch overhead, we\r\n                                                                                expect it to take about twice as long at small batch sizes,\r\n               To evaluate performance we measure AD overhead; that             which is indeed what we \ufb01nd to be the case (\ufb01gure 2). As\r\n               is, the average time spent manipulating AD data structures       batch size grows, AD overhead is amortised, resulting in\r\n               rather than doing essential numerical work when evaluat-         similar performance for both systems. Benchmarks were\r\n               ing a primitive operation (such as addition of two tensors).     conductedona3.6GHzInteli7-7700withanNVIDIAGTX\r\n               This metric gives an approximate threshold at which AD           1080GPU.\r\n               becomesthebottleneckinexecutiontime,ratherthanthenu-             While Julia\u2019s overall suitability for any given AD use case\r\n               merical workload. For example, overhead of 1\u00b5s is accept-        (deep learning, probabilistic programming, computational\r\n               able when running a large ResNet model on a GPU, where           \ufb02uid dynamics, \ufb01nance ...) depends on domain-speci\ufb01c\r\n               kernel launch times are in the microsecond range and large       factors such as implementation quality of numerical kernels,\r\n               convolution operations take far longer. Conversely, scalar       these benchmarks show Zygote\u2019s suitability to build such a\r\n               operations lie in the nanosecond range and 1\u00b5s overhead          system in combination with other tools, and we note that in\r\n               wouldcontribute orders of magnitude to overall execution         manydomainscommonkernelsaresharedbetweensystems\r\n               time.                                                            (e.g. BLAS, or CUDNN in deep learning), making AD a\r\n               Wemeasureoverheadbyreplacing all array operations with           bottleneck by default.\r\n                                                                     Sense & Sensitivities\r\n                              Figure 2. Tracing vs SCT Scaling                      PLANNotices, 33(4):17\u201320, 1998.\r\n                        2                                                         Bartholomew-Biggs, M., Brown, S., Christianson, B., and\r\n                                                                                     Dixon, L. Automatic differentiation of algorithms. Jour-\r\n                                                                                     nalofComputationalandAppliedMathematics,124(1-2):\r\n                  Ratio1.8                                                          171\u2013190, 2000.\r\n                  ime                                                             Benet, L., Sanders, D., et al. TaylorSeries.jl. https://\r\n                  T\r\n                      1.6                                                            github.com/JuliaDiff/TaylorSeries.jl,\r\n                                                                                     2018. Accessed: 2018-09-22.\r\n                  er/Zygote1.4                                                    Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu,\r\n                  rack                                                               R., Desjardins, G., Turian, J., Warde-Farley, D., and Ben-\r\n                  T                                                                  gio, Y. Theano: a cpu and gpu math expression compiler.\r\n                      1.2                                                            In Proceedings of the Python for scienti\ufb01c computing\r\n                                                                                     conference (SciPy), volume 4. Austin, TX, 2010.\r\n                               1               2               3\r\n                             10              10              10\r\n                                               Batch Size                         Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B.\r\n                                                                                    Julia: A fresh approach to numerical computing. SIAM\r\n                                                                                     review, 59(1):65\u201398, 2017.\r\n               5    CONCLUSION\r\n                                                                                    \u00a8\r\n                                                                                  Bucker, H. M., Lang, B., Bischof, C. H., et al. Bring-\r\n               WehavepresentedZygote,atoolforanalytic differentiation                ing together automatic differentiation and openmp. In\r\n               of code in the Julia language. Zygote consolidates the best          Proceedings of the 15th international conference on Su-\r\n               ideas from the many AD tools that came before it, aiming              percomputing, pp. 246\u2013251. ACM, 2001.\r\n               to create a uni\ufb01ed interface that meets the needs of as many\r\n               applications as possible. We have shown how this synthesis         c2    Wiki     Contributors.           Closures    and     ob-\r\n               of \ufb02exibility and high performance can be achieved, espe-             jects    are    equivalent.             wiki.c2.com/\r\n               cially the simplicity that can be found when using powerful           ?ClosuresAndObjectsAreEquivalent, 2018.\r\n               tools from the programming language and compilers com-               Accessed: 2018-09-22.\r\n               munities. Building on work by the AD community, we have            Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert,\r\n               also shown how SCT AD can be applied to a wide range of               D., and Holtham, E. Reversible architectures for arbi-\r\n               language features, from closures to concurrency.                      trarily deep residual neural networks. arXiv preprint\r\n               Webelieve that there is huge potential at the intersection of         arXiv:1709.03698, 2017.\r\n               machine learning and other \ufb01elds, but current ML frame-\r\n               worksrequirethatdomainspeci\ufb01ccodeberewritten,greatly               Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao,\r\n               slowing experimentation. Julia is the \ufb01rst platform to make          T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A \ufb02exible\r\n               existing numerical libraries differentiable by default, en-           and ef\ufb01cient machine learning library for heterogeneous\r\n               abling code sharing and reuse between domain experts and              distributed systems. arXiv preprint arXiv:1512.01274,\r\n               MLresearchers in a growing differentiable library ecosys-             2015.\r\n               tem.\r\n                                                                                  Chollet, F. et al. Keras, 2015.\r\n               REFERENCES                                                         Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N.,\r\n               Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,           and Zadeck, F. K. Ef\ufb01ciently computing static single\r\n                  J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.         assignment form and the control dependence graph. ACM\r\n                  Tensor\ufb02ow: a system for large-scale machine learning.             Transactions on Programming Languages and Systems\r\n                  In OSDI, volume 16, pp. 265\u2013283, 2016.                            (TOPLAS), 13(4):451\u2013490, 1991.\r\n               Agrawal, A., Modi, A. N., Passos, A., Lavoie, A., Agar-            Degrave, J., Hermans, M., Dambre, J., et al. A differentiable\r\n                  wal, A., Shankar, A., Ganichev, I., Levenberg, J., Hong,           physics engine for deep learning in robotics. Frontiers in\r\n                  M., Monga, R., et al.       Tensor\ufb02ow eager: A multi-              neurorobotics, 13, 2019.\r\n                  stage, python-embedded dsl for machine learning. arXiv          Fischer, K. and Saba, E. Automatic full compilation of julia\r\n                  preprint arXiv:1903.01855, 2019.                                   programs and ml models to cloud tpus. arXiv preprint\r\n               Appel, A. W. SSA is functional programming. ACM SIG-                  arXiv:1810.09868, 2018.\r\n                                                              Sense & Sensitivities\r\n              Frostig, R., Johnson, M. J., and Leary, C. Compiling ma-    Johnson, S. G. et al. PyCall.jl. https://github.com/\r\n                chine learning programs via high-level tracing, 2018.       JuliaPy/PyCall.jl,2018. Accessed: 2018-09-22.\r\n              Gers, F. A., Schmidhuber, J., and Cummins, F. Learning to   Kingma, D. P. and Ba, J. Adam: A method for stochastic\r\n                forget: Continual prediction with lstm. 1999.               optimization. arXiv preprint arXiv:1412.6980, 2014.\r\n              Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville,  Li, T.-M., Gharbi, M., Adams, A., Durand, F., and Ragan-\r\n                A., and Bengio, Y. Maxout networks. arXiv preprint          Kelley, J. Differentiable programming for image process-\r\n                arXiv:1302.4389, 2013.                                      ing and deep learning in halide. ACM Transactions on\r\n                                                                            Graphics (TOG), 37(4):139, 2018.\r\n              Griewank, A., Juedes, D., and Utke, J. Algorithm 755:       Luo, M. R., Cui, G., and Rigg, B. The development of the\r\n                Adol-c: a package for the automatic differentiation of      cie 2000 colour-difference formula: Ciede2000. Color\r\n                algorithms written in c/c++. ACM Transactions on Math-      Research&Application: EndorsedbyInter-SocietyColor\r\n                ematical Software (TOMS), 22(2):131\u2013167, 1996.              Council, The Colour Group (Great Britain), Canadian\r\n              Hascoet, L. and Pascual, V. The tapenade automatic differ-    Society for Color, Color Science Association of Japan,\r\n                entiation tool: principles, model, and speci\ufb01cation. ACM    Dutch Society for the Study of Color, The Swedish Colour\r\n                Transactions on Mathematical Software (TOMS), 39(3):        Centre Foundation, Colour Society of Australia, Centre\r\n                20, 2013.                                                   Franc\u00b8ais de la Couleur, 26(5):340\u2013350, 2001.\r\n              Hoare, C. A. R. Communicating sequential processes. In      Maclaurin, D., Duvenaud, D., and Adams, R. P. Autograd:\r\n                The origin of concurrent programming, pp. 413\u2013443.          Effortless gradients in numpy. In ICML 2015 AutoML\r\n                Springer, 1978.                                             Workshop, 2015.\r\n              Hogan, R. J. Fast reverse-mode automatic differentiation    MLIRContributors. Block arguments vs phi nodes, 2019.\r\n                using expression templates in c++. ACM Transactions on      URL        https://github.com/tensorflow/\r\n                Mathematical Software (TOMS), 40(4):26, 2014.               mlir/blob/master/g3doc/Rationale.md#\r\n                                                                            block-arguments-vs-phi-nodes.                Accessed\r\n              Holy, T., Jones, D. C., and contributors. Colors.jl. github.  15-August-2019.\r\n                com/JuliaGraphics/Colors.jl, 2018.                Ac-                         \u00a8\r\n                cessed: 2018-09-22.                                       Naumann, U., Hascoet, L., Hill, C., Hovland, P., Riehme,\r\n                                                                            J., and Utke, J. A framework for proving correctness of\r\n              Hovland, P. D. Automatic differentiation of parallel pro-     adjoint message-passing programs. In European Parallel\r\n                grams. Number 2003. University of Illinois at Urbana-       Virtual Machine/MessagePassingInterfaceUsers\u2019Group\r\n                ChampaignChampaign,IL,USA,1997.                             Meeting, pp. 316\u2013321. Springer, 2008.\r\n                \u00a8                                           \u00a8             Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Am-\r\n              Huckelheim,J.,Hovland,P.,Strout,M.M.,andMuller,J.-D.          mar, W., Anastasopoulos, A., Ballesteros, M., Chiang,\r\n                Reverse-mode algorithmic differentiation of an openmp-      D., Clothiaux, D., Cohn, T., et al. Dynet: The dynamic\r\n                parallel compressible \ufb02ow solver. The International Jour-   neural network toolkit. arXiv preprint arXiv:1701.03980,\r\n                nal of High Performance Computing Applications, 33(1):      2017.\r\n                140\u2013154, 2019.\r\n              Innes, M. Flux: Elegant machine learning with julia. J.     Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,\r\n                OpenSourceSoftware, 3(25):602, 2018.                        DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,\r\n                                                                            A. Automatic differentiation in pytorch. 2017.\r\n              Innes, M., Saba, E., Fischer, K., Gandhi, D., Rudilosso,    Pearlmutter, B. A. and Siskind, J. M. Reverse-mode ad in\r\n                M.C., Joy, N. M., Karmali, T., Singh, A. P., and Shah,      a functional framework: Lambda the ultimate backprop-\r\n                V.  Fashionable modelling with \ufb02ux. arXiv preprint          agator. ACM Transactions on Programming Languages\r\n                arXiv:1811.01457, 2018.                                     andSystems (TOPLAS), 30(2):7, 2008.\r\n              Innes, M., Edelman, A., Fischer, K., Rackauckas, C., Saba,  PyTorch Team.    TorchScript.   pytorch.org/docs/\r\n                E., Shah, V. B., and Tebbutt, W. A differentiable pro-      stable/jit.html,2018. Accessed: 2018-09-22.\r\n                gramming system to bridge machine learning and sci-       Rackauckas, C., Ma, Y., Dixit, V., Guo, X., Innes, M., Rev-\r\n                enti\ufb01c computing. CoRR, abs/1907.07587, 2019. URL           els, J., Nyberg, J., and Ivaturi, V. A comparison of auto-\r\n                http://arxiv.org/abs/1907.07587.                            matic differentiation and continuous sensitivity analysis\r\n              Innes, M.J. Sense sensitivities: The path to general-purpose  for derivatives of differential equation solutions. arXiv\r\n                algorithmic differentiation, 2020.                          preprint arXiv:1812.01892, 2018.\r\n                                                          Sense & Sensitivities\r\n             Revels, J., Lubin, M., and Papamarkou, T.  Forward-\r\n               mode automatic differentiation in julia. arXiv preprint\r\n               arXiv:1607.07892, 2016.\r\n             Rumelhart,D.E.,Hinton,G.E.,Williams,R.J.,etal. Learn-\r\n               ing representations by back-propagating errors. Cognitive\r\n               modeling, 5(3):1, 1988.\r\n             Schlenkrich, S., Walther, A., Gauger, N. R., and Heinrich,\r\n               R. Differentiating \ufb01xed point iterations with adol-c: Gra-\r\n               dient calculation for \ufb02uid dynamics. In Modeling, Simu-\r\n               lation and Optimization of Complex Processes, pp. 499\u2013\r\n               508. Springer, 2008.\r\n             Shiriaev, D. and Griewank, A. Adol-f: Automatic differen-\r\n               tiation of fortran codes. Computational Differentiation:\r\n               Techniques, Applications, and Tools, pp. 375\u2013384, 1996.\r\n             Speelpenning, B. Compiling fast partial derivatives of func-\r\n               tions given by algorithms. Technical report, Illinois Univ.,\r\n               Urbana (USA). Dept. of Computer Science, 1980.\r\n             Steele Jr, G. L. and Sussman, G. J. Lambda: The ulti-\r\n               mate imperative. Technical report, MASSACHUSETTS\r\n               INSTOFTECHCAMBRIDGEARTIFICIALINTELLI-\r\n               GENCELAB,1976.\r\n             Tokui, S., Oono, K., Hido, S., and Clayton, J. Chainer: a\r\n               next-generation open source framework for deep learning.\r\n               In Proceedings of workshop on machine learning systems\r\n               (LearningSys) in the twenty-ninth annual conference on\r\n               neural information processing systems (NIPS), volume 5,\r\n               pp. 1\u20136, 2015.\r\n             Utke, J., Naumann, U., Fagan, M., Tallent, N., Strout, M.,\r\n               Heimbach, P., Hill, C., and Wunsch, C. Openad/f: A\r\n               modular open-source tool for automatic differentiation\r\n               of fortran codes. ACM Transactions on Mathematical\r\n               Software (TOMS), 34(4):18, 2008.\r\n                     \u00a8\r\n             van Merrienboer, B., Breuleux, O., Bergeron, A., and Lam-\r\n               blin, P. Automatic differentiation in ml: Where we are\r\n               and where we should be going. In Advances in Neural\r\n               Information Processing Systems, pp. 8770\u20138780, 2018.\r\n             Wengert, R. E. A simple automatic derivative evaluation\r\n               program. Communications of the ACM, 7(8):463\u2013464,\r\n               1964.\r\n             Wikipedia contributors.  Differentiable programming,\r\n               2019. URLhttps://en.wikipedia.org/wiki/\r\n               Differentiable_programming. Accessed 15-\r\n               August-2019.\r\n", "award": [], "sourceid": 16, "authors": [{"given_name": "Mike", "family_name": "Innes", "institution": "Julia Computing"}]}