{"title": "Continuous Integration of Machine Learning Models with ease.ml/ci: Towards a Rigorous Yet Practical Treatment", "book": "Proceedings of Machine Learning and Systems", "page_first": 322, "page_last": 333, "abstract": "Continuous integration is an indispensable step of modern software engineering practices to systematically manage the life cycles of system development. Developing a machine learning model is no difference \u2014 it is an engineering process with a life cycle, including design, implementation, tuning, testing, and deployment. However, most, if not all, existing continuous integration engines do not support machine learning as first-class citizens.\n\nIn this paper, we present ease.ml/ci, to our best knowledge, the first continuous integration system for machine learning. The challenge of building ease.ml/ci is to provide rigorous guarantees, e.g., single accuracy point error tolerance with 0.999 reliability, with a practical amount of labeling effort, e.g., 2K labels per test. We design a domain specific language that allows users to specify integration conditions with reliability constraints, and develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude for test conditions popularly used in real production systems.", "full_text": "                                       CONTINUOUSINTEGRATIONOFMACHINELEARNINGMODELSWITH\r\n                                          EASE.ML/CI: TOWARDS A RIGOROUS YET PRACTICAL TREATMENT\r\n                                                                     1                               \u02c7 1                           2                        3                                          4                              5                         1\r\n                                     Cedric Renggli                       BojanKarlas                       Bolin Ding                  FengLiu                  KevinSchawinski                            WentaoWu CeZhang\r\n                                                                                                                                      ABSTRACT\r\n                                      Continuous integration is an indispensable step of modern software engineering practices to systematically\r\n                                      managethelife cycles of system development. Developing a machine learning model is no difference \u2014 it is an\r\n                                      engineering process with a life cycle, including design, implementation, tuning, testing, and deployment. However,\r\n                                      most, if not all, existing continuous integration engines do not support machine learning as \ufb01rst-class citizens.\r\n                                      In this paper, we present ease.ml/ci, toourbestknowledge,the\ufb01rstcontinuousintegrationsystemformachine\r\n                                      learning. The challenge of building ease.ml/ci is to provide rigorous guarantees, e.g., single accuracy point\r\n                                      error tolerance with 0.999 reliability, with a practical amount of labeling effort, e.g., 2K labels per test. We design\r\n                                      a domain speci\ufb01c language that allows users to specify integration conditions with reliability constraints, and\r\n                                      develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude\r\n                                      for test conditions popularly used in real production systems.\r\n                            1        INTRODUCTION                                                                                                                                                                          Test Condition and Reliability Guarantees\r\n                                                                                                                                                                                        Github Repository                 ml:\r\n                            In modern software engineering (Van Vliet et al., 2008),                                                                                                                                       - script     : ./ test_model.py\r\n                                                                                                                                                                 \u2776Define test script                                       - condition  : n  - o > 0.02 +/ - 0.01\r\n                            continuous integration (CI) is an important part of the best                                                                                                     ./.travis.yml                 - reliability: 0.9999\r\n                                                                                                                                                                                                                           - mode       :  fp-free\r\n                            practice to systematically manage the life cycle of the de-                                                                                                      ./.testset                    - adaptivity : full\r\n                                                                                                                                                          \u2777Provide N test examples                                         - steps      : 32\r\n                            velopment efforts. With a CI engine, the practice requires                                                                                                       ./ml_codes                               or\r\n                            developerstointegrate(i.e., commit) their code into a shared                                                                    Technical Contribution                                                                            Example Test \r\n                                                                                                                                                            Provide guidelines on      \u2778Commit                                                                 Condition\r\n                            repository at least once a day (Duvall et al., 2007). Each                                                                        how large N is in a       a new ML                                                           New model has at \r\n                                                                                                                                                             declarative, rigorous,   model/code                                                            least 2% higher \r\n                            commit triggers an automatic build of the code, followed                                                                         but still practical way,                                                                     accuracy, estimated \r\n                                                                                                                                                               enabled by novel                                                                             within 1% error, \r\n                                                                                                                                                             system optimization                                                      or                    with probability \r\n                            byrunning a pre-de\ufb01ned test suite. The developer receives                                                                             techniques.                                     \u2779Get pass/fail signal                         0.9999.\r\n                            a pass/failsignal from each commit, which guarantees\r\n                            that every commit that receives a pass signal satis\ufb01es prop-                                                                                    Figure 1. The work\ufb02ow of ease.ml/ci.\r\n                            erties that are necessary for product deployment and/or pre-\r\n                            sumedbydownstreamsoftware.\r\n                            Developing machine learning models is no different from                                                                     In this paper, we take the \ufb01rst step towards building, to our\r\n                            developing traditional software, in the sense that it is also                                                               best knowledge, the \ufb01rst continuous integration system for\r\n                            a full life cycle involving design, implementation, tuning,                                                                 machine learning. The work\ufb02ow of the system largely fol-\r\n                            testing, and deployment. As machine learning models are                                                                     lows the traditional CI systems (Figure 1), while it allows\r\n                            used in more task-critical applications and are more tightly                                                                the user to de\ufb01ne machine-learning speci\ufb01c test conditions\r\n                            integrated with traditional software stacks, it becomes in-                                                                 such as the new model can only change at most 10% predic-\r\n                            creasingly important for the ML development life cycle also                                                                 tions of the old model or the new model must have at least\r\n                            to be managed following systematic, rigid engineering disci-                                                                1%higheraccuracythantheold model. After each commit\r\n                            pline. We believe that developing the theoretical and system                                                                of a machine learning model/program, the system automat-\r\n                            foundation for such a life cycle management system will be                                                                  ically tests whether these test conditions hold, and return\r\n                            an emerging topic for the SysML community.                                                                                  a pass/fail signal to the developer. Unlike traditional\r\n                                  1Department of Computer Science, ETH Zurich, Switzerland                                                              CI, CI for machine learning is inherently probabilistic. As\r\n                            2Data Analytics and Intelligence Lab, Alibaba Group 3Huawei                                                                 a result, all test conditions are evaluated with respect to a\r\n                            Technologies 4Modulos AG5Microsoft Research. Correspondence                                                                 (\u01eb,\u03b4)-reliability requirement from the user, where 1\u2212\u03b4 (e.g.,\r\n                            to:     Cedric Renggli <cedric.renggli@inf.ethz.ch>, Ce Zhang                                                               0.9999) is the probability of a valid test and \u01eb is the error\r\n                            <ce.zhang@inf.ethz.ch>.                                                                                                     tolerance (i.e., the length of the (1\u2212\u03b4)-con\ufb01dence interval).\r\n                            Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA,                                                                ThegoaloftheCIengineistoreturnthepass/failsignal\r\n                            2019. Copyright 2019 by the author(s).                                                                                      that satis\ufb01es the (\u01eb,\u03b4)-reliability requirement.\r\n                                             Continuous Integration of Machine Learning Models with ease.ml/ci\r\n               Technical Challenge: Practicality At the \ufb01rst glance of             2    SYSTEMDESIGN\r\n               the problem, there seems to exist a trivial implementation:         Wepresent the design of ease.ml/ci in this section. We\r\n               Foreachcommittedmodel,drawN labeleddatapointsfrom                   start by presenting the interaction model and work\ufb02ow as il-\r\n               the testset, get an (\u01eb,\u03b4)-estimate of the accuracy of the new       lustrated in Figure 1. We then present the scripting language\r\n               model, and test whether it satis\ufb01es the test conditions or not.     that enables user interactions in a declarative manner. We\r\n               Thechallenge of this strategy is the practicality associated        discuss the syntax and semantics of individual elements, as\r\n               with the label complexity (i.e., how large N is). To get an         well as their physical implementations and possible exten-\r\n               (\u01eb = 0.01,\u03b4 = 1 \u2212 0.9999) estimate of a random variable             sions. We end up with two system utilities, a \u201csample size\r\n               ranging in [0,1], if we simply apply Hoeffding\u2019s inequality,        estimator\u201d and a \u201cnew testset alarm,\u201d the technical details\r\n               we need more than 46K labels from the user (similarly,              of which will be explained in Sections 3 and 4.\r\n               63K labels for 32 models in a non-adaptive fashion and\r\n               156K labels in a fully adaptive fashion, see Section 3)!            2.1   Interaction Model\r\n               The technical contribution of this work is a collection of          ease.ml/ciis a continuous integration system for ma-\r\n               techniques that lower the number of samples, by up to two           chine learning. It supports a four-step work\ufb02ow: (1) user\r\n               orders of magnitude, that the system requires to achieve the        describes test conditions in a test con\ufb01guration script with\r\n               samereliability.                                                    respect to the quality of an ML model; (2) user provides N\r\n               In this paper, we make contributions from both the system           test examples where N is automatically calculated by the\r\n               and machine learning perspectives.                                  system given the con\ufb01guration script; (3) whenever devel-\r\n                                                                                   oper commits/checks in an updated ML model/program, the\r\n                 1. System Contributions. We propose a novel system                system triggers a build; and (4) the system tests whether the\r\n                     architecture to support a new functionality compensat-        test condition is satis\ufb01ed and returns a \u201cpass/fail\u201d signal to\r\n                     ing state-of-the-art ML systems. Speci\ufb01cally, rather          the developer. When the current testset loses its \u201cstatistical\r\n                     than allowing users to compose adhoc, free-style test         power\u201d due to repetitive evaluation, the system also decides\r\n                     conditions, we design a domain speci\ufb01c language that          on when to request a new testset from the user. The old\r\n                     is more restrictive but expressive enough to capture          testset can then be released to the developer as a validation\r\n                     manytest conditions of practical interest.                    set used for developing new models.\r\n                 2. Machine Learning Contributions. On the machine                 Wealso distinguish between two teams of people: the in-\r\n                     learning side, we develop simple, but novel, optimiza-        tegration team, who provides testset and sets the reliabil-\r\n                     tion techniques to optimize for test conditions that can      ity requirement; and the development team, who commits\r\n                     be expressed within the domain-speci\ufb01c language that          newmodels. In practice, these two teams can be identical;\r\n                     wedesigned. Our techniques cover different modes of           however, we make this distinction in this paper for clarity,\r\n                     interaction (fully adaptive, non-adaptive, and hybrid),       especially in the fully adaptive case. We call the integration\r\n                     as well as many popular test conditions that industrial       team the user and the development team the developer.\r\n                     and academic partners found useful. For a subset of           2.2   Aease.ml/ciScript\r\n                     test conditions, we are able to achieve up to two orders      ease.ml/ciprovidesadeclarative way for users to spec-\r\n                     of magnitude savings on the number of labels that the         ify requirements of a new machine learning model in terms\r\n                     system requires.                                              of a set of test cases. ease.ml/ci then compiles such\r\n               Beyondthese speci\ufb01c technical contributions, conceptually,          speci\ufb01cations into a practical work\ufb02ow to enable evalu-\r\n               this work illustrates that enforcing and monitoring an ML           ation of test cases with rigorous theoretical guarantees.\r\n               developmentlife cycle in a rigorous way does not need to be         Wepresent the design of the ease.ml/ci scripting lan-\r\n               expensive. Therefore, ML systems in the near future could           guage, followed by its implementation as an extension to\r\n               afford to support more sophisticated monitoring functional-         the .travis.ymlformatusedbyTravisCI.\r\n               ity to enforce the \u201cright behavior\u201d from the developer.\r\n               In the rest of this paper, we start by presenting the design        Logical Data Model        The core part of a ease.ml/ci\r\n               of ease.ml/ciinSection2. Wethendevelopestimation                    script is a user-speci\ufb01ed condition for the continuous in-\r\n               techniques that can lead to strong probabilistic guarantees         tegration test. In the current version, such a condition is\r\n               using test datasets with moderate labeling effort. We present       speci\ufb01ed over three variables V = {n,o,d}: (1) n, the\r\n               the basic implementation in Section 3 and more advanced             accuracy of the new model; (2) o, the accuracy of the old\r\n               optimizations in Section 4. We further verify the correctness       model; and (3) d, the percentage of new predictions that are\r\n               and effectiveness of our estimation techniques via an exper-        different from the old ones (n,o,d \u2208 [0,1]).\r\n               imental evaluation (Section 5). We discuss related work in          Adetailed overview over the exact syntax and its semantics\r\n               Section 6 and conclude in Section 7.                                is given in Appendix A.\r\n                                             Continuous Integration of Machine Learning Models with ease.ml/ci\r\n               Adaptive vs. Non-adaptive Integration           A prominent        It accepts each commit but sends the test result to the email\r\n               difference between ease.ml/ci and traditional continu-             address xx@abc.comafter each commit. The assumption\r\n               ous integration system is that the statistical power of a test     is that the developer does not have access to this email\r\n               datasetwilldecreasewhentheresultofwhetheranewmodel                 account and therefore, cannot adapt her next model.\r\n               passes the continuous integration test is released to the de-      Discussion and Future Extensions          The current syntax\r\n               veloper. The developer, if she wishes, can adapt her next          of ease.ml/ci is able to capture many use cases that\r\n               modeltoincrease its probability to pass the test, as demon-        our users \ufb01nd useful in their own development process,\r\n               strated by the recent work on adaptive analytics (Blum &           including to reason about the accuracy difference between\r\n               Hardt, 2015; Dwork et al., 2015). As we will see, ensuring         the new and old models, and to reason about the amount of\r\n               probabilistic guarantee in the adaptive case is more expen-        changes in predictions between the new and old models in\r\n               sive as it requires a larger testset. ease.ml/ci allows the        the test dataset. In principle, ease.ml/ci can support a\r\n               user to specify whether the test is adaptive or not with a \ufb02ag     richer syntax. We list some limitations of the current syntax\r\n               adaptivity(full,none,firstChange):                                 that we believe are interesting directions for future work.\r\n                  \u2022 If the \ufb02ag is set to full, ease.ml/ci releases                  1. Beyond accuracy: There are other important quality\r\n                    whether the new model passes the test immediately                   metrics for machine learning that the current system\r\n                     to the developer.                                                  does not support, e.g., F1-score, AUC score, etc. It is\r\n                  \u2022 If the \ufb02ag is set to none, ease.ml/ci accepts all                   possible to extend the current system to accommodate\r\n                     commits, however, sends the information of whether                 these scores by replacing the Bennett\u2019s inequality with\r\n                     the model really passes the test to a user-speci\ufb01ed,               the McDiarmid\u2019s inequality, together with the sensitiv-\r\n                     third-party, email address that the developer does not             ity of F1-score and AUC score. In this new context,\r\n                     have access to.                                                    more optimizations, such as using strati\ufb01ed samples,\r\n                  \u2022 If the \ufb02ag is set to firstChange, ease.ml/ci                        are possible for skewed cases.\r\n                     allows full adaptivity before the \ufb01rst time that the test      2. Ratio statistics: The current syntax of ease.ml/ci\r\n                     passes (or fails), but stops afterwards and requires a             intentionally leaves out division (\u201c/\u201d) and it would be\r\n                     newtestset (see Section 3 for more details).                       useful for a future version to enable relative compari-\r\n               ExampleScripts       Aease.ml/ciscriptisimplemented                      son of qualities (e.g., accuracy, F1-score, etc.).\r\n               as an extension to the .travis.yml \ufb01le format used in                3. Order statistics: Some users think that order statistics\r\n               Travis CI by adding an ml section. For example,                          are also useful, e.g., to make sure the new model is\r\n                  ml:                                                                   amongtop-5modelsinthedevelopmenthistory.\r\n                     - script           : ./test_model.py\r\n                     - condition        : n - o > 0.02 +/- 0.01                   Another limitation of the current system is the lack of being\r\n                     - reliability: 0.9999                                        able to detect a domain drift or concept ship. In princi-\r\n                     - mode             : fp-free                                 ple, this could be thought of as a similar process of CI \u2013\r\n                     - adaptivity : full                                          instead of \ufb01xing the test set and testing multiple models,\r\n                     - steps            : 32\r\n               This script speci\ufb01es a continuous test process that, with          monitoring concept shift is to \ufb01x a single model and test its\r\n               probability larger than 0.9999, accepts the new commit only        generalization over multiple test sets overtime.\r\n               if the new model has two points higher accuracy than the old       Thecurrent version of ease.ml/ci does not provide sup-\r\n               one. This estimation is conducted with an estimation error         port for all these features. However, we believe that many\r\n               within one accuracy point in a \u201cfalse-positive free\u201d manner.       of them can be supported by developing similar statistical\r\n               Wegiveadetailedde\ufb01nition, as well as a simple example of           techniques (see Sections 3 and 4).\r\n               the two modes fp-free and fn-free in Appendix A.2.                 2.3   SystemUtilities\r\n               Thesystemwillreleasethepass/failsignalimmediately                  In traditional continuous integration, the system often as-\r\n               to the developer, and the user expects that the given testset      sumes that the user has the knowledge and competency\r\n               can be used by as many as 32 times before a new testset has        to build the test suite all by herself. This assumption is\r\n               to be provided to the system.                                      too strong for ease.ml/ci\u2014 among the current users of\r\n               Similarly, if the user wants to specify a non-adaptive inte-       ease.ml/ci,weobservethatevenexperiencedsoftware\r\n               gration process, she can provide a script as follows:              engineers in large tech companies can be clueless on how\r\n                  ml:                                                             to develop a proper testset for a given reliability require-\r\n                     - script           : ./test_model.py                         ment. One prominent contribution of ease.ml/ci is a\r\n                     - condition        : d < 0.1 +/- 0.01                        collection of techniques that provide practical, but rigorous,\r\n                     - reliability: 0.9999                                        guidelines for the user to manage testsets: How large does\r\n                     - mode             : fp-free                                 the testset need to be? When does the system need a new\r\n                     - adaptivity : none -> xx@abc.com\r\n                     - steps            : 32                                      freshly generated testset? When can the system release the\r\n                                             Continuous Integration of Machine Learning Models with ease.ml/ci\r\n               testset and \u201cdowngrade\u201d it into a development set? While            3.1   SampleSizeEstimatorforaSingleModel\r\n               most of these questions can be answered by experts based            Estimator for a Single Variable       One building block of\r\n               on heuristics and intuition, the goal of ease.ml/ci is              ease.ml/ci is the estimator of the number of samples\r\n               to provide systematic, principled guidelines. To achieve            oneneedstoestimateonevariable(n,o,andd)to\u01ebaccuracy\r\n               this goal, ease.ml/ci provides two utilities that are not           with 1\u2212\u03b4 probability. We construct this estimator using the\r\n               provided in systems such as Travis CI.                              standard Hoeffding bound.\r\n               Sample Size Estimator This is a program that takes as               Asamplesizeestimator n : V \u00d7[0,1]3 7\u2192 N is a function\r\n               input a ease.ml/ci script, and outputs the number of                that takes as input a variable, its dynamic range, error toler-\r\n               examples that the user needs to provide in the testset.             ance and success rate, and outputs the number of samples\r\n               NewTestsetAlarmThissubsystemisaprogramthattakes                     one needs in a testset. With the standard Hoeffding bound,\r\n               as input a ease.ml/ci script as well as the commit his-\r\n                                                                                                                      \u2212r2ln\u03b4\r\n               tory of machine learning models, and produces an alarm                                n(v,r ,\u01eb,\u03b4) =        v\r\n                                                                                                           v               2\r\n               (e.g., by sending an email) to the user when the current                                                  2\u01eb\r\n               testset has been used too many times and thus cannot be             where r is the dynamic range of the variable v, \u01eb the error\r\n               used to test the next committed model. Upon receiving the                   v\r\n               alarm, the user needs to provide a new testset to the system        tolerance, and 1 \u2212 \u03b4 the success probability.\r\n               and can also release the old testset to the developer.              Recall that we makes use of the exact grammar used to\r\n               Animpractical implementation of these two utilities is easy         de\ufb01ne the test conditions. A formal de\ufb01nition of the syntax\r\n               \u2014thesystemalarmstheusertorequestanewtestset after                   can be found in Appendix A.1.\r\n               everycommitandestimatesthetestsetsizeusingtheHoeffd-                EstimatorforaSingleClause          GivenaclauseC (e.g. n\u2212\r\n               ing bound. However, this can result in testsets that require        o > 0.01) with a left-hand side expression \u03a6, a comparison\r\n               tremendous labeling effort, which is not always feasible.           operator cmp (> or <), and a right-hand side constant, the\r\n               What is \u201cPractical?\u201d The practicality is certainly user             sample size estimator returns the number of samples one\r\n               dependent. Nonetheless, from our experience working with            needs to provide an (\u01eb,\u03b4)-estimation of the left-hand side\r\n               different users, we observe that providing 30,000 to 60,000         expression. This can be done with a trivial recursion:\r\n               labels for every 32 model evaluations seems reasonable for            1. n(EXP = c         v,\u01eb,\u03b4) = n(v,r ,\u01eb/c,\u03b4), where c is\r\n               manyusers: 30,000 to 60,000 is what 2 to 4 engineers can                                *                    v\r\n                                                                                                                                   \u2212c2r2ln\u03b4\r\n               label in a day (8 hours) at a rate of 2 seconds per label, and           a constant. We have n(c        v,\u01eb,\u03b4) =        v    .\r\n                                                                                                                    *                 2\u01eb2\r\n               32 model evaluations imply (on average) one commit per                2. n(EXP1 + EXP2,\u01eb,\u03b4)           = max{n(EXP1,\u01eb , \u03b4),\r\n               day in a month. Under this assumption, the user only needs                                                                   1 2\r\n                                                                                        n(EXP2,\u01eb , \u03b4)}, where \u01eb +\u01eb < \u01eb. The same equal-\r\n               to spend one day per month to provide test labels with a                             2 2             1    2\r\n               reasonable number of labelers. If the user is not able to                ity holds similarly for n(EXP1 - EXP2,\u01eb,\u03b4).\r\n               provide this amount of labels, a \u201ccheap mode\u201d, where the\r\n               numberoflabelsperdayiseasilyreducedbyafactor10x,is                  Estimator for a Single Formula        Given a formula F that\r\n               achieved for most of the common conditions by increasing            is a conjunction over k clauses C1,...,Ck, the sample size\r\n               the error tolerance by a single or two percentage points.           estimator needs to guarantee that it can satisfy each of the\r\n               Therefore, to make ease.ml/ci a useful tool for real-               clause Ci. One way to build such an estimator is\r\n               worldusers, these utilities need to be implemented in a more          3. n(F = C \u2227...\u2227C ,\u01eb,\u03b4) = max n(C ,\u01eb, \u03b4).\r\n               practical way. The technical contribution of ease.ml/ci                            1           k               i     i   k\r\n               is a set of techniques that we will present next, which can         Example GivenaformulaF,wenowhaveasimplealgo-\r\n               reduce the number of samples the system requests from the           rithm for sample size estimation. For\r\n               user by up to two orders of magnitude.\r\n               3    BASELINE IMPLEMENTATION                                         F :- n - 1.1 * o > 0.01 +/- 0.01 /\\ d < 0.1 +/- 0.01\r\n               Wedescribethetechniquestoimplementease.ml/cifor                     the system solves an optimization problem:\r\n               user-speci\ufb01ed conditions in the most general case. The tech-                                       \u2212ln\u03b4 \u22121.12ln \u03b4 \u2212ln \u03b4\r\n               niques that we use involve standard Hoeffding inequality            n(F,\u01eb,\u03b4) =       min    max{        4 ,           4 ,     2 }.\r\n                                                                                                  \u01eb +\u01eb =\u01eb             2          2          2\r\n                                                                                                   1  2            2\u01eb          2\u01eb         2\u01eb\r\n               and a technique similar to Ladder (Blum & Hardt, 2015) in                         \u01eb ,\u01eb \u2208[0,1]          1          2\r\n                                                                                                  1 2\r\n               the adaptive case. This implementation is general enough\r\n               to support all user-speci\ufb01ed conditions currently supported         3.2   Non-Adaptive Scenarios\r\n               in ease.ml/ci,however,it can be made more practical                 In the non-adaptive scenario, the system evaluates H mod-\r\n               whenthetestconditions satisfy certain conditions. We leave          els, without releasing the result to the developer. The result\r\n               optimizations for speci\ufb01c conditions to Section 4.                  can be released to the user (the integration team).\r\n                                             Continuous Integration of Machine Learning Models with ease.ml/ci\r\n               Sample Size Estimation        Estimation of sample size is         Take \u03b4 = 0.0001 and \u01eb = 0.05, we have n(F,\u01eb, \u03b4 ) =\r\n                                                                                                                                         2H\r\n               easy in this case because all H models are independent.            6,279. Assuming the developer checks in the best model\r\n               With probability 1 \u2212 \u03b4, ease.ml/ci returns the right an-           everyday, this means that every month the user needs to\r\n               swer for each of the H models, the number of samples               provide only fewer than seven thousand test samples, a\r\n               one needs for formula F is simply n(F,\u01eb, \u03b4 ). This follows         requirement that is not too crazy. However, if \u01eb = 0.01, this\r\n                                                            H\r\n               from the standard union bound. Given the number of mod-            blows up to 156,955, which is less practical. We will show\r\n               els that user hopes to evaluate (speci\ufb01ed in the steps \ufb01eld        howtotighten this bound in Section 4 for a sub-family of\r\n               of a ease.ml/ci script), the system can then return the            test conditions.\r\n               numberofsamplesinthetestset.                                       NewTestsetAlarm Similartothenon-adaptive scenario,\r\n               NewTestsetAlarm Thealarmforuserstoprovideanew                      the alarm for requesting a new testset is trivial to implement\r\n               testset is easy to implement in the non-adaptive scenario.         \u2014thesystemrequests a new testset when it reaches the pre-\r\n               The system maintains a counter of how many times the               de\ufb01ned budget. At that point, the system can release the\r\n               testset has been used. When this counter reaches the pre-          testset to the developer for future development.\r\n               de\ufb01ned budget (i.e., steps), the system requests a new             3.4   HybridScenarios\r\n               testset from the user. In the meantime, the old testset can be     One can obtain a better bound on the number of required\r\n               released to the developer for future development process.          samples by constraining the information being released to\r\n               3.3   Fully-Adaptive Scenarios                                     the developer. Consider the following scenario:\r\n               In the fully-adaptive scenario, the system releases the test         1. If a commit fails, returns Fail to the developer;\r\n               result (a single bit indicating pass/fail) to the developer.         2. If a commit passes, (1) returns Pass to the developer,\r\n               Because this bit leaks information from the testset to the               and (2) triggers the new testset alarm to request a new\r\n               developer, one cannot use union bound anymore as in the                  testset from the user.\r\n               non-adaptive scenario.                                             Comparedwiththefully adaptive scenario, in this scenario,\r\n               Atrivial strategy exists for such a case \u2014 for every model,        the user provides a new testset immediately after the devel-\r\n               uses a different testset. In this case, the number of samples      oper commits a model that passes the test.\r\n               requiredisH\u00b7n(F,\u01eb, \u03b4 ). Thiscanbeimprovedbyapplying                SampleSizeEstimation LetH bethemaximumnumber\r\n                                      H                                           of steps the system supports. Because the system will re-\r\n               aadaptiveargumentsimilartoLadder(Blum&Hardt,2015)                  quest a new testset immediately after a model passes the\r\n               as follows.                                                        test, it is not really adaptive: As long as the developer con-\r\n               SampleSizeEstimation Forthefullyadaptive scenario,                 tinues to use the same testset, she can assume that the last\r\n               ease.ml/ciusesthefollowingwaytoestimatethesam-                     modelalwaysfails. Assume that the user is a deterministic\r\n               ple size for an H-step process. The intuition is simple.           function that returns a new model given the past history and\r\n               Assumethat a developer is deterministic or pseudo-random,          past feedback (a stream of Fail), there are only H possible\r\n               her decision on the next model only relies on all the previ-       states that we need to apply union bound. This gives us the\r\n               ous pass/fail signals and the initial model H . For H              sameboundasthenon-adaptive scenario: n(F,\u01eb, \u03b4 ).\r\n                                                                    0                                                                  H\r\n                                       H\r\n               steps, there are only 2   possible con\ufb01gurations of the past       New Testset Alarm        Unlike the previous two scenarios,\r\n               pass/failsignals. As a result, one only needs to enforce           the system will alarm the user whenever the model that she\r\n                                              H\r\n               the union bound on all these 2    possibilities. Therefore, the    provides passes the test or reaches the pre-de\ufb01ned budget\r\n               numberofsamplesoneneedsisn(F,\u01eb, \u03b4 ).\r\n                                                          2H                      H,whichevercomesearlier.\r\n               Is the Exponential Term too Impractical?              The im-      Discussion     It might be counter-intuitive that the hybrid\r\n               proved sample size n(F,\u01eb, \u03b4 ) is much smaller than the\r\n                                             2H                                   scenario, which leaks information to the developer, has the\r\n               one, H \u00b7 n(F,\u01eb, \u03b4 ), required by the trivial strategy. Read-       samesamplesize estimator as the non-adaptive case. Given\r\n                                 H\r\n               ers might worry about the dependency on H for the fully            the maximumnumberofstepsthatthetestset supports, H,\r\n               adaptive scenario. However, for H that is not too large, e.g.,     the hybrid scenario cannot always \ufb01nish all H steps as it\r\n               H=32,theaboveboundcanstillleadtopractical number                   might require a new testset in H\u2032 \u226a H steps. In other\r\n               of samples as the \u03b4 is within a logarithm term. As an\r\n                                    2H                                            words, in contrast to the fully adaptive scenario, the hybrid\r\n               example, consider the following simple condition:                  scenario accommodates the leaking of information not by\r\n                              F :- n > 0.8 +/- 0.05.                              adding more samples, but by decreasing the number of steps\r\n               With H = 32, we have                                               that a testset can support.\r\n                                                                                  Thehybrid scenario is useful when the test is hard to pass\r\n                                                    H                             or fail. For example, imagine the following condition:\r\n                               n(F,\u01eb, \u03b4 ) = ln2       \u2212ln\u03b4.                                   F :- n - o > 0.1 +/- 0.01\r\n                                        H              2\r\n                                       2            2\u01eb\r\n                                              Continuous Integration of Machine Learning Models with ease.ml/ci\r\n               That is, the system only accepts commits that increase the                  1-\u03b4      \u01eb           F1, F4           F2, F3\r\n               accuracy by 10 accuracy points. In this case, the developer                                  none    full     none     full\r\n                                                                                           0.99     0.1      404     1340     1753     5496\r\n               might take many developing iterations to get a model that                   0.99     0.05    1615     5358     7012    21984\r\n               actually satis\ufb01es the condition.                                            0.99     0.025   6457    21429    28045    87933\r\n                                                                                           0.99     0.01   40355   133930   175282   549581\r\n                                                                                           0.999    0.1      519     1455     2214     5957\r\n               3.5   Evaluation of a Condition                                             0.999    0.05    2075     5818     8854    23826\r\n                                                                                           0.999    0.025   8299    23271    35414    95302\r\n               Given a testset that satis\ufb01es the number of samples given                   0.999    0.01   51868   145443   221333   595633\r\n               by the sample size estimator, we obtain the estimates of the                0.9999   0.1      634     1570     2674     6417\r\n                                                                    \u02c6                      0.9999   0.05    2536     6279    10696    25668\r\n               three variables used in a clause, i.e., n\u02c6, o\u02c6, and d. Simply               0.9999   0.025  10141    25113    42782   102670\r\n               using these estimates to evaluate a condition might cause                   0.9999   0.01   63381   156956   267385   641684\r\n                                                                                           0.99999  0.1      749     1685     3135     6878\r\n               both false positives and false negatives. In ease.ml/ci,                    0.99999  0.05    2996     6739    12538    27510\r\n               weinstead replace the point estimates by their correspond-                  0.99999  0.025  11983    26955    50150   110038\r\n               ing con\ufb01dence intervals, and de\ufb01ne a simple algebra over                    0.99999  0.01   74894   168469   313437   687736\r\n               intervals (e.g., [a,b]+[c,d] = [a+c,b+d]), which is used            Figure 2. Number of samples required by different conditions,\r\n               to evaluate the left-hand side of a single clause. A clause         H=32steps. Redfontindicates\u201cimpractical\u201dnumberofsamples\r\n               still evaluates to {True, False, Unknown}. The system               (see discussion on practicality in Section 2.3).\r\n               then maps this three-value logic into a two-value logic given       (F3: Signi\ufb01cant Quality Milestones)\r\n               user\u2019s choice of either fp-free or fn-free.\r\n                                                                                      F3              :- n - o > [c] +/- [epsilon]\r\n               3.6   UseCasesandPracticality Analysis                                 adaptivity :- firstChange\r\n               The baseline implementation of ease.ml/ci relies on                    mode            :- fp-free\r\n               standard concentration bounds with simple, but novel, twists           ([c] is large)\r\n               to the speci\ufb01c use cases. Despite its simplicity, this imple-       This condition is used for making sure that the repository\r\n               mentation can support real-world scenarios that many of our         onlycontainssigni\ufb01cantqualitymilestones(e.g.,logmodels\r\n               users \ufb01nd useful. We summarize \ufb01ve use cases and analyze            after 10 points of accuracy jump). Although the condition is\r\n               the number of samples required from the user. These use             syntactically the same as F2, it makes sense for the whole\r\n               casesaresummarizedfromobservingtherequirementsfrom                  process to be hybrid adaptive and false-positive free.\r\n               the set of users we have been supporting over the last two          (F4: No Signi\ufb01cant Changes)\r\n               years, ranging from scientists at multiple universities, to real       F4              :- d < [c] +/- [epsilon]\r\n               production applications provided by high-tech companies.               adaptivity :- full | none\r\n               ([c]and[epsilon]areplaceholdersforconstants.)                          mode            :- fn-free\r\n               (F1: Lower Bound Worst Case Quality)                                   ([c] is large)\r\n                                                                                   This condition is used for safety concerns similar to F1.\r\n                  F1              :- n > [c] +/- [epsilon]                         Whenthemachinelearning application is end-user facing\r\n                  adaptivity :- none                                               or part of a larger application, it is important that its predic-\r\n                  mode            :- fn-free                                       tion will not change signi\ufb01cantly between two subsequent\r\n               This condition is used for quality control to avoid the cases       versions. Here, the process needs to be false-negative free.\r\n               that the developer accidentally commits a model that has            Meanwhile, we see use cases for both fully adaptive and\r\n               an unacceptably low quality or has obvious quality bugs.            non-adative scenarios.\r\n               Wesee many use cases of this condition in non-adaptive              (F5: Compositional Conditions)\r\n               scenario, most of which need to be false-negative free.                  F5 :- F4 /\\ F2\r\n               (F2: Incremental Quality Improvement)                               Oneofthemostpopulartest conditions is a conjunction of\r\n                                                                                   twoconditions, F4 and F2: The integration team wants to\r\n                  F2              :- n - o > [c] +/- [epsilon]                     use F4 and F2 together so that the end-user facing applica-\r\n                  adaptivity :- full                                               tion will not experience dramatic quality change.\r\n                  mode            :- fp-free\r\n                  ([c] is small)                                                   Practicality Analysis      Howpractical is it for our baseline\r\n               This condition is used for making sure that the machine             implementation to support these conditions, and in which\r\n               learning application monotonically improves over time.              case that the baseline implementation becomes impractical?\r\n               This is important when the machine learning application is          When is the Baseline Implementation Practical? The\r\n               end-user facing, in which it is unacceptable for the quality to     baseline implementation, in spite of its simplicity, is practi-\r\n               drop. In this scenario, it makes sense for the whole process        cal in manycases. Figure2illustratesthenumberofsamples\r\n               to be fully adaptive and false-positive free.                       the system requires for H = 32 steps. We see that, for both\r\n                                                Continuous Integration of Machine Learning Models with ease.ml/ci\r\n                F1 and F4, all adaptive strategies are practical up to 2.5             TechnicalObservation2Thesecondtechnicalobservation\r\n                accuracy points, while for F2 and F3, the non-adaptive and             is that, to estimate the difference of predictions between\r\n                hybrid adaptive strategies are practical up to 2.5 accuracy            the new model and the old model, one does not need to\r\n                points and the fully adaptive strategy is only practical up to         have labels. Instead, a sample from the unlabeled dataset\r\n                5 accuracy points. As we see from this example, even with              is enough to estimate the difference. Moreover, to estimate\r\n                a simple implementation, enforcing a rigorous guarantee                n\u2212owhenonly10%datapointshavedifferentpredictions,\r\n                for CI of machine learning is not always expensive!                    oneonlyneedstoprovidelabelsto10%ofthewholetestset.\r\n                When is the Baseline Implementation Not Practical?                     4.1   Pattern 1: Difference-based Optimization\r\n                WecanseefromFigure2thestrongdependencyon\u01eb. This                        The\ufb01rst pattern that ease.ml/ci searches in a formula\r\n                is expected because of the O(1/\u01eb2) term in the Hoeffding               is whether it is of the following form\r\n                inequality. As a result, none of the adaptive strategy is\r\n                practical up to 1 accuracy point, a level of tolerance that                d < A +/- B /\\ n - o > C +/- D\r\n                is important for many task-critical applications of machine            which constrains the amount of changes that a new model\r\n                learning. It is also not surprising that the fully adaptive strat-     is allowed to have while ensuring that the new model is\r\n                egy requires more samples than the non-adaptive one, and               noworsethantheoldmodel. Thesetwoclauses popularly\r\n                therefore becomes impractical with higher error tolerance.             appear in test conditions from our users: For production-\r\n                4    OPTIMIZATIONS                                                     level systems, developers start from an already good enough,\r\n                Asweseefromtheprevious sections, the baseline imple-                   deployed model, and spend most of their time \ufb01ne-tuning\r\n                mentation of ease.ml/ci fails to provide a practical ap-               a machine learning model. As a result, the continuous inte-\r\n                proach for low error tolerance and/or fully adaptive cases.            gration test must have an error tolerance as low as a single\r\n                In this section, we describe optimizations that allow us to            accuracypoint. Ontheotherhand,thenewmodelwillnotbe\r\n                further improve the sample size estimator.                             different from the old model signi\ufb01cantly, otherwise more\r\n                                                                                       engaged debugging and investigations are almost inevitable.\r\n                High-level Intuition All of our proposed techniques in this            Assumption. One assumption of this optimization is that it\r\n                section are based on the same intuition: Tightening the sam-           is relatively cheap to obtain unlabeled data samples, whereas\r\n                ple size estimator in the worst case is hard to get better than        it is expensive to provide labels. This is true in many of the\r\n                       2\r\n                O(1/\u01eb ); instead, we take the classic system way of think-             applications. When this assumption is valid, both optimiza-\r\n                ing\u2014improvethethesamplesizeestimatorforasub-family                     tions in Section 4.1.1 and Section 4.1.2 can be applied to\r\n                of popular test conditions. Accordingly, ease.ml/ci ap-                this pattern; otherwise, both optimizations still apply but\r\n                plies different optimization techniques for test conditions of         will lead to improvement over only a subset.\r\n                different forms.\r\n                Technical Observation 1 The intuition behind a tighter                 4.1.1   Hierarchical Testing\r\n                sample size estimator relies on standard techniques of tight-          The\ufb01rst optimization is to test the rest of the clauses con-\r\n                ening Hoeffding\u2019s inequality for variables with small vari-            ditioned on d < A +/- B, which leads to an algorithm\r\n                ance. Speci\ufb01cally, when the new model and the old model                with two-level tests. The \ufb01rst level tests whether the dif-\r\n                is only different on up to (100 \u00d7 p)% of the predictions,              ference between the new model and the old model is small\r\n                which could be part of the test condition anyway, for data             enough, whereas the second level tests (n \u2212 o).\r\n                point i, the random variable n \u2212 o has small variance:                 Thealgorithm runs in two steps:\r\n                  \u0002            \u0003                    i     i\r\n                E (n \u2212o )2 <p,wheren ando arethepredictionsof\r\n                      i     i                   i       i\r\n                the new and old models on the data point i. This allows us                                       \u2032  \u03b4              \u02c6         \u2032\r\n                                                                                         1. (Filter) Get an (\u01eb , 2)-estimator d with n samples.\r\n                to apply the standard Bennett\u2019s inequality.                                                 \u02c6          \u2032\r\n                                                                                            Test whether d > A+\u01eb : If so, returns False;\r\n                Proposition 1 (Bennett\u2019s inequality). Let X ,...,X             be\r\n                                                                    1        n           2. (Test) Test F as in the baseline implementation (with\r\n                independent and square integrable random variables such                           \u03b4                                              \u2032\r\n                that for some nonnegativeconstantb, |X | \u2264 balmostsurely                    1\u22122probability), conditioned on d < A+2\u01eb .\r\n                                                            i\r\n                for all i < n. We have                                                 It is not hard to see why the above algorithm works \u2014 the\r\n                                                                                       \ufb01rst step only requires unlabeled data points and does not\r\n                      \fP                 \f\r\n                     \u0014                         \u0015           \u0012         \u0012     \u0013\u0013          need human intervention. In the second step, conditioned\r\n                      \f     X \u2212E[X]\f                            v      nb\u01eb                                               \u0002            \u0003\r\n                 Pr \f     i   i        i \f > \u01eb    \u22642exp \u2212 h                     ,      on d < p, we know that E (n \u2212o )2                 < p for each\r\n                      \f        n         \f                      b2      v                                                    i     i\r\n                                                                                       data point. Combined with |n \u2212 o | < 1, applying Ben-\r\n                                                                                                                          i     i\r\n                                                                                                                         h\f                   \f     i\r\n                                                                                                                          \f [                 \f\r\n                                    \u0002    \u0003                                             nett\u2019s inequality we have Pr        n\u2212o\u2212(n\u2212o) >\u01eb \u2264\r\n                where v = P E X2 andh(u) = (1+u)ln(1+u)\u2212u                                            \u0010 \u0011                  \f                   \f\r\n                                i      i                                                                \u01eb\r\n                for all positive u.                                                    2exp(\u2212nph p ).\r\n                                            Continuous Integration of Machine Learning Models with ease.ml/ci\r\n               As a result, the second step needs a sample size (for non-       so different in their predictions. Take AlexNet, ResNet,\r\n               adaptive scenario) of                                            GoogLeNet, AlexNet (Batch Normalized), and VGG for\r\n                                                                                example: When applied to the ImageNet testset, these \ufb01ve\r\n                                         lnH\u2212ln\u03b4                                models, developed by the ML community since 2012, only\r\n                                    n=         \u0010 \u0011 4.\r\n                                           ph \u01eb                                 produce up to 25% different answers for top-1 correctness\r\n                                                 p                              and 15% different answers for top-5 correctness! For a typ-\r\n               Whenp=0.1,1\u2212\u03b4=0.9999,d<0.1,weonlyneed29K                         ical workload of continuous integration, it is therefore not\r\n               samples for 32 non-adaptive steps and 67K samples for 32         unreasonable to expect many of the consecutive commits\r\n               fully-adaptive steps to reach an error tolerance of a single     would have smaller difference than these ImageNet winners\r\n               accuracy point \u2014 10\u00d7 fewer than the baseline (Figure 2).         involving years of development.\r\n               4.1.2   Active Labeling                                          Motivated by this observation, ease.ml/ci will automat-\r\n               The previous example gives the user a way to conduct 32          ically match with the following pattern\r\n               fully-adaptive \ufb01ne-tuning steps with only 67K samples. As-                          n - o > C +/- D.\r\n               sumethat the developer performs one commit per day, this         Whentheunlabeledtestset is cheap to get, the system will\r\n               means that we require 67K samples per month to support           use one testset to estimate d up to \u01eb = 2D: For binary\r\n               the continuous integration service.                              classi\ufb01cation task, the system can use an unlabeled testset;\r\n               Onepotential challenge for this strategy is that all 67K sam-    for multi-class tasks, one can either test the difference of\r\n               ples need to be labeled before the continuous integration        predictions on an unlabeled testset or difference of correct-\r\n               service can start working. This is sometimes a strong as-        ness on a labeled testset. This gives us an upper bound of\r\n               sumption that many users \ufb01nd problematic. In the ideal           n\u2212o.Thesystemthentestsn\u2212oupto\u01eb=Donanother\r\n               case, we hope to interleave the development effort with the      testset (different from the one used to test d). When this\r\n               labeling effort, and amortize the labeling effort over time.     upper bound is small enough, the system will trigger similar\r\n                                                                                optimization as in Pattern 1. Note that the \ufb01rst testset\r\n               Thesecondtechnique our system uses relies on the observa-        will be 16\u00d7 smaller than testing n \u2212 o directly up to \u01eb = D\r\n               tion that, to estimate (n\u2212o), only the data points that have a   \u20144\u00d7duetoahighererrortolerance,and4\u00d7duetothatd\r\n               different prediction between the new and old models need to      has 2\u00d7 smaller range than n \u2212 o.\r\n               be labeled. When we know that the new model predictions          Onecaveatofthisapproachisthatthesystemdoesnotknow\r\n               are only different from the old model by 10%, we only need       how large the second testset would be before execution.\r\n               to label 10% of all data points. It is easy to see that, every   The system uses a technique similar to active labeling by\r\n               time when the developer commits a new model, we only             incrementally growing the labeled testset every time when\r\n               need to provide                                                  a new model is committed, if necessary. Speci\ufb01cally, we\r\n                                          \u2212ln\u03b4                                  optimize for test conditions following the pattern\r\n                                                4\r\n                                    n=ph\u0010\u01eb\u0011\u00d7p                                                           n > A +/- B,\r\n                                               p                                whenAislarge(e.g., 0.9 or 0.95). This can be done by \ufb01rst\r\n               labels. When p = 0.1 and 1 \u2212 \u03b4 = 0.9999, then n = 2188           havingacoarseestimationofthelowerboundofn,andthen\r\n               for an error tolerance of a single accuracy point. If the        conducting a \ufb01ner-grained estimation conditioned on this\r\n               developer commits one model per day, the labeling team           lowerbound. Notethatthiscanonlyintroduceimprovement\r\n               only needs to label 2,188 samples the next day. Given a          whenthelowerboundislarge(e.g., 0.9).\r\n               well designed interface that enables a labeling throughput\r\n               of 5 seconds per label, the labeling team only needs to          4.3   Tight Numerical Bounds\r\n               commit3hoursaday! Forateamwithmultipleengineers,\r\n               this overhead is often acceptable, considering the guarantee     Following (Langford, 2005), having a test condition con-\r\n               provided by the system down to a single accuracy point.          sisting of n i.i.d random variables drawn from a Bernoulli\r\n               Notice that active labeling assumes a stationary underlying      distribution, one can simply derive a tight bound on the\r\n               distribution. One way to enforce this in the system is to ask    numberofsamplesrequired to reach a (\u01eb, \u03b4) accuracy. The\r\n               the user to provide a pool of unlabeled data points at the       calculation of number of samples require the probability\r\n               same time, and then only ask for labels when needed. In          mass function of the Binomial distribution (sum of i.i.d\r\n               this way, we do not need to draw new samples over time.          Bernoulli variables). Tight bound are solved by taking the\r\n                                                                                minimum of number of samples n needed, over the max\r\n               4.2   Pattern 2: Implicit Variance Bound                         unknowntrue mean p. This technique can also be extended\r\n               In many cases, the user does not provide an explicit con-        to more complex queries, where the binomial distribution\r\n               straint on the difference between a new model and an old         has to be replaced by a multimodal distribution. The exact\r\n               model. However, many machine learning models are not             analysis has, as for the simple case, no closed-form solution,\r\n                                                                                and deriving ef\ufb01cient approximations is left as further work.\r\n                                          Continuous Integration of Machine Learning Models with ease.ml/ci\r\n              Figure 3. Comparison of Sample Size Estimators in the Baseline\r\n              Implementation and the Optimized Implementation.\r\n              5    EXPERIMENTS\r\n              Wefocusonempirically validating the derived bounds and\r\n              showease.ml/ciinactionnext.\r\n              5.1   SampleSizeEstimator                                          Figure 4. Impact of \u01eb, \u03b4, and p on the Label Complexity.\r\n              One key technique most of our optimizations relied on is                                                      2\r\n                                                                             following classes: Happy, Sad, Angry or Others. The eight\r\n              that, by knowing an upper bound of the sample variance, we     models developed in an incremental fashion, and submitted\r\n              are able to achieve a tighter bound than simply applying the   in that exact order to the competition (\ufb01nally reaching rank\r\n              Hoeffding bound. This upper bound can either be achieved       29/165) are made available together with a corresponding\r\n              by using unlabeled data points to estimate the difference                                                          3\r\n                                                                             description of each iteration via a public repository. The\r\n              between the new and old models, or by using labeled data       test data, consisting of 5,509 items was published by the\r\n              points but conducting a coarse estimation \ufb01rst. We now         organizers of the competition after its termination. This rep-\r\n              validate our theoretical bound and its impact on improving     resents a non-adaptive scenario, where the developer does\r\n              the label complexity.                                          not get any direct feedback whilst submitting new models.\r\n              Figure 3 illustrates the estimated error and the empirical     Figure 5 illustrates three similar, but different test condi-\r\n              error by assumingdifferentupperboundsp,foramodelwith           tions, which are implemented in ease.ml/ci. The \ufb01rst\r\n              accuracy around 98%. We run GoogLeNet (Jia et al., 2014)       two conditions check whether the new model is better than\r\n              onthe in\ufb01nite MNIST dataset (Bottou, 2016) and estimate        the old one by at least 2 percentage points in a non-adaptive\r\n              the true accuracy c. Assuming a non-adaptive scenario, we      matter. The developer will therefore not get any direct feed-\r\n              obtain a range of accuracies achieved by randomly taking n     back as it was the case during the competition. While query\r\n              data points. We then estimate the interval \u01eb with the given    (I) does reject false positive, condition (II) does accept false\r\n              numberofsamplesnandprobability1\u2212\u03b4. Weseethat,both              negative. The third condition mimics the scenario where\r\n              the baseline implementation and ease.ml/ci dominate            the user would get feedback after every commit without\r\n              the empirical error, as expected, while ease.ml/ci uses        any false negative. All three queries were optimized by\r\n                                          1\r\n              signi\ufb01cantly fewer samples.                                    ease.ml/ciusingPattern2andexploitingthefactthat\r\n              Figure 4 illustrates the impact of this upper bound on im-     between any two submission there is no more than 10%\r\n              proving the label complexity. We see that, the improvement     difference in prediction.\r\n              increases signi\ufb01cantly when p is reasonably small \u2014 when       Simply using Hoeffding\u2019s inequality does not lead to a prac-\r\n              p = 0.1, we can achieve almost 10\u00d7 improvement on the          tical solution \u2014 for \u01eb = 0.02 and \u03b4 = 0.002, in H = 7\r\n              label complexity. Active labeling further increases the im-    non-adaptive steps, one would need\r\n              provement, as expected, by another 10\u00d7.                                           2            \u03b4\r\n                                                                                               r (lnH \u2212ln )\r\n              5.2   ease.ml/ciinAction                                                    n> v               2  =44,268\r\n                                                                                                        2\r\n              Weshowcasethree different test conditions for a real-world                             2\u01eb\r\n              incremental development of machine learning models sub-        samples. This number even grows to up to 58K in the fully\r\n              mitted to the SemEval-2019 Task 3 competition. The goal        adaptive case!\r\n              is to classify the emotion of the user utterance as one of the    2\r\n                                                                                 Competition website:    https://www.humanizing-\r\n                 1Theempiricalerrorwasdeterminedbytakingdifferenttestsets    ai.com/emocontext.html\r\n              (with the sample sample size) and measuring the gap between the   3Github repository: https://github.com/zhaopku/\r\n              \u03b4 and 1 \u2212 \u03b4 quantiles over the observed testing accuracies.    ds3-emoContext\r\n                                                                                                              Continuous Integration of Machine Learning Models with ease.ml/ci\r\n                                                                            Non-Adaptive I                         Non-Adaptive II                             Adaptive\r\n                                                                         - n \u2013 o > 0.02 +/- 0.02                - n \u2013 o > 0.02 +/- 0.02                - n \u2013 o > 0.018 +/- 0.022\r\n                                                                         - adaptivity: full                     - adaptivity: full                     - adaptivity: full\r\n                                                                         - reliability: 0.998                   - reliability: 0.998                   - reliability: 0.998\r\n                                                                         - mode: fp-free                        - mode: fn-free                        - mode: fp-free\r\n                                                                            (# Samples = 4713)                   (# Samples = 4713)                      (# Samples = 5204)\r\n                                                     Iteration 1\r\n                                                     Iteration 2\r\n                                             ytor    Iteration 3\r\n                                             sHi     Iteration 4\r\n                                             t \r\n                                                     Iteration 5                                                                                                                                                   Figure 6. Evolution of Development and Test Accuracy.\r\n                                             Commi   Iteration 6\r\n                                                     Iteration 7                                                                                                                                        plexity of Hoeffding\u2019s inequality becomes O(1/\u01eb) when the\r\n                                                                                                                                                                                                        variance of the random variable \u03c32 is of the same order\r\n                                                     Iteration 8                                                                                                                                        of \u01eb (Boucheron et al., 2013). In this paper, we develop\r\n                                             Figure 5. Continuous Integration Steps in ease.ml/ci.                                                                                                      techniques to adapt the same observation to a real-world\r\n                                    All the queries can be supported rigorously with the 5.5K                                                                                                           scenario (Pattern 1). The technique of only labeling the dif-\r\n                                     test samples provided after the competition. The \ufb01rst two                                                                                                          ference between models is inspired by disagreement-based\r\n                                     conditions can be answered within two percentage point                                                                                                             active learning (Hanneke et al., 2014), which illustrates the\r\n                                     error tolerance and 0.998 reliability. The full-adaptive query                                                                                                     potential of taking advantage of the overlapping structure\r\n                                     in the third scenario can only achieve a 2.2 percentage point                                                                                                      betweenmodelstodecreaselabelingcomplexity. Infact, the\r\n                                     error tolerance, as the number of labels needed would be                                                                                                           technique we develop implies that one can achieve O(1/\u01eb)\r\n                                     morethan6K,withthesameerrortolerance as in the \ufb01rst                                                                                                                label complexity when the overlapping ratio between two\r\n                                     two queries.                                                                                                                                                                                               \u221a\r\n                                                                                                                                                                                                        models p = O( \u01eb).\r\n                                    Wesee that, in all three scenarios, ease.ml/ci returns                                                                                                              Thekeydifference between ease.ml/ci and a differen-\r\n                                     pass/failsignalsthat make intuitive sense. If we look at                                                                                                           tial privacy approach (Dwork et al., 2014) for answering\r\n                                     the evolution of the development and test accuracy over the                                                                                                        statistical queries lies in the optimization techniques we\r\n                                     eight iterations (see Figure 6, the developer would ideally                                                                                                        design. By knowing the structure of the queries we are able\r\n                                    want ease.ml/citoacceptherlastcommit,whereasall                                                                                                                     to considerably lower the number of samples needed.\r\n                                     three queries will have the second last model chosen to be                                                                                                         Conceptually, this work is inspired by the seminal series of\r\n                                     active, which correlates with the test accuracy evolution.                                                                                                                                                                                                                                          \u00a8\u00a8 \u00a8\r\n                                     6           RELATEDWORK                                                                                                                                            workbyLangfordandothers(Langford, 2005; Kaariainen\r\n                                                                                                                                                                                                        &Langford, 2005) that illustrates the possibility for gen-\r\n                                     Continuous integration is a popular concept in software                                                                                                            eralization bound to be practically tight. The goal of this\r\n                                     engineering (Duvall et al., 2007). Nowadays, it is one of the                                                                                                      workistobuild a practical system to guide the user in em-\r\n                                     best practices that most, if not all, industrial development                                                                                                       ploying complicated statistical inequalities and techniques\r\n                                     efforts follow. The emerging requirement of a CI engine for                                                                                                        to achieve practical label complexity.\r\n                                     MLhasbeendiscussed informally in multiple blog posts                                                                                                                7           CONCLUSION\r\n                                     and forum discussions (Lara, 2017; Tran, 2017; Stojnic,                                                                                                            We have presented ease.ml/ci, a continuous integra-\r\n                                     2018a; Lara, 2018; Stojnic, 2018b). However, none of these                                                                                                         tion system for machine learning. It provides a declarative\r\n                                     discussions produce any rigorous solutions to testing the                                                                                                          scripting language that allows users to state a rich class of\r\n                                     quality of a machine learning model, which arguably is the                                                                                                         test conditions with rigorous probabilistic guarantees. We\r\n                                     most important aspect of a CI engine for ML. This paper                                                                                                            have also studied the novel practicality problem in terms of\r\n                                     is motivated by the success of CI in industry, and aims for                                                                                                        labeling effort that is speci\ufb01c to testing machine learning\r\n                                     building the \ufb01rst prototype system for rigorous integration                                                                                                        models. Our techniques can reduce the amount of required\r\n                                     of machine learning models.                                                                                                                                        testing samples by up to two orders of magnitude. We have\r\n                                    The baseline implementation of ease.ml/ci builds on                                                                                                                 validated the soundness of our techniques, and showcased\r\n                                     intensive previous work on generalization and adaptive anal-                                                                                                       their applications in real-world scenarios.\r\n                                    ysis. The non-adaptive version of the system is based on                                                                                                            Acknowledgements\r\n                                     simple concentration inequalities (Boucheron et al., 2013)\r\n                                     and the fully adaptive version of the system is inspired by                                                                                                        WethankZhaoMengandNoraHollensteinforsharingtheir mod-\r\n                                     Ladder (Blum & Hardt, 2015). Comparing to the second,                                                                                                              els for the SemEval\u201919 competition. CZ and the DS3Lab gratefully\r\n                                     ease.ml/ciislessrestrictive on the feedback and more                                                                                                               acknowledge the support from Mercedes-Benz Research & De-\r\n                                     expressive given the speci\ufb01cation of the test conditions.                                                                                                          velopment North America, MeteoSwiss, Oracle Labs, Swiss Data\r\n                                    This leads to a higher number of test samples needed in                                                                                                             Science Center, Swisscom, Zurich Insurance, Chinese Scholarship\r\n                                     general. It is well-known that the O(1/\u01eb2) sample com-                                                                                                             Council, and the Department of Computer Science at ETH Zurich.\r\n                                            Continuous Integration of Machine Learning Models with ease.ml/ci\r\n               REFERENCES                                                      Stojnic,   R.      Continuous integration for machine\r\n               Blum, A. and Hardt, M. The ladder: A reliable leaderboard          learning.            https://www.reddit.com/r/\r\n                 for machine learning competitions. In International Con-         MachineLearning/comments/8bq5la/\r\n                 ference on Machine Learning, pp. 1006\u20131014, 2015.                d continuous integration for machine learning/,\r\n                                                                                  April 2018b.\r\n               Bottou, L.     The in\ufb01nite MNIST dataset.          https:       Tran, D.       Continuous integration for data science.\r\n                 //leon.bottou.org/projects/infimnist,                            http://engineering.pivotal.io/post/\r\n                 February 2016.                                                   continuous-integration-for-data-\r\n               Boucheron, S., Lugosi, G., and Massart, P. Concentration           science/,February2017.\r\n                 inequalities: A nonasymptotic theory of independence.         Van Vliet, H., Van Vliet, H., and Van Vliet, J. Software\r\n                 Oxford university press, 2013.                                   engineering: principles and practice, volume 13. John\r\n               Duvall, P. M., Matyas, S., and Glover, A. Continuous in-           Wiley & Sons, 2008.\r\n                 tegration: improving software quality and reducing risk.\r\n                 Pearson Education, 2007.\r\n               Dwork, C., Roth, A., et al. The algorithmic foundations\r\n                                                                      R\r\n                 of differential privacy.  Foundations and Trends\r in\r\n                 Theoretical Computer Science, 9(3\u20134):211\u2013407, 2014.\r\n               Dwork,C.,Feldman,V.,Hardt,M.,Pitassi, T., Reingold, O.,\r\n                 and Roth, A. The reusable holdout: Preserving validity\r\n                 in adaptive data analysis. Science, 349(6248):636\u2013638,\r\n                 2015.\r\n               Hanneke, S. et al. Theory of disagreement-based active\r\n                                                   R\r\n                 learning. FoundationsandTrends\rinMachineLearning,\r\n                 7(2-3):131\u2013309, 2014.\r\n               Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,\r\n                 Girshick, R., Guadarrama, S., and Darrell, T. Caffe:\r\n                 Convolutional architecture for fast feature embedding.\r\n                 arXiv preprint arXiv:1408.5093, 2014.\r\n                 \u00a8\u00a8 \u00a8\r\n               Kaariainen, M. and Langford, J. A comparison of tight\r\n                 generalization error bounds. In Proceedings of the 22nd\r\n                 international conference on Machine learning, pp. 409\u2013\r\n                 416. ACM, 2005.\r\n               Langford, J. Tutorial on practical prediction theory for\r\n                 classi\ufb01cation. Journal of machine learning research, 6\r\n                 (Mar):273\u2013306, 2005.\r\n               Lara,   A.   F.        Continuous    integration   for   ml\r\n                 projects.          https://medium.com/onfido-\r\n                 tech/continuous-integration-for-ml-\r\n                 projects-e11bc1a4d34f,October2017.\r\n               Lara, A. F. Continuous delivery for ml models. https:\r\n                 //medium.com/onfido-tech/continuous-\r\n                 delivery-for-ml-models-c1f9283aa971,\r\n                 July 2018.\r\n               Stojnic, R.   Continuous integration for machine learn-\r\n                 ing.          https://medium.com/@rstojnic/\r\n                 continuous-integration-for-machine-\r\n                 learning-6893aa867002,April2018a.\r\n                                          Continuous Integration of Machine Learning Models with ease.ml/ci\r\n              A SYNTAXANDSEMANTICS                                           estimator x\u02c6, which, with probability 1 \u2212 \u03b4, satis\ufb01es\r\n              A.1   Syntax of a Condition                                                          \u2217          \u2217\r\n              To specify the condition, which will be tested by                             x\u02c6 \u2208 [x \u22120.01,x +0.01],\r\n              ease.ml/ci whenever a new model is committed, the              what should be the testing outcome of this condition? There\r\n              user makes use of the following grammar:                       are three cases:\r\n                      c     :- floating point constant                         1. When x\u02c6 > 0.11, the condition should return False\r\n                      v     :- n | o | d                                          because, given x\u2217 < 0.1, the probability of having\r\n                      op1 :- + | -\r\n                      op2 :-                                                                    \u2217\r\n                                *                                                 x\u02c6 > 0.11 > x +0.01 is less than \u03b4.\r\n                      EXP :- v | v op1 EXP | EXP op2 c\r\n                                                                               2. When x\u02c6 < 0.09, the condition should return True\r\n                      cmp :- > | <                                                because, given x\u2217 > 0.1, the probability of having\r\n                      C     :- EXP cmp c +/- c                                                  \u2217\r\n                                                                                  x\u02c6 < 0.09 < x \u22120.01 is less than \u03b4.\r\n                      F     :- C | C /\\ F                                      3. When0.09 < x\u02c6 < 0.11, the outcome cannot be deter-\r\n              Fis the \ufb01nal condition, which is a conjunction of a set of          mined: Even if x\u02c6 > 0.1, there is no way to tell whether\r\n                                                                                                 \u2217\r\n              clauses C. Each clause is a comparison between an expres-           the real value x is larger or smaller than 0.1. In this\r\n              sion over {n,o,d} and a constant, with an error tolerance           case, the condition evaluates to Unknown.\r\n              following the symbol +/-. For example, two expressions\r\n              that we focus on optimizing can be speci\ufb01ed as follows:        Theparametermodeallowsthesystemtodealwiththecase\r\n                 n - o > 0.02 +/- 0.01 /\\ d < 0.1 +/- 0.01                   that the condition evaluates to Unknown. In the fp-free\r\n              in which the \ufb01rst clause                                       mode,ease.ml/citreatsUnknownasFalse(thusre-\r\n                                                                             jects the commit) to ensure that whenever the condition eval-\r\n                            n - o > 0.02 +/- 0.01                            uates to True using x\u02c6, the same condition is always True\r\n                                                                             for x\u2217. Similarly, in the fn-free mode, ease.ml/ci\r\n              requires that the new model have an accuracy that is two       treats Unknown as True (thus accepts the commit). The\r\n              points higher than the old model, with an error tolerance of   false positive rate (resp. false negative rate) in the fn-free\r\n              one point, whereas the clause                                  (resp. fp-free) mode is speci\ufb01ed by the error tolerance.\r\n                                 d < 0.1 +/- 0.01\r\n              requires that the new model can only change 10% of the old\r\n              predictions, with an error tolerance of 1%.\r\n              A.2   Semantics of Continuous Integration Tests\r\n              Unlike traditional continuous integration, all three variables\r\n              usedinease.ml/ci,i.e.,{n,o,d},arerandomvariables.\r\n              As a result, the evaluation of an ease.ml/ci condition\r\n              is inherently probabilistic. There are two additional param-\r\n              eters that the user needs to provide, which would de\ufb01ne\r\n              the semantics of the test condition: (1) \u03b4, the probability\r\n              with which the test process is allowed to be incorrect, which\r\n              is usually chosen to be smaller than 0.001 or 0.0001 (i.e.,\r\n              0.999 or 0.9999 success rate); and (2) mode chosen from\r\n              {fp-free, fn-free},whichspeci\ufb01eswhetherthetest\r\n              is false-positive free or false-negative free. The semantics\r\n              are, with probability 1 \u2212 \u03b4, the output of ease.ml/ci is\r\n              free of false positives or false negatives.\r\n              Thenotion of false positives or false negatives is related to\r\n              the fundamental trade-off between the \u201ctype I\u201d error and the\r\n              \u201ctype II\u201d error in statistical hypothesis testing. Consider\r\n                                 x < 0.1 +/- 0.01.\r\n              Suppose that the real unknown value of x is x\u2217. Given an\r\n", "award": [], "sourceid": 162, "authors": [{"given_name": "Cedric", "family_name": "Renggli", "institution": "ETH Zurich"}, {"given_name": "Bojan", "family_name": "Karla\u0161", "institution": "ETH Z\u00fcrich"}, {"given_name": "Bolin", "family_name": "Ding", "institution": "\"Data Analytics and Intelligence Lab, Alibaba Group\""}, {"given_name": "Feng", "family_name": "Liu", "institution": "Huawei Technologies"}, {"given_name": "Kevin", "family_name": "Schawinski", "institution": "Modulos AG"}, {"given_name": "Wentao", "family_name": "Wu", "institution": "Microsoft Research"}, {"given_name": "Ce", "family_name": "Zhang", "institution": "ETH"}]}