A+\u01eb : If so, returns False;\r\n Proposition 1 (Bennett\u2019s inequality). Let X ,...,X be\r\n 1 n 2. (Test) Test F as in the baseline implementation (with\r\n independent and square integrable random variables such \u03b4 \u2032\r\n that for some nonnegativeconstantb, |X | \u2264 balmostsurely 1\u22122probability), conditioned on d < A+2\u01eb .\r\n i\r\n for all i < n. We have It is not hard to see why the above algorithm works \u2014 the\r\n \ufb01rst step only requires unlabeled data points and does not\r\n \fP \f\r\n \u0014 \u0015 \u0012 \u0012 \u0013\u0013 need human intervention. In the second step, conditioned\r\n \f X \u2212E[X]\f v nb\u01eb \u0002 \u0003\r\n Pr \f i i i \f > \u01eb \u22642exp \u2212 h , on d < p, we know that E (n \u2212o )2 < p for each\r\n \f n \f b2 v i i\r\n data point. Combined with |n \u2212 o | < 1, applying Ben-\r\n i i\r\n h\f \f i\r\n \f [ \f\r\n \u0002 \u0003 nett\u2019s inequality we have Pr n\u2212o\u2212(n\u2212o) >\u01eb \u2264\r\n where v = P E X2 andh(u) = (1+u)ln(1+u)\u2212u \u0010 \u0011 \f \f\r\n i i \u01eb\r\n for all positive u. 2exp(\u2212nph p ).\r\n Continuous Integration of Machine Learning Models with ease.ml/ci\r\n As a result, the second step needs a sample size (for non- so different in their predictions. Take AlexNet, ResNet,\r\n adaptive scenario) of GoogLeNet, AlexNet (Batch Normalized), and VGG for\r\n example: When applied to the ImageNet testset, these \ufb01ve\r\n lnH\u2212ln\u03b4 models, developed by the ML community since 2012, only\r\n n= \u0010 \u0011 4.\r\n ph \u01eb produce up to 25% different answers for top-1 correctness\r\n p and 15% different answers for top-5 correctness! For a typ-\r\n Whenp=0.1,1\u2212\u03b4=0.9999,d<0.1,weonlyneed29K ical workload of continuous integration, it is therefore not\r\n samples for 32 non-adaptive steps and 67K samples for 32 unreasonable to expect many of the consecutive commits\r\n fully-adaptive steps to reach an error tolerance of a single would have smaller difference than these ImageNet winners\r\n accuracy point \u2014 10\u00d7 fewer than the baseline (Figure 2). involving years of development.\r\n 4.1.2 Active Labeling Motivated by this observation, ease.ml/ci will automat-\r\n The previous example gives the user a way to conduct 32 ically match with the following pattern\r\n fully-adaptive \ufb01ne-tuning steps with only 67K samples. As- n - o > C +/- D.\r\n sumethat the developer performs one commit per day, this Whentheunlabeledtestset is cheap to get, the system will\r\n means that we require 67K samples per month to support use one testset to estimate d up to \u01eb = 2D: For binary\r\n the continuous integration service. classi\ufb01cation task, the system can use an unlabeled testset;\r\n Onepotential challenge for this strategy is that all 67K sam- for multi-class tasks, one can either test the difference of\r\n ples need to be labeled before the continuous integration predictions on an unlabeled testset or difference of correct-\r\n service can start working. This is sometimes a strong as- ness on a labeled testset. This gives us an upper bound of\r\n sumption that many users \ufb01nd problematic. In the ideal n\u2212o.Thesystemthentestsn\u2212oupto\u01eb=Donanother\r\n case, we hope to interleave the development effort with the testset (different from the one used to test d). When this\r\n labeling effort, and amortize the labeling effort over time. upper bound is small enough, the system will trigger similar\r\n optimization as in Pattern 1. Note that the \ufb01rst testset\r\n Thesecondtechnique our system uses relies on the observa- will be 16\u00d7 smaller than testing n \u2212 o directly up to \u01eb = D\r\n tion that, to estimate (n\u2212o), only the data points that have a \u20144\u00d7duetoahighererrortolerance,and4\u00d7duetothatd\r\n different prediction between the new and old models need to has 2\u00d7 smaller range than n \u2212 o.\r\n be labeled. When we know that the new model predictions Onecaveatofthisapproachisthatthesystemdoesnotknow\r\n are only different from the old model by 10%, we only need how large the second testset would be before execution.\r\n to label 10% of all data points. It is easy to see that, every The system uses a technique similar to active labeling by\r\n time when the developer commits a new model, we only incrementally growing the labeled testset every time when\r\n need to provide a new model is committed, if necessary. Speci\ufb01cally, we\r\n \u2212ln\u03b4 optimize for test conditions following the pattern\r\n 4\r\n n=ph\u0010\u01eb\u0011\u00d7p n > A +/- B,\r\n p whenAislarge(e.g., 0.9 or 0.95). This can be done by \ufb01rst\r\n labels. When p = 0.1 and 1 \u2212 \u03b4 = 0.9999, then n = 2188 havingacoarseestimationofthelowerboundofn,andthen\r\n for an error tolerance of a single accuracy point. If the conducting a \ufb01ner-grained estimation conditioned on this\r\n developer commits one model per day, the labeling team lowerbound. Notethatthiscanonlyintroduceimprovement\r\n only needs to label 2,188 samples the next day. Given a whenthelowerboundislarge(e.g., 0.9).\r\n well designed interface that enables a labeling throughput\r\n of 5 seconds per label, the labeling team only needs to 4.3 Tight Numerical Bounds\r\n commit3hoursaday! Forateamwithmultipleengineers,\r\n this overhead is often acceptable, considering the guarantee Following (Langford, 2005), having a test condition con-\r\n provided by the system down to a single accuracy point. sisting of n i.i.d random variables drawn from a Bernoulli\r\n Notice that active labeling assumes a stationary underlying distribution, one can simply derive a tight bound on the\r\n distribution. One way to enforce this in the system is to ask numberofsamplesrequired to reach a (\u01eb, \u03b4) accuracy. The\r\n the user to provide a pool of unlabeled data points at the calculation of number of samples require the probability\r\n same time, and then only ask for labels when needed. In mass function of the Binomial distribution (sum of i.i.d\r\n this way, we do not need to draw new samples over time. Bernoulli variables). Tight bound are solved by taking the\r\n minimum of number of samples n needed, over the max\r\n 4.2 Pattern 2: Implicit Variance Bound unknowntrue mean p. This technique can also be extended\r\n In many cases, the user does not provide an explicit con- to more complex queries, where the binomial distribution\r\n straint on the difference between a new model and an old has to be replaced by a multimodal distribution. The exact\r\n model. However, many machine learning models are not analysis has, as for the simple case, no closed-form solution,\r\n and deriving ef\ufb01cient approximations is left as further work.\r\n Continuous Integration of Machine Learning Models with ease.ml/ci\r\n Figure 3. Comparison of Sample Size Estimators in the Baseline\r\n Implementation and the Optimized Implementation.\r\n 5 EXPERIMENTS\r\n Wefocusonempirically validating the derived bounds and\r\n showease.ml/ciinactionnext.\r\n 5.1 SampleSizeEstimator Figure 4. Impact of \u01eb, \u03b4, and p on the Label Complexity.\r\n One key technique most of our optimizations relied on is 2\r\n following classes: Happy, Sad, Angry or Others. The eight\r\n that, by knowing an upper bound of the sample variance, we models developed in an incremental fashion, and submitted\r\n are able to achieve a tighter bound than simply applying the in that exact order to the competition (\ufb01nally reaching rank\r\n Hoeffding bound. This upper bound can either be achieved 29/165) are made available together with a corresponding\r\n by using unlabeled data points to estimate the difference 3\r\n description of each iteration via a public repository. The\r\n between the new and old models, or by using labeled data test data, consisting of 5,509 items was published by the\r\n points but conducting a coarse estimation \ufb01rst. We now organizers of the competition after its termination. This rep-\r\n validate our theoretical bound and its impact on improving resents a non-adaptive scenario, where the developer does\r\n the label complexity. not get any direct feedback whilst submitting new models.\r\n Figure 3 illustrates the estimated error and the empirical Figure 5 illustrates three similar, but different test condi-\r\n error by assumingdifferentupperboundsp,foramodelwith tions, which are implemented in ease.ml/ci. The \ufb01rst\r\n accuracy around 98%. We run GoogLeNet (Jia et al., 2014) two conditions check whether the new model is better than\r\n onthe in\ufb01nite MNIST dataset (Bottou, 2016) and estimate the old one by at least 2 percentage points in a non-adaptive\r\n the true accuracy c. Assuming a non-adaptive scenario, we matter. The developer will therefore not get any direct feed-\r\n obtain a range of accuracies achieved by randomly taking n back as it was the case during the competition. While query\r\n data points. We then estimate the interval \u01eb with the given (I) does reject false positive, condition (II) does accept false\r\n numberofsamplesnandprobability1\u2212\u03b4. Weseethat,both negative. The third condition mimics the scenario where\r\n the baseline implementation and ease.ml/ci dominate the user would get feedback after every commit without\r\n the empirical error, as expected, while ease.ml/ci uses any false negative. All three queries were optimized by\r\n 1\r\n signi\ufb01cantly fewer samples. ease.ml/ciusingPattern2andexploitingthefactthat\r\n Figure 4 illustrates the impact of this upper bound on im- between any two submission there is no more than 10%\r\n proving the label complexity. We see that, the improvement difference in prediction.\r\n increases signi\ufb01cantly when p is reasonably small \u2014 when Simply using Hoeffding\u2019s inequality does not lead to a prac-\r\n p = 0.1, we can achieve almost 10\u00d7 improvement on the tical solution \u2014 for \u01eb = 0.02 and \u03b4 = 0.002, in H = 7\r\n label complexity. Active labeling further increases the im- non-adaptive steps, one would need\r\n provement, as expected, by another 10\u00d7. 2 \u03b4\r\n r (lnH \u2212ln )\r\n 5.2 ease.ml/ciinAction n> v 2 =44,268\r\n 2\r\n Weshowcasethree different test conditions for a real-world 2\u01eb\r\n incremental development of machine learning models sub- samples. This number even grows to up to 58K in the fully\r\n mitted to the SemEval-2019 Task 3 competition. The goal adaptive case!\r\n is to classify the emotion of the user utterance as one of the 2\r\n Competition website: https://www.humanizing-\r\n 1Theempiricalerrorwasdeterminedbytakingdifferenttestsets ai.com/emocontext.html\r\n (with the sample sample size) and measuring the gap between the 3Github repository: https://github.com/zhaopku/\r\n \u03b4 and 1 \u2212 \u03b4 quantiles over the observed testing accuracies. ds3-emoContext\r\n Continuous Integration of Machine Learning Models with ease.ml/ci\r\n Non-Adaptive I Non-Adaptive II Adaptive\r\n - n \u2013 o > 0.02 +/- 0.02 - n \u2013 o > 0.02 +/- 0.02 - n \u2013 o > 0.018 +/- 0.022\r\n - adaptivity: full - adaptivity: full - adaptivity: full\r\n - reliability: 0.998 - reliability: 0.998 - reliability: 0.998\r\n - mode: fp-free - mode: fn-free - mode: fp-free\r\n (# Samples = 4713) (# Samples = 4713) (# Samples = 5204)\r\n Iteration 1\r\n Iteration 2\r\n ytor Iteration 3\r\n sHi Iteration 4\r\n t \r\n Iteration 5 Figure 6. Evolution of Development and Test Accuracy.\r\n Commi Iteration 6\r\n Iteration 7 plexity of Hoeffding\u2019s inequality becomes O(1/\u01eb) when the\r\n variance of the random variable \u03c32 is of the same order\r\n Iteration 8 of \u01eb (Boucheron et al., 2013). In this paper, we develop\r\n Figure 5. Continuous Integration Steps in ease.ml/ci. techniques to adapt the same observation to a real-world\r\n All the queries can be supported rigorously with the 5.5K scenario (Pattern 1). The technique of only labeling the dif-\r\n test samples provided after the competition. The \ufb01rst two ference between models is inspired by disagreement-based\r\n conditions can be answered within two percentage point active learning (Hanneke et al., 2014), which illustrates the\r\n error tolerance and 0.998 reliability. The full-adaptive query potential of taking advantage of the overlapping structure\r\n in the third scenario can only achieve a 2.2 percentage point betweenmodelstodecreaselabelingcomplexity. Infact, the\r\n error tolerance, as the number of labels needed would be technique we develop implies that one can achieve O(1/\u01eb)\r\n morethan6K,withthesameerrortolerance as in the \ufb01rst label complexity when the overlapping ratio between two\r\n two queries. \u221a\r\n models p = O( \u01eb).\r\n Wesee that, in all three scenarios, ease.ml/ci returns Thekeydifference between ease.ml/ci and a differen-\r\n pass/failsignalsthat make intuitive sense. If we look at tial privacy approach (Dwork et al., 2014) for answering\r\n the evolution of the development and test accuracy over the statistical queries lies in the optimization techniques we\r\n eight iterations (see Figure 6, the developer would ideally design. By knowing the structure of the queries we are able\r\n want ease.ml/citoacceptherlastcommit,whereasall to considerably lower the number of samples needed.\r\n three queries will have the second last model chosen to be Conceptually, this work is inspired by the seminal series of\r\n active, which correlates with the test accuracy evolution. \u00a8\u00a8 \u00a8\r\n 6 RELATEDWORK workbyLangfordandothers(Langford, 2005; Kaariainen\r\n &Langford, 2005) that illustrates the possibility for gen-\r\n Continuous integration is a popular concept in software eralization bound to be practically tight. The goal of this\r\n engineering (Duvall et al., 2007). Nowadays, it is one of the workistobuild a practical system to guide the user in em-\r\n best practices that most, if not all, industrial development ploying complicated statistical inequalities and techniques\r\n efforts follow. The emerging requirement of a CI engine for to achieve practical label complexity.\r\n MLhasbeendiscussed informally in multiple blog posts 7 CONCLUSION\r\n and forum discussions (Lara, 2017; Tran, 2017; Stojnic, We have presented ease.ml/ci, a continuous integra-\r\n 2018a; Lara, 2018; Stojnic, 2018b). However, none of these tion system for machine learning. It provides a declarative\r\n discussions produce any rigorous solutions to testing the scripting language that allows users to state a rich class of\r\n quality of a machine learning model, which arguably is the test conditions with rigorous probabilistic guarantees. We\r\n most important aspect of a CI engine for ML. This paper have also studied the novel practicality problem in terms of\r\n is motivated by the success of CI in industry, and aims for labeling effort that is speci\ufb01c to testing machine learning\r\n building the \ufb01rst prototype system for rigorous integration models. Our techniques can reduce the amount of required\r\n of machine learning models. testing samples by up to two orders of magnitude. We have\r\n The baseline implementation of ease.ml/ci builds on validated the soundness of our techniques, and showcased\r\n intensive previous work on generalization and adaptive anal- their applications in real-world scenarios.\r\n ysis. The non-adaptive version of the system is based on Acknowledgements\r\n simple concentration inequalities (Boucheron et al., 2013)\r\n and the fully adaptive version of the system is inspired by WethankZhaoMengandNoraHollensteinforsharingtheir mod-\r\n Ladder (Blum & Hardt, 2015). Comparing to the second, els for the SemEval\u201919 competition. CZ and the DS3Lab gratefully\r\n ease.ml/ciislessrestrictive on the feedback and more acknowledge the support from Mercedes-Benz Research & De-\r\n expressive given the speci\ufb01cation of the test conditions. velopment North America, MeteoSwiss, Oracle Labs, Swiss Data\r\n This leads to a higher number of test samples needed in Science Center, Swisscom, Zurich Insurance, Chinese Scholarship\r\n general. It is well-known that the O(1/\u01eb2) sample com- Council, and the Department of Computer Science at ETH Zurich.\r\n Continuous Integration of Machine Learning Models with ease.ml/ci\r\n REFERENCES Stojnic, R. Continuous integration for machine\r\n Blum, A. and Hardt, M. The ladder: A reliable leaderboard learning. https://www.reddit.com/r/\r\n for machine learning competitions. In International Con- MachineLearning/comments/8bq5la/\r\n ference on Machine Learning, pp. 1006\u20131014, 2015. d continuous integration for machine learning/,\r\n April 2018b.\r\n Bottou, L. The in\ufb01nite MNIST dataset. https: Tran, D. Continuous integration for data science.\r\n //leon.bottou.org/projects/infimnist, http://engineering.pivotal.io/post/\r\n February 2016. continuous-integration-for-data-\r\n Boucheron, S., Lugosi, G., and Massart, P. Concentration science/,February2017.\r\n inequalities: A nonasymptotic theory of independence. Van Vliet, H., Van Vliet, H., and Van Vliet, J. Software\r\n Oxford university press, 2013. engineering: principles and practice, volume 13. John\r\n Duvall, P. M., Matyas, S., and Glover, A. Continuous in- Wiley & Sons, 2008.\r\n tegration: improving software quality and reducing risk.\r\n Pearson Education, 2007.\r\n Dwork, C., Roth, A., et al. The algorithmic foundations\r\n R\r\n of differential privacy. Foundations and Trends\r in\r\n Theoretical Computer Science, 9(3\u20134):211\u2013407, 2014.\r\n Dwork,C.,Feldman,V.,Hardt,M.,Pitassi, T., Reingold, O.,\r\n and Roth, A. The reusable holdout: Preserving validity\r\n in adaptive data analysis. Science, 349(6248):636\u2013638,\r\n 2015.\r\n Hanneke, S. et al. Theory of disagreement-based active\r\n R\r\n learning. FoundationsandTrends\rinMachineLearning,\r\n 7(2-3):131\u2013309, 2014.\r\n Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,\r\n Girshick, R., Guadarrama, S., and Darrell, T. Caffe:\r\n Convolutional architecture for fast feature embedding.\r\n arXiv preprint arXiv:1408.5093, 2014.\r\n \u00a8\u00a8 \u00a8\r\n Kaariainen, M. and Langford, J. A comparison of tight\r\n generalization error bounds. In Proceedings of the 22nd\r\n international conference on Machine learning, pp. 409\u2013\r\n 416. ACM, 2005.\r\n Langford, J. Tutorial on practical prediction theory for\r\n classi\ufb01cation. Journal of machine learning research, 6\r\n (Mar):273\u2013306, 2005.\r\n Lara, A. F. Continuous integration for ml\r\n projects. https://medium.com/onfido-\r\n tech/continuous-integration-for-ml-\r\n projects-e11bc1a4d34f,October2017.\r\n Lara, A. F. Continuous delivery for ml models. https:\r\n //medium.com/onfido-tech/continuous-\r\n delivery-for-ml-models-c1f9283aa971,\r\n July 2018.\r\n Stojnic, R. Continuous integration for machine learn-\r\n ing. https://medium.com/@rstojnic/\r\n continuous-integration-for-machine-\r\n learning-6893aa867002,April2018a.\r\n Continuous Integration of Machine Learning Models with ease.ml/ci\r\n A SYNTAXANDSEMANTICS estimator x\u02c6, which, with probability 1 \u2212 \u03b4, satis\ufb01es\r\n A.1 Syntax of a Condition \u2217 \u2217\r\n To specify the condition, which will be tested by x\u02c6 \u2208 [x \u22120.01,x +0.01],\r\n ease.ml/ci whenever a new model is committed, the what should be the testing outcome of this condition? There\r\n user makes use of the following grammar: are three cases:\r\n c :- floating point constant 1. When x\u02c6 > 0.11, the condition should return False\r\n v :- n | o | d because, given x\u2217 < 0.1, the probability of having\r\n op1 :- + | -\r\n op2 :- \u2217\r\n * x\u02c6 > 0.11 > x +0.01 is less than \u03b4.\r\n EXP :- v | v op1 EXP | EXP op2 c\r\n 2. When x\u02c6 < 0.09, the condition should return True\r\n cmp :- > | < because, given x\u2217 > 0.1, the probability of having\r\n C :- EXP cmp c +/- c \u2217\r\n x\u02c6 < 0.09 < x \u22120.01 is less than \u03b4.\r\n F :- C | C /\\ F 3. When0.09 < x\u02c6 < 0.11, the outcome cannot be deter-\r\n Fis the \ufb01nal condition, which is a conjunction of a set of mined: Even if x\u02c6 > 0.1, there is no way to tell whether\r\n \u2217\r\n clauses C. Each clause is a comparison between an expres- the real value x is larger or smaller than 0.1. In this\r\n sion over {n,o,d} and a constant, with an error tolerance case, the condition evaluates to Unknown.\r\n following the symbol +/-. For example, two expressions\r\n that we focus on optimizing can be speci\ufb01ed as follows: Theparametermodeallowsthesystemtodealwiththecase\r\n n - o > 0.02 +/- 0.01 /\\ d < 0.1 +/- 0.01 that the condition evaluates to Unknown. In the fp-free\r\n in which the \ufb01rst clause mode,ease.ml/citreatsUnknownasFalse(thusre-\r\n jects the commit) to ensure that whenever the condition eval-\r\n n - o > 0.02 +/- 0.01 uates to True using x\u02c6, the same condition is always True\r\n for x\u2217. Similarly, in the fn-free mode, ease.ml/ci\r\n requires that the new model have an accuracy that is two treats Unknown as True (thus accepts the commit). The\r\n points higher than the old model, with an error tolerance of false positive rate (resp. false negative rate) in the fn-free\r\n one point, whereas the clause (resp. fp-free) mode is speci\ufb01ed by the error tolerance.\r\n d < 0.1 +/- 0.01\r\n requires that the new model can only change 10% of the old\r\n predictions, with an error tolerance of 1%.\r\n A.2 Semantics of Continuous Integration Tests\r\n Unlike traditional continuous integration, all three variables\r\n usedinease.ml/ci,i.e.,{n,o,d},arerandomvariables.\r\n As a result, the evaluation of an ease.ml/ci condition\r\n is inherently probabilistic. There are two additional param-\r\n eters that the user needs to provide, which would de\ufb01ne\r\n the semantics of the test condition: (1) \u03b4, the probability\r\n with which the test process is allowed to be incorrect, which\r\n is usually chosen to be smaller than 0.001 or 0.0001 (i.e.,\r\n 0.999 or 0.9999 success rate); and (2) mode chosen from\r\n {fp-free, fn-free},whichspeci\ufb01eswhetherthetest\r\n is false-positive free or false-negative free. The semantics\r\n are, with probability 1 \u2212 \u03b4, the output of ease.ml/ci is\r\n free of false positives or false negatives.\r\n Thenotion of false positives or false negatives is related to\r\n the fundamental trade-off between the \u201ctype I\u201d error and the\r\n \u201ctype II\u201d error in statistical hypothesis testing. Consider\r\n x < 0.1 +/- 0.01.\r\n Suppose that the real unknown value of x is x\u2217. Given an\r\n", "award": [], "sourceid": 162, "authors": [{"given_name": "Cedric", "family_name": "Renggli", "institution": "ETH Zurich"}, {"given_name": "Bojan", "family_name": "Karla\u0161", "institution": "ETH Z\u00fcrich"}, {"given_name": "Bolin", "family_name": "Ding", "institution": "\"Data Analytics and Intelligence Lab, Alibaba Group\""}, {"given_name": "Feng", "family_name": "Liu", "institution": "Huawei Technologies"}, {"given_name": "Kevin", "family_name": "Schawinski", "institution": "Modulos AG"}, {"given_name": "Wentao", "family_name": "Wu", "institution": "Microsoft Research"}, {"given_name": "Ce", "family_name": "Zhang", "institution": "ETH"}]}