{"title": "Model Assertions for Monitoring and Improving ML Models", "book": "Proceedings of Machine Learning and Systems", "page_first": 481, "page_last": 496, "abstract": "Machine learning models are increasingly deployed in mission-critical settings\nsuch as vehicles, but unfortunately, these models can fail in complex ways.  To\nprevent errors, ML engineering teams monitor and continuously improve these\nmodels.  We propose a new abstraction, model assertions, that adapts the\nclassical use of program assertions as a way to monitor and improve ML models.\nModel assertions are arbitrary functions over the model's input and output that\nindicates when errors may be occurring.  For example, a developer may write an\nassertion that an object's class should stay the same across frames of video.\nOnce written, these assertions can be used both for runtime monitoring and for\nimproving a model at training time.  In particular, we show that at runtime,\nmodel assertions can find high confidence errors, where a model returns\nthe wrong output with high confidence, which uncertainty-based monitoring\ntechniques would not detect.  We also propose two methods to use model\nassertions at training time.  First, we propose a bandit-based active learning\nalgorithm that can sample from data flagged by assertions and show that it can\nreduce labeling costs by up to 33% over traditional uncertainty-based methods.\nSecond, we propose an API for generating \"consistency assertions\" (e.g., the\nclass change example) and weak labels for inputs where the consistency\nassertions fail, and show that these weak labels can improve relative model\nquality by up to 46%.  We evaluate both algorithms on four real-world tasks\nwith video, LIDAR, and ECG data.", "full_text": "                      MODELASSERTIONSFORMONITORINGANDIMPROVINGMLMODELS\r\n                                             DanielKang*1 DeeptiRaghavan*1 PeterBailis1 MateiZaharia1\r\n                                                                              ABSTRACT\r\n                      MLmodelsareincreasinglydeployedinsettingswithrealworldinteractionssuchasvehicles,butunfortunately,\r\n                      these models can fail in systematic ways. To prevent errors, ML engineering teams monitor and continuously\r\n                      improvethesemodels. Weproposeanewabstraction,modelassertions,thatadaptstheclassicaluseofprogram\r\n                      assertionsasawaytomonitorandimproveMLmodels. Modelassertionsarearbitraryfunctionsoveramodel\u2019s\r\n                      inputandoutputthatindicatewhenerrorsmaybeoccurring,e.g.,afunctionthattriggersifanobjectrapidlychanges\r\n                      its class in a video. We propose methodsofusingmodelassertionsatallstagesofMLsystemdeployment,including\r\n                      runtimemonitoring,validatinglabels,andcontinuouslyimprovingMLmodels. Forruntimemonitoring,weshow\r\n                      that modelassertionscan\ufb01ndhighcon\ufb01denceerrors,whereamodelreturnsthewrongoutputwithhighcon\ufb01dence,\r\n                      whichuncertainty-basedmonitoringtechniqueswouldnotdetect. Fortraining,weproposetwomethodsofusing\r\n                      modelassertions. First, we propose a bandit-based active learning algorithm that can sample from data \ufb02agged\r\n                      byassertionsandshowthatitcanreducelabelingcostsbyupto40%overtraditionaluncertainty-basedmethods.\r\n                      Second, we propose an API for generating \u201cconsistency assertions\u201d (e.g., the class change example) and weak\r\n                      labels for inputs where the consistency assertions fail, and show that these weak labels can improve relative model\r\n                      quality by up to 46%. Weevaluatemodelassertionsonfourreal-worldtaskswithvideo,LIDAR,andECGdata.\r\n                1     INTRODUCTION                                                      withunstructuredinputsthatlackmeaningfulschemas,e.g.,\r\n                MLisincreasingly deployed in complex contexts that re-                  images. Solutionsthatcheckwhethermodelperformancere-\r\n                quireinferenceaboutthephysicalworld,fromautonomous                      mainsconsistentovertime(Bayloretal.,2017)onlyapplyto\r\n                vehicles(AVs)toprecisionmedicine. However,MLmodels                      deploymentsthathavegroundtruthlabels,e.g.,click-through\r\n                canmisbehaveinunexpectedways. Forexample,AVshave                        rate prediction, but not to deployments that lack labels.\r\n                accelerated toward highway lane dividers (Lee, 2018) and                Asasteptowards more robust QA for complex ML appli-\r\n                canrapidlychangetheirclassi\ufb01cationofobjectsovertime,                    cations, wehavefoundthatMLdeveloperscanoftenspecify\r\n                causingerraticbehavior(Coldewey,2018;NTSB,2019). As                     systematic errors made by ML models: certain classes of\r\n                aresult, quality assurance (QA) of models, including contin-            errors are repetitive and can be checked automatically, via\r\n                uousmonitoringandimprovement,isofparamountconcern.                      code. Forexample,indevelopingavideoanalyticsengine,\r\n                Unfortunately, performing QA for complex, real-world                    wenoticedthatobjectdetectionmodelscanidentifyboxes\r\n                MLapplicationsischallenging: MLmodelsfailfordiverse                     ofcarsthat\ufb02ickerrapidlyinandoutofthevideo(Figure1),\r\n                and reasons unknown before deployment. Thus, existing                   indicating someofthedetectionsarelikelywrong. Likewise,\r\n                solutionsthatfocusonverifyingtraining,includingformal                   ourcontactsatanAVcompanyreportedthatLIDARandcam-\r\n                veri\ufb01cation (Katz et al., 2017), whitebox testing (Pei et al.,          era models sometimes disagree. While seemingly simple,\r\n                2017),monitoringtrainingmetrics(Rengglietal.,2019),and                  similar errors were involved with a fatal AV crash (NTSB,\r\n                validating training code (Odena&Goodfellow,2018),only                   2019). Thesesystematicerrorscanarisefordiversereasons,\r\n                give guarantees on a test set and perturbations thereof, so             including domain shift between training and deployment\r\n                modelscanstillfailonthehugevolumesofdeploymentdata                      data (e.g., still images vs. video), incomplete training data\r\n                that are not part of the test set (e.g., billions of images per day     (e.g., no instances of snow-covered cars), and noisy inputs.\r\n                in an AV \ufb02eet). Validating input schemas (Polyzotis et al.,             Toleveragethesystematicnatureoftheseerrors,wepropose\r\n                2019; Baylor et al., 2017) does not work for applications               modelassertions,anabstractiontomonitorandimproveML\r\n                   *                     1                                              model quality. Model assertions are inspired by program\r\n                    Equalcontribution StanfordUniversity. Correspondenceto:             assertions(Goldstineetal.,1947;Turing,1949),oneofthe\r\n                DanielKang<ddkang@stanford.edu>.                                        mostcommonwaystomonitorsoftware. Amodelassertion\r\n                Proceedingsofthe2nd SysMLConference,PaloAlto,CA,USA,                    isanarbitraryfunctionoveramodel\u2019sinputandoutputthatre-\r\n                2019. Copyright2019bytheauthor(s).                                      turnsaBoolean(0or1)orcontinuous(\ufb02oatingpoint)severity\r\n                                                   ModelAssertionsforMonitoringandImprovingMLModels\r\n                                                                                    marginal reduction in the number of assertions \ufb01red (\u00a73).\r\n                                                                                    Weshowthatourbanditalgorithmcanreducelabelingcosts\r\n                                                                                    byupto40%overtraditionaluncertainty-basedmethods.\r\n                 (a) Frame1,SSD        (b) Frame2,SSD        (c) Frame3,SSD         Third, we show that assertions can be used for weak\r\n                                                                                    supervision (Mintz et al., 2009; Ratner et al., 2017). We\r\n                                                                                    propose an API for writing consistency assertions about\r\n                                                                                    how attributes of a model\u2019s output should relate that\r\n                                                                                    can also provide weak labels for training.         Consistency\r\n                 (d) Frame1,SSD (e)Frame2,assertion         (f) Frame3,SSD          assertions specify that data should be consistent between\r\n                                     corrected                                      attributes and identi\ufb01ers, e.g., a TV news host (identi\ufb01er)\r\n               Figure1.Top row: example of \ufb02ickering in three consecutive           should have consistent gender (attribute), or that certain\r\n                frames of a video. The object detection method, SSD (Liu et al.,    predictions should (or should not) exist in temporally related\r\n                2016),failedtoidentifythecarinthesecondframe. Bottomrow:            outputs, e.g., cars in adjacent video frames (Figure 1). We\r\n                exampleofcorrectingtheoutputofamodel. Thecarboundingbox             demonstratethatthis APIcanapplytoarangeofdomains,\r\n                in the secondframecanbeinferredusingnearbyframesbasedon             including medical classi\ufb01cation and TV news analytics.\r\n                aconsistencyassertion.                                              These weak labels can be used to improve relative model\r\n                                                                                    quality by up to 46%withnoadditionalhumanlabeling.\r\n                scoretoindicatewhenfaultsmaybeoccurring. Forexample,                WeimplementmodelassertionsinaPythonlibrary,OMG1,\r\n                amodelassertionthatcheckswhetheranobject\ufb02ickersinand                that can be used with existing MLframeworks. Weevaluate\r\n                outofvideocouldreturnaBooleanvalueovereachframeor                   assertions on four MLapplications: understandingTVnews,\r\n                thenumberofobjectsthat\ufb02icker. Whileassertionsmaynot                 AVs,videoanalytics,andclassifyingmedicalreadings. We\r\n                offer a complete speci\ufb01cation of correctness, we have found         implementassertionsforsystematicerrorsreportedbyML\r\n                that assertions are easy to specify in many domains (\u00a72).           usersinthesedomains,includingcheckingforconsistency\r\n                Weexplore several ways to use model assertions, both at             betweensensors,domainknowledgeaboutobjectlocationsin\r\n                runtimeandtrainingtime.                                             videos,andmedicalknowledgeaboutheartpatterns. Across\r\n                                                                                    thesedomains,we\ufb01ndthatmodelassertionsweconsidercan\r\n                First, weshowthatmodelassertionscanbeusedforruntime                 bewrittenwithatmost60linesofcodeandwith88-100%\r\n                monitoring: theycanbeusedtologunexpectedbehavioror                  precision, that these assertions often \ufb01nd high-con\ufb01dence\r\n                automaticallytriggercorrectiveactions,e.g.,shuttingdown             errors (e.g., top 90th percentile by con\ufb01dence), and that our\r\n                anautopilot. Furthermore, modelassertions can often \ufb01nd             newalgorithmsforactivelearningandweaksupervisionvia\r\n                highcon\ufb01denceerrors,wherethemodelhashighcertainty                   assertions improvemodelqualityoverexistingmethods.\r\n                in an erroneousoutput;theseerrorsareproblematicbecause              Insummary,wemakethefollowingcontributions:\r\n                prior uncertainty-based monitoring would not \ufb02ag these\r\n                errors. Additionally,andperhapssurprisingly,wehavefound             1. We introduce the abstraction of model assertions for\r\n                that many groups are also interested in validating human-              monitoringandcontinuouslyimprovingMLmodels.\r\n                generatedlabels,whichcanbedoneusingmodelassertions.                 2. Weshowthatmodelassertionscan\ufb01ndhighcon\ufb01dence\r\n                Second, we show that assertions can be used for active                 errors, which wouldnotbe\ufb02aggedbyuncertaintymetrics.\r\n                learning,inwhichdataiscontinuouslycollectedtoimprove                3. We propose a bandit algorithm to select data points for\r\n                MLmodels. Traditional active learning algorithms select                active learning via modelassertionsandshowthatitcan\r\n                data to label based on uncertainty, with the intuition that            reducelabelingcostsbyupto40%.\r\n               \u201charder\u201d data where the model is uncertain will be more              4. We propose an API for consistency assertions that can\r\n                informative (Settles, 2009; Coleman et al., 2020). Model               automatically generate weak labels for data where the\r\n                assertions provideanothernaturalwayto\ufb01nd\u201chard\u201dexam-                    assertionfails, and showthatweaksupervisionviathese\r\n                ples. However,usingassertionsinactivelearningpresents                  labels can improverelativemodelqualitybyupto46%.\r\n                a challenge: how should the active learning algorithm\r\n                select between data when several assertions are used? A\r\n                data point can be \ufb02agged by multiple assertions or a single         2     MODELASSERTIONS\r\n                assertion can \ufb02agmultipledatapoints,incontrasttoasingle             We describe the model assertion interface, examples of\r\n                uncertaintymetric. Toaddressthischallenge,wepresenta                modelassertions,howmodelassertionscanintegrateintothe\r\n                novelbandit-basedactivelearningalgorithm(BAL).Given                 MLdevelopment/deploymentcycle,anditsimplementation\r\n                asetofdatathathavebeen\ufb02aggedbypotentiallymultiple                   in OMG.\r\n                modelassertions, our bandit algorithm uses the assertions\u2019\r\n                severity scores as context (i.e., features) and maximizes the           1OMGisarecursiveacronymforOMGModelGuardian.\r\n                                               ModelAssertionsforMonitoringandImprovingMLModels\r\n               2.1  ModelAssertionsInterface                                   Appendix). Wefurtherdescribehowmodelassertionscanbe\r\n               We formalize the model assertions interface.         Model      implementedviaourconsistencyAPIforTVnewsin\u00a74.\r\n               assertions are arbitrary functions that can indicate when       Autonomousvehicles(AVs). AVsarerequiredtoexecutea\r\n               anerrorislikelytohaveoccurred. Theytakeasinputalist             varietyoftasks,includingdetectingobjectsandtrackinglane\r\n               of inputs and outputs from one or more ML models. They          markings. These tasks are accomplished with ML models\r\n               return a severity score, a continuous value that indicates      fromdifferentsensors,suchasvisual,LIDAR,orultrasound\r\n               the severity of an error of a speci\ufb01c type. By convention,      sensors(Davies,2018). Forexample,avisionmodelmight\r\n               the0valuerepresentsanabstention. Booleanvaluescanbe             beusedtodetectobjects in video and a point cloud model\r\n               implemented in model assertions by only returning 0 and         mightbeusedtodo3Dobjectdetection.\r\n               1. Theseverityscoredoesnotneedtobecalibrated,asour\r\n               algorithmsonlyusetherelativeorderingofscores.                   Ourcontacts at an AV company noticed that models from\r\n               Asaconcreteexample,consideranAVwithaLIDARsensor                 video and point clouds can disagree. We implemented a\r\n               andcameraandobjectdetectionmodelsforeachsensor. To              modelassertionthatprojectsthe3Dboxesontothe2Dcam-\r\n               checkthatthesemodelsagree,adevelopermaywrite:                   eraplanetocheckforconsistency. Iftheassertiontriggers,\r\n                                                                               thenatleastoneofthesensorsreturnedanincorrectanswer.\r\n               def sensor_agreement(lidar_boxes, camera_boxes):\r\n                 failures = 0                                                  Videoanalytics. Manymodern,academicvideoanalytics\r\n                 for lidar_box in lidar_boxes:                                 systemsuseanobjectdetectionmethod(Kangetal.,2017;\r\n                    if no_overlap(lidar_box, camera_boxes):\r\n                      failures += 1                                            2019;Hsiehetal.,2018;Jiangetal.,2018;Xuetal.,2019;\r\n                 return failures                                               Caneletal.,2019)trainedonMS-COCO(Linetal.,2014),\r\n               Notably, our library OMG can register arbitrary Python          acorpusofstillimages. Thesestill imageobjectdetection\r\n               functionsasmodelassertions.                                     methodsaredeployedonvideofordetectingobjects. None\r\n                                                                               ofthesesystemsaimtodetecterrors,eventhougherrorscan\r\n               2.2  ExampleUseCasesandAssertions                               affect analytics results.\r\n               Inthissection,weprovideusecasesformodelassertionsthat           Indevelopingsuchsystems,wenoticedthatobjects\ufb02icker\r\n               arose in discussions with industry and academic contacts,       in and out of the video (Figure 1) and that vehicles overlap\r\n               including AV companies and academic labs. We show               in unrealistic ways (Figure 7, Appendix). Weimplemented\r\n               exampleoferrorscaughtbythemodelassertionsdescribed              assertions to detect these.\r\n               in this section in Appendix A and describe how one might        Medical classi\ufb01cation.     Deep learning researchers have\r\n               lookforassertionsinotherdomainsinAppendixB.                     createddeepnetworksthatcanoutperformcardiologistsfor\r\n               Ourdiscussions revealed two key properties in real-world        classifying atrial \ufb01brillation (AF, a form of heart condition)\r\n               MLsystems. First, ML models are deployed on orders of           fromsingle-leadECGdata(Rajpurkaretal.,2019). Ourre-\r\n               magnitudemoredatathancanreasonablybelabeled,soa                 searchercontactsmentionedthatAFpredictionsfromDNNs\r\n               labeled sample cannot capture all deployment conditions.        canrapidlyoscillate. TheEuropeanSocietyofCardiology\r\n               Forexample,the\ufb02eetofTeslavehicleswillseeover100\u00d7                guidelines for detecting AF require at least 30 seconds of\r\n               more images in a day than in the largest existing image         signal before calling a detection (EHRA,2010). Thus,pre-\r\n               dataset(Sunetal.,2017). Second,complexMLdeployments             dictions should not rapidly switch between two states. A\r\n               aredevelopedbylargeteams,ofwhichsomedevelopersmay               developercouldspecifythismodelassertion,whichcouldbe\r\n               not have the ability to manage all parts of the application.    implementedtomonitorECGclassi\ufb01cationdeployments.\r\n               Asaresult,itiscritical to be able to do QA collaboratively\r\n               to cover the application end-to-end.                            2.3   UsingModelAssertionsforQA\r\n               AnalyzingTVnews. Wespoketoaresearchlabstudying                  WedescribehowmodelassertionscanbeintegratedwithML\r\n               biasinmediaviaautomaticanalysis. Thislabcollectedover           developmentanddeploymentpipelines. Importantly,model\r\n              10yearsofTVnews(billionsofframes)andexecutedface                 assertions are complementary to a range of other ML QA\r\n               detection every three seconds. These detections are subse-      techniques, including veri\ufb01cation, fuzzing, and statistical\r\n               quentlyusedtoidentifythefaces,detectgender,andclassify          techniques,asshowninFigure2.\r\n               hair color using MLmodels. Currently,theresearchershave         First, model assertions can be used for monitoring and\r\n               nomethodofidentifyingerrorsandmanuallyinspectdata.              validating all parts of the ML development/deployment\r\n               However,theyadditionallycomputescenecuts. Giventhat             pipeline.  Namely, model assertions are agnostic to the\r\n               mostTVnewhostsdonotmovemuchbetweenscenes,we                     sourceoftheoutput,whethertheybeMLmodelsorhuman\r\n               canassertthattheidentity,gender,andhaircoloroffacesthat         labelers. Perhaps surprisingly, we have found several groups\r\n               highlyoverlapwithinthesamesceneareconsistent(Figure6,\r\n                                                                         ModelAssertionsforMonitoringandImprovingMLModels\r\n                                                              ML developers                                              that OMGusestoimprovemodelquality: BALforactive\r\n                                                                                                                         learningandconsistencyassertionsforweaksupervision.\r\n                        Assertion-based data collection\r\n                                                                                   Model assertion\r\n                            for active learning/weak\r\n                                                                                runtime library (OMG)\r\n                              supervision (OMG)\r\n                                                             Assertion database                                          3      USINGMODELASSERTIONS\r\n                        Data collection    Model development,             Statistical         Deployment and\r\n                         and labeling             training                validation             monitoring                     FORACTIVELEARNINGWITHBAL\r\n                                                       Fuzzing       Veri\ufb01cation, robust\r\n                                                                                           Held-out set\r\n                                                    (TensorFuzz)      ML (DeepXplore)                                    Weintroduce an algorithm called BAL to select data for\r\n                      Figure2.Asystemdiagramofhowmodelassertionscanintegrate                                             active learning via model assertions. BALassumesthataset\r\n                      into the ML development/deployment pipeline.                                 Users can             ofdatapointshasbeencollectedandasubsetwillbelabeled\r\n                      collaboratively add to an assertion database.                          We also show                in bulk. Wefoundthatlabelingservices(sca,2019)andour\r\n                      how related work can be integrated into the pipeline. Notably,                                     industrial contacts usually label data in bulk.\r\n                      veri\ufb01cation only gives guarantees on a test set and perturbations                                  Given a set of data points that triggered model assertions,\r\n                      thereof, but not on arbitrary runtime data.                                                        OMGmustselectwhichpointstolabel. Therearetwokey\r\n                                                                                                                         challengeswhichmakedataselectionintractableinitsfull\r\n                      toalsobeinterestedinmonitoringhumanlabelquality. Thus,                                             generality. First, we do not know the marginal utility of\r\n                      concretely, modelassertionscanbeusedtovalidatehuman                                                selecting a data point to label without labeling the data point.\r\n                      labels (data collection) or historical data (validation), and                                      Second, even with labels, estimating the marginal gain of\r\n                      to monitordeployments(e.g.,topopulatedashboards).                                                  datapointsisexpensivetocomputeastrainingmodernML\r\n                      Second, model assertions can be used at training time to                                           modelsisexpensive.\r\n                      select which data points to label in active learning. We                                           Toaddresstheseissues,wemakesimplifyingassumptions.\r\n                      describeBAL,ouralgorithmfordataselection,in\u00a73.                                                     Wedescribethestatistical model we assume, the resource-\r\n                      Third,modelassertionscanbeusedtogenerateweaklabels                                                 unconstrained algorithm, our simplifying assumptions,\r\n                      to further train ML modelswithoutadditionalhumanlabels.                                            and BAL. We note that, while the resource-unconstrained\r\n                      WedescribehowOMGaccomplishesthisviaconsistency                                                     algorithmcanproducestatisticalguarantees,BALdoesnot.\r\n                      assertions in \u00a74. Users can also register their own weak                                           WeinsteadempiricallyverifyitsperformanceinSection5.\r\n                      supervisionrules.                                                                                  Dataselectionasmulti-armedbandits. Wecastthedata\r\n                      2.4      ImplementingModelAssertionsinOMG                                                          selection problemasamulti-armedbandit(MAB)problem\r\n                                                                                                                         (Aueretal.,2002;Berry&Fristedt,1985). InMABs,asetof\r\n                      We implement a prototype library for model assertions,                                             \u201carms\u201d(i.e.,individualdatapoints)isprovidedandtheuser\r\n                      OMG,that works with existing Python ML training and                                                mustselectasetofarms(i.e.,pointstolabel)toachievethe\r\n                      deployment frameworks.                       We brie\ufb02y describe OMG\u2019s                              maximalexpectedutility(e.g.,maximizevalidationaccuracy,\r\n                      implementation.                                                                                    minimizenumberofassertionsthat\ufb01re). MABshavebeen\r\n                      OMGlogsuser-de\ufb01nedassertionsascallbacks. Thesimplest                                               studiedinawidevarietyofsettings(Radlinskietal.,2008;Lu\r\n                      waytoaddanassertionisthroughAddAssertion(func),                                                    etal., 2010;Bubecketal.,2009),butweassumethatthearms\r\n                      where func is a function of the inputs and outputs (see                                            havecontextassociatedwiththem(i.e.,severityscoresfrom\r\n                      below). OMGalsoprovidesanAPItoaddconsistencyasser-                                                 model assertions) and give submodular rewards (de\ufb01ned\r\n                      tions as described in \u00a74. Given this database, OMG requires                                        below). Therewardsarepossiblytime-varying. Wefurther\r\n                      acallbackaftermodelexecutionthattakesthemodel\u2019sinput                                               assume there is an (unknown) smoothness parameter that\r\n                      and output as input. Given the model\u2019s input and output,                                           determinesthesimilaritybetweenarmsofsimilarcontexts\r\n                                                                                                                                                              \u00a8\r\n                      OMGwillexecutetheassertionsandrecordanyerrors. We                                                  (formally, the \u03b1 in the Holder condition (Evans, 1998)). The\r\n                      assume the assertion signature is similar to the following;                                        followingpresentationisinspiredbyChenetal.(2018).\r\n                      this assertion signature is for the example in Figure 1:                                           Concretely,weassumethedatawillbelabeledinT rounds\r\n                      def flickering(recent_frames: List[PixelBuf],                                                      anddenotetheroundst=1,...,T. Werefertothesetofndata\r\n                          recent_outputs: List[BoundingBox]) -> Float                                                    pointsasN={1,...,n}. Eachdatapointhasaddimensional\r\n                                                                                                                         feature vector associated with it, where d is the number of\r\n                                                                                                                                                                                                        t\r\n                                                                                                                         modelassertions. Werefertothefeaturevectorasx ,where\r\n                      Foractivelearning, OMGwilltakeabatchofdataandreturn                                                                                                                               i\r\n                      indicesforwhichdatapointstolabel. Forweaksupervision,                                              i is the data point index and t is the round index; from here,\r\n                                                                                                                         wewillrefertothedatapointsasxt. Eachentryinafeature\r\n                      OMGwill take data and return weak labels where valid.                                                                                                    i\r\n                      Userscanspecifyweaklabelingfunctionsassociatedwith                                                 vector is the severity score from a model assertion. The\r\n                      assertions to help with this.                                                                      featurevectorscanchangeovertimeasthemodelpredictions,\r\n                                                                                                                         andthereforeassertions,changeoverthecourseoftraining.\r\n                      Inthefollowingtwosections,wedescribetwokeymethods\r\n                                                 ModelAssertionsforMonitoringandImprovingMLModels\r\n                   Input: T,Bt,N,R                                                    Input: T,Bt,N,R\r\n                   Output: choiceofarmsSt atrounds1,...,T                             Output: choiceofarmsSt atrounds1,...,T\r\n                   fort=1,...,T do                                                    fort=1,...,T do\r\n                       if Underexploredarmsthen                                           if t = 0 then\r\n                           Selectarms                                                          Selectdatapointsuniformly\r\n                             St fromunder-exploredcontextsatrandom                              at randomfromthedmodelassertions\r\n                       else                                                               else\r\n                           SelectarmsSt byhighestmarginalgain(Eq.1):                           Computethemarginalreduction\r\n                           fori=1,...,Bt do                                                     rmofthenumberoftimesmodelassertion\r\n                                St=argmax             \u2206R({j},St )                               m=1,...,dtriggeredfromthepreviousround;\r\n                                 i           j\u2208N\\St              i\u22121\r\n                           end                     i\u22121                                         if all rm<1%then\r\n                       end                                                                         Fall backtobaselinemethod;\r\n                                                                                                   continue;\r\n                   end                                                                         end\r\n                 Algorithm 1: A summary of the CC-MAB algorithm.                               fori=1,...,Bt do\r\n                 CC-MAB \ufb01rst explores under-explored arms, then                                    Selectmodelassertionmproportionaltor ;\r\n                                                                                                                                          m\r\n                 greedily selects arms with highest marginal gain. Full                            Selectxi thattriggersm,\r\n                 details are given in (Chen et al., 2018).                                          sampleproportionaltoseverityscorerank;\r\n                                                                                                   Addx toSt;\r\n                                                                                                         i\r\n                                                                                               end\r\n               Weassumethereisabudgetonthenumberofarms(i.e.,data                          end\r\n               pointstolabel),Bt,ateveryround. Theusermustselectaset                  end\r\n               ofarmsSt={x ,...,x          }suchthat|St|\u2264Bt. Weassume               Algorithm 2: BAL algorithm for data selection for\r\n                                s      s t\r\n                                 1      B                                           continuous training. BAL samples from the assertions\r\n               that the reward from the arms, R(St), is submodular in St.           at random in the \ufb01rst round, then selects the assertions\r\n               Intuitively, submodularity implies diminishing marginal              that result in highest marginal reduction in the number\r\n               returns: adding the 100th data point will not improve the            of assertions that \ufb01re in subsequent rounds. BAL will\r\n               rewardasmuchasaddingthe10thdatapoint. Formally,we                    default to random sampling or uncertainty sampling if\r\n               \ufb01rst de\ufb01nethemarginalgainofaddinganextraarm:                         noneoftheassertionsreduce.\r\n                           \u2206R({m},A)=R(A\u222a{m})\u2212R(A).                       (1)\r\n               whereA\u2282Nisasubsetofarmsandm\u2208N isanadditional                       Resource-constrained algorithm.           We make simplify-\r\n               armsuchthat m6\u2208A. Thesubmodularity condition states                ing assumptions and use these to modify CC-MABforthe\r\n               that, for any A\u2282C\u2282N andm6\u2208C                                        resource-constrainedsetting. Oursimplifyingassumptions\r\n                                                                                  arethat1)datapointswithsimilarcontexts(i.e.,xt)areinter-\r\n                               \u2206R({m},A)\u2265\u2206R({m},C).                       (2)                                                         i\r\n                                                                                  changeable,2)datapointswithhigherseverityscoreshave\r\n                                                                                  higherexpectedmarginalgain,and3)reducingthenumber\r\n               Resource-unconstrainedalgorithm. Assuminganin\ufb01nite                 oftriggeredassertionswillincreaseaccuracy.\r\n               labelingandcomputationalbudget,wedescribeanalgorithm               Undertheseassumptions,wedonotrequireanestimateofthe\r\n               that selects data points to train on. Unfortunately, this algo-    marginalrewardforeacharm. Instead,wecanapproximate\r\n               rithmisnotfeasibleasitrequireslabelsforeverypointand               themarginalgainfromselectingarmswithsimilarcontexts\r\n               training the MLmodelmanytimes.                                     bythetotal number of these arms that were selected. This\r\n               If weassumethatrewardsforindividualarmscanbequeried,               hastwobene\ufb01ts. First,wecantrainamodelonasetofarms\r\n               thenarecentbanditalgorithm,CC-MAB(Chenetal.,2018)                  (i.e., data points) in batches instead of adding single arms at\r\n                                                2\u03b1d\r\n               can achieve a regret of O(cT 3\u03b1d log(T)) for \u03b1 to be the           atime. Second,wecanselectdatapointsofsimilarcontexts\r\n               smoothnessparameter. Aregret bound is the (asymptotic)             at random,withouthavingtocomputeitsmarginalgain.\r\n               difference with respect to an oracle algorithm.        Brie\ufb02y,     Leveragingtheseassumptions,wecansimplifyAlgorithm1\r\n               CC-MABexploresunder-exploredarmsuntilitiscon\ufb01dent                  to require less computation for training models and to not\r\n               that certain arms have highest reward. Then, it greedily takes     requirelabelsforalldatapoints. Ouralgorithmisdescribed\r\n               thehighestrewardarms. Fulldetailsaregivenin(Chenetal.,             in Algorithm2. Brie\ufb02y,weapproximatethemarginalgain\r\n               2018)andsummarizedinAlgorithm1.                                    of selecting batches of arms and select arms proportional\r\n               Unfortunately, CC-MABrequiresaccesstoanestimateof                  to the marginal gain. We additionally allocate 25% of the\r\n               selecting a single arm. Estimating the gain of a single arm        budgetineachroundtorandomlysamplearmsthattriggered\r\n               requiresalabelandrequiresretrainingandreevaluatingthe              different model assertions, uniformly; this is inspired by\r\n               model, which is computationally infeasible for expensive-          \u01eb-greedy algorithms (Tokic & Palm, 2011). This ensures\r\n               to-train MLmodels,especiallymoderndeepnetworks.                    that no contexts (i.e., model assertions) are underexplored as\r\n                                                 ModelAssertionsforMonitoringandImprovingMLModels\r\n               training progresses. Finally, in some cases (e.g., with noisy     outputsy    for eachinput. Forexample,eachoutputcould\r\n                                                                                          i,j\r\n               assertions), it may not be possible to reduce the number of       be an object detected in a video frame. The user provides\r\n               assertions that \ufb01re. In this case, BAL will default to random     twofunctionsoveroutputsyi,j:\r\n               samplingoruncertaintysampling,asspeci\ufb01edbytheuser.\r\n                                                                                 \u2022 Id(y    ) returns an identi\ufb01er for the output y  , which is\r\n                                                                                        i,j                                       i,j\r\n               4    CONSISTENCYASSERTIONSAND                                       simplyanopaquevalue.\r\n                                                                                 \u2022 Attrs(y    ) returns zero or more attributes for the output\r\n                    WEAKSUPERVISION                                                         i,j\r\n                                                                                   y ,whicharekey-valuepairs.\r\n                                                                                     i,j\r\n               Although developers can write arbitrary Python functions\r\n               asmodelassertionsinOMG,wefoundthatmanyassertions                  Inadditiontocheckingattributes,wefoundthatmanyappli-\r\n               canbespeci\ufb01edusinganevensimpler,high-levelabstraction             cationsalsoexpecttheiridenti\ufb01erstoappearina\u201ctemporally\r\n               that wecalledconsistencyassertions. Thisinterfaceallows           consistent\u201d fashion, where objects do not disappear and\r\n               OMGtogeneratemultipleBooleanmodelassertionsfrom                   reappear too quickly. For example, one would expect cars\r\n               a high-level description of the model\u2019s output, as well as        identi\ufb01edinthevideotostayonthescreenformultipleframes\r\n               automaticcorrectionrulesthatproposenewlabelsfordata               insteadof\u201c\ufb02ickering\u201dinandoutinmostcases. Toexpress\r\n               that fail the assertion to enable weak supervision.               this expectation, developerscanprovideatemporalconsis-\r\n               Thekeyideaofconsistency assertions is to specify which            tencythreshold,T,whichspeci\ufb01esthateachidenti\ufb01ershould\r\n               attributes of a model\u2019s output are expected to match across       notappearordisappearforintervalslessthanT seconds. For\r\n               manyinvocationstothemodel. Forexample,consideraTV                 example,wemightsetT toonesecondforTVfootagethat\r\n               newsapplicationthattriestolocatefacesinTVfootageand               frequentlycutsacrossframes,or30secondsforanactivity\r\n               then identify their name and gender (one of the real-world        classi\ufb01cationalgorithmthatdistinguishesbetweenwalking\r\n               applications we discussed in \u00a72.2).      The ML developer         andbiking. ThefullAPIforaddingaconsistencyassertion\r\n               may wish to assert that, within each video, each person           is therefore AddConsistencyAssertion(Id,Attrs,T).\r\n               shouldconsistentlybeassignedthesamegender,andshould               Examples. Webrie\ufb02ydescribehowonecanuseconsistency\r\n               appear on the screen at similar positions on most nearby          assertions in several MLtasksmotivatedin\u00a72.2:\r\n               frames. Consistencyassertionsletdevelopersspecifysuch\r\n               requirementsbyprovidingtwofunctions:                              Face identi\ufb01cation in TV footage: This application uses\r\n               \u2022 An identi\ufb01cation function that returns an identi\ufb01er for         multipleMLmodelstodetectfacesinimages,matchthem\r\n                  eachmodeloutput. Forexample,inourTVapplication,                to identities, classify their gender, and classi\ufb01er their hair\r\n                  this could be the person\u2019s name as identi\ufb01ed by the model.     color. We can use the detected identity as our Id function\r\n                                                                                 andgender/haircolorasattributes.\r\n               \u2022 An attributes function that returns a list of named             Video analytics for traf\ufb01c cameras: This application aims\r\n                  attributes expected to be consistent for each identi\ufb01er. In    to detect vehicles in video street traf\ufb01c, and suffers from\r\n                  ourexample,thiscouldreturnthegenderattribute.                  problemssuchas\ufb02ickeringorchangingclassi\ufb01cationsforan\r\n               Giventhesetwofunctions,OMGgeneratesmultipleBoolean                object. The model\u2019soutputisboundingboxeswithclasses\r\n               assertionsthatcheckwhetherthevariousattributesofoutputs           oneachframe. Becausewelackagloballyuniqueidenti\ufb01er\r\n               with a common identi\ufb01er match. In addition, it generates          (e.g., license plate number) for each object, we can assign a\r\n               correctionrulesthatcanreplaceaninconsistentattributewith          newidenti\ufb01erforeachboxthatappearsandassignthesame\r\n               aguessatthatattribute\u2019svaluebasedonotherinstancesofthe            identi\ufb01er as it persists through the video. We can treat the\r\n               identi\ufb01er(wesimplyusethemostcommonvalue). Byrun-                  class as an attribute and set T as well to detect \ufb02ickering.\r\n               ningthemodelandthesegeneratedassertionsoverunlabeled              Heartrhythmclassi\ufb01cationfromECGs: Inthisapplication,\r\n               data, OMG can thus automatically generate weak labels             domain experts informed us that atrial \ufb01brillation heart\r\n               for data points that do not satisfy the consistency assertions.   rhythms need to persist for at least 30 seconds to be\r\n               Notably, OMGprovidesanotherwayofproducinglabelsfor                considered a problem. We used the detected class as our\r\n               training that is complementary to human-generated labels          identi\ufb01erandsetT to30seconds.\r\n               andothersourcesofweaklabels. OMGisespeciallysuited\r\n               for unstructured sources, e.g., video. We show in \u00a75 that         4.2   GeneratingAssertionsandLabelsfromtheAPI\r\n               theseweaklabelscanautomaticallyincreasemodelquality.              Given the Id, Attrs, and T values, OMG automatically\r\n               4.1   APIDetails                                                  generatesBooleanassertionstocheckformatchingattributes\r\n                                                                                 and to check that when an identi\ufb01er appears in the data, it\r\n               The consistency assertions API supports ML applications           persists for at least T seconds. These assertions are treated\r\n               that run over multiple inputs xi and produce zero or more         thesameasuser-providedonesintherestofthesystem.\r\n                                                   ModelAssertionsforMonitoringandImprovingMLModels\r\n                OMG also automatically generates corrective rules that              and appear. The multibox assertion \ufb01res when three\r\n                propose a new label for outputs that do not match their             boxeshighlyoverlap(Figure7, Appendix). Theflicker\r\n                identi\ufb01er\u2019s other outputs on an attribute.        The default       andappearassertionsareimplementedwithourconsistency\r\n                behavior is to propose the most common value of that                APIasdescribedin\u00a74.\r\n                attribute (e.g., the class detected for an object on most\r\n                frames), but users can also provide a WeakLabel function            Autonomous vehicles.          We studied the problem of ob-\r\n                to suggest an alternative based on all of that object\u2019s outputs.    ject detection for autonomousvehiclesusingtheNuScenes\r\n                For temporal consistency constraints via T, OMG will as-            dataset (Caesar et al., 2019), which contains labeled LIDAR\r\n                sert by default that at most one transition can occur within a      pointcloudsandassociatedvisualimages. Wesplitthedata\r\n                T-secondwindow;thiscanbeoverridden. Forexample,an                   into separate train, unlabeled, and test splits. We detected\r\n                identi\ufb01erappearingisvalid,butanidenti\ufb01erappearing,disap-            vehicles only. We use the open-source Second model with\r\n                pearing,thenappearingisinvalid. Ifaviolationoccurs,OMG              PointPillars (Yanetal., 2018;Langetal.,2019)forLIDAR\r\n                will propose to remove, modify, or add predictions. In the          detectionsandSSDforvisualdetections. WeimproveSSD\r\n                latter case, OMGneedstoknowhowtogenerateanexpected                  viaactivelearningandweaksupervisioninourexperiments.\r\n                outputonaninputwheretheobjectwasnotidenti\ufb01ed(e.g.,                  AsNuScenescontainstime-alignedpointcloudsandimages,\r\n                frames where the object \ufb02ickered out in Figure 1). OMG              we deployed a custom assertion for 2D and 3D boxes\r\n                requirestheusertoprovideaWeakLabelfunctiontocover                   agreeing, and the multibox assertion. We deployed a\r\n                this case, since it may require domain speci\ufb01c logic, e.g.,         customweaksupervisionrulethatimputedboxesfromthe\r\n                averagingthelocationsoftheobjectonnearbyvideoframes.                3D predictions. While other assertions could have been\r\n                                                                                    deployed(e.g.,flicker),wefoundthatthedatasetwasnot\r\n                5    EVALUATION                                                     sampledfrequentlyenough(at2Hz)fortheseassertions.\r\n                5.1   ExperimentalSetup                                             Medical classi\ufb01cation.       Westudied the problem of clas-\r\n                                                                                    sifying atrial \ufb01brillation (AF) via ECG signals. We used a\r\n                WeevaluatedOMGandmodelassertionsonfourdiverseML                     convolutional network that was shown to outperform car-\r\n                workloadsbasedonrealindustrialandacademicuse-cases:                 diologists (Rajpurkar et al., 2019). Unfortunately, the full\r\n                analyzingTVnews,videoanalytics,autonomousvehicles,                  dataset usedin(Rajpurkaretal.,2019)isnotpubliclyavail-\r\n                and medical classi\ufb01cation. For each domain, we describe             able, so we used the CINC17 dataset (cin, 2017). CINC17\r\n                thetask,dataset,model,trainingprocedure,andassertions.              contains8,528datapointsthatwesplitintotrain,validation,\r\n                AsummaryisgiveninTable1.                                            unlabeled,andtestsplits.\r\n                TVnews. OurcontactsanalyzingTVnewsprovidedus50                      We consulted with medical researchers and deployed an\r\n                hour-longsegmentsthatwereknowntobeproblematic. They                 assertionthatassertsthattheclassi\ufb01cationshouldnotchange\r\n                further provided pre-computed boxes of faces, identities,           betweentwoclassesinundera30secondtimeperiod(i.e.,\r\n                and hair colors; this data was computed from a range of             the assertion \ufb01res when the classi\ufb01cation changes from\r\n                modelsandsources,includinghand-labeling, weaklabels,                A\u2192B\u2192Awithin30seconds),asdescribedin\u00a74.\r\n                and custom classi\ufb01ers. We implemented the consistency\r\n                assertions described in \u00a74. We were unable to access the            5.2    ModelAssertionscanbeWrittenwith\r\n                training code for this domain so were unable to perform                    HighPrecisionandFewLOC\r\n                retraining experimentsforthisdomain.                                We\ufb01rst asked whether model assertions could be written\r\n                Videoanalytics. Manymodernvideoanalyticssystemsuse                  succinctly. Totestthis,weimplementedthemodelassertions\r\n                objectdetectionasacoreprimitive(Kangetal.,2017;2019;                described above and counted the lines of code (LOC)\r\n                Hsiehetal.,2018;Jiangetal.,2018;Xuetal.,2019;Canel                  necessary for each assertion. We count the LOC for the\r\n                et al., 2019), in which the task is to localize and classify        identity and attribute functions for the consistency assertions\r\n                the objects in a frame of video. We focus on the object             (see Table 1 for a summary of assertions). We counted the\r\n                detection portion of these systems. We used a ResNet-34             LOC with and without the shared helper functions (e.g.,\r\n                SSD(Liuetal.,2016)(henceforthSSD)modelpretrainedon                  computing box overlap); we double counted the helper\r\n                MS-COCO(Linetal.,2014). WedeployedSSDfordetecting                   functions when used between assertions. As we show in\r\n                vehiclesinthenight-street(i.e.,jackson)videothatis                  Table 2, both consistency and domain-speci\ufb01c assertions\r\n                commonlyused(Kangetal.,2017;Xuetal.,2019;Canel                      can be written in under 25 LOC excluding shared helper\r\n                et al., 2019; Hsieh et al., 2018). We used a separate day of        functionsandunder60LOCwhenincludinghelperfunctions.\r\n                videofortrainingandtesting.                                         Thus,modelassertionscanbewrittenwithfewLOC.\r\n                Wedeployedthreemodelassertions: multibox,flicker,                   Wethenaskedwhether model assertions could be written\r\n                                                   ModelAssertionsforMonitoringandImprovingMLModels\r\n                 Task                       Model                             Assertions\r\n                 TVnews                     Custom                            Consistency(\u00a74,news)\r\n                 Objectdetection(video)     SSD(Liuetal.,2016)                Three vehicles should not highly overlap (multibox), identity\r\n                                                                              consistencyassertions(flickerandappear)\r\n                 Vehicledetection(AVs)      Second(Yanetal.,2018),SSD         AgreementofPointcloudandimagedetections(agree),multibox\r\n                 AFclassi\ufb01cation            ResNet(Rajpurkaretal.,2019)       Consistencyassertionwithina30stimewindow(ECG)\r\n                                              Table1. Asummaryoftasks,models,andassertionsusedinourevaluation.\r\n                     Assertion      LOC(nohelpers)      LOC(inc. helpers)\r\n                     news           7                   39                              90\r\n                     ECG            23                  50\r\n                     flicker        18                  60                              80                                                 Appear\r\n                     appear         18                  35                              70                                                 Multibox\r\n                     multibox       14                  28                            Percentile                                           Flicker\r\n                     agree          11                  28                              60\r\n                Table2.Number of lines of code (LOC) for each assertion.                50\r\n                Consistency assertions are on the top and custom assertions                   1     2    3     4    5     6     7    8     9    10\r\n                are on the bottom. All assertions could be written in under 60                                       Rank\r\n                LOCincludinghelperfunctions, whendoublecountingbetween\r\n                assertions. The assertion main body could be written in under 25     Figure3.Percentile of con\ufb01dence of the top-10 ranked errors by\r\n                LOCinallcases. Thehelper functions included utilities such as        con\ufb01dencefoundbyOMGforvideoanalytics. Thex-axisisthe\r\n                computingtheoverlapbetweenboxes.                                     rank of the errors caught by model assertions, ordered by rank.\r\n                                                                                     The y-axis is the percentile of con\ufb01dence among all the boxes.\r\n                                       Precision               Precision             As shown, model assertions can \ufb01nd errors where the original\r\n                  Assertion      (identi\ufb01er and output)   (modeloutputonly)          model has high con\ufb01dence (94th percentile), allowing them to\r\n                  news                   100%                    100%                complementexistingcon\ufb01dence-basedmethodsfordataselection.\r\n                  ECG                    100%                    100%\r\n                  flicker                100%                    96%\r\n                  appear                 100%                    88%\r\n                  multibox                N/A                    100%                chali et al., 2019). Furthermore, sampling solutions that are\r\n                  agree                   N/A                    98%                 basedoncon\ufb01dencewouldbeunabletoidentifytheseerrors.\r\n                Table3.Precision of our model assertions we deployed on 50           Todeterminewhethermodelassertionscouldidentifyhigh\r\n                randomly selected examples. The top are consistency assertions       con\ufb01dence errors, we collected the 10 data points with\r\n                andthebottomarecustomassertions. Wereportbothprecisionin             highest con\ufb01dence error for each of the model assertions\r\n                theMLmodeloutputsonlyandwhencountingerrorsintheidenti-               deployedforvideoanalytics. Wethenplottedthepercentile\r\n                \ufb01cationfunctionandMLmodeloutputsforconsistencyassertions.            ofthecon\ufb01denceamongalltheboxesforeacherror.\r\n                Asshown,modelassertionscanbewrittenwith88-100%precision\r\n                acrossalldomainswhenonlycountingerrorsinthemodeloutputs.             AsshowninFigure3,modelassertionscanidentifyerrors\r\n                                                                                     within the top 94th percentile of boxes by con\ufb01dence\r\n                with high precision. To test this, we randomly sampled               (the flicker con\ufb01dences were from the average of the\r\n                50 data points that triggered each assertion and manually            surroundingboxes). Importantly,uncertainty-basedmethods\r\n                checkedwhetherthatdatapointhadanincorrectoutputfrom                  ofmonitoringwouldnotcatchtheseerrors.\r\n                the ML model. The consistency assertions return clusters             Wefurther show that model assertions can identify errors\r\n                of data points (e.g., appear) and we report the precision            in human labels, which effectively have a con\ufb01dence of 1.\r\n                for errors in both the identi\ufb01er and ML model outputs and            TheseresultsareshowninAppendixE.\r\n                onlytheMLmodeloutputs. AsweshowinTable3,model\r\n                assertions achieve at least 88% precision in all cases.              5.4   ModelAssertionscanImproveModelQuality\r\n                                                                                           viaActiveLearning\r\n                5.3   ModelAssertionscanIdentify                                     WeevaluatedOMG\u2019sactivelearningcapabilitiesandBAL\r\n                      High-Con\ufb01denceErrors                                           using the three domains for which we had access to the\r\n                We asked whether model assertions can identify high-                 training code(visualanalytics,ECG,AVs).\r\n                con\ufb01dence errors, or errors where the model returns the\r\n                wrong output with high con\ufb01dence.            High-con\ufb01dence          Multiple model assertions.        Weaskedwhether multiple\r\n                errors are important to identify as con\ufb01dence is used in             model assertions could be used to improve model quality\r\n                downstreamtasks, such as analytics queries and actuation             via continuous data collection. We deployed three asser-\r\n                decisions(Kangetal.,2017;2019;Hsiehetal.,2018;Chin-                  tions over night-streetandtwoassertionsforNuScenes.\r\n                                                   ModelAssertionsforMonitoringandImprovingMLModels\r\n                   64                                                                  70.0        Random\r\n                   63                                                                  67.5        Uncertainty\r\n                                                                                                   BAL\r\n                   62                                                                  65.0\r\n                 mAP61                                            Random              Accuracy\r\n                                                                  Uncertainty          62.5\r\n                   60                                             Uniform MA\r\n                                                                  BAL                          0         1        2         3         4         5\r\n                   59                                                                                                Round\r\n                         2               3                4               5\r\n                                               Round                                Figure5.ActivelearningresultswithasingleassertionfortheECG\r\n                           (a) Active learning for night-street.                    dataset. As shown, with just a single assertion, model-assertion\r\n                   16        Random                                                 based active learning can match uncertainty sampling and\r\n                             Uncertainty                                            outperformrandomsampling.\r\n                   15        Uniform MA\r\n                             BAL                                                        Domain                   Pretrained   Weaklysupervised\r\n                 mAP14                                                                  Videoanalytics(mAP)      34.4         49.9\r\n                   13                                                                   AVs(mAP)                 10.6         14.1\r\n                                                                                        ECG(%accuracy)           70.7         72.1\r\n                   12                                                               Table4.Accuracyofthepretrainedandweaklysupervisedmodels\r\n                         2               3                4               5         for videoanalytics,AVandECGdomains. Weaksupervisioncan\r\n                                               Round                                improveaccuracywithnohuman-generatedlabels.\r\n                               (b) Active learning for NuScenes.\r\n               Figure4.Performanceofrandomsampling,uncertaintysampling,             model assertion could be used to improve model quality.\r\n                uniformsamplingfrommodelassertions,andBALforactivelearn-            Weran\ufb01veroundsofdatalabelingwith100exampleseach\r\n                ing. Theroundistheroundofdatacollection(see\u00a73). Asshown             roundforECGdatasets. Werantheexperiment8timesand\r\n                in (a), BALimprovesaccuracyonunseendataandcanachievean\r\n                accuracy target (62% mAP) with 40% fewer labels compared to         reportaverages. WeshowresultsinFigure5. Asshown,data\r\n                randomanduncertaintysamplingfornight-street. BALalso                collection with a single model assertion generally matches\r\n                outperformsbothbaselinesfortheNuScenesdatasetasshownin(b).          oroutperformsbothuncertaintyandrandomsampling.\r\n                Weshow\ufb01gureswithallroundsofactivelearninginAppendixD.\r\n                                                                                    5.5    ModelAssertionscanImproveModelQuality\r\n                Weusedrandomsampling,uncertaintysamplingwith\u201cleast                        viaWeakSupervision\r\n                con\ufb01dent\u201d(Settles,2009),uniformsamplingfromdatathat                 Weusedourconsistency assertions to evaluate the impact\r\n                triggeredassertions,andBALfortheactivelearningstrate-               ofweaksupervisionusingassertionsforthedomainswehad\r\n                gies. We used the mAP metric for both datasets, which is            weaklabelsfor(videoanalytics,AVs,andECG).\r\n                widelyusedforobjectdetection(Linetal.,2014;Heetal.,\r\n                2017). WedeferhyperparmeterstoAppendixC.                            Fornight-street,weused1,000additionalframeswith\r\n                As we show in Figure 4, BAL outperforms both random                 750framesthattriggeredflickerand250randomframes\r\n                                                                                                                  \u22126\r\n                samplinganduncertaintysamplingonbothdatasetsafterthe                withalearningrateof5\u00d710          foratotalof6epochs. Forthe\r\n                \ufb01rst round, whichisrequiredforcalibration. BALalsoout-              NuScenesdataset,weusedthesame350scenestobootstrap\r\n                performsuniformsamplingfrommodelassertionsbythelast                 theLIDARmodelasintheactivelearningexperiments. We\r\n                round. For night-street, at a \ufb01xed accuracy threshold               trained with 175 scenes of weakly supervised data for one\r\n                                                                                                                          \u22125\r\n                of62%,BALuses40%fewerlabelsthanrandomanduncer-                      epochwithalearningrateof5\u00d710             . For the ECGdataset,\r\n                tainty sampling. Bythe\ufb01fthround,BALoutperformsboth                  weused1,000weaklabelsandthesametrainingprocedure\r\n                randomsamplinganduncertaintysamplingby1.5%mAP.                      asinactivelearning.\r\n                WhiletheabsolutechangeinmAPmayseemsmall,doubling                    Table4showsthatmodelassertion-basedweaksupervision\r\n                themodeldepth,whichdoublesthecomputationalbudget,on                 canimproverelativeperformanceby46.4%forvideoanalyt-\r\n                MS-COCOachievesa1.7%improvementinmAP(ResNet-                        ics and 33%forAVs. Similarly,theECGclassi\ufb01cationcan\r\n                50FPNvs.ResNet-101FPN)(Girshicketal.,2018).                         alsoimprovewithnohuman-generatedlabels. Theseresults\r\n                Theseresultsareexpected,aspriorworkhasshownthatun-                  showthatmodelassertionscanbeusefulasaprimitivefor\r\n                certainty samplingcanbeunsuitedfordeepnetworks(Sener                improvingmodelqualitywithnoadditionaldatalabeling.\r\n                &Savarese,2017).                                                    6     RELATEDWORK\r\n                Singlemodelassertion. Duetothelimiteddataquantities\r\n                fortheECGdataset,wewereunabletodeploymorethanone                    MLQA. ArangeofexistingMLQAtoolsfocusonvalidat-\r\n                assertion. Nonetheless, we further asked whether a single           ing inputs via schemas or tracking performance over time\r\n                                                 ModelAssertionsforMonitoringandImprovingMLModels\r\n               (Polyzotisetal.,2019;Bayloretal.,2017). However,these              ods encode structure/inductive biases into training proce-\r\n               systemsapplytosituationswithmeaningfulschemas(e.g.,                dures or models (BakIr et al., 2007; Haussler, 1988; BakIr\r\n               tabular data) and ground-truth labels at test time (e.g., pre-     et al., 2007). While promising, designing algorithms and\r\n               dicting click-through rate). While model assertions could          modelswithspeci\ufb01cinductivebiasescanbechallengingfor\r\n               also apply to these cases, they also cover situations that do      non-experts. Additionally, these methods generally do not\r\n               notcontainmeaningfulschemasorlabelsattesttime.                     containruntimechecksforaberrantbehavior.\r\n               OtherMLQAsystemsfocusontrainingpipelines(Renggli                   WeakSupervision,Semi-supervisedLearning. Weaksu-\r\n               et al., 2019) or validating numerical errors (Odena &              pervision leverages higher-level and/or noisier input from\r\n               Goodfellow, 2018). These approaches are important at               humanexpertstoimprovemodelquality(Mintzetal.,2009;\r\n               \ufb01ndingpre-deploymentbugs,butdonotapplytotest-time                  Ratneretal.,2017;Jinetal.,2018). Insemi-supervisedlearn-\r\n               scenarios;theyarecomplementarytomodelassertions.                   ing, structural assumptions over the data are used to leverage\r\n               White-box testing systems, e.g., DeepXplore (Pei et al.,           unlabeled data (Zhu, 2011). However, to our knowledge,\r\n               2017), test ML models by taking inputs and perturbing              bothofthesemethodsdonotcontainruntimechecksandare\r\n               them. However,asdiscussed,avalidationsetcannotcover                notusedinmodel-agnosticactivelearningmethods.\r\n               all possibilities in the deployment set. Furthermore, these\r\n               systemsdonotgiveguaranteesundermodeldrift.                         7    DISCUSSION\r\n               Sinceourinitialworkshoppaper(Kangetal.,2018),several               While we believe model assertions are an important step\r\n               workshaveextendedmodelassertions(Arechigaetal.,2019;               towardsapracticalsolutionformonitoringandcontinuously\r\n               Henzingeretal.,2019).                                              improving ML models, we highlight three important\r\n               Veri\ufb01edML. Veri\ufb01cationhasbeenappliedtoMLmodelsin                   limitations of model assertions, which may be fruitful\r\n               simplecases. Forexample,Reluplex(Katzetal.,2017)can                directions for future work.\r\n               verify that extremely small networkswillmakecorrectcon-            First, certain model assertions may be dif\ufb01cult to express\r\n               trol decisions given a \ufb01xed set of inputs and other work has       in our current API. While arbitrary code can be expressed\r\n               shownthatsimilarlysmallnetworkscanbeveri\ufb01edagainst                 in OMG\u2019sAPI,certain temporal assertions may be better\r\n               minimalperturbationsofa\ufb01xedsetofinputimages(Raghu-                 expressedinacomplexeventprocessinglanguage(Wuetal.,\r\n               nathanetal.,2018). However,veri\ufb01cationrequiresaspeci\ufb01-             2006). Webelievethatdomain-speci\ufb01clanguagesformodel\r\n               cation, which may not be feasible to implement, e.g., even         assertions will be a fruitful area of future research.\r\n               humansmaydisagreeoncertainpredictions(Kirillovetal.,               Second,wehavenotthoroughlyevaluatedmodelassertions\u2019\r\n               2018). Furthermore, the largest veri\ufb01ed networks we are            performance in real-time systems. Model assertions may\r\n               awareof(Katzetal.,2017;Raghunathanetal.,2018;Wang                  addoverheadtosystemswhereactuationhastightlatency\r\n               et al., 2018; Sun et al., 2019) are orders of magnitude smaller    constraints, e.g., AVs. Nonetheless, model assertions can be\r\n               thanthenetworksweconsider.                                         usedoverhistoricaldataforthesesystems. Weareactively\r\n               SoftwareDebugging. Writingcorrectsoftwareandverify-                collaboratingwithanAVcompanytoexploretheseissues.\r\n               ingsoftwarehasalonghistory,withmanyproposalsfromthe                Third,certainissuesinMLsystems,suchasbiasintraining\r\n               researchcommunity. Wehopethatmanysuchpracticesare                  sets, are out of scope for model assertions. We hope that\r\n               adoptedindeployingmachinelearningmodels;wefocuson                  complementarysystems,suchasTFX(Bayloretal.,2017),\r\n               assertions in this work (Goldstine et al., 1947; Turing, 1949).    canhelpimprovequalityinthesecases.\r\n               Assertionshavebeenshowntoreducetheprevalenceofbugs,\r\n               whendeployedcorrectly(Kudrjavetsetal.,2006;Mahmood                 8    CONCLUSION\r\n               et al., 1984). There are many other such methods, such as\r\n               formalveri\ufb01cation(Kleinetal.,2009;Leroy,2009;Keller,               In this work, we introduced model assertions, a model-\r\n               1976),conductinglarge-scaletesting(e.g.,fuzzing)(Takanen           agnostic technique that allows domain experts to indicate\r\n               et al., 2008; Godefroid et al., 2012), and symbolic execution      errors in MLmodels. Weshowedthatmodelassertionscan\r\n               to trigger assertions (King, 1976; Cadar et al., 2008). Proba-     beusedatruntimetodetecthigh-con\ufb01denceerrors,which\r\n               bilistic assertions have been used to verify simple distribu-      priormethodswouldnotdetect. Weproposedmethodstouse\r\n               tional properties of programs, suchasdifferentiallyprivate         modelassertionsforactivelearningandweaksupervisionto\r\n               programsshouldreturnanexpectedmean(Sampsonetal.,                   improvemodelquality. Weimplementedmodelassertionsin\r\n               2014). However,MLdevelopersmaynotbeabletospecify                   anovellibrary, OMG,anddemonstratedthattheycanapply\r\n               distributions and data mayshift in deployment.                     to a wide range of real-world MLtasks,improvingmonitor-\r\n               StructuredPrediction,InductiveBias. SeveralMLmeth-                 ing, active learning, and weak supervision for MLmodels.\r\n                                                    ModelAssertionsforMonitoringandImprovingMLModels\r\n                Acknowledgements                                                       Cadar,C.,Dunbar,D.,Engler,D.R.,etal. Klee: Unassisted\r\n                This research was supported in part by af\ufb01liate members and               and automatic generation of high-coverage tests for\r\n                other supporters of the Stanford DAWN project\u2014Ant Financial,              complex systems programs. In OSDI, volume 8, pp.\r\n                Facebook,Google,Infosys,NEC,andVMware\u2014aswellasToyota                      209\u2013224,2008.\r\n                ResearchInstitute,NorthropGrumman,Cisco,SAP,andtheNSF\r\n                under CAREER grant CNS-1651570 and Graduate Research                   Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E.,\r\n                Fellowship grant DGE-1656518. Any opinions, \ufb01ndings, and                  Xu,Q.,Krishnan,A.,Pan,Y.,Baldan,G.,andBeijbom,O.\r\n                conclusions or recommendations expressed in this material are             nuscenes: Amultimodaldatasetforautonomousdriving.\r\n                thoseoftheauthorsanddonotnecessarilyre\ufb02ecttheviewsofthe                   arXivpreprintarXiv:1903.11027,2019.\r\n                National Science Foundation. Toyota Research Institute (\u201cTRI\u201d)\r\n                provided funds to assist the authors with their research but this\r\n                article solely re\ufb02ects the opinions and conclusions of its authors     Canel,C.,Kim,T.,Zhou,G.,Li,C.,Lim,H.,Andersen,D.,\r\n                andnotTRIoranyotherToyotaentity.                                          Kaminsky, M., and Dulloor, S. Scaling video analytics\r\n                Wefurther acknowledge Kayvon Fatahalian, James Hong, Dan                  onconstrainededgenodes. SysML,2019.\r\n                Fu, Will Crichton, Nikos Arechiga, and Sudeep Pillai for their         Chen, L., Xu, J., and Lu, Z. Contextual combinatorial\r\n                productivediscussionsonMLapplications.                                    multi-armed bandits with volatile arms and submodular\r\n                REFERENCES                                                                reward. In Advances in Neural Information Processing\r\n                                                                                          Systems,pp.3247\u20133256,2018.\r\n                AF classi\ufb01cation from a short single lead ECG record-                  Chinchali, S., Sharma, A., Harrison, J., Elhafsi, A., Kang,\r\n                   ing: the physionet/computing in cardiology challenge                   D., Pergament, E., Cidon, E., Katti, S., and Pavone, M.\r\n                   2017, 2017.         URL https://physionet.org/                         Networkof\ufb02oadingpoliciesforcloudrobotics: alearning-\r\n                   challenge/2017/.                                                       basedapproach. arXivpreprintarXiv:1902.05703,2019.\r\n                Scale API: The API for training data, 2019.                  URL       Coldewey, D. Uber in fatal crash detected pedestrian but\r\n                   https://scale.ai/.                                                     had emergency braking disabled, 2018. URL https:\r\n                Arechiga, N., DeCastro, J., Kong, S., and Leung, K.                       //techcrunch.com/2018/05/24/uber-in-\r\n                   Better AI through logical scaffolding. arXiv preprint                  fatal-crash-detected-pedestrian-but-\r\n                   arXiv:1909.06965,2019.                                                 had-emergency-braking-disabled/.\r\n                Athalye, A., Engstrom, L., Ilyas, A., and Kwok, K.                     Coleman, C., Yeh, C., Mussmann, S., Mirzasoleiman,\r\n                   Synthesizingrobustadversarialexamples. ICML,2018.                      B., Bailis, P., Liang, P., Leskovec, J., and Zaharia,\r\n                                                                                          M.      Selection via proxy:        Ef\ufb01cient data selection\r\n                Auer, P., Cesa-Bianchi, N., and Fischer, P.           Finite-time         for deep learning.        In International Conference on\r\n                   analysis of the multiarmed bandit problem. Machine                     Learning Representations, 2020.               URL https:\r\n                   learning,47(2-3):235\u2013256,2002.                                         //openreview.net/forum?id=HJg2b0VYDr.\r\n                                                \u00a8                                      Davies, A.         How do self-driving cars see?             (and\r\n                BakIr,G.,Hofmann,T.,Scholkopf,B.,Smola,A.J.,Taskar,                       how do they see me?), 2018.                   URL https:\r\n                   B.,andVishwanathan,S. Predictingstructureddata. MIT                    //www.wired.com/story/the-know-it-\r\n                   press, 2007.                                                           alls-how-do-self-driving-cars-see/.\r\n                Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y.,           EHRA. Guidelinesforthemanagementofatrial\ufb01brillation:\r\n                   Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., et al.            the task force for the management of atrial \ufb01brillation\r\n                   Tfx:    A tensor\ufb02ow-based production-scale machine                     oftheEuropeansocietyofcardiology(ESC). European\r\n                   learningplatform. InSIGKDD.ACM,2017.                                   heartjournal,31(19):2369\u20132429,2010.\r\n                Berry, D. A. and Fristedt, B. Bandit problems: sequential              Evans, L. C. Graduate studies in mathematics. In Partial\r\n                   allocation of experiments (monographsonstatisticsand                   differential equations. Am. Math.Soc.,1998.\r\n                   applied probability). London: Chapman and Hall, 5:\r\n                   71\u201387,1985.                                                                                                                     \u00b4\r\n                                                                                       Girshick, R., Radosavovic, I., Gkioxari, G., Dollar, P.,\r\n                                                                                          and He, K.       Detectron.      https://github.com/\r\n                Bubeck, S., Munos, R., and Stoltz, G. Pure exploration                    facebookresearch/detectron,2018.\r\n                   in multi-armed bandits problems.            In International\r\n                   conference on Algorithmic learning theory, pp. 23\u201337.               Godefroid,P.,Levin,M.Y.,andMolnar,D. Sage: whitebox\r\n                   Springer,2009.                                                         fuzzingforsecuritytesting. Queue,10(1):20,2012.\r\n                                             ModelAssertionsforMonitoringandImprovingMLModels\r\n              Goldstine, H. H., Von Neumann, J., and Von Neumann,          Keller, R. M.   Formal veri\ufb01cation of parallel programs.\r\n                J.  Planning and coding of problems for an electronic        CommunicationsoftheACM,19(7):371\u2013384,1976.\r\n                computinginstrument. 1947.                                 King, J. C.   Symbolic execution and program testing.\r\n              Goodfellow,I.J.,Shlens,J.,andSzegedy,C. Explainingand          CommunicationsoftheACM,19(7):385\u2013394,1976.\r\n                harnessingadversarialexamples. ICLR,2015.                                                                      \u00b4\r\n                                                                           Kirillov, A., He, K., Girshick, R., Rother, C., and Dollar, P.\r\n              Haussler, D.   Quantifying inductive bias: AI learning         Panopticsegmentation. arXivpreprintarXiv:1801.00868,\r\n                algorithms and valiant\u2019s learning framework. Arti\ufb01cial       2018.\r\n                intelligence, 36(2):177\u2013221,1988.                          Klein,G.,Elphinstone,K.,Heiser,G.,Andronick,J.,Cock,\r\n                                      \u00b4                                      D., Derrin, P., Elkaduwe, D., Engelhardt, K., Kolanski, R.,\r\n              He,K.,Gkioxari,G.,Dollar,P.,andGirshick,R. Maskr-cnn.          Norrish,M.,etal.sel4: Formalveri\ufb01cationofanOSkernel.\r\n                In Computer Vision (ICCV), 2017 IEEE International           InProceedingsoftheACMSIGOPS22ndsymposiumon\r\n                Conferenceon,pp.2980\u20132988.IEEE,2017.                         Operatingsystemsprinciples,pp.207\u2013220.ACM,2009.\r\n              Henzinger,T.A.,Lukina,A.,andSchilling,C. Outsidethe          Kudrjavets, G., Nagappan, N., and Ball, T. Assessing the\r\n                box: Abstraction-based monitoring of neural networks.        relationship between software assertions and faults: An\r\n                arXivpreprintarXiv:1911.09032,2019.                          empiricalinvestigation. InSoftwareReliabilityEngineer-\r\n              Hirth, M., Ho\u00dffeld, T., and Tran-Gia, P. Analyzing costs and   ing, 2006. ISSRE\u201906. 17th International Symposium on,\r\n                accuracy of validation mechanisms for crowdsourcing          pp.204\u2013212.IEEE,2006.\r\n                platforms. Mathematical and Computer Modelling, 57         Lang, A. H., Vora, S., Caesar, H., Zhou, L., Yang, J., and\r\n                (11-12):2918\u20132932,2013.                                      Beijbom, O.     Pointpillars: Fast encoders for object\r\n              Hsieh,K.,Ananthanarayanan,G.,Bodik,P.,Venkataraman,            detectionfrompointclouds. InProceedingsoftheIEEE\r\n                S., Bahl, P., Philipose, M., Gibbons, P. B., and Mutlu, O.   ConferenceonComputerVisionandPatternRecognition,\r\n                Focus: Querying large video datasets with low latency        pp.12697\u201312705,2019.\r\n                andlowcost. InOSDI,pp.269\u2013286,2018.                        Lee,   T.      Tesla  says autopilot was active dur-\r\n              Jiang, J., Ananthanarayanan, G., Bodik, P., Sen, S., and       ing   fatal crash   in  mountain view.        https:\r\n                Stoica, I.  Chameleon: scalable adaptation of video          //arstechnica.com/cars/2018/03/tesla-\r\n                analytics. In Proceedings of the 2018 Conference of the      says-autopilot-was-active-during-\r\n                ACMSpecialInterestGrouponDataCommunication,pp.               fatal-crash-in-mountain-view/,2018.\r\n                253\u2013266.ACM,2018.                                          Leroy, X.    Formal veri\ufb01cation of a realistic compiler.\r\n              Jin, S., RoyChowdhury,A.,Jiang,H.,Singh,A.,Prasad,A.,          CommunicationsoftheACM,52(7):107\u2013115,2009.\r\n                Chakraborty, D., and Learned-Miller, E. Unsupervised       Lin,T.-Y.,Maire,M.,Belongie,S.,Hays,J.,Perona,P.,Ra-\r\n                hard example mining from videos for improved object                         \u00b4\r\n                detection. In Proceedings of the European Conference         manan,D.,Dollar,P.,andZitnick,C.L. MicrosoftCOCO:\r\n                onComputerVision(ECCV),pp.307\u2013324,2018.                      Commonobjectsincontext. InEuropeanconferenceon\r\n                                                                             computervision,pp.740\u2013755.Springer,2014.\r\n              Kang,D.,Emmons,J.,Abuzaid,F.,Bailis,P.,andZaharia,           Liu,W.,Anguelov,D.,Erhan,D.,Szegedy,C.,Reed,S.,Fu,\r\n                M. Noscope: optimizing neural network queries over           C.-Y., and Berg, A. C. SSD:Singleshotmultiboxdetector.\r\n                videoatscale. ProceedingsoftheVLDBEndowment,10               In European conference on computer vision, pp. 21\u201337.\r\n                (11):1586\u20131597,2017.                                         Springer,2016.\r\n              Kang,D.,Raghavan,D.,Bailis,P.,andZaharia,M. Model                    \u00b4           \u00b4\r\n                                                                           Lu,T.,Pal,D.,andPal,M. Contextualmulti-armedbandits.\r\n                assertions for debugging machine learning. In NeurIPS        InProceedingsoftheThirteenthinternationalconference\r\n                MLSysWorkshop,2018.                                          onArti\ufb01cialIntelligenceandStatistics,pp.485\u2013492,2010.\r\n              Kang,D.,Bailis,P.,andZaharia,M.Blazeit: Fastexploratory      Mahmood, A., Andrews, D. M., and McCluskey, E. J.\r\n                videoqueriesusingneuralnetworks. PVLDB,2019.                 Executableassertionsand\ufb02ightsoftware. 1984.\r\n              Katz,G.,Barrett,C.,Dill,D.L.,Julian,K.,andKochenderfer,      Mintz, M., Bills, S., Snow, R., and Jurafsky, D. Distant\r\n                M.J. Reluplex: An ef\ufb01cient SMT solver for verifying          supervision for relation extraction without labeled data.\r\n                deep neural networks. In International Conference on         In Proceedings of the Joint Conference of the 47th\r\n                ComputerAidedVeri\ufb01cation,pp.97\u2013117.Springer,2017.            Annual Meeting of the ACL and the 4th International\r\n                                                ModelAssertionsforMonitoringandImprovingMLModels\r\n                 Joint ConferenceonNaturalLanguageProcessingofthe              Sun, X., Khedr, H., and Shoukry, Y. Formal veri\ufb01cation\r\n                 AFNLP:Volume2-Volume2,pp.1003\u20131011.Association                   of neural network controlled autonomous systems. In\r\n                 for ComputationalLinguistics,2009.                               Proceedingsofthe22ndACMInternationalConference\r\n               NTSB. Vehicleautomationreport,HWY18MH010,2019.                     on Hybrid Systems: Computation and Control, pp.\r\n                 URL https://dms.ntsb.gov/public/62500-                           147\u2013156.ACM,2019.\r\n                 62999/62978/629713.pdf.                                       Takanen, A., Demott, J. D., and Miller, C. Fuzzing for\r\n               Odena,A.andGoodfellow,I. Tensorfuzz: Debuggingneural               software security testing and quality assurance. Artech\r\n                 networkswithcoverage-guidedfuzzing. arXivpreprint                House,2008.\r\n                 arXiv:1807.10875,2018.                                        Taylor, L. and Nitschke, G.       Improving deep learning\r\n               Pei, K., Cao, Y., Yang, J., and Jana, S.       Deepxplore:         using generic data augmentation.          arXiv preprint\r\n                 Automatedwhiteboxtestingofdeeplearningsystems. In                arXiv:1708.06020,2017.\r\n                 Proceedingsofthe26thSymposiumonOperatingSystems               Tokic,M.andPalm,G. Value-differencebasedexploration:\r\n                 Principles, pp. 1\u201318. ACM,2017.                                  adaptivecontrolbetweenepsilon-greedyandsoftmax. In\r\n               Polyzotis, N., Zinkevich, M., Roy, S., Breck, E., and Whang,       AnnualConferenceonArti\ufb01cialIntelligence,pp.335\u2013346.\r\n                 S. Datavalidationformachinelearning. SysML,2019.                 Springer,2011.\r\n               Radlinski, F., Kleinberg, R., and Joachims, T. Learning di-     Tran-Thanh,L.,Venanzi,M.,Rogers,A.,andJennings,N.R.\r\n                 verserankingswithmulti-armedbandits. InProceedings               Ef\ufb01cientbudgetallocationwithaccuracyguaranteesfor\r\n                 of the 25th international conference on Machinelearning,         crowdsourcing classi\ufb01cation tasks. In Proceedings of\r\n                 pp.784\u2013791.ACM,2008.                                             the2013internationalconferenceonAutonomousagents\r\n               Raghunathan, A., Steinhardt, J., and Liang, P. Certi\ufb01ed            and multi-agent systems, pp. 901\u2013908. International\r\n                 defenses against adversarial examples. arXiv preprint            Foundation for Autonomous Agents and Multiagent\r\n                 arXiv:1801.09344,2018.                                           Systems,2013.\r\n               Rajpurkar, P., Hannun, A. Y., Haghpanahi, M., Bourn, C.,        Turing,A. Checkingalargeroutine. InReportonaConfer-\r\n                 andNg,A.Y.Cardiologist-levelarrhythmiadetectionwith              enceonHighSpeedAutomaticCalculatingmachines,pp.\r\n                 convolutionalneuralnetworks. NatureMedicine,2019.                67\u201369.CambridgeUniversityMathematicsLab,1949.\r\n                                                    \u00b4                          Wang, J. and Perez, L.      The effectiveness of data aug-\r\n               Ratner,A.,Bach,S.,Varma,P.,andRe,C. Weaksupervision:               mentation in image classi\ufb01cation using deep learning.\r\n                 The new programming paradigm for machine learning,               ConvolutionalNeuralNetworksVis.Recognit,2017.\r\n                 2017.     URL https://dawn.cs.stanford.edu/\r\n                 2017/07/16/weak-supervision/.                                 Wang, S., Pei, K., Whitehouse, J., Yang, J., and Jana,\r\n               Renggli, C., Karla, B., Ding, B., Liu, F., Schawinski, K.,         S. Formal security analysis of neural networks using\r\n                 Wu,W.,andZhang,C. Continuousintegrationofmachine                 symbolicintervals. InUSENIXSecuritySymposium,pp.\r\n                 learningmodelswithease.ml/ci: Towardsarigorousyet                1599\u20131614,2018.\r\n                 practical treatment. SysML,2019.                              Wu,E.,Diao,Y.,andRizvi,S. High-performancecomplex\r\n               Sampson,A.,Panchekha,P.,Mytkowicz,T.,McKinley,K.S.,                eventprocessingoverstreams. InProceedingsofthe2006\r\n                 Grossman, D., and Ceze, L. Expressing and verifying              ACMSIGMODinternationalconferenceonManagement\r\n                 probabilistic assertions. ACM SIGPLANNotices, 49(6):             of data, pp. 407\u2013418.ACM,2006.\r\n                 112\u2013122,2014.                                                 Xu,T.,Botelho,L.M.,andLin,F.X. Vstore: Adatastorefor\r\n               Sener,O.andSavarese,S. Activelearningforconvolutional              analyticsonlargevideos. InProceedingsoftheFourteenth\r\n                 neural networks: A core-set approach. arXiv preprint             EuroSysConference2019,pp. 16.ACM,2019.\r\n                 arXiv:1708.00489,2017.                                        Yan, Y., Mao, Y., and Li, B. Second: Sparsely embedded\r\n               Settles, B.  Active learning literature survey. Technical          convolutionaldetection. Sensors,18(10):3337,2018.\r\n                 report, University of Wisconsin-MadisonDepartmentof           Zhu, X. Semi-supervised learning. In Encyclopedia of\r\n                 ComputerSciences,2009.                                           machinelearning,pp.892\u2013897.Springer,2011.\r\n               Sun,C.,Shrivastava,A.,Singh,S.,andGupta,A. Revisiting\r\n                 unreasonable effectiveness of data in deep learning era.\r\n                 In Proceedings of the IEEE international conference on\r\n                 computervision,pp.843\u2013852,2017.\r\n                                                ModelAssertionsforMonitoringandImprovingMLModels\r\n                        (a) Frame1                    (b) Frame2               (a) Example error \ufb02agged by multibox. SSD predicts three\r\n                                                                               truckswhenonlyoneshouldbedetected.\r\n               Figure6.Two example frames from the same scene with an\r\n               inconsistent attribute (the identity) from the TV news use case.\r\n                                   (a) Exampleerror1.                          (b) Exampleerror\ufb02aggedbyagree. SSDmissesthecaronthe\r\n                                                                               right and the LIDARmodelpredictsthetruckonthelefttobetoo\r\n                                                                               large.\r\n                                                                               Figure8.Examples of errors that the multibox and agree\r\n                                                                               assertions catch for the NuScenes dataset. LIDAR model boxes\r\n                                                                               areinpinkandSSDboxesareingreen. Bestviewedincolor.\r\n                                   (b) Exampleerror2.                          B CLASSESOFMODELASSERTIONS\r\n               Figure7.Examples errors when three boxes highly overlap (see    Wepresentanon-exhaustivelistofcommonclassesofmodel\r\n               multiboxinSection5). Bestviewedincolor.                         assertions in Table 5 and below. Namely, we describe how\r\n                                                                               onemightlookforassertionsinotherdomains.\r\n                                                                               Ourtaxonomizationisnotexactandseveralexampleswill\r\n                                                                               contain features from several classes of model assertions.\r\n               A EXAMPLESOFERRORS                                              Prior work on schema validation (Polyzotis et al., 2019;\r\n                    CAUGHTBYMODELASSERTIONS                                    Bayloretal.,2017)anddataaugmentation(Wang&Perez,\r\n                                                                               2017; Taylor & Nitschke, 2017) can be cast in the model\r\n               In this section, we illustrate several errors caught by the     assertion framework. Asthesehavebeenstudied,wedonot\r\n               modelassertionsusedinourevaluation.                             focusontheseclassesofassertionsinthiswork.\r\n               First, we showanexampleerrorintheTVnewsusecasein                Consistency assertions. An important class of model as-\r\n               Figure 6. Recall that these assertions were generated with      sertions checks the consistency across multiple models or\r\n               our consistency API (\u00a74). In this example, the identi\ufb01er is     sources of data. The multiple sources of data could be the\r\n               thebox\u2019ssceneidandtheattributeistheidentity.                    output of multiple ML models on the same data, multiple\r\n               Second,weshowanexampleerrorforthevisualanalytics                sensors, or multiple views of the same data. The output\r\n               usecaseinFigure7forthemultiboxassertion. Here,SSD               fromthevarioussourcesshouldagreeandconsistencymodel\r\n               erroneouslydetectsmultiplecarswhenthereshouldbeone.             assertions specify this constraint. These assertions can be\r\n                                                                               generatedviaourAPIasdescribedin\u00a74.\r\n               Third, we show two example errors for the AV use case in\r\n               Figure8fromthemultiboxandagreeassertions.                       Domainknowledgeassertions. Inmanyphysicaldomains,\r\n                                            ModelAssertionsforMonitoringandImprovingMLModels\r\n               Assertionclass   Assertion      Description                        Examples\r\n                                sub-class\r\n               Consistency      Multi-source   Modeloutputsfrommultiple           \u2022 Verifying human labels (e.g., number of\r\n                                               sourcesshouldagree                   labelers that disagree)\r\n                                                                                  \u2022 Multiplemodels(e.g.,numberofmodelsthat\r\n                                                                                    disagree)\r\n                                Multi-modal    Modeloutputsfrommultiple           \u2022 Multiple sensors (e.g., number of disagree-\r\n                                               modesofdatashouldagree               mentsfromLIDARandcameramodels)\r\n                                                                                  \u2022 Multipledatasources(e.g.,textandimages)\r\n                                Multi-view     Modeloutputsfrommultipleviews      \u2022 Videoanalytics(e.g.,resultsfromoverlapping\r\n                                               ofthesamedatashouldagree             viewsofdifferentcamerasshouldagree)\r\n                                                                                  \u2022 Medicalimaging(e.g.,differentanglesshould\r\n                                                                                    agree)\r\n               Domain           Physical       Physicalconstraints                \u2022 Videoanalytics(e.g.,carsshouldnot\ufb02icker)\r\n               knowledge                       onmodeloutputs                     \u2022 Earthquakedetection(e.g.,earthquakesshould\r\n                                                                                    appearacrosssensorsinphysicallyconsistent\r\n                                                                                    ways)\r\n                                                                                  \u2022 Protein-protein interaction (e.g., number of\r\n                                                                                    overlappingatoms)\r\n                                Unlikely       Scenariosthatare                   \u2022 Video analytics (e.g., maximum con\ufb01dence\r\n                                scenario       unlikelytooccur                      of3vehiclesthathighlyoverlap),\r\n                                                                                  \u2022 Text generation (e.g., two of the same word\r\n                                                                                    shouldnotappearsequentially)\r\n               Perturbation     Insertion      Inserting certain types of data    \u2022 Visual analytics (e.g., synthetically adding a\r\n                                               shouldnotmodifymodeloutputs          car to a frame of video should be detected as\r\n                                                                                    acar),\r\n                                                                                  \u2022 LIDAR detection (e.g., similar to visual\r\n                                                                                    analytics)\r\n                                Similar        Replacingpartsoftheinputwith       \u2022 Sentimentanalysis(e.g.,classi\ufb01cationshould\r\n                                               similar data should not modify       notchangewithsynonyms)\r\n                                               modeloutputs                       \u2022 Objectdetection(e.g.,paintingobjectsdiffer-\r\n                                                                                    entcolorsshouldnotchangethedetection)\r\n                                Noise          Addingnoiseshouldnot               \u2022 Image classi\ufb01cation (e.g., small Gaussian\r\n                                               modifymodeloutputs                   noiseshouldnotaffectclassi\ufb01cation)\r\n                                                                                  \u2022 Timeseries(e.g.,smallGaussiannoiseshould\r\n                                                                                    notaffecttimeseriesclassi\ufb01cation)\r\n               Input            Schema         Inputsshould                       \u2022 Boolean features should not have inputs that\r\n               validation       validation     conformtoaschema                     arenot0or1\r\n                                                                                  \u2022 Allfeaturesshouldbepresent\r\n             Table5.Exampleofmodelassertions. We describe several assertion classes, sub-classes, and concrete instantiations of each class. In\r\n              parentheses,wedescribeapotentialseverityscoreoranapplication.\r\n                                                ModelAssertionsforMonitoringandImprovingMLModels\r\n               domainexpertscanexpressphysicalconstraintsorunlikely\r\n               scenarios. As an example of a physical constraint, when\r\n               predictinghowproteinswillinteract,atomsshouldnotphys-              60\r\n               ically overlap. As an example of an unlikely scenario, boxes      mAP                                          Random\r\n               ofthevisiblepartofcarsshouldnothighlyoverlap(Figure7).             55                                          Uncertainty\r\n               Inparticular, modelassertionsofunlikelyscenariosmaynot                                                         Uniform MA\r\n               be100%precise,i.e.,willbesoftassertions.                           50                                          BAL\r\n                                                                                        1           2          3           4          5\r\n               Perturbationassertions. Manydomainscontaininputand                                            Round\r\n               outputpairsthatcanbeperturbed(perhapsjointly)suchthat                      (a) Active learning for night-street.\r\n               theoutputdoesnotchange. Theseperturbationshavebeen                 16\r\n               widelystudiedthroughthelensofdataaugmentation(Wang\r\n               &Perez, 2017; Taylor & Nitschke, 2017) and adversarial             14\r\n               examples(Goodfellowetal.,2015;Athalyeetal.,2018).                 mAP                                           Random\r\n                                                                                  12                                           Uncertainty\r\n               Input validation assertions.        Domains that contain                                                        Uniform MA\r\n               schemas for the input data can have model assertions that          10                                           BAL\r\n               validate the input data based on the schema (Polyzotis et al.,           1           2          3           4          5\r\n               2019; Baylor et al., 2017). For example, boolean inputs                                       Round\r\n               that are encoded with integral values (i.e., 0 or 1) should                    (b) Active learning for NuScenes.\r\n               neverbenegative. Thisclassofassertionsisaninstanceof            Figure9.Performanceofrandomsampling,uncertaintysampling,\r\n               preconditionsforMLmodels.                                       uniform sampling from model assertions, and BAL for active\r\n                                                                               learning. The round is the round of data collection (see \u00a73). As\r\n               C HYPERPARAMETERS                                               shown,BALimprovesaccuracyonunseendataandcanachievethe\r\n                                                                               sameaccuracy(62%mAP)asrandomsamplingwith40%fewer\r\n                                                                               labels for night-street. BALalsooutperformsbothbaselines\r\n               Hyperparametersforactivelearning experiments. For               for the NuScenesdataset.\r\n               night-street,weused300,000framesofonedayofvideo                                    Description    Number\r\n               for the training and unlabeled data. We sampled100frames                           Alllabels      469\r\n               perroundfor\ufb01veroundsandused25,000framesofadiffer-                                  Errors         32\r\n               entdayofvideoforthetestset. Duetothecostofobtaining                                Errorscaught   4\r\n               labels, we ran each trial twice.                                Table6.Number of labels, errors, and errors caught from model\r\n               FortheNuScenesdataset,weused350scenestobootstrap                assertionsforScale-annotatedimagesforthevideoanalyticstask.\r\n                                                                               Asshown,modelassertionscaught12.5%oftheerrorsinthisdata.\r\n               the LIDAR model, 175 scenes for unlabeled/training data\r\n               for SSD,and75scenesforvalidation(outoftheoriginal850            annotator identi\ufb01cation which is necessary to perform this\r\n               labeledscenes). Wetrainedforoneepochatalearningrate             veri\ufb01cation. We deployed a model assertion in which we\r\n                       \u22125\r\n               of5\u00d710 .Weran8trials.                                           trackedobjectsacrossframesofavideousinganautomated\r\n               FortheECGdataset,wetrainfor5roundsofactivelearning              methodandveri\ufb01edthatthesameobjectindifferentframes\r\n               with100samplesperround. Weusealearningrateof0.001               hadthesamelabel.\r\n               until the loss plateaus, which the original training code did.  We obtained labels for 1,000 random frames from\r\n                                                                               night-streetfromScaleAI(sca,2019),whichisusedby\r\n               D FULLACTIVELEARNINGFIGURES                                     several autonomousvehiclecompanies. Table6summarizes\r\n                                                                               our results. Scale returned 469 boxes, which we manually\r\n               WeshowactivelearningresultsforallroundsinFigure9.               veri\ufb01ed for correctness. There were no localization errors,\r\n                                                                               but there were 32 classi\ufb01cation errors, of which the model\r\n               E MODELASSERTIONSCAN                                            assertion caught12.5%. Thus,weseethatmodelassertions\r\n                    IDENTIFYERRORSINHUMANLABELS                                canalsobeusedtoverifyhumanlabels.\r\n               Wefurther asked whether model assertions could be used\r\n               to identify errors in human-generated labels, i.e., a human is\r\n               actingasa\u201cMLmodel.\u201d Whileveri\ufb01cationofhumanlabels\r\n               has been studied in the context of crowd-sourcing (Hirth\r\n               et al., 2013; Tran-Thanh et al., 2013), several production\r\n               labeling services (e.g., Scale (sca, 2019)) do not provide\r\n", "award": [], "sourceid": 189, "authors": [{"given_name": "Daniel", "family_name": "Kang", "institution": "Stanford University"}, {"given_name": "Deepti", "family_name": "Raghavan", "institution": "Stanford University"}, {"given_name": "Peter", "family_name": "Bailis", "institution": "Stanford University"}, {"given_name": "Matei", "family_name": "Zaharia", "institution": "Stanford and Databricks"}]}