{"title": "Predictive Precompute with Recurrent Neural Networks", "book": "Proceedings of Machine Learning and Systems", "page_first": 470, "page_last": 480, "abstract": "In both mobile and web applications, speeding up user interface response times can often lead to significant improvements in user engagement. A common technique to improve responsiveness is to precompute data ahead of time for specific features. However, simply precomputing data for all user and feature combinations is prohibitive at scale due to both network constraints and server-side computational costs. It is therefore important to accurately predict per-user feature usage in order to minimize wasted precomputation (an approach we call \u201cpredictive precompute\u201d). In this paper, we describe the novel application of recurrent neural networks (RNNs) for predictive precompute. We compare their performance with traditional machine learning models, and share findings from their use in large-scale production systems. We demonstrate that RNN models improve prediction accuracy, eliminate most feature engineering steps, and reduce the computational cost of serving predictions by an order of magnitude.", "full_text": "                      PREDICTIVE PRECOMPUTE WITH RECURRENT NEURAL NETWORKS\r\n                                                    HansonWang1 ZehuiWang1 YuanyuanMa1\r\n                                                                       ABSTRACT\r\n                    In both mobile and web applications, speeding up user interface response times can often lead to signi\ufb01cant\r\n                    improvementsinuserengagement. Acommontechniquetoimproveresponsivenessistoprecomputedataaheadof\r\n                    time for speci\ufb01c activities. However, simply precomputing data for all user and activity combinations is prohibitive\r\n                    at scale due to both network constraints and server-side computational costs. It is therefore important to accurately\r\n                    predict per-user application usage in order to minimize wasted precomputation (\u201cpredictive precompute\u201d). In this\r\n                    paper, we describe the novel application of recurrent neural networks (RNNs) for predictive precompute. We\r\n                    comparetheir performance with traditional machine learning models, and share \ufb01ndings from their large-scale\r\n                    production use at Facebook. We demonstrate that RNN models improve prediction accuracy, eliminate most\r\n                    feature engineering steps, and reduce the computational cost of serving predictions by an order of magnitude.\r\n               1    INTRODUCTION                                                 Tominimizecomputational costs, one solution is to use an\r\n               The relationship between application latency and user en-         approach we call predictive precompute: we can predict\r\n               gagement is well-known; improving the responsiveness of           the probability that a user will access a particular activity\r\n               an application by just a few seconds can result in signi\ufb01cant     given the current application state and their historical access\r\n               increases in user engagement due to the limited attention         logs. We then only precompute data when the probabil-\r\n               span of users (Palmer, 2016).                                     ity surpasses a certain threshold, signi\ufb01cantly reducing the\r\n                                                                                 proportion of wasted precompute.\r\n               In modern applications, the most common source of latency         Thekeytothisapproachisaccuratelypredicting user access\r\n               is data fetching. To serve the content required for a particular  probabilities. Estimating probabilities translates well into a\r\n               activity, one might \ufb01rst make a network request, retrieve the     standard machine learning problem, where existing research\r\n               rawcontentfromadatabase,andthenapplysomeadditional                offers some solutions: for example, (Wang et al., 2015) de-\r\n               processing \u2014 each of which may take a signi\ufb01cant amount           scribe a system to precompute links embedded in Twitter\r\n               of time. One common strategy to improve responsiveness            posts using linear regression on various content-speci\ufb01c fea-\r\n               from the user\u2019s perspective is to precompute the results          tures and (Sarker et al., 2019) demonstrate the effectiveness\r\n               (prefetching them ahead of time) so that it is immediately        of decision tree models for context-aware smartphone us-\r\n               available for the user.                                           age predictions. However, it is often dif\ufb01cult to \u201cfeature\r\n               Foranyindividualactivity, the simplest precompute strategy        engineer\u201d a user\u2019s historical access logs into a form that\r\n               is to just always perform a precompute at application startup.    traditional machine learning methods can use.\r\n               However, this causes issues both client-side and server-side:     Wepropose the use of deep learning models, particularly\r\n                  \u2022 On the client side (especially on mobile clients), ag-       those based on recurrent neural networks (RNNs), as a\r\n                    gressive precompute adversely affects cellular data us-      novel improvement over previous methods. We demon-\r\n                    age, application startup time, and battery usage, all of     strate through of\ufb02ine experiments that RNNs are able to\r\n                    which negatively impact user engagement.                     makeeffective use of historical data with minimal feature\r\n                                                                                 engineering to achieve superior prediction accuracy over\r\n                  \u2022 Ontheserver side, the computational cost of the data         traditional models. We prove these bene\ufb01ts in production\r\n                    fetches becomes signi\ufb01cant at scale, especially during       through an online experiment where RNNs yield a 7.81%\r\n                    peak hours where computational resources are rela-           increase in successful precompute over a traditional model.\r\n                    tively scarce.                                               Finally, we highlight the bene\ufb01ts of the RNN computation\r\n                  1Facebook, Menlo Park, California, USA. Correspondence to:     model from a systems perspective. By eliminating the time-\r\n               HansonWang<hansonw@fb.com>.                                       based aggregations used in traditional models in favor of a\r\n               Proceedings of the 3rd MLSys Conference, Austin, TX, USA,         single hidden state, the overall computational cost of serving\r\n               2020. Copyright 2020 by the author(s).                            predictions is reduced by a factor of 10x.\r\n                                                     Predictive Precompute with Recurrent Neural Networks\r\n                2    RELATEDWORK                                                        \u2022 (Soh et al., 2017) describe the use of RNNs (speci\ufb01-\r\n                Existing literature describes relatively simple models to                 cally, gated recurrent units) to personalize recommen-\r\n                estimate access probabilities in the context of precompute:               dations in user interfaces based on a sequence of past\r\n                                                                                          user interactions, which is a very similar domain to the\r\n                  \u2022 (Wangetal., 2015) describe a system to prefetch links                 workinthis paper.\r\n                     embeddedinsocial network feeds using linear regres-             (Katevas et al., 2017) have also demonstrated the effec-\r\n                     sion on mostly content-based features.                          tiveness of RNN architectures to predict user responses to\r\n                  \u2022 (Parate et al., 2013) describe using the CDF of the              noti\ufb01cations based on a sequence of mobile sensor data.\r\n                     \u201ctime since last use\u201d as the probability estimate.              Our\ufb01ndingsindicate that the results from the works above\r\n                  \u2022 (Sarker et al., 2019) showcase the effectiveness of de-          transfer well to the domain of predictive precompute. Us-\r\n                     cision tree models for context-aware smartphone usage           ing several test datasets, we compare RNN models with\r\n                     prediction.                                                     simpler models similar to the ones mentioned above and\r\n                However, there are a few common limitations in approaches            demonstratethebene\ufb01tsofRNNsthroughonlineandof\ufb02ine\r\n                like the ones mentioned above:                                       experimentation.\r\n                  \u2022 Making full use of historical access logs is dif\ufb01cult.           3    DEFINING PREDICTIVE PRECOMPUTE\r\n                     Most classical machine learning models operate on\r\n                     \ufb01xed-length feature vectors, so a common approach               In this paper, we will focus on precomputation of individual\r\n                     is to compute aggregate functions based on the times-           activities within large-scale applications, where we have\r\n                     tamps of previous accesses (e.g. time since last ac-            datasets of access logs from previous user sessions.\r\n                     cess, average time between historical accesses, number          Tobemorespeci\ufb01c,wewouldliketohaveafunctionthat\r\n                     of accesses within a certain time window.) However,             estimates the probability that a user accesses a particular\r\n                     the choice of aggregations must be manually chosen              activity within an application session, based on the current\r\n                     through trial-and-error (\u201cfeature engineering\u201d).                session context and past access logs as inputs. We will\r\n                  \u2022 Incorporating contextual features from historical ac-            estimate the access probability at the beginning of each\r\n                     cess logs adds another dimension of dif\ufb01culty, as any           session and then choose to trigger precomputation at that\r\n                     time-based aggregation can be combined with any sub-            point if the probability is greater than some \ufb01xed threshold.\r\n                     set of context dimensions (e.g. what was the number\r\n                     of accesses associated with this application surface?).         3.1   De\ufb01nitions\r\n                  \u2022 The act of performing the predictions themselves is              Sessions are de\ufb01ned as discrete time windows where the\r\n                     resource-constrained: care must be taken to ensure that         user is actively using the application; a session typically\r\n                     model serving must not become as computationally                starts when the user opens the application. For simplicity,\r\n                     expensive as the precomputation itself. For example,            weconsider each session to have a \ufb01xed length, e.g. 20 min-\r\n                     computing and serving aggregation features like the             utes, to avoid having to precisely measure when a session\r\n                     examples mentioned above may require specialized                ends.\r\n                     infrastructure to remain ef\ufb01cient at scale.\r\n                Toaddress these problems, we are able to draw inspiration            Context describes session-speci\ufb01c information that may be\r\n                from research in the recommendation domain, where histor-            predictive of user behavior. Examples of context include:\r\n                ical user behavior is similarly predictive. In recent times,            \u2022 the current timestamp (including the hour of the day,\r\n                deep learning-based recommendation systems have become                    day of the week, etc.)\r\n                extremely widespread due to their effectiveness at handling             \u2022 the current application surface\r\n                complex relationships in large datasets (Zhang et al., 2017).           \u2022 indicators visible to the user, e.g. a \u201cbadge count\u201d\r\n                Ofparticular interest are recommendation systems based on                 indicating the number of unseen noti\ufb01cations\r\n                recurrent neural networks (RNNs) due to their innate ability         Access logs are a sequential record of past application ses-\r\n                to model sequential data:                                            sions, keyed by a user identi\ufb01er. For each session we will\r\n                  \u2022 (Beuteletal.,2018)describetheuseofRNN-basedrec-                  record the context at the start of the session, as well as an\r\n                     ommendersystemsforvideorecommendationsbased                     additional Boolean access \ufb02ag indicating if the activity was\r\n                     on user actions. Of special note is the improved han-           accessed within that application session or not. Access logs\r\n                     dling of \u201ccontextual features\u201d (e.g. time, location, inter-     will be used as the training dataset, where access \ufb02ags are\r\n                     face) which we \ufb01nd to be very relevant in the domain            used as ground truth labels and contexts are used to extract\r\n                     of predictive precompute as well.                               features.\r\n                                                  Predictive Precompute with Recurrent Neural Networks\r\n               3.2   ProblemStatement                                                         Table 1. Sample data for MobileTab.\r\n               For any given user, assume that we have access to logged\r\n               data for n \u2212 1 previous user sessions, where Ci denotes the         TIMESTAMP       ACCESS FLAG      UNREAD      ACTIVE TAB\r\n                                                                           th\r\n               context and Ai \u2208 {0,1} denotes the access \ufb02ag for the i             1564642800            1              3          HOME\r\n               session. If we refer to the current session as session n, we        1564642900            0              0          HOME\r\n               would like to estimate the probability of an access, P(An),         1564643000            0              1        MESSAGES\r\n               given all known information past and present:\r\n                      P(A |C ,A ,C ,A ,...,C             , A    , C )\r\n                           n    1    1   2   2       n\u22121    n\u22121    n\r\n               The remainder of this paper will primarily focus on how           Asession begins when the user starts the application and\r\n               to best estimate P(An). We will train machine learning            endsaftera\ufb01xedwindowof20minutes. Forthisdataset,we\r\n               models treating each session as an individual data point,         selected a tab with moderate usage and recorded an access\r\n               with the recorded value of An as the ground truth (label)         for every session where an interaction occurred with the tab\r\n               and C ,A ,...,C       , A    , C  as the basis for features.      within the time window.\r\n                     1    1      n\u22121    n\u22121    n\r\n               3.2.1   Timeshifted Precompute                                    Context for this dataset includes the current time, unread\r\n                                                                                 noti\ufb01cation count displayed over the tab icon (0-99), and\r\n               Aninteresting related problem occurs in the case where we         the name of the active application tab at startup. Access logs\r\n               wish to enable precomputation of data prior to the start of       consisting of the three context features and the access \ufb02ag\r\n               an application session (e.g. several hours in advance). We        are stored over a 30-day period.\r\n               refer to this modi\ufb01ed problem as timeshifted precompute.          Table 1 illustrates an example sequence of sessions for an\r\n               The primary motivation of precomputing data further in            individual user in the MobileTab dataset.\r\n               advance is to shift computational cost from peak hours to\r\n               off-peak hours. At scale, being able to shift a meaningful        4.2   Timeshifted Data Queries (Timeshift)\r\n               amount of computational cost to off-peak hours smooths            Onthe Facebook website, data queries that are relatively\r\n               out the peak/off-peak power curve and can reduce overall          static can be precomputed and cached several hours ahead of\r\n               capacity requirements.                                            time (as described in the Timeshifted Precompute problem\r\n               Inthisscenariowedonothaveaccesstoanysession-speci\ufb01c               statement). During off-peak hours of the day, we predict if\r\n               context and instead must rely on existing access logs alone       a user will require a particular data query result in a session\r\n               to predict the probability that an access will occur within a     during peak hours on the following day.\r\n               pre-de\ufb01ned peak hours window of a particular day. If we           Asession begins upon the start of a website load and ends\r\n               denote access during peak hours of day d as PA \u2208 {0,1},\r\n                                                                 d               after a \ufb01xed 20-minute window. We selected a moderately\r\n               then we can state the probability estimate as:                    used data query and recorded an access for every session\r\n                                                                                 where the data query was used.\r\n                              P(PAd | C1,A1,...,Cn,An)                           Context for this dataset includes only the session timestamp\r\n               Each training example corresponds to one user \u00d7 peak win-         and a \ufb02ag indicating whether or not the session occurred\r\n               dowpair, with the ground truth label being the presence of        during peak hours. Any additional context quickly loses\r\n               an access within the peak window.                                 relevance by prediction time, where no context is available.\r\n               4    DATASETS                                                     4.3   Mobile Phone Use (MPU)\r\n               In this paper we analyze two real-world datasets where            (Pielot et al., 2017) have generously published a dataset\r\n                                                                                                                 1\r\n               predictive precompute is being employed at Facebook, each         containing data traces for 279 mobile phone users over\r\n                                                      6                          the course of four weeks. We borrow heavily from the\r\n               consisting of a random sample of 10 users over 30 days.           work in (Katevas et al., 2017) and also attempt to predict\r\n               Asapublically available baseline, we also make use of the         the probability that a user opens the app associated with a\r\n               Mobile Phone Use dataset published by (Pielot et al., 2017).      noti\ufb01cation when it is received. In the context of predictive\r\n               4.1   Mobile Tab Access (MobileTab)                               precompute, the OS could conceivably preload the app in\r\n                                                                                 the background for noti\ufb01cations with a high probability of\r\n               Uponstartup of the Facebook mobile application, we may            interaction.\r\n               choose to prefetch data for certain sections (\u201ctabs\u201d) if we          1There are 342 users in the full dataset, but only 279 have an\r\n               can predict that the user is likely to access them.               Android version with full support for noti\ufb01cation tracking.\r\n                                                 Predictive Precompute with Recurrent Neural Networks\r\n               For MPU we de\ufb01ne each session to start with the appear-        5    BASELINE MODELS\r\n               ance of a noti\ufb01cation, with a \ufb01xed length of 10 minutes. An    In this section we will describe a selection of simpler ma-\r\n               access occurs if the user opens the app associated with the    chine learning models for comparison against the recurrent\r\n               noti\ufb01cation. To provide a reasonable comparison against        neural network model.\r\n               the previous datasets we ignore the auxiliary \u201csensors\u201d and\r\n               focus on only the access logs associated with previous noti-   5.1   Percentage-Based Model\r\n               \ufb01cations.\r\n              Wederive four context variables for each noti\ufb01cation: the       Averysimplebaseline model is to return the current access\r\n               current time, the current screen state (on/off/unlocked), the  percentage based on all historical sessions for each user, dis-\r\n               application ID the noti\ufb01cation was associated with, and the    regarding additional context information. We can seed the\r\n               last opened application ID.                                    percentage estimate for each user with the globally averaged\r\n                                                                              access percentage across all sessions (\u03b1 \u2208 (0,1)):\r\n               4.4  DataStatistics                                                                             P\r\n                                                                                                          \u03b1+ n\u22121Ai\r\n                                                                                               P(A )=            i=1\r\n              Table 2 displays summary statistics for each dataset. In                              n            n\r\n              MobileTab and MPU each labeled example corresponds to           For Timeshift the calculation is similar, except we average\r\n               a single session, while in Timeshift an example corresponds    over accesses at peak rather than individual sessions:\r\n               to a single peak period (one peak period per day and 30\r\n               days per user, for a total of 30M unique training examples).                                    P\r\n                                                                                                          \u03b1+ d\u22121PAi\r\n              Although the MPU dataset has a very small sample of users,                     P(PA )=             i=1\r\n                                                                                                    d             d\r\n               there is much more data per user \u2014 on average over 8,000\r\n               noti\ufb01cation events per user \u2014 which still yields a suf\ufb01cient   5.2   Prelude: Feature Engineering\r\n               amount of data for the purposes of model training.\r\n               Figure 1 displays the CDF of access rates for each dataset.    In order to make use of traditional models we must \ufb01rst\r\n               Note that for MobileTab and Timeshift a signi\ufb01cant percent-    convert all the available raw context for each session into\r\n               age of users (36% and 42% respectively) have no recorded       a \ufb01xed-length numerical vector. We use common feature\r\n               accesses at all in 30 days; this is typical of real-world sce- engineering techniques to accomplish this.\r\n               narios, where not all users may access a particular activity.     \u2022 One-hotencodingofcategorical variables. For con-\r\n                                                                                   text variables such as the unread/noti\ufb01cation counts,\r\n                            Table 2. Summary of each dataset.                      active tab, and application names, we use the standard\r\n                                                                                                         2\r\n                                                                                   technique of one-hot encoding. Note that for the tab\r\n                  DATA SET       POSITIVE RATE     SESSIONS     USERS              andapplication name features we \ufb01rst limit the number\r\n                  MOBILETAB          11.1%           60.8M       1M                of distinct values to a reasonable range by hashing and\r\n                  TIMESHIFT           7.1%           38.5M       1M                taking the remainder modulo 97.\r\n                  MPU                39.7%           2.34M       279             \u2022 Time-based features. Given the raw timestamp, we\r\n                                                                                   additionally calculate the hour of day (0 - 23) and day\r\n                       1                                                           of week (1 - 7) and apply a one-hot encoding to them.\r\n                                                                                 \u2022 Time-based aggregations. We can track the number\r\n                     0.8                                                           of accesses, number of sessions, and their ratio (the\r\n                                                                                   access percentage) for each user across a variety of\r\n                 users                                                             time windows. In our comparisons we use the last 28\r\n                 of  0.6                                                           days, 7 days, 1 day, and 1 hour as time windows. We\r\n                                                                                   can also \ufb01lter past accesses to those whose contexts\r\n                     0.4                                                           matchthecurrentsession context, e.g. having the same\r\n                 Percentage                                  MobileTab             active tab or noti\ufb01cation count (or both). To maximize\r\n                     0.2                                     Timeshift             coverage, we calculate aggregations based on all (time\r\n                                                               MPU                 window) \u00d7 (matching subset of context) combinations.\r\n                       00        0.2       0.4      0.6       0.8        1       \u2022 Timeelapsed. Wecancalculatethetimedifference(in\r\n                                                                                   seconds) from both the last access and the last session.\r\n                                            Access rate                            As with the aggregation features, we also condition\r\n                                                                                   these to past events with a matching context subset.\r\n                         Figure 1. CDF of access rates across users.              2\r\n                                                                                  https://en.wikipedia.org/wiki/One-hot\r\n                                                 Predictive Precompute with Recurrent Neural Networks\r\n               Table 5 illustrates the importance of thorough feature engi-    AnRNNmodelacceptsboththecurrentfeaturevector and\r\n               neeringwithGBDTmodels;evaluationmetricsdropsharply              the hidden state as inputs, and produces both a prediction\r\n               once aggregation and time-elapsed features are removed.         and an updated hidden state for the subsequent prediction.\r\n               5.3  Logistic Regression (LR)                                   Theprimary bene\ufb01t of RNN architectures is to obsolete the\r\n                                                                               tedious and manual feature engineering steps described in\r\n               As a \ufb01rst step we use logistic regression (LR) on the fea-      section 5.2 \u2014 speci\ufb01cally the aggregation and time differ-\r\n               tures described above, treating each session as an individual   ence features. Instead, the RNN hidden state has the ability\r\n               data point. To give the aggregation-based features adequate     to automatically capture features based on past events.\r\n               warm-uptime, only sessions from the latest 7 days of each       Traditionally, RNN architectures have been popularized in\r\n               dataset are used for training; of these, we take 90% of the     text and audio domains, as they are well suited to handle\r\n               users as the training set and leave 10% as a test set for eval- regular sequences of information like characters or audio\r\n               uation. Details are explained in Section 8. An additional       samples (Graves et al., 2013). In this section we describe\r\n               feature pre-processing step is to bucketize time elapsed fea-   various technical considerations when applying RNNs in\r\n               tures into 50 buckets and one-hot encode them, due to their     the domain of predictive precompute.\r\n               unevendistribution. To do so, we take \u230a50 \u00b7ln(t)\u230b where t is\r\n                                                      15\r\n               the time difference in seconds; note that the largest possible  6.1   SequenceModeling\r\n                                    14.76\r\n               t (30 days) is about e     seconds.\r\n               Totrain a logistic regression model, we use the scikit-learn    Recall from section 3.2 that for each user we have a\r\n               (Pedregosa et al., 2011) LogisticRegression3 API                sequence of n logged historical sessions, with contexts\r\n               with the saga solver and default settings.                      C1,...,Cn and access activity A1,...,An. Let t1,...,tn\r\n                                                                               also denote the UNIX timestamp of each session where\r\n               5.4  Gradient Boosted Decision Trees (GBDT)                     t1 < t2 < \u00b7\u00b7\u00b7 < tn. To model this using RNNs, we \ufb01rst\r\n                                                                               introduce some preliminary concepts:\r\n               Gradient boosted decision trees (GBDT) are a popular              \u2022 Feature extraction. Each step of the RNN model\r\n               model type that provides solid results with minimal tun-             must receive a \ufb01xed-length feature vector. While we\r\n               ing. Training is similar to the logistic regression approach,        can omit all of the aggregation features described in\r\n               but we skip the one-hot encoding step for time-elapsed fea-          Section 5.2, we still must construct a feature vector\r\n               tures and also some categorical features (e.g. the time of           from each C consisting of the one-hot encoded cate-\r\n               day and the day of the week).                                                     i\r\n                                                                                    gorical context variables (e.g. noti\ufb01cation count) and\r\n                                 4                                                  time-based features (hour of day, day of week). Let f\r\n               Weuse XGBoost 0.90 (Chen & Guestrin, 2016) to train                                                                        i\r\n               a decision tree model with mostly default settings, except           denote the feature vector of context Ci.\r\n               for the tree depth hyperparameter. To determine the optimal       \u2022 Representing different time intervals. Traditional\r\n               tree depth, we split off 10% of the users from the training          RNN models usually operate on regular sequences\r\n               set as validation set. We then use a simple exhaustive search        where elements in the sequence represent \ufb01xed time\r\n               over all possible depths in the range [1,10] to minimize the         steps or consecutive characters. However, for a se-\r\n               log loss objective over the validation dataset.                      quence of user sessions, some sessions may be sec-\r\n               Withthe manually engineered numerical features, we \ufb01nd               onds apart, while others may be hours or days apart.\r\n               that GBDTs are very hard to beat. We tested simple neural            To allow the network to react effectively to different\r\n               network architectures (e.g. a multi-layer perceptron) and            timescales we input \u2206ti = ti \u2212 ti\u22121 to the recurrent\r\n               could not obtain signi\ufb01cant gains over GBDT models.                  network at each step, where \u2206t1 = 0. We \ufb01nd in our\r\n                                                                                    datasets that the distribution of \u2206t tends to be power-\r\n               6   RECURRENTNEURALNETWORKS                                          law distributed, so we apply the log/bucketing trans-\r\n                                                                                    form described in 5.2 here as well, denoted T(\u2206t ).\r\n                                                                                                                                      i\r\n               Whereas traditional models treat the access prediction for        \u2022 Hiddenstates. Each user starts with an initial hidden\r\n               individual sessions as independent events, the innovation            state h , an all-zero vector. At the end of session i,\r\n               of recurrent neural networks (RNNs) is to process events in                 0\r\n                                                                                    the RNNmodelproducesanupdatedh basedonthe\r\n               a sequential manner while introducing a persistent hidden                                                    i\r\n                                                                                    previous hidden state h     and inputs f , A , and \u2206t .\r\n               state to carry over information from previous events.                                        i\u22121             i   i        i\r\n                  3                                                              \u2022 Update delays. To model real-world behavior accu-\r\n                   https://scikit-learn.org/stable/                                 rately we must take into account two sources of delays:\r\n               modules/generated/sklearn.linear_model.                              (1) that the ground truth Ai cannot be determined until\r\n               LogisticRegression.html                                              the session ends (recall that each session has a \ufb01xed\r\n                  4https://github.com/dmlc/xgboost                                  length, e.g. 20 minutes), and (2) that obtaining h is not\r\n                                                                                                                                    i\r\n                                                 Predictive Precompute with Recurrent Neural Networks\r\n                             P(A )                                   P(A )                   P(A )\r\n                                  1                                      2                       3\r\n                        MLP                                     MLP                     MLP\r\n                  h     f      T(0)                       h     f    T(t - t )    h     f    T(t - t )\r\n                    0    1                                  1    2      2   1       1    3      3   1\r\n                          t1                                     t2                      t3\r\n               t\r\n                                              t1 + \u03b4                                        t2 + \u03b4                   t3 + \u03b4\r\n                                                                      h1                                  h2                       h3\r\n               h0                              GRU                                          GRU                      GRU\r\n                                         f    A    T(\u2206t )                             f    A     T(\u2206t )        f    A    T(\u2206t )\r\n                                         1      1        1                             2     2        2         3     3        3\r\n              Figure 2. Modeling sequences of access logs with recurrent neural networks. Multilayer perceptron (MLP) units produce output\r\n               probabilities at time ti, while hidden state updates occur through the gated recurrent units (GRU) at time ti + \u03b4 due to the delay. Note\r\n               that because t occurs before t + \u03b4 it cannot make use of h and uses h ,t as inputs instead.\r\n                           3              2                          2           1  1\r\n                    instantaneous (i.e. it takes some time, \u01eb). To address    Intuitively, we can think of the hidden vectors h as an en-\r\n                    this we de\ufb01ne a lag parameter, \u03b4, equal to the session    coding of the sequences of contexts C, accesses A, and\r\n                    length plus \u01eb.                                            timestamps t. Many of the feature engineering techniques\r\n                 \u2022 Functions for hidden updates and predictions. Ab-          in section 5.2 can be seen as attempts to do this in a manual\r\n                    stractly, an RNN can be separated into two functions:     way. However, through training a recurrent neural net we\r\n                    an updater RNN           which produces new hidden        hopetolearn the optimal way to encode the entire sequence\r\n                                      update                                  into a single vector rather than relying on manual tuning\r\n                    states, and a feed-forward network RNN          which\r\n                                                             predict          andheuristics. Another signi\ufb01cant bene\ufb01t is the incremental\r\n                    producespredictionsasoutput. Itisoftenconvenientto        nature of hidden updates: with manual feature engineering\r\n                    combinethetwointoasinglemodelthatproducesboth             weneedtostoreandretrieve the entire sequence, but with\r\n                    outputs simultaneously, but this separation is required   RNNsweonlyneedthelastknownhiddenvector.\r\n                    in order to properly model the lag \u03b4.\r\n               Putting everything together, we can de\ufb01ne a sequence of        6.2   ModelArchitecture\r\n               hidden states h = 0,h ,...,h , with a recurrence relation:\r\n                              0       1      n                                A few options are available for hidden state updates\r\n                       h =RNN            (h    , [f ; A ;T(\u2206t )])      (1)\r\n                         i         update   i\u22121   i   i       i               (RNN          ). For this paper, we evaluated three options:\r\n                                                                                     update\r\n              Toobtain a prediction for P(A ), we use RNN            with     a basic tanh-based recurrent unit, a gated recurrent unit\r\n                                             i                predict         (GRU)andalongshort-termmemory(LSTM)unit. (Chung\r\n               the latest known hidden vector accounting for update lag,\r\n               denotedh ,wherekisthemaximumksuchthatt < t \u2212\u03b4                  et al., 2014) found that GRU and LSTM units result in\r\n                        k                                       k    i        comparable performance on a number of example datasets,\r\n              (if no such k exists, then we let k = 0 and ti \u2212 tk = 0):       while tanh performance lags behind.\r\n                      P(A ) = RNN            (h ,[f ;T(t \u2212t )])        (2)\r\n                           i          predict  k    i    i    k               Empiricallywe\ufb01ndthatGRUsprovidethebestperformance\r\n               For timeshifted precompute (3.2.1), predictions do not have    over all of the datasets (at least without signi\ufb01cant tuning).\r\n               access to f or t and instead can only use start and h ,\r\n                                                                d       k     The primary hyperparameter available when using RNN\r\n              where start marks the start of the peak period on day d\r\n                           d                                                  units is the dimensionality of the hidden vectors. Empiri-\r\n               and k is the maximum index such that t < p \u2212\u03b4:\r\n                                                       k    1                 cally, d = 128 seems to be a good dimensionality for all\r\n                   P(PA )=RNN               (h ,[T(start \u2212t )])        (3)    datasets. Another possible modi\ufb01cation is to stack multiple\r\n                         d           predict  k           d    k\r\n                                                    Predictive Precompute with Recurrent Neural Networks\r\n               GRUor LSTM units on top of each other; however, we                  import torch.nn as nn\r\n               report similar \ufb01ndings to (Beutel et al., 2018), where the\r\n               addition of multiple GRU units did not provide a meaningful         class RNNClassifier(nn.Module):\r\n               improvement over a single unit.                                        # i_n = feature vector dimensions\r\n               For RNN           , a simple architecture where the input vec-         # h_n = hidden dimensions\r\n                          predict\r\n               tor and hidden vector are concatenated and passed into a               # w_n = number of hidden neurons\r\n               feed-forward multilayer perceptron (MLP) provides good                 def __init__(self, i_n, h_n, w_n):\r\n               performance. Inspired by (Beutel et al., 2018), we \ufb01nd that               self.L = nn.Linear(i_n, h_n)\r\n               an element-wise multiplication of the hidden vector with a                self.W_1 = nn.Linear(i_n + h_n, w_n)\r\n                                                                                         self.W_2 = nn.Linear(w_n, 1)\r\n               latent factor derived from the context provides a meaningful              self.Dropout = nn.Dropout(0.2)\r\n               improvement:                                                              self.GRU = nn.GRUCell(i_n + 1, h_n)\r\n                             \u2032                                                        # Returns a new hidden vector h_{i+1}.\r\n                            h =h \u25e6(1+L([f ;T(t \u2212t )]))\r\n                             i     k              i     i    k                        # f_i includes both the feature vector\r\n               where k is the latest known index as described previously              # and the encoded time difference.\r\n               and L is a linear transformation matrix.                               def hidden_forward(self, h_i, f_i, A_i):\r\n                                                                                         return self.GRU(\r\n               AsfortheMLPlayer,we\ufb01ndthatasinglehiddenlayerwith                             torch.cat((f_i, A_i), 1),\r\n               128 neurons combined with a recti\ufb01ed linear unit (ReLU)                      h_i,\r\n                                                                                         )\r\n               layer seems to be suf\ufb01cient for the best performance; adding\r\n               morelayers does not lead to meaningful improvements. We                # Predicts P(A_{i+1}).\r\n               can therefore summarize the formulation as:                            def forward(self, h_k, f_i):\r\n                                                                                         cross_h_i = h_k * (1 + self.L(f_i))\r\n                                                           \u2032                             mlp = self.W_1(\r\n               P(A ) = \u03c3(b +W \u00b7ReLU(b +W [h ;f ;T(t \u2212t )]))\r\n                     i        2      2           1      1  i  i      i   k                  torch.cat((cross_h_i, f_i), 1),\r\n               Here \u03c3 denotes the sigmoid function while b ,b represent                  )\r\n                                                               1   2                     mlp = torch.relu(self.Dropout(mlp))\r\n               constant bias vectors and W ,W represent linear transfor-                 return torch.sigmoid(self.W_2(mlp))\r\n                                              1    2\r\n               mation matrices.\r\n                                                                                       Figure 3. Sample PyTorch code for key model de\ufb01nitions.\r\n               6.3   Loss Functions\r\n               For binary classi\ufb01cation problems, log loss is the standard         whenconsidering computational costs as a whole, different\r\n               loss measurement function. For an individual access predic-         weighting schemes can be explored if we wish to give more\r\n               tion P(Ai) the log loss is de\ufb01ned as:                               weight to inactive users as well.\r\n                    \u2212[Ai \u00b7 log(P(Ai))+(1\u2212Ai)\u00b7log(1\u2212P(Ai))]                         7    RNNTRAINING\r\n               For traditional tasks, training is often optimized over the\r\n               log loss averaged over all points in the training dataset.          RNN models are trained using PyTorch v1.16 using the\r\n               However, we \ufb01nd that this is suboptimal when comparing              Adam optimizer with a learning rate of 1e\u22123. We also\r\n               evaluation metrics over later days; it over-weights errors          include a dropout layer in the middle of the MLP set to 20%\r\n               from predictions early in the sequence (when only a small           to prevent over\ufb01tting. Figure 3 displays sample PyTorch\r\n               number of access logs have been included in the hidden              code for the key model de\ufb01nitions.\r\n               state). On the other hand, optimizing directly on later days        Eachdataset is randomly split into training and test groups\r\n               (e.g. the last 7 days) appears to be suboptimal as well,            by user, with 90% of users in the training group and 10%\r\n               possibly because the gradient is less stable for later elements     of users in the test group. We opted for a user-based split\r\n                                5\r\n               in the sequence . Empirically, we consistently \ufb01nd it is best       rather than a time-based split due to the limited number of\r\n               to train on the log loss for the last 21 days out of the 30 days    days available; empirically this did not seem to introduce\r\n               available for each user. We did not \ufb01nd weighing the loss           data leakage (validated through online results). Due to the\r\n               with an exponential time decay to be of signi\ufb01cant bene\ufb01t.          small number of users in the MPU dataset, we used a k-fold\r\n               Each session is weighted equally, which does mean that              cross-validation setup with k = 4 and trained a separate\r\n               users with more active sessions have greater in\ufb02uence               modeloneachsplit. Evaluation metrics are measured over\r\n               over the model\u2019s performance. Although this is desirable            the combined cross-validated predictions (from all 4 folds).\r\n                   5https://en.wikipedia.org/wiki/Vanishing_                           6https://github.com/pytorch/pytorch/\r\n               gradient_problem                                                    releases/tag/v1.1.0\r\n                                                      Predictive Precompute with Recurrent Neural Networks\r\n                                                               MPUdataset             before evaluating them in batch, in practice the distribution\r\n                      0.65                                                            of access history lengths has a very long tail (Figure 5).\r\n                                                                                      This results in an excessive amount of operations wasted\r\n                                                                                      on padding values. Instead, we can evaluate predictions\r\n                   loss0.6                                                            and calculate gradients for each user on a separate thread\r\n                                                                                      and then accumulate the gradients afterwards. Models train\r\n                   Log                                                                twice as quickly with this approach versus the padded batch\r\n                                                                                      approach.\r\n                      0.55\r\n                                                                                      A\ufb01nalconsideration for the MPU dataset is to truncate user\r\n                                                                                      histories to the most recent 10,000 sessions. This limits the\r\n                                                                                      effect of long tail users on training time without having a\r\n                           0    0.14 0.28 0.43 0.57 0.71 0.85            1    1.14    noticeable impact on model quality.\r\n                                                                               7\r\n                                           Sessions processed              \u00b710        8    EVALUATION RESULTS\r\n                Figure 4. Log loss vs. number of sessions processed (based on the     Unless otherwise noted, evaluation metrics are reported on\r\n                full cross-validated dataset). Each vertical line represents the end\r\n                of one epoch, with 8 epochs in total.                                 the test dataset for each model. We are careful to use the\r\n                                                                                      same train / test split for all models (for MPU, the same\r\n                                                                                      4-fold validation sets). We split based on users because it\r\n                        40                                     MPUdataset             allows more historical data per user at training time. An-\r\n                                                                                      other key consideration is the timeframe used for evaluation:\r\n                                                                                      evaluating on the full 30-day period does not accurately\r\n                   users30                                                            re\ufb02ect real-world performance because the majority of users\r\n                   of                                                                 already have a full 30 days of history. For example, on any\r\n                        20                                                            given day in the MobileTab dataset, less than 1% of sessions\r\n                                                                                      have no previous history in the previous 29 days. In prac-\r\n                   Number                                                             tice, only 1% or fewer users do not already have previous\r\n                        10                                                            logged history over the past 30 days. Therefore, we evaluate\r\n                                                                                      predictions on the last 7 days of testing data in each dataset\r\n                         0                                                            to get a better estimate of performance in production.\r\n                           0         5,000       10,000      15,000       20,000      Asfortheevaluation metric itself, the most important met-\r\n                                Numberofsessions (capped at 20,000)                   rics in the predictive precompute domain are precision and\r\n                                                                                      recall. Precision corresponds to the percentage of precom-\r\n                         Figure 5. Distribution of MPU session counts.                putations that were followed by an actual access, while\r\n                                                                                      recall corresponds to the percentage of accesses that were\r\n                                                                                      successfully precomputed. In practice, improvements in\r\n                7.1   Minibatch Training                                              recall are almost linearly correlated to reductions in applica-\r\n                For MobileTab and Timeshift we can signi\ufb01cantly increase              tion latency. Figure 6 shows the full precision-recall curve7\r\n                training speed by using minibatch training with batches of            across all tested models for MobileTab.\r\n                10users. Note that for MPU this becomes ineffective due to            Toobtain a single numerical metric for model comparison\r\n                the small number of users versus high number of sessions              weusetheareaundertheprecision-recall curve (PR-AUC):\r\n                per user, and we fall back to processing users individually.          (Davis & Goadrich, 2006) show that PR-AUC tends to be an\r\n                For each minibatch we compute predictions for the last                effective measure when dealing with highly skewed datasets\r\n                21 days and then calculate the average log loss over all              (MobileTab and Timeshift). Table 3 compares the PR-AUC\r\n                prediction/label pairs. The loss gradient of each minibatch           across all tested models and datasets. When applying mod-\r\n                is then back-propagated to complete one training iteration.           els in practice, we typically select a threshold that keeps\r\n                For the larger datasets, training converges in just one epoch,        wasted precomputations within an acceptable range (i.e.\r\n                but for the MPU dataset a total of 8 epochs are required for          maximizing recall while constraining on precision; for ex-\r\n                convergence. Figure 4 displays the training log loss curves.          ample constraining precision to 50%). Table 4 compares\r\n                We\ufb01ndthat a key optimization to speed up training is to                  7As calculated     via   https://scikit-learn.org/\r\n                evaluate minibatches via custom parallelism. While a stan-            stable/modules/generated/sklearn.metrics.\r\n                dard approach is to pad user histories to a uniform length            precision_recall_curve.html\r\n                                                     Predictive Precompute with Recurrent Neural Networks\r\n                         1\r\n                                                                    %Based          Table 3. Comparison of PR-AUC values. The improvement per-\r\n                       0.8                                             LR           centage is calculated relative to the GBDT PR-AUC.\r\n                                                                     GBDT\r\n                                                                      RNN\r\n                       0.6                                                            MODEL                    MOBILETAB       TIMESHIFT      MPU\r\n                                                                                      PERCENTAGEBASED             0.470           0.260       0.591\r\n                  Precision0.4                                                        LR                          0.546           0.290       0.683\r\n                                                                                      GBDT                        0.578           0.311       0.686\r\n                                                                                      RNN                         0.596           0.335       0.767\r\n                       0.2                                                            IMPROVEMENT                 3.11%          7.72%        11.8%\r\n                          0         0.2       0.4       0.6        0.8        1             Table 4. Comparison of recalls at 50% precision.\r\n                                                 Recall\r\n               Figure 6. Precision-recall curve comparison for MobileTab.             MODEL                    MOBILETAB       TIMESHIFT      MPU\r\n                                                                                      PERCENTAGEBASED             0.413           0.124       0.811\r\n                                                                                      LR                          0.596           0.153       0.906\r\n                the recall for each model at a \ufb01xed 50% precision, where              GBDT                        0.616           0.176       0.917\r\n                the difference between models becomes more apparent for               RNN                         0.642           0.209       0.977\r\n                MobileTab and Timeshift.                                              IMPROVEMENT                 4.22%          18.8%        6.54%\r\n                9    ONLINE EXPERIMENTATION                                         Table 5. Ablation study of feature engineering on GBDT models\r\n                While the of\ufb02ine experiments described above show clear             on the MPU dataset. A: time-based aggregations, E: time elapsed\r\n                improvements,wealsohaveresultsfromonlineexperiments                 features, C: contextual features\r\n                to verify that they carry over to production environments.                     FEATURES      PR-AUC       RECALL@50%\r\n                For the MobileTab dataset, we productionized the RNN                           C               0.588           0.848\r\n                model to replace an existing production GBDT model as                          E+C             0.642           0.883\r\n                follows:                                                                       A+E+C           0.686           0.917\r\n                  \u2022 The most recent hidden state for each user (a 128-                         RNN             0.767           0.977\r\n                     element \ufb02oating point vector) and session timestamp\r\n                                                                            8\r\n                     are stored in a real-time data store similar to Redis .\r\n                                                                                          session length \ufb01res, the context C and access \ufb02ag A\r\n                  \u2022 TorchScript9 versions of the MLP and GRU models are                                                        i                   i\r\n                     madeavailable in a remote execution environment.                     are computed. We then retrieve the most recent hidden\r\n                                                                                          state for the user h and execute the GRU part of the\r\n                                                                                                               i\r\n                  \u2022 At session startup time, the most recent hidden state                 model to calculate and store a new hidden state.\r\n                     along with the current context variables are retrieved         Wereportseveralobservationsaftermonitoringthebehavior\r\n                     andsentthroughtheMLPpartofthemodeltocalculate                  of the productionized model over a period of about 90 days:\r\n                     an access probability p. We eagerly precompute and\r\n                     retrieve the tab contents if p is greater than a \ufb01xed          Relative production resources. RNN models are indeed\r\n                     threshold, chosen to target a precision of 60%. This           more resource intensive \u2014 empirically the TorchScript\r\n                     corresponds to a recall of about 51.1% in the RNN              model is about 9.5x more computationally intensive than\r\n                     model vs. 47.4% in the GBDT model. This comes                  a GBDTmodel. However, in practice, the most compute-\r\n                     out to a 7.81% increase in \u201csuccessful prefetches\u201d (i.e.       intensive component is actually the serving of aggregate\r\n                     accesses that were successfully prefetched).                   access percentages and time elapsed features, which re-\r\n                  \u2022 Context variables are sent to a stream processing sys-          quires about two orders of magnitude more compute than\r\n                                                      10                            the model computation itself. One approach is to retrieve\r\n                     tem similar to Apache Kafka , tagged by a unique               the 30-day access logs for each user to compute aggrega-\r\n                     session ID. Tab accesses are also sent to the same sys-        tions on the \ufb02y, but some users have hundreds or thousands\r\n                     tem with a matching session ID. Events are buffered            of past accesses which makes this impractical. Instead, ag-\r\n                     by session ID, and after a timer corresponding to the          gregations are computed using a stream processing service\r\n                   8https://redis.io                                                in combination with a key-value store. However, we still\r\n                   9https://pytorch.org/docs/stable/jit.html                        needtokeeptrackofeverycombinationofcontextvaluesin\r\n                  10https://kafka.apache.org                                        order to serve context-dependent aggregations, which may\r\n                                                 Predictive Precompute with Recurrent Neural Networks\r\n                                                                              invalidating all existing hidden states. Another approach\r\n                     0.6                                                      maybetopreservetheGRU parametersandhiddenstates\r\n                                                                              and retrain only the MLP portion of the model, which is\r\n                                                                              signi\ufb01cantly faster to retrain.\r\n                 UC  0.4                                                      Tradeoffs. While RNNs have a clear advantage in model\r\n                                                                              performance and signi\ufb01cantly reduce serving computational\r\n                 PR-A                                                         costs, the primary drawbacks are 1) the increased training\r\n                                                                              time and 2) the increased amount of data required to train an\r\n                                                                 RNN          effective model. The MPU dataset is a realistic baseline for\r\n                     0.2                                        GBDT          the amountofdatarequired,with2\u00d7106 sessions,andtakes\r\n                                                                              about 10 hours to complete 8 training epochs. In contrast,\r\n                        0       5       10      15      20      25      30    the aggregation-based features described in Section 5.2 are\r\n                                   Dayssince experiment start                 very generically applicable to any predictive precompute\r\n                                                                              use case, and GBDT models can be trained in minutes with\r\n                         Figure 7. Online PR-AUC for MobileTab.               just 104 data points. Finally, hidden states are almost en-\r\n                                                                              tirely \u201cblack-box\u201d and are not easily explainable, whereas\r\n                                                                              aggregation functions are easily human-interpretable.\r\n               result in thousands of unique keys per user. For example,      10     CONCLUSION\r\n              MobileTab requires about 20 aggregation feature lookups\r\n               for every individual prediction.                               Wepresent a review of existing techniques that can be ap-\r\n               In contrast, with the RNN model we only need to make           plied to predictive precompute problems as well as a selec-\r\n               one key-value lookup to retrieve a 128-dimensional (512-       tion of real-world datasets for comparison.\r\n               byte) hidden vector for each prediction. By decreasing both    Wedemonstratethenoveluseofrecurrent neural network\r\n               the storage footprint and request volume, this reduces the     (RNN)modelstoachievestate-of-the-art results in this do-\r\n               overall serving computational cost by about 10x in practice.   main. In addition to achieving superior precision and recall\r\n               Furthermore, if necessary, hidden states allow for very \ufb01ne-   metrics, RNNs signi\ufb01cantly reduce the need for manual fea-\r\n               grained control over resource usage via the hidden state       ture engineering due to the automatic encoding of historical\r\n               dimensionality. In more resource constrained environments,     information into hidden states.\r\n              wecaneasilytrainamodelthathasfewerhiddendimensions              Weshowthattheseadvantages carry over to an online pro-\r\n               to trade off model quality for a smaller storage footprint     duction environment, where models maintain consistent\r\n               per user. Neural network quantization methods can also         performance over an extended 90-day period. We highlight\r\n               be applied to store single bytes instead of \ufb02oating-point      howRNNmodelscanhelpdecreasethecomputational cost\r\n               numbers for each dimension.                                    of serving models by an order of magnitude by encoding all\r\n               Cold start behavior. In our online experiment, we com-         prior history into a compact hidden state.\r\n               pared two groups of users starting with an empty history to    In closing, we hope that the techniques described in this pa-\r\n               comparethewarmupbehaviorbetweentheGBDTandRNN                   per make it easier for other applications to utilize predictive\r\n               models. We \ufb01nd that it takes about 14 days for the RNN         precompute.\r\n               modeltostabilize, and that it is consistently superior than\r\n               the GBDTmodel. Figure 7 displays the online PR-AUC for         10.1   Future Work\r\n               the \ufb01rst 30 days of the online experiment.\r\n               Long-term model quality and stability. Despite the fact        Reusable models.      The very simple percentage-based\r\n               that the training data only spans 30 days, we see that the     model described in 5.1 in some sense acts as a \u201cuniver-\r\n               empirical precision and recall are consistent with the results sal model\u201d that works as a solid baseline across all use cases\r\n               obtained from of\ufb02ine experiments, and continue to maintain     with almost no training. In a similar vein it may be pos-\r\n               the samelevelofquality(withnosignofdegradation)overa           sible to create a generic RNN-based model that uses only\r\n               90dayperiod. This suggests that the hidden states produced     past session timestamps and their access labels to produce\r\n               bythe RNNsarestable over long-term periods.                    high-quality estimates without any pre-training.\r\n               Retraining the model. While not tested in production, it       Interpretable hidden states. The hidden state model is\r\n               shouldbepossibletotrainnewerversionsoftheRNNmodel              practically a black box. Extracting interpretable relations\r\n               using the existing hidden states as the value for h , thus     from the hidden state could suggest ways of feature engi-\r\n                                                                   0          neering the dataset to enable the use of simpler models.\r\n               providing a path to replacing the production model without\r\n                                                    Predictive Precompute with Recurrent Neural Networks\r\n               REFERENCES                                                            Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-\r\n               Beutel, A., Covington, P., Jain, S., Xu, C., Li, J., Gatto, V.,        napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.\r\n                  and Chi, E. H. Latent cross: Making use of context in               Scikit-learn: Machine learning in Python. Journal of\r\n                  recurrent recommender systems. In Proceedings of the               Machine Learning Research, 12:2825\u20132830, 2011.\r\n                  Eleventh ACM International Conference on Web Search                                                         `\r\n                                                                                   Pielot, M., Cardoso, B., Katevas, K., Serra, J., Matic, A., and\r\n                  and Data Mining, WSDM \u201918, pp. 46\u201354, New York,                     Oliver, N. Beyond interruptibility: Predicting opportune\r\n                  NY,USA,2018.ACM. ISBN978-1-4503-5581-0. doi:                        moments to engage mobile phone users. Proc. ACM\r\n                  10.1145/3159652.3159727. URL http://doi.acm.                        Interact. Mob. Wearable Ubiquitous Technol., 1(3):91:1\u2013\r\n                  org/10.1145/3159652.3159727.                                        91:25, September 2017. ISSN 2474-9567. doi: 10.1145/\r\n               Chen, T. and Guestrin, C. XGBoost: A scalable tree boost-              3130956. URLhttp://doi.acm.org/10.1145/\r\n                  ing system. In Proceedings of the 22nd ACM SIGKDD                   3130956.\r\n                  International Conference on Knowledge Discovery and              Sarker, I. H., Kayes, A. S. M., and Watters, P. Effectiveness\r\n                  Data Mining, KDD \u201916, pp. 785\u2013794, New York, NY,                    analysis of machine learning classi\ufb01cation models for\r\n                  USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi:                        predicting personalized context-aware smartphone usage.\r\n                  10.1145/2939672.2939785. URL http://doi.acm.                       Journal of Big Data, 6(1):57, Jul 2019. ISSN 2196-\r\n                  org/10.1145/2939672.2939785.                                       1115. doi: 10.1186/s40537-019-0219-y. URL https:\r\n                             \u00a8                                                       //doi.org/10.1186/s40537-019-0219-y.\r\n               Chung, J., Gulc\u00b8ehre, C\u00b8., Cho, K., and Bengio, Y. Empirical\r\n                  evaluationofgatedrecurrentneuralnetworksonsequence               Soh, H., Sanner, S., White, M., and Jamieson, G. Deep\r\n                  modeling. CoRR, abs/1412.3555, 2014. URL http:                      sequential recommendation for personalized adaptive\r\n                  //arxiv.org/abs/1412.3555.                                          user interfaces.   In Proceedings of the 22nd Interna-\r\n               Davis, J. and Goadrich, M.         The relationship between            tional Conference on Intelligent User Interfaces, IUI\r\n                  precision-recall and roc curves. In Proceedings of the             \u201917, pp. 589\u2013593, New York, NY, USA, 2017. ACM.\r\n                  23rd International Conference on Machine Learning,                  ISBN 978-1-4503-4348-0.          doi:   10.1145/3025171.\r\n                  ICML \u201906, pp. 233\u2013240, New York, NY, USA, 2006.                     3025207. URLhttp://doi.acm.org/10.1145/\r\n                  ACM. ISBN 1-59593-383-2. doi: 10.1145/1143844.                      3025171.3025207.\r\n                  1143874. URLhttp://doi.acm.org/10.1145/                          Wang,Y., Liu, X., Chu, D., and Liu, Y. Earlybird: Mobile\r\n                  1143844.1143874.                                                    prefetching of social network feeds via content preference\r\n               Graves, A., Mohamed, A., and Hinton, G. E. Speech                      mining and usage pattern analysis. In Proceedings of the\r\n                  recognition with deep recurrent neural networks. CoRR,             16th ACM International Symposium on Mobile Ad Hoc\r\n                  abs/1303.5778, 2013.       URL http://arxiv.org/                   Networking and Computing, MobiHoc \u201915, pp. 67\u201376,\r\n                  abs/1303.5778.                                                      NewYork, NY, USA, 2015. ACM. ISBN 978-1-4503-\r\n                                                                                      3489-1. doi: 10.1145/2746285.2746312. URL http:\r\n                                                                 `\r\n               Katevas, K., Leontiadis, I., Pielot, M., and Serra, J. Contin-        //doi.acm.org/10.1145/2746285.2746312.\r\n                  ualpredictionofnoti\ufb01cationattendancewithclassicaland\r\n                  deep network approaches. CoRR, abs/1712.07120, 2017.             Zhang, S., Yao, L., and Sun, A. Deep learning based recom-\r\n                  URLhttp://arxiv.org/abs/1712.07120.                                 mendersystem: A survey and new perspectives. CoRR,\r\n                                                                                      abs/1707.07435, 2017. URL http://arxiv.org/\r\n               Palmer, O.      How Does Page Load Time Impact En-                     abs/1707.07435.\r\n                  gagement?       Optimizely Blog, 2016.       URL https:\r\n                  //blog.optimizely.com/2016/07/13/\r\n                  how-does-page-load-time-impact-engagement/.\r\n                             \u00a8\r\n               Parate, A., Bohmer, M., Chu, D., Ganesan, D., and Marlin,\r\n                  B. M. Practical prediction and prefetch for faster access\r\n                  to applications on mobile phones. In Proceedings of the\r\n                  2013ACMInternationalJoint Conference on Pervasive\r\n                  and Ubiquitous Computing, UbiComp \u201913, pp. 275\u2013284,\r\n                  NewYork, NY, USA, 2013. ACM. ISBN 978-1-4503-\r\n                  1770-2. doi: 10.1145/2493432.2493490. URL http:\r\n                  //doi.acm.org/10.1145/2493432.2493490.\r\n               Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,\r\n                  Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,\r\n", "award": [], "sourceid": 179, "authors": [{"given_name": "Hanson", "family_name": "Wang", "institution": "Facebook"}, {"given_name": "Zehui", "family_name": "Wang", "institution": "Facebook"}, {"given_name": "Yuanyuan", "family_name": "Ma", "institution": "Facebook"}]}