{"title": "Solving Most Systems of Random Quadratic Equations", "book": "Advances in Neural Information Processing Systems", "page_first": 1867, "page_last": 1877, "abstract": "This paper deals with finding an $n$-dimensional solution $\\bm{x}$ to a system of quadratic equations $y_i=|\\langle\\bm{a}_i,\\bm{x}\\rangle|^2$, $1\\le i \\le m$, which in general is known to be NP-hard. We put forth a novel procedure, that starts with a \\emph{weighted maximal correlation initialization} obtainable with a few power iterations, followed by successive refinements based on \\emph{iteratively reweighted gradient-type iterations}. The novel techniques distinguish themselves from prior works by the inclusion of a fresh (re)weighting regularization. For certain random measurement models, the proposed procedure returns the true solution $\\bm{x}$ with high probability in time proportional to reading the data $\\{(\\bm{a}_i;y_i)\\}_{1\\le i \\le m}$, provided that the number $m$ of equations is some constant $c>0$ times the number $n$ of unknowns, that is, $m\\ge cn$. Empirically, the upshots of this contribution are: i) perfect signal recovery in the high-dimensional regime given only an \\emph{information-theoretic limit number} of equations; and, ii) (near-)optimal statistical accuracy in the presence of additive noise. Extensive numerical tests using both synthetic data and real images corroborate its improved signal recovery performance and computational efficiency relative to state-of-the-art approaches.", "full_text": "Solving Most Systems of Random\n\nQuadratic Equations\n\nGang Wang(cid:63),\u2217\n\nGeorgios B. Giannakis\u2217\n\nYousef Saad\u2020\n\nJie Chen(cid:63)\n\n(cid:63)Key Lab of Intell. Contr. and Decision of Complex Syst., Beijing Inst. of Technology\n\u2217Digital Tech. Center & Dept. of Electrical and Computer Eng., Univ. of Minnesota\n\n\u2020Department of Computer Science and Engineering, Univ. of Minnesota\n\n{gangwang, georgios, saad}@umn.edu; chenjie@bit.edu.cn.\n\nAbstract\n\nThis paper deals with \ufb01nding an n-dimensional solution x to a system of quadratic\nequations yi = |(cid:104)ai, x(cid:105)|2, 1 \u2264 i \u2264 m, which in general is known to be NP-hard.\nWe put forth a novel procedure, that starts with a weighted maximal correlation\ninitialization obtainable with a few power iterations, followed by successive re-\n\ufb01nements based on iteratively reweighted gradient-type iterations. The novel\ntechniques distinguish themselves from prior works by the inclusion of a fresh\n(re)weighting regularization. For certain random measurement models, the pro-\nposed procedure returns the true solution x with high probability in time pro-\nportional to reading the data {(ai; yi)}1\u2264i\u2264m, provided that the number m of\nequations is some constant c > 0 times the number n of unknowns, that is, m \u2265 cn.\nEmpirically, the upshots of this contribution are: i) perfect signal recovery in the\nhigh-dimensional regime given only an information-theoretic limit number of equa-\ntions; and, ii) (near-)optimal statistical accuracy in the presence of additive noise.\nExtensive numerical tests using both synthetic data and real images corroborate\nits improved signal recovery performance and computational ef\ufb01ciency relative to\nstate-of-the-art approaches.\n\n1 Introduction\nOne is often faced with solving quadratic equations of the form yi = |(cid:104)ai, x(cid:105)|2, or equivalently,\n\n\u03c8i = |(cid:104)ai, x(cid:105)|,\n\n1 \u2264 i \u2264 m\n\n(1)\nwhere x \u2208 Rn/Cn (hereafter, symbol \u201cA/B\u201d denotes either A or B) is the wanted unknown n \u00d7 1\nvector, while given observations \u03c8i and feature vectors ai \u2208 Rn/Cn that are collectively stacked in\nthe data vector \u03c8 := [\u03c8i]1\u2264i\u2264m and the m \u00d7 n sensing matrix A := [ai]1\u2264i\u2264m, respectively. Put\ndifferently, given information about the (squared) modulus of the inner products of the signal vector\nx and several known design vectors ai, can one reconstruct exactly (up to a global phase factor) x,\nor alternatively, the missing phase of (cid:104)ai, x(cid:105)? In fact, much effort has been devoted to determining\nthe number of such equations necessary and/or suf\ufb01cient for the uniqueness of the solution x; see\ne.g., [1, 8]. It has been proved that m \u2265 2n \u2212 1 (m \u2265 4n \u2212 4) generic 1 (which includes the case of\nrandom vectors) real (complex) vectors ai are suf\ufb01cient for uniquely determining an n-dimensional\nreal (complex) vector x [1, Theorem 2.8], [8], while in the real case m = 2n \u2212 1 is shown also\nnecessary [1]. In this sense, the number m = 2n \u2212 1 of equations as in (1) can be regarded as the\ninformation-theoretic limit for such a quadratic system to be uniquely solvable.\n\n1It is out of the scope of the present paper to explain the meaning of generic vectors, whereas interested\n\nreaders are referred to [1].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn diverse physical sciences and engineering \ufb01elds, it is impossible or very dif\ufb01cult to record phase\nmeasurements. The problem of recovering the signal or phase from magnitude measurements only,\nalso commonly known as phase retrieval, emerges naturally [10, 11]. Relevant application domains\ninclude e.g., X-ray crystallography, astronomy, microscopy, ptychography, and coherent diffraction\nimaging [21]. In such setups, optical measurement and detection systems record solely the photon\n\ufb02ux, which is proportional to the (squared) magnitude of the \ufb01eld, but not the phase. Problem (1) in\nits squared form, on the other hand, can be readily recast as an instance of nonconvex quadratically\nconstrained quadratic programming, that subsumes as special cases several well-known combinatorial\noptimization problems involving Boolean variables, e.g., the NP-complete stone problem [2, Sec.\n3.4.1]. A related task of this kind is that of estimating the mixture of linear regressions, where the\nlatent membership indicators can be converted into the missing phases [29]. Although of simple form\nand practical relevance across different \ufb01elds, solving systems of nonlinear equations is arguably the\nmost dif\ufb01cult problem in all of the numerical computations [19, Page 355].\nNotation: Lower- (upper-) case boldface letters denote vectors (matrices), e.g., a \u2208 Rn (A \u2208 Rm\u00d7n).\nCalligraphic letters are reserved for sets. The \ufb02oor operation (cid:98)c(cid:99) gives the largest integer no greater\nthan the given real quantity c > 0, the cardinality |S| counts the number of elements in set S, and\n(cid:107)x(cid:107) denotes the Euclidean norm of x. Since for any phase \u03c6 \u2208 R, vectors x \u2208 Cn and ej\u03c6x are\nindistinguishable given {\u03c8i} in (1), let dist(z, x) := min\u03c6\u2208[0,2\u03c0) (cid:107)z \u2212 xej\u03c6(cid:107) be the Euclidean\ndistance of any estimate z \u2208 Cn to the solution set {ej\u03c6x}0\u2264\u03c6<2\u03c0 of (1); in particular, \u03c6 = 0/\u03c0 in\nthe real case.\n\n1.1 Prior contributions\n\nFollowing the least-squares (LS) criterion (which coincides with the maximum likelihood (ML) one\nassuming additive white Gaussian noise), the problem of solving quadratic equations can be naturally\nrecast as an empirical loss minimization\n\nm(cid:88)\n\ni=1\n\nminimize\nz\u2208Rn/Cn\n\nL(z) :=\n\n1\nm\n\n(cid:96)(z; \u03c8i/yi)\n\n(2)\n\nwhere one can choose to work with the amplitude-based loss (cid:96)(z; \u03c8i) := (\u03c8i\u2212|(cid:104)ai,z(cid:105)|)2/2 [28, 30], or\nthe intensity-based one (cid:96)(z; yi) := (yi\u2212|(cid:104)ai,z(cid:105)|2)2/2 [3], and its related Poisson likelihood (cid:96)(z; yi) :=\nyi log(|(cid:104)ai, z(cid:105)|2) \u2212 |(cid:104)ai, z(cid:105)|2 [7]. Either way, the objective functional L(z) is nonconvex; hence, it\nis generally NP-hard and computationally intractable to compute the ML or LS estimate.\nMinimizing the squared modulus-based LS loss in (2), several numerical polynomial-time algorithms\nhave been devised via convex programming for certain choices of design vectors ai [4, 25]. Such\nconvex paradigms \ufb01rst rely on the matrix-lifting technique to express all squared modulus terms into\nlinear ones in a new rank-1 matrix variable, followed by solving a convex semi-de\ufb01nite program (SDP)\nafter dropping the rank constraint. It has been established that perfect recovery and (near-)optimal\nstatistical accuracy are achieved in noiseless and noisy settings respectively with an optimal-order\nnumber of measurements [4]. In terms of computational ef\ufb01ciency however, such lifting-based\nconvex approaches entail storing and solving for an n \u00d7 n semi-de\ufb01nite matrix from m general SDP\nconstraints, whose computational complexity in the worst case scales as n4.5 log 1/\u0001 for m \u2248 n [25],\nwhich is not scalable. Another recent line of convex relaxation [12], [13] reformulated the problem of\nphase retrieval as that of sparse signal recovery, and solved a linear program in the natural parameter\nvector domain. Although exact signal recovery can be established assuming an accurate enough\nanchor vector, its empirical performance is in general not competitive with state-of-the-art phase\nretrieval approaches.\nRecent proposals advocate suitably initialized iterative procedures for coping with certain noncon-\nvex formulations directly; see e.g., algorithms abbreviated as AltMinPhase, (R/P)WF, (M)TWF,\n(S)TAF [16, 3, 7, 26, 28, 27, 30, 22, 6, 24], as well as a prox-linear algorithm [9]. These nonconvex\napproaches operate directly upon vector optimization variables, thus leading to signi\ufb01cant computa-\ntional advantages over their convex counterparts. With random features, they can be interpreted as\nperforming stochastic optimization over acquired examples {(ai; \u03c8i/yi)}1\u2264i\u2264m to approximately\nminimize the population risk functional L(z) := E(ai,\u03c8i/yi)[(cid:96)(z; \u03c8i/yi)]. It is well documented\nthat minimizing nonconvex functionals is generally intractable due to existence of multiple critical\npoints [17]. Assuming Gaussian sensing vectors however, such nonconvex paradigms can provably\nlocate the global optimum, several of which also achieve optimal (statistical) guarantees. Speci\ufb01cally,\n\n2\n\n\fstarting with a judiciously designed initial guess, successive improvement is effected by means of a\nsequence of (truncated) (generalized) gradient-type iterations given by\n\nt = 0, 1, . . .\n\n(3)\n\nzt+1 := zt \u2212 \u00b5t\nm\n\n\u2207(cid:96)i(zt; \u03c8i/yi),\n\n(cid:88)\n\ni\u2208T t+1\n\nwhere zt denotes the estimate returned by the algorithm at the t-th iteration, \u00b5t > 0 is learning rate\nthat can be pre-selected or found via e.g., the backtracking line search strategy, and \u2207(cid:96)(zt, \u03c8i/yi)\nrepresents the (generalized) gradient of the modulus- or squared modulus-based LS loss evaluated at\nzt. Here, T t+1 denotes some time-varying index set signifying the per-iteration gradient truncation.\nAlthough they achieve optimal statistical guarantees in both noiseless and noisy settings, state-of-the-\nart (convex and nonconvex) approaches studied under Gaussian designs, empirically require stable\nrecovery of a number of equations (several) times larger than the information-theoretic limit [7, 3, 30].\nAs a matter of fact, when there are numerously enough measurements (on the order of n up to some\npolylog factors), the squared modulus-based LS functional admits benign geometric structure in\nthe sense that [23]: i) all local minimizers are also global; and, ii) there always exists a negative\ndirectional curvature at every saddle point. In a nutshell, the grand challenge of tackling systems of\nrandom quadratic equations remains to develop algorithms capable of achieving perfect recovery and\nstatistical accuracy when the number of measurements approaches the information limit.\n\n1.2 This work\n\nBuilding upon but going beyond the scope of the aforementioned nonconvex paradigms, the present\npaper puts forward a novel iterative linear-time scheme, namely, time proportional to that required\nby the processor to scan all the data {(ai; \u03c8i)}1\u2264i\u2264m, that we term reweighted amplitude \ufb02ow, and\nhenceforth, abbreviate as RAF. Our methodology is capable of solving noiseless random quadratic\nequations exactly, yielding an estimate of (near)-optimal statistical accuracy from noisy modulus\nobservations. Exactness and accuracy hold with high probability and without extra assumption on\nthe unknown signal vector x, provided that the ratio m/n of the number of equations to that of the\nunknowns is larger than a certain constant. Empirically, our approach is shown able to ensure exact\nrecovery of high-dimensional unstructured signals given a minimal number of equations, where m/n\nin the real case can be as small as 2. The new twist here is to leverage judiciously designed yet\nconceptually simple (re)weighting regularization techniques to enhance existing initializations and\nalso gradient re\ufb01nements. An informal depiction of our RAF methodology is given in two stages as\nfollows, with rigorous details deferred to Section 3:\nS1) Weighted maximal correlation initialization: Obtain an initializer z0 maximally correlated\nwith a carefully selected subset S (cid:40) M := {1, 2, . . . , m} of feature vectors ai, whose contributions\ntoward constructing z0 are judiciously weighted by suitable parameters {w0\nS2) Iteratively reweighted \u201cgradient-like\u201d iterations: Loop over 0 \u2264 t \u2264 T :\n\ni > 0}i\u2208S.\n\ni \u2207(cid:96)(zt; \u03c8i)\nwt\n\nm(cid:88)\n(4)\ni \u2265 0}, each possibly relying on the current iterate\n\nzt+1 = zt \u2212 \u00b5t\nm\nfor some time-varying weighting parameters {wt\nzt and the datum (ai; \u03c8i).\nTwo attributes of the novel approach are worth highlighting next. First, albeit being a variant of\nthe spectral initialization devised in [28], the initialization here [cf. S1)] is distinct in the sense that\ndifferent importance is attached to each selected datum (ai; \u03c8i). Likewise, the gradient \ufb02ow [cf.\nS2)] weighs judiciously the search direction suggested by each datum (ai; \u03c8i). In this manner, more\nrobust initializations and more stable overall search directions can be constructed even based solely\non a rather limited number of data samples. Moreover, with particular choices of the weights wt\ni\u2019s\n(e.g., taking 0/1 values), the developed methodology subsumes as special cases the recently proposed\nalgorithms RWF [30] and TAF [28].\n\ni=1\n\n2 Algorithm: Reweighted Amplitude Flow\n\nThis section explains the intuition and basic principles behind each stage of the advocated RAF\nalgorithm in detail. For analytical concreteness, we focus on the real Gaussian model with x \u2208 Rn,\n\n3\n\n\fand independent sensing vectors ai \u2208 Rn \u223c N (0, I) for all 1 \u2264 i \u2264 m. Nonetheless, the presented\napproach can be directly applied when the complex Gaussian and the coded diffraction pattern (CDP)\nmodels are considered.\n\n2.1 Weighted maximal correlation initialization\n\nA key enabler of general nonconvex iterative heuristics\u2019 success in \ufb01nding the global optimum is\nto seed them with an excellent starting point [14]. Indeed, several smart initialization strategies\nhave been advocated for iterative phase retrieval algorithms; see e.g., the spectral initialization [16],\n[3] as well as its truncated variants [7], [28], [9], [30], [15]. One promising approach is the one\npursued in [28], which is also shown robust to outliers in [9]. To hopefully approach the information-\ntheoretic limit however, its performance may need further enhancement. Intuitively, it is increasingly\nchallenging to improve the initialization (over state-of-the-art) as the number of acquired data samples\napproaches the information-theoretic limit.\nIn this context, we develop a more \ufb02exible initialization scheme based on the correlation property\n(as opposed to the orthogonality in [28]), in which the added bene\ufb01t is the inclusion of a \ufb02exible\nweighting regularization technique to better balance the useful information exploited in the selected\ndata. Similar to related approaches of the same kind, our strategy entails estimating both the norm\n(cid:107)x(cid:107) and the direction x/(cid:107)x(cid:107) of x. Leveraging the strong law of large numbers and the rotational\ninvariance of Gaussian ai vectors (the latter suf\ufb01ces to assume x = (cid:107)x(cid:107)e1, with e1 being the \ufb01rst\ncanonical vector in Rn), it is clear that\n\nm(cid:88)\n\n1\nm\n\nm(cid:88)\n\n(cid:12)(cid:12)(cid:104)ai,(cid:107)x(cid:107)e1(cid:105)(cid:12)(cid:12)2\n\n=\n\n(cid:16) 1\n\nm(cid:88)\n\nm\n\ni=1\n\n(cid:17)(cid:107)x(cid:107)2 \u2248 (cid:107)x(cid:107)2\n\na2\ni,1\n\n(5)\n\n\u03c82\n\ni =\n\nwhereby (cid:107)x(cid:107) can be estimated to be(cid:80)m\nlimited number of data samples because(cid:80)m\n\ni=1\n\ni=1\n\ni=1\n\n1\nm\n\ni/m. This estimate proves very accurate even with a\n\u03c82\ni,1/m is unbiased and tightly concentrated.\na2\n\ni=1\n\nThe challenge thus lies in accurately estimating the direction of x, or seeking a unit vector maximally\naligned with x. Toward this end, let us \ufb01rst present a variant of the initialization in [28]. Note that\nthe larger the modulus \u03c8i of the inner-product between ai and x is, the known design vector ai is\ndeemed more correlated to the unknown solution x, hence bearing useful directional information\nof x. Inspired by this fact and having available data {(ai; \u03c8i)}1\u2264i\u2264m, one can sort all (absolute)\ncorrelation coef\ufb01cients {\u03c8i}1\u2264i\u2264m in an ascending order, yielding ordered coef\ufb01cients 0 < \u03c8[m] \u2264\n\u00b7\u00b7\u00b7 \u2264 \u03c8[2] \u2264 \u03c8[1]. Sorting m records takes time proportional to O(m log m).2 Let S (cid:36) M denote the\nset of selected feature vectors ai to be used for computing the initialization, which is to be designed\nnext. Fix a priori the cardinality |S| to some integer on the order of m, say, |S| := (cid:98)3m/13(cid:99). It is\nthen natural to de\ufb01ne S to collect the ai vectors that correspond to one of the largest |S| correlation\ncoef\ufb01cients {\u03c8[i]}1\u2264i\u2264|S|, each of which can be thought of as pointing to (roughly) the direction of x.\nApproximating the direction of x therefore boils down to \ufb01nding a vector to maximize its correlation\nwith the subset S of selected directional vectors ai. Succinctly, the wanted approximation vector can\nbe ef\ufb01ciently found as the solution of\n\n(cid:88)\n\n(cid:12)(cid:12)(cid:104)ai, z(cid:105)(cid:12)(cid:12)2\n\n= z\u2217(cid:16) 1\n\n|S|\n\n(cid:88)\n\n(cid:17)\n\naia\u2217\n\ni\n\nz\n\nmaximize\n\n(cid:107)z(cid:107)=1\n\n1\n|S|\n\ni\u2208S\n\ni\u2208S\n\nthe context. Upon scaling the unity-norm solution of (6) by the norm estimate obtained(cid:80)m\n\nwhere the superscript \u2217 represents the transpose or the conjugate transpose that will be clear from\ni/m\nin (5), to match the magnitude of x, we will develop what we will henceforth refer to as maximal\ncorrelation initialization.\nAs long as |S| is chosen on the order of m, the maximal correlation method outperforms the spectral\nones in [3, 16, 7], and has comparable performance to the orthogonality-promoting method [28].\nIts performance around the information-limit however, is still not the best that we can hope for.\nRecall from (6) that all selected directional vectors {ai}i\u2208S are treated the same in terms of their\ncontributions to constructing the initialization. Nevertheless, according to our starting principle, this\nordering information carried by the selected ai vectors is not exploited by the initialization scheme\nin (6) and [28]. In other words, if for i, j \u2208 S, the correlation coef\ufb01cient of \u03c8i with ai is larger\n\ni=1\n\n\u03c82\n\n(6)\n\n2f (m) = O(g(m)) means that there exists a constant C > 0 such that |f (m)| \u2264 C|g(m)|.\n\n4\n\n\fthan that of \u03c8j with aj, then ai is deemed more correlated (with x) than aj is, hence bearing more\nuseful information about the direction of x. It is thus prudent to weigh more the selected ai vectors\nassociated with larger \u03c8i values. Given the ordering information \u03c8[|S|] \u2264 \u00b7\u00b7\u00b7 \u2264 \u03c8[2] \u2264 \u03c8[1] available\nfrom the sorting procedure, a natural way to achieve this goal is weighting each ai vector with simple\ni , \u2200i \u2208 S with the\nmonotonically increasing functions of \u03c8i, say e.g., taking the weights w0\nexponent parameter \u03b3 \u2265 0 chosen to maintain the wanted ordering w0|S| \u2264 \u00b7\u00b7\u00b7 \u2264 w0\n[1]. In a\nnutshell, a more \ufb02exible initialization strategy, that we refer to as weighted maximal correlation, can\nbe summarized as follows\n\n[2] \u2264 w0\n\ni := \u03c8\u03b3\n\n(cid:88)\nand the second largest eigenvalues of the matrix (1/|S|)(cid:80)\nscaling \u02dcz0 from (7) by the norm estimate in (5), to yield z0 := ((cid:80)m\n\nFor any given \u0001 > 0, the power method or the Lanczos algorithm can be called for to \ufb01nd an \u0001-accurate\nsolution to (7) in time proportional to O(n|S|) [20], assuming a positive eigengap between the largest\ni , which is often true when\n{ai} are sampled from continuous distribution. The proposed initialization can be obtained upon\ni/m)\u02dcz0. By default, we take\n\u03c82\n\nz.\n\n(7)\n\n\u02dcz0 := arg max\n(cid:107)z(cid:107)=1\n\nz\u2217(cid:16) 1\n\ni aia\u2217\n\ni aia\u2217\n\u03c8\u03b3\n\ni\n\ni\u2208S \u03c8\u03b3\n\n(cid:17)\n\n|S|\n\ni\u2208S\n\n\u03b3 := 1/2 in all reported numerical implementations, yielding w0\nRegarding the initialization procedure in (7), we next highlight two features, whereas technical details\nand theoretical performance guarantees are provided in Section 3:\nF1) The weights {w0\neach feature vector ai may bear regarding the direction of x.\nF2) Taking w0\n\ni } in the maximal correlation scheme enable leveraging useful information that\ni for all i \u2208 S and 0 otherwise, problem (7) can be equivalently rewritten as\n\ni := \u03c8\u03b3\n\ni :=(cid:112)|(cid:104)ai, x(cid:105)| for all i \u2208 S.\n\ni=1\n\nz\u2217(cid:16) 1\n\nm\n\nm(cid:88)\n\ni=1\n\n(cid:17)\n\ni aia\u2217\nw0\n\ni\n\nz\n\n\u02dcz0 := arg max\n(cid:107)z(cid:107)=1\n\nwhich subsumes previous initialization schemes with particular selections of weights {w0\ninstance, the spectral initialization in [16, 3] is recovered by choosing S := M, and w0\ni := \u03c82\n1 \u2264 i \u2264 m.\nFor comparison, de\ufb01ne\n\nRelative error :=\n\ndist(z, x)\n\n(cid:107)x(cid:107)\n\n.\n\nThroughout the paper, all simulated results were\naveraged over 100 Monte Carlo (MC) realizations,\nand each simulated scheme was implemented with\ntheir pertinent default parameters. Figure 1 eval-\nuates the performance of the developed initializa-\ntion relative to several state-of-the-art strategies,\nand also with the information limit number of\ndata benchmarking the minimal number of sam-\nples required.\nIt is clear that our initialization\nis: i) consistently better than the state-of-the-art;\nand, ii) stable as n grows, which is in contrast to\nthe instability encountered by the spectral ones\n[16, 3, 7, 30]. It is worth stressing that the more\nthan 5% empirical advantage (relative to the best)\nat the challenging information-theoretic bench-\nmark is nontrivial, and is one of the main RAF upshots. This advantage becomes increasingly\npronounced as the ratio m/n grows.\n\nFigure 1: Relative initialization error for i.i.d.\nai \u223c N (0, I1,000), 1 \u2264 i \u2264 1, 999.\n\n2.2\n\nIteratively reweighted gradient \ufb02ow\n\n(8)\ni }. For\ni for all\n\nFor independent data obeying the real Gaussian model, the direction that TAF moves along in stage\nS2) presented earlier is given by the following (generalized) gradient [28]:\n\n\u2207(cid:96)(z; \u03c8i) =\n\n1\nm\n\ni z \u2212 \u03c8i\na\u2217\n\na\u2217\ni z\n|a\u2217\ni z|\n\nai\n\n(9)\n\n(cid:88)\n\ni\u2208T\n\n1\nm\n\n(cid:17)\n\n(cid:88)\n\n(cid:16)\n\ni\u2208T\n\n5\n\n1,0002,0003,0004,0005,000n: signal dimension (m=2n-1)0.911.11.21.31.41.5Relative errorReweight. max. correlationSpectral initializationTrunc. spectral in TWFOrthogonality promotingTrunc. spectral in RWF\fi z = 0.\n\ni z \u2212 \u03c8i\n\ni z| := 0 is adopted when a\u2217\n\nwhere the dependence on the iterate count t is neglected for notational brevity, and the convention\na\u2217\ni z/|a\u2217\nUnfortunately, the (negative) gradient of the average in (9) generally may not point towards the true\nsolution x unless the current iterate z is already very close to x. Therefore, moving along such a\ndescent direction may not drag z closer to x. To see this, consider an initial guess z0 that has already\nbeen in a basin of attraction (i.e., a region within which there is only a unique stationary point) of\na\u2217\nx. Certainly, there are summands (a\u2217\ni z\ni z| )ai in (9), that could give rise to \u201cbad/misleading\u201d\n|a\u2217\ni z| (cid:54)= a\u2217\ngradient directions due to the erroneously estimated signs a\u2217\ni x| [28], or (a\u2217\ni x|a\u2217\ni z\ni x) < 0\n|a\u2217\n[30]. Those gradients as a whole may drag z away from x, and hence out of the basin of attraction.\nSuch an effect becomes increasingly severe as m approaches the information-theoretic limit of\n2n \u2212 1, thus rendering past approaches less effective in this case. Although this issue is somewhat\nremedied by TAF with a truncation procedure, its ef\ufb01cacy is limited due to misses of bad gradients\nand mis-rejections of meaningful ones around the information limit.\nTo address this challenge, reweighted amplitude \ufb02ow effecting suitable gradient directions from all\ndata samples {(ai; \u03c8i)}1\u2264i\u2264m will be adopted in a (timely) adaptive fashion, namely introducing\nappropriate weights for all gradients to yield the update\nzt+1 = zt \u2212 \u00b5t\u2207(cid:96)rw(zt; \u03c8i),\n\ni z)(a\u2217\n\nt = 0, 1, . . .\n\n(10)\n\nThe reweighted gradient \u2207(cid:96)rw(zt) evaluated at the current point zt is given as\n\nm(cid:88)\n\n\u2207(cid:96)rw(z) :=\n\n1\nm\n\nwi\u2207(cid:96)(z; \u03c8i)\n\n(11)\n\ni=1\n\n(cid:27)\n\n(cid:26)\n\nT :=\n\n1 \u2264 i \u2264 m :\n\nfor suitable weights {wi}1\u2264i\u2264m to be designed next.\nTo that end, we observe that the truncation criterion [28]\n|a\u2217\ni z|\ni x| \u2265 \u03b1\n(12)\n|a\u2217\nwith some given parameter \u03b1 > 0 suggests to include only gradient components associated with |a\u2217\ni z|\nof relatively large sizes. This is because gradients of sizable |a\u2217\ni z|/|a\u2217\ni x| offer reliable and meaningful\ndirections pointing to the truth x with large probability [28]. As such, the ratio |a\u2217\ni x| can be\nsomewhat viewed as a con\ufb01dence score about the reliability or meaningfulness of the corresponding\ngradient \u2207(cid:96)(z; \u03c8i). Recognizing that con\ufb01dence can vary, it is natural to distinguish the contributions\nthat different gradients make to the overall search direction. An easy way is to attach large weights\nto the reliable gradients, and small weights to the spurious ones. Assume without loss of generality\nthat 0 \u2264 wi \u2264 1 for all 1 \u2264 i \u2264 m; otherwise, lump the normalization factor achieving this into the\nlearning rate \u00b5t. Building upon this observation and leveraging the gradient reliability con\ufb01dence\nscore |a\u2217\n\ni x|, the weight per gradient \u2207(cid:96)(z; \u03c8i) in RAF is designed to be\n\ni z|/|a\u2217\n\ni z|/|a\u2217\n\nwi :=\n\n1\n1 + \u03b2i/(|a\u2217\ni z|/|a\u2217\n\ni x|)\n\n1 \u2264 i \u2264 m\n\n,\n\n(13)\n\ni=1 wt\n\n(cid:80)m\n\nin which {\u03b2i > 0}1\u2264i\u2264m are some pre-selected parameters.\nRegarding the proposed weighting criterion in (13), three remarks are in order, followed by the RAF\nalgorithm summarized in Algorithm 1.\nR1) The weights {wt\ni}1\u2264i\u2264m are time adapted to zt. One can also interpret the reweighted gradient\n\ufb02ow zt+1 in (10) as performing a single gradient step to minimize the smooth reweighted loss\ni(cid:96)(z; \u03c8i) with starting point zt; see also [4] for related ideas successfully exploited in the\n1\nm\niteratively reweighted least-squares approach to compressive sampling.\nR2) Note that the larger |a\u2217\ni x| is, the larger wi will be. More importance will be attached to\nreliable gradients than to spurious ones. Gradients from almost all data points are are judiciously\naccounted for, which is in sharp contrast to [28], where withdrawn gradients do not contribute the\ninformation they carry.\nR3) At the points {z} where a\u2217\ni z = 0 for certain i \u2208 M, the corresponding weight will be wi = 0.\nThat is, the losses (cid:96)(z; \u03c8i) in (2) that are nonsmooth at points z will be eliminated, to prevent their\ncontribution to the reweighted gradient update in (10). Hence, the convergence analysis of RAF can\nbe considerably simpli\ufb01ed because it does not have to cope with the nonsmoothness of the objective\nfunction in (2).\n\ni z|/|a\u2217\n\n6\n\n\f2.3 Algorithmic parameters\n\nTo optimize the empirical performance and facilitate numerical implementations, choice of pertinent\nalgorithmic parameters of RAF is independently discussed here. It is obvious that the RAF algorithm\nentails four parameters. Our theory and all experiments are based on: i) |S|/m \u2264 0.25; ii) 0 \u2264 \u03b2i \u2264 10\nfor all 1 \u2264 i \u2264 m; and, iii) 0 \u2264 \u03b3 \u2264 1. For convenience, a constant step size \u00b5t \u2261 \u00b5 > 0 is suggested,\nbut other step size rules such as backtracking line search with the reweighted objective work as well.\nAs will be formalized in Section 3, RAF converges if the constant \u00b5 is not too large, with the upper\nbound depending in part on the selection of {\u03b2i}1\u2264i\u2264m.\nIn the numerical tests presented in Sections 2 and 4, we take\n\n|S| := (cid:98)3m/13(cid:99),\n\n\u03b2i \u2261 \u03b2 := 10,\n\n\u03b3 := 0.5,\n\nand \u00b5 := 2\n\n(14)\n\nwhile larger step sizes \u00b5 > 0 can be afforded for larger m/n values.\n\nAlgorithm 1 Reweighted Amplitude Flow\n1: Input: Data {(ai; \u03c8i}1\u2264i\u2264m; maximum number of iterations T ; step size \u00b5t = 2/6 and\nweighting parameter \u03b2i = 10/5 for real/complex Gaussian model; |S| = (cid:98)3m/13(cid:99), and \u03b3 = 0.5.\n2: Construct S to include indices associated with the |S| largest entries among {\u03c8i}1\u2264i\u2264m.\n\ni=1 \u03c82\n\ni/m \u02dcz0 with \u02dcz0 being the unit principal eigenvector of\n\nm(cid:88)\n\ni=1\n\n(cid:16)\n\nm(cid:88)\n\ni=1\n\n(15)\n\n(16)\n\n(cid:17)\n\nai\n\n3: Initialize z0 :=(cid:112)(cid:80)m\n(cid:26) \u03c8\u03b3\n\nwhere w0\n\ni :=\n\ni ,\n0,\n\n4: Loop: for t = 0 to T \u2212 1\n\nY :=\n\n1\nm\n\ni aia\u2217\nw0\n\ni\n\ni \u2208 S \u2286M\notherwise\n\nfor all 1 \u2264 i \u2264 m.\n\nzt+1 = zt \u2212 \u00b5t\nm\n\nwt\ni\n\ni zt \u2212 \u03c8i\na\u2217\n\na\u2217\ni zt\n|a\u2217\ni zt|\n\ni zt|/\u03c8i\n\n|a\u2217\ni zt|/\u03c8i+\u03b2i\n\n|a\u2217\n\nfor all 1 \u2264 i \u2264 m.\n\nwhere wt\n\ni :=\n5: Output: zT .\n\n3 Main results\n\nOur main results summarized in Theorem 1 next establish exact recovery under the real Gaussian\nmodel, whose proof is provided in the supplementary material. Our RAF approach however can be\ngeneralized readily to the complex Gaussian and CDP models.\nTheorem 1 (Exact recovery) Consider m noiseless measurements \u03c8 = |Ax| for an arbitrary\nx \u2208 Rn. If the data size m \u2265 c0|S| \u2265 c1n and the step size \u00b5 \u2264 \u00b50, then with probability at least\n1 \u2212 c3e\u2212c2m, the reweighted amplitude \ufb02ow\u2019s estimates zt in Algorithm 1 obey\n\ndist(zt, x) \u2264 1\n10\n\n(1 \u2212 \u03bd)t(cid:107)x(cid:107),\n\nt = 0, 1, . . .\n\n(17)\n\nwhere c0, c1, c2, c3 > 0, 0 < \u03bd < 1, and \u00b50 > 0 are certain numerical constants depending on the\nchoice of algorithmic parameters |S|, \u03b2, \u03b3, and \u00b5.\n\nAccording to Theorem 1, a few interesting properties of our RAF algorithm are worth highlighting.\nTo start, RAF recovers the true solution exactly with high probability whenever the ratio m/n of\nthe number of equations to the unknowns exceeds some numerical constant. Expressed differently,\nRAF achieves the information-theoretic optimal order of sample complexity, which is consistent\nwith the state-of-the-art including TWF [7], TAF [28], and RWF [30]. Notice that (17) also holds at\nt = 0, namely, dist(z0, x) \u2264 (cid:107)x(cid:107)/10, therefore providing performance guarantees for the proposed\ninitialization scheme (cf. Step 3 in Algorithm 1). Moreover, starting from this initial estimate, RAF\nconverges linearly to the true solution x. That is, to reach any \u0001-relative solution accuracy (i.e.,\ndist(zT , x) \u2264 \u0001(cid:107)x(cid:107)), it suf\ufb01ces to run at most T = O(log 1/\u0001) RAF iterations (cf. Step 4). This in\n\n7\n\n\fconjunction with the per-iteration complexity O(mn) con\ufb01rms that RAF solves exactly a quadratic\nsystem in time O(mn log 1/\u0001), which is linear in O(mn), the time required to read the entire data\n{(ai; \u03c8i)}1\u2264i\u2264m. Given the fact that the initialization stage can be performed in time O(n|S|) and\n|S| < m, the overall linear-time complexity of RAF is order-optimal.\nProof of Theorem 1 is provided in the supplementary material.\n\n4 Simulated tests\n\nOur theoretical \ufb01ndings about RAF have been cor-\nroborated with comprehensive numerical tests, a\nsample of which are discussed next. Performance\nof RAF is evaluated relative to the state-of-the-art\n(T)WF, RWF, and TAF in terms of the empirical\nsuccess rate among 100 MC trials, where a success\nwill be declared for a trial if the returned estimate\n\nincurs error (cid:13)(cid:13)\u03c8 \u2212 |AzT|(cid:13)(cid:13)\n\n\u2264 10\u22125\n\n(cid:107)x(cid:107)\n\nFigure 2: Function value L(zT ) by RAF for\n100 MC realizations when m = 2n \u2212 1.\n\nwhere the modulus operator | \u00b7 | is understood\nelement-wise. The real Gaussian model and the\nphysically realizable CDPs were simulated in this\nsection. For fairness, all schemes were imple-\nmented with their suggested parameter values. The\ntrue signal vector x was randomly generated using\nx \u223c N (0, I), and the i.i.d. sensing vectors ai\nai \u223c N (0, I). Each scheme obtained the initial\nguess based on 200 power iterations, followed by\na series of T = 2, 000 (truncated/reweighted) gradient iterations. All experiments were performed\nusing MATLAB on an Intel CPU @ 3.4 GHz (32 GB RAM) computer. For reproducibility, the Matlab\ncode of the RAF algorithm is publicly available at https://gangwg.github.io/RAF/.\nTo demonstrate the power of RAF in the high-dimensional regime, the function value L(z) in (2)\nevaluated at the returned estimate zT for 100 independent trials is plotted (in negative logarithmic\nscale) in Fig. 2, where m = 2n \u2212 1 = 9, 999. It is self-evident that RAF succeeded in all trials even\nat this challenging information limit. To the best of our knowledge, RAF is the \ufb01rst algorithm that\nempirically recovers any solution exactly from a minimal number of random quadratic equations. Left\npanel in Fig. 3 further compares the empirical success rate of \ufb01ve schemes under the real Gaussian\nmodel with n = 1, 000 and m/n varying by 0.1 from 1 to 5. Evidently, the developed RAF achieves\nperfect recovery as soon as m is about 2n, where its competing alternatives do not work well. To\ndemonstrate the stability and robustness of RAF in the presence of additive noise, the right panel in\nFig. 3 depicts the normalized mean-square error\n\nas a function of the signal-to-noise ratio (SNR) for m/n taking values {3, 4, 5}. The noise model\n\nNMSE :=\n\ndist2(zT , x)\n\n(cid:107)x(cid:107)2\n\n\u03c8i = |(cid:104)ai, x(cid:105)| + \u03b7i,\n\n1 \u2264 i \u2264 m\n\nwith \u03b7 := [\u03b7i]1\u2264i\u2264m \u223c N (0, \u03c32Im) was employed, where \u03c32 was set such that certain SNR :=\n10 log10((cid:107)Ax(cid:107)2/m\u03c32) values on the x-axis were achieved.\nTo examine the ef\ufb01cacy and scalability of RAF in real-world conditions, the last experiment entails\nthe Galaxy image 3 depicted by a three-way array X \u2208 R1,080\u00d71,920\u00d73, whose \ufb01rst two coordinates\nencode the pixel locations, and the third the RGB color bands. Consider the physically realizable\nCDP model with random masks [3]. Letting x \u2208 Rn (n \u2248 2 \u00d7 106) be a vectorization of a certain\nband of X, the CDP model with K masks is\n\n\u03c8(k) = |F D(k)x|,\n\n1 \u2264 k \u2264 K,\n\n3Downloaded from http://pics-about-space.com/milky-way-galaxy.\n\n8\n\n0102030405060708090100Realization number051015202530!log10L(zT)\fFigure 3: Real Gaussian model: Empirical success rate (Left); and, Relative MSE vs. SNR (Right).\n\nwhere F \u2208 Cn\u00d7n is a DFT matrix, and diagonal matrices D(k) have their diagonal entries sampled\n\u221a\u22121. Each D(k) represents a random mask\nuniformly at random from {1,\u22121, j,\u2212j} with j :=\nplaced after the object to modulate the illumination patterns [5]. Implementing K = 4 masks, each\nalgorithm performs independently over each band 100 power iterations for an initial guess, which\nwas re\ufb01ned by 100 gradient iterations. Recovered images of TAF (left) and RAF (right) are displayed\nin Fig. 4, whose relative errors were 1.0347 and 1.0715 \u00d7 10\u22123, respectively. WF and TWF returned\nimages of corresponding relative error 1.6870 and 1.4211, which are far away from the ground truth.\n\nFigure 4: Recovered Galaxy images after 100 gradient iterations of TAF (Left); and of RAF (Right).\n\n5 Conclusion\n\nThis paper developed a linear-time algorithm called RAF for solving systems of random quadratic\nequations. Our procedure consists of two stages: a weighted maximal correlation initializer attainable\nwith a few power or Lanczos iterations, and a sequence of scalable reweighted gradient re\ufb01nements\nfor a nonconvex nonsmooth LS loss function. It was demonstrated that RAF achieves the optimal\nsample and computational complexity. Judicious numerical tests showcase its superior performance\nover state-of-the-art alternatives. Empirically, RAF solves a set of random quadratic equations\nwith high probability so long as a unique solution exists. Promising extensions include studying\nrobust and/or sparse phase retrieval and matrix recovery via (stochastic) reweighted amplitude \ufb02ow\ncounterparts, and in particular exploiting the power of (re)weighting regularization techniques to\nenable more general nonconvex optimization such as training deep neural networks [18].\n\nAcknowledgments\n\nG. Wang and G. B. Giannakis were partially supported by NSF grants 1500713 and 1514056. Y.\nSaad was partially supported by NSF grant 1505970. J. Chen was partially supported by the National\nNatural Science Foundation of China grants U1509215, 61621063, and the Program for Changjiang\nScholars and Innovative Research Team in University (IRT1208).\n\n9\n\n12345m/n for x2 R1,00000.20.40.60.81Empirical success rateRAFTAFTWFRWFWF51015202530354045SNR (dB) for x2 R1,00010-610-510-410-310-210-1100NMSEm=3nm=4nm=5n\fReferences\n[1] R. Balan, P. Casazza, and D. Edidin, \u201cOn signal reconstruction without phase,\u201d Appl. Comput.\n\nHarmon. Anal., vol. 20, no. 3, pp. 345\u2013356, May 2006.\n\n[2] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization: Analysis, Algorithms,\n\nand Engineering Applications. SIAM, 2001, vol. 2.\n\n[3] E. J. Cand\u00e8s, X. Li, and M. Soltanolkotabi, \u201cPhase retrieval via Wirtinger \ufb02ow: Theory and\n\nalgorithms,\u201d IEEE Trans. Inf. Theory, vol. 61, no. 4, pp. 1985\u20132007, Apr. 2015.\n\n[4] E. J. Cand\u00e8s, T. Strohmer, and V. Voroninski, \u201cPhaseLift: Exact and stable signal recovery from\nmagnitude measurements via convex programming,\u201d Appl. Comput. Harmon. Anal., vol. 66,\nno. 8, pp. 1241\u20131274, Nov. 2013.\n\n[5] E. J. Cand\u00e8s, X. Li, and M. Soltanolkotabi, \u201cPhase retrieval from coded diffraction patterns,\u201d\n\nAppl. Comput. Harmon. Anal., vol. 39, no. 2, pp. 277\u2013299, Sept. 2015.\n\n[6] J. Chen, L. Wang, X. Zhang, and Q. Gu, \u201cRobust Wirtinger \ufb02ow for phase retrieval with arbitrary\n\ncorruption,\u201d arXiv:1704.06256, 2017.\n\n[7] Y. Chen and E. J. Cand\u00e8s, \u201cSolving random quadratic systems of equations is nearly as easy\nas solving linear systems,\u201d in Adv. on Neural Inf. Process. Syst., Montr\u00e9al, Canada, 2015, pp.\n739\u2013747.\n\n[8] A. Conca, D. Edidin, M. Hering, and C. Vinzant, \u201cAn algebraic characterization of injectivity in\n\nphase retrieval,\u201d Appl. Comput. Harmon. Anal., vol. 38, no. 2, pp. 346\u2013356, Mar. 2015.\n\n[9] J. C. Duchi and F. Ruan, \u201cSolving (most) of a set of quadratic equalities: Composite optimization\n\nfor robust phase retrieval,\u201d arXiv:1705.02356, 2017.\n\n[10] J. R. Fienup, \u201cPhase retrieval algorithms: A comparison,\u201d Appl. Opt., vol. 21, no. 15, pp.\n\n2758\u20132769, Aug. 1982.\n\n[11] R. W. Gerchberg and W. O. Saxton, \u201cA practical algorithm for the determination of phase from\n\nimage and diffraction,\u201d Optik, vol. 35, pp. 237\u2013246, Nov. 1972.\n\n[12] T. Goldstein and S. Studer, \u201cPhaseMax: Convex phase retrieval via basis pursuit,\u201d\n\narXiv:1610.07531v1, 2016.\n\n[13] P. Hand and V. Voroninski, \u201cAn elementary proof of convex phase retrieval in the natural\n\nparameter space via the linear program phasemax,\u201d arXiv:1611.03935, 2016.\n\n[14] R. H. Keshavan, A. Montanari, and S. Oh, \u201cMatrix completion from a few entries,\u201d IEEE Trans.\n\nInf. Theory, vol. 56, no. 6, pp. 2980\u20132998, Jun. 2010.\n\n[15] Y. M. Lu and G. Li, \u201cPhase transitions of spectral initialization for high-dimensional nonconvex\n\nestimation,\u201d arXiv:1702.06435, 2017.\n\n[16] P. Netrapalli, P. Jain, and S. Sanghavi, \u201cPhase retrieval using alternating minimization,\u201d in Adv.\n\non Neural Inf. Process. Syst., Stateline, NV, 2013, pp. 2796\u20132804.\n\n[17] P. M. Pardalos and S. A. Vavasis, \u201cQuadratic programming with one negative eigenvalue is\n\nNP-hard,\u201d J. Global Optim., vol. 1, no. 1, pp. 15\u201322, 1991.\n\n[18] G. Pereyra, G. Tucker, J. Chorowski, \u0141. Kaiser, and G. Hinton, \u201cRegularizing neural networks\n\nby penalizing con\ufb01dent output distributions,\u201d arXiv:1701.06548, 2017.\n\n[19] J. R. Rice, Numerical Methods in Software and Analysis. Academic Press, 1992.\n\n[20] Y. Saad, Numerical Methods for Large Eigenvalue Problems: Revised Edition. SIAM, 2011.\n\n[21] Y. Shechtman, Y. C. Eldar, O. Cohen, H. N. Chapman, J. Miao, and M. Segev, \u201cPhase retrieval\nwith application to optical imaging: A contemporary overview,\u201d IEEE Signal Process. Mag.,\nvol. 32, no. 3, pp. 87\u2013109, May 2015.\n\n10\n\n\f[22] M. Soltanolkotabi, \u201cStructured signal recovery from quadratic measurements: Breaking sample\n\ncomplexity barriers via nonconvex optimization,\u201d arXiv:1702.06175, 2017.\n\n[23] J. Sun, Q. Qu, and J. Wright, \u201cA geometric analysis of phase retrieval,\u201d Found. Comput. Math.,\n\n2017 (to appear); see also arXiv:1602.06664, 2016.\n\n[24] I. Waldspurger, \u201cPhase retrieval with random Gaussian sensing vectors by alternating projec-\n\ntions,\u201d aXiv:1609.03088, 2016.\n\n[25] I. Waldspurger, A. d\u2019Aspremont, and S. Mallat, \u201cPhase recovery, maxcut and complex semide\ufb01-\n\nnite programming,\u201d Math. Program., vol. 149, no. 1, pp. 47\u201381, 2015.\n\n[26] G. Wang and G. B. Giannakis, \u201cSolving random systems of quadratic equations via truncated\ngeneralized gradient \ufb02ow,\u201d in Adv. on Neural Inf. Process. Syst., Barcelona, Spain, 2016, pp.\n568\u2013576.\n\n[27] G. Wang, G. B. Giannakis, and J. Chen, \u201cScalable solvers of random quadratic equations via\nstochastic truncated amplitude \ufb02ow,\u201d IEEE Trans. Signal Process., vol. 65, no. 8, pp. 1961\u20131974,\nApr. 2017.\n\n[28] G. Wang, G. B. Giannakis, and Y. C. Eldar, \u201cSolving systems of random quadratic equations via\ntruncated amplitude \ufb02ow,\u201d IEEE Trans. Inf. Theory, 2017 (to appear); see also arXiv:1605.08285,\n2016.\n\n[29] X. Yi, C. Caramanis, and S. Sanghavi, \u201cAlternating minimization for mixed linear regression,\u201d\n\nin Proc. Intl. Conf. on Mach. Learn., Beijing, China, 2014, pp. 613\u2013621.\n\n[30] H. Zhang, Y. Zhou, Y. Liang, and Y. Chi, \u201cReshaped Wirtinger \ufb02ow and incremental algorithm\nfor solving quadratic system of equations,\u201d J. Mach. Learn. Res., 2017 (to appear); see also\narXiv:1605.07719, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1172, "authors": [{"given_name": "Gang", "family_name": "Wang", "institution": "University of Minnesota"}, {"given_name": "Georgios", "family_name": "Giannakis", "institution": "University of Minnesota"}, {"given_name": "Yousef", "family_name": "Saad", "institution": "University of Minnesota"}, {"given_name": "Jie", "family_name": "Chen", "institution": "Beijing Institute of Technology"}]}