Qi Lei, Lingfei Wu, Pin-Yu Chen, Alex Dimakis, Inderjit S. Dhillon, Michael J Witbrock
In this paper, we formulate the adversarial attacks with discrete input as an optimization task on a set function. We prove that this set function is submodular for some popular neural network text classifiers. This finding guarantees a $1-1/e$ approximation factor with the greedy algorithm. Meanwhile, we show how to use the gradient of the attacked classifier to guide the greedy search. Empirical studies with our proposed optimization scheme show significantly improved attack ability and efficiency, on three different text classification tasks over various baselines. We also use a joint sentence and word paraphrasing technique to maintain the original semantics and syntax. This is validated by a human subject evaluation in subjective metrics on the quality and semantic coherence of our generated adversarial text.