Computer Science > Computation and Language

arXiv:2308.00319 (cs)

[Submitted on 1 Aug 2023 (v1), last revised 10 Jan 2024 (this version, v2)]

Title:LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack

Authors:Hai Zhu, Zhaoqing Yang, Weiwei Shang, Yuren Wu

Abstract:Natural language processing models are vulnerable to adversarial examples. Previous textual adversarial attacks adopt gradients or confidence scores to calculate word importance ranking and generate adversarial examples. However, this information is unavailable in the real world. Therefore, we focus on a more realistic and challenging setting, named hard-label attack, in which the attacker can only query the model and obtain a discrete prediction label. Existing hard-label attack algorithms tend to initialize adversarial examples by random substitution and then utilize complex heuristic algorithms to optimize the adversarial perturbation. These methods require a lot of model queries and the attack success rate is restricted by adversary initialization. In this paper, we propose a novel hard-label attack algorithm named LimeAttack, which leverages a local explainable method to approximate word importance ranking, and then adopts beam search to find the optimal solution. Extensive experiments show that LimeAttack achieves the better attacking performance compared with existing hard-label attack under the same query budget. In addition, we evaluate the effectiveness of LimeAttack on large language models, and results indicate that adversarial examples remain a significant threat to large language models. The adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training.

Comments:	18 pages, 38th AAAI Main Track
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2308.00319 [cs.CL]
	(or arXiv:2308.00319v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2308.00319

Submission history

From: Hai Zhu [view email]
[v1] Tue, 1 Aug 2023 06:30:37 UTC (671 KB)
[v2] Wed, 10 Jan 2024 13:26:18 UTC (458 KB)

Computer Science > Computation and Language

Title:LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators