Cisco Researchers Warn of Emerging AI Threats: Disguising Attacks from Machine Analysis

Emerging Threats to Large Language Models

Cisco security researchers have identified a number of threats being used by bad actors to infect or attack AI's most common component - the large language model. These techniques include disguising attacks from machine analysis, which is not a new technique but has seen an increase in use during the second half of 2024.

According to Martin Lee, a security engineer with Cisco Talos, being able to disguise and hide content from machine analysis or human oversight is likely to become a more important vector of attack against AI systems. However, he notes that the techniques to detect this kind of obfuscation are well known and already integrated into spam detection systems such as Cisco Email Threat Defense.

Single-Turn Crescendo Attack

The Single-Turn Crescendo Attack (STCA) represents a significant advancement in attack methods, simulating an extended dialogue within a single interaction to bypass content moderation filters. This technique exploits the pattern continuation tendencies of LLMs and has been demonstrated to be successful against models including GPT-4o, Gemini 1.5, and variants of Llama 3.

The real-world implications of this attack are undoubtedly concerning and highlight the importance of strong content moderation and filter measures. As researchers from Cisco note, the STCA establishes a context that builds towards controversial or explicit content in one prompt, efficiently jailbreaking several frontier models.

Jailbreak via Simple Assistive Task Linkage (SATA)

SATA is a novel paradigm for jailbreaking LLMs by leveraging Simple Assistive Task Linkage. This technique masks harmful keywords in a given prompt and uses simple assistive tasks such as masked language model (MLM) and element lookup by position (ELP) to fill in the semantic gaps left by the masked words.

The researchers from Tsinghua University, Hefei University of Technology, and Shanghai Qi Zhi Institute demonstrated the remarkable effectiveness of SATA with attack success rates of 85% using MLM and 76% using ELP on the AdvBench dataset. This is a significant improvement over existing methods, underscoring the potential impact of SATA as a low-cost, efficient method for bypassing LLM guardrails.

Jailbreak through Neural Carrier Articles

A new, sophisticated jailbreak technique known as Neural Carrier Articles embeds prohibited queries into benign carrier articles in order to effectively bypass model guardrails. Using only a lexical database like WordNet and composer LLM, this technique generates prompts that are contextually similar to a harmful query without triggering model safeguards.

As researchers from Penn State, Northern Arizona University, Worcester Polytechnic Institute, and Carnegie Mellon University demonstrate, the Neural Carrier Activities jailbreak is effective against several frontier models in a black box setting and has a relatively low barrier to entry.

Further Research on Adversarial Attacks

Cisco authors point to additional research from the Ellis Institute and University of Maryland that looks at adversarial attacks on LLMs. The Ellis Institute and University of Maryland researchers highlight the ease with which current-generation LLMs can be coerced into a number of unintended behaviors, including misdirection attacks in which an LLM outputs URLs or malicious instructions to a user or LLM and denial-of-service attacks where LLMs were made to produce extreme numbers of tokens to exhaust GPU resources.

These emerging threats to large language models highlight the need for robust defense mechanisms and ongoing research into new attack methods. As AI continues to evolve, so too must our understanding of its vulnerabilities and limitations.

Sources: Network World
Published: Feb 4, 2025, 4:00:00 AM EST