AI

Prompt hacking is an Achilles' heel for AI

  • "Prompt hacking" is becoming a concern as hackers figure out how to manipulate LLMs to retrieve restricted information
  • Outsmarting an LLM in many environments can be done with little to no hacking experience
  • New security measures need to be put in place and LLMs themselves will have to adapt

It turns out that tricking a large language model (LLM) can be surprisingly easy. Manipulating the input given to an LLM to elicit an unintended response, known as "prompt hacking," is a growing threat to organizations as they implement more artificial intelligence (AI).

Most moderns LLMs and generative artificial intelligence (GenAI) lack reasoning capability, said Red Hat Chief Technologist Fatih Nar. And that is a problem.

“You can trick the model to give you an outcome that is not supposed to be given,” Nar told Fierce Network. “It's just like you're asking a kid to drive a car when that kid is not of legitimate age to drive a car, but you can force a kid to drive a car, right?"

A survey by Lakera.AI of over 1,000 security professionals found that nearly 90% of organizations are actively implementing or exploring LLM use cases, yet only about 5% expressed high confidence in their GenAI security frameworks.

There are a “broad range of environments using GenAI,” said Lakera CEO David Haber, including coding assistance, data analysis, internal knowledge base and search, customer service (chatbots) and content creation (writing, translation, etc.). Because GenAI models themselves are vulnerable to prompt attacks, Haber noted all use cases are “equally susceptible."

Surprisingly, 62% of respondents reported moderate to low concern about GenAI/LLM vulnerabilities, even though only 5% are confident in their security controls. 

"One of the biggest obstacles to securing AI systems right now is a lack of knowledge on the part of both engineers and security teams," said Nate Lee, CISO at Cloudsec.ai.

Hacking made easy

Lakera announced last month it raised $20 million in a Series A funding round for GenAI security product development and expansion into the U.S. The company’s nascent success can be attributed partly to the popularity of Gandalf, an educational game it created to teach users about the risks of prompt injection attacks on LLMs.

Through a web form, players are challenged to trick the underlying LLM—powered by the OpenAI API—into revealing in-game passwords, with each level presenting increasingly difficult scenarios.

Initial levels provide “less protection than what we expect enterprises to deploy today,” Haber said. The later levels, on the other hand, do replicate the best protections the existing GenAI models in the market can provide. 

Lakera tests various popular models for level seven of the game. This level has so far been solved by over 200,000 players, "which shows that a creative person (with no hacking background) can trick the models we are relying on today,” Haber added.

However, there are dozens of threads on Reddit and other sites where users have shared their solutions to Gandalf. The same could be true for hacking real-life LLMs being used across organizations. One hacker could find a good prompt and then tell other hackers about it. 

An easy search will lead to YouTube videos such as “Learn to Hack LLMs like ChatGPT,” or this presentation at the Applied Machine Learning Days (AMLD) 2024 where computer scientist Julia Bazińska demonstrated how to hack an LLM by “asking nicely.”

ChatGPT

In a recent blog, AWS Community Builder and Data Scientist Aboze Brain John showed an example of “jailbreaking” an LLM using a prompt shared by Jailbreakchat.

Prompt protection

Prompt hacking can range from benign tricks to malicious attempts to uncover private information, bypass content filters or propagate misinformation. There are two main types of prompt injections: direct, where the prompt is manipulated directly, and indirect, where the model's behavior is influenced through embedded attacks in external content.

To combat these vulnerabilities, organizations have adopted a range of security practices.

Lakera's survey revealed that 61% of organizations have implemented access control mechanisms, and 55% use data encryption to protect sensitive information. However, 28% of respondents were unsure about their security practices, and 13% reported having none.

John outlined other strategies for mitigating prompt injection risks in his blog, including enhanced input validation, secure interaction with external data sources and user education. Additionally, measures like data anonymization, advanced training techniques and careful prompt design can help prevent sensitive information leakage.

To reduce risk, LLMs themselves will have to adapt as well. As AI advances toward Artificial General Intelligence (AGI)—a form of AI that can perform any intellectual task a human can—combined with improved reasoning abilities, it will become much harder for these models to be tricked, Red Hat’s Nar said. AI development is also moving from fine-tuned AI models, which require specific training for different tasks, to zero-shot models that can handle new tasks without any prior training.

“Hopefully we'll have reasoning that will make it much harder to break LLM models, to get them to do something that they're not supposed to do,” Nar continued. “But it's a path. It's a journey. We are walking through it.”