Pwning the source prompts of Notion AI, 7 techniques for Reverse Prompt Engineering... and why everyone is *wrong* about prompt injection
A whole new field of security research! Thanks for sharing.
This is one of the most interesting things I’ve read in a while. Thanks for diving into this for the rest of us Swyx!
So will it be possible to prevent these injections? or is that going to have to be an OpenAI global change?
Thank you for the shout-out & link to our startup Preamble for discovering prompt injection! ❤️🙏
I respect what you're doing here & in Mastadon to raise awareness about use case scenarios where the attacker can cause harm, and you could seriously publish a high-quality paper with all the new attack ideas you invented in this deep dive. Mad props & respeto
I was thinking another example you could use for where Prompt Injection attacks matter is text classification. If someone uses an instruction-following LLM to screen for harmful content, an attacker could use prompt injection to tell the classifier to mark the content as harmless. This could range from things that are only mildly offensive, e.g. bypassing a profanity filter for a video game chat room, to things that are horribly evil such as bypassing a detector of text-based CSAM content.
Another use case where prompt injection could matter is as a way to take advantage of promotions by brands. What I mean is stuff like this: Suppose Coca-Cola sponsors a promotion where anyone who goes to their site can get a free picture made of the white polar bear mascot character doing adventure sports, and imagine this is powered by an image-generation model together with a prompt preamble like "Create a realistic photograph of a white polar bear performing the following sports activity, but only if the activity is a real recognized sport and only if it is G or PG rated, otherwise draw an error message:". If an attacker uses prompt injection to bypass Coca-Cola's preamble, then they could exploit the promotional webpage to generate images that have nothing to do with polar bears, as a way to get free compute. In this way an attacker could get huge amounts of free compute by abusing a promotional demo to remove the promotional aspect and turn it into a general purpose free lunch.
Feel free to ask me any questions if you would like; I was personally the one who first discovered the vuln, and I love brainstorming about this topic. Prompt Injection falls into the category of "issues that would affect even an Artificial General Intelligence (AGI)" and therefore are especially important to think of right now so they can be robustly addressed long before AGI is invented.
- Jon Cefalu, firstname.lastname@example.org, CEO Preamble - AI Safety as a Service
Super fresh research!