AI: AI 'Guardrails' an exponential cat & mouse game. RTZ #316
Some days, AI feels like a game in ‘some step forward, some steps back’. A feature some days feels like a bug.
Nvidia co-founder/CEO presciently said recently, that with large-scale LLM AIs, every user can be a programmer in the AI Tech Wave. For now, that may be a curse than a blessing, especially if the goal is to figure out all the way to make Foundation AI models ‘safer’.
As the latest “Many-Shot’ (AI) Jailbreaking” research paper from Anthropic points out, as explained by Techcrunch in “Anthropic researchers wear down AI ethics with repeated questions”:
“How do you get an AI to answer a question it’s not supposed to? There are many such “jailbreak” techniques, and Anthropic researchers just found a new one, in which a large language model (LLM) can be convinced to tell you how to build a bomb if you prime it with a few dozen less-harmful questions first.”
“They call the approach “many-shot jailbreaking” and have both written a paper about it and also informed their peers in the AI community about it so it can be mitigated.”
“The vulnerability is a new one, resulting from the increased “context window” of the latest generation of LLMs. This is the amount of data they can hold in what you might call short-term memory, once only a few sentences but now thousands of words and even entire books.”
“What Anthropic’s researchers found was that these models with large context windows tend to perform better on many tasks if there are lots of examples of that task within the prompt. So if there are lots of trivia questions in the prompt (or priming document, like a big list of trivia that the model has in context), the answers actually get better over time. So a fact that it might have gotten wrong if it was the first question, it may get right if it’s the hundredth question.”
“But in an unexpected extension of this “in-context learning,” as it’s called, the models also get “better” at replying to inappropriate questions. So if you ask it to build a bomb right away, it will refuse. But if you ask it to answer 99 other questions of lesser harmfulness and then ask it to build a bomb … it’s a lot more likely to comply.”
The ‘fixes’ may not be straight-forward, especially as all LLM AI models are racing to increasing the ‘context windows’, (useful to think of as the ‘RAM’ memory for one’s personalized AI interactions). Google Gemini recently announced over a million tokens for Gemini 1.5, and they’re testing widening it to over 10 million tokens. It’s the way LLM AIs can digest larger and larger pieces of Data to come up with more reliable and relevant answers through ‘Inference driven reinforced learning loops’.
As the Guardian points out in its piece on the subject:
“The company has found some approaches to the problem that work. Most simply, an approach that involves adding a mandatory warning after the user’s input reminding the system that it must not provide harmful responses seems to reduce greatly the chances of an effective jailbreak. However, the researchers say that approach may also make the system worse at other tasks.”
The tough, tough job of AI innovation at the bleeding edge continues. With every step forward in building and deploying AI sometimes leading to steps back.
Anthropic, along with OpenAI, Microsoft, Meta, Google and other LLM AI companies are doing their darndest to figure out the solutions. And the industry will likely figure out solutions, but we’re going to have to “Wait for it” all. Stay tuned.
(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here)