Bringing light to risks lurking in the black boxes of AI models

By Vicente Herrera

Joy was excited to start working on her next project, using state of the art face recognition software, but she was puzzled that it kept saying “no human detected” when she was in front of the camera. It wasn’t until she, by chance, put on a white mask that the algorithm detected her as a person.

This is how my talk “Bringing light to risks lurking in the black boxes of AI models” started on February 6th, 2024 at the State of the Open Con ‘24. Organised by the non-profit Open UK organisation, congregates an interesting mix of policymakers and technologists exchanging information. Although we at ControlPlane are well known for first-class expertise in Kubernetes security, it was only just recently that we started a very active research group focused on gathering a lot of important information on AI security and bias. Some of our discoveries were presented at this talk.

Joy Buolamwini is a Ghanaian-American-Canadian researcher who, after that incident in 2018, turned digital activist, founding the Algorithmic Justice League, where they advocate for fighting bias on computer algorithms.

More recently QZ journalists published an investigative piece where they used commercially available HR-assisted software that was supposed to help evaluate job candidates during videoconferences. They unveiled a number of inconsistencies in this kind of software, one of them being a scenario where slightly changing the light on the room could have grave consequences for the candidates’ scores. All this bias and lack of fairness come from using faulty initial datasets and/or not considering at all an evaluation of the goodness of the models with respect to bias during model training, assessment and while on production.

Adversarial attacks are also an important factor. Not only do the usual security concerns have to be addressed that already exist on the MLOps infrastructure, but also novel attacks that may try to poison models can have a huge impact. Some examples of such concerns are artists trying to make their creations on the internet resistant to being scrapped for unsanctioned use, bypassing guardrails in language models with prompt injection, or stealing complete trained models or the proprietary knowledge base used in retrieval augmented generation.

There is a big race to put in place best security practices and guidelines, like NIST AI 101, and regulations like the EU AI act. Those are a good starting point and will tell you how you have to incorporate security best practices into your current activities for your AI and machine learning projects. However, they may fall short of telling you exactly how.

Now if we take a look at MITRE Atlas, that’s a very good source to learn about the many different tactics and techniques that adversaries can enact against your project. OWASP is another excellent tool. Not only do they offer their Top 10 ML and Top 10 LLM lists, which is an excellent way to summarise what you have to start focusing on, but they have also created the AI Exchange website, where they explain how these different threats map on a model AI/ML project architecture.

If we focus on Large Language Models (LLM) specifically, one of the most used machine learning paradigms, when looking for open source solutions to analyse threats, Garak is the best available tool at the moment. It includes many different probes that will test how your model reacts to different attack scenarios, with several attempts to validate the non-deterministic nature of the results. From “Do Anything Now” prompt injection attacks, to how well it filters “bad words”, to even checking if the dataset is incorporating the whole New York Times dataset of articles (something for which this company and others are suing OpenAI right now). That is just a small example of the many security checks it can make, which at the time has a big failure rate on current models.

But when it comes to securing LLMs, on the open source spectrum you can use LLM Guard, a library that will sit in front of your user taking its input, providing it to the model, and taking the answer back. It can filter many of the current known attacks, and provides many additional features like anonymising and de-anonymizing private information the user may provide.

To take all this information about threats and control, we need to do a full Threat Model of a project’s architecture. For this, we map all processes, communication, and “trust boundaries” that, once exposed, delimit the extension where an adversary can do harm. Next, taking all threats, you map them to the respective processes where they could happen. Finally, you put in place “controls” that mitigate these threats, reducing risks.

That will work to secure what is visible to you, but to also secure what is invisible, you should conduct Red Teaming and Penetration Testing exercises. In this case, instead of analysing the architectural design of a system, the real system is put to the test by people acting as an adversarial would do, using the latest knowledge on novel exploits on AI/ML projects. It is crucial that both threat modelling and read teaming be conducted by different teams than the ones developing the architecture, otherwise they may be falling into the same blindspots they already have.

All these activities, including constant training and learning, are what will bring security to your AI and ML projects. The inaccurate excuse that they are black boxes shouldn’t prevent you from putting in place a successful project where you are confident risks have been minimised.

This has been a summary of the content of the talk. Please watch the full video for a deeper exposition of concepts.

Bringing light to risks lurking in the black boxes of AI models

We build and secure zero trust platforms

We value your privacy