No One Truly Knows How AI Systems Work. A New Discovery Could Change That

21.05.2024 18:00

Time.com

A new breakthrough by researchers at Anthropic allows them to peer inside neural networks, paving the way for safer AI systems.

Today’s artificial intelligence is often described as a “black box.” AI developers don’t write explicit rules for these systems; instead, they feed in vast quantities of data and the systems learn on their own to spot patterns. But the inner workings of the AI models remain opaque, and efforts to peer inside them to check exactly what is happening haven’t progressed very far. Beneath the surface, neural networks—today’s most powerful type of AI—consist of billions of artificial “neurons” represented as decimal-point numbers. Nobody truly understands what they mean, or how they work.

[time-brightcove not-tgx=”true”]

For those concerned about risks from AI, this fact looms large. If you don’t know exactly how a system works, how can you be sure it is safe?

On Tuesday, the AI lab Anthropic announced it had made a breakthrough toward solving this problem. Researchers developed a technique for essentially scanning the “brain” of an AI model, allowing them to identify collections of neurons—called “features”—corresponding to different concepts. And for the first time, they successfully used this technique on a frontier large language model, Anthropic’s Claude Sonnet, the lab’s second-most powerful system, .

In one example, Anthropic researchers discovered a feature inside Claude representing the concept of “unsafe code.” By stimulating those neurons, they could get Claude to generate code containing a bug that could be exploited to create a security vulnerability. But by suppressing the neurons, the researchers found, Claude would generate harmless code.

The findings could have big implications for the safety of both present and future AI systems. The researchers found millions of features inside Claude, including some representing bias, fraudulent activity, toxic speech, and manipulative behavior. And they discovered that by suppressing each of these collections of neurons, they could alter the model’s behavior.

As well as helping to address current risks, the technique could also help with more speculative ones. For years, the primary method available to researchers trying to understand the capabilities and risks of new AI systems has simply been to chat with them. This approach, sometimes known as “red-teaming,” can help catch a model being toxic or dangerous, allowing researchers to build in safeguards before the model is released to the public. But it doesn’t help address one type of potential danger that some AI researchers are worried about: the risk of an AI system becoming smart enough to deceive its creators, hiding its capabilities from them until it can escape their control and potentially wreak havoc.

“If we could really understand these systems—and this would require a lot of progress—we might be able to say when these models actually are safe, or whether they just appear safe,” Chris Olah, the head of Anthropic’s interpretability team who led the research, tells TIME.

“The fact that we can do these interventions on the model suggests to me that we’re starting to make progress on what you might call an X-ray, or an MRI [of an AI model],” Anthropic CEO Dario Amodei adds. “Right now, the paradigm is: let’s talk to the model, let’s see what it does. But what we’d like to be able to do is look inside the model as an object—like scanning the brain instead of interviewing someone.”

The research is still in its early stages, Anthropic said in a summary of the findings. But the lab struck an optimistic tone that the findings could soon benefit its AI safety work. “The ability to manipulate features may provide a promising avenue for directly impacting the safety of AI models,” Anthropic said. By suppressing certain features, it may be possible to prevent so-called “jailbreaks” of AI models, a type of vulnerability where safety guardrails can be disabled, the company added.

Researchers in Anthropic’s “interpretability” team have been trying to peer into the brains of neural networks for years. But until recently, they had mostly been working on far smaller models than the giant language models currently being developed and released by tech companies.

One of the reasons for this slow progress was that individual neurons inside AI models would fire even when the model was discussing completely different concepts. “This means that the same neuron might fire on concepts as disparate as the presence of semicolons in computer programming languages, references to burritos, or discussion of the Golden Gate Bridge, giving us little indication as to which specific concept was responsible for activating a given neuron,” Anthropic said in its summary of the research.

To get around this problem, Olah’s team of Anthropic researchers zoomed out. Instead of studying individual neurons, they began to look for groups of neurons that would all fire in response to a specific concept. This technique worked—and allowed them to graduate from studying smaller “toy” models to larger models like Anthropic’s Claude Sonnet, which has billions of neurons.

Although the researchers said they had identified millions of features inside Claude, they cautioned that this number was nowhere near the true number of features likely present inside the model. Identifying all the features, they said, would be prohibitively expensive using their current techniques, because doing so would require more computing power than it took to train Claude in the first place. (Costing somewhere in the tens or hundreds of millions of dollars.) The researchers also cautioned that although they had found some features they believed to be related to safety, more study would still be needed to determine whether those features could reliably be manipulated to improve a model’s safety.

For Olah, the research is a breakthrough that proves the utility of his esoteric field, interpretability, to the broader world of AI safety research. “Historically, interpretability has been this thing on its own island, and there was this hope that someday it would connect with [AI] safety—but that seemed far off,” Olah says. “I think that’s no longer true.”

Другие популярные новости дня сегодня

Новости 24/7

Читайте на 123ru.net

VIP-тусовка

Новини України

Работа

Деньги

Другие проекты от 123ru.net

АО «Транснефть - Север» оказало благотворительную помощь детскому саду в Архангельской области

Завершилась первая цикловая встреча региональных руководителей компании AlfaBiom

Автопробег в честь Дня России прошел в Ленинском округе

Real Commando Shooting 3D 6.0

В Дубне сотрудники Росгвардии помогли утиному семейству перейти оживленную трассу

Max free trial returns just in time for House of the Dragon season 2 this week

«Байкал Сервис» снижает тарифы из городов Дальнего Востока

Жаркая ночь боёв не в пользу Ульяновска

«Культурная столица мира!»: Бутман раскрыл подробности Moscow Jazz Festival

Политолог Данилин: Россия повышает возможности для социального продвижения

Федерер рассказал о недооценке Джоковича на старте карьеры

Попурри мелодий и дух патриотизма: военный оркестр в парке «Кузьминки» отметил День России

В «Институте регионального развития» при Уральской ТПП прошел семинар «Возможности оптимизации налогов для производства и налоговые риски»

Выступлением самарских артистов в Крыму открылся масштабный Фестиваль, посвященный 225-летию со дня рождения А.С. Пушкина

«ЯРКО» провела развлекательную программу на фестивале «Крутая песочница»

Собянин сообщил об открытии «Домиков добра» для сбора гуманитарной помощи

В Дубне сотрудники Росгвардии помогли утиному семейству перейти оживленную трассу

Другие популярные новости дня сегодня

Топ 10 новостей последнего часа

Новости России

Финэксперт Тальдрик: покупать иностранную валюту лучше после стабилизации курса

Социальные и ESG-проекты ГПМ Радио названы лучшими в России

Берлин блокировал новый пакет санкций против России – Reuters

Филиал № 4 ОСФР по Москве и Московской области информирует: Свыше 5,2 миллиона жителей Московского региона получают набор социальных услуг в натуральном виде

Социальные и ESG-проекты ГПМ Радио названы лучшими в России

Филиал № 4 ОСФР по Москве и Московской области информирует: Свыше 5,2 миллиона жителей Московского региона получают набор социальных услуг в натуральном виде

Социальные и ESG-проекты ГПМ Радио названы лучшими в России

В Дубне сотрудники Росгвардии помогли утиному семейству перейти оживленную трассу

Владимир Путин, Дмитрий Медведев, Анатолий Голод: ПРИЗНАНИЕ НАТО НАДО ПРИМЕНЯТЬ!

Собянин: Объем закупок у малого бизнеса с 2024 года составил почти 227 миллиардов

Не берут количеством // КПРФ и «Яблоко» выдвинули кандидатов в Мосгордуму

Алексей Смирнов – актер, которого, надеюсь, еще не забыли

Площадь пожара в здании у метро «Динамо» в Москве составляет 400 квадратов

«Этот спорт воспитывает силу и упорство»: Вице-спикер Народного Хурала Бурятии приняла участие в открытии чемпионата России по гиревому спорту

Резидент «Инсайт Люди» Дмитрий Зубов 12 июня установил мировой рекорд по чеканке мяча в «Лахта Центре»

Испанский теннисист Надаль не сыграет на Уимблдоне

Новости Крыма на Sevpoisk.ru

Путин обратился с приветствием к участникам Спортивных игр стран БРИКС

Частные объявления в Вашем городе, в Вашем регионе и в России