Artwork

A tartalmat a Michaël Trazzi biztosítja. Az összes podcast-tartalmat, beleértve az epizódokat, grafikákat és podcast-leírásokat, közvetlenül a Michaël Trazzi vagy a podcast platform partnere tölti fel és biztosítja. Ha úgy gondolja, hogy valaki az Ön engedélye nélkül használja fel a szerzői joggal védett művét, kövesse az itt leírt folyamatot https://hu.player.fm/legal.
Player FM - Podcast alkalmazás
Lépjen offline állapotba az Player FM alkalmazással!

Erik Jones on Automatically Auditing Large Language Models

22:36
 
Megosztás
 

Manage episode 374008737 series 2966339
A tartalmat a Michaël Trazzi biztosítja. Az összes podcast-tartalmat, beleértve az epizódokat, grafikákat és podcast-leírásokat, közvetlenül a Michaël Trazzi vagy a podcast platform partnere tölti fel és biztosítja. Ha úgy gondolja, hogy valaki az Ön engedélye nélkül használja fel a szerzői joggal védett művét, kövesse az itt leírt folyamatot https://hu.player.fm/legal.

Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML.

Youtube: https://youtu.be/bhE5Zs3Y1n8

Paper: https://arxiv.org/abs/2303.04381

Erik: https://twitter.com/ErikJones313

Host: https://twitter.com/MichaelTrazzi

Patreon: https://www.patreon.com/theinsideview

Outline

00:00 Highlights

00:31 Eric's background and research in Berkeley

01:19 Motivation for doing safety research on language models

02:56 Is it too easy to fool today's language models?

03:31 The goal of adversarial attacks on language models

04:57 Automatically Auditing Large Language Models via Discrete Optimization

06:01 Optimizing over a finite set of tokens rather than continuous embeddings

06:44 Goal is revealing behaviors, not necessarily breaking the AI

07:51 On the feasibility of solving adversarial attacks

09:18 Suppressing dangerous knowledge vs just bypassing safety filters

10:35 Can you really ask a language model to cook meth?

11:48 Optimizing French to English translation example

13:07 Forcing toxic celebrity outputs just to test rare behaviors

13:19 Testing the method on GPT-2 and GPT-J

14:03 Adversarial prompts transferred to GPT-3 as well

14:39 How this auditing research fits into the broader AI safety field

15:49 Need for automated tools to audit failures beyond what humans can find

17:47 Auditing to avoid unsafe deployments, not for existential risk reduction

18:41 Adaptive auditing that updates based on the model's outputs

19:54 Prospects for using these methods to detect model deception

22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts

Patreon supporters:

  • Tassilo Neubauer
  • MonikerEpsilon
  • Alexey Malafeev
  • Jack Seroy
  • JJ Hepburn
  • Max Chiswick
  • William Freire
  • Edward Huff
  • Gunnar Höglund
  • Ryan Coppolo
  • Cameron Holmes
  • Emil Wallner
  • Jesse Hoogland
  • Jacques Thibodeau
  • Vincent Weisser
  continue reading

55 epizódok

Artwork
iconMegosztás
 
Manage episode 374008737 series 2966339
A tartalmat a Michaël Trazzi biztosítja. Az összes podcast-tartalmat, beleértve az epizódokat, grafikákat és podcast-leírásokat, közvetlenül a Michaël Trazzi vagy a podcast platform partnere tölti fel és biztosítja. Ha úgy gondolja, hogy valaki az Ön engedélye nélkül használja fel a szerzői joggal védett művét, kövesse az itt leírt folyamatot https://hu.player.fm/legal.

Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML.

Youtube: https://youtu.be/bhE5Zs3Y1n8

Paper: https://arxiv.org/abs/2303.04381

Erik: https://twitter.com/ErikJones313

Host: https://twitter.com/MichaelTrazzi

Patreon: https://www.patreon.com/theinsideview

Outline

00:00 Highlights

00:31 Eric's background and research in Berkeley

01:19 Motivation for doing safety research on language models

02:56 Is it too easy to fool today's language models?

03:31 The goal of adversarial attacks on language models

04:57 Automatically Auditing Large Language Models via Discrete Optimization

06:01 Optimizing over a finite set of tokens rather than continuous embeddings

06:44 Goal is revealing behaviors, not necessarily breaking the AI

07:51 On the feasibility of solving adversarial attacks

09:18 Suppressing dangerous knowledge vs just bypassing safety filters

10:35 Can you really ask a language model to cook meth?

11:48 Optimizing French to English translation example

13:07 Forcing toxic celebrity outputs just to test rare behaviors

13:19 Testing the method on GPT-2 and GPT-J

14:03 Adversarial prompts transferred to GPT-3 as well

14:39 How this auditing research fits into the broader AI safety field

15:49 Need for automated tools to audit failures beyond what humans can find

17:47 Auditing to avoid unsafe deployments, not for existential risk reduction

18:41 Adaptive auditing that updates based on the model's outputs

19:54 Prospects for using these methods to detect model deception

22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts

Patreon supporters:

  • Tassilo Neubauer
  • MonikerEpsilon
  • Alexey Malafeev
  • Jack Seroy
  • JJ Hepburn
  • Max Chiswick
  • William Freire
  • Edward Huff
  • Gunnar Höglund
  • Ryan Coppolo
  • Cameron Holmes
  • Emil Wallner
  • Jesse Hoogland
  • Jacques Thibodeau
  • Vincent Weisser
  continue reading

55 epizódok

Minden epizód

×
 
Loading …

Üdvözlünk a Player FM-nél!

A Player FM lejátszó az internetet böngészi a kiváló minőségű podcastok után, hogy ön élvezhesse azokat. Ez a legjobb podcast-alkalmazás, Androidon, iPhone-on és a weben is működik. Jelentkezzen be az feliratkozások szinkronizálásához az eszközök között.

 

Gyors referencia kézikönyv