What AI ethicists get wrong – and right – about AI safety
Backstory - on December 1st I wrote a thread on Twitter responding to Timnit Gebru’s article in WIRED which criticized EA-funded AI safety work. Someone recommended I turn it into a blog post. It’s pretty late now, but this is the result. It’s much more thorough and balanced than the original thread.
Two camps are raising alarm about AI. On the one side is the AI safety community which is closely associated with the Effective Altruism (EA) movement. People in AI safety have been raising alarm about the existential risk that future superintelligent AI systems pose. AI safety researchers mostly study technical problems about how to ensure robustness and “value alignment”. People in AI ethics, on the other hand, have been raising alarm about present-day issues around how AI is deployed and used today. AI ethicists study issues such as algorithmic bias, the lack of transparency and explainability in AI systems, and the growing dominance of a few big tech firms. The two groups are quite different demographically. In broad strokes, AI safety is mostly men while AI ethics is mostly women.1
Recently both groups have been raising concerns about large language models. Unfortunately, the two groups don’t get along. Recently an article by AI ethicist Timnit Gebru appeared in WIRED entitled “Effective Altruism Is Pushing a Dangerous Brand of ‘AI Safety’”. The article makes many good points, but it is also unfair to the AI safety movement in several ways.
Did EA AI safety work contribute to the current LLM arms race?
Gebru is right that OpenAI has pioneered and promoted the use of large models, and that these models have many implicit problems that are not easily rectified. Extensive work went into preventing ChatGPT from providing certain types of content to users. Despite all this work, within a few hours numerous ways were found to circumvent the system’s controls. While ChatGPT represents an advance over its predecessor, GPT-3, it still contains unwanted bias at its core, since it was trained on bulk data scraped from the internet. ChatGPT is also famously prone to hallucinate, giving information that sounds plausible but is incorrect. More recently, numerous articles have documented how Microsoft’s new Bing AI can exhibit unhinged and manipulative behavior. Despite numerous recent high-profile failings of LLMs, hype about LLMs continues at a fever-pitch, goaded onward by Silicon Valley interests who have invested millions into the technology.
The proliferation and use of large models is not easily controlled either. Open source alternatives to GPT-3 and DALL-E-2 have now been released where users can easily disable the safety controls. The floodgates have now been opened for nefarious actors to use large models for fake news generation, non-consensual generation of sexual imagery, spam generation, and possibly for hacking and other harmful activities.
Gebru is right to ask how things get this way and whether EA shares any blame.
OpenAI was founded in 2015 with Elon Musk and Peter Thiel as seed donors, among several others. While both Musk and Thiel have spoken at EA conferences, neither has any further involvement with EA as Gebru suggests. From the very beginning many EAs disagreed with OpenAI’s mission to make the source code for highly advanced AI systems freely available to all. Shortly after it was founded, OpenAI was criticized by Scott Alexander, one of the most influential bloggers in EA. Not long after, in 2016, Nick Bostrom told WIRED: "If you have a button that could do bad things to the world, you don’t want to give it to everyone." Other prominent figures such as Eliezer Yudkosky would later take to Twitter to castigate OpenAI for initiating the current AI arms race.
In 2021, Yudkowsky told me he feared that a large enough language model, when combined with a few other pieces, could create an existentially dangerous system. Many others, such as Connor Leahy, now CEO at Conjecture, have voiced similar concerns. Effective Altruists are now in near universal agreement with Gebru that the current AI arms race is a bad thing and that large language models are dangerous.
Gebru argues that EA’s AI safety efforts have contributed to the current situation. Most notably, she points to a $30 million dollar donation that Open Philanthropy, the largest EA funding body, gave to OpenAI in 2017. At the time they gave a detailed rationale for the donation. In exchange for the donation, Open Philanthropy CEO Holden Karnofsky was given a seat at the OpenAI Board and allowed to participate in weekly all-hands meetings. Open Philanthropy said they believed the impact of the donation would arise from the influence it would give them within OpenAI, not from contributing to the progress of AI. Open Philanthropy identified OpenAI as one of two key players in AI (the other being DeepMind) and wanted to influence them to care more about safety. Not long after Open Philanthropy’s investment, 80,000 hours, a prominent Effective Altruist organization focused on career development, started advertising software engineering roles at OpenAI regardless of whether they were on AI alignment or not. Soon, many EAs started working at OpenAI and appearing on OpenAI publications. It’s hard to know, since we can’t re-run history, but it appears that the efforts of Open Philanthropy and 80,000 Hours resulted in more safety-consciousness at OpenAI while having minimal effect on the advancement of AI capabilities.
After the influx of EAs into OpenAI, OpenAI began to publish AI safety papers, for instance “AI Safety by Debate”, (2018) “Testing Robustness Against Unforeseen Adversaries” (2019), “AI Safety Needs Social Scientists” (2019), and “Release Strategies and the Social Impacts of Language Models” (2019). When GPT-2 was announced by OpenAI in 2019, OpenAI did not make the model available for several months, out of concerns about how it may be used.
Then, in late 2020 there was a mass exodus of effective altruists from OpenAI. AI safety expert Paul Christiano went to form the Alignment Research Center, while Dario Amodei, Chris Olah, Jack Clark, Amanda Askell and several others left to form Anthropic. Few of these individuals have spoken publicly about why they left, but people have speculated it was because of a shift away from research towards productization.
Anthropic went on to raise over $580 million, with the bulk of that coming from EA-associated funders Jaan Tallinn, Dustin Moskovitz, and Sam Bankman-Fried. According to Jack Clark (private conversation), one of the main motivators for founding Anthropic was the realization that the shift towards large “foundation models” had made it harder for individuals, small organizations, and academic labs to work on AI safety at the frontier of AI. Clark believes that we can not rely on corporations to properly do safety work, a view which is shared by many AI ethicists, including Gebru. The founding premise of Anthropic was to counter this by investing in infrastructure for training, testing, and interrogating state-of-the-art (“SoTA”) AI models. Right now, SoTA models are all transformers, but Clark acknowledges this could change. By investing in general-purpose infrastructure and highly skilled staff, Anthropic hopes to continue doing safety work at the cutting edge of AI.
Does current AI safety work advance AI capabilities?
“It's a dangerous myth that AI safety and AI capability are orthogonal vectors.” - Sam Altman, Twitter, November 16th, 2022
“Clued-in people disagree about whether Anthropic has already pivoted to building the Torment Nexus, but it’s probably only a matter of time.” - Scott Alexander, August 8, 2022.
"If you look at the people who are most worried about AI risk, they are the people who are working the hardest on AI... EA is doing more to accelerate AI than say, I am... I've met with some of these people, visited some of the institutions, and you hear across the same lunch table talk both the worries and the gung-ho-ness." - Tyler Cowen, at EA Global: DC, 2022
Gebru raises the legitimate concern that Anthropic may actually be helping push forward the development of dangerous AI systems such as large language models. It’s pretty clear that alignment and capabilities go hand-in-hand – more aligned systems are easier to work with and more economically viable. Thus, alignment techniques published by Anthropic could increase the economic viability of LLMs and increase profits at big tech companies deploying LLMs. In fact, it appears this has already happened - OpenAI’s paper on GPT-4 references Anthropic’s work on “Constitutional AI”, saying that such “rule-based” reward models are “one of our main tools for steering the model towards appropriate refusals”.
These sort of concerns are not lost on those who work at Anthropic. In personal conversations with Anthropic staff, they have stated that the organization is extremely careful about what they work on and that they try hard not to do anything to contribute to AI capabilities.
Even if Anthropic never publishes papers or code, people who work on large models may develop skills that they take elsewhere. Software engineers working at the cutting edge of AI learn the latest “black art” tricks for designing, training, and deploying large models at scale. This economic utility of this arcane and sometimes purely tacit knowledge is partly reflected in the salaries that individuals processing it command. Software engineers at OpenAI receive total compensation starting at $300,000 and going up to $1,000,000 a year. Sources tell me that salaries at Anthropic are in a similar range, with some of the most senior staff receiving over $1,000,000 per year. Recently, Anthropic advertised an opening for a “prompt engineer” with compensation of $250,000 - $315,000 a year. The distorting impact of very high salaries like this should not be overlooked. High salaries may attract individuals who are in it largely for the money even though they profess purely altruistic motivations.
There’s also a concern that over time Anthropic may drift towards looking and acting more like a for-profit entity. Just recently, Google invested $400 million in Anthropic, and Salesforce has also invested recently. Presumably these investors are looking for a return and are not investing soley for philanthropic reasons.
While all of these concerns are legitimate, my personal view is that Anthropic’s contribution to “making AI go well” comes out as positive on balance, considering the high quality of the alignment research they are producing. Anthropic’s first paper was entitled “A General Language Assistant as a Laboratory for Alignment”. They went on to publish papers on reinforcement learning from human feedback, predictability and surprise in large language models, red teaming, and scalable oversight. Anthropic’s research so far has all contributed towards making LLMs more truthful and less prone to bias, toxicity, and other unwanted behaviors. This is exactly the sort of work that AI ethicists such as Gebru should support.
The role of billionaires
The fact that several billionaires are funding AI safety research concerns people within AI ethics, including Gebru. Indeed, it’s worth asking whether these individuals are looking to control the future development of AI in a quest to maintain their power and dominance. This isn’t a stretch – power corrupts and one can observe arrogance and a “master of the universe” mentality in some tech billionaires.
It’s important not to jump to conclusions here, however. There is a growing recognition among the super-rich that they have a moral obligation to use their wealth to help others. Each of the billionaires mentioned have expressed specific altruistic motivations for donating to AI safety efforts. Elon Musk said superintelligent AI is “summoning the demon”. In the case of Jaan Tallinn, he first became worried about the existential risk of AI after reading the writings of Eliezer Yudksowky in 2009, and by 2011 he was actively funding AI safety efforts.
Some of these billionaires may have selfish motivations too, and that’s OK - Dustin Moshowitz recently admitted on Twitter that he has a selfish motivation for funding AI safety - he doesn’t want himself and everyone he knows to die.
It also makes sense that AI safety is not receiving lots of small donations, although it receives some from effective altruists. For most people there are many more salient issues occupying their headspace such as local homelessness, natural disasters, or discrimination. Some people have trouble imagining how a computer could ever be as smart as a human, due to mistaken religious or dualistic thinking. Many other people can intuitively see the risk that superintelligent AI presents, and can easily conjure to mind vast armies of killer robots ala The Terminator, but then cast this off as science fiction that may only happen decades from now, if ever. Other people mistakenly think that AI alignment is easy - why not just program exactly what you want the AI to do and don’t program the things you don’t want it to do? People also express a number of common other objections, such as the idea that higher intelligence will automatically lead to better understanding and appreciation of ethics, or that it would be easy to keep a superintelligent AI under control by installing off-switches and keeping it physically confined. Understanding the full scope of the risk from misaligned AI requires processing all these arguments and their counterarguments. This is what the philosopher Nick Bostrom did in his influential 2015 monograph, Superintelligence, which Elon Musk has specifically cited on many occasions. The book is famously dense and technical. It’s not surprising, therefore, that the billionaires funding AI safety are exclusively from technical backgrounds.
EA-funded AI Safety started with the Machine Intelligence Research Institute, which for a long time focused on esoteric mathematical research under the banner of “agent foundations”. Today, EA-funded AI safety work is spread across many nonprofits including Anthropic, Redwood Research, Alignment Research Center, the Center for Human-Compatible AI, Ought, Conjecture, Aligned AI, and dozens of independent researchers and smaller entities. The bulk of this work is focused on large-language models (LLMs), either trying to understand how they work internally or working to make them more aligned and truthful. Evaluating the impact of this work is difficult for many reasons. We don’t know if large language models are actually a step towards existentially dangerous “AGI” or, as Yann Lecun suggests, a detour (distraction). We also don’t know if the current safety work will translate to future more advanced systems. It’s unclear how neglected a lot of the work is, also - LLM alignment is presumably being studied in many corporate research labs, for instance, and is being studied in traditional academic institutions as well. Finally, the current work may be contributing to the current problematic arm race dynamics around LLMs and the advancement of AI capabilities towards existentially dangerous systems. However, it is my personal appraisal that the AI safety work being done remains net positive overall. It’s much less clear whether the current AI safety work on LLMs is actually reducing existential risk. Given the stakes at hand, AI safety research still seems underfunded in general.
Thank you to Charles He for contributing a few of the ideas discussed and Mike Levine for proofreading.
Some notable AI safety researchers are Eliezer Yudkoswky, Nick Bostrom, Stuart Russell, Stuart Armstrong, Paul Christiano, and Jacob Steinhardt. Some notable AI ethics researchers are Timnit Gebru, Emily Bender, Joanna Bryson, Meredith Whittaker, Marzyeh Ghassemi, Ruha Benjamin, Elizabeth Adams, Latanya Sweeney, and Cynthia Dwork. How this gender divide came about is best for a sociologist to figure out.