A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections.
I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn't how safety policies are supposed to work.
(This was originally a tweet thread (https://x.com/RyanPGreenblatt/status/1925992236648464774) which I've converted into a quick take. I also posted it on LessWrong.)
What is the change and how does it affect security?
9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to employees trying to steal model weights if the employee has any access to "systems that process model weights".
Anthropic claims this change is minor (and calls insiders with this access "sophisticated insiders").
But, I'm not so sure it's a small change: we don't know what fraction of employees could get this access and "systems that process model weights" isn't explained.
Naively, I'd guess that access to "systems that process model weights" includes employees being able to operate on the model weights in any way other than through a trusted API (a restricted API that we're very confident is secure). If that's right, it could be a high fraction! So, this might be a large reduction in the required level of security.
If this does actually apply to a large fraction of technical employees, then I'm also somewhat skeptical that Anthropic can actually be "highly protected" from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical!
Also, one of the easiest ways for security-aware employees to evaluate security is to think about how easily they could steal the weights. So, if you don't aim to be robust to employees, it might be much harder for employees to evaluate the level of security and then complain about not meeting requirements[1].
Anthropic's justification and why I disagree
Anthropic justified the change by
Hot Take: Securing AI Labs could actually make things worse
There's a consensus view that stronger security at leading AI labs would be a good thing. It's not at all clear to me that this is the case.
Consider the extremes:
In a maximally insecure world, where anyone can easily steal any model that gets trained, there's no profit or strategic/military advantage to be gained from doing the training, so nobody's incentivised to invest much to do it. We'd only get AGI if some sufficiently-well-resourced group believed it would be good for everyone to have an AGI, and were willing to fund its development as philanthropy.
In a maximally secure world, where stealing trained models is impossible, whichever company/country got to AGI first could essentially dominate everyone else. In this world there's huge incentive to invest and to race.
Of course, our world lies somewhere between these two. State actors almost certainly could steal models from any of the big 3; potentially organised cybercriminals/rival companies too, but most private individuals could not. Still, it seems that marginal steps towards a higher security world make investment and racing more appealing as the number of actors able to steal the products of your investment and compete with you for profits/power falls.
But I notice I am confused. The above reasoning predicts that nobody should be willing to make significant investments in developing AGI with current levels of cyber-security, since if they succeeded their AGI would immediately be stolen by multiple gouvenments (and possibly rival companies/cybercriminals), which would probably nullify any return on the investment. What I observe is OpenAI raising $40 billion in their last funding round, with the explicit goal of building AGI.
So now I have a question: given current levels of cybersecurity, why are investors willing to pour so much cash into building AGI?
...maybe it's the same reason various actors are willing to invest into building open
I recently created a simple workflow to allow people to write to the Attorneys General of California and Delaware to share thoughts + encourage scrutiny of the upcoming OpenAI nonprofit conversion attempt.
Write a letter to the CA and DE Attorneys General
I think this might be a high-leverage opportunity for outreach. Both AG offices have already begun investigations, and AGs are elected officials who are primarily tasked with protecting the public interest, so they should care what the public thinks and prioritizes. Unlike e.g. congresspeople, I don't AGs often receive grassroots outreach (I found ~0 examples of this in the past), and an influx of polite and thoughtful letters may have some influence — especially from CA and DE residents, although I think anyone impacted by their decision should feel comfortable contacting them.
Personally I don't expect the conversion to be blocked, but I do think the value and nature of the eventual deal might be significantly influenced by the degree of scrutiny on the transaction.
Please consider writing a short letter — even a few sentences is fine. Our partner handles the actual delivery, so all you need to do is submit the form. If you want to write one on your own and can't find contact info, feel free to dm me.
I'm not sure how to word this properly, and I'm uncertain about the best approach to this issue, but I feel it's important to get this take out there.
Yesterday, Mechanize was announced, a startup focused on developing virtual work environments, benchmarks, and training data to fully automate the economy. The founders include Matthew Barnett, Tamay Besiroglu, and Ege Erdil, who are leaving (or have left) Epoch AI to start this company.
I'm very concerned we might be witnessing another situation like Anthropic, where people with EA connections start a company that ultimately increases AI capabilities rather than safeguarding humanity's future. But this time, we have a real opportunity for impact before it's too late. I believe this project could potentially accelerate capabilities, increasing the odds of an existential catastrophe.
I've already reached out to the founders on X, but perhaps there are people more qualified than me who could speak with them about these concerns. In my tweets to them, I expressed worry about how this project could speed up AI development timelines, asked for a detailed write-up explaining why they believe this approach is net positive and low risk, and suggested an open debate on the EA Forum. While their vision of abundance sounds appealing, rushing toward it might increase the chance we never reach it due to misaligned systems.
I personally don't have a lot of energy or capacity to work on this right now, nor do I think I have the required expertise, so I hope that others will pick up the slack. It's important we approach this constructively and avoid attacking the three founders personally. The goal should be productive dialogue, not confrontation.
Does anyone have thoughts on how to productively engage with the Mechanize team? Or am I overreacting to what might actually be a beneficial project?
Notes on some of my AI-related confusions[1]
It’s hard for me to get a sense for stuff like “how quickly are we moving towards the kind of AI that I’m really worried about?” I think this stems partly from (1) a conflation of different types of “crazy powerful AI”, and (2) the way that benchmarks and other measures of “AI progress” de-couple from actual progress towards the relevant things. Trying to represent these things graphically helps me orient/think.
First, it seems useful to distinguish the breadth or generality of state-of-the-art AI models and how able they are on some relevant capabilities. Once I separate these out, I can plot roughly where some definitions of "crazy powerful AI" apparently lie on these axes:
(I think there are too many definitions of "AGI" at this point. Many people would make that area much narrower, but possibly in different ways.)
Visualizing things this way also makes it easier for me[2] to ask: Where do various threat models kick in? Where do we get “transformative” effects? (Where does “TAI” lie?)
Another question that I keep thinking about is something like: “what are key narrow (sets of) capabilities such that the risks from models grow ~linearly as they improve on those capabilities?” Or maybe “What is the narrowest set of capabilities for which we capture basically all the relevant info by turning the axes above into something like ‘average ability on that set’ and ‘coverage of those abilities’, and then plotting how risk changes as we move the frontier?”
The most plausible sets of abilities like this might be something like:
* Everything necessary for AI R&D[3]
* Long-horizon planning and technical skills?
If I try the former, how does risk from different AI systems change?
And we could try drawing some curves that represent our guesses about how the risk changes as we make progress on a narrow set of AI capabilities on the x-axis. This is very hard; I worry that companies focus on benchmarks in ways that
I was extremely disappointed to see this tweet from Liron Shapira revealing that the Centre for AI Safety fired a recent hire, John Sherman, for stating that members of the public would attempt to destroy AI labs if they understood the magnitude of AI risk. Capitulating to this sort of pressure campaign is not the right path for EA, which should have a focus on seeking the truth rather than playing along with social-status games, and is not even the right path for PR (it makes you look like you think the campaigners have valid points, which in this case is not true). This makes me think less of CAIS' decision-makers.
In some discussions I had with people at EAG it was interesting to discover that there might be a significant lack of EA-aligned people in the hardware-space of AI, which seems to translate towards difficulties in getting industry contacts for co-development of hardware-level AI safety measures. To the degree to which there are EA members in these companies, it might make sense to create some kind of communication space to exchange ideas between people working on hardware AI safety with people at hardware-relevant companies (think Broadcomm, Samsung, Nvidia, GloFo, Tsmc etc). Unfortunately I feel that culturally these spaces (EEng/CE) are not very transmissible to EA-ideas and the boom in ML/AI has caused significant self-selection of people towards hotter topics.
I believe there might be significant benefit for accelerating realistic safety designs, if discussions can be moved into industry as fast as possible.