Hide table of contents

People often ask me for career advice related to AGI safety. This post (now also translated into Spanish) summarizes the advice I most commonly give. I’ve split it into three sections: general mindset, alignment research and governance work. For each of the latter two, I start with high-level advice aimed primarily at students and those early in their careers, then dig into more details of the field. See also this post I wrote two years ago, containing a bunch of fairly general career advice.

General mindset

In order to have a big impact on the world you need to find a big lever. This document assumes that you think, as I do, that AGI safety is the biggest such lever. There are many ways to pull on that lever, though—from research and engineering to operations and field-building to politics and communications. I encourage you to choose between these based primarily on your personal fit—a combination of what you're really good at and what you really enjoy. In my opinion the difference between being a great versus a mediocre fit swamps other differences in the impactfulness of most pairs of AGI-safety-related jobs.

How should you find your personal fit? To start, you should focus on finding work where you can get fast feedback loops. That will typically involve getting hands-on or doing some kind of concrete project (rather than just reading and learning) and seeing how quickly you can make progress. Eventually, once you've had a bunch of experience, you might notice a feeling of confusion or frustration: why is everyone else missing the point, or doing so badly at this? (Though note that a few top researchers commented on a draft to say that they didn't have this experience.) For some people that involves investigating a specific topic (for me, the question “what’s the best argument that AGI will be misaligned?“); for others it's about applying skills like conscientiousness (e.g. "why can't others just go through all the obvious steps?") Being excellent seldom feels like you’re excellent, because your own abilities set your baseline for what feels normal.

What if you have that experience for something you don't enjoy doing? I expect that this is fairly rare, because being good at something is often very enjoyable. But in those cases, I'd suggest trying it until you observe that even a string of successes doesn't make you excited about what you're doing; and at that point, probably trying to pivot (although this is pretty dependent on the specific details).

Lastly: AGI safety is a young and small field; there’s a lot to be done, and still very few people to do it. I encourage you to have agency when it comes to making things happen: most of the time the answer to “why isn’t this seemingly-good thing happening?” or “why aren’t we 10x better at this particular thing?” is “because nobody’s gotten around to it yet”. And the most important qualifications for being able to solve a problem are typically the ability to notice it and the willingness to try. One anecdote to help drive this point home: a friend of mine has had four jobs at four top alignment research organizations; none of those jobs existed before she reached out to the relevant groups to suggest that they should hire someone with her skillset. And this is just what’s possible within existing organizations—if you’re launching your own project, there are far more opportunities to do totally novel things. (The main exception is when it comes to outreach and political advocacy. Alignment is an unusual field because the base of fans and supporters is much larger than the number of researchers, and so we should be careful to avoid alignment discourse being dominated by advocates who have little familiarity with the technical details, and come across as overconfident. See the discussion here for more on this.)

Alignment research

I’ll start with some high-level recommendations, then give a brief overview of how I see the field.

  1. Alignment is mentorship-constrained. If you have little research experience, your main priority should be finding the best mentor possible to help you gain research skills—e.g. via doing research in a professor’s lab, or internships at AI labs. Most of the best researchers and mentors aren't (yet) working on alignment, so the best option for mentorship may be outside of alignment—but PhDs are long enough, and timelines short enough, that you should make sure that your mentor would be excited about supervising some kind of alignment-relevant research. People can occasionally start doing great work without any mentorship; if you’re excited about this, feel free to try it, but focus on the types of research where you have fast feedback loops.
  2. You’ll need to get hands-on. The best ML and alignment research engages heavily with neural networks (with only a few exceptions). Even if you’re more theoretically-minded, you should plan to be interacting with models regularly, and gain the relevant coding skills. In particular, I see a lot of junior researchers who want to do “conceptual research”. But you should assume that such research is useless until it cashes out in writing code or proving theorems, and that you’ll need to do the cashing out yourself (with threat modeling being the main exception, although even then I think most threat modeling is not concrete enough to be useful). Perhaps once you’re a senior researcher with intuitions gained from hands-on experience you’ll be able to step back and primarily think about potential solutions at a high level, but that can’t be your plan as a junior researcher—it’ll predictably steer you away from doing useful work.
  3. You can get started quickly. People coming from fields like physics and mathematics often don’t realize how much shallower deep learning is as a field, and so think they need to spend a long time understanding the theoretical foundations first. You don’t—you can get started doing deep learning research with nothing more than first-year undergrad math, and pick up things you’re missing as you go along. (Coding skill is a much more important prerequisite, though.) You can also pick up many of the conceptual foundations of alignment as you go along, especially in more engineering heavy roles. While I recommend that all alignment researchers eventually become familiar with the ideas covered in the Alignment Fundamentals curriculum, upskilling at empirical research should be a bigger priority for most people who have already decided to pursue a career in alignment research and who aren't already ML researchers.
    Some recommended ways to upskill at empirical research (roughly in order):
    1. MLAB
    2. ARENA
    3. Jacob Hilton’s deep learning curriculum
    4. Neel Nanda's guide to getting started with mechanistic interpretability
    5. Replicating papers
      Each of these teaches you important skills for good research: how to implement algorithms, how to debug code and experiments, how to interpret results, etc. Once you’ve implemented an algorithm or replicated a paper, you can then try to extend the results by improving the techniques somehow.
  4. Most research won’t succeed. This is true both on the level of individual projects, and also on the level of whole research directions: research is a very heavy-tailed domain. You should be looking hard for the core intuitions for why a given research direction will succeed, the absence of which may be hidden under mathematics or complicated algorithms (as I argue here). (You can think of this as a type of conceptual research, but intended to steer your own empirical or theoretical work, rather than intended as a research output in its own right.) In the next section I outline some of my views on which research directions are and aren't promising.

Alignment research directions

From my perspective, the most promising alignment research falls into three primary categories. I outline those below, as well as three secondary categories I think are valuable. Note that I expect the boundaries between all of these to blur over time as research on them progresses, and as we automate more and more things.

  1. Scalable oversight: finding ways to leverage more powerful models to produce better reward signals. Scalable oversight research may be particularly high-leverage if it ends up being adopted widely, e.g. as a tool for preventing hallucinations (like how alignment teams’ work on RLHF has now been adopted very widely).
    1. The theoretical paper I most often point people to is Irving et al.’s debate paper.
    2. The empirical paper I most often point people to is Saunders et al.’s critiques paper, which can be seen as the simplest case of the debate algorithm; Bowman et al. (2022) is also useful from a methodological perspective.
    3. The two other well-known algorithms in this area are iterated amplification and recursive reward modeling. My opinion is that people often overestimate the differences between these algorithms, and that standard presentations of them obfuscate the ways in which they’re structurally similar. I personally find debate the easiest to reason about (and it seems like others agree, since more papers build on it than on the others), hence why I most often recommend people work on that.
    4. Will scalable oversight just lead to more capabilities advances? This is an important question; one way I think about it is in terms of the generator-discriminator-critique gap from Saunders et al.’s critiques paper. Specifically, while I expect that closing the generator-discriminator gap is a dual-purpose advance (and could be good or bad depending on your other views), closing the discriminator-critique gap by producing correct human-comprehensible explanations should definitely be seen as an alignment advance.
  2. Mechanistic interpretability: finding ways to understand how networks function internally. While still only a small subfield of ML, I think of it as a way of pushing the whole field of ML from a “behaviorist” perspective that only focuses on inputs and outputs towards a “cognitivist” framework that studies what’s going on inside neural networks. It’s also much easier to do outside industry labs than scalable oversight work. To get started, check out Nanda's 200 Concrete Open Problems in Mechanistic Interpretability.
    1. Three strands of mechanistic interpretability work:
      1. Case studies: finding algorithms inside networks that implement specific capabilities. My favorite papers here are Olsson et al. (2022)Nanda et al. (2023)Wang et al. (2022) and Li et al. (2022); I’m excited to see more work which builds on the last in particular to find world-models and internally-represented goals within networks.
      2. Solving superposition: finding ways to train networks to have fewer overlapping concepts within individual neurons. The key resource here is Elhage (2022) (as well as other work in the Transformer Circuits thread).
      3. Scalable interpretability: finding algorithms to automatically identify or modify internal representations. My favorite papers: Meng et al. (2022) and Burns et al. (2023) (although some consider the latter to be closer to scalable oversight work).
  3. Alignment theory: finding formal frameworks we can use to reason about advanced AI. I want to flag that success at this type of research is even more heavy-tailed than the other research directions I’ve described—it seems to requires exceptional mathematical skills, a deep understanding of ML theory, and nuanced philosophical intuitions. I'm not optimistic that any of the research directions listed here will work out, but they are attempting to address such fundamental problems that even partial successes could be a big deal.
    1. I’m most excited about Christiano’s work on formalizing heuristic arguments, Kosoy’s learning-theoretic agenda (particularly infra-bayesianism), and various work by Scott Garrabrant (e.g. geometric rationality, finite factored sets, and Cartesian frames).
    2. Historically most of the work in this category has been done by MIRI (e.g. work on functional decision theory and Garrabrant induction). Their output has dropped significantly lately, though; so I mainly think of them as having a handful of researchers pursuing their individual interests, rather than a unified research agenda.
    3. Why do I think alignment theory is worth pursuing? In large part because scientific knowledge is typically very interconnected. Alignment theory often seems disconnected from modern ML—but the motions of the stars once seemed totally disconnected from events on earth. And who could have guessed that understanding variation in the beaks of finches would advance our understanding of...well, basically everything in biology? In many domains there are key principles that explain a huge range of phenomena, and the main difficulty is finding a tractable angle of attack. That's why asking the right questions is often more important than actually getting concrete results. For example, asking "what is the optimal strategy in this specific formalization of a 2-player game?" is a large chunk of the work of inventing game theory.

Three other research areas that seem important, but less central:

  1. Evaluations: finding ways to measure how dangerous and/or misaligned models are.
    1. There’s been little published on this so far; the main thing to look at is the ARC evals (also discussed in section 2.9 of the GPT-4 system card). In general it seems like alignment evals are very difficult, so most people are focusing on evals for measuring dangerous capabilities instead.
    2. My own opinion is that evaluations will live or die by how simple and scalable they are. The best evals would be easily implementable even by people without any alignment background, and would meaningfully track improvements all the way from current systems up to superintelligences. In short, this is because the primary purpose of evals is to facilitate decision-making and coordination, and both of these benefit hugely from legible and predictable metrics.
  2. Unrestricted adversarial training: finding ways to generate inputs on which misaligned systems will misbehave.
    1. It seems like there are strong principled reasons to expect this to be difficult—in general you can only generate fake data which fools one model using a much more powerful model. But it may be possible to find unrestricted adversarial examples by leveraging mechanistic interpretability, as explored in this post by Christiano.
    2. The empirical paper I point people to most often is Ziegler et al. (2022) (see also the other papers they cite).
  3. Threat modeling: understanding and forecasting how AGI might lead to catastrophic outcomes.
    1. I most often point people to my own recent paper (Ngo et al., 2022). Other good work includes reports by Joe Carlsmith and Ajeya Cotra. (Cohen et al. (2022) make a peer-reviewed case for existential risk from AGI, but it’s too focused on outer alignment for me to buy into it.)
    2. One threat modeling research direction that seems valuable is understanding gradient hacking (and understanding cooperation between different models more generally). Another is to explore the specific ways that AGIs are most likely to be deployed in the real world, and what sorts of vulnerabilities they may be able to exploit.

By contrast, some lines of research which I think are overrated by many newcomers to the field, along with some critiques of them:

  1. Cooperative inverse reinforcement learning (the direction that Stuart Russell defends in his book Human Compatible); critiques here and here.
  2. John Wentworth’s work on natural abstractions; exposition and critique here, and another here.
  3. Work which relies on agents acting myopically, including by only making next-timestep predictions (e.g. work on the simulators abstraction, or on conditioning predictive models); critique here.

Governance work

I mentally split this into three categories: governance research, lab governance, and policy jobs. A few high-level takeaways for each:

  1. Governance research
    1. The main advice I give people who want to enter this field: pick one relevant topic and try to become an expert on it. There are about two dozen topics where I wish there were a world expert on applying this topic to making AGI go well, and no such person exists; I’ve made a list of those topics below. To learn about them I strongly recommend not just reading and absorbing ideas, but also writing about them. It’s very plausible that, starting off with no background in the field, within six months you could write a post or paper which pushes forward the frontier of our knowledge on how one of those topics is relevant to AGI governance.
    2. You don’t necessarily need to stick with your choice longer-term; my claim is mainly that it’s important to have some concrete topic to investigate. As you do so, you’ll gradually branch out to other topics which are tangentially relevant, and pick up a broader knowledge of the field (the Governance Fundamentals course is one good way of doing so). Eventually you’ll be able to do “strategy research” with much wider implications. But trying to do that from the beginning is a bad plan—it’ll go much better with a base of detailed expertise to work from.
    3. In general I think people overrate “analysis” and underrate “proposals”. There are many high-level factors which will affect AGI governance, and we could spend the rest of our lives trying to analyze them. But ultimately what we need is concrete mechanisms which actually move the needle, which are currently in short supply. Of course you need to do analysis in order to understand the factors which will influence proposals’ success, but you should always keep in mind the goal of trying to ground it out in something useful.
    4. Relatedly, I personally don’t think that quantitative modeling is very valuable. I have yet to see such a model of a big-picture question (e.g. compute projections, takeoff speed, timelines) whose conclusions substantively change my opinions about what the best governance proposals are. If such a model is a strong success it may shift my credences from, say, 25% to 75% in a given proposition. But that’s only a factor of 3 difference, whereas one plan for how to solve governance could be one or two orders of magnitude more effective than another. And in general models rarely move me that much, because even a few free parameters allow people to dramatically overfit to their intuitions; I’d typically prefer having a short summary of the core insights that the person doing the modeling learned during that process. So prioritize plans first, insights second and models last.
    5. Don’t be constrained too much by political feasibility, especially when formulating early versions of a plan. Almost nobody in the world has both good intuitions for how politics really works, and good intuitions for how crazy progress towards AGI will be. All sorts of possibilities will open up in the future—we just need to be ready with concrete proposals when they do. However, a deep understanding of the fundamental drivers of today’s policy decisions will be helpful in navigating when things start changing much faster.
  2. AI lab governance
    1. Leading labs are often amenable to carrying out proposals which don’t strongly trade off against their core capabilities work; the bottleneck is usually the agency and work required to actually implement the proposal. Thus interventions of the form “tell labs to care more about safety” generally don’t work very well, whereas interventions of the form “here is a concrete ask, here are the specific steps you’d need to take, here’s a person who’s agreed to lead the effort” tend to go well. This post conveys that idea particularly well.
    2. It’s hard for people outside labs to know enough details about what’s going on inside labs to be able to make concrete proposals, but I expect there are a few important cases where it’s possible. This probably looks fairly similar to the path I outlined in the section on governance research, of first gaining expertise on a specific topic, then generating specific proposals.
    3. There is a specific skill of getting things done inside large organizations that most EAs lack (due to lack of corporate experience, plus lack of people-orientedness), but which is particularly useful when pushing for lab governance proposals. If you have it, lab governance work may be a good fit for you.
  3. Policy-related jobs
    1. By this I mean going to work in government-related positions, with the goal of trying to get into a position where you can help make government regulation go well. I don’t have too much to say here, since it’s not my area of expertise. You should probably take fairly general advice (e.g. the advice here) about how to have a successful career in this area, and then figure out how to go faster under the assumption that people will get increasingly stressed about AI. Short masters degrees and policy fellowships are quick ways to fast-track towards mid-career policy roles; getting even a small amount of legible AI expertise (e.g. any CS/AI-related degree or job) is also helpful.

List of governance topics

Here are some topics where I wish we had a world expert on applying it to AGI safety. One example of what great work on one of these topics might look like: Baker’s paper on lessons from nuclear arms control (a topic which would have been on this list if he hadn’t written that).

One cluster of topics can be described roughly as “anything mentioned in Yonadav Shavit’s compute governance paper”, in particular:

  1. Tamper-evident logging in GPUs
  2. Global tracking of GPUs
  3. Proof-of-learning algorithms
  4. On-site inspections of models
  5. Detecting datacenters
  6. Building a suite for verifiable inference
  7. Measuring effective compute use (e.g. by measuring and controlling for algorithmic progress)
  8. Regulating large-scale decentralized training (if it becomes competitive with centralized training)

Another cluster: security-related topics such as

  1. Preventing neural network weight exfiltration (by third parties or an AI itself)
  2. Evaluating the possibility of autonomous replication across the internet
  3. Privilege escalation from within secure systems (e.g. if your coding assistant is misaligned, what could it achieve?)
  4. Datacenter monitoring (e.g. if unauthorized copies of a model were running on your servers, how would you know?)
  5. Detecting unauthorized communication channels between different copies of a model.
  6. Detecting tampering (e.g. if your training run had been modified, how would you know?)
  7. How vulnerable are nuclear command and control systems?
  8. Scalable behavior monitoring (e.g. how can we aggregate information across monitoring logs from millions of AIs?)

And a more miscellaneous (and less technical) third category:

  1. What regulatory apparatus within the US government would be most effective at regulating large training runs?
  2. What tools and methods does the US government have for auditing tech companies?
  3. What are the biggest gaps in the US export controls to China, and how might they be closed?
  4. What AI applications or demonstrations will society react to most strongly?
  5. What interfaces will humans use to interact with AIs in the future?
  6. How will AI most likely be deployed for sensitive tasks (e.g. advising world leaders) given concerns about privacy?
  7. How might political discourse around AI polarize, and what could mitigate that?
  8. What would it take to automate crucial infrastructure (factories, weapons, etc)?
Comments20
Sorted by Click to highlight new comments since:

Hi Richard! I truly appreciate your enlightening post! It struck me as highly informative, and I believe others will feel the same. In order to reach out to our Spanish-speaking individuals, I have proactively translated it into Spanish, as I'm sure they will also see the immense value in this information.

 

Thanks again!


 

Thanks! I'll update it to include the link.

I'll piggyback on this (excellent) post to mention that we're working on some of the governance questions mentioned here at Rethink Priorities.

  • For example: @Onni Aarne is working on hardware-enabled mechanisms for compute governance (this touches on a bunch of stuff that comes up in Yonadav's paper, like tamper-evident logging), and I am working on China's access to ML compute post October export controls. @MichaelA is supervising those projects.
    • We're definitely happy to hear from others who are working on these (or related) things, are considering considering working on these things, or are simply interested in these things! (You can reach any of us at <firstname>@rethinkpriorities.org.)
    • We expect to open a hiring round for another compute governance person soon™.
  • Our other projects are summarized in this two-pager, and some are also relevant to problems listed in this post.

Note: this comment is cross-posted on LessWrong.

Classification of AI safety work

Here I proposed a systematic framework for classifying AI safety work. This is a matrix, where one dimension is the system level:

  • A monolithic AI system, e.g., a conversational LLM
  • AGI lab (= the system that designs, manufactures, operates, and evolves monolithic AI systems and systems of AIs)
  • A cyborg, human + AI(s)
  • A system of AIs with emergent qualities (e.g., https://numer.ai/, but in the future, we may see more systems like this, operating on a larger scope, up to fully automatic AI economy; or a swarm of CoEms automating science)
  • A human+AI group, community, or society (scale-free consideration, supports arbitrary fractal nestedness): collective intelligence, e.g., The Collective Intelligence Project
  • The whole civilisation, e.g., Open Agency Architecture, or the Gaia network

Another dimension is the "time" of consideration:

  • Design time: research into how the corresponding system should be designed (engineered, organised): considering its functional ("capability", quality of decisions) properties, adversarial robustness (= misuse safety, memetic virus security), and security.  AGI labs: org design and charter.
  • Manufacturing and deployment time: research into how to create the desired designs of systems successfully and safely:
    • AI training and monitoring of training runs.
    • Offline alignment of AIs during (or after) training. 
    • AI strategy (= research into how to transition into the desirable civilisational state = design).
    • Designing upskilling and educational programs for people to become cyborgs is also here (= designing efficient procedures for manufacturing cyborgs out of people and AIs).
  • Operations time: ongoing (online) alignment of systems on all levels to each other, ongoing monitoring, inspection, anomaly detection, and governance.
  • Evolutionary time: research into how the (evolutionary lineages of) systems at the given level evolve long-term:
    • How the human psyche evolves when it is in a cyborg
    • How humans will evolve over generations as cyborgs
    • How AI safety labs evolve into AGI capability labs :/
    • How groups, communities, and society evolve.
    • Designing feedback systems that don't let systems "drift" into undesired state over evolutionary time.
    • Considering system property: property of flexibility of values (i.e., the property opposite of value lock-in, Riedel (2021)).
    • IMO, it (sometimes) makes sense to think about this separately from alignment per se. Systems could be perfectly aligned with each other but drift into undesirable states and not even notice this if they don't have proper feedback loops and procedures for reflection.

There would be 6*4 = 24 slots in this matrix, and almost all of them have something interesting to research and design, and none of them is "too early" to consider.

Richard's directions within the framework

Scalable oversight: (monolithic) AI system * manufacturing time

Mechanistic interpretability: (monolithic) AI system * manufacturing time, also design time (e.g., in the context of the research agenda of weaving together theories of cognition and cognitive development, ML, deep learning, and interpretability through the abstraction-grounding stack, interpretability plays the role of empirical/experimental science work)

Alignment theory: Richard phrases it vaguely, but referencing primarily MIRI-style work reveals that he means primarily "(monolithic) AI system * design, manufacturing, and operations time".

Evaluations, unrestricted adversarial training: (monolithic) AI system * manufacturing, operations time

Threat modeling: system of AIs (rarely), human + AI group, whole civilisation * deployment time, operations time, evolutionary time

Governance research, policy research: human + AI group, whole civilisation * mostly design and operations time.

Takeaways

To me, it seems almost certain that many current governance institutions and democratic systems will not survive the AI transition of civilisation. Bengio recently hinted at the same conclusion.

Human+AI group design (scale-free: small group, org, society) and the civilisational intelligence design must be modernised.

Richard mostly classifies this as "governance research", which has a connotation that this is a sort of "literary" work and not science, with which I disagree. There is a ton of cross-disciplinary hard science to be done about group intelligence and civilisational intelligence design: game theory, control theory, resilience theory, linguistics, political economy (rebuild as hard science, of course, on the basis of resource theory, bounded rationality, economic game theory, etc.), cooperative reinforcement learning, etc.

I feel that the design of group intelligence and civilisational intelligence is an under-appreciated area by the AI safety community. Some people do this (Eric Drexler, davidad, the cip.org team, ai.objectives.institute, the Digital Gaia team, and the SingularityNET team, although the latter are less concerned about alignment), but I feel that far more work is needed in this area.

There is also a place for "literary", strategic research, but I think it should mostly concern deployment time of group and civilisational intelligence designs, i.e., the questions of transition from the current governance systems to the next-generation, computation and AI-assisted systems.

Also, operations and evolutionary time concerns of everything (AI systems, systems of AIs, human+AI groups, civilisation) seem to be under-appreciated and under-researched: alignment is not a "problem to solve", but an ongoing, manufacturing-time and operations-time process.

Thank you so much for your insightful and detailed list of ideas for AGI safety careers, Richard! I really appreciate your excellent post.

I would propose explicitly grouping some of your ideas and additional ones under a third category: “identifying and raising public awareness of AGI’s dangers.” In fact, I think this category may plausibly contain some of the most impactful ideas for reducing catastrophic and existential risks, given that alignment seems potentially difficult to achieve in a reasonable period of time (if ever) and the implementation of governance ideas is bottlenecked by public support.

For a similar argument that I found particularly compelling, please check out Greg Colbourn’s recent post: https://forum.effectivealtruism.org/posts/8YXFaM9yHbhiJTPqp/agi-rising-why-we-are-in-a-new-era-of-acute-risk-and

I don't actually think the implementation of governance ideas is mainly bottlenecked by public support; I think it's bottlenecked by good concrete proposals. And to the extent that it is bottlenecked by public support, that will change by default as more powerful AI systems are released.

I don't actually think the implementation of governance ideas is mainly bottlenecked by public support; I think it's bottlenecked by good concrete proposals. And to the extent that it is bottlenecked by public support, that will change by default as more powerful AI systems are released.

I appreciate Richard stating this explicitly. I think this is (and has been) a pretty big crux in the AI governance space right now.

Some folks (like Richard) believe that we're mainly bottlenecked by good concrete proposals. Other folks believe that we have concrete proposals, but we need to raise awareness and political support in order to implement them.

I'd like to see more work going into both of these areas. On the margin, though, I'm currently more excited about efforts to raise awareness [well], acquire political support, and channel that support into achieving useful policies. 

I think this is largely due to (a) my perception that this work is largely neglected, (b) the fact that a few AI governance professionals I trust have also stated that they see this as the higher priority thing at the moment, and (c) worldview beliefs around what kind of regulation is warranted (e.g., being more sympathetic to proposals that require a lot of political will).

I can see a worldview in which prioritizing raising awareness is more valuable, but I don't see the case for believing "that we have concrete proposals". Or at least, I haven't seen any; could you link them, or explain what you mean by a concrete proposal?

My guess is that you're underestimating how concrete a proposal needs to be before you can actually muster political will behind it. For example, you don't just need "let's force labs to pass evals", you actually need to have solid descriptions of the evals you want them to pass.

I also think that recent events have been strong evidence in favor of my position: we got a huge amount of political will "for free" from AI capabilities advances, and the best we could do with it was to push a deeply flawed "let's all just pause for 6 months" proposal.

Clarification: I think we're bottlenecked by both, and I'd love to see the proposals become more concrete. 

Nonetheless, I think proposals like "Get a federal agency to regulate frontier AI labs like the FDA/FAA" or even "push for an international treaty that regulates AI in a way that the IAEA regulates atomic energy" are "concrete enough" to start building political will behind them. Other (more specific) examples include export controls, compute monitoring, licensing for frontier AI models, and some others on Luke's list

I don't think any of these are concrete enough for me to say "here's exactly how the regulatory process should be operationalized", and I'm glad we're trying to get more people to concretize these. 

At the same time, I expect that a lot of the concretization happens after you've developed political will. If the USG really wanted to figure out how to implement compute monitoring, I'm confident they'd be able to figure it out. 

More broadly, my guess is that we might disagree on how concrete a proposal needs to be before you can actually muster political will behind it, though. Here's a rough attempt at sketching out three possible "levels of concreteness". (First attempt; feel free to point out flaws). 

Level 1, No concreteness: You have a goal but no particular ideas for how to get there. (e.g., "we need to make sure we don't build unaligned AGI")

Level 2, Low concreteness: You have a goal with some vagueish ideas for how to get there (e.g., "we need to make sure we don't build unaligned AGI, and this should involve evals/compute monitoring, or maybe a domestic ban on AGI projects and a single international project). 

Level 3, Medium concreteness: You have a goal with high-level ideas for how to get there. (e.g., "We would like to see licensing requirements for models trained above a certain threshold. Still ironing out whether or not that threshold should be X FLOP, Y FLOP, or $Z, but we've got some initial research and some models for how this would work.)

Level 4, High concreteness: You have concrete proposals that can be debated. (e.g., We should require licenses for anything above X FLOP, and we have some drafts of the forms that labs would need to fill out.)

I get the sense that some people feel like we need to be at "medium concreteness" or "high concreteness" before we can start having conversations about implementation. I don't think this is true.

Many laws, executive orders, and regulatory procedures have vague language (often at Level 2 or in-between Level 2 and Level 3). My (loosely-held, mostly based on talking to experts and reading things) sense quite common for regulators to be like "we're going to establish regulations for X, and we're not yet exactly sure what they look like. Part of this regulatory agency's job is going to be to figure out exactly how to operationalize XYZ."

I also think that recent events have been strong evidence in favor of my position: we got a huge amount of political will "for free" from AI capabilities advances, and the best we could do with it was to push a deeply flawed "let's all just pause for 6 months" proposal.

I don't think this is clear evidence in favor of the "we are more bottlenecked by concrete proposals" position. My current sense is that we were bottlenecked both by "not having concrete proposals" and by "not having relationships with relevant stakeholders."

I also expect that the process of concretizing these proposals will likely involve a lot of back-and-forth with people (outside the EA/LW/AIS community) who have lots of experience crafting policy proposals. Part of the benefit of "building political will" is "finding people who have more experience turning ideas into concrete proposals."

Richard, I hope you turn out to be correct that public support for AI governance ideas will become less of a bottleneck as more powerful AI systems are released!

But I think it is plausible that we should not leave this to chance. Several of the governance ideas you have listed as promising (e.g., global GPU tracking, data center monitoring) are probably infeasible at the moment, to say the least. It is plausible that these ideas will only become globally implementable once a critical mass of people around the world become highly aware of and concerned about AGI dangers.

This means that timing may be an issue. Will the most detrimental of the AGI dangers manifest before meaningful preventative measures are implemented globally? It is plausible that before the necessary critical mass of public support builds up, a catastrophic or even existential outcome may already have occurred. It would then be too late.

The plausibility of this scenario is why I agree with Akash that identifying and raising public awareness of AGI’s dangers is an underrated approach.

Thanks for sharing!

If such a model is a strong success it may shift my credences from, say, 25% to 75% in a given proposition. But that’s only a factor of 3 difference, whereas one plan for how to solve governance could be one or two orders of magnitude more effective than another.

Do you have any thoughts on the value of models to determine the effectiveness of plans to solve governance?

I really appreciate specific career advice from people working in relevant jobs and the ideas and considerations outlined here, and am curating the post. (I'm also really interested in the discussion happening here.)

Personal highlights (note that I'm interested in hearing disagreement with these points!): 

  • The emphasis on fast feedback loops, especially for people who are newer to a field (see also the bit about becoming an expert in something for governance)
  • "the best option for mentorship may be outside of alignment—but PhDs are long enough, and timelines short enough, that you should make sure that your mentor would be excited about supervising some kind of alignment-relevant research."
  • This bit (I'd be interested in hearing disagreement, if there is much, though!): 
    • "You’ll need to get hands-on. The best ML and alignment research engages heavily with neural networks (with only a few exceptions). Even if you’re more theoretically-minded, you should plan to be interacting with models regularly, and gain the relevant coding skills. In particular, I see a lot of junior researchers who want to do “conceptual research”. But you should assume that such research is useless until it cashes out in writing code or proving theorems, and that you’ll need to do the cashing out yourself (with threat modeling being the main exception, since it forces a different type of concreteness). ..."
  • "You can get started quickly. People coming from fields like physics and mathematics often don’t realize how much shallower deep learning is as a field, and so think they need to spend a long time understanding the theoretical foundations first. You don’t..." [read the rest above]
  • The specific directions and research topics listed! (With links and commentary!)
  • On governance: 
    • "The main advice I give people who want to enter this field: pick one relevant topic and try to become an expert on it."
    • "In general I think people overrate “analysis” and underrate “proposals”.

"You’ll need to get hands-on. The best ML and alignment research engages heavily with neural networks (with only a few exceptions). Even if you’re more theoretically-minded, you should plan to be interacting with models regularly, and gain the relevant coding skills. In particular, I see a lot of junior researchers who want to do “conceptual research”. But you should assume that such research is useless until it cashes out in writing code or proving theorems, and that you’ll need to do the cashing out yourself (with threat modeling being the main exception, since it forces a different type of concreteness). ..."

This seems strongly true to me

Yeah, I agree on priors & some arguments about feedback loops, although note that I don't really have relevant experience. But I remember hearing someone try to defend something like the opposite claim to me in some group setting where I wasn't able to ask the follow-up questions I wanted to ask — so now I don't remember what their main arguments were and don't know if I should change my opinion.

I expect a bunch of more rationalist-type people disagree with this claim, FWIW. But I also think that they heavily overestimate the value of the types of conceptual research I'm talking about here.

CC https://www.lesswrong.com/posts/fqryrxnvpSr5w2dDJ/touch-reality-as-soon-as-possible-when-doing-machine that expands on "hands-on" experience in alignment. 

I don't know of any writing that directly contradicts these claims. I think https://www.lesswrong.com/s/v55BhXbpJuaExkpcD/p/3pinFH3jerMzAvmza indirectly contradicts these claims as it broadly criticizes most empirical approaches and is more open to conceptual approaches.

Loved reading this post, as a person considering working in AI Safety, this is a great resource and answers many questions, including some I hadn't thought of answering. Thanks so much for writing this! 

One question: I am curious to hear anyone's perspective on the following "conflict": 

Point 1: "There is a specific skill of getting things done inside large organizations that most EAs lack (due to lack of corporate experience, plus lack of people-orientedness), but which is particularly useful when pushing for lab governance proposals. If you have it, lab governance work may be a good fit for you."

Point 2: "You need to get hands on" and, related: "Coding skill is a much more important prerequisite, though." 

There may be exceptions, but I would guess (partly based on my own experience) that the kind of people who have a lot of experience getting things done in large organisations typically do not spend much time coding ML models. 

And yet, as I say, I believe both of these are necessary. If I want to influence a major AI / ML company, I will lack credibility in their eyes if I have no experience working with and in large organisations. But I will also lack credibility if I don't have an in-depth understanding of the models and an ability to discuss them specifically rather than just abstractly. 

Specific question: What might the typical learning curve be for the second aspect, to get to the point where I could get hands on with models? My starting point would be having studied FORTRAN in college (!! - yes, that long ago!) and only having one online course of Python. There may be others with different starting points. 

I suppose. ultimately, it still seems likely that it would be quicker even for a total novice to coding to reach some level of meaningful competence than for someone with no experience of organisations to become expert in how decisions are made and plans are approved or rejected, and how to influence this. 

Also are there good online courses anyone would recommend? 

One question: I am curious to hear anyone's perspective on the following "conflict": 

The former is more important for influencing labs, the latter is more important for doing alignment research.

And yet, as I say, I believe both of these are necessary.

FWIW when I talk about the "specific skill", I'm not talking about having legible experience doing this, I'm talking about actually just being able to do it. In general I think it's less important to optimize for having credibility, and more important to optimize for the skills needed. Same for ML skill—less important for gaining credibility, more important for actually just figuring out what the best plans are.

Also are there good online courses anyone would recommend? 

See the resources listed here.

Thanks Richard, This is clear now. 

And thank you (and others) for sharing the resources link - this indeed looks like a fantastic resource. 

Denis

 

Curated and popular this week
Relevant opportunities