RG

Ryan Greenblatt

Member of Technical Staff @ Redwood Research
513 karmaJoined

Bio

This other Ryan Greenblatt is my old account[1]. Here is my LW account.

  1. ^

    Account lost to the mists of time and expired university email addresses.

Comments
131

Topic contributions
2

Perceived counter-argument:

My proposed counter-argument loosely based on the structure of yours.

Summary of claims

  • A reasonable fraction of computational resources will be spent based on the result of careful reflection.
  • I expect to be reasonably aligned with the result of careful reflection from other humans
  • I expect to be much less aligned with result of AIs-that-seize-control reflecting due to less similarity and the potential for AIs to pursue relatively specific objectives from training (things like reward seeking).
  • Many arguments that human resource usage won't be that good seem to apply equally well to AIs and thus aren't differential.

Full argument

The vast majority of value from my perspective on reflection (where my perspective on reflection is probably somewhat utilitarian, but this is somewhat unclear) in the future will come from agents who are trying to optimize explicitly for doing "good" things and are being at least somewhat thoughtful about it, rather than those who incidentally achieve utilitarian objectives. (By "good", I just mean what seems to them to be good.)

At present, the moral views of humanity are a hot mess. However, it seems likely to me that a reasonable fraction of the total computational resources of our lightcone (perhaps 50%) will in expectation be spent based on the result of a process in which an agent or some agents think carefully about what would be best in a pretty delibrate and relatively wise way. This could involve eventually deferring to other smarter/wiser agents or massive amounts of self-enhancement. Let's call this a "reasonably-good-reflection" process.

Why think a reasonable fraction of resources will be spent like this?

  • If you self-enhance and get smarter, this sort of reflection on your values seems very natural. The same for deferring to other smarter entities. Further, entities in control might live for an extremely long time, so if they don't lock in something, as long as they eventually get around to being thoughtful it should be fine.
  • People who don't reflect like this probably won't care much about having vast amounts of resources and thus the resources will go to those who reflect.
  • The argument for "you should be at least somewhat thoughtful about how you spend vast amounts of resources" is pretty compelling at an absolute level and will be more compelling as people get smarter.
  • Currently a variety of moderately powerful groups are pretty sympathetic to this sort of view and the power of these groups will be higher in the singularity.

I expect that I am pretty aligned (on reasonably-good-reflection) with the result of random humans doing reasonably-good-reflection as I am also a human and many of the underlying arguments/intuitions I think seem important seem likely to seem important to many other humans (given various common human intuitions) upon those humans becoming wiser. Further, I really just care about the preferences of (post-)humans who end care most about using vast, vast amounts of computational resources (assuming I end up caring about these things on reflection), because the humans who care about other things won't use most of the resources. Additionally, I care "most" about the on-reflection preferences I have which are relatively less contingent and more common among at least humans for a variety of reasons. (One way to put this is that I care less about worlds in which my preferences on reflection seem highly contingent.)

So, I've claimed that reasonably-good-reflection resource usage will be non-trivial (perhaps 50%) and that I'm pretty aligned with humans on reasonably-good-reflection. Supposing these, why think that most of the value is coming from something like reasonably-good-reflection prefences rather than other things, e.g. not very thoughtful indexical preferences (selfish) consumption? Broadly three reasons:

  • I expect huge returns to heavy optimization of resource usage (similar to spending altruistic resources today IMO and in the future we'll we smarter which will make this effect stronger).
  • I don't think that (even heavily optimized) not-very-thoughtful indexical preferences directly result in things I care that much about relative to things optimized for what I care about on reflection (e.g. it probably doesn't result in vast, vast, vast amounts of experience which is optimized heavily for goodness/$).
    • Consider how billionaries currently spend money which doesn't seem to have have much direct value, certainly not relative to their altruistic expenditures.
    • I find it hard to imagine that indexical self-ish consumption results in things like simulating 10^50 happy minds. See also my other comment. It seems more likely IMO that people with self-ish preferences mostly just buy positional goods that involve little to no experience (separately, I expect this means that people without self-ish preferences get more of the compute, but this is counted in my earlier argument, so we shouldn't double count it.)
  • I expect that indirect value "in the minds of the laborers producing the goods for consumption" is also small relative to things optimized for what I care about on reflection. (It seems pretty small or maybe net-negative (due to factory farming) today (relative to optimized altruism) and I expect the share will go down going forward.)

(Aside: I was talking about not-very-thoughtful indexical-preferences. It's likely to me that doing a reasonably good job reflecting on selfish preferences get back to something like de facto utilitarianism (at least as far as how you spend the vast majority of computational resources) because personal identity and indexical preferences don't make much sense and the thing you end up thinking is more like "I guess I just care about experiences in general".)

What about AIs? I think there are broadly two main reasons to expect that what AIs do on reasonably-good-reflection to be worse from my perspective than what humans do:

  • As discussed above, I am more similar to other humans and when I inspect the object level of how other humans think or act, I feel reasonably optimistic about the results of reasonably-good-reflection for humans. (It seems to me like the main thing holding me back from agreement with other humans is mostly biases/communication/lack of smarts/wisdom given many shared intuitions.) However, AIs might be more different and thus result in less value. Further, the values of humans after reasonably-good-reflection seem close to saturating in goodness from my perspective (perhaps 1/3 or 1/2 of the value of purely my values), so it seems hard for AI to do better.
    • To better understand this argument, imagine that instead of humanity the question was between identical clones of myself and AIs. It's pretty clear I share the same values the clones, so the clones do pretty much strictly better than AIs (up to self-defeating moral views).
    • I'm uncertain about the degree of similarity between myself and other humans. But, mostly the underlying similarity uncertainties also applies to AIs. So, e.g., maybe I currently think on reasonably-good-reflection humans spend resources 1/3 as well as I would and AIs spend resources 1/9 as well. If I updated to think that other humans after reasonably-good-reflection only spend resources 1/10 as well as I do, I might also update to thinking AIs spend resources 1/100 as well.
  • In many of the stories I imagine for AIs seizing control, very powerful AIs end up directly pursuing close correlated of what was reinforced in training (sometimes called reward-seeking, though I'm trying to point at a more general notion). Such AIs are reasonably likely to pursue relatively obviously valueless-from-my-perspective things on reflection. Overall, they might act more like a ultra powerful corporation that just optimizes for power/money rather than our children (see also here). More generally, AIs might in some sense be subjected to wildly higher levels of optimization pressure than humans while being able to better internalize these values (lack of genetic bottleneck) which can plausibly result in "worse" values from my perspective.

Note that we're conditioning on safety/alignment technology failing to retain human control, so we should imagine correspondingly less human control over AI values.

I think that the fraction of computation resources of our lightcone used based on the result of a reasonably-good-reflection process seems similar between human control and AI control (perhaps 50%). It's possible to mess this up of course and either mess up the reflection or to lock-in bad values too early. But, when I look at the balance of arguments, humans messing this up seems pretty similar to AIs messing this up to me. So, the main question is what the result of such a process would be. One way to put this is that I don't expect humans to differ substantially from AIs in terms of how "thoughtful" they are.

I interpret one of your arguments as being "Humans won't be very thoughtful about how they spend vast, vast amounts of computational resources. After all, they aren't thoughtful right now." To the extent I buy this argument, I think it applies roughly equally well to AIs. So naively, it just divides by both sides rather than making AI look more favorable. (At least, if you accept that all most all of the value comes from being at least a bit thoughtful, which you also contest. See my arguments for that.)

In other words, agents optimizing for their own happiness, or the happiness of those they care about, seem likely to be the primary force behind the creation of hedonium-like structures. They may not frame it in utilitarian terms, but they will still be striving to maximize happiness and well-being for themselves and others they care about regardless. And it seems natural to assume that, with advanced technology, they would optimize pretty hard for their own happiness and well-being, just as a utilitarian might optimize hard for happiness when creating hedonium.

Suppose that a single misaligned AI takes control and it happens to care somewhat about its own happiness while not having any more "altruistic" tendencies that I would care about or you would care about. (I think misaligned AIs which seize control caring about their own happiness substantially seems less likely than not, but let's suppose this for now.) (I'm saying "single misaligned AI" for simplicity, I get that a messier coalition might be in control.) It now has access to vast amounts of computation after sending out huge numbers of probes to take control over all available energy. This is enough computation to run absolutely absurd amounts of stuff.

What are you imagining it spends these resources on which is competitive with optimized goodness? Running >10^50 copies of itself which are heavily optimized for being as happy as possible while spending?

If a small number of agents have a vast amount of power, and these agents don't (eventually, possibly after a large amount of thinking) want to do something which is de facto like the values I end up caring about upon reflection (which is probably, though not certainly, vaguely like utilitarianism in some sense), then from my perspective it seems very likely that the resources will be squandered.

If you're imagining something like:

  1. It thinks carefully about what would make "it" happy.
  2. It realizes it cares about having as many diverse good experience moments as possible in a non-indexical way.
  3. It realizes that heavy self-modification would result in these experience moments being better and more efficient, so it creates new versions of "itself" which are radically different and produce more efficiently good experiences.
  4. It realizes it doesn't care much about the notion of "itself" here and mostly just focuses on good experiences.
  5. It runs vast numbers of such copies with diverse experiences.

Then this is just something like utilitarianism by another name via a differnet line of reasoning.

I thought your view was that step (2) in this process won't go like this. E.g., currently self-ish entities will retain indexical preferences. If so, then I do see where the goodness can plausibly come from.

The fact that our current world isn't well described by the idea that what matters most is the number of explicit utilitarians, strengthens my point here.

When I look at very rich people (people with >$1 billion), it seems like the dominant way they make the world better via spending money (not via making money!) is via thoughtful altuistic giving not via consumption.

Perhaps your view is that with the potential for digital minds this situation will change?

(Also, it seems very plausible to me that the dominant effect on current welfare is driven mostly by the effect on factory farming and other animal welfare.)

I expect this trend to further increase as people get much, much wealthier and some fraction (probably most) of them get much, much smarter and wiser with intelligence augmentation.

Additionally, how are you feeling about voluntary commitments from labs (RSPs included) relative to alternatives like mandatory regulation by governments

This is discussed in Holden's earlier post on the topic here.

Explicit +1  to what Owen is saying here.

(Given that I commented with some counterarguments, I thought I would explicitly note my +1 here.)

In particular, I am persuaded by the argument that, because evaluation is usually easier than generation, it should be feasible to accurately evaluate whether a slightly-smarter-than-human AI is taking unethical actions, allowing us to shape its rewards during training accordingly. After we've aligned a model that's merely slightly smarter than humans, we can use it to help us align even smarter AIs, and so on, plausibly implying that alignment will scale to indefinitely higher levels of intelligence, without necessarily breaking down at any physically realistic point.

This reasoning seems to imply that you could use GPT-2 to oversee GPT-4 by bootstrapping from a chain of models of scales between GPT-2 and GPT-4. However, this isn't true, the weak-to-strong generalization paper finds that this doesn't work and indeed bootstrapping like this doesn't help at all for ChatGPT reward modeling (it helps on chess puzzles and for nothing else they investigate I believe).

I think this sort of bootstrapping argument might work if we could ensure that the each model in the chain was sufficiently aligned and capable of reasoning that it would carefully reason about what humans would want if they were more knowledgeable and then rate outputs based on this. However, I don't think GPT-4 is either aligned enough or capable enough that we see this behavior. And I still think it's unlikely it works under these generous assumptions (though I won't argue for this here).

In fact, it is difficult for me to name even a single technology that I think is currently underregulated by society.

The obvious example would be synthetic biology, gain-of-function research, and similar.

I also think AI itself is currently massively underregulated even entirely ignoring alignment difficulties. I think the probability of the creation of AI capable of accelerating AI R&D by 10x this year is around 3%. It would be extremely bad for US national interests if such an AI was stolen by foreign actors. This suffices for regulation ensuring very high levels of security IMO. And this is setting aside ongoing IP theft and similar issues.

Sure, but there are many alternative explanations:

  • There is internal and external pressure to avoid downplaying AI safety.
  • Regulation is inevitable, so it would be better to ensure that you can at least influence it somewhat. Purely fighting against regulation might go poorly for you.
  • The leaders care at least a bit about AI safety either out of a bit of altruism or self interest. (Or at least aren't constantly manipulative to such an extent that they choose all words to maximize their power.)

Not to mention that Big Tech companies whose business plans might be most threatened by "AI pause" advocacy are currently seeing more general "AI safety" arguments as an opportunity to achieve regulatory capture...

Why do you think this? It seems very unclear if this is true to me.

I'm not sure that I buy that critics lack motivation. At least in the space of AI, there will be (and already are) people with immense financial incentive to ensure that x-risk concerns don't become very politically powerful.

Of course, it might be that the best move for these critics won't be to write careful and well reasoned arguments for whatever reason (e.g. this would draw more attention to x-risk so ignoring it is better from their perspective).

Edit: this is mentioned in the post, but I'm a bit surprised because this isn't emphasized more.

because it feels very differently about "99% of humanity is destroyed, but the remaining 1% are able to rebuild civilisation" and "100% of humanity is destroyed, civilisation ends"

Maybe? This depends on what you think about the probability that intelligent life re-evolves on earth (it seems likely to me) and how good you feel about the next intelligent species on earth vs humans.

the particular focus on extinction increases the threat from AI and engineered biorisks

IMO, most x-risk from AI probably doesn't come from literal human extinction but instead AI systems acquiring most of the control over long run resources while some/most/all humans survive, but fair enough.

Load more