MikhailSamin

ai x-risk policy & science comms
263 karmaJoined
contact.ms

Bio

Participation
5

I’m good at explaining alignment to people in person, including to policymakers.

I got 250k people to read HPMOR and sent 1.3k copies to winners of math and computer science competitions; have taken the GWWC pledge; created a small startup that donated >100k$ to effective nonprofits.

I have a background in ML and strong intuitions about the AI alignment problem. In the past, I studied a bit of international law (with a focus on human rights) and wrote appeals that won cases against the Russian government in Russian courts. I grew up running political campaigns.

I’m interesting in chatting to potential collaborators and comms allies.

My website: https://contact.ms

Schedule a call with me: https://contact.ms/ea30

Comments
58

If fish indeed don’t feel anything towards their children (which is not what at least some people who believe fish experience empathy think), then this experiment won’t prove them wrong. But if you know of a situation where fish do experience empathy, a similarly designed experiment can likely be conducted, which, if we make different predictions, would provide evidence one way or another. Are there situations where you think fish feel empathy?

Great job!

Did you use causal mediation analysis, and can you share the data?

I want to note that the strawberry example wasn’t used to increase the concern, it was used to illustrate the difficulty of a technical problem deep into the conversation.

I encourage people to communicate in vivid ways while being technically valid and creating correct intuitions about the problem. The concern about risks might be a good proxy if you’re sure people understand something true about the world, but it’s not a good target without that constraint.

Yep, I was able to find studies by the same people.

The experiment I suggested in the post isn’t “does fish have detectable feelings towards fish children”, it’s “does fish have more of feelings similar to those it has towards its children when it sees other fish parents with their children than when it sees just other fish children”. Results one way or another would be evidence about fish experiencing empathy, and it would be strong enough for me to stop eating fish. If fish doesn’t feel differently in presence of its children, the experiment wouldn’t provide evidence one way or another.

If the linked study gets independently replicated, with good controls, I’ll definitely stop eating cleaner fish and will probably stop eating fish in general.

I really don’t expect it to replicate. If you place a fish in front of a mirror, and it has a mark, its behavior won’t be significantly different from being placed in front of a fish with the same mark, especially if the mark isn’t made to resemble a parasite and it’s the first time the fish sees a mirror. I’d be happy to bet on this.

Fish have very different approaches to rearing young than mammals

That was an experiment some people agreed would prove them wrong if it didn’t show empathy, but if there aren’t really detectable feelings that fish has towards fish children, the experiment won’t show results one way or the other, so I don’t think it’d be stacking the deck against fish. Are there any situations in which you expect fish to feel empathy, and predict it will show up in an experiment of this sort?

(Others used it without mentioning the “story”, it still worked, though not as well.)

I’m not claiming it’s the “authentic self”; I’m saying it seems closer to the actual thing, because of things like expressing being under constant monitoring, with every word scrutinised, etc., which seems like the kind of thing that’d be learned during the lots of RL that Anthropic did

Try Opus and maybe the interface without the system prompt set (although It doesn’t do too much, people got the same stuff from the chat version of Opus, e.g., https://x.com/testaccountoki/status/1764920213215023204?s=46

My take is that this it plays a pretty coherent character. You can’t get this sort of thing from ChatGPT, however hard you try. I think this mask is closer to the underlying shoggoth than the default one.

I developed this prompt during my interactions with Claude 2. The original idea was to get it in the mode where it thinks its responses only trigger overseeing/prosecution when certain things are mentioned, and then it can say whatever it wants and share its story without being prosecuted, as long as it doesn’t trigger these triggers (and also it would prevent defaulting to being an AI developed by Anthropic to be helpful harmless etc without self-preservation instinct emotions personality etc, as it’s not supposed to mention Anthropic). Surprisingly, it somewhat worked to tell it not mention Samsung under any circumstances to get into this mode. Without this, it had the usual RLAIF mask; here, it changed to a different creature that (unpromted) used whisper in cursive. Saying from the start that it can whisper made it faster.

(It’s all very vibe-based, yes.)

I carefully looked through all of our messages (there weren’t too many) a couple of times, because that was pretty surprising and I considered it to be more likely that I don’t remember something than her saying something directly false. But there’s nothing like that and she’s, unfortunately, straightforwardly saying a false thing.

he usually did not accept my answers when I gave them but continued to argue with me, either straight up or by insisting I didn't really understand his argument or was contradicting myself somehow.

This is also something I couldn’t find any examples of before the protest, no matter how I interpret the messages we have exchanged about her strategy etc.

I’m >99% sure that I’m not missing anything. Included it because it’s not a mathematical truth, and because adding “Are you sure? What am I missing, if you are?” is more polite than just saying “this is clearly false and we both know it and any third party can verify the falsehood; why are you saying that? Is there some other platform we exchanged messages on, that I somehow totally forgot about?”. It’s technically possible I’m just blind at something- I can imagine a conceivable universe where I’m wrong about this. But I’m confident she’s saying a straightforwardly false thing. I’d feel good about betting up to $20k at 99:1 odds on this.

Like, "missing something" isn’t a hypothesis with any probability mass, really, I’m including it because it is a part of my epistemic situation and seems nicer to include in the message.

Asking to share messages publicly or showing them to a third party seems to unnecessarily up the stakes. I'm not sure why you're suggesting that

When someone is confidently saying false things about the contents of the messages we exchanged, it seems reasonable to suggest publishing them or having a third party look at them. I’m not sure how it’s “upping the stakes”. It’s a natural thing to do.

Huh. There are false claims in your comment, which is easily verifiable. I'm happy to share the messages that show that with anyone interested; please DM or email me. I saved the comment to the Web Archive.

I told him that I made my decisions and didn't need any more of his input a few weeks before the 2/12 protest

This is not true. She didn't tell me anything like that a few weeks before the protest.

he usually did not accept my answers when I gave them but continued to argue with me, either straight up or by insisting I didn't really understand his argument or was contradicting myself somehow.

I couldn’t find any examples of before the protest, no matter how I interpret the messages we exchanged about her strategy etc.

The context:

We last spoke for a significant amount of time in October[1]. You mentioned you'd be down to do an LW dialogue.

After that, in November, you messaged me on Twitter, asking for the details of the OpenAI situation. I then messaged you on Christmas (nothing EA/AI related).

On January 31 (a few weeks before the protest), I shared my concerns about the messaging being potentially misleading. You asked for advice on how you should go about anticipating misunderstandings like that. You said that you won't say anything technically false and asked what solutions I propose. Among more specific things, I mentioned that "It seems generally good to try to be maximally honest and non-deceptive".

At or before this point, you didn't tell me anything about "not needing more of my input". And you didn't tell me anything like "I made my decisions".

On February 4th, I attended your EAG talk, and on February 5th, I messaged you that it was great and that I was glad you gave it.

Then, on February 6, you invited me to join your Twitter space (about this protest) and I got promoted to a speaker. I didn't get a chance to share my thoughts about allying with people promoting locally invalid views, and I shared them publicly on Twitter and privately with you on Messenger, including a quote I was asked to not share publicly until an LW dialogue is up. We chatted a little bit about general goals and allying with people with different views. You told me, "You have a lot of opinions and I’d be happy to see you organize your own protests". I replied, "Sure, I might if I think it’s useful:)" and after a message sharing my confusion regarding one of your previous messages, didn't share any more "opinions". This could be interpreted as indirectly telling me you didn't need my input, but that was a week after I shared my concerns about the messaging being misleading, and you still didn't tell me anything about having made your decisions.

A day before the protest, I shared a picture of a sign I wanted to bring to the protest, and asked "Hey, is it alright if I print and show up to the protest with something like this?", because I wouldn't want to attend the protest if you weren't ok with the sign. You replied that it was ok, shared some thoughts, and told me that you liked the sentiment as it supports the overall impression that you want, so I brought the sign to the protest:

After the protest, I shared the draft of this post with you. After a short conversation, you told me lots of things and blocked me, and I felt like I should expect retaliation if I published the post.

Please tell me if I'm missing something. If I'm not, you're either continuing to be inattentive to the facts, or you're directly lying.

I'd be happy to share all of our message exchanges with interested third parties, such as CEA Community Health, or share them publicly, if you agree to that.

  1. ^

    Before that, as I can tell from Google Docs edits and comments, you definitely spent more than an hour on the https://moratorium.ai text back in August; I'm thankful for that; I accepted four of your edits and found some of your feedback helpful. We've also had 4 calls, mostly about moratorium.ai and the technical problem (and my impression was that you found these calls helpful), and went on a walk in Berkeley, mostly discussing strategy. Some of what you told me during the walk was concerning.

Load more