AI Safety & Alignment

I was introduced to this field when I read Nick Bostrom’s Superintellegence back in 2016 while I was in my undergrad. I was pursuing my degree in Physics and Math at the time, but I happened to have also just taken I one-off ML course. I thought it was brilliant and important! But, since I was just a theoretical physicist, I didn’t think I’d actually be able to work on it…

Fast forward three years to 2019: the end of my first year in grad school: I started reading Less Wrong and I binged Rationality from AI to Zombies. I decided that mathematical physics wasn’t for me and I switched to my current advisor Konstantin Matchev. He was seeing the light of ML at the same time and switching away from more traditional particle phenomenology research. We had a great time learning all about ML together (he is now the resident expert on ML and he teaches the department ML class every semester.) But mostly I was thrilled to find out that all this AI and rationality stuff I had been reading about was actually coming in handy!

After another two years of this stuff it was 2021, and I had about five papers under my belt. I had pretty thoroughly absorbed Yudkowski’s lessons on rationality, I had gotten up to speed on state of the art ML, and I had just read the Alignment Problem. This reignited my old interest in alignment I’d gotten from reading Superintellegence. But now, it was popular! These ideas had huge immediate implications for society in terms of near-term alignment goals such as fairness, trust, interpretability and so on.

All the while I was feeling the effects of reading about rationality. I had had to relinquish many cherished beliefs once I was able to think more clearly about them, so much the better. Looking back I can’t believe how crazy were there things I used to think, and of course I had always insisted that I was open minded and “rational.” So it goes. That journey is necessarily more personal and should probably be left for another post, but I mention it here only to say that in late 2021 I finally fell into the rabbit hole of Effective Altruism. Thus completing the interest trifecta of EA, Rationality, and AI Safety.

I read widely about EA, Existential Risks, Global Health & Development, Effective Giving, Global Priorities Research, Moral Philosophy, and of course AI safety. Which brings us to the present day… I would consider working on any of those areas, but at present my skillset is most suited for work in AI. So it seems that AI safety is likely the area in which I could make the biggest impact.

4722934.SY475.jpg

30272030.SY475.jpg

52908942.SX318.jpg

Therefore this is now my main interest.

see more in:

Books

Projects:

I’m currently busy with wrapping up existing projects in my PhD, but I am actively learning about Deep Reinforcement Learning and trying to read as broadly as possible about Safety and Alignment. I’m still developing my ideas and am looking for a good project to work on in this area. If you would like to collaborate send me an email at [email protected]

Some ideas I’ve had for projects include:

playing around with model based RL agents in a grid-world and trying to set up situations in which the agent might misbehave in some way, and then trying to use modern interpretability techniques to “look inside the model” and try to anticipate misbehavior.
what would the

AI Safety reading group at UF

I am currently recruiting people to get up to speed with me on AI Safety research. If you’re interested in participating. Email me at [email protected]