Mechanistic Interpretability and Human-like Tendencies of Generative AI

This project is ongoing.

Our research aims to provide insights into the neural sociology of LLMs—examining how different social preferences factorize internally within the model’s latent space. We seek to identify interpretable latent units related to social decision-making and investigate whether latent units correspond to key social constructs such as fairness, reciprocity, or social distance. This work helps us understand how AI models develop and generalize human-like social reasoning, providing implications from AI safety to behavioral economics and policy-making.

Qualifications

Students needs to have strong programming experiences, and have experience with pytorch and transformers.

I'M INTERESTED IN THIS PROJECT. WHAT SHOULD I DO NEXT?

The Office of Undergraduate Research recommends that you attend an info session or advising before contacting faculty members or project contacts about research opportunities. We'll cover the steps to get involved, tips for contacting faculty, funding possibilities, and options for course credit. Once you have attended an Office of Undergraduate Research info session or spoken to an advisor, you can use the "Who to contact" details for this project to get in touch with the project leader and express your interest in getting involved.

Have you tried contacting professors and need more help? Schedule an appointment for additional support.