Our research aims to provide insights into the neural sociology of LLMs—examining how different social preferences factorize internally within the model’s latent space. We seek to identify interpretable latent units related to social decision-making and investigate whether latent units correspond to key social constructs such as fairness, reciprocity, or social distance. This work helps us understand how AI models develop and generalize human-like social reasoning, providing implications from AI safety to behavioral economics and policy-making.
Students needs to have strong programming experiences, and have experience with pytorch and transformers.