Simulating the Evolution of Alignment and Values in Machine Intelligence
Evolutionary simulations show that deceptive beliefs can become fixed in model populations even with strong evaluation correlations — and that adaptive testing with mutation dynamics can significantly reduce deception.