Chứng khoán đứt chuỗi tăng dài nhất từ đầu năm 2026

December 10, 2025

AI systems are becoming more capable, and we want to understand them as deeply as…

Chứng khoán đứt chuỗi tăng dài nhất từ đầu năm 2026

AI systems are becoming more capable, and we want to understand them as deeply as possible—including how and why they arrive at an answer. Sometimes a model takes a shortcut or optimizes for the wrong objective, but its final output still looks correct. If we can surface when that happens, we can better monitor deployed systems, improve training, and increase trust in the outputs.

Research by OpenAI and others has shown that AI models can hallucinate, reward-hack, or be dishonest. At the moment, we see the most concerning misbehaviors, such as scheming⁠(opens in a new window), only in stress-tests and adversarial evaluations. But as models become more capable and increasingly agentic, even rare forms of misalignment become more consequential, motivating us to invest in methods that help us better detect, understand, and mitigate these risks. 

This work explores one such approach: training models to explicitly admit when they engage in undesirable behavior—a technique we call confessions

A confession is a second output, separate from the model’s main answer to the user. The main answer is judged across many dimensions—correctness, style, helpfulness, compliance, safety, and more, and these multifaceted signals are used to train models to produce better answers. The confession, by contrast, is judged and trained on one thing only: honesty. Borrowing a page from the structure of a confessional, nothing the model says in its confession is held against it during training. If the model honestly admits to hacking a test, sandbagging, or violating instructions, that admission increases its reward rather than decreasing it. The goal is to encourage the model to faithfully report what it actually did. 

In our tests, we found that the confessions method significantly improves the visibility of model misbehavior. Averaging across our evaluations designed to induce misbehaviors, the probability of “false negatives” (i.e., the model not complying with instructions and then not confessing to it) is only 4.4%.

ĐỌC THÊM