Aligning large language models (LLMs) with human values is a pressing concern in artificial intelligence. Recent studies have introduced innovative methods to address this challenge. One notable approach is IterAlign, which employs a data-driven constitution discovery and self-alignment framework. By leveraging red teaming to identify LLM weaknesses, IterAlign automatically discovers new constitutions using a more robust LLM, guiding self-correction in the base model. Empirical results demonstrate that IterAlign enhances truthfulness, helpfulness, harmlessness, and honesty, improving LLM alignment by up to 13.5% in harmlessness. arxiv.org
Another significant contribution is the survey "Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment," which outlines key dimensions crucial for assessing LLM trustworthiness. The survey covers seven major categories: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each category is further divided into sub-categories, resulting in a total of 29 sub-categories. Measurement studies conducted on several widely-used LLMs indicate that more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across different trustworthiness categories, highlighting the need for continuous analysis and improvement in LLM alignment. arxiv.org