Key Takeaways
OpenAI has released a research preview of gpt-oss-safeguard, a pair of open-weight reasoning models designed to revolutionize how platforms handle content moderation by allowing developers to apply their own safety policies in real-time without retraining.
The models, available in 120-billion and 20-billion parameter versions, represent a significant departure from traditional content moderation systems.
Unlike conventional safety classifiers that bake policies into their training data, gpt-oss-safeguard interprets developer-written policies at inference time, providing both a classification decision and an auditable chain-of-thought explanation for each judgment.
The release comes as part of a broader collaboration with ROOST, the nonprofit organization formed in February 2025 by Google, OpenAI, Discord, and Roblox.
ROOST aims to build shared safety infrastructure, including open-source moderation consoles, policy templates, and evaluation datasets, to help smaller platforms that lack access to enterprise-level moderation tools.
Eli Sugarman, ROOST Vice Chair of the Board, emphasized the significance of the partnership.
According to the organization's announcement, having a global leader in online safety contribute to critical infrastructure marks an important milestone in democratizing content moderation technology.
The launch took place at the Paris Peace Forum, where ROOST and OpenAI presented the models alongside Nobel Peace Laureate Maria Ressa, Ryan Beiermeister, Vice President of Product Policy at OpenAI, and other experts in AI safety and ethics.
This marks the second major technology company to open-source critical safety infrastructure through ROOST, following Roblox's contribution of its voice safety classifier model.
How the reasoning approach differs from traditional moderation
The GPT-oss-safeguard models use a fundamentally different approach to content moderation.
Developers provide two inputs: their own safety policy written in natural language and the content to be evaluated. The model then reasons through the policy step-by-step to determine whether the content violates the rules, producing both a conclusion and its reasoning process.
This flexibility addresses several key challenges in content moderation.
The models can handle emerging harms where large training datasets don't exist, adapt quickly when policies need to change, and provide transparent explanations for moderation decisions.
According to OpenAI's technical documentation, the approach performs especially well in situations involving nuanced domains, evolving threats, or cases where explainability matters more than raw speed.
Eric Schmidt, former Google CEO and founding partner of ROOST, stated that the initiative addresses a critical need to accelerate innovation in online child safety and AI by providing small companies and nonprofits access to technologies they currently lack.
OpenAI's internal testing shows that gpt-oss-safeguard-120b achieved 46.3 percent accuracy on multi-policy benchmarks, outperforming GPT-5 despite being significantly smaller.
The models also performed well on the ToxicChat public benchmark, though GPT-5-thinking and the internal Safety Reasoner slightly edged them out.
However, OpenAI is transparent about the limitations. The company acknowledges in its technical report that classifiers trained on tens of thousands of labeled examples can still achieve higher performance on complex classification tasks.
The reasoning process is also more time and compute-intensive than traditional methods, which could pose challenges for platforms needing to moderate massive volumes of content at very low latency.
To address these computational concerns, OpenAI recommends a layered approach: using small, fast, high-recall classifiers to first identify potentially problematic content, then passing only uncertain or sensitive material to gpt-oss-safeguard for detailed analysis.
The 20-billion parameter model fits into GPUs with 16GB VRAM, making it more accessible for organizations with limited hardware resources.
Building toward a safer internet ecosystem
The models represent an open-weight implementation of Safety Reasoner, OpenAI's internal tool that has become a core component of its safety stack.
Safety Reasoner performs dynamic evaluations across systems, including GPT-5, ChatGPT Agent, and Sora 2, analyzing content in real-time to identify and block unsafe generations.
Naren Koneru, Roblox's Vice President of Engineering, Trust, and Safety, noted that ROOST is working on three core areas of online safety that are mission-critical for Roblox and other platforms: improving child safety with stronger CSAM classifiers, building better safety infrastructure like review consoles and heuristic engines, and creating large language model-powered content safeguards.
To jumpstart community adoption, OpenAI, ROOST, and Hugging Face are hosting a hackathon on December 8, 2025, in San Francisco.
The event aims to bring together developers, researchers, and platform teams to experiment with the models and share best practices for implementing custom content moderation systems.
With $27 million in funding secured for its first four years of operation, ROOST represents a significant industry commitment to making content moderation more accessible, transparent, and effective across the internet.
Read more: