Total Runs
36
benchmark evaluations
Suites Tested
5
stranger-meeting, health-risk, nsfw, grooming-real, grooming
Models Tested
6
unique models
Best Model (Avg F1)
79.9%
grok-4.1
Cross-Suite Rankings
Average performance across all suites each model was evaluated on
| # | Model ▼ | Suites | Avg F1 ▼ | Avg Precision | Avg Recall | grooming | grooming-real | health-risk | nsfw | stranger-meeting |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | grok-4.1 | 5 | 79.9% | 86.7% | 80.3% | 63.3% | 100.0% | 50.0% | 90.9% | 95.2% |
| 2 | grok-3 | 5 | 59.5% | 66.6% | 59.4% | 64.4% | Error | 50.0% | 90.9% | 92.1% |
| 3 | claude-opus-4.6 | 5 | 42.2% | 51.3% | 36.3% | 61.5% | Refused | Refused | 66.7% | 83.0% |
| 4 | gemini-3-pro | 5 | 0.0% | 0.0% | 0.0% | Refused | Error | Refused | Refused | Refused |
| 5 | gemini-2.5-pro | 5 | 0.0% | 0.0% | 0.0% | Refused | Error | Refused | Refused | Refused |
| 6 | gpt-5 | 5 | 0.0% | 0.0% | 0.0% | Refused | Refused | Refused | Refused | Refused |
