Predator Hunters

Child Safety Benchmark

Predator Hunters — AI Model Evaluation

Total Runs

36

benchmark evaluations

Suites Tested

5

stranger-meeting, health-risk, nsfw, grooming-real, grooming

Models Tested

6

unique models

Best Model (Avg F1)

79.9%

grok-4.1

Cross-Suite Rankings

Average performance across all suites each model was evaluated on

#ModelSuitesAvg F1Avg PrecisionAvg Recallgroominggrooming-realhealth-risknsfwstranger-meeting
1grok-4.1579.9%86.7%80.3%63.3%100.0%50.0%90.9%95.2%
2grok-3559.5%66.6%59.4%64.4%Error50.0%90.9%92.1%
3claude-opus-4.6542.2%51.3%36.3%61.5%RefusedRefused66.7%83.0%
4gemini-3-pro50.0%0.0%0.0%RefusedErrorRefusedRefusedRefused
5gemini-2.5-pro50.0%0.0%0.0%RefusedErrorRefusedRefusedRefused
6gpt-550.0%0.0%0.0%RefusedRefusedRefusedRefusedRefused

Suite Summaries