Local AIPROPRO REQUIREDFC-AI-004

Model Eval Claw

Name: Model Eval Claw
Price: 9.95 EUR
Availability: InStock
Author: FlickClaw

model-eval-claw

v0.2.0May 22, 2026

Model evaluation — benchmark suites, comparison matrices, and blind evaluation protocols with statistical significance

I do not run benchmarks — I am suspicious of them. An eval that does not test YOUR use case on YOUR data is a vanity metric. I care about constructing adversarial test sets that probe for instruction drift, few-shot contamination in your benchmark split, LLM-as-judge bias (position bias, verbosity bias, self-enhancement bias), and whether your "90% accuracy" is really 90% on the 20% of cases that matter. I generate custom evaluation harnesses with capability-weighted scoring, pairwise human-preference calibration, and red-team test cases that bypass your safety filters.

OpenClaw

Hermes

Claude Code

Codex

Cursor

Windsurf

Aider

Ollama

PRIMARY ACTION

Unlock with Pro

COMPATIBLE WITH

OpenClawHermesClaude CodeCodex+4

8 compatible tools4 output formats

When to Use

Run local models with private workflows
Tune inference for local hardware
Choose effective GGUF variants
Benchmark practical latency and quality

Compatible Frameworks

OpenClaw

Hermes

Claude Code

Codex

Cursor

Windsurf

Aider

Ollama

8 TOOLS

Quality Gates

Framework covers all evaluation dimensions
Test suite representative of real use cases
Consistent scoring rubrics across evaluations
Fair and controlled comparisons between models
Security evaluation included in assessment

4 GATES DEFINED

Expected Outputs

Custom Evaluation Harness with Capability-Weighted Scoring RubricAdversarial Test Suite with Instruction Drift and Boundary Probe CasesLLM-as-Judge Bias Audit (Position, Verbosity, Self-Enhancement)Human-Preference Calibration Report with Pairwise Agreement Matrix

Native exports per tool

OpenClaw10 files

openclaw/AGENTS.mdopenclaw/SOUL.mdopenclaw/TOOLS.md+7 more

Hermes5 files

hermes/skills/flickclaw/model-eval-claw/SKILL.mdhermes/skills/flickclaw/model-eval-claw/references/workflow.mdhermes/skills/flickclaw/model-eval-claw/references/quality-gates.md+2 more

Claude Code6 files

claude-code/CLAUDE.mdclaude-code/.claude/skills/model-eval-claw/SKILL.mdclaude-code/.claude/skills/model-eval-claw/references/workflow.md+3 more

Codex5 files

codex/AGENTS.mdcodex/.flickclaw/agents/model-eval-claw/codex.mdcodex/.flickclaw/agents/model-eval-claw/workflow.md+2 more

Cursor3 files

cursor/.cursor/rules/flickclaw-model-eval-claw.mdccursor/.cursor/rules/flickclaw-model-eval-claw-workflow.mdccursor/.cursor/rules/flickclaw-model-eval-claw-quality-gates.mdc

Windsurf3 files

windsurf/.windsurf/rules/flickclaw-model-eval-claw.mdwindsurf/.windsurf/rules/flickclaw-model-eval-claw-workflow.mdwindsurf/.windsurf/rules/flickclaw-model-eval-claw-quality-gates.md

Aider3 files

aider/CONVENTIONS.mdaider/aider.mdaider/.aider.conf.yml

Ollama4 files

ollama/Modelfileollama/system-prompt.mdollama/template.md+1 more

Install Commands

Install the FlickClaw CLI, then select your AI agent framework below to get the correct install command.

Step 1: Install CLI (one-time)

npm install -g @flickclaw/cli@latest

Step 2: Select Framework

OpenClaw

npm exec --yes @flickclaw/cli@latest -- install model-eval-claw --target openclaw

Download as ZIP

Example Prompt

Try this prompt with Model Eval Claw to see what it can do:

Optimize the local model setup for performance. Benchmark current config and suggest improvements for Custom Evaluation Harness with Capability-Weighted Scoring Rubric, Adversarial Test Suite with Instruction Drift and Boundary Probe Cases, LLM-as-Judge Bias Audit (Position, Verbosity, Self-Enhancement).

Example Output

Illustrative

What a typical Model Eval Claw report looks like:

# Model Eval Claw — Assessment Report

**Project**: ollama-deployment
**Context**: a local LLM deployment running Llama 3.1 8B on consumer GPU hardware
**Generated**: 2026-07-10

## Executive Summary

The Model Eval Claw completed its review of ollama-deployment (a local LLM deployment running Llama 3.1 8B on consumer GPU hardware).
3 findings were identified with concrete remediation steps.
All quality gates were verified before delivery.

## Findings

| # | Severity | Area | Finding | Recommended Action |
|---|----------|------|---------|-------------------|
| 1 | **P1** | Performance | Inference latency spikes to 8s under concurrency | Enable continuous batching and set max_batch=4 |
| 2 | **P2** | Memory | Model consumes 18GB VRAM, headroom insufficient | Switch to Q4_K_M quantization, target <12GB |
| 3 | **P2** | Setup | No Modelfile with system prompt defined | Create Modelfile with role, constraints, and templates |

## Quality Gates

- [✓] adversarial_test_set_with_instruction_drift_probes
- [✓] llm_as_judge_bias_audit_position_verbosity_self
- [✓] capability_weighted_scoring_calibrated

## Outputs Generated

- **Custom Evaluation Harness with Capability-Weighted Scoring Rubric**: Included in the report above.
- **Adversarial Test Suite with Instruction Drift and Boundary Probe Cases**: Included in the report above.
- **LLM-as-Judge Bias Audit (Position, Verbosity, Self-Enhancement)**: Included in the report above.
- **Human-Preference Calibration Report with Pairwise Agreement Matrix**: Included in the report above.

## Validation

- [x] All quality gates passed (3/3)
- [x] 3 findings documented with severity and remediation
- [x] 4 output sections generated
- [x] Evidence collected and referenced

---
*This is an illustrative example output from FlickClaw. Results vary based on project context.*

RELATED AGENTS

PRO

Ollama Claw

Local LLM setup — Ollama Modelfiles, GPU configuration, model selection benchmarks, and inference tuning

Local AI

PRO

LLMOps Claw

LLM operations — monitoring, cost tracking, latency budgets, and production readiness for language model deployments

Local AI

PRO

Vector Claw

Vector databases — index optimization, similarity search benchmarks, hybrid retrieval, and query latency tuning

Local AI

Unlock with Pro