People are constantly asking chatbots to explain how to do things. Fixing a faucet, filing a tax return, asking for a raise – according to some estimates, about 8.5% of all conversations with ChatGPT are exactly these kinds of requests for step-by-step instructions.
And as the tasks AI systems take on become more complex, the ability to generate reliable instructions becomes increasingly important. But there's a serious problem here.
Challenges of Verifying AI Generated Procedural Instructions
Instructions Can't Be Tested in a Lab
How do you know if AI instructions will work? You can't verify this in a standard test environment – no one is going to file for divorce or rewire their apartment just to make sure the steps are correct. And superficial comparisons with a reference don't catch the errors that matter: a missed prerequisite or an incorrect order of operations that makes the whole process fall apart.
That is exactly why Allen AI released How2Everything – a framework for evaluating and improving how well models generate step-by-step instructions. It includes a complete pipeline for extracting real-world instructions from the web (351,000 examples from nearly a million pages across 14 topics), a benchmark of 7,000 examples for model testing, and an open evaluation model that checks whether instructions contain critical errors preventing a human from achieving their goal.
Experiments have shown that if you train models against this signal – that is, reward instructions with fewer critical failures – their performance on the How2Bench benchmark increases by more than 10 points, while their abilities on other tasks do not degrade.
The Web as a Data Source for Complex Tasks
The project demonstrates how web data can support a closed evaluation and improvement loop for model capabilities at scale. The web provides a virtually inexhaustible supply of open, naturally occurring real-world documents that can serve as anchor points when verification through execution is impossible.
By extracting and standardizing this data into a testable format and developing an evaluation protocol aimed at task-level validity that can be reproduced at scale, the team turned hard-to-measure behavior into a practical development loop.
Limitations of Current Benchmarks for Procedural Tasks
What's Wrong with Existing Benchmarks
Procedural instructions are important everywhere – for example, in agents, planning and tool use depend on the correct sequence of actions. But existing datasets are often limited in scope, data source, or metrics that don't reflect whether the procedure will work in practice. How2Everything is designed to be broad, scalable, and focused on real-world validity.
How2Everything consists of three main components: How2Mine, a pipeline for extracting procedures from the web; How2Bench, a benchmark for evaluating models; and How2Score, an evaluation method and an open judge model called How2Judge. The team is also releasing training data and recipes for directly improving models.
How How2Mine Works
How2Mine is a pipeline for extracting and standardizing procedures from web pages at scale. It starts with the DCLM corpus, uses WebOrganizer to identify tutorial-style pages, and then applies stratified sampling to ensure diversity across 14 topics – from arts and design to crime and law, electronics, and transportation.
The pipeline then uses GPT-4.1 to process these pages through several stages: extracting candidate procedures from raw HTML; filtering out interface-dependent, illogical, or nonsensical procedures; applying heuristics (keeping only those with 5 to 15 steps); extracting resource lists; and final validation.
Running How2Mine on 980,000 documents yields 351,162 structured procedures, each with a topic, goal, list of necessary resources, and reference steps. Processing at this scale required 252,000 API calls costing about $5,700.
Even after filtration, not every reference procedure is perfect. To check quality, the team validated the benchmark references using GPT-4.1, which rated 96.6% of them as valid.
How2Bench: Testing for Practical Utility
How2Bench is a benchmark for testing how well models generate procedures. It is built by sampling 500 procedures per topic from the How2Mine pool, with the remaining procedures reserved for training.
To evaluate a model, How2Bench provides a goal (e.g., «change a flat tire»), a list of available resources, and the number of steps N that the procedure must contain. The model must then generate exactly N steps, each in a single sentence. This controlled setup makes results comparable across models.
Unlike many benchmarks that saturate quickly as models develop, How2Bench shows clear scaling trends regarding both model size and training progress – this makes it useful for tracking improvements long before a model achieves state-of-the-art performance.
How2Score: Not Just «Does It Sound Useful», but «Will It Work»
How2Score is an evaluation method designed to measure whether a procedure will work in practice – not just whether it sounds useful.
Specifically, How2Score checks whether a procedure contains any critical failure that would prevent a human from achieving the goal. Critical failures include missing steps, unnecessary actions that derail the process, contradictions, or vagueness so severe that the procedure becomes unusable – for example, skipping a legally required waiting period when selling real estate, or omitting critical cooking temperatures and times.
Using a proprietary model like GPT-5 for this is accurate but expensive at scale, and it hinders reproducibility for others – evaluating 7,000 examples with GPT-5 costs about $15.
To make How2Score practical for widespread use, the team performed distillation and created an open judge model called How2Judge. First, they validated their critical failure evaluation framework against human labeling – 200 examples annotated by three annotators. Then, they used GPT-5 to generate 73,000 evaluations and trained an open 8-billion-parameter model based on Qwen 3 to reproduce these decisions.
The resulting judge model agrees with GPT-5 in 90.5% of cases and matches the majority of human ratings in 80.5% of cases – accurate enough to provide inexpensive, reproducible evaluation and serve as a reward signal for training.
How2Score Method for Evaluating Real World Task Validity
Improving Models via How2Everything
How2Everything is not just an evaluation framework; it is also designed to help improve models. A subset of procedures from How2Mine can serve as training data, and the How2Score judge provides a reward signal. Procedures with fewer critical failures receive higher scores on How2Bench.
The framework yields a substantial boost in generating valid step-by-step procedures, as measured by How2Bench. Qwen3-4B-Inst improved from 30.3 to 43.5 (+13.2 points), Qwen3-8B-Inst from 38.5 to 48.6 (+10.1), and Olmo 3 7B Think from 27.3 to 37.9 (+10.6). Importantly, these improvements do not come at the expense of other abilities – results on 12 out-of-domain benchmarks show no systematic degradation.
Length Matters
One important observation: explicit length control is crucial during training. Without it, models learn to «game» the judge by producing longer, more wordy outputs. An experiment showed inflated How2Bench scores paired with much longer procedures when length control was removed – a useful reminder that «LLM-as-a-judge» setups require careful design.
Importance of Length Control in LLM Training and Evaluation
What Was Released
The team is releasing all code and data associated with How2Everything, including the How2Mine pipeline and prompts, the full dataset of 351,000 procedures and the How2Bench split, the distilled How2Score judge (8B), and training recipes for fine-tuning with How2Score as a reward.
If you are building instruction-following systems, tool-using agents, or anything that relies on reliable step-by-step guides, How2Everything allows you to check whether your model's procedures will actually work, and to train directly to reduce critical failures.