The Problem I Actually Needed to Solve

I was not trying to build a perfect handwriting recognition system. I was trying to reduce teacher workload.

The real task was narrower and more practical: take a student scratchpad image, extract the handwritten math steps, and determine two things with reasonable confidence:

  • Did the student likely reach the correct answer?
  • Did the student likely show their work?

That distinction matters. In a classroom setting, teachers do not need a research-grade OCR benchmark winner. They need something reliable enough to turn a pile of handwritten practice into a faster review workflow. If the system can correctly identify the final answer most of the time, and if it can detect whether the student produced a visible chain of work, it already removes a meaningful amount of repetitive grading effort.

That framing shaped the entire system. The constraints were not abstract model-quality numbers. The constraints were low latency, messy handwriting, privacy, and classroom reliability.

Why I Chose Local Models Instead of a Bigger Remote Model

The most common question people ask about this work is why I did not just send images to a large cloud model. On paper, that sounds like the easier option. In practice, it was the wrong fit for the problem I was solving.

1. Cost Grows Faster Than It Looks

Cloud inference costs are easy to underestimate because the unit economics look small in isolation. In a classroom workflow, they compound quickly.

One student can generate multiple problems on a single assignment. A single class of 25 students doing 20 problems each creates hundreds of opportunities for OCR. In my working estimate, that is roughly 500 OCR events minimum and more if you need retries, rescans, or multiple image passes. If every one of those is sent to a paid remote model, the cost curve stops being theoretical and starts becoming operational.

That matters even more when the system is intended to sit in a regular teaching workflow. I did not want a design that looked affordable in a demo but became expensive as soon as it was used regularly.

2. Privacy Still Matters Even If You Minimize the Payload

Student handwriting is not just generic content. It is identifying data.

Even if the uploaded image can be stripped of names or wrapped in anonymized identifiers, the handwriting itself can still be treated as biometric or personally identifying information. That means sending it to the cloud introduces a real privacy boundary. It is still student work leaving the local environment and being transmitted to an external service.

For a classroom product, that boundary matters. I wanted to keep the grading path local where possible, both to reduce risk and to simplify the story around handling student data.

3. Speed Matters at Classroom Scale

Latency is not just a "nice to have" in this kind of system. It directly affects whether the workflow feels usable.

When a teacher is processing many student submissions, the difference between local processing and cloud processing is not just the time for a single API call. It is the total effect of network round-trips, service contention, retries, and batch waiting. Those delays stack up quickly when many handwritten images are being evaluated in one session.

Local processing removes that overhead. If the model is already loaded, the request can stay on the machine or local network. That cuts out internet transit and reduces the "why is this still spinning?" problem that makes grading tools feel unreliable.

4. A Cloud LLM Was Overkill for the Actual Goal

This is the most important reason.

I was not trying to build a system that transcribes every mark on the page perfectly. I was trying to build a system that reliably supports a teacher.

Those are different goals.

For this use case, the grading logic can be deterministic. Once I have a plausible sequence of equation states, I can use rule-based validation to check whether the transformations make sense and whether the final answer balances. The OCR step does not need to be flawless. It only needs to be good enough to preserve grading signal.

That changes the economics of the model decision. A large cloud LLM may be more capable in a broad sense, but that capability is not always useful here. If a smaller local model can extract enough structure for a deterministic verifier to finish the job, then the larger model is unnecessary overhead.

In other words, I was not optimizing for perfect transcription. I was optimizing for a system that could say, with useful confidence, "this student likely solved it correctly" and "this student likely showed work."

5. Local Network Dependence Is Better Than Internet Dependence

Classroom software has to keep working under ordinary, imperfect conditions.

If the grading flow depends on an external model provider, then internet problems become grading problems. If the school network is unstable, if an upstream service is degraded, or if the connection simply stalls, the grading pipeline slows down or fails.

A local-first design is not immune to operational issues, but it gives me tighter control over them. I would rather depend on a local runtime I can monitor and recover than on internet connectivity and an external service boundary I do not control.

That is why I treat the "local vs remote" decision as a systems decision, not a benchmark decision.

The Architecture

At a high level, the pipeline is straightforward:

  • A scratchpad image is captured as a PNG.
  • A local vision model reads the image and returns OCR-like output.
  • A cleanup layer normalizes and restructures that output into candidate math rows.
  • A math-aware verifier checks equation validity and transformation quality.
  • The system returns a score and a judgment about whether the student likely showed their work.

The important detail is that each stage assumes the previous stage can fail in small, messy ways.

I am currently running this through a local Ollama-backed path with two small models in play: ministral-3:3b as the primary path and glm-ocr as a fallback. That pairing reflects the core design choice of the system: use a fast primary model, but build the surrounding pipeline so a miss from the first model does not collapse the whole grading attempt.

Small Models Work If You Design for Failure

The most useful lesson from this work is that model size is only part of the story. The surrounding system design often matters more.

Small local models are imperfect. They skip rows. They misread symbols. They produce malformed LaTeX-like fragments. They sometimes return structure that is inconsistent enough to break naive parsing.

If the rest of the system expects clean output, the whole pipeline looks fragile.

If the rest of the system is built around cleanup, verification, and fallback, those same small models become much more usable.

That is the pattern I leaned into.

Dual-Model Strategy: Fast Primary, Deterministic Fallback

The first protection layer is model fallback.

The pipeline starts with the primary vision model. If that request fails, the system automatically retries with a fallback model instead of failing the grading attempt immediately. That matters for operational resilience, but it also matters because models fail differently. A model that struggles on one handwriting style may still produce a usable read on the same image when a different model is tried.

I also added a continuation probe for the glm-ocr path. In practice, one recurring failure mode was that the model would miss the bottom of the page, especially the final answer row. Rather than throwing out the attempt, the pipeline checks whether the cleaned result looks incomplete. If it does, it issues a second prompt that tells the model to read the image as one continuous problem from top to bottom, then merges the new result with the original scan.

That is not glamorous, but it is exactly the kind of engineering decision that makes a local model practical.

Post-Processing Is the Real Product

Raw OCR output is not directly usable for grading. The cleanup layer does a large share of the real work.

This part of the pipeline takes the model response and aggressively normalizes it into something the verifier can reason about. That includes:

  • Extracting useful data from mixed or semi-structured model output.
  • Splitting rows on embedded newlines when the model collapses multiple equations into one blob.
  • Splitting on commas only when those commas appear to separate actual equation states.
  • Repairing malformed LaTeX fragments and normalizing fractions.
  • Deduplicating repeated or overlapping rows.
  • Filtering out text that is not actually an equation.

The fraction handling is a good example of why this matters. OCR for handwritten math often produces broken fragments like malformed frac commands or partially escaped expressions. If that text is passed straight into validation, the grader fails for the wrong reason. If it is normalized first, the system can recover a large amount of usable signal.

This is one of the central themes of the project: the raw model output is not the product. The cleaned, structured, grading-ready representation is the product.

Math-Aware Validation Beats String Matching

After cleanup, the pipeline moves into deterministic validation.

This is where the system stops acting like a generic OCR reader and starts acting like a math grader.

Instead of checking whether the student's text exactly matches an expected string, the verifier checks equation states and transformations. It can substitute candidate values into each side of an equation, confirm whether the equation balances, and inspect whether one line reasonably follows from the previous line.

That matters because students do not all write the same way. Equivalent steps can look different on the page. A pure string comparison approach is brittle. A math-aware validation path is much more aligned with the actual grading task.

I also added targeted OCR correction logic for common symbol and digit confusions. If a row does not balance, the verifier can try small, plausible corrections, such as swapping commonly confused digits or correcting a missing negative sign. Those corrections are tracked with confidence scores so the system can distinguish "clearly valid" from "likely recovered after OCR cleanup."

There is also a "next-row lookahead" behavior for cases where one row is bad but the surrounding work is still meaningful. If the chain breaks because the previous row is invalid, the verifier can check whether the next row still forms a valid transformation. That lets the system preserve grading signal instead of treating one OCR error as a total failure.

This is the difference between building a brittle transcription pipeline and building a resilient grading pipeline.

What Made It Robust in Practice

Several practical controls made the system usable beyond the happy path.

Timeout handling matters because local model runtimes can stall, especially on cold starts. Health checks matter because it is not enough to assume the local runtime is available; the application needs to know whether the vision model is actually loaded. Model-specific prompts matter because different local models respond better to different instructions and cleanup strategies.

None of those decisions are individually dramatic. Together, they are what turn a small local model into a dependable part of a workflow.

This is also why I am skeptical of model-only comparisons. In a real product, "which model is best?" is often a less useful question than "which system degrades gracefully when the model is imperfect?"

Where the Small Local Approach Was Good Enough

In this project, the local-first approach was good enough when the goal was to reduce repetitive teacher effort rather than automate grading with absolute certainty.

If the system can:

  • Read most of the student's equation chain,
  • Recover from common OCR noise,
  • Verify whether the math is internally consistent,
  • And give a useful signal about whether the student showed their work,

then it already creates value.

That is the bar I care about.

I do not need it to replace human judgment. I need it to compress the amount of routine checking a teacher has to do before they step in.

What I Would Improve Next

There are still clear areas to improve.

I want better confidence calibration so the system can express uncertainty more precisely. I want a tighter error taxonomy by model so I can measure which failure modes belong to OCR, which belong to parsing, and which belong to grading logic. I also want a stronger regression harness built around a curated corpus of OCR edge cases, because this kind of system improves fastest when the weird failures are captured and replayed intentionally.

Those are normal next steps for a system like this. The important point is that they are system refinements, not a sign that the local-first approach was flawed.

The Main Lesson

The core lesson from this work is simple: bigger models are not automatically better for a practical workflow.

When the actual job is narrow, the output can be verified deterministically, and reliability matters as much as raw model capability, a small local model can be the right tool. The quality of the surrounding system — including cleanup, fallback logic, validation, and operational control — can matter more than jumping to a larger remote model.

That is what this project demonstrated for me.

I did not need a massive model to solve the problem. I needed a model that was good enough, cheap enough, private enough, and fast enough to fit the real environment where teachers would use it.