We Trained a Custom AI Model to Investigate Missing Parcels — Here’s Exactly How We Did It
A major UK retailer was haemorrhaging time and money manually reviewing courier photos. We built them a fine-tuned vision model that makes the call in seconds. This is the full story.
The Problem Nobody Talks About
Missing parcel claims are one of the most expensive and time-consuming operational problems in e-commerce. Every day, logistics teams sift through hundreds — sometimes thousands — of courier delivery photos trying to answer a single question: did this delivery actually happen properly, or not?
The client who came to us was dealing with exactly this. A high-volume retail operation with a dedicated investigations team spending the bulk of their working day reviewing photos from multiple courier partners. Each image had to be assessed manually, cross-referenced with the claim, and categorised. The process was slow, inconsistent across team members, and frankly — unsustainable at scale.
They’d looked at off-the-shelf computer vision tools. Nothing came close to handling the variability of real delivery photography. Dark doorsteps. Blurry dashcam screenshots. Parcels obscured by wheelie bins. A generic model would fail immediately.
They needed something trained specifically for their problem.
“A standard image classification model trained on generic data wasn’t going to cut it. We needed a model that understood the difference between a compliant delivery and a non-compliant one — in the context of real, messy, real-world courier photos.”
Why This Couldn’t Be Solved With Prompting Alone
Before going anywhere near model training, we tested the obvious cheaper routes. Could a multimodal foundation model — given a detailed prompt — reliably classify these images? We ran extensive tests. The results were inconsistent. On clear, well-lit photos it performed reasonably well. On the ambiguous cases — which are the ones that actually matter — it struggled.
The problem isn’t intelligence. It’s specificity. Foundation models are generalists. They haven’t seen thousands of examples of what this courier’s non-compliant delivery looks like, in this client’s context, under these policy rules. That knowledge has to be built in through training data.
Fine-tuning was the right call. We moved forward.
Step 1 — Defining What “Compliant” Actually Means
This was the hardest part of the entire project. Before a single line of training code ran, we sat down with the client’s investigations team for a series of working sessions. The goal: produce an unambiguous labelling guide that any human — or model — could apply consistently.
It sounds simple. It isn’t. Consider these real edge cases we had to resolve:
- Parcel placed in a communal corridor, door not visible — compliant or not?
- Photo taken from inside a vehicle showing a doorstep from distance — acceptable proof?
- Image shows a “safe place” note visible but no parcel — does the note count?
- Multiple parcels visible — how do you confirm which one is the claimed item?
- Photo is clearly timestamped but location data doesn’t match the delivery address
Every one of these had to be defined, agreed, and documented. The labelling guide became the foundation of the entire system. Without it, you get garbage training data. And garbage training data gives you a garbage model — regardless of how sophisticated the architecture is.
Step 2 — Building the Training Dataset
With the labelling guide agreed, we worked through the client’s historical image archive. Thousands of delivery photos, spanning multiple courier partners and conditions. Each image was labelled against our three output categories.
Data quality checks ran throughout. Ambiguous images — where even experienced team members disagreed — were flagged and either resolved in committee or excluded. We weren’t going to let edge-case noise degrade the model’s confidence on the clear-cut majority.
We also deliberately balanced the dataset. Real delivery photo archives skew heavily compliant — most deliveries are fine. An unbalanced dataset produces a model that’s great at confirming compliant deliveries and terrible at catching the non-compliant ones, which is precisely the wrong failure mode. We adjusted for this.
Step 3 — Fine-Tuning the Vision Model
We fine-tuned a vision model on the labelled dataset — training it to recognise the visual patterns associated with each outcome. The architecture decision was driven by the real-world constraints: the model needed to run fast enough to process claims in near real-time, at scale, without requiring expensive inference infrastructure.
Training iterations revealed where the model was uncertain. We used those uncertainty signals to go back to the dataset, pull the relevant images, and tighten the labelling. Multiple rounds of this loop produced a model that was genuinely confident on the cases it should be confident on — and genuinely uncertain on the ones that warranted human review.
That second point is critical and often overlooked. A model that’s confidently wrong is far more dangerous than one that admits uncertainty. We optimised specifically for well-calibrated confidence, not just raw accuracy on the test set.
What the Model Returns
When a missing parcel claim arrives, the system pulls the associated delivery photo and runs it through the model. The response comes back in seconds with one of three outcomes:
Evidence of a valid delivery attempt is present. The claim is likely fraudulent or the result of a genuine mistake. Flag for follow-up with the customer.
Delivery issue confirmed. The courier failed to meet the required standard. Claim is warranted — escalate to courier partner for resolution.
Image is ambiguous or falls outside the model’s confident range. Send to a human investigator with the model’s reasoning attached.
The Refer category is not a weakness. It’s a feature. A system that knows its own limits — and routes edge cases to humans rather than making a confident wrong call — is a production-ready system. The goal was never to remove humans entirely. It was to make humans only deal with the cases that genuinely need them.
The Business Impact
The investigations team went from reviewing every incoming claim manually to only handling the Refer category. The vast majority of claims — the clear-cut compliant and non-compliant ones — are now processed automatically, in seconds, with a documented audit trail attached.
- First-pass investigation time dramatically reduced
- Fraudulent claims caught that were previously settled to avoid admin overhead
- Consistent decisions — no more variation between team members
- Full audit trail on every classification for compliance and dispute purposes
- The system scales with claim volume — no additional headcount required
What This Actually Demonstrates
You do not need to be Google, Amazon, or a university research lab to train a production AI model. You need three things: a clearly defined problem, a well-labelled dataset, and people who understand how to build the system properly.
CCwithAI is a Manchester-based AI development company. We don’t resell access to ChatGPT with a markup. We don’t bolt AI wrappers onto existing software and call it innovation. We build custom AI systems — trained, fine-tuned, and deployed for specific business problems — for companies that need something that actually works.
This parcel investigation system is one example. The same approach applies to any business process that involves repetitive decision-making on visual, textual, or structured data. Quality control. Document classification. Customer intent detection. Compliance checking. If humans are doing it repeatedly by following a consistent set of rules — a model can be trained to do it faster, cheaper, and at scale.
Got a Problem That Needs a Real AI Solution?
Not a chatbot. Not a prompt wrapper. A system built specifically for your business.
We’ll tell you honestly whether AI is the right tool — and if it is, we’ll build it properly.
Book a Free Consultation