How to implement OCR in iOS without 3rd party frameworks
Learn the basics of getting OCR up and running plus a few tips and extra stuff you may want to know.
Published: July 29, 2020 Sponsored See booksQuite recently I have successfully implemented OCR features in my Scan it app and it was surprisingly easy given how difficult the task of recognizing text in image is. In this tutorial I would like to present short example you can build upon.
What is OCR exactly? Basically just getting a text out of image via some sophhisticated technology.
The nice thing is that we don't have to add any kind of 3rd party framework or library to implement OCR. We can use what iOS offers since version 13.
Import Vision
OCR (optical character recognition) is done with the Vision framework you need to import first.
import Vision
Then it is a matter of constructing the request, executing it and finally reading the results.
Create OCR request
Our first task is to initialize VNImageRequestHandler
with the target image. This expects CGImage
which you can get from UIImage
using its optional property cgImage
:
let requestHandler = VNImageRequestHandler(cgImage: image)
This requestHander will allow you to perform specific requests on the image you passed in.
Reconignize text
Next up create the VNRecognizeTextRequest
that will be performed. This accepts a closure which will be run once the request completes. Its parameters are VNRequest
and Error?
.
let request = VNRecognizeTextRequest { (request, error) in
if let error = error {
print(error.localizedDescription)
return
}
self.recongizeText(from: request)
}
The rest will be done in recognizeText
method so the closure body is more readable. Signature looks like this:
func recongizeText(from request: VNRequest) -> String? {
}
Now let's kick off the OCR itself.
do {
try requestHandler.perform([request])
} catch {
print("Unable to perform the requests: \(error).")
}
This will perform the request
we created earlier and run the closure. All being well method recognizeText
should be called and we can process the results.
Get the result
The following code goes inside the recognizeText
method.
First we try to get the text observations from the results
property which is of type [Any]
like this:
guard let observations = request.results as? [VNRecognizedTextObservation] else {
return nil
}
Each observation has topCandidates
method which will return possibly multiple variants for recognized text, while the maximum count is 10. These are already sorted by confidence which means how confident is Vision framework about the recognized text.
In this example, we can use compactMap
to get array of strings from the observations:
let recognizedStrings: [String] = observations.compactMap { (observation) in
guard let topCandidate = observation.topCandidates(1).first else { return nil }
return topCandidate.string.trimmingCharacters(in: .whitespaces)
}
return recognizedStrings.joined(separator: "\n")
And voila! We have the OCR results. I think it is just incredible that such a complicated task can be accomplished in just a few lines of code.
Final code
The final recognizeText
method looks like this:
func recognizeText(from request: VNRequest) -> String? {
guard let observations =
request.results as? [VNRecognizedTextObservation] else {
return nil
}
let recognizedStrings: [String] = observations.compactMap { (observation) in
guard let topCandidate = observation.topCandidates(1).first else { return nil }
return topCandidate.string.trimmingCharacters(in: .whitespaces)
}
return recognizedStrings.joined(separator: "\n")
}
Possible improvements
There are a few small things we can use to make the recognition better. For example we can use confidence
property of the individual VNRecognizedText
instances returned by topCandidates
method to filter out low confidence results.
Confidence will have value between 0.0 to 1.0. The more the better. This requires some experimentation with your data to find confidence value best suited for your app. For example I optimistically started with accepting only results with confidence 0.8 or better but this left out a lot of perfectly usable recognized text so I had to lower it and experiment again.
You can also help Vision with specifying languages via the recognitionLanguages
property on the VNRecognizeTextRequest
. The order specifies their priority. I think sensible default is to use user's preferred languages from Locale
but obviously this will vary based on your particular usecase. If you wanted really precise OCR you could even let user choose a language to apply.
VNRecognizeTextRequest
also lets us set customWords
property which is array of strings. This may be useful when you expect the image to contain non-standard words.
And lastly there is (among other things) minimumTextHeight
property which can help you filter out small text. The value needs to be relative to image height. Btw default is 1/32 which should cover a lot of usecases. The polar opposite would be something like 0.75 to only match text that is at least 3/4 of the image height. This can have some important performance considerations. Especially if you were doing real-time recognition.
Uses: Xcode 12 & Swift 5.3