How to implement OCR with Vision framework in iOS 13+

The basics of getting OCR up and running plus a few tips and extra stuff you may want to know.

Quite recently I have successfully implemented OCR features in my Scan it app and it was surprisingly easy given how difficult the task of recognizing text in image is. In this tutorial I would like to present short example you can build upon.

OCR (optical character recognition) is done with the Vision framework you need to import first.

import Vision

Then it is a matter of constructing the request, executing it and finally reading the results.

Our first task is to initialize VNImageRequestHandler with the target image. This expects CGImage which you can get from UIImage using its optional property cgImage:

let requestHandler = VNImageRequestHandler(cgImage: image)

This requestHander will allow you to perform specific requests on the image you passed in.

Next up create the VNRecognizeTextRequest that will be performed. This accepts a closure which will be run once the request completes. Its parameters are VNRequest and Error?.

let request = VNRecognizeTextRequest { (request, error) in
    if let error = error {
        print(error.localizedDescription)
        return
    }            
    self.recongizeText(from: request)
}

The rest will be done in recognizeText method so the closure body is more readable. Signature looks like this:

func recongizeText(from request: VNRequest) -> String? {      
}

Now let's kick off the OCR itself.

do {
    try requestHandler.perform([request])
} catch {
    print("Unable to perform the requests: \(error).")
}

This will perform the request we created earlier and run the closure. All being well method recognizeText should be called and we can process the results.

The following code goes inside the recognizeText method. First we try to get the text observations from the results property which is of type [Any] like this:

guard let observations = request.results as? [VNRecognizedTextObservation] else {
    return nil
}

Each observation has topCandidates method which will return possibly multiple variants for recognized text, while the maximum count is 10. These are already sorted by confidence which means how confident is Vision framework about the recognized text.

In this example, we can use compactMap to get array of strings from the observations:

let recognizedStrings: [String] = observations.compactMap { (observation)  in
    guard let topCandidate = observation.topCandidates(1).first else { return nil }
    return topCandidate.string.trimmingCharacters(in: .whitespaces)
}

return recognizedStrings.joined(separator: "\n")

And voila! We have the OCR results. I think it is just incredible that such a complicated task can be accomplished in just a few lines of code.

The final recognizeText method looks like this:

func recognizeText(from request: VNRequest) -> String? {
    guard let observations =
            request.results as? [VNRecognizedTextObservation] else {
        return nil
    }

    let recognizedStrings: [String] = observations.compactMap { (observation)  in
        guard let topCandidate = observation.topCandidates(1).first else { return nil }

        return topCandidate.string.trimmingCharacters(in: .whitespaces)
    }

    return recognizedStrings.joined(separator: "\n")
}

Going further

There are a few small things we can use to make the recognition better. For example we can use confidence property of the individual VNRecognizedText instances returned by topCandidates method to filter out low confidence results.

Confidence will have value between 0.0 to 1.0. The more the better. This requires some experimentation with your data to find confidence value best suited for your app. For example I optimistically started with accepting only results with confidence 0.8 or better but this left out a lot of perfectly usable recognized text so I had to lower it and experiment again.

You can also help Vision with specifying languages via the recognitionLanguages property on the VNRecognizeTextRequest. The order specifies their priority. I think sensible default is to use user's preferred languages from Locale but obviously this will vary based on your particular usecase. If you wanted really precise OCR you could even let user choose a language to apply.

VNRecognizeTextRequest also lets us set customWords property which is array of strings. This may be useful when you expect the image to contain non-standard words.

And lastly there is (among other things) minimumTextHeight property which can help you filter out small text. The value needs to be relative to image height. Btw default is 1/32 which should cover a lot of usecases. The polar opposite would be something like 0.75 to only match text that is at least 3/4 of the image height. This can have some important performance considerations. Especially if you were doing real-time recognition.

Uses: Xcode 11 & Swift 5