How to implement OCR in iOS without 3rd party frameworks

Learn the basics of getting OCR up and running plus a few tips and extra stuff you may want to know.

Quite recently I have successfully implemented OCR features in my Scan it app and it was surprisingly easy given how difficult the task of recognizing text in image is. In this tutorial I would like to present short example you can build upon.

What is OCR exactly? Basically just getting a text out of image via some sophhisticated technology.

The nice thing is that we don't have to add any kind of 3rd party framework or library to implement OCR. We can use what iOS offers since version 13.

Import Vision

OCR (optical character recognition) is done with the Vision framework you need to import first.

import Vision

Then it is a matter of constructing the request, executing it and finally reading the results.

Create OCR request

Our first task is to initialize VNImageRequestHandler with the target image. This expects CGImage which you can get from UIImage using its optional property cgImage:

let requestHandler = VNImageRequestHandler(cgImage: image)

This requestHander will allow you to perform specific requests on the image you passed in.

Reconignize text

Next up create the VNRecognizeTextRequest that will be performed. This accepts a closure which will be run once the request completes. Its parameters are VNRequest and Error?.

let request = VNRecognizeTextRequest { (request, error) in
    if let error = error {
    self.recongizeText(from: request)

The rest will be done in recognizeText method so the closure body is more readable. Signature looks like this:

func recongizeText(from request: VNRequest) -> String? {      

Now let's kick off the OCR itself.

do {
    try requestHandler.perform([request])
} catch {
    print("Unable to perform the requests: \(error).")

This will perform the request we created earlier and run the closure. All being well method recognizeText should be called and we can process the results.

Get the result

The following code goes inside the recognizeText method. First we try to get the text observations from the results property which is of type [Any] like this:

guard let observations = request.results as? [VNRecognizedTextObservation] else {
    return nil

Each observation has topCandidates method which will return possibly multiple variants for recognized text, while the maximum count is 10. These are already sorted by confidence which means how confident is Vision framework about the recognized text.

In this example, we can use compactMap to get array of strings from the observations:

let recognizedStrings: [String] = observations.compactMap { (observation)  in
    guard let topCandidate = observation.topCandidates(1).first else { return nil }
    return topCandidate.string.trimmingCharacters(in: .whitespaces)

return recognizedStrings.joined(separator: "\n")

And voila! We have the OCR results. I think it is just incredible that such a complicated task can be accomplished in just a few lines of code.

Final code

The final recognizeText method looks like this:

func recognizeText(from request: VNRequest) -> String? {
    guard let observations =
            request.results as? [VNRecognizedTextObservation] else {
        return nil

    let recognizedStrings: [String] = observations.compactMap { (observation)  in
        guard let topCandidate = observation.topCandidates(1).first else { return nil }

        return topCandidate.string.trimmingCharacters(in: .whitespaces)

    return recognizedStrings.joined(separator: "\n")

Possible improvements

There are a few small things we can use to make the recognition better. For example we can use confidence property of the individual VNRecognizedText instances returned by topCandidates method to filter out low confidence results.

Confidence will have value between 0.0 to 1.0. The more the better. This requires some experimentation with your data to find confidence value best suited for your app. For example I optimistically started with accepting only results with confidence 0.8 or better but this left out a lot of perfectly usable recognized text so I had to lower it and experiment again.

You can also help Vision with specifying languages via the recognitionLanguages property on the VNRecognizeTextRequest. The order specifies their priority. I think sensible default is to use user's preferred languages from Locale but obviously this will vary based on your particular usecase. If you wanted really precise OCR you could even let user choose a language to apply.

VNRecognizeTextRequest also lets us set customWords property which is array of strings. This may be useful when you expect the image to contain non-standard words.

And lastly there is (among other things) minimumTextHeight property which can help you filter out small text. The value needs to be relative to image height. Btw default is 1/32 which should cover a lot of usecases. The polar opposite would be something like 0.75 to only match text that is at least 3/4 of the image height. This can have some important performance considerations. Especially if you were doing real-time recognition.

Uses: Xcode 12 & Swift 5.3

Filip Němeček profile photo


Filip Němeček @nemecek_f

iOS blogger and developer with interest in Python/Django. Telling other devs' stories with iOS Chat.