The AI world is hungry for data. When it comes to English, there is an abundance of digitised high-quality content that researchers can use to train the AI models. However, there is a dearth of digitised content for native Indian languages like Hindi, Odia, Marathi and Telugu. Notably, most of such content is stored in libraries and old texts.
This is where optical character recognition (OCR) comes in. OCR has long been a cornerstone technique for enabling the conversion of various forms of written text, such as scanned documents, images, and PDFs, into machine-readable data.
However, since modern large language models (LLMs) can process large volumes of data in several languages just by uploading PDFs or images, the relevance of OCR is in question. Homegrown initiatives like Bhashini and AI4Bharat, or startups like Sarvam, have built frameworks and applications for scanning texts from images into machine-readable format.
Even so, this falls short of gathering huge amounts of data. Though companies could digitise content using OCR, it would take a lot of time and manual effort. At the same time, they still want high-quality data from the hundreds of thousands of books in Indic languages, which can realistically only be provided by OCR.
This is where LLMs have been playing a crucial role in helping them.
Are LLMs Killing OCR?
Indian startups like Sarvam AI have started training their models using synthetic data generated from Meta Llama 3.3. This allows companies to use the data generated by the model to train their own models.
The success of such approaches is evident in projects like Sarvam AI’s Sarvam 2B, which was trained on 2 trillion synthetic Indic tokens. This demonstrates how such data can efficiently train smaller, purpose-built models while retaining high performance.
Collecting data with OCR is very difficult as it is largely a very manual process of scanning documents. Hamid Shojanazeri, partner engineering manager (PyTorch and Llama) at Meta, said synthetic data generation solves critical bottlenecks in domains where collecting real-world datasets is too costly or impractical. “Synthetic data is vital for advancing AI in privacy-sensitive areas or low-resource languages,” he added.
This is exactly why OCR is taking a back seat for startups that are focusing on the English language.
Traditional OCR systems have been instrumental in digitising printed text, but they often struggle with handwritten content, complex layouts, and diverse fonts. Recent examples like the GPT-4o mini were able to identify text with much more accuracy than any OCR, making the case for OCR’s issues even stronger.
For instance, platforms like Amazon Textract combine OCR with machine learning (ML) to extract text and data from virtually any document, enhancing accuracy and functionality.
Miguel Ríos Berríos, co-founder and CTO of Parcha, recently wrote on X that OCR remains relevant for simple text extraction tasks. However, in high-stakes applications like document verification, it’s being overtaken by more advanced AI models that integrate vision, language, and metadata analysis for real-time and adaptable decision-making.
“OCR and text-based rules only see half the picture. A document isn’t just its text content – it’s the relationship between visual elements, fonts’ consistency, official seals’ placement, and even the metadata traces left by editing software,” Berríos said. He added that modern vision models can process all these signals simultaneously, flagging subtle inconsistencies that traditional approaches miss.
Some believe that while LLMs are good at extracting text from clear images, they still struggle with complex documents like handwritten text, low-quality scans, or unusual fonts. Arham Raza, AI engineer at Clouxi Plexi, said, “OCR systems like ABBYY FineReader are specifically designed to handle these issues and remain far superior in these scenarios.”
According to him, OCR is much faster at processing large batches of text, whereas LLMs can be slower and have token limits. “OCR is far from dead, especially for things like legal or medical documents!”
India Keeps OCR Afloat
India’s linguistic diversity, with 22 officially recognised languages and numerous dialects, presents unique challenges for digital accessibility. Many documents, historical records, and literary works are available only in printed or handwritten forms in various Indian languages. This is what the Indian initiatives continue to focus on, and OCR might be the right option now.
Ori Shachar, co-founder and CEO of Autom8Labs, wrote on LinkedIn, “After working with all the large LLM providers on analysing images of scanned text, I can declare the death of standard OCR applications. Those LLM just extract the text and read it from the image seamlessly, doing a better job than OCR.”
But this is mostly for English. Other languages are still not as precise as required.
Indian startups have long been dedicated to scaling their OCR capabilities. Although they might obtain a lot of Indic data through synthetic generation, the quality of text in several books cannot be assured without OCR. However, this is also gradually changing, with LLMs being able to detect Indic language text when documents are uploaded.
While there are definitely experts who disagree that OCR is in trouble because of problems with vision language models (VLMs) and LLMs, such as hallucinations and the high cost of each image, the future of OCR seems hazy. The cost of running such models is decreasing. LLMs might be overkill for many tasks, but OCR might soon not be enough.
Mohit Pandey
Mohit writes about AI in simple, explainable, and sometimes funny words. He holds keen interest in discussing AI with people building it for India, and for Bharat, while also talking a little bit about AGI.
OCR is Dying, but Not in India
Mohit Pandey
Since modern LLMs can process large volumes of data in several languages just by uploading PDFs or images, the relevance of OCR is in question.
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
February 5 – 7, 2025 | Nimhans Convention Center, Bangalore
Rising 2025 | DE&I in Tech & AI
Mar 20 and 21, 2025 | 📍 J N Tata Auditorium, Bengaluru
Data Engineering Summit 2025
15-16 May, 2025 | 📍 Taj Yeshwantpur, Bengaluru, India
17-19 September, 2025 | 📍KTPO, Whitefield, Bangalore, India
MachineCon GCC Summit 2025
19-20th June 2025 | Bangalore
Our Discord Community for AI Ecosystem.