Update on using LLMs for OCR

Here’s an update on using LLMs for OCR without having to use the same hammer (generic model) for all nails. DeepSeek has released an OCR-focused model: https://github.com/deepseek-ai/DeepSeek-OCR

Check out the deep parsing mode, which is parsing images within documents through secondary model calls. Very useful for data extraction. The results are pretty impressive too:

Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode 𝑁 text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy. These findings suggest promising directions for future applications, such as implementing optical processing for dialogue histories beyond 𝑘 rounds in multi-turn conversations to achieve 10× compression efficiency.