Multimodal-OCR3

Multimodal-OCR3 is an advanced Optical Character Recognition (OCR) application that leverages multiple state-of-the-art multimodal models to extract text from images. Built with a user-friendly Gradio interface, it supports models like Nanonets-OCR2-3B, Chandra-OCR, olmOCR-2-7B-1025, and Dots.OCR, enabling robust text extraction with customizable generation parameters.

This project is licensed under the Apache License 2.0.

Features

Multiple OCR Models: Supports four OCR models: Nanonets-OCR2-3B, Chandra-OCR, olmOCR-2-7B-1025, and Dots.OCR.
Gradio Interface: Intuitive web-based UI for uploading images and entering queries.
Customizable Parameters: Adjust max new tokens, temperature, top-p, top-k, and repetition penalty for text generation.
Real-time Streaming: View OCR output as it is generated.
Example Inputs: Predefined example queries and images for quick testing.
Custom Theme: Styled with a unique SteelBlue theme for an enhanced user experience.

Installation

Clone the Repository:

git clone https://github.com/PRITHIVSAKTHIUR/Multimodal-OCR3.git
cd Multimodal-OCR3

Set Up a Virtual Environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies: Ensure you have Python 3.10+ installed, then install the required packages:
```
pip install -r requirements.txt
```
The requirements.txt includes dependencies such as torch, transformers, gradio, and flash-attn. See the Requirements section for the full list.
Download Models: The application automatically downloads and caches the required models from Hugging Face during the first run. Ensure you have sufficient disk space in the ./model_cache directory.
Run the Application: Launch the Gradio interface:
```
python app.py
```
This will start a local web server, and you can access the interface via the provided URL (typically http://localhost:7860).

Usage

Access the Interface: Open the Gradio interface in your browser after running the application.
Select a Model: Choose one of the supported models (e.g., Nanonets-OCR2-3B) from the radio buttons.
Upload an Image: Upload an image containing text you want to extract.
Enter a Query: Provide a query (e.g., "Perform OCR on the image") in the text input box.
Adjust Advanced Options (optional): Modify parameters like max_new_tokens, temperature, top_p, top_k, and repetition_penalty for fine-tuned results.
Submit: Click the "Submit" button to process the image and view the extracted text in real-time.
View Output: The raw text output and Markdown-formatted results will appear in the output section.

Example Usage

The application includes example inputs for quick testing:

Query: "Perform OCR on the image."
Image: examples/1.jpg
Output: Extracted text from the image.

Model Details

Multimodal-OCR3 integrates the following models:

Nanonets-OCR2-3B: A lightweight, efficient OCR model for text extraction.
Chandra-OCR: A high-precision model optimized for complex documents.
olmOCR-2-7B-1025: A robust model for diverse image types, developed by Allen AI.
Dots.OCR: A custom-patched model for enhanced OCR performance.

All models are loaded with torch.float16 or torch.bfloat16 precision and utilize GPU acceleration (if available) via CUDA.

Requirements

The following packages are required to run Multimodal-OCR3:

flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
transformers-stream-generator
huggingface_hub
qwen-vl-utils
sentencepiece
opencv-python
torch==2.6.0
transformers
torchvision
matplotlib
accelerate
requests
hf_xet
spaces
pillow
gradio
einops
peft
fpdf
timm
av

Install them using:

pip install -r requirements.txt

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a new branch (git checkout -b feature/your-feature).
Make your changes and commit (git commit -m 'Add your feature').
Push to the branch (git push origin feature/your-feature).
Open a pull request.

Please ensure your code adheres to the project's coding standards and includes appropriate documentation.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Acknowledgements

Hugging Face for providing the pretrained models.
Gradio for the intuitive web interface framework.
PyTorch for the deep learning backend.
The open-source community for contributions to dependencies like transformers and flash-attn.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
examples		examples
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pre-requirements.txt		pre-requirements.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multimodal-OCR3

Table of Contents

Features

Installation

Usage

Example Usage

Model Details

Requirements

Contributing

License

Acknowledgements

About

Uh oh!

Languages

License

PRITHIVSAKTHIUR/Multimodal-OCR3

Folders and files

Latest commit

History

Repository files navigation

Multimodal-OCR3

Table of Contents

Features

Installation

Usage

Example Usage

Model Details

Requirements

Contributing

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages