Multimodal-OCR3 is an advanced Optical Character Recognition (OCR) application that leverages multiple state-of-the-art multimodal models to extract text from images. Built with a user-friendly Gradio interface, it supports models like Nanonets-OCR2-3B, Chandra-OCR, olmOCR-2-7B-1025, and Dots.OCR, enabling robust text extraction with customizable generation parameters.
This project is licensed under the Apache License 2.0.
- Multiple OCR Models: Supports four OCR models: Nanonets-OCR2-3B, Chandra-OCR, olmOCR-2-7B-1025, and Dots.OCR.
- Gradio Interface: Intuitive web-based UI for uploading images and entering queries.
- Customizable Parameters: Adjust max new tokens, temperature, top-p, top-k, and repetition penalty for text generation.
- Real-time Streaming: View OCR output as it is generated.
- Example Inputs: Predefined example queries and images for quick testing.
- Custom Theme: Styled with a unique SteelBlue theme for an enhanced user experience.
-
Clone the Repository:
git clone https://github.com/PRITHIVSAKTHIUR/Multimodal-OCR3.git cd Multimodal-OCR3 -
Set Up a Virtual Environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies: Ensure you have Python 3.10+ installed, then install the required packages:
pip install -r requirements.txt
The
requirements.txtincludes dependencies such astorch,transformers,gradio, andflash-attn. See the Requirements section for the full list. -
Download Models: The application automatically downloads and caches the required models from Hugging Face during the first run. Ensure you have sufficient disk space in the
./model_cachedirectory. -
Run the Application: Launch the Gradio interface:
python app.py
This will start a local web server, and you can access the interface via the provided URL (typically
http://localhost:7860).
-
Access the Interface: Open the Gradio interface in your browser after running the application.
-
Select a Model: Choose one of the supported models (e.g., Nanonets-OCR2-3B) from the radio buttons.
-
Upload an Image: Upload an image containing text you want to extract.
-
Enter a Query: Provide a query (e.g., "Perform OCR on the image") in the text input box.
-
Adjust Advanced Options (optional): Modify parameters like
max_new_tokens,temperature,top_p,top_k, andrepetition_penaltyfor fine-tuned results. -
Submit: Click the "Submit" button to process the image and view the extracted text in real-time.
-
View Output: The raw text output and Markdown-formatted results will appear in the output section.
The application includes example inputs for quick testing:
- Query: "Perform OCR on the image."
- Image:
examples/1.jpg - Output: Extracted text from the image.
Multimodal-OCR3 integrates the following models:
- Nanonets-OCR2-3B: A lightweight, efficient OCR model for text extraction.
- Chandra-OCR: A high-precision model optimized for complex documents.
- olmOCR-2-7B-1025: A robust model for diverse image types, developed by Allen AI.
- Dots.OCR: A custom-patched model for enhanced OCR performance.
All models are loaded with torch.float16 or torch.bfloat16 precision and utilize GPU acceleration (if available) via CUDA.
The following packages are required to run Multimodal-OCR3:
flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
transformers-stream-generator
huggingface_hub
qwen-vl-utils
sentencepiece
opencv-python
torch==2.6.0
transformers
torchvision
matplotlib
accelerate
requests
hf_xet
spaces
pillow
gradio
einops
peft
fpdf
timm
av
Install them using:
pip install -r requirements.txtContributions are welcome! To contribute:
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature). - Make your changes and commit (
git commit -m 'Add your feature'). - Push to the branch (
git push origin feature/your-feature). - Open a pull request.
Please ensure your code adheres to the project's coding standards and includes appropriate documentation.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
- Hugging Face for providing the pretrained models.
- Gradio for the intuitive web interface framework.
- PyTorch for the deep learning backend.
- The open-source community for contributions to dependencies like
transformersandflash-attn.