allenai/olmocr
原文摘要
Toolkit for linearizing PDFs for LLM datasets/training A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format. Try the online demo: https://olmocr.allenai.org/ Features: Convert PDF, PNG, and JPEG based documents into clean Markdown Support for equations, tables, handwriting, and complex formatting Automatically removes headers and footers Convert into text with a natural reading order, even in the presence of figures, multi-column layouts, and insets Efficient, less than $200 USD per million pages converted (Based on a 7B parameter VLM, so it requires a GPU) News October 21, 2025 - v0.4.0 - New model release , boosts olmOCR-bench score by ~4 points using synthetic data and introduces RL training. August 13, 2025 - v0.3.0 - New model release , fixes auto-rotation detection, and hallucinations on blank documents. July 24, 2025 - v0.2.1 - New model release , scores 3 points higher on olmOCR-Bench , also runs significantly faster because it's default FP8, and needs much fewer retries per document. July 23, 2025 - v0.2.0 - New cleaned up trainer code , makes it much simpler to train olmOCR models yourself. June 17, 2025 - v0.1.75 -…