1220 shaares
Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
- Support for a range of PDF documents (optimized for books and scientific papers)
- Removes headers/footers/other artifacts
- Converts most equations to latex
- Formats code blocks and tables
- Support for multiple languages (although most testing is done in English). See
settings.pyfor a language list. - Works on GPU, CPU, or MPS
See also:
- https://github.com/MarkPDFdown/markpdfdown A high-quality PDF to Markdown tool based on large language model visual recognition. 一款基于大模型视觉识别的高质量PDF转Markdown工具
- gptpdf: Using GPT to parse PDF 一个使用 GPT-4o 将 PDF 解析为 Markdown 的工具。它可以几乎完美地解析任何 PDF 文件,包括排版、数学公式、表格、图片和图表等内容,平均每页成本为 $0.013。
- https://github.com/opendatalab/MinerU MinerU: A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction. 一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取,转成 markdown 格式。MinerU 也免费支持 OCR 解析并识别转换扫描版 PDF !【在线使用】
- https://github.com/oomol-lab/pdf-craft PDF Craft 可以将 PDF 文件转化为各种其他格式。该项目将专注于扫描书籍的 PDF 文件的处理。【v2ex】