Installation#

Installing with pip#

pip install lexoid

This installs both the Python library and the lexoid command-line entry point. See Command-Line Interface for CLI usage.

Environment Setup#

To use LLM-based parsing, define the environment variables for the providers you intend to use (in a shell, .env file, or your container environment):

GOOGLE_API_KEY=your_google_api_key            # Gemini
OPENAI_API_KEY=your_openai_api_key            # OpenAI / GPT
ANTHROPIC_API_KEY=your_anthropic_api_key      # Claude
MISTRAL_API_KEY=your_mistral_api_key          # Mistral OCR
HUGGINGFACEHUB_API_TOKEN=your_huggingface_token
TOGETHER_API_KEY=your_together_api_key
OPENROUTER_API_KEY=your_openrouter_api_key
FIREWORKS_API_KEY=your_fireworks_api_key

Only the providers you actually use require keys. Local backends (Ollama, SmolDocling/granite-docling, PaddleOCR-VL) do not require an API key.

Additional environment variables#

  • DEFAULT_LLM — overrides the default LLM model. Default: gemini-2.5-flash.

  • DEFAULT_LOCAL_LM — overrides the default local model used by parse_with_local_model. Default: ds4sd/SmolDocling-256M-preview.

  • DEFAULT_STATIC_FRAMEWORK — overrides the default static-parsing framework. Default: pdfplumber.

  • DEFAULT_MAX_IMAGE_DIMENSION — maximum pixel dimension for resizing page/image inputs. Default: 1000.

  • OLLAMA_BASE_URL — base URL of the Ollama server. Default: http://localhost:11434.

  • OLLAMA_TIMEOUT — request timeout (seconds) for Ollama. Default: 120.

Optional Dependencies#

Playwright (for web content retrieval)#

To use Playwright for retrieving web content (instead of the bare requests library), install its browser dependencies after pip install lexoid:

playwright install --with-deps --only-shell chromium

LibreOffice (for DOCX to PDF on Linux)#

On Linux, .doc/.docx to PDF conversion uses LibreOffice’s lowriter binary (because docx2pdf is unsupported on Linux). Install it from your distribution’s package manager, e.g.:

sudo apt-get install libreoffice

On macOS/Windows, docx2pdf is used automatically (requires Microsoft Word or compatible installation).

Ollama (for local LLM parsing)#

Install Ollama, pull a vision-capable model, and keep the server running:

ollama pull gemma4
ollama serve

Then call parse(..., api_provider="ollama", model="gemma4:latest", max_processes=1). Lexoid forces max_processes=1 for Ollama-backed parsing to avoid local multiprocess contention.

Building from Source#

To build the .whl file:

make build

Local Development Setup#

To install dependencies:

make install

Or, to install with dev-dependencies:

make dev

To activate virtual environment:

source .venv/bin/activate