API Reference#

Core Function#

parse#

lexoid.api.parse(path: str, parser_type: str | ParserType = 'LLM_PARSE', raw: bool = False, pages_per_split: int = 4, max_processes: int = 4, **kwargs) List[Dict] | str#

Parse a document using specified strategy.

Parameters:
  • path – File path or URL to parse

  • parser_type – Parser type to use (“LLM_PARSE”, “STATIC_PARSE”, or “AUTO”)

  • raw – If True, returns raw text; if False, returns structured data

  • pages_per_split – Number of pages per chunk for processing

  • max_processes – Maximum number of parallel processes

  • kwargs – Additional keyword arguments

Returns:

List of dictionaries containing page metadata and content, or raw text string

Additional keyword arguments:

  • model (str): LLM model to use

  • framework (str): Static parsing framework

  • temperature (float): Temperature for LLM generation

  • depth (int): Depth for recursive URL parsing

  • as_pdf (bool): Convert input to PDF before processing

  • verbose (bool): Enable verbose logging

  • x_tolerance (int): X-axis tolerance for text extraction

  • y_tolerance (int): Y-axis tolerance for text extraction

Examples#

Basic Usage#

from lexoid.api import parse

# Basic parsing
result = parse("document.pdf")

# Raw text output
parsed_md = parse("document.pdf", raw=True)

# Automatic parser selection
result = parse("document.pdf", parser_type="AUTO")

LLM-Based Parsing#

# Parse using GPT-4o
result = parse("document.pdf", parser_type="LLM_PARSE", model="gpt-4o")

# Parse using Gemini 1.5 Pro
result = parse("document.pdf", parser_type="LLM_PARSE", model="gemini-1.5-pro")

Static Parsing#

# Parse using PDFPlumber
result = parse("document.pdf", parser_type="STATIC_PARSE", model="pdfplumber")

# Parse using PDFMiner
result = parse("document.pdf", parser_type="STATIC_PARSE", model="pdfminer")

Web Content#

# Parse webpage
result = parse("https://example.com")

# Parse webpage and the pages linked within the page
result = parse("https://example.com", depth=2)

Return Value Format#

When raw=True, the function returns a raw text string.

When raw=False, the function returns a list of dictionaries:

[
    {
        "metadata": {
            "title": "filename",
            "page": page_number
        },
        "content": "parsed_content"
    },
    # ... one dict per page
]