API Reference#
Parser Types#
Lexoid exposes three parser types via the lexoid.api.ParserType enum:
LLM_PARSE— Sends document pages (rendered to images) to an LLM API.STATIC_PARSE— Non-LLM parsing viapdfplumber,pdfminer, orpaddleocrfor PDFs/images, plus dedicated handlers for HTML, plain text, CSV, spreadsheets, DOCX, and PPTX.AUTO— Routes toLLM_PARSEorSTATIC_PARSEbased on document characteristics and the configured routing priority. Falls back to the alternate parser type on failure.
Static parsing frameworks#
For PDF inputs, the following framework values are supported:
pdfplumber(default) — Text extraction with table/heading/hyperlink heuristics.pdfminer— Pure pdfminer-based text extraction.paddleocr— OCR-based extraction (used automatically for image inputs and as a fallback).
Core Function#
parse#
- lexoid.api.parse(path: str, parser_type: str | ParserType = 'AUTO', pages_per_split: int = 4, max_processes: int = 4, **kwargs) Dict#
Parse a document, image, audio file, or URL.
- Parameters:
path – File path or URL to parse.
parser_type – Parser type to use (
"LLM_PARSE","STATIC_PARSE", or"AUTO"). Default:"AUTO".pages_per_split – Number of pages per chunk for processing. Default:
4.max_processes – Maximum number of parallel processes. Default:
4. Automatically forced to1whenparser_type="LLM_PARSE"andapi_provider="ollama".kwargs – Additional keyword arguments (see below).
Automatic input handling:
.doc/.docxinputs with anyparser_typeother thanSTATIC_PARSEare first converted to PDF (as_pdfis forced toTrue)..xlsxand.pptxinputs withparser_type="LLM_PARSE"are silently switched toSTATIC_PARSE(a warning is logged) because LLM parsing is not supported for these formats.
Additional keyword arguments:
model(str): LLM model to use. Defaults to theDEFAULT_LLMenvironment variable, or"gemini-2.5-flash".api_provider(str): API provider for LLM parsing. One of"gemini","openai","anthropic","mistral","huggingface","together","openrouter","fireworks","ollama", or"local". If not set, the provider is inferred from the model name.framework(str): Static parsing framework —"pdfplumber"(default),"pdfminer", or"paddleocr".temperature(float): Temperature for LLM generation. Default:0.0.max_tokens(int): Max output tokens per LLM call. Defaults to1024(4096for Ollama).system_prompt(str): Override the default parser system prompt.user_prompt(str): Override the default user prompt.depth(int): Depth for recursive URL parsing. Default:1.as_pdf(bool): Convert input (image / webpage / DOCX) to PDF before processing.verbose(bool): Enable verbose logging during LLM parsing.x_tolerance(int): X-axis tolerance forpdfplumbertext extraction.y_tolerance(int): Y-axis tolerance forpdfplumbertext extraction.save_dir(str): Directory to save intermediate PDFs whenas_pdf=True.save_filename(str): Filename used when saving the intermediate PDF for a webpage. Defaults towebpage_<timestamp>.pdf.page_nums(List[int]): Specific 1-indexed page numbers to parse (PDFs only).max_image_dimension(int): Maximum width/height (px) to which page images / input images are downscaled before parsing. Defaults toDEFAULT_MAX_IMAGE_DIMENSION(1000).api_cost_mapping(Union[dict, str]): Cost-per-million-tokens dictionary, or path to a JSON file. Sample attests/api_cost_mapping.json. When provided, thetoken_costkey is added to the result.router_priority(str): Routing priority forAUTOmode. One of:"speed"(default): UsesSTATIC_PARSEfor PDFs without images, elseLLM_PARSE."accuracy": PrefersLLM_PARSE, except for PDFs with no images but with embedded/hidden hyperlinks (usesSTATIC_PARSEsince LLMs miss hidden links)."cost": For PDFs that contain images, tries PaddleOCR first; if the extracted character count is belowcharacter_thresholdthe PaddleOCR result is returned, otherwise the document is re-parsed withLLM_PARSE. PDFs without images, and non-PDF inputs, fall back to the same routing as"speed".
character_threshold(int): Minimum character count for arouter_priority="cost"STATIC_PARSE result to be accepted. Default:100.autoselect_llm(bool): Whenparser_type="AUTO", runs the ML-basedDocumentRankedLLMSelectorto choose the best LLM for the input document. Default:False.retry_on_fail(bool): WhenTrue(default), automatically retries with the alternate parser type / framework on failure.return_bboxes(bool): IfTrue, attach per-segment bounding boxes (bboxeskey on each segment). Default:False.bbox_framework(str): Framework used for bounding box extraction whenreturn_bboxes=True. One of"auto"(default — choosespaddleocrorpdfplumberbased on file content),"pdfplumber", or"paddleocr".
Return value format: A dictionary with the following keys. Keys marked (optional) are only present in specific configurations; the others are present on the standard parsing path (and may hold an empty string / list / zeroed dict).
raw: Full markdown content as a string.segments: List of dictionaries with per-segmentmetadata(e.g.,page) andcontent. For PDFs, a segment is a page; for webpages, a segment is a section (heading and its content). Whenreturn_bboxes=True, each segment additionally carries abboxeskey (a list of(text, [x0, top, x1, bottom])tuples normalized to[0, 1]).title: Title of the document (defaults to the input file’s basename).url: Original URL if the input was a URL, otherwise an empty string.parent_title: Title of the parent document when this result was produced by recursive crawling; otherwise an empty string.recursive_docs: List of recursively-parsed sub-documents. Empty unlessdepth > 1.token_usage(optional): Dictionary withinput,output,total, andllm_page_counttoken statistics. Counts are zero when onlySTATIC_PARSEran. Absent on the HTML/recursive-URL path — whenpathis a URL that is not a supported file-typed URL (e.g.,.pdf/image) andas_pdfis not set,parse()returns the output ofrecursive_read_htmldirectly, which does not include this key.parsers_used(optional): List of parser names that actually ran, one entry per chunk (e.g.,["LLM_PARSE", "STATIC_PARSE"]). Absent on the HTML/recursive-URL path for the same reason astoken_usage.token_cost(optional): Estimated cost broken down by token category. Only present whenapi_cost_mappingis supplied and contains an entry for the resolved model.pdf_path(optional): Path to the intermediate PDF generated whenas_pdf=True. To keep the file readable afterparse()returns, also passsave_dir— otherwise the PDF is written inside a temporary directory that is removed on return.
parse_with_schema#
- lexoid.api.parse_with_schema(path: str, schema: Dict | Type, api: str | None = None, model: str = 'gpt-4o-mini', example_schema: Dict | None = None, alternate_keys: Dict | None = None, fill_single_schema: bool = False, **kwargs) List#
Parse a document with an LLM to generate structured output conforming to a given schema. The schema can be a plain
dict, a Pythondataclass, or a PydanticBaseModel(all are converted to JSON Schema internally).In the default per-page mode, only PDF and image inputs are accepted (each page is rendered to an image and sent to the LLM together with the schema prompt). When
fill_single_schema=True, the document is first parsed withparse()and the resulting markdown is sent to the LLM, so the broader set of file types supported byparse()(DOCX, HTML, URLs, etc.) becomes available.- Parameters:
path – Path to the file to parse.
schema –
dict,dataclass, or PydanticBaseModeldescribing the desired output.api – LLM API provider. One of
"gemini","openai","anthropic","mistral","huggingface","together","openrouter","fireworks", or"ollama". If not specified, inferred from the model name.model – LLM model name. Default:
"gpt-4o-mini".example_schema – Optional example data illustrating the desired filled schema (improves few-shot extraction).
alternate_keys – Optional mapping of alternate key names that may appear in the document — helps the model match synonyms.
fill_single_schema – When
True, the entire document is parsed once and the whole content is used to produce a single instance of the schema (rather than one instance per page).kwargs – Additional keyword arguments (e.g.,
temperature,max_tokens).
Additional keyword arguments:
temperature(float): Sampling temperature for LLM generation. Default:0.0.max_tokens(int): Maximum number of tokens to generate. Default:1024.
Return value format:
The function returns a Python list of the JSON values parsed from the model. The exact shape depends on the mode and on the schema:
Default (per-page) mode — one entry per page, in page order. Each entry is the JSON value the model produced for that page. If the schema describes a single record, expect one
dictper page; if the schema describes multiple records per page (e.g., table rows), expect alistofdicts per page. Concretely, indexing followsresult[page_index][record_index]when each page contains multiple records.``fill_single_schema=True`` — a single-element list whose lone element is the JSON value the model produced for the whole document (typically a single
dict).
Because the shape is driven by the model’s output, callers that need to support both single-record and multi-record schemas should normalize the per-page entries themselves (e.g., wrap
dictentries in a one-elementlist).
parse_to_latex#
- lexoid.api.parse_to_latex(path: str, api: str | None = None, model: str = 'gpt-4o-mini', **kwargs) str#
Convert a document (PDF or image) into a self-contained LaTeX string by feeding each rendered page to a vision-capable LLM. The first page emits the LaTeX preamble and
\begin{document}; the last page closes the document with\end{document}.- Parameters:
path – Path to the file to convert.
api – LLM API provider. If not specified, inferred from the model name.
model – LLM model name. Default:
"gpt-4o-mini".kwargs – Additional keyword arguments forwarded to the LLM call (e.g.,
temperature,max_tokens).
- Returns:
The concatenated LaTeX source as a single string.
parse_chunk#
- lexoid.api.parse_chunk(path: str, parser_type: ParserType, **kwargs) Dict#
Low-level entry point that parses a single file (or PDF chunk) with the given parser type.
parse()orchestrates calls toparse_chunkover PDF splits; most users should callparse().parse_chunkis wrapped by theretry_with_different_parser_typedecorator, which implements theAUTOrouting and fallback logic.- Parameters:
path – The file path or URL.
parser_type – The
ParserTypeto use.kwargs – Same keyword arguments accepted by
parse().
- Returns:
Dictionary containing parsed document data, plus a
parser_usedkey indicating which parser type actually ran.
Examples#
Basic Usage#
from lexoid.api import parse
# Basic parsing (AUTO is the default parser_type)
result = parse("document.pdf")
# Raw text output
parsed_md = result["raw"]
# Segmented output with metadata
parsed_segments = result["segments"]
# Explicitly select the parser type
result = parse("document.pdf", parser_type="LLM_PARSE")
LLM-Based Parsing#
# Parse using GPT-4o
result = parse("document.pdf", parser_type="LLM_PARSE", model="gpt-4o")
# Parse using Gemini 2.5 Pro
result = parse("document.pdf", parser_type="LLM_PARSE", model="gemini-2.5-pro")
# Parse using Claude
result = parse("document.pdf", parser_type="LLM_PARSE", model="claude-3-5-sonnet-20241022")
# Parse using Mistral OCR
result = parse(
"document.pdf",
parser_type="LLM_PARSE",
api_provider="mistral",
model="mistral-ocr-latest",
)
# Parse using a local Ollama model
result = parse(
"document.pdf",
parser_type="LLM_PARSE",
api_provider="ollama",
model="gemma4:latest",
max_processes=1,
)
# Local SmolDocling / granite-docling
result = parse(
"document.pdf",
parser_type="LLM_PARSE",
api_provider="local",
model="ds4sd/SmolDocling-256M-preview",
)
# Local PaddleOCR-VL
result = parse(
"document.pdf",
parser_type="LLM_PARSE",
api_provider="local",
model="PaddlePaddle/PaddleOCR-VL",
)
Static Parsing#
# Parse using pdfplumber (default static framework)
result = parse("document.pdf", parser_type="STATIC_PARSE", framework="pdfplumber")
# Parse using pdfminer
result = parse("document.pdf", parser_type="STATIC_PARSE", framework="pdfminer")
# OCR with PaddleOCR
result = parse("scanned.pdf", parser_type="STATIC_PARSE", framework="paddleocr")
AUTO Mode and Routing#
# Default AUTO with "speed" priority
result = parse("document.pdf", parser_type="AUTO")
# Accuracy-first routing (prefers LLM_PARSE)
result = parse("document.pdf", parser_type="AUTO", router_priority="accuracy")
# Cost-first routing (tries PaddleOCR before falling back to LLM)
result = parse(
"document.pdf",
parser_type="AUTO",
router_priority="cost",
character_threshold=100,
)
# Auto-select the best LLM for this document
result = parse("document.pdf", parser_type="AUTO", autoselect_llm=True)
Bounding Box Extraction#
result = parse(
"document.pdf",
return_bboxes=True,
bbox_framework="auto", # "auto", "pdfplumber", or "paddleocr"
)
for segment in result["segments"]:
for text, bbox in segment.get("bboxes", []):
print(text, bbox) # bbox is [x0, top, x1, bottom] normalized to [0, 1]
Parse with Schema#
from lexoid.api import parse_with_schema
# Plain dict schema (one filled instance per page)
schema = {
"Disability Category": "string",
"Participants": "int",
"Ballots Completed": "int",
"Accuracy": ["string"],
"Time to complete": ["string"],
}
result = parse_with_schema(path="inputs/test_1.pdf", schema=schema, model="gpt-4o")
# Single instance for the whole document
result = parse_with_schema(
path="inputs/test_1.pdf",
schema=schema,
model="gpt-4o",
fill_single_schema=True,
)
# Pydantic schema with example data and alternate keys
from pydantic import BaseModel
class Invoice(BaseModel):
invoice_number: str
total: float
result = parse_with_schema(
path="inputs/invoice.pdf",
schema=Invoice,
model="gpt-4o-mini",
example_schema={"invoice_number": "INV-001", "total": 199.95},
alternate_keys={"invoice_number": ["Invoice #", "Invoice No."]},
)
# Dataclass schema
from dataclasses import dataclass
@dataclass
class Receipt:
merchant: str
amount: float
result = parse_with_schema(path="receipt.pdf", schema=Receipt)
Parse to LaTeX#
from lexoid.api import parse_to_latex
latex_source = parse_to_latex("paper.pdf", model="gpt-4o")
with open("paper.tex", "w") as f:
f.write(latex_source)
Web Content#
# Parse a webpage
result = parse("https://example.com")
# Parse a webpage and the pages linked within it
result = parse("https://example.com", depth=2)
# Render the webpage to PDF first, then parse
result = parse(
"https://example.com",
as_pdf=True,
save_dir="output/",
save_filename="example.pdf",
)
Audio Transcription#
Audio inputs are routed to LLM_PARSE automatically and require a
Gemini model (currently the only provider supporting audio).
result = parse("interview.mp3", model="gemini-2.5-flash")
print(result["raw"])