API Reference#
Core Function#
parse#
- lexoid.api.parse(path: str, parser_type: str | ParserType = 'LLM_PARSE', raw: bool = False, pages_per_split: int = 4, max_processes: int = 4, **kwargs) List[Dict] | str #
Parse a document using specified strategy.
- Parameters:
path – File path or URL to parse
parser_type – Parser type to use (“LLM_PARSE”, “STATIC_PARSE”, or “AUTO”)
raw – If True, returns raw text; if False, returns structured data
pages_per_split – Number of pages per chunk for processing
max_processes – Maximum number of parallel processes
kwargs – Additional keyword arguments
- Returns:
List of dictionaries containing page metadata and content, or raw text string
Additional keyword arguments:
model
(str): LLM model to useframework
(str): Static parsing frameworktemperature
(float): Temperature for LLM generationdepth
(int): Depth for recursive URL parsingas_pdf
(bool): Convert input to PDF before processingverbose
(bool): Enable verbose loggingx_tolerance
(int): X-axis tolerance for text extractiony_tolerance
(int): Y-axis tolerance for text extraction
Examples#
Basic Usage#
from lexoid.api import parse
# Basic parsing
result = parse("document.pdf")
# Raw text output
parsed_md = parse("document.pdf", raw=True)
# Automatic parser selection
result = parse("document.pdf", parser_type="AUTO")
LLM-Based Parsing#
# Parse using GPT-4o
result = parse("document.pdf", parser_type="LLM_PARSE", model="gpt-4o")
# Parse using Gemini 1.5 Pro
result = parse("document.pdf", parser_type="LLM_PARSE", model="gemini-1.5-pro")
Static Parsing#
# Parse using PDFPlumber
result = parse("document.pdf", parser_type="STATIC_PARSE", model="pdfplumber")
# Parse using PDFMiner
result = parse("document.pdf", parser_type="STATIC_PARSE", model="pdfminer")
Web Content#
# Parse webpage
result = parse("https://example.com")
# Parse webpage and the pages linked within the page
result = parse("https://example.com", depth=2)
Return Value Format#
When raw=True
, the function returns a raw text string.
When raw=False
, the function returns a list of dictionaries:
[
{
"metadata": {
"title": "filename",
"page": page_number
},
"content": "parsed_content"
},
# ... one dict per page
]