API Reference#
Core Function#
parse#
- lexoid.api.parse(path: str, parser_type: str | ParserType = 'LLM_PARSE', pages_per_split: int = 4, max_processes: int = 4, **kwargs) Dict #
Parse a document using specified strategy.
- Parameters:
path – File path or URL to parse
parser_type – Parser type to use (“LLM_PARSE”, “STATIC_PARSE”, or “AUTO”)
pages_per_split – Number of pages per chunk for processing
max_processes – Maximum number of parallel processes
kwargs – Additional keyword arguments
- Returns:
List of dictionaries containing page metadata and content, or raw text string
Additional keyword arguments:
model
(str): LLM model to useframework
(str): Static parsing frameworktemperature
(float): Temperature for LLM generationdepth
(int): Depth for recursive URL parsingas_pdf
(bool): Convert input to PDF before processingverbose
(bool): Enable verbose loggingx_tolerance
(int): X-axis tolerance for text extractiony_tolerance
(int): Y-axis tolerance for text extractionsave_dir
(str): Directory to save intermediate PDFspage_nums
(List[int]): List of page numbers to parseapi_cost_mapping
(Union[dict, str]): Dictionary containing API cost details or the string path to a JSON file containing the cost details. Sample file available attests/api_cost_mapping.json
router_priority
(str): What the routing strategy should prioritize. Options are"speed"
and"accuracy"
. The router directs a file to either"STATIC_PARSE"
or"LLM_PARSE"
based on its type and the selected priority. If priority is “accuracy”, it prefers LLM_PARSE unless the PDF has no images but contains embedded/hidden hyperlinks, in which case it usesSTATIC_PARSE
(because LLMs currently fail to parse hidden hyperlinks). If priority is “speed”, it usesSTATIC_PARSE
for documents without images andLLM_PARSE
for documents with images.api_provider
(str): The API provider to use for LLM parsing. Options areopenai
,huggingface
,together
,openrouter
, andfireworks
. This parameter is only relevant when using LLM parsing.
Return value format: A dictionary containing a subset or all of the following keys:
raw
: Full markdown content as stringsegments
: List of dictionaries with metadata and content of each segment. For PDFs, a segment denotes a page. For webpages, a segment denotes a section (a heading and its content).title
: Title of the documenturl
: URL if applicableparent_title
: Title of parent doc if recursively parsedrecursive_docs
: List of dictionaries for recursively parsed documentstoken_usage
: Token usage statisticspdf_path
: Path to the intermediate PDF generated whenas_pdf
is enabled and the kwargsave_dir
is specified.
parse_with_schema#
- lexoid.api.parse_with_schema(path: str, schema: Dict, api: str = 'openai', model: str = 'gpt-4o-mini', **kwargs) List[List[Dict]] #
Parses a PDF using an LLM to generate structured output conforming to a given JSON schema.
- Parameters:
path – Path to the PDF file.
schema – JSON schema to which the parsed output should conform.
api – LLM API provider to use (
"openai"
,"huggingface"
,"together"
,"openrouter"
, or"fireworks"
).model – LLM model name.
kwargs – Additional keyword arguments passed to the LLM (e.g.,
temperature
,max_tokens
).
- Returns:
A list where each element represents a page, which in turn contains a list of dictionaries conforming to the provided schema.
Additional keyword arguments:
temperature
(float): Sampling temperature for LLM generation.max_tokens
(int): Maximum number of tokens to generate.
Return value format: A list of pages, where each page is represented as a list of dictionaries. Each dictionary conforms to the structure defined by the input
schema
.
Examples#
Basic Usage#
from lexoid.api import parse
# Basic parsing
result = parse("document.pdf")
# Raw text output
parsed_md = result["raw"]
# Segmented output with metadata
parsed_segments = result["segments"]
# Automatic parser selection
result = parse("document.pdf", parser_type="AUTO")
LLM-Based Parsing#
# Parse using GPT-4o
result = parse("document.pdf", parser_type="LLM_PARSE", model="gpt-4o")
# Parse using Gemini 1.5 Pro
result = parse("document.pdf", parser_type="LLM_PARSE", model="gemini-1.5-pro")
Static Parsing#
# Parse using PDFPlumber
result = parse("document.pdf", parser_type="STATIC_PARSE", model="pdfplumber")
# Parse using PDFMiner
result = parse("document.pdf", parser_type="STATIC_PARSE", model="pdfminer")
Parse with Schema#
from lexoid.api import parse_with_schema
sample_schema = [
{
"Disability Category": "string",
"Participants": "int",
"Ballots Completed": "int",
"Ballots Incomplete/Terminated": "int",
"Accuracy": ["string"],
"Time to complete": ["string"]
}
]
pdf_path = "inputs/test_1.pdf"
result = parse_with_schema(path=pdf_path, schema=sample_schema, model="gpt-4o")
Web Content#
# Parse webpage
result = parse("https://example.com")
# Parse webpage and the pages linked within the page
result = parse("https://example.com", depth=2)