Skip to main content

Overview

Crawleo provides multiple output formats optimized for different use cases. Choose the format that best fits your application’s needs.

Available Formats

Raw HTML

Original page source with all elements intact.

Enhanced HTML

Clean HTML with ads, scripts, and tracking removed (default).

Plain Text

Extracted text content without HTML markup.

Markdown

Structured Markdown optimized for LLMs (default).

Format Details

Raw HTML

Returns the complete, unmodified HTML source of the page. Parameter: raw_html=true Best for:
  • Full page preservation
  • Custom parsing and extraction
  • When you need exact page structure
Example output:
<!DOCTYPE html>
<html>
<head>
  <title>Example Page</title>
  <script src="analytics.js"></script>
</head>
<body>
  <nav>...</nav>
  <main>
    <h1>Welcome</h1>
    <p>Content here...</p>
  </main>
  <footer>...</footer>
</body>
</html>

Enhanced HTML

Returns cleaned HTML with ads, tracking scripts, and unnecessary elements removed. Enabled by default. Parameter: enhanced_html=true (default) Best for:
  • Cleaner content processing
  • Reduced noise in extraction
  • When you want HTML structure without clutter
Example output:
<main>
  <h1>Welcome</h1>
  <p>Content here...</p>
</main>

Plain Text

Returns extracted text content without any HTML markup. Parameter: page_text=true Best for:
  • Simple text processing
  • Search and analysis tasks
  • When HTML structure isn’t needed
Example output:
Welcome

Content here...

Markdown

Returns content converted to structured Markdown format. Enabled by default. Parameter: markdown=true (default) Best for:
  • RAG pipelines
  • LLM consumption
  • Vector database ingestion
  • Minimal token usage
Example output:
# Welcome

Content here...

## Section Title

More content with [links](https://example.com) and **formatting**.

Format Comparison

FormatToken UsageStructureNoise LevelBest For
Raw HTMLHighFullHighCustom parsing
Enhanced HTMLMediumPartialLowClean extraction
Plain TextLowNoneLowSimple processing
MarkdownLowPreservedMinimalLLM/RAG

Recommendations by Use Case

Use Markdown format for optimal results:
  • Preserves document structure (headers, lists, links)
  • Minimal token usage
  • Clean content ready for embedding
response = requests.get(
    "https://api.crawleo.dev/crawl",
    params={"urls": url, "markdown": True},
    headers={"Authorization": f"Bearer {api_key}"}
)
Use Markdown or Plain Text:
  • Markdown if structure matters (enabled by default)
  • Plain text for maximum token efficiency
# For structured content (default)
params = {"urls": url, "markdown": True}

# For simple text
params = {"urls": url, "page_text": True, "markdown": False}
Use Raw HTML or Enhanced HTML:
  • Raw HTML for full control
  • Enhanced HTML for cleaner starting point (default)
# Get both formats
params = {"urls": url, "raw_html": True, "enhanced_html": True}
Use Plain Text or Markdown:
  • Easy to process with NLP tools
  • No HTML parsing required

Multiple Formats

You can request multiple formats in a single request:
curl -X GET "https://api.crawleo.dev/crawl?urls=https://example.com&raw_html=true&enhanced_html=true&markdown=true&page_text=true" \
  -H "Authorization: Bearer YOUR_API_KEY"
The response will include all requested formats.
Pro tip: Markdown and Enhanced HTML are enabled by default. You only need to explicitly set parameters if you want to disable them or add additional formats.
Last modified on April 4, 2026