Skip to main content

Overview

Crawleo provides a dedicated LangChain integration package that makes it easy to use Crawleo’s web search and crawling capabilities in your LangChain applications.

Installation

Install the langchain-crawleo package:
pip install langchain-crawleo

Available Tools

The package provides two LangChain tools:

CrawleoSearch

Web search tool powered by Crawleo’s Search API.

CrawleoCrawler

URL crawling tool powered by Crawleo’s Crawler API.

Quick Start

Basic Setup

from langchain_crawleo import CrawleoSearch, CrawleoCrawler

# Initialize tools with your API key
search_tool = CrawleoSearch(api_key="YOUR_API_KEY")
crawler_tool = CrawleoCrawler(api_key="YOUR_API_KEY")

Using Environment Variables

import os
from langchain_crawleo import CrawleoSearch, CrawleoCrawler

# Set environment variable
os.environ["CRAWLEO_API_KEY"] = "YOUR_API_KEY"

# Tools will automatically use the environment variable
search_tool = CrawleoSearch()
crawler_tool = CrawleoCrawler()

CrawleoSearch Tool

Perform web searches using the Search API:
from langchain_crawleo import CrawleoSearch

search = CrawleoSearch(api_key="YOUR_API_KEY")

# Basic search
results = search.invoke("latest AI news")
print(results)

# Search with options
results = search.invoke({
    "query": "machine learning tutorials",
    "count": 5,
    "get_page_text_markdown": True
})

Parameters

ParameterTypeDescription
querystrSearch query (required)
countintNumber of results
get_page_text_markdownboolReturn Markdown content
auto_crawlingboolCrawl result pages

CrawleoCrawler Tool

Crawl specific URLs:
from langchain_crawleo import CrawleoCrawler

crawler = CrawleoCrawler(api_key="YOUR_API_KEY")

# Crawl a single URL
content = crawler.invoke("https://example.com")
print(content)

# Crawl with Markdown output
content = crawler.invoke({
    "urls": "https://example.com",
    "markdown": True
})

Parameters

ParameterTypeDescription
urlsstrComma-separated URLs (required)
markdownboolReturn Markdown content
raw_htmlboolReturn raw HTML

Using with LangChain Agents

Create an Agent with Crawleo Tools

from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
from langchain_crawleo import CrawleoSearch, CrawleoCrawler

# Initialize LLM
llm = ChatOpenAI(model="gpt-4")

# Initialize Crawleo tools
tools = [
    CrawleoSearch(api_key="YOUR_CRAWLEO_API_KEY"),
    CrawleoCrawler(api_key="YOUR_CRAWLEO_API_KEY")
]

# Create prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful research assistant with access to web search and crawling tools."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}")
])

# Create agent
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# Run agent
response = agent_executor.invoke({
    "input": "Search for the latest Python 3.12 features and summarize them"
})
print(response["output"])

RAG Pipeline Example

Build a RAG pipeline with Crawleo:
from langchain_crawleo import CrawleoCrawler
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

# 1. Crawl documentation
crawler = CrawleoCrawler(api_key="YOUR_CRAWLEO_API_KEY")
docs_content = crawler.invoke({
    "urls": "https://docs.example.com/guide,https://docs.example.com/api",
    "markdown": True
})

# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_text(docs_content)

# 3. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_texts(chunks, embeddings)
retriever = vectorstore.as_retriever()

# 4. Create RAG chain
llm = ChatOpenAI(model="gpt-4")
prompt = ChatPromptTemplate.from_template("""
Answer based on the following context:

{context}

Question: {question}
""")

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
)

# 5. Query
response = rag_chain.invoke("How do I authenticate with the API?")
print(response.content)

Best Practices

Always use markdown=True or get_page_text_markdown=True for LLM applications to minimize token usage.
Implement retry logic for rate limit errors:
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential())
def search_with_retry(query):
    return search_tool.invoke(query)
Cache crawled content to avoid redundant API calls:
from langchain.globals import set_llm_cache
from langchain_community.cache import InMemoryCache

set_llm_cache(InMemoryCache())

Resources