Track
Firecrawl is a novel kind of web crawler optimized for AI workflows. It allows developers to scrape pages, websites, or even the whole web. It is well-suited to remove complexity from traditional web scraping with features like JavaScript support, automatic markdown conversion, and integration with popular LLM frameworks.
In this article, I will not only describe the main functionality of Firecrawl, but I will also cite some functional advanced tips, ecosystem integration, and mention some real-world applications, so you can start practicing for yourself.
What Is Firecrawl?
Firecrawl is an AI-powered web crawler developed by Mendable.ai. It is an API service that crawls websites and converts them into clean LLM-ready data like markdown, JSON, etc.
It is an AI-driven approach that understands page context and extracts the main content intelligibly, unlike traditional scrapers like BeautifulSoup or Puppeteer, which crawl the web blindly and often return unclean data. It is capable of turning entire websites into clean markdown or structured data, and is ideal for LLM tasks.
Firecrawl has 3 core modes, which I'll discuss in detail in a bit. You generally find yourself using scrape mode to scrape a single URL, crawl mode to scrape an entire website, and map mode for URL discovery.
Key Features of Firecrawl
What makes Firecrawl stand out is that it requires no sitemap since it has an intelligent navigation mechanism, it handles heavy JavaScript and dynamic web content, its output is a clean markdown, HTML, JSON, screenshots, and more, making it very flexible in use, it has a built-in proxy, anti-bot, and caching mechanisms, it handles batch processing and is concurrent for large-scale jobs, it is extremely customizable as you can exclude tags, crawl with custom headers, set crawl depth, and more, it integrates well with LLM frameworks like LangChain, LlamaIndex, and CrewAI, and it is of course enterprise friendly since you can set your rate-limit, concurrency controls and reliability.
How to Set Up Firecrawl
The fun part about Firecrawl is that you can set it to be working very quickly, here are the main steps to follow to install it in your system:
API key configuration
- Sign up at firecrawl.dev to obtain an API key.
- Install the Python client library:
pip install firecrawl
And if you are using langchain, use
pip install firecrawl-py
- Securely store the API key using environment variables:
import os
from firecrawl import FirecrawlApp
# Set your Firecrawl API key as an environment variable for security
api_key = os.getenv('FIRECRAWL_API_KEY')
# Create an instance of the FirecrawlApp class with your API key
app = FirecrawlApp(api_key=api_key)
Basic scraping example
You can simply make your first scraping using the .scrape_url()
method:
response = app.scrape_url(url=example.com', formats=['markdown'])
print(response)
This code retrieves the main content of the firecrawl.dev webpage as a markdown.
Understanding Firecrawl Modes and Endpoints
Firecrawl has three core modes that define how broadly you want to scrape. In this section, I will walk you through each one.
Scrape mode
Scrape mode targets individual URLs. It is ideal for extracting product details or news articles. The llm_extract
parameter enables schema-based extraction. What you should do is define the schema you want your final JSON to look like, initiate your configuration, and start scraping:
from firecrawl import JsonConfig
from pydantic import BaseModel
# Define the expected data structure for the JSON extraction schema
class ExtractSchema(BaseModel):
company_mission: str
supports_sso: bool
is_open_source: bool
is_in_yc: bool
# Define the JSON configuration and set it to llm extraction mode
json_config = JsonConfig(
extractionSchema=ExtractSchema.model_json_schema(),
mode="llm-extraction",
pageOptions={"onlyMainContent": True}
)
# Scrape the URL, extract data, and format it into JSON
llm_extraction_result = app.scrape_url(
'https://firecrawl.dev',
formats=["json"],
json_options=json_config
)
print(llm_extraction_result)
Here’s another type of single-URL scraping, but this time, with selector-based extraction:
# Get a screenshot of the top of the overview page
scrape_result = app.scrape_url('firecrawl.dev',
formats=['markdown', 'html', "screenshot"],
actions=[
{"type": "wait", "milliseconds": 3000},
{"type": "click", "selector": "h1"},
{"type": "wait", "milliseconds": 3000},
{"type": "scrape"},
{"type": "screenshot"}
],
)
print(scrape_result)
This script first opens the website, then waits for 3000 milliseconds, clicks on the first h1 selector, waits again, and then scrapes and saves a screenshot of the current page. If you get timeout errors, try to increase the waiting times alongside the timeout parameter.
Crawl mode
The crawl mode allows you to crawl an entire website, including all accessible subpages, without requiring a sitemap. It returns a job ID to track progress and supports metadata retrieval when you want to grab extra info like headers and timestamps, rate control to throttle how fast you hit the site, and recursive depth settings to decide how many link levels to follow. The steps of a crawler are as follows:
- First, the crawler opens the given URL
- It scrapes it
- All the hyperlinks linking to subpages are scanned and are given separate crawlers
- The steps are repeated until a defined maximum is reached.
Here is an example of a crawler using Python requests:
import requests
url = "https://api.firecrawl.dev/v1/crawl"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer YOUR_API_KEY"
}
data = {
"url": "https://docs.firecrawl.dev", # site to crawl
"limit": 100, # max items
"scrapeOptions": {
"formats": ["markdown", "html"] # output formats
}
}
response = requests.post(url, headers=headers, json=data)
print(response.status_code)
print(response.json())
This code tells Firecrawl to fetch up to 100 pages from https://docs.firecrawl.dev, return each page in both Markdown and HTML, and then prints out the result.
If you want to learn more about APIs, check out our Introduction to APIs in Python course, Working with APIs in Python code-along, and our Mastering Python APIs: A Comprehensive Guide to Building and Using APIs in Python tutorial.
Map mode
Map lets you input a website and instantly retrieve every URL on the site extremely fast. That’s why we created /map
: It transforms a single URL into a comprehensive sitemap, making it ideal when you need to prompt end-users to select links to scrape, quickly uncover all site URLs, focus on pages related to a specific topic using the search parameter, or restrict your crawl to particular pages.
A consideration: Because this endpoint prioritizes speed, it may not capture every link.
# Map a website:
map_result = app.map_url('https://firecrawl.dev')
print(map_result)
This code tells Firecrawl to discover all of the links in the https://firecrawl.dev page and list them to you.
Real-World Applications of Firecrawl
Firecrawl is a versatile tool for a variety of tasks, as we have seen earlier, which makes it extremely handy in real-world scenarios. It is very applicable when scraping job boards or news websites to compile up-to-date listings or articles for analysis, or to train your forecasting machine learning model. You can also perform sentiment analysis via scraped reviews, like collecting customer reviews from e-commerce or review sites like Amazon, to analyze sentiment and inform business decisions.
This can be further enhanced with price monitoring, like tracking product prices across multiple platforms to identify trends or trigger alerts for price drops. Furthermore, you can use Firecrawl to extract technical documentation to train or inform AI agents, improving their knowledge base, or data ingestion into multi-agent AI pipelines, like how CAMEL-AI does it.
It can also provide structured web data to retrieval-augmented generation (RAG) systems to enhance the model performance and decrease its hallucination levels
Advanced Techniques and Best Practices
To maximize Firecrawl’s potential, I state some of the advanced techniques and best practices:
-
Extract structured data for your LLM by using the
/extract
endpoint to return structured data based on a prompt or a schema, since LLMs better handle this type of data than raw HTML. -
Handle your errors with retries by implementing retry logic with exponential backoff to manage transient API errors.
-
Optimize the performance by using Firecrawl’s asynchronous crawling to start large crawls without blocking your application. This is ideal for web apps or services.
-
Filter crawls with parameters like
max_depth
orexclude
to limit crawls to specific subdomains or page types. This will help reduce unnecessary data as needed. -
Use Python tools to validate and clean the crawled data, like balancing it, deleting null values, and many others. pandas, numpy, seaborn, and others help well in such tasks. If you want to learn more about data cleaning, consider our course on Cleaning Data in Python.
Firecrawl vs. Traditional Scraping Tools
Firecrawl has numerous advantages that make it one of the best choices in most cases, and make it stand out over traditional scraping tools like Scrapy, BeautifulSoup, and Puppeteer. It handles JavaScript-rendered content effortlessly, while tools like BeautifulSoup require additional tools like Selenium and sometimes manual inspection. It has a built-in proxy rotation and handles rate limits to reduce setup time compared to Scrapy’s manual configurations, which takes a lot of time in itself. Besides, Firecrawl’s LLM extraction capabilities provide structured data, which makes it a feature absent in most traditional tools.
However, traditional tools may be preferable in specific cases, like when you want a local execution, which makes tools like BeautifulSoup run locally without API dependencies, appealing to users who avoid cloud services. Ease and efficiency of implementation sometimes come with the cost of extreme customization, like how Scrapy offers fine-grained control for highlight customizable scraping pipelines.
Integrations and Ecosystem Support
Firecrawl integrates well with many of the common LLM frameworks, with robust ecosystem support, which enhances its utility in AI workflows. You can integrate with LangChain using the FirecrawlLoader
module, which enables easy integration of web data into LangChain pipelines. Or, you can use it with LlamaIndex since the FirecrawlReader
supports loading web data for indexing and querying. Firecrawl is also a built-in tool for crawling websites within CrewAI’s framework and supports many other frameworks like Flowise, Dify, and CAMEL-AI with broad compatibility.
Pricing Plans and Usage Limits
Firecrawl offers a range of pricing plans depending on the use case to accommodate different needs. Make sure to refer to the Firecrawl official pricing page, in case things change.
- Free tier: It includes a limited number of credits and is suitable for testing and small projects.
- Hobby tier: It offers more credits for individual or small teams for a low cost.
- Standard, Growth, Enterprise tiers: It provides increasing credits, higher rate limits, and advanced features like custom requests per minute (RPM).
The credit system works as follows:
Plan |
Credits |
Features |
Best For |
Free |
Limited |
Basic scraping, testing |
Beginners, small projects |
Hobby |
Moderate |
More credits, standard features |
Individual developers |
Standards |
High |
Higher rate limits, advanced features |
Growing teams |
Growth |
Very High |
Scalable, custom options |
Large projects |
Enterprise |
Unlimited |
Custom RPM, dedicated support |
High-volume, enterprise use |
Conclusion
Firecrawl is a game-changer for web data extraction, particularly for AI applications. By converting websites into clean, structured, LLM-ready data, it helps developers to build smarter. Its ease of use, rich feature set, and extensive integrations make it a top choice for tasks ranging from price monitoring in RAG workflows.
I encourage you to explore Firecrawl’s free tier to test its capabilities. Stay engaged with the community on Firecrawl GitHub and Discord for updates, tips, and support.
With Firecrawl, web data is no longer a challenge. It’s a powerful asset for your AI innovations.
As a next step, I recommend the following course and code-along:
I work on accelerated AI systems enabling edge intelligence with federated ML pipelines on decentralized data and distributed workloads. Mywork focuses on Large Models, Speech Processing, Computer Vision, Reinforcement Learning, and advanced ML Topologies.
FAQs
Why am I getting a timeout error?
You should first check that you use your API key correctly, then debug if the website correctly finds the parts you are scraping.
Does Firecrawl only use AI-powered search?
No, you can use different kinds of search, like selector-based and LLM-based search.
Can Firecrawl handle websites with heavy JavaScript content?
Yes, Firecrawl is optimized even in dynamic websites.
When might I choose a traditional tool like BeautifulSoup over FireCrawl?
BeautifulSoup is local, free, and handles simple HTML scraping.
Can I test Firecrawl for free?
Yes, you have a limited requests per minute as a tester.