Kurs
Extracting and organizing data from various sources, such as websites and local documents (XML, HTML, JSON, Markdown), can be complex and time-consuming. Whether you're doing research, business analytics, or content aggregation, the manual effort can be overwhelming.
ScrapeGraphAI, a Python library for web scraping, simplifies this process.
It uses large language models (LLMs) and direct graph logic to create efficient scraping pipelines, automating data extraction and minimizing the need for extensive coding.
In this article, I’ll provide a brief introduction to ScrapeGraphAI and guide you through setting up your first pipeline.
Develop AI Applications
What Is ScrapeGraphAI?
ScrapeGraphAI is a powerful web scraping tool that uses large language models (LLMs) and direct graph logic to build scraping pipelines.
It can extract information from websites and various local document formats, such as XML, HTML, JSON, and Markdown.
Features
ScrapeGraphAI is designed to be user-friendly and efficient. Users simply specify the information they need, and ScrapeGraphAI handles the rest. It automates the creation of scraping pipelines based on user prompts, which reduces the need for manual coding.
The tool supports various document formats and can integrate with different LLMs through APIs. It is scalable, offering both single-page and multi-page scraping capabilities, making it suitable for both small and large-scale data extraction tasks.
Lastly, it is compatible with multiple LLM providers like OpenAI, Groq, Azure, and Gemini, as well as local models using Ollama.
Scraping pipelines
ScrapeGraphAI includes several types of scraping pipelines:
- SmartScraperGraph: This is a single-page scraper that only requires a user prompt and an input source.
- SearchGraph: This multi-page scraper extracts information from the top search results.
- SpeechGraph: This single-page scraper generates audio files from website content.
- ScriptCreatorGraph: This single-page scraper creates Python scripts for the extracted data.
- SmartScraperMultiGraph: This multi-page scraper works on multiple pages with a single prompt and a list of sources.
- ScriptCreatorMultiGraph: This multi-page scraper creates Python scripts for extracting information from multiple pages and sources.
ScrapeGraphAI Installation
ScrapeGraphAI simplifies setting up and running data extraction tasks from websites and local documents. Here's how to quickly install the library and build a basic scraping application.
Quick install
To install ScrapeGraphAI, run:
pip install scrapegraphai
Building a Simple ScrapeGraphAI Application
We'll build a simple scraping pipeline using SmartScraperGraph. First, I’ll describe the steps, and then I’ll provide the code.
Step 1: Define the task
Specify what information you want to extract. In this example, we'll extract all the article titles and URLs from my own Substack newsletter The Limitless Playbook 🧬.
Step 2: Select the scraping pipeline
Choose the appropriate pipeline. For a single page, we'll use SmartScraperGraph. Feel free to explore other pipelines for different use cases.
Step 3: Execute the pipeline
Run the pipeline using the .run()
method to extract the data.
Step 4: Review and utilize data:
Validate the extracted data. LLMs are powerful but might not always give perfect results initially. Adjust the prompt as needed to get the desired output.
Code
Here’s the complete code snippet that covers the steps outline earlier:
import json
import nest_asyncio
nest_asyncio.apply()
from scrapegraphai.graphs import SmartScraperGraph
# Define the configuration for the scraping pipeline
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"
graph_config = {
"llm": {
"api_key": OPENAI_API_KEY,
"model": "gpt-3.5-turbo-0125",
"temperature": 0,
},
"verbose": True,
}
# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt="List me all the articles with their respective urls",
source="<https://ryanocm.substack.com/archive>",
config=graph_config
)
# Run the pipeline
articles_data = smart_scraper_graph.run()
# Save the result to a JSON file
with open('articles_data.json', 'w') as json_file:
json.dump(articles_data, json_file, indent=4)
And here are the final results. Pretty good, right?
{
"articles": [
{
"title": "#112 | I Built an AI Tool to Help You Learn More and Take Action Faster",
"url": "<https://ryanocm.substack.com/p/112-i-built-an-ai-tool-to-help-you>"
},
{
"title": "#111 | Don't Fight When You Are Getting Flooded \\ud83c\\udf0a",
"url": "<https://ryanocm.substack.com/p/111-dont-fight-when-you-are-getting>"
},
{
"title": "#110 | How to process past fights in relationships",
"url": "<https://ryanocm.substack.com/p/110-how-to-process-past-fights-in>"
},
{
"title": "#109 | AI x Crypto(graphy)",
"url": "<https://ryanocm.substack.com/p/109-ai-x-cryptography>"
},
{
"title": "#108 | How to Love Intentionally \\u2764\\ufe0f",
"url": "<https://ryanocm.substack.com/p/108-how-to-love-intentionally>"
},
{
"title": "#107 | 16 Questions To Better Understand Who You Are \\ud83c\\udfad",
"url": "<https://ryanocm.substack.com/p/107-16-questions-to-better-understand>"
},
{
"title": "#106 | Three Types of Burnouts \\ud83e\\udd15",
"url": "<https://ryanocm.substack.com/p/106-three-types-of-burnouts>"
},
{
"title": "#105 | The Bagel Method in Relationships \\ud83e\\udd6f",
"url": "<https://ryanocm.substack.com/p/105-the-bagel-method-in-relationships>"
},
{
"title": "#104 | The 8 Play Personalities \\ud83c\\udfad",
"url": "<https://ryanocm.substack.com/p/104-the-8-play-personalities>"
},
{
"title": "#103 | The Top Ten Myths About Conflict by Relationship Experts Julie and John Gottman",
"url": "<https://ryanocm.substack.com/p/103-the-top-ten-myths-about-conflict>"
},
{
"title": "#102 | 7 Principles to Writing Well",
"url": "<https://ryanocm.substack.com/p/102-7-principles-to-writing-well>"
},
{
"title": "#101 | How to Remember More from Books",
"url": "<https://ryanocm.substack.com/p/101-how-to-remember-more-from-books>"
}
]
}
Conclusion
ScrapeGraphAI simplifies and automates web and document scraping, making data extraction easier and faster. Its compatibility with various LLMs and document formats makes it a versatile tool for all kinds of data tasks. With ScrapeGraphAI, you can focus on analyzing and using your data rather than collecting it.
To learn more, check out these resources:
And finally, please use ScrapeGraphAI responsibly and be aware of the scraping rules for the websites you target. Always follow the terms of service and legal guidelines for web scraping.
Earn a Top AI Certification
Ryan is a lead data scientist specialising in building AI applications using LLMs. He is a PhD candidate in Natural Language Processing and Knowledge Graphs at Imperial College London, where he also completed his Master’s degree in Computer Science. Outside of data science, he writes a weekly Substack newsletter, The Limitless Playbook, where he shares one actionable idea from the world's top thinkers and occasionally writes about core AI concepts.