Skip to main content

ScrapeGraphAI Tutorial: Getting Started With AI Web Scraping

ScrapeGraph AI is an open-source tool that simplifies web scraping by automatically extracting structured data from websites, allowing users to interact with and retrieve the data through simple prompts.
Jul 28, 2024  · 7 min read

Extracting and organizing data from various sources, such as websites and local documents (XML, HTML, JSON, Markdown), can be complex and time-consuming. Whether you're doing research, business analytics, or content aggregation, the manual effort can be overwhelming.

ScrapeGraphAI, a Python library for web scraping, simplifies this process.

It uses large language models (LLMs) and direct graph logic to create efficient scraping pipelines, automating data extraction and minimizing the need for extensive coding.

In this article, I’ll provide a brief introduction to ScrapeGraphAI and guide you through setting up your first pipeline.

Develop AI Applications

Learn to build AI applications using the OpenAI API.
Start Upskilling for Free

What Is ScrapeGraphAI?

ScrapeGraphAI is a powerful web scraping tool that uses large language models (LLMs) and direct graph logic to build scraping pipelines.

It can extract information from websites and various local document formats, such as XML, HTML, JSON, and Markdown.

Features

ScrapeGraphAI is designed to be user-friendly and efficient. Users simply specify the information they need, and ScrapeGraphAI handles the rest. It automates the creation of scraping pipelines based on user prompts, which reduces the need for manual coding.

The tool supports various document formats and can integrate with different LLMs through APIs. It is scalable, offering both single-page and multi-page scraping capabilities, making it suitable for both small and large-scale data extraction tasks.

Lastly, it is compatible with multiple LLM providers like OpenAI, Groq, Azure, and Gemini, as well as local models using Ollama.

Scraping pipelines

ScrapeGraphAI includes several types of scraping pipelines:

  • SmartScraperGraph: This is a single-page scraper that only requires a user prompt and an input source.
  • SearchGraph: This multi-page scraper extracts information from the top search results.
  • SpeechGraph: This single-page scraper generates audio files from website content.
  • ScriptCreatorGraph: This single-page scraper creates Python scripts for the extracted data.
  • SmartScraperMultiGraph: This multi-page scraper works on multiple pages with a single prompt and a list of sources.
  • ScriptCreatorMultiGraph: This multi-page scraper creates Python scripts for extracting information from multiple pages and sources.

ScrapeGraphAI Installation

ScrapeGraphAI simplifies setting up and running data extraction tasks from websites and local documents. Here's how to quickly install the library and build a basic scraping application.

Quick install

To install ScrapeGraphAI, run:

pip install scrapegraphai

Building a Simple ScrapeGraphAI Application

We'll build a simple scraping pipeline using SmartScraperGraph. First, I’ll describe the steps, and then I’ll provide the code.

Step 1: Define the task

Specify what information you want to extract. In this example, we'll extract all the article titles and URLs from my own Substack newsletter The Limitless Playbook 🧬.

Step 2: Select the scraping pipeline

Choose the appropriate pipeline. For a single page, we'll use SmartScraperGraph. Feel free to explore other pipelines for different use cases.

Step 3: Execute the pipeline

Run the pipeline using the .run() method to extract the data.

Step 4: Review and utilize data:

Validate the extracted data. LLMs are powerful but might not always give perfect results initially. Adjust the prompt as needed to get the desired output.

Code

Here’s the complete code snippet that covers the steps outline earlier:

import json
import nest_asyncio
nest_asyncio.apply()
from scrapegraphai.graphs import SmartScraperGraph
# Define the configuration for the scraping pipeline
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"
graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo-0125",
        "temperature": 0,
    },
    "verbose": True,
}
# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles with their respective urls",
    source="<https://ryanocm.substack.com/archive>",
    config=graph_config
)
# Run the pipeline
articles_data = smart_scraper_graph.run()
# Save the result to a JSON file
with open('articles_data.json', 'w') as json_file:
    json.dump(articles_data, json_file, indent=4)

And here are the final results. Pretty good, right?

{
    "articles": [
        {
            "title": "#112 | I Built an AI Tool to Help You Learn More and Take Action Faster",
            "url": "<https://ryanocm.substack.com/p/112-i-built-an-ai-tool-to-help-you>"
        },
        {
            "title": "#111 | Don't Fight When You Are Getting Flooded \\ud83c\\udf0a",
            "url": "<https://ryanocm.substack.com/p/111-dont-fight-when-you-are-getting>"
        },
        {
            "title": "#110 | How to process past fights in relationships",
            "url": "<https://ryanocm.substack.com/p/110-how-to-process-past-fights-in>"
        },
        {
            "title": "#109 | AI x Crypto(graphy)",
            "url": "<https://ryanocm.substack.com/p/109-ai-x-cryptography>"
        },
        {
            "title": "#108 | How to Love Intentionally \\u2764\\ufe0f",
            "url": "<https://ryanocm.substack.com/p/108-how-to-love-intentionally>"
        },
        {
            "title": "#107 | 16 Questions To Better Understand Who You Are \\ud83c\\udfad",
            "url": "<https://ryanocm.substack.com/p/107-16-questions-to-better-understand>"
        },
        {
            "title": "#106 | Three Types of Burnouts \\ud83e\\udd15",
            "url": "<https://ryanocm.substack.com/p/106-three-types-of-burnouts>"
        },
        {
            "title": "#105 | The Bagel Method in Relationships \\ud83e\\udd6f",
            "url": "<https://ryanocm.substack.com/p/105-the-bagel-method-in-relationships>"
        },
        {
            "title": "#104 | The 8 Play Personalities \\ud83c\\udfad",
            "url": "<https://ryanocm.substack.com/p/104-the-8-play-personalities>"
        },
        {
            "title": "#103 | The Top Ten Myths About Conflict by Relationship Experts Julie and John Gottman",
            "url": "<https://ryanocm.substack.com/p/103-the-top-ten-myths-about-conflict>"
        },
        {
            "title": "#102 | 7 Principles to Writing Well",
            "url": "<https://ryanocm.substack.com/p/102-7-principles-to-writing-well>"
        },
        {
            "title": "#101 | How to Remember More from Books",
            "url": "<https://ryanocm.substack.com/p/101-how-to-remember-more-from-books>"
        }
    ]
}

Conclusion

ScrapeGraphAI simplifies and automates web and document scraping, making data extraction easier and faster. Its compatibility with various LLMs and document formats makes it a versatile tool for all kinds of data tasks. With ScrapeGraphAI, you can focus on analyzing and using your data rather than collecting it.

ScrapeGraphAI features

To learn more, check out these resources:

And finally, please use ScrapeGraphAI responsibly and be aware of the scraping rules for the websites you target. Always follow the terms of service and legal guidelines for web scraping.

Earn a Top AI Certification

Demonstrate you can effectively and responsibly use AI.

Photo of Ryan Ong
Author
Ryan Ong
LinkedIn
Twitter

Ryan is a lead data scientist specialising in building AI applications using LLMs. He is a PhD candidate in Natural Language Processing and Knowledge Graphs at Imperial College London, where he also completed his Master’s degree in Computer Science. Outside of data science, he writes a weekly Substack newsletter, The Limitless Playbook, where he shares one actionable idea from the world's top thinkers and occasionally writes about core AI concepts.

Topics

Learn AI with these courses!

course

Implementing AI Solutions in Business

2 hr
20.3K
Discover how to extract business value from AI. Learn to scope opportunities for AI, create POCs, implement solutions, and develop an AI strategy.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

tutorial

Web Scraping & NLP in Python

Learn to scrape novels from the web and plot word frequency distributions; You will gain experience with Python packages requests, BeautifulSoup and nltk.
Hugo Bowne-Anderson's photo

Hugo Bowne-Anderson

14 min

tutorial

Using a Knowledge Graph to Implement a RAG Application

Learn how to implement knowledge graphs for RAG applications by following this step-by-step tutorial to enhance AI responses with structured knowledge.
Dr Ana Rojo-Echeburúa's photo

Dr Ana Rojo-Echeburúa

19 min

tutorial

Web Scraping using Python (and Beautiful Soup)

In this tutorial, you'll learn how to extract data from the web, manipulate and clean data using Python's Pandas library, and data visualize using Python's Matplotlib library.

Sicelo Masango

14 min

tutorial

Vertex AI Tutorial: A Comprehensive Guide For Beginners

Master the fundamentals of setting up Vertex AI and performing machine learning workflows.
Bex Tuychiev's photo

Bex Tuychiev

14 min

tutorial

Getting Started With OpenAI Structured Outputs

Learn how to get started with OpenAI Structured Outputs, understand its new syntax, and explore its key applications.
Bex Tuychiev's photo

Bex Tuychiev

9 min

code-along

Getting Started with the OpenAI API and ChatGPT

Get an introduction to the OpenAI API and the GPT-3 model.
Richie Cotton's photo

Richie Cotton

See MoreSee More