Skip to content

Send urls index status from the GSC using paralle computing into Bigquery using R

step 1 : getting the URLs

This one is going to be quick, we will use the xsitemap package which crawls XML sitemap

Add your notes here

# Add your code snippets here
library(devtools)
devtools::install_github("pixgarden/xsitemap")
library(xsitemap)
library(urltools)
library(XML)
library(httr)
upload <- xsitemapGet("https://www.rforseo.com/sitemap.xml")

step 2 : Launching the URL Inspection API in parallel

We use the parallel package to allow us to run a core request at the same time.

Warning, with regards to the URL Inspection API, the quota is enforced per Search Console website property (calls querying the same site)

I could be useful to create some extra properties using url directories

library(searchConsoleR)
library(lubridate)
library(parallel)
scr_auth()


res <- mclapply(1:nrow(upload), function(i) {
  cat(".")         
  url <-  upload[i,"loc"]
  result <- inspection(url, siteUrl = "sc-domain:rforseo.com", languageCode = NULL)
  
  text <- paste0(url,"§",
                 result[["indexStatusResult"]][["verdict"]],"§",
                 result[["indexStatusResult"]][["coverageState"]],"§",
                 result[["indexStatusResult"]][["robotsTxtState"]],"§",
                 result[["indexStatusResult"]][["indexingState"]],"§",
                 now())
  text
  
  
   }, mc.cores = detectCores())      ## Split this job across 10 cores

res <- data.frame(unlist(res))

library(stringr)

res[,c("url", "verdict", "coverageState", "robotsTxtState", "indexingState", "date")] <- str_split_fixed(res$unlist.r., '§', 6)
res$unlist.r. <- NULL

step 3 : Save data frame inside a Google Cloud Storage bucket

# Load the package
library(googleCloudStorageR)
library(bigQueryR)


## project id
gcs_global_bucket("mindful-path-205008")

gcs_auth()

## custom upload function to ignore quotes and column headers
f <- function(input, output) {
  write.table(input, sep = ",", col.names = FALSE, row.names = FALSE, 
              quote = FALSE, file = output, qmethod = "double")}

## upload files to Google Cloud Storage
gcs_upload(res, name = "res.csv", object_function = f,bucket = "gsc_backup")