Skip to content

Modulo: Introduction to PySark

Capitulo: Are you query-ious ?

Neste trexo de codigo estou testando agrupamentos usando somente funcoes do PySpark, sem expressoes SQL. (Invalid URL)

# Testing groupBy with spark functions only
query = "FROM flights SELECT *"

all_flights = spark.sql(query)

# Get the air time in hours by origin and flight
flights10 = all_flights.select("*", (all_flights.air_time.cast("int")/60).alias("air_time_hrs")).groupBy("origin", "flight").sum("air_time_hrs")

# Show the results
flights10.show()

Ja neste trexo de codigo estou testando agrupamentos usando somente expressoes SQL.

# Testing groupBy with SQL expressions
from pyspark.sql.functions import expr

# Don't change this query
query = "FROM flights SELECT *"

# Get the air time in hours by origin and flight
flights10 = spark.sql(query).select("*", expr("(cast(air_time as int)/60) as air_time_hrs")).groupBy("origin", "flight").sum("air_time_hrs")

# the version above can be acomplished even if you specify the desirables columns only
flights10 = spark.sql(query).select("origin","flight", expr("(cast(air_time as int)/60) as air_time_hrs")).groupBy("origin", "flight").sum("air_time_hrs")

# Show the results
flights10.show()