Sign in
Log inSign up
Spark + s3a:// = ❤️

Spark + s3a:// = ❤️

Graham Thomson's photo
Graham Thomson
·Jun 14, 2017

Typically our data science AWS workflows follow this sequence:

  1. Turn on EC2.
  2. Copy data from S3 via awscli to local machine file system.
  3. Code references local data via /path/to/data/.
  4. ???
  5. Profit.

However, if the data you need to reference is relatively small (no more than 5GB), you can use s3a:// and stream the data direct from S3 into your code.

Say we have this script as visits_by_day.py

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()
    df = spark.read.parquet("s3a://some-s3-bucket/clients/client-x/data…")
    df.groupBy("visit_day").count().orderBy("visit_day").show()

Then run via spark-submit with the hadoop-aws package:

spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.3 visits_by_day.py

Or in Scala VisitsByDay.scala:

import org.apache.spark.sql.SparkSession

/**
  * Created by grathoms on 6/14/17.
  */
object VisitsByDay {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().getOrCreate()
    val df = spark.read.parquet("s3a://some-s3-bucket/clients/client-x/data…")
    df.groupBy("visit_day").count().orderBy("visit_day").show()
  }
}

Output:

+---------+------+
|visit_day| count|
+---------+------+
| 20170410|208823|
| 20170411|335355|
| 20170412|238535|
| 20170413|102363|
| 20170414|618847|
| 20170415|561687|
| 20170416|146944|
| 20170417|698453|
| 20170418|142700|
| 20170419|343261|
+---------+------+