Typically our data science AWS workflows follow this sequence:
- Turn on EC2.
- Copy data from S3 via
awscli
to local machine file system. - Code references local data via
/path/to/data/
. - ???
- Profit.
However, if the data you need to reference is relatively small (no more than 5GB), you can use s3a:// and stream the data direct from S3 into your code.
Say we have this script as visits_by_day.py
from pyspark.sql import SparkSession
if __name__ == '__main__':
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("s3a://some-s3-bucket/clients/client-x/data…")
df.groupBy("visit_day").count().orderBy("visit_day").show()
Then run via spark-submit
with the hadoop-aws package:
spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.3 visits_by_day.py
Or in Scala VisitsByDay.scala
:
import org.apache.spark.sql.SparkSession
/**
* Created by grathoms on 6/14/17.
*/
object VisitsByDay {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().getOrCreate()
val df = spark.read.parquet("s3a://some-s3-bucket/clients/client-x/data…")
df.groupBy("visit_day").count().orderBy("visit_day").show()
}
}
Output:
+---------+------+
|visit_day| count|
+---------+------+
| 20170410|208823|
| 20170411|335355|
| 20170412|238535|
| 20170413|102363|
| 20170414|618847|
| 20170415|561687|
| 20170416|146944|
| 20170417|698453|
| 20170418|142700|
| 20170419|343261|
+---------+------+