Extending Spark's Ingestion: Build Your Own Java Data Source with Jean Georges Perrin

Jean Georges Perrin, @jgperrin, Oplo Extending Spark’s Ingestion: Build Your Own Java Data Source #EUdev6

2#EUdev6 JGP @jgperrin Jean Georges Perrin Chapel Hill, NC, USA I 🏗SW #Knowledge = 𝑓 ( ∑ (#SmallData, #BigData), #DataScience) & #Software #IBMChampion x9 #KeepLearning @ http://jgp.net 2#EUdev6 I💚#

A Little Help Hello Ruby, Nathaniel, Jack, and PN! 3@jgperrin - #EUdev6

And What About You? • Who is a Java programmer? • Who swears by Scala? • Who has tried to build a custom data source? • In Java? • Who has succeeded? 4@jgperrin - #EUdev6

My Problem… 5@jgperrin - #EUdev6 CSV JSON RDBMS REST Other

Solution #1 6@jgperrin - #EUdev6 • Look in the Spark Packages library • https://spark-packages.org/ • 48 data sources listed (some dups) • Some with Source Code

package net.jgp.labs.spark.datasources.l100_ photo_datasource; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class PhotoMetadataIngestionApp { public static void main(String[] args) { PhotoMetadataIngestionApp app = new PhotoMetadataIngestionApp(); app.start(); } private boolean start() { SparkSession spark = SparkSession.builder() .appName("EXIF to Dataset") .master("local[*]"). getOrCreate(); String importDirectory = "/Users/jgp/Pictures/All Photos/2010-2019/2016"; Solution #2 7@jgperrin - #EUdev6 • Write it yourself • In Java

What is ingestion anyway? • Loading – CSV, JSON, RDBMS… – Basically any data • Getting a Dataframe (Dataset<Row>) with your data 8@jgperrin - #EUdev6

9@jgperrin - #EUdev6 Our lab Import some of the metadata of our photos and start analysis…

Code Dataset<Row> df = spark.read() .format("exif") .option("recursive", "true") .option("limit", "80000") .option("extensions", "jpg,jpeg") .load(importDirectory); #EUdev6 Spark Session Short name for your data source Options Where to start /jgperrin/net.jgp.labs.spark.datasources

Short name • Optional: can specify a full class name • An alias to a class name • Needs to be registered • Not recommended during development phase 11@jgperrin - #EUdev6

Project 12#EUdev6 Information for registration of the data source Data source’s short name Data source The app Relation Utilities An existing library you can embed/link to /jgperrin /net.jgp.labs.spark.datasources

Library code • A library you already use, you developed or acquired • No need for anything special 13@jgperrin - #EUdev6

/jgperrin /net.jgp.labs.spark.datasources Scala 2 Java conversion (sorry!) – All the options Data Source Code 14@jgperrin - #EUdev6 Provides a relation Needed by Spark Needs to be passed to the relation public class ExifDirectoryDataSource implements RelationProvider { @Override public BaseRelation createRelation( SQLContext sqlContext, Map<String, String> params) { java.util.Map<String, String> javaMap = mapAsJavaMapConverter(params).asJava(); ExifDirectoryRelation br = new ExifDirectoryRelation(); br.setSqlContext(sqlContext); ... br.setPhotoLister(photoLister); return br; } } The relation to be exploited

Relation • Plumbing between – Spark – Your existing library • Mission – Returns the schema as a StructType – Returns the data as a RDD<Row> 15@jgperrin - #EUdev6

/jgperrin /net.jgp.labs.spark.datasources TableScan is the key, other more specialized Scan available Relation 16@jgperrin - #EUdev6 The schema: may/will be called first The data… SQL Context public class ExifDirectoryRelation extends BaseRelation implements Serializable, TableScan { private static final long serialVersionUID = 4598175080399877334L; @Override public RDD<Row> buildScan() { ... return rowRDD.rdd(); } @Override public StructType schema() { ... return schema.getSparkSchema(); } @Override public SQLContext sqlContext() { return this.sqlContext; } ... }

/jgperrin /net.jgp.labs.spark.datasources A utility function that introspect a Java bean and turn it into a ”Super” schema, which contains the required StructType for Spark Relation – Schema 17@jgperrin - #EUdev6 @Override public StructType schema() { if (schema == null) { schema = SparkBeanUtils.getSchemaFromBean(PhotoMetadata.class); } return schema.getSparkSchema(); }

/jgperrin /net.jgp.labs.spark.datasources Collect the data Relation – Data 18@jgperrin - #EUdev6 @Override public RDD<Row> buildScan() { schema(); List<PhotoMetadata> table = collectData(); JavaSparkContext sparkContext = new JavaSparkContext(sqlContext.sparkContext()); JavaRDD<Row> rowRDD = sparkContext.parallelize(table) .map(photo -> SparkBeanUtils.getRowFromBean(schema, photo)); return rowRDD.rdd(); } private List<PhotoMetadata> collectData() { List<File> photosToProcess = this.photoLister.getFiles(); List<PhotoMetadata> list = new ArrayList<PhotoMetadata>(); PhotoMetadata photo; for (File photoToProcess : photosToProcess) { photo = ExifUtils.processFromFilename(photoToProcess.getAbsolutePath()); list.add(photo); } return list; } Scans the files, extract EXIF information: the interface to your library… Creates the RDD by parallelizing the list of photos

/jgperrin /net.jgp.labs.spark.datasources Application 19@jgperrin - #EUdev6 import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class PhotoMetadataIngestionApp { public static void main(String[] args) { PhotoMetadataIngestionApp app = new PhotoMetadataIngestionApp(); app.start(); } private boolean start() { SparkSession spark = SparkSession.builder() .appName("EXIF to Dataset").master("local[*]").getOrCreate(); Dataset<Row> df = spark.read() .format("exif") .option("recursive", "true") .option("limit", "80000") .option("extensions", "jpg,jpeg") .load("/Users/jgp/Pictures"); df = df .filter(df.col("GeoX").isNotNull()) .filter(df.col("GeoZ").notEqual("NaN")) .orderBy(df.col("GeoZ").desc()); System.out.println("I have imported " + df.count() + " photos."); df.printSchema(); df.show(5); return true; } } Normal imports, no reference to our data source Local mode Classic read Standard dataframe API: getting my ”highest” photos! Standard output mechanism

Output +--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+ | Name| Size|Extension| MimeType| Directory| FileCreationDate| FileLastAccessDate|FileLastModifiedDate| Filename| GeoY| Date| GeoX| GeoZ|Height|Width| +--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+ | IMG_0177.JPG|1129296| JPG|image/jpeg|/Users/jgp/Pictur...|2015-04-14 08:14:42|2017-10-24 11:20:31| 2015-04-14 08:14:42|/Users/jgp/Pictur...| -111.17341|2015-04-14 15:14:42| 36.38115| 9314.743| 1936| 2592| | IMG_0176.JPG|1206265| JPG|image/jpeg|/Users/jgp/Pictur...|2015-04-14 09:14:36|2017-10-24 11:20:31| 2015-04-14 09:14:36|/Users/jgp/Pictur...| -111.17341|2015-04-14 16:14:36| 36.38115| 9314.743| 1936| 2592| | IMG_1202.JPG|1799552| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 17:52:15|2017-10-24 11:20:28| 2015-05-05 17:52:15|/Users/jgp/Pictur...| -98.471405|2015-05-05 15:52:15|29.528196| 9128.197| 2448| 3264| | IMG_1203.JPG|2412908| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 18:05:25|2017-10-24 11:20:28| 2015-05-05 18:05:25|/Users/jgp/Pictur...| -98.4719|2015-05-05 16:05:25|29.526403| 9128.197| 2448| 3264| | IMG_1204.JPG|2261133| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 15:13:25|2017-10-24 11:20:29| 2015-05-05 15:13:25|/Users/jgp/Pictur...| -98.47248|2015-05-05 16:13:25|29.526756| 9128.197| 2448| 3264| +--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+ only showing top 5 rows 20@jgperrin - #EUdev6 +--------------------+-------+-----------+-------------------+---------+----------+ | Name| Size| GeoY| Date| GeoX| GeoZ| +--------------------+-------+-----------+-------------------+---------+----------+ | IMG_0177.JPG|1129296| -111.17341|2015-04-14 15:14:42| 36.38115| 9314.743| | IMG_0176.JPG|1206265| -111.17341|2015-04-14 16:14:36| 36.38115| 9314.743| | IMG_1202.JPG|1799552| -98.471405|2015-05-05 15:52:15|29.528196| 9128.197| | IMG_1203.JPG|2412908| -98.4719|2015-05-05 16:05:25|29.526403| 9128.197| | IMG_1204.JPG|2261133| -98.47248|2015-05-05 16:13:25|29.526756| 9128.197| +--------------------+-------+-----------+-------------------+---------+----------+ only showing top 5 rows

dsfsd 21#EUdev6 IMG_0176.JPG Rijsttafel in Rotterdam, NL 2.6m (9ft) above sea level

dsfsd 22#EUdev6 IMG_0176.JPG Hoover Dam & Lake Mead, AZ 9314.7m (30560ft) above sea level

@jgperrin - #EUdev6 A Little Extra 23 Data Metadata (Schema)

Schema, Beans, and Annotations 24@jgperrin - #EUdev6 Schema Bean StructType SparkBeanUtils. getSchemaFromBean() SparkBeanUtils. getRowFromBean() Schema. getSparkSchema() RowColumn Column Column Column Can be augmented via the @SparkColumn annotation

Mass production • Easily import from any Java Bean • Conversion done by utility functions • Schema is a superset of StructType 25#EUdev6@jgperrin - #EUdev6

Conclusion • Reuse the Bean to schema and Bean to data in your project • Building custom data source to REST server or non standard format is easy • No need for a pricey conversion to CSV or JSON • There is always a solution in Java • On you: check for parallelism, optimization, extend schema (order of columns) 26@jgperrin - #EUdev6

Going Further • Check out the code (fork/like) https://github.com/jgperrin/net.jgp.labs.spark.dataso urces • Follow me @jgperrin • Watch for my Java + Spark book, coming out soon! • (If you come in the RTP area in NC, USA, come for a Spark meet-up and let’s have a drink!) 27#EUdev6

Go raibh maith agaibh. Jean Georges “JGP” Perrin @jgperrin Don’t forget to rate

Backup Slides & Other Resources

30Discover CLEGO @ http://bit.ly/spark-clego

More Reading & Resources • My blog: http://jgp.net • This code on GitHub: https://github.com/jgperrin/net.jgp.labs.spark.dat asources • Java Code on GitHub: https://github.com/jgperrin/net.jgp.labs.spark /jgperrin/net.jgp.labs.spark.datasources

Photo & Other Credits • JGP @ Oplo, Problem, Solution 1&2, Relation, Beans: Pexels • Lab, Hoover Dam, Clego: © Jean Georges Perrin • Rijsttafel: © Nathaniel Perrin

Abstract EXTENDING APACHE SPARK'S INGESTION: BUILDING YOUR OWN JAVA DATA SOURCE By Jean Georges Perrin (@jgperrin, Oplo) Apache Spark is a wonderful platform for running your analytics jobs. It has great ingestion features from CSV, Hive, JDBC, etc. however, you may have your own data sources or formats you want to use. Your solution could be to convert your data in a CSV or JSON file and then ask Spark to do ingest it through its built-in tools. However, for enhanced performance, we will explore the way to build a data source, in Java, to extend Spark’s ingestion capabilities. We will first understand how Spark works for ingestion, then walk through the development of this data source plug-in. Targeted audience: Software and data engineers who need to expand Spark’s ingestion capability. Key takeaways: • Requirements, needs & architecture – 15%. • Build the required tool set in Java – 85%. Session hashtag: #EUdev6

Extending Spark's Ingestion: Build Your Own Java Data Source with Jean Georges Perrin

In this document

More Related Content

What's hot

Viewers also liked

Similar to Extending Spark's Ingestion: Build Your Own Java Data Source with Jean Georges Perrin

More from Databricks

Recently uploaded

Extending Spark's Ingestion: Build Your Own Java Data Source with Jean Georges Perrin