Processing data stored in Redshift

Question

We're currently using Redshift as our data warehouse, which we're very happy with. However, we now have a requirement to do machine learning against the data in our warehouse. Given the volume of data involved, ideally I'd want to run the computation in the same location as the data rather than shifting the data around, but this doesn't seem possible with Redshift. I've looked at MADlib, but this is not an option as Redshift does not support UDFs (which MADlib requires). I'm currently looking at shifting the data over to EMR and processing it with the Apache Spark machine learning library (or maybe H20, or Mahout, or whatever). So my questions are:

is there a better way?
if not, how should I make the data accessible to Spark? The options I've identified so far include: use Sqoop to load it into HDFS, use DBInputFormat, do a Redshift export to S3 and have Spark grab it from there. What are the pros/cons for these different approaches (and any others) when using Spark?

Note that this is off-line batch learning, but we'd like to be able to do this as quickly as possible so that we can iterate experiments quickly.

Could you tell us the number of times that you need to read the data. And how big is your data set? — Majid Darabi
– Majid Darabi, Commented Dec 9, 2014 at 1:27
We need to be able to perform ad hoc analyses on the data, so unlimited number of reads really. We have multiple fact tables, which range in size from tens of millions to billions of records. — deanj
– deanj, Commented Dec 10, 2014 at 9:40

Josh Rosen · Accepted Answer · 2015-09-13 00:15:24Z

If you'd like to query Redshift data in Spark and you're using Spark 1.4.0 or newer, check out spark-redshift, a library which supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you're querying large volumes of data, this approach should perform better than JDBC because it will be able to unload and query the data in parallel. If you plan to run many different ML jobs on your Redshift data, then consider using spark-redshift to export it out of Redshift and save it to S3 in an efficient file format, such as Parquet.

Disclosure: I'm one of the authors of spark-redshift.

Yuri Levinsky · Accepted Answer · 2014-12-09 15:22:45Z

You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines. To access Hadoop data from Spark, just use a hdfs:// URL (typically hdfs://:9000/path, but you can find the right URL on your Hadoop Namenode’s web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on). You can use Data Pipeline service or just copy command to move data from Redshift to HDFS. Anyway you can use Redshift for machine learning, depends on tool your using or algorithm you implementing. Anyway It's less data base and more data store with all pros&cons behind it.

Collectives™ on Stack Overflow

Processing data stored in Redshift

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related