We're currently using Redshift as our data warehouse, which we're very happy with. However, we now have a requirement to do machine learning against the data in our warehouse. Given the volume of data involved, ideally I'd want to run the computation in the same location as the data rather than shifting the data around, but this doesn't seem possible with Redshift. I've looked at MADlib, but this is not an option as Redshift does not support UDFs (which MADlib requires). I'm currently looking at shifting the data over to EMR and processing it with the Apache Spark machine learning library (or maybe H20, or Mahout, or whatever). So my questions are:
- is there a better way?
- if not, how should I make the data accessible to Spark? The options I've identified so far include: use Sqoop to load it into HDFS, use DBInputFormat, do a Redshift export to S3 and have Spark grab it from there. What are the pros/cons for these different approaches (and any others) when using Spark?
Note that this is off-line batch learning, but we'd like to be able to do this as quickly as possible so that we can iterate experiments quickly.