Data Science Stack with MongoDB and RStudio Building up an easy data science platform with RStudio server on top of your MongoDB Winston Chen – Lead Software Engineer
What does Fliptop do? • Predictive Lead Scoring, using data science – Pull opportunity/lead/contact data from CRM – Aggregate company data and social data from various data sources and the internet – Over 3000 signals – Build conversion/revenue model – Predict lead conversion and revenue
Our Platform Stack • Java/Scala • Liftweb • JMS/Storm • MongoDB/MySql
Our Machine Learning Stack • Python • Numpy/Scipy/Pandas • Bottle (RESTful Server)
So, where is R then? • Problem: – Data is stored in MongoDB • Sales Lead Data • Sales Opportunity Data • Sales Contact Data – It’s hard to view/digest/process data on the fly using MongoDB console • (X) Text processing for insight extraction? • (X) Prototype cool machine learning algorithms on the fly? • Solution: – R and Rstudio Server • Why not scala? • Why not python/ipython
MongoDB Console & Query
Rstudio Server
Pull MongoDB data into R data frame • rmongodb (https://github.com/gerald-lindsly/rmongodb) Transform Into a R data-frame
1 – Get the total count of your data set
2 – Construct Vectors for each column
3 – Loop through curser and insert values Where are my apply functions? - Too bad. We are using mongo cursor :P
4 – Go into sub bson block to extract data (optional)
5 – Construct data frame and return You are able to get the full example code here: http://goo.gl/tlyyXp We now have a data frame to play with from MongoDB bson.
This is NOT a BIG DATA Stack • It takes around 1 min to process 900Mb+ of bson from Mongo. • NOT BIG data stack – Data should fit into the ram • Most of the data in the business world is not big anyways. • It works fine for us (m1.large machine in AWS) – CRM data is never big, not even after we pull in 3000+ additional signals. – The term ‘Big-Data’ is seriously overrated, ‘Data Science’ however, is the key term here.
@Fliptop, we now use Rstudio to do • Data Insight Extraction • Algorithm prototyping
If you REALLY want BIG Data • Look into: HDFS + Pig/Hive + Hue (any other suggestion from the audience here?)
QA • Winston Chen – Personal Blog: http://winston.attlin.com/ – Twitter: @wingchen83 – winston@fliptop.com • Fliptop is hiring Data Scientists. Please email to: winston@fliptop.com

Data Science Stack with MongoDB and RStudio

  • 1.
    Data Science Stackwith MongoDB and RStudio Building up an easy data science platform with RStudio server on top of your MongoDB Winston Chen – Lead Software Engineer
  • 2.
    What does Fliptopdo? • Predictive Lead Scoring, using data science – Pull opportunity/lead/contact data from CRM – Aggregate company data and social data from various data sources and the internet – Over 3000 signals – Build conversion/revenue model – Predict lead conversion and revenue
  • 3.
    Our Platform Stack •Java/Scala • Liftweb • JMS/Storm • MongoDB/MySql
  • 4.
    Our Machine LearningStack • Python • Numpy/Scipy/Pandas • Bottle (RESTful Server)
  • 5.
    So, where isR then? • Problem: – Data is stored in MongoDB • Sales Lead Data • Sales Opportunity Data • Sales Contact Data – It’s hard to view/digest/process data on the fly using MongoDB console • (X) Text processing for insight extraction? • (X) Prototype cool machine learning algorithms on the fly? • Solution: – R and Rstudio Server • Why not scala? • Why not python/ipython
  • 6.
  • 7.
  • 8.
    Pull MongoDB datainto R data frame • rmongodb (https://github.com/gerald-lindsly/rmongodb) Transform Into a R data-frame
  • 9.
    1 – Getthe total count of your data set
  • 10.
    2 – ConstructVectors for each column
  • 11.
    3 – Loopthrough curser and insert values Where are my apply functions? - Too bad. We are using mongo cursor :P
  • 12.
    4 – Gointo sub bson block to extract data (optional)
  • 13.
    5 – Constructdata frame and return You are able to get the full example code here: http://goo.gl/tlyyXp We now have a data frame to play with from MongoDB bson.
  • 14.
    This is NOTa BIG DATA Stack • It takes around 1 min to process 900Mb+ of bson from Mongo. • NOT BIG data stack – Data should fit into the ram • Most of the data in the business world is not big anyways. • It works fine for us (m1.large machine in AWS) – CRM data is never big, not even after we pull in 3000+ additional signals. – The term ‘Big-Data’ is seriously overrated, ‘Data Science’ however, is the key term here.
  • 15.
    @Fliptop, we nowuse Rstudio to do • Data Insight Extraction • Algorithm prototyping
  • 16.
    If you REALLYwant BIG Data • Look into: HDFS + Pig/Hive + Hue (any other suggestion from the audience here?)
  • 17.
    QA • Winston Chen –Personal Blog: http://winston.attlin.com/ – Twitter: @wingchen83 – winston@fliptop.com • Fliptop is hiring Data Scientists. Please email to: winston@fliptop.com