Big data and data science overview

 Oxford English Dictionary: ◦ “An all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications”  Defined by volume, variety, velocity  2008 computer scientist predictions: ◦ Big Data will “transform the activities of companies, scientific researchers, medical practitioners, and our nation’s defense and intelligence operations”  According to the New York Times: ◦ Big data science “typically means applying the tools of artificial application of intelligence, like machine learning, to vast new troves of data beyond that captured in standard databases”

 Wider  Longer  Wider and Longer  Complex subgroupings within wider or longer sets  Many correlations  Noisy  Missing data

 Computational challenges of storage and statistical program memory ◦ R space on a laptop is limited to 2 GB unless more RAM is added ◦ Algorithm computing time grows according to scaling rules, many of which are exponential. Thus, 2 GB takes 4 minutes, and 4 GB then takes 16 minutes…  Statistical challenges from data structure ◦ Wide data violates many statistical assumptions. ◦ Correlations among predictors also violate statistical assumptions and creates problems with the underlying linear algebra calculation methods. ◦ Potential for lots of informative missing data that can’t be imputed using existing statistical methods.

 More computing resources ◦ Expensive ◦ Cloud computing ◦ Does not solve statistical issues posed by big data  New statistical methods ◦ Rely on a new set of tools from computer science ◦ Work around limitations of existing multivariate data analysis methods ◦ Don’t always scale as big data grows  Still have computational issues  Need for larger and larger training sets for good performance

 Hadoop ◦ Open-source software for storage and processing of big data across computer cores/clusters ◦ Compatible with existing statistical software  MapReduce ◦ Distributed computing strategy for big data processing and analyses ◦ Compute problem in parallel and combine final answers for shorter compute times  SQL/NoSQL ◦ Relational database language for:  Database construction/modifications  Pulling pieces of data for further analyses/reporting  R ◦ Free open-source software with existing machine learning algorithms and coding environment to create and test new machine learning algorithms  Simulations ◦ Use data structure and relationship rules to create a dataset with pre- specified structure to it ◦ Allows for testing and validation of new algorithms against datasets with known answers ◦ Useful for comparing existing algorithms with new algorithms

 Statistics ◦ Hypothesis testing (parametric and nonparametric) and experimental design ◦ Generalized linear models ◦ Longitudinal, time series, and survival models ◦ Bayesian methods  Mathematics ◦ Multivariable calculus ◦ Linear algebra ◦ Probability theory ◦ Optimization ◦ Graph theory/discrete math ◦ Real analysis/topology  Machine learning ◦ Technically, considered a branch of statistics ◦ Supervised, unsupervised, and semi-supervised models ◦ Serve to extend statistical models and relax assumptions on data ◦ Includes algorithms from topological data analysis and network analysis

 A professional who blends several different areas of expertise to draw insights from disparate data sources (particularly big data) such that inference can be made about specific problems/decisions within the field of application  Data science is a blend of statistical, machine learning, computer science, mathematical, and domain knowledge to leverage data for decision-making in that domain (business, medical, social media…).

 Discuss problem with leadership to understand the problem and how results might be used. ◦ Providing a predictive algorithm that performs well but doesn’t provide insight into the problem might not be useful. ◦ There may be related items that leadership hasn’t considered, items that can enrich the project.  Define data that needs to be pulled. ◦ May exist in database. ◦ May need to find elsewhere.  Pull and clean data. ◦ Examine for errors or bias. ◦ Deal with missing data.  Perform analyses and interpret output. ◦ Can be supervised (fit to outcome) or unsupervised (exploratory). ◦ Typically involves visualization of important results.  Compile summary of actionable insights for leadership. ◦ Simplification ◦ Business value (no point in doing analysis if it can’t be implemented!)

 Mathematical/Statistical Background ◦ Graduate degree, typically in mathematics/statistics, computer science, or engineering ◦ Training in machine learning and algorithm design ◦ Experience with R and SAS statistical languages/programs  Computer Science Background ◦ Python/MATLAB/other high-level computing languages ◦ Hadoop/MapReduce concepts ◦ SQL or NoSQL coding for database extraction/management ◦ Experience with structured or unstructured data ◦ Data mining/algorithm design  Field of Application Expertise ◦ Intellectual curiosity ◦ Understanding of the industry of application (marketing, medical, finance…) ◦ Communication skills to relate findings to non-technical leaders

 From a quick Indeed.com search: ◦ Allstate Insurance ◦ Sprint ◦ Twitter ◦ APS Healthcare ◦ XOR Security ◦ LinkedIn ◦ IBM ◦ Intel  Indeed.com search continued: ◦ Roche Pharmaceuticals ◦ Amazon ◦ Capital One

 According to NewVantage and others: ◦ 2016 revenue gained from data science is estimated at $130.1 billion. ◦ This is expected to grow to $203 billion by 2020.  Individual company results vary according to: ◦ Team talent and expertise ◦ Data collected (and quality of data) ◦ Competitor strengths in data science.  Current and projected shortages of those with analytics talent will impact the market. ◦ Hubs of data science are emerging outside California— Boston, New York, Austin, Chicago, Jacksonville, Tampa, Charlotte, Atlanta… ◦ Across industries—healthcare, tech, finance, energy…

Big data and data science overview

In this document