|
| 1 | +# Spark with Python programming |
| 2 | +## Introduction |
| 3 | +[Apache Spark](https://spark.apache.org/) is one of the hottest new trends in the technology domain. It is the framework with probably the highest potential to realize the fruit of the marriage between Big Data and Machine Learning. |
| 4 | + |
| 5 | +It runs fast (up to 100x faster than traditional [Hadoop MapReduce](https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm) due to in-memory operation, offers robust, distributed, fault-tolerant data objects (called [RDD](https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm)), and integrates beautifully with the world of machine learning and graph analytics through supplementary packages like [Mlib](https://spark.apache.org/mllib/) and [GraphX](https://spark.apache.org/graphx/). |
| 6 | + |
| 7 | +Spark is implemented on [Hadoop/HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) and written mostly in [Scala](https://www.scala-lang.org/), a functional programming language, similar to Java. In fact, Scala needs the latest Java installation on your system and runs on JVM. However, for most beginners, Scala is not a language that they learn first to venture into the world of data science. Fortunately, Spark provides a wonderful Python integration, called **PySpark**, which lets Python programmers to interface with the Spark framework and learn how to manipulate data at scale and work with objects and algorithms over a distributed file system. |
| 8 | + |
| 9 | +In this article, we will learn the basics of Python integration with Spark. There are a lot of concepts (constantly evolving and introduced), and therefore, we just focus on fundamentals with a few simple examples. Readers are encouraged to build on these and explore more on their own. |
| 10 | + |
| 11 | +## Short history |
| 12 | +Apache Spark started as a research project at the UC Berkeley AMPLab in 2009, and was open sourced in early 2010. It was a class project at UC Berkeley. Idea was to build a cluster management framework, which can support different kinds of cluster computing systems. Many of the ideas behind the system were presented in various research papers over the years. |
| 13 | +After being released, Spark grew into a broad developer community, and moved to the Apache Software Foundation in 2013. Today, the project is developed collaboratively by a community of hundreds of developers from hundreds of organizations. |
| 14 | + |
| 15 | +## Spark in a nutshell |
| 16 | +One thing to remember is that Spark is not a programming language like Python or Java. It is a general-purpose distributed data processing engine, suitable for use in a wide range of circumstances. It is particularly useful for big data processing both at scale and with high speed. |
| 17 | + |
| 18 | +Application developers and data scientists generally incorporate Spark into their applications to rapidly query, analyze, and transform data at scale. Some of the tasks that are most frequently associated with Spark, include, |
| 19 | +- ETL and SQL batch jobs across large data sets (often of terabytes of size), |
| 20 | +- processing of streaming data from IoT devices and nodes, data from various sensors, financial and transactional systems of all kinds, and |
| 21 | +- machine learning tasks for e-commerce or IT applications. |
| 22 | + |
| 23 | +At its core, Spark builds on top of the Hadoop/HDFS framework for handling distributed files. It is mostly implemented with Scala, a functional language variant of Java. There is a core Spark data processing engine, but on top of that, there are many libraries developed for SQL-type query analysis, distributed machine learning, large-scale graph computation, and streaming data processing. Multiple programming languages are supported by Spark in the form of easy interface libraries: Java, Python, Scala, and R. |
| 24 | + |
| 25 | +## Spark uses the MapReduce paradigm |
| 26 | +The basic idea of distributed processing is to divide the data chunks into small manageable pieces (including some filtering and sorting), bring the computation close to the data i.e. use small nodes of a large cluster for specific jobs and then re-combine them back. The dividing portion is called the ‘Map’ action and the recombination is called the ‘Reduce’ action. Together, they make the famous ‘MapReduce’ paradigm, which was introduced by Google around 2004 (see the original paper here). |
| 27 | + |
| 28 | +For example, if a file has 100 records to be processed, 100 mappers can run together to process one record each. Or maybe 50 mappers can run together to process two records each. After all the mappers complete processing, the framework shuffles and sorts the results before passing them on to the reducers. A reducer cannot start while a mapper is still in progress. All the map output values that have the same key are assigned to a single reducer, which then aggregates the values for that key. |
| 29 | + |
| 30 | +Here is a simple illustration of counting characters in a list of strings using MapReduce principle, |
| 31 | + |
| 32 | +Spark follows this idea with its powerful data structures - RDD (Resilient Distributed Data) and DataFrame. |
| 33 | + |
| 34 | +## Set up Python for Spark |
| 35 | +If you’re already familiar with Python and libraries such as Pandas and Numpy, then PySpark is a great extension/framework to learn in order to create more scalable, data-intensive analyses and pipelines by utilizing the power of Spark at the background. |
| 36 | + |
| 37 | +The exact process of installing and setting up PySpark environment (on a standalone machine) is somewhat involved and can vary slightly depending on your system and environment. The goal is to get your regular Jupyter data science environment working with Spark at the background using PySpark. |
| 38 | + |
| 39 | +Read this article to know more details on the setup process, step-by-step. |
| 40 | + |
| 41 | +Alternatively, you can use Databricks setup for practicing Spark. This company was created by the original creators of Spark and have an excellent ready-to-launch environment to do distributed analysis with Spark. |
| 42 | + |
| 43 | +But the idea is always the same. You are distributing (and replicating) your large dataset in small fixed chunks over many nodes. You then bring the compute engine close to them so that the whole operation is parallelized, fault-tolerant and scalable. |
| 44 | + |
| 45 | +By working with PySpark and Jupyter notebook, you can learn all these concepts without spending anything on AWS or Databricks platform. You can also easily interface with SparkSQL and MLlib for database manipulation and machine learning. |
| 46 | +It will be much easier to start working with real-life large clusters if you have internalized these concepts beforehand! |
| 47 | + |
| 48 | +## RDD and SparkContext |
| 49 | +Many Spark programs revolve around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. SparkContext resides in the Driver program and manages the distributed data over the worker nodes through the cluster manager. The good thing about using PySpark is that all this complexity of data partitioning and task management is handled automatically at the back and the programmer can focus on the specific analytics or machine learning job at hand. |
| 50 | + |
| 51 | +There are two ways to create RDDs: |
| 52 | +- parallelizing an existing collection in your driver program, or |
| 53 | +- referencing a dataset in an external storage system, such as a shared file- system, HDFS, HBase, or any data source offering a Hadoop InputFormat. |
| 54 | + |
| 55 | +For illustration with a Python-based approach, we will give examples of the first type here. We can create a simple Python array of 20 random integers (between 0 and 10), using Numpy `random.randint()`, and then create an RDD object as following, |
| 56 | +```python |
| 57 | +from pyspark import SparkContext |
| 58 | +import numpy as np |
| 59 | +sc=SparkContext(master="local[4]") |
| 60 | +lst=np.random.randint(0,10,20) |
| 61 | +A=sc.parallelize(lst) |
| 62 | +``` |
| 63 | +**Note the '4' in the argument. It denotes 4 computing cores (in your local machine) to be used for this `SparkContext` object**. If we check the type of the RDD object, we get the following, |
| 64 | +```python |
| 65 | +type(A) |
| 66 | +>> pyspark.rdd.RDD |
| 67 | +``` |
| 68 | +Opposite to parallelization is the collection (with `collect()`) which brings all the distributed elements and returns them to the head node. |
| 69 | +```python |
| 70 | +A.collect() |
| 71 | +>> [4, 8, 2, 2, 4, 7, 0, 3, 3, 9, 2, 6, 0, 0, 1, 7, 5, 1, 9, 7] |
| 72 | +``` |
| 73 | +But `A` is no longer is a simple Numpy array. We can use the `glom()` method to check how the partitions are created. |
| 74 | +```python |
| 75 | +A.glom().collect() |
| 76 | +>> [[4, 8, 2, 2, 4], [7, 0, 3, 3, 9], [2, 6, 0, 0, 1], [7, 5, 1, 9, 7]] |
| 77 | +``` |
| 78 | +Now stop the SC and reinitialize it with 2 cores and see what happens when you repeat the process. |
| 79 | +```python |
| 80 | +sc.stop() |
| 81 | +sc=SparkContext(master="local[2]") |
| 82 | +A = sc.parallelize(lst) |
| 83 | +A.glom().collect() |
| 84 | +>> [[4, 8, 2, 2, 4, 7, 0, 3, 3, 9], [2, 6, 0, 0, 1, 7, 5, 1, 9, 7]] |
| 85 | +``` |
| 86 | +The RDD is now distributed over two chunks, not four! **You have learned about the first step in distributed data analytics i.e. controlling how your data is partitioned over smaller chunks for further processing** |
| 87 | + |
| 88 | +## Examples of basic operations with RDD |
| 89 | +### Count the elements |
| 90 | +```A.count() |
| 91 | +>> 20 |
| 92 | +``` |
| 93 | +### The first element (`first`) and the first few elements (`take`) |
| 94 | +``` |
| 95 | +A.first() |
| 96 | +>> 4 |
| 97 | +A.take(3) |
| 98 | +>> [4, 8, 2] |
| 99 | +``` |
| 100 | +### Removing duplicates with using `distinct` |
| 101 | +**NOTE**: This operation requires a **shuffle** in order to detect duplication across partitions. So, it is a slow operation. Don't overdo it. |
| 102 | +``` |
| 103 | +A_distinct=A.distinct() |
| 104 | +A_distinct.collect() |
| 105 | +>> [4, 8, 0, 9, 1, 5, 2, 6, 7, 3] |
| 106 | +``` |
| 107 | + |
| 108 | +### To sum all the elements use `reduce` method |
| 109 | +Note the use of a lambda function in this, |
| 110 | +``` |
| 111 | +A.reduce(lambda x,y:x+y) |
| 112 | +>> 80 |
| 113 | +``` |
| 114 | +### Or the direct `sum()` method |
| 115 | +``` |
| 116 | +A.sum() |
| 117 | +>> 80 |
| 118 | +``` |
| 119 | +### Finding maximum element by `reduce` |
| 120 | +``` |
| 121 | +A.reduce(lambda x,y: x if x > y else y) |
| 122 | +>> 9 |
| 123 | +``` |
| 124 | +### Finding longest word in a blob of text |
| 125 | +``` |
| 126 | +words = 'These are some of the best Macintosh computers ever'.split(' ') |
| 127 | +wordRDD = sc.parallelize(words) |
| 128 | +wordRDD.reduce(lambda w,v: w if len(w)>len(v) else v) |
| 129 | +>> 'computers' |
| 130 | +``` |
0 commit comments