k-means algorithm implementation on Hadoop Athens University of Economics and Business Dpt. Of Management Science and Technology Prof. Damianos Chatziantoniou | lkoutsokera@gmail.com | stratos.gounidellis@gmail.com Lamprini Koutsokera Stratos Gounidellis BDSMasters
Tools 2
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [1] Prerequisites • jdk-8uversion-linux-x64.tar.gz (Download) • Directory modification • % tar zxvf jdk-8uversion-linux-x64.tar.gz Hadoop system user $ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser Disabling IPv6 /etc/sysctl.conf net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 Configuring SSH $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp $ ssh-keygen -t rsa -P "" $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys $ ssh localhost Generate an SSH key for the hduser hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys Create an RSA key pair with an empty password References: [1], [2] 3
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [2] Hadoop configuration $ cd /usr/local $ sudo tar xzf hadoop-2.7.3.tar.gz $ sudo mv hadoop-2.7.3 hadoop $ sudo chown -R hduser:hadoop hadoop Configuration conf/core-site.xml <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp ownerships and permissions export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64“ hadoop-env.sh <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> conf/mapred-site.xml References: [1], [2] 4
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [3] hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> </configuration> <configuration> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> yarn-site.xml References: [1], [2] 5
Running Hadoop on Ubuntu Linux (Single-Node Cluster) [4] conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> </configuration> hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format Formatting the HDFS filesystem via the NameNode hduser@ubuntu:~$ /usr/local/hadoop/bin/start- all.sh Starting your single-node cluster hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh Stopping your single-node cluster References: [1], [2] 6
Setting up a Single Node Cluster - Hadoop Distributed File System (HDFS) To access Hadoop WEB UI , we need to type http://localhost:50070/ though our core-site.xml that has as value the http://localhost:9000. Our HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. 7
From hardware/software to algorithms 8
Data Generation • Isotropic Gaussian blobs • 2.000.000 points • centers = [[25, 25], [-1, -1], [-25, -25]] • cluster_std = 3.5 9
K-means++ : Calculation of initial centroids Reference: [3] 10
K-means : Clustering algorithm References: [4], [5] 11
K-means using Map Reduce Do Map Input is a data point and k centers are broadcasted Finds the closest center among k centers for the input point Reduce Input is one of k centers and all data points having this center as their closest center. Calculates the new center using data points Until all of new centers are not changed Reference: [6] 12
K-means using Map Reduce [1] k = 3 map map map . . . Centroids are broadcasted to every map function key value [centroid1, point1] [centroid2, point3] [centroid3, point10] [centroid2, point45] . . . . . . . . . . . . . . . . [centroidx, pointx] Mapper map map map . . . Reference: [6] 13
K-means using Map Reduce [2] combine combine combine . . . key values [centroid1, partialsum1] [centroid2, partialsum1] [centroid3, partialsum1] [centroid2, partialsum2] . . . . . . . . . . . . . . . . [centroidx, partialsumx] Combiner key values [centroid1, point1, point4] [centroid2, point3, point7, point9] [centroid3, point10] [centroid2, point45, point 73] . . . . . . . . . . . . . . . . [centroid3, pointx, . . . ] Reference: [6] 14
K-means using Map Reduce [3] reduce reduce reduce . . . key value [centroid1, centroid1’] [centroid2, centroid2’] [centroid3, centroid3’] Reducer key values [centroid1, partialsum1, . . . ] [centroid2, partialsum1, . . . ] [centroid3, partialsum1, . . . ] Reference: [6] 15
From algorithms to coding 16
Python Coding [1] The initial task of the project is to generate a set of more than one million data points to be used later as an input for the k-means clustering algorithm. Using this python script three isotropic Gaussian blobs for clustering are generated. More specifically, the centers are the following data points [25, 25], [-1, -1], [-25, -25]. The silhouette score constitutes a useful criterion for determining the proper number of clusters. A silhouette close to 1 implies the datum is in an appropriate cluster, while a silhouette close to −1 implies the datum is in the wrong cluster. The specific python script calculates the silhouette score for different numbers of clusters ranging from 2 to 6. 17
Python Coding [2] 18 In order to implement k-means algorithm on hadoop mrjob is used. Mrjob is a python package, which allows to write multi-step MapReduce jobs in pure Python and run them on a hadoop cluster. In our case mrjob run on a single-node cluster. • The mapper function returns each data point and the cluster, to which it belongs. • The combiner function returns partial sums of batches of data points belonging to the same cluster. • The reducer returns the new centroids of each cluster. • If the centroids remain unchanged the algorithm terminates. Otherwise, the steps are repeated from the beginning. This python script calls the k-means algorithm implemented on hadoop. However, before implementing k-means the initial centroids are computed using the k-means++ algorithm proposed in 2007 by Arthur and Vassilvitskii.
Coding Running [1] 19
Coding Running [2] kmeans 20
Coding Running [3] 21
References [1]: Noll, M. Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll. [online] Available at: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ [Accessed 29 Mar. 2017]. [2]: Stackoverflow.com. (2017). Error launching job using mrjob on Hadoop. [online] Available at: http://stackoverflow.com/questions/25358793/error-launching-job-using-mrjob-on-hadoop [Accessed 29 Mar. 2017]. [3]: David Arthur, and Sergei Vassilvitskii, (2007). k-means++: the advantages of careful seeding - Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, January 7-9, 2007. 1st ed. New York: ACM, pp.1027–1035. [4]: Nlp.stanford.edu. K-means. Available at: http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html [Accessed 15 Mar. 2017] [5]: Home.deib.polimi.it. (n.d.). Clustering - K-means. [online] Available at: http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html [Accessed 8 Mar. 2017]. [6]: Kyuseok Shim, "MapReduce Algorithms for Big Data Analysis", VLDB Conference, 2012 | lkoutsokera@gmail.com | stratos.gounidellis@gmail.com Lamprini Koutsokera Stratos Gounidellis BDSMasters 22

Implementation of k means algorithm on Hadoop

  • 1.
    k-means algorithm implementationon Hadoop Athens University of Economics and Business Dpt. Of Management Science and Technology Prof. Damianos Chatziantoniou | lkoutsokera@gmail.com | stratos.gounidellis@gmail.com Lamprini Koutsokera Stratos Gounidellis BDSMasters
  • 2.
  • 3.
    Running Hadoop onUbuntu Linux (Single-Node Cluster) [1] Prerequisites • jdk-8uversion-linux-x64.tar.gz (Download) • Directory modification • % tar zxvf jdk-8uversion-linux-x64.tar.gz Hadoop system user $ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser Disabling IPv6 /etc/sysctl.conf net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 Configuring SSH $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp $ ssh-keygen -t rsa -P "" $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys $ ssh localhost Generate an SSH key for the hduser hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys Create an RSA key pair with an empty password References: [1], [2] 3
  • 4.
    Running Hadoop onUbuntu Linux (Single-Node Cluster) [2] Hadoop configuration $ cd /usr/local $ sudo tar xzf hadoop-2.7.3.tar.gz $ sudo mv hadoop-2.7.3 hadoop $ sudo chown -R hduser:hadoop hadoop Configuration conf/core-site.xml <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp ownerships and permissions export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64“ hadoop-env.sh <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> conf/mapred-site.xml References: [1], [2] 4
  • 5.
    Running Hadoop onUbuntu Linux (Single-Node Cluster) [3] hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> </configuration> <configuration> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> yarn-site.xml References: [1], [2] 5
  • 6.
    Running Hadoop onUbuntu Linux (Single-Node Cluster) [4] conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> </configuration> hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format Formatting the HDFS filesystem via the NameNode hduser@ubuntu:~$ /usr/local/hadoop/bin/start- all.sh Starting your single-node cluster hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh Stopping your single-node cluster References: [1], [2] 6
  • 7.
    Setting up aSingle Node Cluster - Hadoop Distributed File System (HDFS) To access Hadoop WEB UI , we need to type http://localhost:50070/ though our core-site.xml that has as value the http://localhost:9000. Our HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. 7
  • 8.
  • 9.
    Data Generation • IsotropicGaussian blobs • 2.000.000 points • centers = [[25, 25], [-1, -1], [-25, -25]] • cluster_std = 3.5 9
  • 10.
    K-means++ : Calculationof initial centroids Reference: [3] 10
  • 11.
    K-means : Clusteringalgorithm References: [4], [5] 11
  • 12.
    K-means using MapReduce Do Map Input is a data point and k centers are broadcasted Finds the closest center among k centers for the input point Reduce Input is one of k centers and all data points having this center as their closest center. Calculates the new center using data points Until all of new centers are not changed Reference: [6] 12
  • 13.
    K-means using MapReduce [1] k = 3 map map map . . . Centroids are broadcasted to every map function key value [centroid1, point1] [centroid2, point3] [centroid3, point10] [centroid2, point45] . . . . . . . . . . . . . . . . [centroidx, pointx] Mapper map map map . . . Reference: [6] 13
  • 14.
    K-means using MapReduce [2] combine combine combine . . . key values [centroid1, partialsum1] [centroid2, partialsum1] [centroid3, partialsum1] [centroid2, partialsum2] . . . . . . . . . . . . . . . . [centroidx, partialsumx] Combiner key values [centroid1, point1, point4] [centroid2, point3, point7, point9] [centroid3, point10] [centroid2, point45, point 73] . . . . . . . . . . . . . . . . [centroid3, pointx, . . . ] Reference: [6] 14
  • 15.
    K-means using MapReduce [3] reduce reduce reduce . . . key value [centroid1, centroid1’] [centroid2, centroid2’] [centroid3, centroid3’] Reducer key values [centroid1, partialsum1, . . . ] [centroid2, partialsum1, . . . ] [centroid3, partialsum1, . . . ] Reference: [6] 15
  • 16.
  • 17.
    Python Coding [1] Theinitial task of the project is to generate a set of more than one million data points to be used later as an input for the k-means clustering algorithm. Using this python script three isotropic Gaussian blobs for clustering are generated. More specifically, the centers are the following data points [25, 25], [-1, -1], [-25, -25]. The silhouette score constitutes a useful criterion for determining the proper number of clusters. A silhouette close to 1 implies the datum is in an appropriate cluster, while a silhouette close to −1 implies the datum is in the wrong cluster. The specific python script calculates the silhouette score for different numbers of clusters ranging from 2 to 6. 17
  • 18.
    Python Coding [2] 18 Inorder to implement k-means algorithm on hadoop mrjob is used. Mrjob is a python package, which allows to write multi-step MapReduce jobs in pure Python and run them on a hadoop cluster. In our case mrjob run on a single-node cluster. • The mapper function returns each data point and the cluster, to which it belongs. • The combiner function returns partial sums of batches of data points belonging to the same cluster. • The reducer returns the new centroids of each cluster. • If the centroids remain unchanged the algorithm terminates. Otherwise, the steps are repeated from the beginning. This python script calls the k-means algorithm implemented on hadoop. However, before implementing k-means the initial centroids are computed using the k-means++ algorithm proposed in 2007 by Arthur and Vassilvitskii.
  • 19.
  • 20.
  • 21.
  • 22.
    References [1]: Noll, M.Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll. [online] Available at: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ [Accessed 29 Mar. 2017]. [2]: Stackoverflow.com. (2017). Error launching job using mrjob on Hadoop. [online] Available at: http://stackoverflow.com/questions/25358793/error-launching-job-using-mrjob-on-hadoop [Accessed 29 Mar. 2017]. [3]: David Arthur, and Sergei Vassilvitskii, (2007). k-means++: the advantages of careful seeding - Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, January 7-9, 2007. 1st ed. New York: ACM, pp.1027–1035. [4]: Nlp.stanford.edu. K-means. Available at: http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html [Accessed 15 Mar. 2017] [5]: Home.deib.polimi.it. (n.d.). Clustering - K-means. [online] Available at: http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html [Accessed 8 Mar. 2017]. [6]: Kyuseok Shim, "MapReduce Algorithms for Big Data Analysis", VLDB Conference, 2012 | lkoutsokera@gmail.com | stratos.gounidellis@gmail.com Lamprini Koutsokera Stratos Gounidellis BDSMasters 22