Frequently Asked Question

Hadoop and Spark
Last Updated 6 months ago

1. Overview

This user guide is based on Apache Hadoop 2.7.3 and Apache Spark 2.4.0. This file intends to guide users through various steps in deploying and launching Apache Hadoop and Apache Spark applications on Aziz supercomputer.


2. Features

  • High performance design with the DDN parallel file system for storage using a similar block size for storage.
  • Support in Interactive and batch mode in PBS with minimal configuration requirements.
  • Smart installing and uninstalling scripts with built-in service startup and shutdown management.
  • On the fly node installation and configuration for Hadoop. These nodes will
    be returned to PBS pool as soon as Hadoop job completes.
  • HDFS over InfiniBand and parallel file system support.
  • On-demand installation and configuration.
  • Usable storage of up to 3.1 Petabytes.

3. Setup Instructions

3.1 Prerequisites

  • Create an empty directory in the DDN file system. By default, the DDN directory for Aziz users would be
/ddn/data/{Department}/{User ID}/{Temporary_Directory}
  • Configure JAVA_HOME and Hadoop environment variables. Without this, you cannot run Java and Hadoop commands. The following table explains each variable and its function.
JAVA_HOME*

The home directory for java (Java Development Kit)

HADOOP_HOMEPath of Hadoop Home Directory
HADOOP_CONF_DIRHadoop Configuration Directory
PATH*

PATH variables define the list of directories where the system would search for given command

* Important variables to configure

The following command will set the variables as per the installation script and default configuration (If you are aware of alternative better configuration settings or have a path for the latest version of Java, please ignore the following commands and configure to your best knowledge)

module load /app/utils/modules/jdk-9.0.4

3.2 Launching PBS job

The instructions in this document can be used to run Hadoop and Spark applications on Aziz interactively or in batch mode. This document focuses on Launching Apache Hadoop and Spark environments interactively. An example PBS job file is provided in Appendix A and Appendix B for reference.

3.2.1 Executing PBS job in interactive mode
 qsub -I -l select=:ncpus=24:mpiprocs=1 -l walltime=04:00:00 -q thin
3.2.2 Explanation of PBS job command options
qsub

PBS command to submit jobs to a queue

-IInteractive Job. Remove this option to run jobs in batch mode
-lResource Specifications
-qSpecify destination queue. It can be any of the following:
  • thin : Select nodes with 96 GB of memory
  • fat : Select nodes with 256 GB of memory
3.2.3 Explanation of select statement of the command in section 3.2.1
nodes_numNumber of nodes required for the job
ncpusNumber of CPUs to be allocated for the job. For Hadoop and Spark jobs, this value should be 24.
mpiprocsThe number of MPI processes to be allocated for the job. For Hadoop and Spark jobs. this value should be 1
walltimeSpecify the required time for the job to complete ( format: HH:MM:SS) where HH=Hours; MM=Minutes; SS=Seconds)
3.2.4 An Example job to run a three-node cluster for 2 hours

We are requesting a three-node cluster for 2 hours using the following command.
Note that we are using the thin queue, as 96 Gb is sufficient for most of the jobs.

 qsub -I -l select=3:ncpus=24:mpiprocs=1 -l walltime=02:00:00 -q thin

3.3 Running the cluster installer script

3.3.1 Path of the cluster installer script

Following is the path of the cluster installer script. The script is the same for Apache Hadoop and Apache Spark cluster setups.

/app/common/bigdata/exec/install
3.3.2 Cluster installer script options
-iInstall and configure available options: hadoop or spark
-i hadoopInstall and configure Apache Hadoop Cluster
-i sparkInstall and configure Apache Spark Cluster
-sStart Services. If this option is not specified, the setup tool will only perform installation and configuration. The user must manually start all services on the cluster, including formatting HDFS Namenode. This option performs the following actions 
  • Format HDFS Namenode 
  • Start HDFS Service on Master and Slaves 
  • Start Yarn service on Master and Slaves 
  • Start Spark service on Master and Slaves if Spark install is selected
-pSpecify the path of the DDN directory. Note that this directory should be empty. Hadoop stores metadata and data blocks in this directory. This directory is not user readable, except for the log files. Please specify only the DDN directory in this option; otherwise, unexpected failures and performance issues may occur.
-uUninstall Apache Hadoop or Apache Spark. If active, this option will stop all Apache Hadoop and Apache Spark servicesand then uninstall services from Master and all Slave nodes. This option performs the following actions: 
  • Stop Apache Spark services if started 
  • Stop Apache Hadoop services if started 
  • Stop HDFS services if started 
  • Uninstall Hadoop and Spark from Slave Nodes 
  • Uninstall Hadoop and Spark from Master Node
3.3.3 Example command to setup cluster

Example 1: Install Apache Hadoop Cluster and start Services

# /app/common/bigdata/exec/install -i hadoop -s -p /ddn/data/test

Example 2: Install Apache Spark Cluster and start Services

# /app/common/bigdata/exec/install -i spark -s -p /ddn/data/test
3.3.4 Stopping Services and Uninstalling Cluster

Note: This is a very important step. If this step is not performed, subsequent commands will fail, and processes on the compute nodes will have to be terminated manually, which would be a tedious task considering the size of the cluster. The second issue is that the installation and configuration would take considerable storage from compute nodes, where this space will not get freed up. 

The following command will stop all services started during installation and uninstall the application from Master and all Slave nodes.

# /app/common/bigdata/exec/install -u

Frequently Asked Questions

Is this implementation of Hadoop/Spark different from the one available on the Apache website?

Though the implementation is a little different from the standard installation. You will not find any difference in architecture, classes or functionality of this implementation from the standard ones.

Is it okay to copy any existing applications from the PC/Workstation or another cluster to Aziz Supercomputer?

Yes.

Is there a particular queue to execute Hadoop jobs?

No, there are no special queues for Hadoop jobs. You can launch Hadoop job as a regular job on Aziz Supercomputer.

How much memory is available on compute nodes?

We have two types of compute nodes 'thin' and 'fat'. Thin nodes have 96 GB of memory, and fat nodes have 256 GB. Both thin and fat nodes have 24 CPU cores on each node.


Appendix A

PBS Job File for Apache Hadoop built-in Example

#!/bin/bash
#PBS -N wordcount
#PBS -l select=1:ncpus=24
#PBS -l walltime=02:00:00
#PBS -q thin

module load /app/utils/modules/jdk-9.0.4

/app/common/bigdata/exec/install -i hadoop -s -p /ddn/data/test 

export PATH=/tmp/hadoop-$USER/bin:$PATH

hdfs dfs -mkdir /input
hdfs dfs -copyFromLocal big_text_file.txt /input

yarn jar /tmp/hadoop-$USER/share/hadoop/mapreduce/hadoop-mapreduce-examples2.7.3.jar wordcount /input /output

hdfs dfs -copyToLocal /output ~/

/app/common/bigdata/exec/install -u

Appendix B

PBS Job file for Apache Spark wordcount example

#!/bin/bash
#PBS -N wordcount
#PBS -l select=1:ncpus=24
#PBS -l walltime=02:00:00
#PBS -q thin

module load /app/utils/modules/jdk-9.0.4
/app/common/bigdata/exec/install -i spark -s -p /ddn/data/test 

export PATH=/tmp/hadoop-$USER/bin:/tmp/spark-$USER/bin

hdfs dfs -mkdir /input
hdfs dfs -copyFromLocal big_text_file.txt /input

cat < wordcount.scala
val text = sc.textFile("/sample/big_text_file.txt")
val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
counts.collect
System.exit(0)
EOF
spark-shell -i wordcount.scala

/app/common/bigdata/exec/install -u

Please Wait!

Please wait... it will take a second!