Frequently Asked Question
1. Overview
This user guide is based on Apache Hadoop 2.7.3 and Apache Spark 2.4.0. This file intends to guide users through various steps in deploying and launching Apache Hadoop and Apache Spark applications on Aziz supercomputer.
2. Features
- High performance design with the DDN parallel file system for storage using a similar block size for storage.
- Support in Interactive and batch mode in PBS with minimal configuration requirements.
- Smart installing and uninstalling scripts with built-in service startup and shutdown management.
- On the fly node installation and configuration for Hadoop. These nodes will
be returned to PBS pool as soon as Hadoop job completes. - HDFS over InfiniBand and parallel file system support.
- On-demand installation and configuration.
- Usable storage of up to 3.1 Petabytes.
3. Setup Instructions
3.1 Prerequisites
- Create an empty directory in the DDN file system. By default, the DDN directory for Aziz users would be
/ddn/data/{Department}/{User ID}/{Temporary_Directory}
- Configure JAVA_HOME and Hadoop environment variables. Without this, you cannot run Java and Hadoop commands. The following table explains each variable and its function.
JAVA_HOME* | The home directory for java (Java Development Kit) |
HADOOP_HOME | Path of Hadoop Home Directory |
HADOOP_CONF_DIR | Hadoop Configuration Directory |
PATH* | PATH variables define the list of directories where the system would search for given command |
The following command will set the variables as per the installation script and default configuration (If you are aware of alternative better configuration settings or have a path for the latest version of Java, please ignore the following commands and configure to your best knowledge)
module load /app/utils/modules/jdk-9.0.4
3.2 Launching PBS job
The instructions in this document can be used to run Hadoop and Spark applications on Aziz interactively or in batch mode. This document focuses on Launching Apache Hadoop and Spark environments interactively. An example PBS job file is provided in Appendix A and Appendix B for reference.
3.2.1 Executing PBS job in interactive mode
qsub -I -l select=:ncpus=24:mpiprocs=1 -l walltime=04:00:00 -q thin
3.2.2 Explanation of PBS job command options
qsub | PBS command to submit jobs to a queue |
-I | Interactive Job. Remove this option to run jobs in batch mode |
-l | Resource Specifications |
-q | Specify destination queue. It can be any of the following:
|
3.2.3 Explanation of select statement of the command in section 3.2.1
nodes_num | Number of nodes required for the job |
ncpus | Number of CPUs to be allocated for the job. For Hadoop and Spark jobs, this value should be 24. |
mpiprocs | The number of MPI processes to be allocated for the job. For Hadoop and Spark jobs. this value should be 1 |
walltime | Specify the required time for the job to complete ( format: HH:MM:SS) where HH=Hours; MM=Minutes; SS=Seconds) |
3.2.4 An Example job to run a three-node cluster for 2 hours
We are requesting a three-node cluster for 2 hours using the following command.
Note that we are using the thin queue, as 96 Gb is sufficient for most of the jobs.
qsub -I -l select=3:ncpus=24:mpiprocs=1 -l walltime=02:00:00 -q thin
3.3 Running the cluster installer script
3.3.1 Path of the cluster installer script
Following is the path of the cluster installer script. The script is the same for Apache Hadoop and Apache Spark cluster setups.
/app/common/bigdata/exec/install
3.3.2 Cluster installer script options
-i | Install and configure available options: hadoop or spark |
-i hadoop | Install and configure Apache Hadoop Cluster |
-i spark | Install and configure Apache Spark Cluster |
-s | Start Services. If this option is not specified, the setup tool will only perform installation and configuration. The user must manually start all services on the cluster, including formatting HDFS Namenode. This option performs the following actions
|
-p | Specify the path of the DDN directory. Note that this directory should be empty. Hadoop stores metadata and data blocks in this directory. This directory is not user readable, except for the log files. Please specify only the DDN directory in this option; otherwise, unexpected failures and performance issues may occur. |
-u | Uninstall Apache Hadoop or Apache Spark. If active, this option will stop all Apache Hadoop and Apache Spark servicesand then uninstall services from Master and all Slave nodes. This option performs the following actions:
|
3.3.3 Example command to setup cluster
Example 1: Install Apache Hadoop Cluster and start Services
# /app/common/bigdata/exec/install -i hadoop -s -p /ddn/data/test
Example 2: Install Apache Spark Cluster and start Services
# /app/common/bigdata/exec/install -i spark -s -p /ddn/data/test
3.3.4 Stopping Services and Uninstalling Cluster
Note: This is a very important step. If this step is not performed, subsequent commands will fail, and processes on the compute nodes will have to be terminated manually, which would be a tedious task considering the size of the cluster. The second issue is that the installation and configuration would take considerable storage from compute nodes, where this space will not get freed up.
The following command will stop all services started during installation and uninstall the application from Master and all Slave nodes.
# /app/common/bigdata/exec/install -u
Frequently Asked Questions
Is this implementation of Hadoop/Spark different from the one available on the Apache website?
Though the implementation is a little different from the standard installation. You will not find any difference in architecture, classes or functionality of this implementation from the standard ones.
Is it okay to copy any existing applications from the PC/Workstation or another cluster to Aziz Supercomputer?
Yes.
Is there a particular queue to execute Hadoop jobs?
No, there are no special queues for Hadoop jobs. You can launch Hadoop job as a regular job on Aziz Supercomputer.
How much memory is available on compute nodes?
We have two types of compute nodes 'thin' and 'fat'. Thin nodes have 96 GB of memory, and fat nodes have 256 GB. Both thin and fat nodes have 24 CPU cores on each node.
Appendix A
PBS Job File for Apache Hadoop built-in Example
#!/bin/bash #PBS -N wordcount #PBS -l select=1:ncpus=24 #PBS -l walltime=02:00:00 #PBS -q thin module load /app/utils/modules/jdk-9.0.4 /app/common/bigdata/exec/install -i hadoop -s -p /ddn/data/test export PATH=/tmp/hadoop-$USER/bin:$PATH hdfs dfs -mkdir /input hdfs dfs -copyFromLocal big_text_file.txt /input yarn jar /tmp/hadoop-$USER/share/hadoop/mapreduce/hadoop-mapreduce-examples2.7.3.jar wordcount /input /output hdfs dfs -copyToLocal /output ~/ /app/common/bigdata/exec/install -u
Appendix B
PBS Job file for Apache Spark wordcount example
#!/bin/bash #PBS -N wordcount #PBS -l select=1:ncpus=24 #PBS -l walltime=02:00:00 #PBS -q thin module load /app/utils/modules/jdk-9.0.4 /app/common/bigdata/exec/install -i spark -s -p /ddn/data/test export PATH=/tmp/hadoop-$USER/bin:/tmp/spark-$USER/bin hdfs dfs -mkdir /input hdfs dfs -copyFromLocal big_text_file.txt /input cat < wordcount.scala val text = sc.textFile("/sample/big_text_file.txt") val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_) counts.collect System.exit(0) EOF spark-shell -i wordcount.scala /app/common/bigdata/exec/install -u