In 45 Min, Set Up Hadoop (Pivotal HD) on a Multi-VM Cluster & Run Test Data

header-graphic-45-min-hadoop-set-up-multi-vm-clusterGetting started with Hadoop can take up a lot of time, but it doesn’t have to.

Architects, developers, and operations people often want to get an environment up and running, but it helps if the environment is built automatically, is realistic, allows for easy experimentation of different configurations, and has a complete set of services.

In this post, I will show you some experimental, unofficial tips on how to do this, and it only takes about 45 minutes (if your downloads don’t take forever). From that point, cleaning, changing configuration, and rebuilding the VMs takes less than 20 minutes. We will provide a thorough background, cover the prerequisites, build the environment with free, public tools. We will also test it with sample data, and provide additional insight on architectural elements like IP addresses, users, and provisioning variables. Our approach allows for testing of Hadoop applications, experimenting with different Hadoop configurations, changing Hadoop services, and learning the architecture. Since we build automatically, we can also start over easily if something gets messed up.

Overview of Pivotal HD Options

With Pivotal HD, there are two main options. You can start with the Pivotal HD single-node VM. This VM contains all the components included in Pivotal HD and HAWQ as well as tutorials. It is a pre-configured installation and makes it easy for you to learn without having to build a full cluster.

There is a second option—to get the full power of Hadoop, you can use Pivotal HD Community in a physical server or virtual environment. This version has a 50-node limit and includes several other components like the Command Center. With this version, we can explore a multi-node cluster without needing significant physical resources to deploy it.

When you create a multi-VM Pivotal HD cluster using Pivotal HD Community, there are additional manual steps. You have to create multiple VMs, install the Pivotal Command Center (PCC), configure, deploy and start Pivotal HD (PHD). If you want to modify the environment, you probably have to repeat all the steps again. Instead, we are going to automate the build of a multi-VM (node) Pivotal HD cluster.

Getting Started—Building the Pivotal HD Cluster

We are going to build the environment using Vagrant—an extremely helpful tool for automatically building VM environments. With Vagrant, you can define a multi-VM PHD environment in a single configuration file called a VagrantFile and materialize the configuration with a single command (vagrant up). Vagrant will create the VMs and run a shell script to install Pivotal Control Center, install Pivotal HD, and start the cluster. At any moment, you can destroy the environment, apply changes, or start it again in just a couple of minutes.

We will use a Vagrant configuration file I developed to create the multi-VM cluster. There are also two associated provisioning files that follow the Pivotal HD_v1.0_Guide.pdf instructions for installing Pivotal Control Center and Pivotal HD in the cluster. One, phd_provision_script is embedded within the Vagrant file and defines provisioning configurations like network IPs, NTP—common for all VMs. The second, pcc_provision.sh, installs Pivotal Control Center and Pivotal HD on all VMs.

While the Vagrant configuration file sets up VirtualBox VMs, the file should also work with VMWare Fusion and Workstation but it requires a commercial, inexpensive Vagrant VMware plugin. Our Vagrant configuration creates four (CentOS 6.2) virtual machines: pcc, phd1, phd2, and phd3.  The pcc machine is used as the Pivotal Command Center host and the remaining 3 machines are used for the PivotalHD cluster. By default, the configuration installs several Hadoop services, including HDFS, Yarn, Pig, Zookeeper, HBase, Greenplum Extension Framework (GPXF), and HAWQ (Pivotal’s SQL on Hadoop engine). A recent Pivotal HD post provides a good overview of these pieces along with this post and the related graphic below.

Pivotal-HD-Architecture-large

Note: Hive is disabled by default. The Pivotal HD VMs are configured with 1024MB of memory. To enable Hive, you have to increase this amount to at least 2048MB. As well, the DataLoader, HVE (Hadoop Virtual Extension), and USS (Unified Storage Service) are not part of this Vagrant configuration.

Prerequisites and VM Set-Up

From a hardware standpoint, you need 64-bit architecture and at least 8GB of physical memory.

First, we install the latest version of Vagrant and VirtualBox and then, we add CentOS 6.2:

1. Install VirtualBox v4.2.16 or new: https://www.virtualbox.org/wiki/Downloads

2. Install Vagrant v1.2.7 or new: http://downloads.vagrantup.com/tags/v1.2.7

3. Add a CentOS 6.2 x86_64 box to your local Vagrant configuration:

> vagrant box add CentOS-6.2-x86_64 https://s3.amazonaws.com/Vagrant_BaseBoxes/centos-6.2-x86_64-201306301713.box

CentOS takes about 10 min to download. If you have it, you cal also add CentOS from a local file system.
Note:  Keep the box name exactly: ‘CentOS-6.2-x86_64’ or the Vagrant file will not recognize it.
Note:  Only CentOS 6.1 or newer are supported.

Check to confirm the vagrant box is installed:

> vagrant box list

Installing Pivotal HD Components

This section explains how to: (1) Download and uncompress the PHD 1.0.1 CE distribution, (2) copy the oracle jdk6 installation binaries inside the uncompressed folder, (3) add the PHD Vagrant configuration files, and (4) run Vagrant to setup the cluster.

4. Download and uncompress Pivotal HD 1.0.1 (phd_1.0.1.0-19_community.tar.gz). Files are uncompressed in the PHD_1.0.1_CE folder.

> wget "http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/pivotal-sw/phd_1.0.1.0-19_community.tar.gz"
> tar -xzf ./phd_1.0.1.0-19_community.tar.gz
> cd PHD_1.0.1_CE

5. Download Oraclejdk-6u45-linux-x64-rpm.bin into the PHD_1.0.1_CE folder.

> wget --cookies=off --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com" "http://download.oracle.com/otn-pub/java/jdk/6u45-b06/jdk-6u45-linux-x64-rpm.bin"

6. Download the files mentioned earlier, Vagrantfile and pcc_provisioning.sh, into the PHD_1.0.1_CE folder.

> wget "https://gist.github.com/tzolov/6415996/download" -O gist.tar.gz
> tar --strip-components=1 -xzf ./gist.tar.gz

7. Within the PHD_1.0.1_CE folder run Vagrant and wait until the cluster is installed and started.

> vagrant up

Note: The first time you run it, the provisioning script will download PADS-1.1.0-8.tar.gz (i.e. HAWQ). This will take some time. Alternatively if you have PADS-1.1.0-8.tar.gz already downloaded, just copy it inside the PHD_1.0.1_CE folder. 

If the installation fails while starting the cluster, then perform ‘vagrant destroy -f’ and then ‘vagrant up’ to try again. When this is done running, Vagrant has created four (CentOS6.2) virtual machines:

  • pcc (10.211.55.100) is dedicated for the Pivotal Command Center;
  • phd1, phd23, phd 3 (10.211.55.10[1..3]) is used for the Pivotal HD cluster and includes HDFS, Yarn, Pig, Zookeeper, HBase, and HAWQ.

Testing the Install—Access, Test Data, Service Management

To confirm the set-up, open the Pivotal Command Center Web UI at: http://10.211.55.100:5000/status

(user: gpadmin, password: gpadmin), and go to the dashboard as shown below.

Pivotal-HD-Dashboard

You can also SSH to any of the VMs using the provided user accounts: root/vagrant, vagrant/vagrant, and gpadmin/gpadmin. Within the Pivotal Command Center VM (10.211.55.100), there is an Install and Configuration Manager (ICM) command line utility you can use to connect.

We can also test the cluster by running jobs against sample data from the PivotalHD demo project—the link to this project is located at pivotalhd.cfapps.io or go to pivotalhd.cfapps.io/getting-started/dataset.html to get more detail.

Once you have run these jobs, you can access the job monitor from the top menu (as shown below) or access the Job History Management UI here: http://10.211.55.100/:19888/jobhistory

Pivotal-HD-JobMon

To stop and start cluster use or destroy it and start over, you can issue the following commands:

Stop the cluster from the PCC node:

pcc> icm_client stop -l PHD_C1

Then shut down all VMs (from your host node):

> vagrant halt -f

When you need the cluster environment again, just run vagrant without the provisioning. This should take less than 2 minutes to come up.

> vagrant up --no-provision

Ssh to PCC and start the cluster again:

pcc> icm_client start -l PHD_C1

To destroy the cluster completely:

> vagrant destroy -f

Additional Configuration Info for Services, IP Address, Users, and Config Variables

You can alter the list of services by changing the SERVICES variable in the pcc_provision.sh script.

The default configuration applies the following Hadoop services topology:

phd1 - client, namenode, secondarynameonde, yarn-resourcemanager, mapreduce-historyserver, hbase-master, hive-server, hive-metastore, hawq-master, hawq-standbymaste, hawq-segment, gpxf-agent

phd1, phd2, phd3 - datanode, yarn-nodemanager, zookeeper-server, hbase-regionserver,hawq-segment, gpxf-agent

It is fairly easy to modify the default configuration and change the number of virtual machines or set different Hadoop services. For example, add a new machine phd4 to the cluster by (1) appending phd4 to theSLAVE_NODES variable (in pcc_provision.sh), (2) adding ‘10.211.55.104 phd4.localdomain phd4’ line to the /etc/hosts in the phd_provision_script, and (3) add new ‘config.vm.define :phd4 do |phd4| …’ statement in the Vagrantfile.

Hostnames and IP addresses are configured for each of the virtual machines. This is defined in Vagrantfile (/etc/hosts created inside the phd_provision_script) and is applied to all VMs (XXX.vm.provision :shell, :inline => $phd_provision_script).

10.211.55.100                  pcc
10.211.55.101                  phd1
10.211.55.102                  phd2
10.211.55.103                  phd3

Note: The IP addresses are explicitly assigned to each VM (xxx.vm.network :private_network, ip: “10.211.55.XX”)

The following users accounts are created during the Installation process.

User                        Password             Description
root                           vagrant                     exist on all machines
vagrant                    vagrant                     exists on all machines
gpadmin                  gpadmin                  exists on all machines (the password on pcc is different)

Here are some additional, key variables to point out from https://gist.github.com/tzolov/6415996#file-pcc_provision-sh:

  • CLUSTER_NAME=PHD_C1 - PHD cluster name
  • SERVICES=hdfs,yarn,pig,zookeeper,hbase,gpxf,hawq  -  Hadoop services to install.
  • MASTER_NODE=phd1 – host name of the master VM
  • MASTER_AND_SLAVES=$MASTER_NODE,phd2,phd3 – hostnames all slave VMs (by convention the maser is used also as slave)

To learn more about Pivotal HD:

Permalink

Tags: , , , ,

31 comments on “In 45 Min, Set Up Hadoop (Pivotal HD) on a Multi-VM Cluster & Run Test Data

  1. David Greco on said:

    Great job Christian!

  2. Hi Christian,

    I’m having an error during the deploy.

    It gives the error message ERRORS: GPHDClusterInstaller.py failed : Default failure.

    I’ve tried to look through the log files in the manual but have come up blank. Do you have any suggestions for me?

    Thanks!

    • lfuerst on said:

      Hi

      Have you been able to resolve this issue? And would you tell me how? I’m stuck at the same point :D

      mfg leo

      • Christian Tzolov on said:

        Hi leo,
        can you share some context? What is your host OS? How much memory do you have? Have you tried to size down your configuration (use only one phd1 slave for example)?
        Cheers, Christian

    • Christian Tzolov on said:

      the –strip-components parameter in the blog was misspelled.
      If you have the gist.tar.gz in your PHD_1.0.1_CE folder but you can’t see the Vagrantfile and pcc_provisioning.sh then please run: tar –strip-components=1 -xzf ./gist.tar.gz
      run ‘vagrant destroy -f’ before you try to run ‘vagrant up’ the environment again

  3. Christian Tzolov on said:

    Thank you @David!

    @Renard, no I haven’t faced this particular issue but if you can share additional context (screen dump to start with) we should resolve it. Please use the comment zone under the https://gist.github.com/tzolov/6415996 gist or directly my email. ctzolov AT gopivotal DOT com

  4. Hi Christian,

    I want to install “Pivotal HD Single Node VM” on my laptop. What are the system requirements? I want to install on my laptop with 4 GM RAM and running on WindowsXP, is it possible with these combinations? Basically, I want to learn hadoop and other technologies using the platform.

    Thanks,
    Sam

  5. Christian Tzolov on said:

    Hi Sam,

    I you can start immediately by downloading the single-node Pivotal HD from: http://gopivotal.com/pivotal-products/data/pivotal-hd#4 and the follow the excellent tutorials: http://pivotalhd.cfapps.io/getting-started/pivotalhd-vm.html

    I do not expect that WindowsXP will be a problem (have not tested it though) but you need to have at least 8GB of RAM.

  6. prabhukiran vempati on said:

    Hi Christian, great article.
    Is there any way to increase the default harddrive size of each of the created VM’s in Vbox.
    The default seems to be 15 GB
    Thanks
    Sincerely
    Kiran

  7. Christian Tzolov on said:

    Hi Kiran,
    You can easily modify the VMs memory size. Just modify your Vagrantfile (e.g https://gist.github.com/tzolov/6415996#file-vagrantfile)
    For VirtualBox alter the “–memory” configuration property for each defined VM. For example to increase the memory of the phd2 VM from 1GB to 4GB you should modify the configuration for phd2 form: v.customize ["modifyvm", :id, "--memory", "1024"] to: v.customize ["modifyvm", :id, "--memory", "4096"]
    In addition you may want to alter your hadoop configurations to make use the additional memory as well.
    Cheers,
    Christian

  8. prabhukiran vempati on said:

    Thanks Christian,
    I was looking more into increasing the hard drive size.
    Thanks
    Kiran

  9. prabhukiran vempati on said:

    Hi Christian
    Unable to run pig sample as gpadmin and vagrant
    receiving below
    13/10/02 20:59:04 WARN pig.Main: Cannot write to log file: /home/gpadmin/sample-data/pig_1380740344363.log
    2013-10-02 20:59:06,410 [main] ERROR org.apache.pig.Main – ERROR 2997: Encountered IOException. File pigscript.pig does not exist
    2013-10-02 20:59:06,410 [main] ERROR org.apache.pig.Main – java.io.FileNotFoundException: File pigscript.pig does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:479)

  10. Christian Tzolov on said:

    Hi Kiran,
    Sorry i did not get your question right !
    I am afraid that you can not resize the disk within Vagrant but have to create/modify the Vbox. Let me know if you know another solution?

    Concerning the Pig script error. The “File pigscript.pig does not exist” error suggests problem finding the script. Can you pleas share your run command line?
    Also can you try to start the Pig grunt console (form the command line type ‘pig’) and check if it works?

  11. Hi ,

    I am getting the following error while deploying the pivotal cluster.

    ERRORS: Default Failure

    • Christian Tzolov on said:

      Hi Hasan,

      Just found that the –strip-components parameter was misspelled in the blog. The issue is fixed now.

      If you have the gist.tar.gz in your PHD_1.0.1_CE folder but you can’t see the Vagrantfile and pcc_provisioning.sh then please run: tar –strip-components=1 -xzf ./gist.tar.gz

      If you have performed ‘vagrant up’ already you may want to clean it with ‘vagrant destroy -f’ before try it again.

      Cheers,
      Christian

  12. Prashanth Ayyavu on said:

    Hi Christian,
    I am facing the following error while deploying the cluster,

    ********************************************************************************
    * Deploy Cluster: PHD_C1
    ********************************************************************************
    [DONE] Cluster configuration successfully written to /home/gpadmin/ClusterConfigDir
    Verifying input
    Starting install
    [========== ] 10%ERRORS: GPHDClusterInstaller.py failed : Default Failure

    Here are my machine details,
    Host OS = Windows 7 Professional
    RAM = 16 GB
    Internet connection through proxy server. ( I added the proxy_http env var in the vagrantfile)
    Both vagrantfile and pcc_provision.sh are available under PHD_1.0.1_CE

    I didn see any errors while setting up the VMs or downloading the necessary packages. Only negative light during the VMs setup was the following
    ……..
    [phd1] Running: inline script
    Shutting down ntpd: [FAILED]
    Starting ntpd: [ OK ]
    ……..

    Any suggestions christian?

    Thanks,
    Prashanth

  13. lfuerst on said:

    Hi.

    My deploy finally worked, aber redoing my complete setup of virtual machines. Now the “start” bothers me, after “ICM_CLIENT start …” console tells me that the cluster started successfully. But Command Center says that only HDFS, und History Server startet successfully.
    All other services can’t be started. If i try to start any JAVA based program/service I get a “Java Heap exception” (also after variing the Heap variables). I suppose that the Java Heap is causing problems during the service startup…

    (CentOS 6.2 – 4GB ram for each vm)

    mfg leo

    • Christian Tzolov on said:

      Hi ufuerst, Indeed it looks like your VMs are taking all physical memory on your machine.
      How much physical memory do you have? Recommended is at least 8G.
      You can reduce the amount of memory for your VMs. As you can see in the example vargrant configuration (https://gist.github.com/tzolov/6415996#file-vagrantfile) I have alocated 2G for phd1, 1G for phd2 and phd3 and 356M for the pcc nodes. This work on my notebook with 8G physical memory.
      Also you can try stopping all background application that may consumer memory.

  14. saurabh on said:

    Hi , i am able to run hawq through the VM image that you guys have provided. I have a query, is it possible to access hive tables through hawq? if yes then how can i do so, one way is to create a external table and create a table in hawq using same file. Is there some other way too?

  15. Melvin on said:

    Hi Christian,

    Did you install the Virtual Box then install the VMWare inside the Virtual Box? Because I already download the Pivotal HD Single Node VM and run it using VMware. And it is running already. The problem is i don’t get where to start, how to start and where can i see the other platforms like Map Reduce and etc. Do I have to install it first?

    Regards,
    Melvin

    • Christian Tzolov on said:

      Hi Melvin,

      If you have the “PivotalHD single node VM” installed and you are only interested in learning how to use the services provided with the platform (MapReduce being one of them) then you should check ths excelent tutorial: http://pivotalhd.cfapps.io/

      If you want to learn how to build a multi-node PHD VM cluster then my blog will show you how to do it easily with Vagrant. For this you can use either VirtualBox or/and VMWareFusion6 providers (you don’t have to install them within each other!). Note that the VMWare Fusion provider requires an inexpensive commercial Vagrant license.

      Cheers,
      Christian

  16. Hi Christian,

    I wonder if you have the instructions and scripts for installing the Pivotal HD 1.1. It will be great if you can write a blog on that.

    Thanks.

    • Christian Tzolov on said:

      Hi Kelvin,

      I have just updated the Vagrant scripts and wrote a blog draft for installing PHD 1.1 + HAWQ with VM cluster. Scripts supports both VirtualBox and VMWare Fusion 6.

      Would you be interested in testing it before the blog is published?

      Cheers, Christian

  17. Koen Dejonghe on said:

    Hi Christian,
    Thank you very much for assembling this blog.
    I am having trouble to successfully deploy it, however.
    Everything works fine after executing ‘vagrant up’ but the at the end of the provisioning script, things go wrong.

    ********************************************************************************
    * HAWQ - post deploy configuration
    ********************************************************************************
    bash: /usr/local/hawq/greenplum_path.sh: No such file or directory
    ********************************************************************************
    * Start Cluster: PHD_C1
    ********************************************************************************
    bash: /usr/local/hawq/bin/gpssh-exkeys: No such file or directory
    Starting services
    SUCCESS: Start complete
    ********************************************************************************
    * Initialise HAWQ
    ********************************************************************************
    bash: /etc/init.d/hawq: No such file or directory

    The reason was that HAWQ was not installed on phd1.
    I’ve tried to install it manually by executing rpm Uvh on the package, but that didn’t work either since the /etc/init.d/hawq file was not created.

    Do you have any idea how to fix this ?

    Mac OSX 10.9.1 – 16GB

    • Christian Tzolov on said:

      Hi Koen,

      Thank you for giving it a try. I’ve updated the scripts to the latest Pivotal HD 1.1 + HAWQ version.
      New blog entry is coming soon but if you are interested i can share the draft version? I’ve successfully tested it with VirtualBox and VMWareFusion6 on similar HW configuration: Mac OSX 10.8.5 – 16GB

      - Christian

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>