Mid-Summer Progress Updates for Meng - "Hadoop Provisioning on Cloudstack Via Whirr"

Mid-Summer Progress Updates for Meng - "Hadoop Provisioning on Cloudstack Via Whirr" In this section I describe my progress with the project titled "Hadoop Provisioning on CloudStack Via Whirr"

Introduction It has been five weeks since the GSOC 2013 is kick-started. During the last five weeks I have been constantly learning from the CloudStack Community in aspects of both knowledge and personality. The whole community is very accommodating and willing to help newbies. I am making progress steadily with the community's help. This is my first experience working with such a large and cool code base, definitely a challenging and wonderful experience for me. Though I am a little slipped behind my schedule, I am making my best effort and hoping to complete what I set out in my proposal by the end of this summer.

CloudStack Installation I spent two weeks or so on the CloudStack Installation. In the beginning, I am using the Ubuntu systems. Given that I am not familiar with maven and a little scared by various kinds of errors and exceptions during system deployment, I failed to deploy CloudStack through building from the source. With Ian's advice, I switched to CentOS and began to use rpm packages for installation, things went much smoother. By the end of the second week, I submitted my first patch -- CloudStack_4.1_Quick_Install_Guide.

Deploying a Hadoop Cluster on CloudStack via Whirr Provided that CloudStack is in place and I can register templates and add instances, I went ahead to use Whirr to deploy a hadoop cluster on CloudStack. The cluster definition file is as follows: whirr.cluster-name: the name of your hadoop cluster. whirr.store-cluster-in-etc-hosts: store all cluster IPs and hostnames in /etc/hosts on each node. whirr.instance-templates: this specifies your cluster layout. One node acts as the jobtracker and namenode (the hadoop master). Another two slaves nodes act as both datanode and tasktracker. image-id: This tells CloudStack which template to use to start the cluster. hardware-id: This is the type of hardware to use for the cluster instances. private/public-key-file: :the key-pair used to login to each instance. Only RSA SSH keys are supported at this moment. Jclouds will move this key pair to the set of instances on startup. whirr.cluster-user: this is the name of the cluster admin user. whirr.bootstrap-user: this tells Jclouds which user name and password to use to login to each instance for bootstrapping and customizing each instance. You must specify this property if the image you choose has a hardwired username/password.(e.g. the default template CentOS 5.5(64-bit) no GUI (KVM) comes with Cloudstack has a hardcoded credential: root:password), otherwise you don't need to specify this property. whirr.env.repo: this tells Whirr which repository to use to download packages. whirr.hadoop.install-function/whirr.hadoop.configure-function :it's self-explanatory. Output of this deployment is as follows: Other details can be found at this post in my blog. In addition I have a Whirr trouble shooting post there if you are interested.

Elastic Map Reduce(EMR) Plugin Implementation Given that I have completed the deployment of a hadoop cluster on CloudStack using Whirr through the above steps, I began to dive into the EMR plugin development. My first API is launchHadoopCluster, it's implementation is quite straight forward, by invoking an external Whirr command in the command line on the management server and piggybacking the Whirr output in responses.This api has a structure like below: The following is the source code of launchHadoopClusterCmd.java. You can invoke this api through the following command in CloudMonkey: > launchHadoopCluster config=myhadoop.properties This is sort of the launchHadoopCluster 0.0, other details can be found in this post . My undergoing working is modifying this api so that it calls Whirr libraries instead of invoking Whirr externally in the command line. First add Whirr as a dependency of this plugin so that maven will download Whirr automatically when you compile this plugin. I am planning to replace the Runtime.getRuntime().exec() above with the following code snippet. LaunchClusterCommand command = new LaunchClusterCommand(); command.run(System.in, System.out, System.err, Arrays.asList(args)); Eventually when a hadoop cluster is launched. We can use Yarn to submit hadoop jobs. Yarn exposes the following API for job submission. ApplicationId submitApplication(ApplicationSubmissionContext appContext) throws org.apache.hadoop.yarn.exceptions.YarnRemoteException In Yarn, an application is either a single job in the classical sense of Map-Reduce or a DAG of jobs. In other words an application can have many jobs. This fits well with the concepts in EMR design. The term job flow in EMR is equivalent to the application concept in Yarn. Correspondingly, a job flow step in EMR is equal to a job in Yarn. In addition Yarn exposes the following API to query the state of an application. ApplicationReport getApplicationReport(ApplicationId appId) throws org.apache.hadoop.yarn.exceptions.YarnRemoteException The above API can be used to implement the DescribeJobFlows API in EMR.

Learning Jclouds As Whirr relies on Jclouds for clouds provisioning, it's important for me to understand what Jclouds features support Whirr and how Whirr interacts with Jclouds. I figured out the following problems: How does Whirr create user credentials on each node? Using the runScript feature provide by Jclouds, Whirr can execute a script at node bootup, one of the options in the script is to override the login credentials with the ones that provide in the cluster properties file. The following line from Whirr demonstrates this idea. final RunScriptOptions options = overrideLoginCredentials(LoginCredentials.builder().user(clusterSpec.getClusterUser()).privateKey(clusterSpec.getPrivateKey()).build()); How does Whirr start up instances in the beginning? The computeService APIs provided by jclouds allow Whirr to create a set of nodes in a group(specified by the cluster name),and operate them as a logical unit without worrying about the implementation details of the cloud. Set<NodeMetadata> nodes = (Set<NodeMetadata>)computeService.createNodesInGroup(clusterName, num, template); The above command returns all the nodes the API was able to launch into in a running state with port 22 open. How does Whirr differentiate nodes by roles and configure them separately? Jclouds commands ending in Matching are called predicate commands. They allow Whirr to decide which subset of nodes these commands will affect. For example, the following command in Whirr will run a script with specified options on nodes who match the given condition. Predicate<NodeMetadata> condition; condition = Predicates.and(runningInGroup(spec.getClusterName()), condition); ComputeServiceContext context = getCompute().apply(spec); context.getComputeService().runScriptOnNodesMatching(condition,statement, options); The following is an example how a node playing the role of jobtracker in a hadoop cluster is configured to open certain ports using the predicate commands. Instance jobtracker = cluster.getInstanceMatching(role(ROLE)); // ROLE="hadoop-jobtracker" event.getFirewallManager().addRules( Rule.create() .destination(jobtracker) .ports(HadoopCluster.JOBTRACKER_WEB_UI_PORT), Rule.create() .source(HadoopCluster.getNamenodePublicAddress(cluster).getHostAddress()) .destination(jobtracker) .ports(HadoopCluster.JOBTRACKER_PORT) ); With the help of such predicated commands, Whirr can run different bootstrap and init scripts on nodes with distinct roles.

Great Lessons Learned I am much appreciated with the opportunity to work with CloudStack and learn from the lovable community. I can see myself constantly evolving from this invaluable experience both technologically and psychologically. There were hard times that I were stuck on certain problems for days and good times that made me want to scream seeing problem cleared. This project is a great challenge for me. I am making progress steadily though not smoothly. That's where I learned the following great lessons: When you work in an open source community, do things in the open source way. There was a time when I locked myself up because I am stuck on problems and I am not confident enough to ask them on the mailing list. The more I restricted myself from the community the less progress I made. Also the lack of communication from my side also prevents me from learning from other people and get guidance from my mentor. CloudStack is evolving at a fast pace. There are many APIs being added ,many patches being submitted every day. That's why the community use the word "SNAPSHOT" for each version. At this moment I am learning to deal with fast code changing and upgrading. A large portion of my time is devoted to system installation and deployment. I am getting used to treat system exceptions and errors as a common case. That's another reason why communication with the community is critical. In addition to the project itself, I am strengthening my technical suite at the same time. I learned to use some useful software tools: maven, git, publican, etc. Reading the source code of Whirr make me learn more high level java programming skills, e.g. using generics, wildcard, service loader, the Executor model, Future object, etc . I am exposed to Jclouds, a useful cloud neutral library to manipulate different cloud infrastructures. I gained deeper understanding of cloud web services and learned the usage of several cloud clients, e.g. Jclouds CLI, CloudMonkey,etc. I am grateful that Google Summer Of Code exists, it gives us students a sense of how fast real-world software development works and provides us hand-on experience of coding in large open source projects. More importantly it's a self-challenging process that strengthens our minds along the way.