Need this homework be completed by 09:00 6 July, 2017

Discussion Question #2

Using this article below which mentions excellence but the overall topic may be any topic that you are interested in.

Talk about the article you have chosen and provide the APA citation for the article [this article must be unique and not used by other students]. Tell us why you chose this article and how excellence played a part in the article. Also, tell us why is excellence is important.

Optimizing big data processing performance in the public cloud: opportunities and approaches

View Document

Abstract:

Today's lightning fast data generation from massive sources is calling for efficient big data processing, which imposes unprecedented demands on the computing and networking infrastructures. State-of-the-art tools, most notably MapReduce, are generally performed on dedicated server clusters to explore data parallelism. For grass roots users or non-computing professionals, the cost of deploying and maintaining a large-scale dedicated server clusters can be prohibitively high, not to mention the technical skills involved. On the other hand, public clouds allow general users to rent virtual machines and run their applications in a pay-as-you-go manner with ultra-high scalability with minimal upfront costs. This new computing paradigm has gained tremendous success in recent years, becoming a highly attractive alternative to dedicated server clusters. This article discusses the critical challenges and opportunities when big data meet the public cloud. We identify the key differences between running big data processing in a public cloud and in dedicated server clusters. We then present two important problems for efficient big data processing in the public cloud, resource provisioning (i.e., how to rent VMs) and VM-MapReduce job/task scheduling (i.e., how to run MapReduce after the VMs are constructed). Each of these two questions have a set of problems to solve. We present solution approaches for certain problems, and offer optimized design guidelines for others. Finally, we discuss our implementation experiences.



Today's lightning fast data generation from massive sources and advanced data analytics have made information mining from big data possible. We have witnessed the success of many big data applications. For example, Amazon uses massive historical shipment tracking data to recommend goods to sharply targeted customers, and Google uses billions of query data to predict flu trends, which can be one week earlier than the National Centers for Disease Control and Prevention (CDC).

Bid data processing, however, imposes unprecedented demands on the underlying computing and networking infrastructures. For input data at the petabyte or even exabyte scale, simply improving the power of individual machines, or scaling up, is hardly practical. State-of-the-art tools, most notably MapReduce, are generally performed on dedicated server clusters to explore data parallelism. They rely on a large scale of machines that work together, that is, scaling out, to process the big data in a divide-and-conquer manner. Nevertheless, compared to conventional data processing, MapReduce-like tools are still new in the market and there is much room for improvement. In particular, from the performance point of view, MapReduce has been criticized for its inefficiency.

Moreover, a growing number of big data applications come from not only the IT industry, but also such other areas as the civil, environmental, financial, and health industries, to name but a few. For grass roots users or non-computing professionals, the cost of deploying and maintaining large-scale dedicated server clusters can be prohibitively high, not to mention the technical skills involved. On the other hand, public clouds, also emerging in recent years, allow general users to rent computing and storage resources, and run their applications in a pay-as-you-go manner. The massive resources available at a public cloud provider's data centers offer ultra-high scalability with minimal upfront costs for its users. This new computing paradigm has gained tremendous success, becoming a highly attractive alternative to dedicated server clusters.

While there have been flourishing studies on improving MapReduce performance in dedicated server clusters, the research in the context of the public cloud is still in its infancy. A public cloud has very different characteristics from dedicated server clusters. It emphasizes support for effective resource sharing and elastic pay-as-you-go services for general users, and machine virtualization plays a key role. For example, Amazon EC2, 1. the most successful and widely used infrastructure-as-a-service (IaaS) cloud platform, heavily relies on the Xen virtualization in its deployment. Each virtual machine (VM), known as an instance, functions as a virtual private server; a user can rent VMs of different configurations for different prices.

Figure 1.

The MapReduce architecture.

View All

This article discusses the unique challenges and opportunities when big data meet the public cloud. We identify the key differences between running big data processing in a public cloud and in dedicated server clusters. We then present two main questions for efficient big data processing under MapReduce in the public cloud: resource provisioning (i.e., how to rent VMs) and VM-MapReduce job/task scheduling (i.e., how to run MapReduce after the VMs are constructed). We systematically examine the performance of the key problems facing these two questions, including VM job scheduling, shuffling, data locality, as well as resource provisioning under CPU-I/O contention. We review the potential solutions in addressing some problems and offer optimized design guidelines for others. Finally, we discuss our implementation experiences. We release our codes, scripts, and documentation as open source.

When Big Data Meet the Public Cloud

Among the many tools that scale out big data processing to parallel machines, MapReduce (proposed by Google) is arguably the most popular, and has become the de facto standard nowadays. Figure 1 illustrates the basic MapReduce structure. A MapReduce job consists of two phases (or tasks), map and reduce. Accordingly, there are two program functions: mapper and reducer. In the map phase, the input data is split into blocks. The mappers then scan the data blocks and produce intermediate data. The reduce phase starts from a shuffle sub-phase, which, run by the reducers, shuffles intermediate data and moves them to the corresponding nodse for reducing. The reducers then process the shuffled data and generate the final results. For complex problems, multiple rounds of map and reduce can be executed.

There have been many practical MapReduce implementations, and the open source Hadoop2 is the most widely used to date. There have also been MapReduce-like tools that target specific application scenarios; for example, YARN for multi-stage cluster management with global resource management, Pregel for big graph processing, GraphLab for parallel machine learning, PowerGraph for machine learning on nature graphs, and Spark for distributed in-memory computation.

Today, Hadoop and other MapReduce-like tools generally run on dedicated server clusters. Thus, existing studies have mainly focused on optimization with such physical infrastructures. A public cloud, however, has very different characteristics from dedicated server clusters. Through machine virtualization, it effectively hides the many levels of implementation and platform details, making shared resources seem exclusive to the end users [2]. In such a virtualized environment, user applications share the underlying hardware by running in VMs. Each VM, during its initial creation (by the users), is provisioned with a certain amount of resources (e.g., CPU, memory, and I/O). The number and capacity of the VMs can be requested in a pay-as-you-go fashion, or even adjusted in runtime. In other words, the cloud's resource allocation is highly elastic, and an application may adjust the resources for processing its big data at different stages. The performance of such resources as CPUs and memories are relatively the same with both physical and virtual machines. However, this is no longer true for shared resources such as network bandwidth and I/Os. Even worse, it is known that there are conflicts between networking, I/O, and CPU. More specifically, given that virtualization often requires a hypervisor to intercept system calls, when there are intensive networking and I/O operations, this influences the CPU's operation, thereby affecting its performance [3]. As such, big data processing in the public cloud can differ dramatically from that in dedicated server clusters, and the above factors must be taken into account.

In this virtualized, shared, and highly elastic environment, two cascaded issues are to be addressed:

  • Resource provisioning: Given a MapReduce application, find the best way to construct VMs.

  • VM-job scheduling: Given the set of VMs that have been provisioned/constructed, find the best way to run MapReduce jobs/tasks.

Obviously, optimized resource provisioning and job scheduling depends on accurate cloud performance measurement and modeling. Modeling cloud performance is a difficult problem. The behavior of many cloud applications, such as game servers and web servers, changes frequently as it closely depends on user behavior. There is a major difference in big data processing applications, however. A big data application runs with the same type of data again and again. For example, Google flu prediction runs every day. It uses user query data. Although the queries are different from day to day, the amount of data, the amount of data for each data block used in MapReduce, and the way that Google flu prediction processes these data are the same. This makes it possible to predict the performance of VMs, mapper tasks, reduce tasks, and so on. As such, one can use one round of data to obtain the performance modeling. Such information serves as the input for our resource provisioning and scheduling.

Resource Provisioning in the Public Cloud

We first look into the resource provisioning problem. There have been initial studies on running MapReduce in the cloud, addressing the resource provisioning issues [4], [5]. The strategies that optimize the cluster size according to different job types and workloads have also been presented [6]. These pioneer studies consider different VM types as containers with given computing capacities. As discussed, in the public cloud, the labeled capacity may not be accurate, particularly for data-intensive applications.

Figure 2.

The cumulative distributed function (CDF) of the processing time for the same MapReduce job under different cluster configurations in EC2.

View All

To understand this, we construct two clusters in Amazon EC2:

  1. Small-16 with 16 EC2 small instances

  2. XLarge- 2 with 2 EC2 extra large instances

Both clusters have identical CPU capacity and memory capacity. We also assume that the unit-time costs for using both clusters are identical. As such, one should expect their performance to be the same. We show that this is not the case when running MapReduce.

We execute the same MapReduce job on both clusters, and the results are shown in Fig. 2. When two extra large instances are employed in our cluster (referred to as XLarge-2), the job can be done between 2732 and 2831 s. When 16 small instances are used (referred to as Small-16), the job completion time increases to 3042 s. In other words, the job runs 10 percent slower on Small-16 than on XLarge-2, indicating that a user will suffer a 10 percent capital loss if he/she does not construct the VMs wisely. This is because with I/O-intensive operations, there is strong interference between CPU and I/Os.

In [3], the resource provisioning problem in cloud-based big data systems is remodeled, and an interference-aware solution that smartly allocates the MapReduce jobs to different VMs is presented. The key is the remodeling phase, where the interference is captured into a parameter and is formulated into the performance of mapper, reducer, and so on. The trace-driven evaluation using Facebook data and four types of Amazon EC2 instances (small, media, large, and extra large) show that compared to a state-of-the-art cost-aware resource provisioning algorithm [7], the interference-aware solution outperforms the other by 11 percent.

Mapreduce VM-Job Scheduling

We next look into VM-job scheduling, assuming that the set of VMs have already been constructed. We first discuss how to schedule the MapReduce jobs and tasks, and how to assign them on the VMs. We then look into the shuffling sub-phase and examine how online data-machine assignment can be optimized. Finally, we discuss the notion of data locality and offer design guidelines through cloud elasticity.

VM-Job Scheduling

In the default MapReduce implementation, each machine is assigned two mappers and one reducer, and multiple jobs are randomly assigned to the machines. Consider an example in

Fig. 3. There are three VMs and two jobs. Job 1 has four map tasks and two reduce tasks. Job 2 has one map task and one reduce task. Assume the processing time to be 75 s for all map tasks and 100 s for all reduce tasks. If the VM assignment is not considered, we have Fig. 3a, which follows the default first-in first-out (FIFO) strategy of Hadoop. If we jointly consider VM assignment, we can achieve a schedule in Fig. 3b. It is easy to see that the completion time of job 2 in Fig. 3a is 250 s, whereas in Fig. 3b it becomes 175 s, a 30 percent improvement.

The above example shows that the simple default setting is hardly optimal. This is a scheduling problem, but MapReduce has a unique parallel-sequential structure that is worth special consideration. More formally. we have:

MapReduce VM job scheduling (MVJS)

Given a set of MapReduce jobs with map tasks and reduce tasks (map tasks need to precede reduce tasks), a set of VMs with certain performance, find a feasible schedule (when and where to run the tasks) to minimize total job completion time.

There are a series of works that progressively investigate this problem. In [8], a scheduling algorithm, OffA, of multiple MapReduce jobs on multiple machines is developed. In [9], the study is extended where the parallel-sequential structure is taken into consideration and an 8-approximation algorithm, H-MARES, is developed. Both studies do not consider machine assignment. The latest episode is [15], where MapReduce job/task scheduling is jointly considered with machine assignment. A 3-approximation algorithm, MarS, is developed. MarS first relaxes MSJO into a linear programming problem with a readily available optimal solution. This gives a lower bound to MSJO and sorts out the optimized scheduling order. A greedy search is then conducted for a feasible solution and approximately solves MVJS.

Performance improvement can be observed in each of these different areas of progress. For example, a gain of 20 percent of MarS to H-MARES is observed in most typical cases [15].

Online Optimal Shuffling

Looking further into the MapReduce structure, there is a shuffling sub-phase in the reduce task. The map function transfers the input raw data into (key, value) pairs, and the reduce function merges all intermediate values associated with the same intermediate key. The shuffling subphase starts after some map tasks finish and runs in parallel with other map tasks. Intrinsically, the shuffling sub-phase will re-assign intermediate data into appropriate VMs to run. In the state-of-the-art implementations, a hash function is used for the re-assignment, which directly affects the load of different machines. If the data input has a uniform distribution, the load can be expected to be automatically balanced. However, if the data is skewed, the performance can be poor.

To illustrate this, consider the example in Fig. 4. Assume that there are three VMs for a typical Wordcount application (to calculate the number of words in a document). If the default hash function is used, the result will be Fig. 4a. Here the words are distributed into three VMs in a round-robin fashion. As such, VM1 has a, d, g, VM2 has b, d, and VM3 has c, f for further processing. Clearly, the load is unbalanced as the distribution of the words is highly skewed. Figure 4b shows a better distribution. which yields shorter finishing time.

This again shows that there is large room towards an optimal strategy for biG data Processing. More formally. we have:

MapReduce shuffle scheduling (MSS)

Given a data stream produced by a MapReduce job, a set of VMs, find a partition strategy for the data stream so as to minimize the maximum load of the VMs.

Figure 3.

VM-Job/task scheduling: a) without taking VM assignment into consideration; and b) taking VM assignment into consideration.

View All

A List-Based Online Scheduling (LOS) algorithm is developed in [11]. LOS decides, upon receiving a (key, location) pair, to which machine to assign that item without any knowledge of what other items may be received in the future. LOS assigns the unassigned keys to the machine with the smallest load once they come in, and for the assigned keys, LOS just follows the constraint that the same key should go to the same machine. It is shown that LOS has a competitive ratio of 2, that is, it yields an overall finishing time at most twice of the optimal solution.

From the theoretical point of view, this is related to the online minimum makespan problemwhere jobs come one by one with processing time and need to be assigned to parallel machines. But a unique constraint in the MapReduce context is that the same key must go to the same machine.

To minimize the overall finishing time, the shuffle sub-phase should start as soon as possible. In other words, we need an online algorithm for problem MSS.

Data Locality and Cloud Elasticity

An important notion in big data processing is data locality. Data parallelism scales out the data processing to multiple machines, which incur data movement from one VM to another. Massive data movements, involving slow networking and disk operations, can aggravate resource contention and introduce excessive delay. It is therefore desirable for data to be close to the target machines.

Wang et al. [14] investigate data locality of map tasks in scheduling under heavy traffic. To balance between data locality and load balancing while maximizing throughput and minimizing delay, the system is modeled into a queueing architecture. Then a scheduling algorithm based on Join the Shortest Queue (JSQ) policy and MaxWeight policy are proposed. It is proven that the proposed algorithm is throughput optimal. Tan et al. [13] propose a stochastic optimization framework to optimize data locality of reduce tasks. Based on the optimal solution under restricted conditions, a receding horizon control policy is proposed.

Note that the public cloud has much greater capacity beyond the capacity requirement of one big data processing application. Together with the elastic nature of the public cloud, this offers opportunities to change VM capacity during runtime [12] to accommodate different data intensities at different stages of MapReduce. In other words, rather than moving data, it is possible to change the VM capacities at different times according to the amount of data to be processed in local VMs. In the initial investigation [12], for a certain MapReduce job, and a set of VMs each of which has a set of CPUs, when a task is planned to be executed, the number of CPUs in different VMs is adjusted according to the location of data to be processed by this task. Experiments in real clustering and EC2 clustering show that the throughput of Hadoop is improved by 41 and 15 percent. respectively.

Currently, runtime elastic VMs are not available in existing cloud providers. We show our initial implementation experience in constructing runtime elastic VMs where we adjust the number of CPU in each VM. Using Xen hypervisor, such adjustment introduces ignorable delay. With runtime elastic VMs, we observe a 43 percent earlier in finishing time, where the total number of CPU x hours remain similar. We believe that there are immense opportunities for both performance optimization and pricing here.

Figure 4.

Shuffle scheduling: a) default shuffle scheduling; b) load-aware shuffle scheduling.

View All

Figure 5.

The implementation framework.

View All

Implementation Experiences

So far we have systematically reviewed the approaches toward optimizing MapReduce for big data processing in the public cloud. We now present our implementation experiences of an integrated optimization framework, and our codes, scripts, and documentation/manual are available as open source.3

We implement Hadoop-l.2.0 running in the Amazon EC2 public cloud. In this implementation, we need to coordinate two components within Hadoop: Job'Tracker and TaskTracker. Job'Tracker manages all jobs in a Hadoop cluster, and as jobs are split into tasks, TaskTracker is used to manage tasks on every machine. The implementation framework is shown in Fig. 5. We add a new big data processing arrangement component. In this component, there is a predictor module to evaluate the performance of jobs, tasks, and VMs. We register a MapReduce VM-job scheduler module to Job'Tracker so that Job'Tracker can call this module to make scheduling decisions. The MapReduce VM-job scheduler module makes decisions according to the algorithms we develop in this project. We

show in Fig. 5 the event-driven steps of a Hadoop runtime. More specifically. there are four associated steps:

  • When a job is submitted to Hadoop, JobTracker notifies MSJO that a job is added. MSJO puts the job into a queue.

  • The MSJO scheduler is event-driven from JobTracker. When Hadoop is running, JobTracker keeps notifying MSJO on TaskTracker status, and if a machine is idle, MSJO assigns a task to the TaskTracker of this machine.

  • After a task finishes, the TaskTracker informs JobTracker, which will further notify MSJO, and MSJO updates job information.

  • If all tasks in a job finish, MSJO removes the job, and Job-Tracker sends a job-completion event to user application.

As a working example, we evaluate the MarS and H-MARES algorithms discussed earlier with experiments on a cluster. This cluster is built with 16 Amazon EC2 small instances. We employ Wordcount, a benchmark application, as the MapReduce program in our experiments. We use a document package from Wikipedia as input data of jobs. This package contains all English documents in Wikipedia since 30 January 2010 with uncompressed size of 43.7 GB. We build a job containing 10 jobs where the input data size of every job is less than 1 GB. We use this job set to evaluate the performance of our algorithms when jobs are small. In the experiments the total weighted job completion time of H-MARES is 5224 s, while MarS is 4826 s. Such results well match our earlier simulation results.

Currently, we are also developing runtime elastic VMs. We show the benefit in a demo experiment. We employ Word-count as the MapReduce job, and the input data is 12G Wikipedia data. We compare two runtime strategies: static VM and elastic VM. In the static VM strategy, every slave node has one CPU. In the elastic VM strategy, every slave node has two CPUs at job starting time, and the number of CPUs decreases to one after all map tasks finish. In the experiment, the total CPU number x CPU running time of both strategies are comparable. However, the elastic VM strategy finishes the job in 3934 s, while the static VM strategy takes 6890 s, an impressive improvement of 42.9 percent.

Conclusion and Future Work

In this article, we discuss the suitable user groups in running big data applications in the public cloud. We show that compared to Google, who run their big data processing applications on dedicated server clusters, grass roots users and non-computing professionals may only resort to the public cloud. We illustrate the key differences between the public cloud and dedicated server clusters. We discuss two important problems for efficient big data processing in the public cloud. We present solution approaches for certain problems, and offer optimized design guidelines for others.

Nevertheless, the research and practices of big data processing in the public cloud remains in its infancy. Many differences between the public cloud and dedicated server clusters are left unexplored, and many existing questions still need to have clear winning solutions. With grass root users and non-computing professionals becoming aware of running big data applications, there will be abundant opportunities. In particuar, we consider that better understanding of the impact of networking on the VM performance and cloud elasticity may improve or even drastically change the problem spaces in resource provisioning as well as MapReduce job scheduling.

Keywords

  • IEEE Keywords

Cloud computingBig dataServersRuntimeData processingVirtualization

  • INSPEC: Controlled Indexing

virtual machinesBig Datacloud computingparallel processing

  • INSPEC: Non-Controlled Indexing

VM-MapReduce job/task schedulingperformance optimizationdata generationBig Data processing,computing infrastructuresnetworking infrastructuresdata parallelismlarge-scale dedicated server clustersvirtual machinespublic cloudresource provisioning

Authors

Dan Wang

Department of Computing, Hong Kong Polytechnic University

Dan Wang [s'05, M'07, Sm’ ([email protected]) received his B. Sc degree from Peking University, Beijing, China, in 2000, his M. Sc degree from Case Western Reserve University, Cleveland, Ohio, in 2004, and his Ph. D. degree from Simon Fraser University, Burnaby, British Columbia, Canada, in 2007, all in computer science. He is currently an associate professor at the Department of Computing, Hong Kong Polytechnic University. His current research interests include green computing and big data in the cloud. His research work also spans from wireless sensor networks to Internet routing and applications.

Jiangchuan Liu

School of Computing Science, Simon Fraser University, and an EMCEndowed Visiting Chair Professor of Tsinghua University

Jiangchuan Liu ([email protected]) received his B. Eng. (cum laude) from Tsinghua University in 1999 and his Ph.D. from Hong Kong University of Science and Technology in 2003. He was a co-recipient of the ACM TOMCCAP Nicolas D. Georganas Best Paper Award 2013, ACM Multimedia Best Paper Award 2012, and IEEE Communications Society Best Paper Award on Multimedia Communications 2009. He is currently an associate professor in the School of Computing Science, Simon Fraser University, and an EMC-Endowed Visiting Chair Professor of Tsinghua University. He serves on the Editorial Boards of IEEE Transactions on Multimedia, IEEE Communications Surveys and Tutorials, IEEE Access, and IEEE Internet of Things Journal. He is a TPC co-chair of IEEE/ACM IWQoS ‘ 14 and an Area Chair of ACM Multi-media’14. His research interests include multimedia systems and networks, cloud computing, social networking, wireless ad hoc and sensor networks, and peer-to-peer and overlay networks.

Related Articles

Managing Performance Overhead of Virtual Machines in Cloud Computing: A Survey, State of the Art, and Future Directions

Fei Xu; Fangming Liu; Hai Jin; Athanasios V. Vasilakos

Probabilistic Consolidation of Virtual Machines in Self-Organizing Cloud Data Centers

Carlo Mastroianni; Michela Meo; Giuseppe Papuzzo

Migrate or not? exploiting dynamic task migration in mobile cloud computing systems

Lazaros Gkatzikis; Iordanis Koutsopoulos

Adaptive Resource Provisioning for the Cloud Using Online Bin Packing

Weijia Song; Zhen Xiao; Qi Chen; Haipeng Luo

Energy-efficient dynamic traffic offloading and reconfiguration of networked data centers for big data stream mobile computing: review, challenges, and a case study

Enzo Baccarelli; Nicola Cordeschi; Alessandro Mei; Massimo Panella; Mohammad Shojafar; Julinda Stefa

Exploiting Spatio-Temporal Tradeoffs for Energy-Aware MapReduce in the Cloud

Michael Cardosa; Aameek Singh; Himabindu Pucha; Abhishek Chandra

When big data meets software-defined networking: SDN for big data and big data for SDN

Laizhong Cui; F. Richard Yu; Qiao Yan

Mobile cloud sensing, big data, and 5G networks make an intelligent and smart world

Qilong Han; Shuang Liang; Hongli Zhang

Processing Distributed Internet of Things Data in Clouds

Lizhe Wang; Rajiv Ranjan

Modeling and Analysis of State-of-the-art VM-based Cloud Management Platforms

Saif U. R. Malik; Samee U. Khan; Sudarshan K. Srinivasan

 

 

  • Full Text

  • Abstract

  • Authors

  • Figures

  • References

  • Citations

  • Keywords

  • Footnotes

  • Back to Top