Answered You can hire a professional tutor to get the answer.

QUESTION

Assignment: MapReduce Algorithm Design, Computing Relative Frequencies Dataset description explore a set of 100,000 Wikipedia documents found on...

Assignment: MapReduce Algorithm Design, Computing Relative Frequencies

Dataset description

explore a set of 100,000 Wikipedia documents found on Canvas as wikitext.txt. Each line in this file consists of the plain text extracted from a Wikipedia document. You will upload these to a folder in S3.

Task

In this task you will use the Hadoop Streaming API to count the occurrence of each word in the document. The map and reduce functions you will use can be found on Canvas, also. You will upload these to a folder in your S3.

You will use Amazon Web Services to created a cluster, then add a Hadoop Streaming step for the map and reduce functions provided. The Hadoop Streaming API is provided by Hadoop for using the standard input and output streams for writing map and reduce functions. This enables using MapReduce with languages other than Java; scripting languages such as Python, Ruby or Unix shell and command languages.

Python functions will be used in this assignment. The mapper function loops through the lines in the input and splits the words into a list. It then emits each word followed by the count of one to stdout (the standard output stream). The code is shown below:

for line in sys.stdin: line = line.strip()

words = line.split() for word in words:

print word + 't' + str(1)

The reducer function loops through the sorted words from stdin (the standard input stream), using a dictionary to tally the total word counts for each word. Once the total word counts have been computed, a loop is initiated to print the totals to stdout. The code for this function is shown below:

for line in sys.stdin: line = line.strip()

word, count = line.split('t',1)

try:

count = int(count)

wordCount[word] = wordCount[word] + count

except KeyError: wordCount[word] = count

for word in wordCount:

print word + 't' + str(wordCount[word])

The steps for this assignment closely follow the steps in the previous AWS assignment. The primary difference is that you will need to created a Streaming program rather than a Hive Program. Other than the requirements for creating the Streaming program, you can follow the same steps as in the AWS assignment.

After you've created the cluster you can select the add step button from the top of the console. A new window will pop up for creating the step. There will be a drop down menu for step type. Here, you should select Streaming program. Once this has been selected the Add Step popup window will change. At this point you can enter the proper paths to the mapper function, the reducer function, the data directory and the output directory. There are no special arguments require for this job, so, you may leave that field empty. NOTE: For the output directory, you can select an existing folder, however, you still must add a new directory name to the end of that path, otherwise your MapReduce job will fail. This is because the required path for the output directory must not exist; Amazon EMR requires that the output directory be newly created at runtime.

You should download the syslog.gz file from the log files associated with the cluster step. You will have to search for this file in the directory that you assign for log files when you created your cluster. This file should become a text file when you download it. Once you've downloaded it, open it. You can scroll toward the middle of the file and find a log of the time taken for the map and reduce steps. Using the timestamps in this file you can determine how long the mapper and reducer steps took to run.

You should also go to the directory you specified for the output. Here, you will download the output file or files from your Streaming program. You should open the file or files. If there are more than one, you can compare them.

What to submit

  •  You will need to submit the log file, syslog.txt, that you downloaded.
  •  You should also submit a short summary. This should include:
  • o How long, in seconds, did the mapper function take to run?

o How long, in seconds, did the reducer function take to run? o How many nodes were used to run this MapReduce job?

o How many output files were generated by the reducer? 

Show more
LEARN MORE EFFECTIVELY AND GET BETTER GRADES!
Ask a Question