Big Data "People You Might Know" social network friendship recommendation algorithm builds on the idea that if two people have a lot of mutual

Big Data

"People You Might Know" social network friendship recommendation algorithm builds on the idea that if two people have a lot of mutual friends, then the system should recommend them to friend each other. Let us use MapReduce to implement an algorithm that recommends N = 10 users who are not already friends with user i, but have the largest number of mutual friends in common with user i.

Input:

The input data is made up of multiple lines, each representing a list of friends of a user and being of the following format:

Here <User> is a unique integer ID corresponding to a unique user and <Friends> is a comma-separated list of unique IDs corresponding to the friends of the user.

A sample of the data (assume that the IDs in the friend list of each user are sorted in ascending order):

...

18: 0,4,16,30,89,2406,2411,12562

...

30: 0,10,12,16,18,83,15520

...

We can observe that both user 0 and user 16 have friended with user 18 and user 30, which means user 18 and user 30 share at least two common friends. If they have not friended each other, the recommendation algorithm may need to suggest them to do so.

Please also note that the friendships are mutual (i.e., edges are undirected): if A is friend with B, then B is also friend with A. So we can observe, as shown in the following example, that if user 30 appears in user 18's friend list, user 18 should also appear in user 30's friend list.

Output:

The output should contain one line per user in the following format:

where <User> is a unique ID corresponding to a user and <Recommendations> is a comma-separated list of tuples each consisting of the unique ID corresponding to a recommended friend that <User> might know and the number of mutual friends s/he shares with <User> (i.e., <ID, number of mutual friends>; not necessarily sort these tuples w.r.t. the second element).

Assume that we use two consecutive MapReduce jobs to achieve this goal and that each row in the input data is treated as a record by the first MapReduce job.

The first MapReduce job counts the number of common friends each co-occurring pair of users has (or how many users have a certain pair of users appearing concurrently in their friend lists).

The second MapReduce job that consumes the first job's output generates the unsorted list of recommended friends for <User>.

Also assume in this initial implementation, we delegate the decisions of whether recommended users have already friended the focal user and whether a recommended user is among top 10 users in terms of no. of mutual friends to some ad hoc analyses. In other words, you don't need to consider the filterings of 1) already friended users, and 2) recommended users with less mutual friends in these two MapReduce jobs. We will try to incorporate these two filterings after we learn more design possibilities for MapReduce.

Questions:

Please provide a description of how you are going to solve this problem. Specifically, describe 1) the key-value pairs emitted from the map functions of both MapReduce jobs, and 2) the main operations performed by the reduce functions of both jobs as well.

Can we use the reduce functions as the combine functions directly for both MapReduce jobs if we want to introduce the local aggregation refinement?