Big Data Processing

Assignment 2

Q1) Parallel Breadth-First Search on Large Graphs

Overview

Write an advanced MapReduce program which gives your chance to develop in-depth understanding of principles when solving complex problems on Hadoop execution platform and analyse solutions by applying the knowledge learned in this course to achieve the optimal outcome.

Assessment Details

Given the source node in a graph , an important task is to find the shortest path distance from to all other nodes in . This course introduces a solution known as Parallel Breadth-First Search (BFS) in details.

Task 1 Code Development

write a MapReduce program with python to implement BFS. Table-1 shows an example graph and its representation in graph.txt (where none means the corresponding node has no out-link neighbours). The distance.txt indicates the distance from source node to each node after initialization (the node #1 is the source, the distance from source to node #1 is 0, to other nodes is 9999).

Your code must represent graph and initialize the distance in the same way as in Table-1. Also, you must implement the code following the framework introduced in the lab for the k-means clustering. Failure to do so leads to 0 marks of assignment. Also, you are not allowed to use any python MapReduce library such as mrjob.

Table-1 Graph representation and initialization.

Graph graph.txt

node#: a list of out-link neighbours distance.txt

node#: the initial distance from source

1: 2 3 4 1: 0

2: 5 6 2: 9999

3: none 3: 9999

4: 7 8 4: 9999

5: 9 10 5: 9999

6: none 6: 9999

7: 3 11 12 7: 9999

8: none 8: 9999

9: none 9: 9999

10: none 10: 9999

11: none 11: 9999

12: none 12: 9999

The expected output of the example in Table-1 is

distance.txt

node#: the shortest path distance from source

1: 0

2: 1

3: 1

4: 1

5: 2

6: 2

7: 3

8: 3

9: 3

10: 3

11: 3

12: 3

Task 2 - Performance Analysis

Suppose you have a very large graph with millions of nodes such as a road network or social networks. In the framework introduced in the lab for the k-means clustering, the number of reducer task is set to 1. What is the disadvantage of this setting? What is your solution to address this disadvantage? Note your solution must be detailed and complete other than a high-level description. Write a report in a PDF file.

Format Requirements

Failure to follow the requirements incurs up to 6 marks penalty

If your student ID is s1234567, then please create a zip file named s1234567_BDP_A2.zip with the following files without sub-folders.

All Python files you have developed.

run.sh: a bash script to run your MapReduce job on the EMR master node.

report.pdf: a PDF file for task 2.

README: a text file that includes your student's name, student ID, and how to run your code.

Do NOT submit the Hadoop Streaming jar file.

Do NOT submit the given input files.

Any path in the shell scripts must be specified as follows:

-file ./distances.txt

-file ./mapper.py

-mapper ./mapper.py

-file ./reducer.py

-reducer ./reducer.py

-input /graph.txt

-output /output

Please assume the Hadoop Streaming jar file and all your Python files are in the same folder on the EMR master node.

Question 2)

For example:

M.txt N.txt X.txt

1, 0, 1, 2, 3

1, 1, 4, 5, 6 2, 0, 7, 8

2, 1, 9, 10

2, 2, 11, 12 3, 0, 120, 86

3, 1, 140, 210

For the output matrix, it must show row#column#value for every row and column. For example:

0 0 62

0 1 22

1 0 1

1 1 56

Note the format of input and output must comply with the requirement strictly. Failure to do so leads to 0 marks of assignment.

Task 2 - Performance Analysis

Use the developed code in Task 1 to conduct a series of tests. For each test, you need create matrices M, N and X of same size, e.g., 6 6. Task 2 ask you to test 5 different sizes including 6 6, 20 20, 50 50, 100 100, and 200 200, respectively. For each test, execute Task 1 on M, N and X with different numbers of reducers (i.e., 1, 3, 6, 9). Report the test results in a PDF file with the following information.

For each test, the results should include:

Map input records Map output records CPU time spent (ms)

It is a good practice to organize the test results in a table and a line chart. The clear and concise presentation will lead to the higher mark.

What is the impact of the matrix size to the performance? Explain your answer based on the test results

Can more reducers always benefit the performance? Explain your answer based on the test results.

Functional Requirements

Failure to follow the requirements incurs up to 5 marks penalty

The code must be well written using a good coding style.

The code must include sufficient comments which can clearly explain the major logic flow of the program.

DEADLINE: 25th September 23:59

Download Solution Now

Uploaded By : Pooja Dhaka
Posted on : December 24th, 2024
Downloads : 0
Views : 372

Download Solution Now

Choose a Plan

Premium

80 USD

All in Gold, plus:
30-minute live one-to-one session with an expert
- Understanding Marking Rubric
- Understanding task requirements
- Structuring & Formatting
- Referencing & Citing

Most
Popular

Gold

30 50 USD

Get the Full Used Solution
(Solution is already submitted and 100% plagiarised.
Can only be used for reference purposes)

Save 33%

Silver

20 USD

Journals
Peer-Reviewed Articles
Books
Various other Data Sources – ProQuest, Informit, Scopus, Academic Search Complete, EBSCO, Exerpta Medica Database, and more

Big Data Processing

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Request a Call Back