Big Data Processing
Big Data Processing
Assignment 2
Q1) Parallel Breadth-First Search on Large Graphs
Overview
Write an advanced MapReduce program which gives your chance to develop in-depth understanding of principles when solving complex problems on Hadoop execution platform and analyse solutions by applying the knowledge learned in this course to achieve the optimal outcome.
Assessment Details
Given the source node in a graph , an important task is to find the shortest path distance from to all other nodes in . This course introduces a solution known as Parallel Breadth-First Search (BFS) in details.
Task 1 Code Development
write a MapReduce program with python to implement BFS. Table-1 shows an example graph and its representation in graph.txt (where none means the corresponding node has no out-link neighbours). The distance.txt indicates the distance from source node to each node after initialization (the node #1 is the source, the distance from source to node #1 is 0, to other nodes is 9999).
Your code must represent graph and initialize the distance in the same way as in Table-1. Also, you must implement the code following the framework introduced in the lab for the k-means clustering. Failure to do so leads to 0 marks of assignment. Also, you are not allowed to use any python MapReduce library such as mrjob.
Table-1 Graph representation and initialization.
Graph graph.txt
node#: a list of out-link neighbours distance.txt
node#: the initial distance from source
1: 2 3 4 1: 0
2: 5 6 2: 9999
3: none 3: 9999
4: 7 8 4: 9999
5: 9 10 5: 9999
6: none 6: 9999
7: 3 11 12 7: 9999
8: none 8: 9999
9: none 9: 9999
10: none 10: 9999
11: none 11: 9999
12: none 12: 9999
The expected output of the example in Table-1 is
distance.txt
node#: the shortest path distance from source
1: 0
2: 1
3: 1
4: 1
5: 2
6: 2
7: 3
8: 3
9: 3
10: 3
11: 3
12: 3
Task 2 - Performance Analysis
Suppose you have a very large graph with millions of nodes such as a road network or social networks. In the framework introduced in the lab for the k-means clustering, the number of reducer task is set to 1. What is the disadvantage of this setting? What is your solution to address this disadvantage? Note your solution must be detailed and complete other than a high-level description. Write a report in a PDF file.
Format Requirements
Failure to follow the requirements incurs up to 6 marks penalty
If your student ID is s1234567, then please create a zip file named s1234567_BDP_A2.zip with the following files without sub-folders.
All Python files you have developed.
run.sh: a bash script to run your MapReduce job on the EMR master node.
report.pdf: a PDF file for task 2.
README: a text file that includes your student's name, student ID, and how to run your code.
Do NOT submit the Hadoop Streaming jar file.
Do NOT submit the given input files.
Any path in the shell scripts must be specified as follows:
-file ./distances.txt
-file ./mapper.py
-mapper ./mapper.py
-file ./reducer.py
-reducer ./reducer.py
-input /graph.txt
-output /output
Please assume the Hadoop Streaming jar file and all your Python files are in the same folder on the EMR master node.
Question 2)
For example:
M.txt N.txt X.txt
1, 0, 1, 2, 3
1, 1, 4, 5, 6 2, 0, 7, 8
2, 1, 9, 10
2, 2, 11, 12 3, 0, 120, 86
3, 1, 140, 210
For the output matrix, it must show row#column#value for every row and column. For example:
0 0 62
0 1 22
1 0 1
1 1 56
Note the format of input and output must comply with the requirement strictly. Failure to do so leads to 0 marks of assignment.
Task 2 - Performance Analysis
Use the developed code in Task 1 to conduct a series of tests. For each test, you need create matrices M, N and X of same size, e.g., 6 6. Task 2 ask you to test 5 different sizes including 6 6, 20 20, 50 50, 100 100, and 200 200, respectively. For each test, execute Task 1 on M, N and X with different numbers of reducers (i.e., 1, 3, 6, 9). Report the test results in a PDF file with the following information.
For each test, the results should include:
Map input records Map output records CPU time spent (ms)
It is a good practice to organize the test results in a table and a line chart. The clear and concise presentation will lead to the higher mark.
What is the impact of the matrix size to the performance? Explain your answer based on the test results
Can more reducers always benefit the performance? Explain your answer based on the test results.
Functional Requirements
Failure to follow the requirements incurs up to 5 marks penalty
The code must be well written using a good coding style.
The code must include sufficient comments which can clearly explain the major logic flow of the program.
DEADLINE: 25th September 23:59