CP3403 Data Mining Practice and Analysis Assignment

Requirements (Tasks) 

The whole task of this assignment consists of the following procedural steps.

Step 1 : Set up (by your imagination of a real-like business situation or by applyinganactual analysis problem case) a scenario in which you are given a set of domain-specific datasetand asked to analyze the given dataset. The purpose of the analysis might betounderstand (overview or learn about) the given data or to solve a specific analytical problem – depending on the scenario you made up. 

Step 2 : Find and get your own domain-specific dataset to fit for the scenario you madeup. Thedataset could be unique or publicly available. Some public datasets are availablefromthe UCI machine learning repository (http://archive.ics.uci.edu/ml/). 

Step 3 : Choose appropriate data mining techniques (algorithms) – see more details for eachoption in Step 4 below. 

** Note: The procedural order of the above three steps can be alternated. For example,you may find an interesting dataset first and then set up a specific data-miningscenariowhich fits for the analysis on the dataset chosen. ** 

Step 4 : You can select either of two options for this assignment. 

Option (1) – Programming-intensive Assignment 

  • Once you have your own domain-specific dataset and chosen dataminingalgorithm, then you need to design and implement the chosen algorithminyourpreferred programming language. 
  • A series of preprocessing will be required at this step. The preprocessingprocedure should be designed carefully (considering what kind of processingwill be required? How? Why?) to make your data ready to be fed to your program.Some parts of this preprocessing procedure can be included in your programas a part of “pre-data-mining module”. 
  • Your final program must become a stand-alone data-mining tool designedforyour own purpose of data analysis. It is expected that your programshouldinclude the following modules (and may include more sub-modules if needed);
    • pre-data-mining module – designed for necessary preprocessingandforgetting the data ready to be fed to the next module (data-miningmodule).You don’t need to include all required pre-processing in this module. Itisassumed that some initial preprocessing (e.g. cleaning noise data) canbedone externally using other software tools (e.g. Excel or Weka). 
    • data-mining module – the chosen data mining algorithmis implemented.You can directly borrow the algorithm from one popular existingdatamining method, or you can design your own algorithm(by amendingtheexisting one) 
    • post-mining module – this module is for presenting/reporting theoutputresult produced through previous modules. The result can be madeinasimple text report or additionally in a non-text visualization way (e.g. graph,chart or diagram). 
  • This programming-intensive assignment still requires an analysis. Trytofindall the patterns you can detect with your implemented algorithm. Try tocompareand contrast the result using your chosen preprocessing scheme andalgorithmwith using other existing algorithm or with using other preprocessingmethods.
  • Note: in particular for the comparison the result using your programwithusingother existing algorithm, you can use other existing data mining tools(e.g.Weka) to get the result using other algorithm. 

Option (2) – Analysis-intensive Assignment 

  • Once you have your own domain-specific dataset chosen, you needtodesignyour own data-mining analysis scheme. This analysis scheme canconsist ofmultiple steps of procedures: 
    • Set up a strategy for preprocessing on your data. A series of preprocessing will be required and need to be designedcarefully(considering what kind of processing will be required? How? Why?). Youmay include multiple different preprocessing schemes for the comparisonanalysis. 
    • Set up a strategy for data-mining. you need to select one data mining areas (clustering, classification,association rules mining) of your choice and select AT LEAST TWOexistingdata mining algorithms in your chosen data mining area. For example, ifyou chose Clustering as your data mining area, you can applytwoalgorithms; DBScan and K-mean and compare the tworesults.Alternatively you can design a combined algorithmwhich applies multiplealgorithms from same/different data mining areas in a series. Your strategyalso can be designed to apply different parameters for one algorithm.Another strategy you can set up is to apply multiple preprocessing(attribute selection) schemes for one algorithm.
  • You can choose one data mining tool (e.g. Weka) to analyze your chosendataset.Apply the data-mining strategy (you had set up) on your chosendata(preprocessed) using the data mining tool and try to find all the patterns youcandetect. 
  • Do various comparison experiments either by applying different dataminingalgorithms (or strategy) to the same chosen dataset or by applyingasamealgorithm to the differently pre-processed datasets. 
  • Critically analyze experimental results and discuss/demonstrate whyachosenalgorithm (strategy) is superior/inferior to other algorithm(strategy). 

