MIST6060 Business Intelligence & Data Mining Assignment

Subject Code :
MIST6060
Country :
United States America

1. The following is a set of eight transactions in a convenience store.

a. Let minimum support = 25%; that is, a frequent itemset must have support count ? 2. Find all frequent itemsets using the Apriori algorithm. Show your computation steps in a way similar to the example in the lecture notes.

b. For the frequent itemset that contains the largest number of items, calculate confidence values for all the rules that can be generated from that itemset (again, similar to the example in the notes).

2. Let us revisit the CongressVote data for association rules mining.

a. Run the Apriori algorithm in Weka to find: (i) three 2-itemsets that have the highest supports, (ii) three 3-itemsets that have the highest supports, (iii) any 4-itemsets, and (iv) the most significant association rule and explain your choice (the answer depends on your explanation/justification). You can adjust the minimum support parameters to narrow down the output results (hint: start with support = 0.4 and confidence = 0.8 and increase them gradually to reduce the number of itemsets and association rules displayed in the output). Show the output screen containing the relevant results (you will lose some points if you simply print and submit all output results).

b. Based on your findings, what can be concluded in terms of voting preference by democrats and republicans? A (hypothetical) example of such conclusions would be “Republicans tend to vote against the education spending bill.”

3. The TopUniversities.csv data file is taken from U.S. News & World Report (Sept. 18, 1995, p.126). It includes the data on some top universities with several attributes, used to compare and rank the universities.

a. Run Weka’s Hierarchical Clustering/Single Linkage clustering algorithm with 2 clusters. Use the Ignore attributes button (located right above the Start button) to exclude the University attribute in clustering computation. The reason for excluding this attribute is that it is an identity attribute and no clustering patterns or characteristics would be related to the identity attributes (this is also true for the other data mining tasks we learned in this course, such as classification and association rules). Show the screens with the dendrogram and the cluster memberships of the individual records. Comment on the usefulness of the results.

b. Run Weka’s SimpleKMeans clustering algorithm with 2 clusters. Again, do not include the University attribute. Describe the main characteristics of the two clusters based on the output information on the clusters. List the name of universities that are in the smaller cluster. Show all of the relevant output screens. Which one, single linkage or k-means, is better for clustering this dataset?

4. Download the hotel-reviews-train.arff and hotel-reviews-test.arff datasets, which include 150 and 57 records of hotel customer reviews, respectively. Each dataset has two attributes: a string attribute, text, containing customer reviews, and a class attribute, sentiment, classifying a review as negative or positive.

a. Run J48 decision trees and support vector machines (SMO) using the data in hotel-reviews-train.arff and hotel-reviews-test.arff as the training and test set, respectively. Follow basically the same steps in the “Classifying New Documents in Weka” section of the Text Mining lecture notes (pp.10-12). Specifically, set lowerCaseTokens and outputWordCounts to True and select Rainbow for stopwordsHandler. Use the default settings for J48 and SMO (e.g., 10-fold cross validation). Keep Null for Output predictions (i.e., skip step 7 in the lecture notes on p. 11). For J48, show the output screen with the tree model, test error rate, and confusion matrix. For SVM/SMO, show the test error rate and confusion matrix (the SVM model is too big to display).

b. (i) Based on the overall classification accuracy, which one, J48 decision trees or SVM/SMO, is better? (ii) If we are more concerned about negative reviews than positive ones, then which one, J48 or SVM/SMO, is better? Answer this question based on the confusion matrices (i.e., do not try to use ROC/AUC for evaluation).

c. (i) Based on the J48 model, what are the top 5 key words that are most useful to identify customers’ sentiments? (ii) What are the top 8 key words based the SVM model (hint: the SVM equation model is big, but only 8 coefficients’ absolute values are at least one)?

Get your MIST6060 Business Intelligence & Data Mining assignment solved by our Data Warehouse Experts from Exam Question Bank . Our Assignment Writing Experts are efficient to provide a fresh solution to all question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing Style. Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered.

You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turn tin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.

Download Solution Now

Uploaded By : Katthy Wills
Posted on : March 04th, 2023
Downloads : 0
Views : 195

MIST6060 Business Intelligence & Data Mining Assignment

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

MIST6060 Business Intelligence & Data Mining Assignment

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Request a Call Back