Your assignment should be submitted through the link provided on
Your assignment should be submitted through the link provided on
Blackboard and should be separated into two formats. First, a single PDF report named Surname_studentid.pdf and second, a zipped file containing your codes also named Surname_ studentid.zip. This should contain working Python code for all parts of the assessment, in either .py or .ipynb format. You should also include a clearly written description of each file and its use in a Read Me.txt file in the same zip file.
Assessment task details This coursework will allow you to use the machine and instructions learning and data mining techniques covered in this module to analyse datasets that interest you, to draw conclusions based on your analysis, and finally to present your results in the form of a report.
There are three tasks to this assignment. For each assignment, you should choose a dataset amenable to the data mining task. For example, for Task 1 you should choose a dataset which contains a potential target variable suitable for classification. You should use a different dataset for each task.
For each task, you must fully explain and document your full experimental process, including the exploratory data analysis, data preparation and cleaning and the algorithms selected. A writing frame is provided on page six of this document, and you should complete a report for each of the three tasks (please provide all three reports in one PDF document). Overall, your submission should be around 5,500 words excluding references.
A report framework is provided at the end of this brief to provide more guidance on the expected contents of the report.
Task 1 (40 Marks)
Apply two classification algorithms of your choice on your chosen dataset using Python. Compare the performance of the two algorithms, justifying your choice of performance metrics. You should critically evaluate your classification models, and recommend which, if any, would be appropriate for future deployment. (25 Marks) classification algorithms: logistics regression, k nearest neighbour
Use Azure Machine Learning Designer to apply two classification algorithms to the same dataset as you used for part a). (15 Marks)
Task 2 (35 Marks)
Apply two clustering algorithms on a selected dataset of your choice using Python. You should provide an analysis and evaluation of the clusters identified and discuss which clustering method may be
better suited to your data.
Task 3 (25 Marks)
Using Python, apply text mining and sentiment analysis on a selected text dataset of your choice. Apply the needed preprocessing steps and analyze your results.
Assessment Criteria Assessment criteria are provided alongside this brief.
You should look at the assessment criteria to find out what we are specifically looking at during the assessment.
Assessed intended learning outcomes
On successful completion of this assessment, you will be able to:
Knowledge and
Understanding
1.
2.
3.
4. Critically assess diverse issues regarding the use of data mining and machine learning in real-world contexts, including ethics Design and create reports to present analytical and interpretative information in creative and effective ways
Devise strategies for making effective use of analytical software such as Python and Azure Machine Learning Studio. Learn about different algorithms, such as classification,
Practical, Professional or
Subject Specific Skills clustering and text mining methods.
Diverse issues regarding the use of data mining techniques to real-world datasets
Discover patterns within a dataset using exploratory data analysis
Use of Python / Azure for data mining
Discover techniques to leverage Pythons features and work with its libraries
Reporting and presentation of analytical and interpretative information
Word count/ duration (if Your assessment should be 5,500 words (+/- 10%) in total across the applicable) three tasks.
Key elements of the report
You should have only one PDF file but for each task you should include the following elements (so
for example for task1 you will have a title, introduction, , references and similarly for task2 you will have a different title, introduction, , references )
For each part of your work, please include screenshots of the related code and its corresponding output in your report and explain it. Failure to do so will result in a deduction of marks.
First page
Include your first name, last name, student ID, school name, and the University of Salford logo.
Title
The title should provide an overview of the focus of your problem and the expected solution (for each task you should have a separate title).
Introduction
This section contains a brief background to the topic and leads to the formulation of the specific question, based on your selected topic. The research question must be focused and clear. You should provide a brief summary of relevant academic literature you can find which is relevant to the application of machine learning to your chosen dataset for the task
Datasets
You are welcome to choose any datasets that interest you, and that has enough data to enable meaningful analysis. In making your choice, you should be sure to consider what problems you would be able to solve by employing data mining on the dataset. In other words, you should ask yourself: How could I use data mining to answer one or more questions about the datasets?
Briefly describe the datasets you have used, independent and dependent variables, datatypes and the link/source from which you downloaded/obtained the data
Explanation and preparation of datasets (Exploratory Data Analysis)
Explain any preparation tasks (e.g., normalisation, dealing with missing values, handling class imbalance etc.) carried on the datasets.
Implementation in Python / Azure Machine Learning Designer
Implement your proposed approach using libraries available in Python or Azure (in classification task). This section will include:
A brief description of the algorithms used.
The application of data-mining techniques to selected datasets that you choose using Python (or Azure Machine Learning Designer for Task 1b).
Explanation of the experimental procedure, including the setting and optimisation of model hyperparameters during training, and your approach to validation.
Visualisation of the results.
Results analysis and discussion
Explain and justify the performance metric you choose to use to evaluate the model(s).
A clear and compelling presentation of the results that you obtain, both from the data mining and any other analysis that you may perform.
Evaluate and discuss the results. For tasks that require you to use more than one algorithm, you should compare and discuss the results obtained from each.
You should also consider and discuss any ethical, legal or professional considerations in using machine learning and data mining on the datasets you have selected.
Conclusions
The key points from the assignment must be synthesised within the conclusion. This must relate back to the introduction and the research question and provide an overall evaluation of the validity of the solution you have proposed.
4251964860671
References
You will list all publications referenced in the report. You should show evidence of sufficient readings related to your work.
References must follow the Harvard formatting system as in this guide:
APA 7TH edition referencing
Appendices (Optional)
Appendices may be used to provide relevant supporting evidence for reference but should only be used if necessary. Students may wish to include in appendices, evidence which confirms the originality of their work or illustrates points of principle set out in the main text.
Links for obtaining your datasets
The following links are provided as examples of data repositories which may be useful to obtain datasets. You are welcome to use them if you wish or you can use any other resource. You may want to choose ones in domains you have existing experience or interest in.
You shouldnt use the datasets we have used in any of the workshops.
We recommend avoiding very large datasets
Some data repositories:
UCI Machine Learning Repository https://archive.ics.uci.edu/datasets
Harvard Dataverse https://dataverse.harvard.edu/
Google Dataset https://toolbox.google.com/datasetsearch
Microsoft Research Open Data https://msropendata.com/
Data.gov.uk https://data.gov.uk/
re3data.org https://www.re3data.org/
Assessment criteria
Overall level 0-29% 30-49% 50-69% 70-89% 90-100%
Ex
tremely poor
Ex
tremely poor
Very
poor
Very
poor
Poor
Poor
Inadequate
Inadequate
Unsatisfactory
Unsatisfactory
Satisfactory
Satisfactory
Good
Good
Very
Good
Very
Good
Excellent
Excellent
Outstanding
Outstanding
Title,
Introduction,
Conclusion and
References
(15%) No Title/Very vague title Uninformative title, vague introduction, unreliable conclusion, Inadequate attempt made at proper referencing, many
errors/omissions
Satisfactory title, introduction well defines the studied problem and the intended tasks, relevant literature is lacking , incomplete conclusion, Acceptable attempt made at proper referencing, with a number of errors/omissions Informative and attractive title, clear setting of the scene and boundaries of the report in introduction, some relevant literature, conclusion drawn persuasively from results analysis and discussion. Referencing good, but with some errors and omissions.
Concise and appealing title, introduction presents an excellent clarity of focus of the report, relevant literature, conclusions are reliable and can be trustfully used by users.
Referencing are perfect.
Explanation of datasets, legal and ethical issues (if any)
(15%)
Did not perform data preparation steps for
ML correctly Insufficient collection of primary information, datasets are barely explained.
Direct download and no preparation of data for data mining task. Adequate engagement with relevant information collection, Adequate dataset explanation.
Good information collection, relevant to the assignment, Datasets clearly explained.
Detailed handling of data also mentioned some ethical and legal issues. Information collection of very high standard, relevant to assignment.
Concise and informative dataset explanation.
Outstanding handling of data, also considered important legal and ethical issues.
10
Implementation in Python (or Azure Machine Learning
Designer for Task
1b) (40%) No implementation
Implementation not justified for the task considered. Experimental
implementation and setup is lacking detail, little or no relevant description and discussion of relevant package and functions, and no critique of designs. Basic descriptions experiments, design, and statistics that could conducted, little or critique. of
be no Good descriptions of experiments, design, statistics that could be conducted, basic critique.
Detailed justifying of decision made for ethical principles throughout the data mining algorithmic selection and usage. Detailed descriptions of experiments, design, statistics that could be conducted, critique.
Outstanding justifying of decision made for ethical principles throughout the design, build and use of business intelligence systems and data mining algorithmic selection.
Results analysis No results Results are not presented Results are Results are clearly and Results are
and discussion interpretation or professionally, little or no presented using informatively presented. professionally presented
(30%) discussion was results analysis and proper means such Results analysis and at standard of a journal
presented in the report discussion as tables and
graphs, results discussion are specific and sufficient. publication. Results are critically analysed and
analysis and discussed. Valuable
discussion is general observation and finding
and shallow. are made from the results.
11