diff_months: 18

ITECH2303 Data wrangling Assignment

Flat 50% Off Order New Solution
Added on: 2022-11-09 05:45:50
Order Code:
Question Task Id: 0

Task requirements

As you work through the tasks in Jupyter notebook, ensure you write comments on what you are doing and why you are using the chosen functions/methods. Also create cells where written answers need to be added. You won’t need to create a word document but make a sequentially flowing notebook with the appropriate headings and sub headings etc. Imagine that this document would be presented to your Manager in a work place.

 Data preparation [10 marks]

You are to download any 5 of the datasets from the Moodle shell, within the assessment section. There are 10 datasets, 2 from each of the given 5 genres as discussed in the previous page. You can choose any combination you wish, including 2 datasets from genres.

 

  1. Load the 5 datasets individually into a Jupyter Notebook
  2. Before combining the 5 datasets, it is a good idea to have a quick look through them:
    • Look at each dataset individually, make sure that they are all Check how many records for each, also check how many features and that they have the same features between them.
    • Look at the data types, ensure that all data types are the same for each data set
    • Look at the ID features, “user_id” , how many unique keys does each id feature have in all datasets.
  3. Combine the 5 datasets into an appropriate data structure (for instance pandas

dataframe), with an identifier attribute that indicates their source (dataset identifier).

  • Show that this process has been 

Initial analysis [20 marks] 

You will have already realised that the format of the data is JSON when you initially examined it to load it into your program - you will now need to decide which features you are going to keep in your dataframe; how you are going to deal with missing values; what are the appropriate data types and any new derived features you believe will help you to analyse the data in the following sections. 

  1. Check what missing values there are, deal with the missing values and justify your
  2. Check the feature data types, make any appropriate changes and justify your
  3. Determine which features (or even records) are useful and are worth keeping:
    • Use appropriate pandas functions to initially analyse the data, for instance descriptive statistics of each attribute, Comment on your analysis and any changes.
    • In terms of records, a user may have reviewed the same book more than Can we consider that as a duplicate record?
  4. In the following sections, you are going to need to aggregate data, so that you can view, graph and analyse the data more easily. Derive other features, which may include extra features that contain summary type data or categorical or binned For example you are going to have to graph data based on dates or time series. Justify any newly created features.

 GroupBy analysis [20 marks] 

Use the GroupBy function in pandas to analyse the data. Implement various aggregate functions that will provide interesting insights into the data.

Give 5 (FIVE) different interesting insights into the data by using the groupby function with other functions including aggregate, loc() etc. and combinations of them.

Look at the features and ask questions of the data set to try to find useful insights. Somethings to think about:

There are “id” features, i.e. “user_id”, “book_id” and “review_id”, each one is respectively, a user, a book and a review. There are also dates assigned to when the review was made. A user may review one or many books (also in different genres), a book could have many reviews by many users.

So for example you could investigate such things as:

  • how many unique reviews a book has had, even over different time
  • which books have had the most reviews
  • which book genre gets the most book reviews, including the individual books
  • which month had the most books reviews, for which ever genre 

Also the data set contains ratings and votes for each respective review. So furthermore you could investigate things like:

  • average ratings and votes per book,
  • maximum votes for the various books,
  • how many books got a maximum vote . . 

It is up to you to find 5 interesting groupby functions. Each one will get a maximum mark of 4 marks, but the function combination must be sufficiently complex enough to get the full 4 marks. A basic function, for example: DataFrame.groupby(FeatName).aggfunc() will give you a maximum of 2 marks. Also you need to comment on what you are doing. 

Linechart display [15 marks]

You need to create a line chart based on the top 5 books, i.e. defined as having the largest sum of ratings across the time period of the combined datasets. 

For this graphing task, I have given you some steps to follow:

  1. Find the books with the largest sum of ratings over the whole given time period – choose the top
  2. If you haven't done this already in the previous step, then take the year out of one of the date features and make a new feature with just the year for each
  3. Get all of the records with the book ids from the top 5 highest accumulated ratings, from which you found in step
  4. Using only, the ‘book_id”, “rating” and your new “year” attribute, plot the separate ratings over time (in years) for each of the five top rating books, and display them in a single line
  5. Make sure that all the labels for the graph are present, including for the X and Y axis and the book id for each of the five books in the graph can (You may want to use a legend, as an example). 

Histogram display [20 marks] 

Draw a histogram for the 5 separate datasets based on a date field.

You will need to chunk the date field, as you did for the line graph, into periods of time like years. ( You can use the year feature that you have already created.) Combine the 5 datasets into a stacked type histogram.

Use whichever you prefer from either matplotlib (matplotlib.pyplot.hist), pandas (pandas.DataFrame.plot) or seaborn (seaborn.histplot).

 Sentiment analysis [15 marks] 

A code sample of a possible sentiment analysis implementation is given below. Please note, this is based on another dataset, so you will have to change the feature names. You can either use this supplied code example, or any other that you can find on the web. It must do Sentiment Analysis though. 

  1. Get the code running, show and explain the output;
  2. Briefly explain what sentiment analysis is;
  3. Discuss how you can use sentiment analysis to make best use of the book “review_text” feature in the dataset(s).
  4. Can you see any problems with using sentiment analysis, including any issues you come across while running it on your
  • Uploaded By : Katthy Wills
  • Posted on : November 09th, 2022
  • Downloads : 0
  • Views : 130

Order New Solution

Can't find what you're looking for?

Whatsapp Tap to ChatGet instant assistance

Choose a Plan

Premium

80 USD
  • All in Gold, plus:
  • 30-minute live one-to-one session with an expert
    • Understanding Marking Rubric
    • Understanding task requirements
    • Structuring & Formatting
    • Referencing & Citing
Most
Popular

Gold

30 50 USD
  • Get the Full Used Solution
    (Solution is already submitted and 100% plagiarised.
    Can only be used for reference purposes)
Save 33%

Silver

20 USD
  • Journals
  • Peer-Reviewed Articles
  • Books
  • Various other Data Sources – ProQuest, Informit, Scopus, Academic Search Complete, EBSCO, Exerpta Medica Database, and more