diff_months: 22

# Visualization and data processing Assessment

Order Code: 468303
• Subject Code :

INF20024

• University :

SWINBURNE UNIVERSITY of Technology

• Country :

Australia

## A. Visualization: Section Marks 10

Import the data oneworld.csv into R. Show the class of all variables. Submit code and output. The objective in this question is explore the relationship between the variables GDP vs Infant Mortality by regions.

1. Create a new ordered categorical variable GDPcat - with three categories such that the proportion of countries in each categories is 40%, 40% and 20%, respectively, in increasing order of magnitude of the variable GDPName the three categories to be “Low” “Medium” and “High”. Show your code and the solution.

Hint: Remove any missing observations. There is a function in R that helps find such percentage values for variables. Marks (4)

1. Use a visualization tool to display the relationship between GDP and Infant Mortality, stratified by Regions, on a single plot. You must use ggplot2 for visualization. For plotting you could use either the variable GDP or the variable GDPcat. Comment on your observations. Hint: There are multiple possible correct responses, but you must justify your choice. Note that your choice of the GDP variable would dictate the choice of the plot function. Marks(6)

## B. Data Processing – I: Section Marks 15

1. Write an R function to identify the proportion of missing observations in a variable or column of tabular data. Marks (4)
2. Implement this function across all variables of the dataset airquality. This dataset is available with base R. Show your code and the result— that is, the proportion missing in each variable. Marks (2)
3. Specify a variable from this dataset that you would select for univariate missing value imputation. Hint: Justify your variable choice based on the count or proportion of missing observations, noting that univariate imputation reduces the natural variation of a variable. Marks (2)
4. Show your code -in base R - to replace all missing observations in the chosen variable with some value, based on the data. Justify the choice of replacement value. Hint: Read the appropriate section on your Weekly content page to perform this task. Marks (3)
5. Summarize the results after imputation and compare with the pre-imputed variable. Hint: In commenting you can use a use a descriptive statistics function or a visualization tool learned on this subject. Marks (4)

## C. Data Processing – II: Section Marks 20

Clinician scientists at Royal Melbourne hospital are investigating the relationship between fecal calprotectin (FC) as a non-invasive diagnostic alternative to Inflammatory Bowel Disease (IBD) and Acute Sever Ulcerative Colitis (ASUC). It would also contribute to standard of care. A subset of the dataset is provided as bowel.csv. The data has several missing values, causes of missing-ness are often unknown.

In the following show questions show your R working and output.

1. In the bowel.csv dataset,
1. Count the number of missing observations on the variable FC and in the overall dataset. Marks (2)
2. Perform a univariate imputation on the variable FC. Your solution should include Marks (3)
1. code,
2. result, and
3. Justification for the choice of imputation value you used. Hint: please consult the relevant topic page.
3. Select two variables that may have association with FC.
4. Justify your choice using dplyr()based tool(s). Show your working in R. Marks(4)
5. Use your investigation in a.) to replace all missing values in the variable FC. Show your code. Marks (6)
6. Present a comparison - with discussion - between the two types of imputation in the context of variable FC. You may use tools from topics in Weeks 2 and 3, for comparison. Marks (5).

## D. Text Analytics:  Section Marks 15

Mysterydocs.RData is a collection of unstructured text documents. The response to the questions below must include comments, wherever applicable. This question tests both implementation and conceptual understanding of data analytic tools that you have learned in this subject.

1. How many documents are there in the dataset? Justify your answer using an R function (show your code and result). Mark (1)
2. Using methods of Week 5 Topic 2,
3. clean this collection of texts and convert it into tabular data. You must use at least 5 cleaning steps, including stemming.
4. Show your code and the last six rows and first five columns (only) of the tabular data that you created. Marks (4)
5. Create a subset of the tabular data you constructed in QD2., retaining only those words that have occurred at least 200 times. Use a visualization tool to show the frequency distribution of words in this subset data. Hint: Select an appropriate visualization tool from your learnings of Week 3. Marks (3).
6. Using tools learned in this subject to quantify and depict any similarities between the documents.

Please use the original tabular data that you created in Q.D2. You have to use an appropriate (relevant) weighting metric described on your content page. Hint: You have to use (at least) one quantitative measure and (at least) one visualization tool to justify your answer. For visualization of the similarity matrix you may use R functions such as levelplot() or image()or any other suitable plotting function. You would have to research the implementation of these functions. Show your code, results and comment on your findings. Marks (7)

• Posted on : October 04th, 2022
• Views : 291

## Choose a Plan

80 USD
• All in Gold, plus:
• 30-minute live one-to-one session with an expert
• Understanding Marking Rubric
• Structuring & Formatting
• Referencing & Citing
Most
Popular

### Gold

30 50 USD
• Get the Full Used Solution
(Solution is already submitted and 100% plagiarised.
Can only be used for reference purposes)
Save 33%

### Silver

20 USD
• Journals
• Peer-Reviewed Articles
• Books
• Various other Data Sources – ProQuest, Informit, Scopus, Academic Search Complete, EBSCO, Exerpta Medica Database, and more