Exploratory Data Analysis DATA2001
- Subject Code :
DATA2001
ExploratoryDataAnalysis(EDA)process
EDA1isaboutbuildinganunderstandingaboutyourdata.Itinvolvesasystematic process that applies all the theory and concepts you have learnt in this module. AccordingtoPengandMatsui(2017)2thegoalsofEDAare:
- Todetermineifthereareanyproblemswiththedata(e.g.missingvaluesor duplicates).
- Todetermineifthequestionyouareaskingcanbeansweredwiththedata you have.
- Todevelopasketchoftheanswertoyour
TheEDAprocess3coversthefollowingsteps:
- Identify the question(s) to be addressed/explored: this involves asking questions suchaswhyyouaredoingEDA,whatquestionsyouarehopingtoanswer,who willbeusingyouranswers,forwhatpurpose,andhow?Alwaysstartwitha question or question(s); do not expect that the insights from the data will emergewithoutquestionsguidingyour
- Learnaboutyourdataset(i.e.variables/features,types,instances/samples, etc.) before you even start applying any data analysis methods. Without knowing what the variables and instances in your dataset represent you cannotexpecttoreachanymeaningfulanswertothequestion(s)drivingthe dataSometimesitisevennecessarytounderstandtheprocessby whichthedatawascreated(seethefootnote3below).Inthisstepyouought to investigate if there are any problems with the dataset that may have negative consequences for the analysis4.An in-depth knowledge of your datasetwillhelpyouidentifyaneedtoreformatsomeofthevariables/features (e.g.fromnumericaltocategorical)and/orderivenewvariables/featuresby combingtheexistingvariables/features.
- CheckformissingvaluesandArethereanymissingvaluesand/or duplicates,whatcouldexplainthose?Decidewhatyouaregoingtodoabout missingvaluesand/orduplicates.
- Generate univariate and bivariate descriptive statistics for your variables.Investigatethosestatistics(e.g.howthevariablesaredistributedandtheir shape, i.e.skewness, modality, kurtosis) beforeyou moveto the next step.The findingsfromthisinvestigation shouldinformthenexttwo
- Create visualizations of your data(univariate and bivariate) to get visual representationofhowthedataofallyourvariablesareUsefindings
1JohnW.TukeycalledEDAadetectiveworkandgraphicaldetectivework.
2https://bookdown.org/rdpeng/artofdatascience/
3TheEDAprocessoutlinedhereassumesthatthedatarequiredtoaddressthequestionhavebeen identified,mined,collected,prepared,cleansed,combined,andformattedsoitisinarectangularform (whereeachcolumnisavariable/feature,andeachrowisacompleteanduniqueinstance)readyfor analysis(seeTidyData.pdf).However,often,preparingdataforanalysisisthemosttimeandresource consumingstepintheentiredataanalysisprocess.Tolearnmore,checkfollowing:link1andlink2.
4 For example, the dataset may violate some or all principles of Tidy Data (see Tidy Data.pdf).fromtheprevioussteptoenrichyouranalysis.Thefindingsofthisstepshould helpyougeneratehypothesestobetestedinthenextstep.
- Formulate and test hypothesesg. about all potentially informative associations between variables identified in the previous two steps or any other hypothesis you may want to test (e.g. statistical significance of correlations and associations between different variables).
- Discuss, compare, and synthesise your findings into an answerin the context of domainknowledge,practice,reason,andThisstepaimstomakesense ofandsynthesise(blendorcombine)allthefindingsfromyouranalysisintoan answer(s)totheanalysisquestion(s)identifiedinStep1.
- Sharetheanswer(s)toyouranalysisquestion(s) byusingtheformatagreedat theMakesureyouranswer(s)addressesallthequestionsraisedbyyour target audience (see Step 1).