Practical 1: Descriptive Statistics
Practical 1: Descriptive Statistics
Descriptive statistics are the first, and most important, stage in any quantitative analysis of data.
NOTE: SAVE YOUR WORK REGULARLY!
In this practical, you will begin your analysis of the Randolph Glacier Inventory dataset, gaining skills in basic data analysis and visualisation. The full RGI Version 4.0 can be accessed here: http://glims.org/RGI/index.html.
Your task today is to summarise and reduce the data, to find meaning in it and make sense of the numbers. Today, we will be analysing glaciers in Iceland. Note down your answers on the worksheet. Then, use your worksheet and Excel spreadsheet to complete the Moodle quiz for Practical 1. You should give all your answers to two decimal places.
Remember to bring these notes with you next week youll need them to complete Practical 2.
If you have any problems you can do two things:
Press F1 to access the rather useful Excel help function
Ask Amy or one of the demonstrators during the practical
Practical aims and objectives
To explore data on Icelandic glaciers and to use descriptive statistics to analyse the dataset;
To learn how to display the data;
To familiarise yourself with the basic functions of Excel,
If you do not finish the work today, you may complete it in your own time and seek help in the Helpdesk next week. You will need the results of Practical 1 before you can undertake Practical 2, so make sure you finish the practical before Practical 2. The deadline for the Practical 1 Moodle Quiz is the day of Lecture D2.
Task 1: Accessing and saving the data
Download the data from the Moodle page.
Save the RGIv4.0_Iceland_Alaska_WesternCanada_ArcticCanada _2020_2021.xlsx Excel document to your Y Drive or external USB drive and open it. Navigate to the Iceland tab at the bottom of the page.
You must save it to your Y Drive or a USB drive; the C: Drive is wiped when you log out and you will lose your data. Please note that you should have no spaces in the filename or worksheet names, as this can make Excel not function correctly.
Open the Excel spreadsheet and arrange your desktop so that you can edit this Word Document and the Excel spreadsheet at the same time.
Task 2: familiarisation with the data
Inspect the data. You will see that there are several tabs along the bottom of the page. Each worksheet has data on glaciers in one of 19 regions. These regions are identified in REF _Ref404852468 h * MERGEFORMAT Figure 1. Today, we will be working on glaciers in Iceland (Region 6). Click on each worksheet in turn and inspect the data. Note that the number of glaciers (and hence, number of rows) in each region varies.
Figure SEQ Figure * ARABIC 1. Regions of the Randolph Glacier Inventory. From Pfeffer et al., 2014
Figure 2. Glaciers in Iceland, mapped by the World Glacier Inventory.
Return to the Iceland tab in Excel. Ensure you are at the top of the spreadsheet. You will notice that the data are organised into columns, with a header at the top of the column ( REF _Ref404859303 h * MERGEFORMAT Table 1):
Table SEQ Table * ARABIC 1. The first 5 rows of the Icelandic data. Note that the data are arranged in rows and columns; each row represents one glacier with various specific attributes, such as slope, length and elevation (Z).
RGIID GLIMSID RGIFLAG BGNDATE ENDDATE CENLON CENLAT AREA ZMIN ZMAX ZMED SLOPE ASPECT Aspect_ cardinal LMAX GLACTYPE NAME
RGI40-06.00001 G336215E64817N 0 20020928 20020928 -23.7852 64.8174 4.903 624 1406 900 15 315 NW 3412 9099
RGI40-06.00002 G336231E64795N 10 20020928 20020928 -23.7686 64.7951 1.897 750 1394 1132 17.7 135 SE 1823 9099
RGI40-06.00003 G336196E64798N 0 20020928 20020928 -23.8043 64.7979 0.751 826 1410 1093 22.8 270 W 1244 9099
RGI40-06.00004 G336179E64804N 0 20020928 -9999999 -23.821 64.8036 0.067 830 950 877 17.8 270 W 334 9099
RGI40-06.00005 G336201E64805N 0 20020928 20020928 -23.7993 64.8046 0.978 923 1406 1140 16.1 315 NW 1960 9099
RGI40-06.00006 G336247E64800N 0 20020928 20020928 -23.753 64.8003 0.912 765 1339 1044 17.3 90 E 2009 9099
Each row represents one glacier, with all its attributes. So, glacier ID RGI40-06.00002 has an area of 1.897 km2, a length (LMAX) of 1823 m and a maximum elevation (ZMAX) of 1394 m above sea level.
The columns are each a particular attribute of each glacier:
RGIID is the region. It stands for: Randolph Glacier Inventory version 4.0., region 06 (Iceland), glacier number (#00001).
GLIMSID is the unique identifier given to each glacier. It uses their latitude and longitude.
RGIFLAG differentiates between glaciers and ice caps. It is a three-character code. Each character signifies something about the glacier. Nominal glaciers (character 1) are those whose outlines are nominal circles; no other attribute data exists. Character 3 describes the connectivity level for glaciers in Greenland. Glaciers that are physically detached from the ice sheet have a connectivity level of 0. A glacier is weakly connected if it is in contact with the ice sheet only at a well-defined divide in the accumulation zone, and strongly connected if the divide is indistinct in the accumulation zone and/or confluent with an ice-sheet outlet in the ablation zone.
So, if the RGIFLAG is 0, only character one is used and the glacier is Glacier or ice cap. If the RGIFLAG is 10, it is a Glacier and if the RGIFLAG is 11, it is an Ice Cap.
Value Character one (Status flag) Character two (form flag) Character three (connectivity flag)
0 Glacier or ice cap Glacier No connection
1 Glacier complex Ice cap Weak connection
2 Nominal glacier Strong connection
9 Not assigned Not assigned Not assigned
BGNDATE, ENDDATE are the date or image from which the survey was undertaken, in the form yyyymmdd. In this case, the 28th September 2002. When a single date is given in the source it is assigned to BgnDate. Missing dates are represented by -9999999.
CENLON, CENLAT are the longitude and latitude, in decimal degrees, representing the centre of the glacier.
AREA is the area of the glacier in km2.
ZMIN, ZMAX, ZMED is the minimum, maximum and median altitude (metres above sea level) for each glacier.
SLOPE is the mean slope of the glacier in degrees.
ASPECT is the mean aspect (orientation) (in degrees) for each glacier. Here, the average glacier aspect is categorised for you into the cardinal directions (N, NW, W, S, SE etc).
Aspect_cardinal here, I have written an IF statement that returns the cardinal direction of the glaciers (N, NW, etc). If you want to do this for any other region (for example, during your coursework), create a new column next to ASPECT in Column M and copy and paste the first cell (N2) in, and then drag it down.
LMAX is the maximum measured length of the glacier (metres). It is the longest surface flowline.
GLACTYPE is a description of the glacier. It is a four-digit type code following REF _Ref405555099 h * MERGEFORMAT Table 2 below. The first digit refers to the extent of snow around the perimeter of the glacier. The second digit refers to a glaciers calving environment (whether or not it terminates in the sea, on land, or in a lake). The third digit refers to whether the glacier has evidence of surging. The fourth digit refers to the ice divides. So, if a glacier has the glacier code 9099, then snow, surging and ice divides are not specified, and calving is normal. If it has the code 9199, then it is a tidewater glacier, calving icebergs into the ocean. If it is 9599, the glacier terminates in a floating ice shelf (see: http://www.antarcticglaciers.org/glaciers-and-climate/shrinking-ice-shelves/).
Table SEQ Table * ARABIC 2. Glacier Type code.
Snow Terminus type Surging Divides
Character 1 Character 2 Character 3 Character 4
0 normal Normal normal normal
1 hides 5-50% of perimeter Tidewater reported uncertain
2 hides > 50% of perimeter Freshwater signs compound
3 perennial snowfield Dry signs & reported ice cap
4 seasonal snowfield Regenerated - -
5 Shelf-terminating 8 specific remarks specific remarks specific remarks specific remarks
9 not specified not specified not specified not specified
NAME is the name of the glacier. Nearly all glaciers have no name.
Return to the Iceland tab and answer the following questions:
What is the area of glacier ID RGI40-06.00004?
What is the length of glacier RGI40-06.00002?
What is the slope of glacier RGI40-06.00017?
Task 3: Data description in Excel
You will notice that the table is massive and it would be difficult for you to describe the glaciers of Iceland from it. How would you know which is the largest or smallest? We also need to calculate the central tendency of the glaciers (what is the average size?) and the dispersion (how spread out is the data?). Think back to your lecture and take another look at your lecture notes to remind yourselves about these concepts.
If we start with the mean we need the formula:
However, rather than first summing the data (adding together) and dividing this by the number of data points, we can shortcut using Excel. We will now calculate the mean area for glaciers in Iceland.
Step 1: calculating the central tendency
What kind of data (ordinal, nominal, continuous, discrete, ratio, interval,) are the glacier areas? Explain your answer.
What kind of data are the glacier aspects? Explain your answer.
In an empty cell below the data in column G (G571) type in mean as a label. In the adjacent cell (i.e. H571) type in the formula (below) for calculating the mean glacier area and press Enter: =AVERAGE(H2:H569)
Youll notice that the range of numbers you typed in are highlighted with a blue box; you can use this to check that you have entered the correct cell references.
Now calculate the median glacier area in the cell below. Remember to label it in the cell to the left (column G) so that you can refer back to it later. Hint: =MEDIAN(H2:H569)
Now calculate the modal glacier area in the cell below. Remember to label it! Hint: =MODE(H2:H569)
We now have the three measures of central tendency calculated for us in Excel. Why are they different? What do the three different methods measure again? Describe what these three factors measure in the box below.
Mean:
Median:
Mode:
Step 2: Calculating the dispersion
We will now calculate the dispersion or spread around the central value of the glacier area data. Excel can do this quickly and easily.
What is the standard deviation? What does it measure?
Description:
Try writing your own formula to calculate the standard deviation in glacier area for glaciers in Iceland, using the =STDEV.S() function in Excel. Remember to use the = first and then enter the formula and cell references.
Standard deviation =
NOTE!
STDEV.S calculates standard deviation using the "n-1" method. STDEV assumes data is a sample only. This is the more usual method to use. For more information, see your Cheat Sheet.
When data representsan entire population, useSTDEVP orSTDEV.P. This divides the variance by n rather than n-1.
Now calculate the range of the data. There is no quick formula that does this in Excel so we need to do this manually.
The range is the maximum minus the minimum values in a dataset. If we calculate these values in Excel and take the smallest from the largest then we have the range. Remember to label all your values!
Use the formula =MAX() to calculate the maximum value in a dataset.
Use the formula =MIN() to calculate the minimum value in a dataset.
Then, in a new cell you must create your own formula to calculate the range in Excel. Label this with Range. This formula will always start with an = but then will be followed by the maximum value minus the minimum.
In Excel we can refer to the cells rather than the numerical values. For example, if your Maximum and Minimum values were in cells H575 and H576 respectively, to calculate the range you could type =H575-H576 in a new cell and Excel will calculate the answer. Alternatively, type = then highlight cell H575, then type - , highlight H576, and finally press Enter. Remember to label your answer in the adjacent cell in column G.
Use your results so far to complete REF _Ref405555466 h * MERGEFORMAT Table 3 below.
Table SEQ Table * ARABIC 3. Ranges for glacier area for glaciers in Iceland.
Mean glacier area (km2) Mode glacier area (km2) Median glacier area (km2) Standard deviation glacier area (km2) Maximum glacier area (km2) Minimum glacier area (km2) Range of glacier area (km2) Now the quartile deviation for glacier area for glaciers in Iceland should be calculated. For more information on Quartiles, see: https://www.excelfunctions.net/excel-quartile-exc-function.html
The formulas in Excel for calculating the interquartile range and quartile deviation are as follows:
Interquartile Range (IQR): (=Q3-Q1)
Quartile Deviation (QD): (=(Q3 Q1)/2)
Now we will calculate the IQR and QD for the Icelandic glaciers.
You will need to determine the upper and lower quartiles as part of this formula (Q3 and Q1)
Use the =QUARTILE.INC() function in Excel. When you type in =quartile.INC( you will note that Excel pops up =QUARTILE.INC(array,quart) next to the cell. This is Excel telling you what you need to fill in next. Hint: The 3rd Quartile would be: =QUARTILE.INC(H2:H569,3)
Remember:
The 1st quartile is the same as the 25th percentile;
The 2nd quartile is the same as the 50th percentile, or the median;
The 3rd quartile is the same as the 75th percentile;
The 4th quartile is the same as the maximum value.
Hint: array means the cells that want to calculate the quartile of (type in the references).
Quart means the quartile you wish to calculate (to determine what you need to type here click on QUARTILE in the pop up next to the cell and Excel will define values for Upper and Lower Quartiles). Type the cell references you want and dont forget to separate them from the Quartile with the comma. Then, as always, press Enter.
Calculate the Interquartile Range and quartile deviation. Use parentheses to calculate the upper part of the equation (interquartile range, Q3-Q1). In the row below, use a new formula to obtain the quartile deviation using the / symbol as divide by. Remember to label your results!
Use your results to complete the table below.
Table SEQ Table * ARABIC 4. Quartiles for glacier area for glaciers in Iceland.
1st quartile 2nd quartile 3rd quartile 4th quartile Interquartile range Quartile deviation Interquartile ranges are often shown on box plots ( REF _Ref58168968 h Figure 3):
Figure 3. Example of a box plot.
Excel has a function that can draw a box plot:
Select the Iceland glacier area data.
In Excel, click Insert > Insert Statistic Chart > Box and Whisker:
A box plot of Icelands glacier area will appear, as below:
As Icelands glacier area is very skewed, there will be many outliers, and the box plot will be difficult to see, it will appear as dots seen above. Therefore, we may want to exclude outliers.
Right-click one of the boxes/dots on the chart to select these and then, on the shortcut menu, click Format Data Series.
In the Format Data Series pane, with Series Options selected, make the changes that you want.
150373113678140026552151246759Untick show outlier points
0Untick show outlier points
Your box plot may now look something like this:
RECAP
Using your box plots, descriptive statistics, and interquartile ranges, what have you learned about the glaciers in Iceland?
We now need to calculate what the total glacierised area of Iceland is. Remembering to label your formula in the adjacent cell, calculate the total glacierised area using the =SUM() function. Note down the answer below.
Now use the =COUNT() function to count the number of glaciers. This function counts the number of cells with numbers in. Note down the answer below.
Step 3: Calculating the descriptive statistics
Finally, you can use Excel to calculate many measures of descriptive statistics in one quick step. This is a very useful function but will probably require you to update Excel and add some add ins. Follow this step-by-step process:
Click the file on the top left of the Excel window and select options
Select add ins, then Excel add ins and select go. A dialogue box like that in REF _Ref404861275 h * MERGEFORMAT Figure 4 should appear.
24544312976077
Figure SEQ Figure * ARABIC 4. Excel Add-ins
Select Go at the bottom left of the dialogue box (red arrow in Figure 2).
Ensure Analysis Toolpak and Analysis Toolpak VBA are ticked and click OK.
If these were not ticked you may need to restart Excel save your work first!
Now go to the Data tab in the Excel window and select Data Analysis
Note that there are lots of options for statistical applications here. The one you want is Descriptive Statistics. Click OK.
4389491324533
Figure SEQ Figure * ARABIC 5. The Data Analysis tool (highlighted with red arrow). Select the Descriptive Statistics from the box that opens.
The Input Range is the range of data to be analysed. Select all the glacier area data with your mouse. If you include the title then make sure the Labels in First Row box is ticked. Select an output cell in an empty area below your data (so that it wont overwrite your work) using the Output Range box and tick the Summary Statistics box.
Your selections should look like image below. Click OK and the results should display descriptive statistics, most of which you should be familiar with.
Figure SEQ Figure * ARABIC 6. Descriptive Statistics dialogue.
This is a good check of your results! Are all of the results in the descriptive statistics the same as those that you calculated yourself? If not, why not? Go back and inspect your cells to see if you have made any mistakes.
What have you learned about the distribution of your data?
Task 4: Histogram of your data
We will now make a graph showing the distribution of glacier sizes. We will use a histogram to do this.
First, we must bin the range of values into a series of small intervals. We will then count how many values fall into each bin. Excel will plot a chart with height determined by count and width determined by bin size.
Bins are consecutive, non-overlapping intervals of a variable (in this case, glacier area). The bins must be adjacent and of equal size. The histogram is drawn so that they touch each other, indicating that the data are continuous.
The glacier area data are very skewed, because most glaciers are very small. We will therefore conduct this analysis on a subset of glaciers: those below 10 km2.
First, write out your bins in 0.2 increments. You can do it automatically:
Write 0.2. In the cell below, write =[cell reference]+0.2, where [cell reference] is the cell where you just wrote your first value (0.2). You will note that this new cell now outputs 0.4. Click and drag the bottom corner of this cell to create an Array (a list of numbers) from 0.2 to 10, in 0.2 increments (see REF _Ref404865462 h Figure 13).
Next, use the histogram tool in the Data Analysis toolpack. Select histogram. The input range should be the full glacier area array (H2:H569). The Bin range is the array of bin values that you created. The Output Range should be adjacent to the label Histogram of your Bin Range. Your dialogue box should look similar to REF _Ref404865462 h * MERGEFORMAT Figure 13. Make sure that Labels isnt checked if you arent including labels.
Figure SEQ Figure * ARABIC 13. Histogram tool in the Data Analysis tool packSelect OK and you will have two arrays: the Bin and the Frequency. This is the number of glaciers with an area in that particular bin interval.
Ensure that nothing is selected. Go to Insert-Column-2D Column. A blank column chart will appear.
Under the new Chart tools that appears in your menu bar, choose Design Select data. This will select the data arrays to be used in generating your histogram ( REF _Ref404865662 h * MERGEFORMAT Figure 9).
Click Add on the dialogue and give it a series name, Glacier area. For the Series Values, type in the range for your Frequency array (for example, $H$615:$H$666).
Next, edit the Horizontal Category x axis labels. These should be your bin intervals.
Figure SEQ Figure * ARABIC 14. Editing the data source for a histogram
Select OK and your histogram should appear!
Under Chart Tools Chart Layouts, there are pictures of quick layouts. Select Layout 8, which has continuous bars.
Under the Layout tab under Chart Tools, use the dropdown menus to tidy up your graph. Remove the legend and add axis labels. Your histogram might look similar to the one below:
Figure SEQ Figure * ARABIC 15. Histogram of glacier area for Icelandic glaciers.
Refer back to the skew and kurtosis values for glacier area that you calculated earlier for your Descriptive Statistics. What does the histogram tell you? What about the skew and kurtosis of the data? What have you learned?
Skew:
Kurtosis:
What have you learned from your histogram?
Is the box plot, bar chart or the histogram better for displaying the data? What do they each tell you that the other does not?
Moodle Quiz
Once you have completed this practical, use your work to complete the quiz on Moodle. This is worth 1.5% of your final mark.
This must be completed online before Practical 2 but I would encourage you to do it this week so the work is fresh in your mind. Feedback is automatically generated when the quiz closes.
To access the online exercise log in to the GG1011 page in Moodle and find Week 21. Open up the quiz Practical 1 Moodle Quiz and follow the instructions. Note that this is timed! You get two attempts with the highest counting. Feedback will be available immediately after the quiz closure time.
The exercise is self-marking. If you feel the program has not marked your answer appropriately then please contact the course tutor.
Practical 2: Statistical difference tests using ExcelNOTE: SAVE YOUR WORK REGULARLY!
In this practical, you will analyse glaciers in Alaska thats Region 1 in REF _Ref404869786 h Figure 1 below. You will be comparing these glaciers to those in WesternCanadaUS and ArcticCanadaNorth (Regions 2 and 3 respectively). Theyre all in North America, but the characteristics of the glaciers are very different.
Note down all your answers on the Worksheet. Use your spreadsheet and Worksheet together to complete the Moodle quiz for Practical 2. Remember to save your work to your Y drive and also to email it to yourself so that you have it at home you will continue to use the same Excel spreadsheet next week.
Please note that you should have no spaces in the filename or worksheet names, as this can make Excel not function correctly.
Figure SEQ Figure * ARABIC 1. The World Glacier Inventory. Regions are indicated by boxes. Glaciers are highlighted in red.
Aims and Objectives
To calculate confidence limits for sampled data to indicate the accuracy of our sample mean with respect to the population mean.
To test whether TWO or more sample means are taken from different populations using the students t-test.
To gain experience in using inferential statistics in Excel.
Task 1: Exploratory data analysis: Alaskan Glaciers
You will be working on Glaciers that have been mapped in Alaska. The data are downloaded from the World Glacier Inventory.
Figure SEQ Figure * ARABIC 2. Glaciers mapped in Alaska in the World Glacier Inventory.
Step 1: Descriptive statistics
Use the same Excel file (saved to your Y drive) that you were working on last week. Navigate to the Alaska tab.
As with any dataset, the first thing to do is to make some initial descriptive observations.
Obtain the full set of descriptive statistics for the glacier median elevation (ZMED) for glaciers in Alaska. As last week, use the Descriptive Statistics function in Excel. This time make sure that you check the box Confidence Level for Mean 95% and summary statistics. Output your results in a clear area to the right of your dataset. Well keep the area under the data clear, in case we want to sort the data.
Task 2: Confidence intervals about the Sample Mean
Background
Confidence Intervals are used to assess the accuracy of our sample mean with respect to the likely population mean. In this case, we have observations of glaciers in a particular region (e.g., Alaska). However, these observations are subject to uncertainty and errors, and some glaciers may have been missed, been too small to count or mistaken for snow patches. Alternatively, some debris-covered glaciers may have been misclassified as ice-free ground.
We have calculated a sample mean for glacier median elevation in Alaska. Confidence intervals consist of a range of values (interval) that act as good estimates for the unknown population mean (in this case, the true glacier mean area).
The level of confidence (often 90%, 95% or 99%, most commonly 95%) of the confidence interval indicates the probability that the confidence range (interval) captures the true population mean given a distribution of samples. It does not describe any single sample.
Note: the level of confidence (e.g., 95%) does not predict the probability that the area for a newly observed glacier would be within the confidence interval. It does mean that if we went out and measured many more glaciers, 95% of these glaciers would be expected to be within the confidence interval.
We can compare confidence intervals. We may be interested in the difference between the means of two populations and whether we can infer from samples from the populations that the means are different.
Calculating confidence intervals
We will compare confidence intervals for glacier median elevation for three regions: Alaska, WesternCanadaUS and ArcticCanadaNorth.
-47625416368CI=xtsn00CI=xtsnConfidence intervals: Here you will be calculating the 95% t-confidence intervals for glacier data. The formula for confidence intervals (CI) is:
where
x = the mean,
t = the t statistic
(i.e. corresponding to your dataset at 95% confidence level - this is the number of standard deviations (1.96, see REF _Ref58175632 h Figure 5) either side of the mean needed to include 95% of the data. In the lecture we used the z statistic, which is similar but used for smaller numbers)
sn is the standard error (standard deviation over the number of observations).
Figure SEQ Figure * ARABIC 5. The 68-95-99.7 rule says that 95.45% of the data are within 2 standard deviations of the mean within a normal distribution. 95% of the data are within 1.96 standard deviations of the mean.
Remember that there are two values for confidence intervals an upper and lower bound as denoted by the symbol in the formula above.
Step 1: Confidence intervals
The easiest way to obtain confidence intervals is via the descriptive statistics function in Excel that you used earlier. Notice that in the box of descriptive statistics you calculated earlier there is a cell giving a Confidence Level (95%). This is the part of the equation above equal to:
CL= tsn
Calculate the Descriptive Statistics for Glacier ZMED for Arctic Canada North and WesternCanadaUS. Remember to update the cell references each time there are very different numbers of glaciers in each region! Output your results to the top right of your data in each worksheet (e.g., cell R1).
Insert a new worksheet (right click on the Alaska Tab and select Insert). Name your new Worksheet Confidence Intervals.
Copy and paste your three Glacier ZMED Descriptive Statistics tables into your new worksheet. Label each regions descriptive statistics.
Note that the 95% confidence level values that Excel gives are always close to 1.96 x the standard error.
For more information on the confidence interval, go here: https://www.mathsisfun.com/data/confidence-interval.html
Step 2: Confidence intervals
Give the range of confidence interval values where we can be 95% certain that this range includes the population mean of ZMED for each dataset. This is basically the Mean + the confidence level and the Mean - the confidence level, and is expressed as a range, e.g., 2109.02 to 2123.32 m above sea level. Use your answers to complete REF _Ref405878012 h Table 3 below:
Table SEQ Table * ARABIC 3 Results for calculation of confidence intervals for glacier ZMED. Give your answers to 2 d.p.Alaska Western Canada US Arctic Canada North
Mean (m) 1573.30
Standard error 2.99
Standard deviation 492.15
Confidence Level (95%) 5.86
Confidence interval (m) 1567.44 to 1579.16
What do the confidence intervals tell you about the likely difference between the true glacier median elevations for glaciers in different parts of North America?
Can we say that there is a statistical difference in glacier median elevation for glaciers in different regions of North America? Do the confidence intervals of the mean glacier median elevation overlap? What does this mean?
Note how the narrowest sample distribution gives the most precise estimate of the population mean.
Step 3: Bar chart with CI error bars
We will create a bar chart of the means of the glacier ZMED, using the 95% confidence levels as error bars.
First, arrange your data in a table to make it easy to plot a bar chart. I recommend arranging like below:
Region Mean ZMED Confidence level (95%)
Alaska WesternCanadaUSArcticCanadaNorthPopulate your table with your calculated data. You could use a cell reference to do this quickly and easily.
Use the Insert-Column tool to generate a bar chart of the means of glacier ZMED.
Under Chart Tools > Design, select the error bars function under Add Chart Element ( REF _Ref787590 h Figure 6).
Figure SEQ Figure * ARABIC 6. Adding error bars to a bar chart
Select More Error Bars Options as shown in the figure above and specify the Error Amount to be Custom.
Select the positive and negative error bars to be the values from your Confidence Level (95%) column.
Give your bar chart axes appropriate titles and remove the legend. Your bar chart might look like the one above.
Step 4: Interpreting the dataBased on these descriptive statistics, do you think that there is a difference in the median elevation of glaciers in these three regions? Is it significant or down to change? How sure can you be?
Task 3: Difference between sample means - significance testing (t-test)
Background
We now want to go one stage further to ask whether there is any statistical difference between glacier median elevation in the three regions. We can use a t-test to investigate this. T tests are parametric tests that investigate the difference between samples means. Some assumptions to remember:
T-tests assume that both samples are normally distributed, and are ratio or interval data.
T-tests assume that both samples have the same standard deviation (one sample is simply shifted relative to the other). If this is violated, we should assume unequal variances when we run the t-test.
If these assumptions are violated, then the test result may be uncertain. Some things to remember:
Slight violations are probably unimportant;Violations of unequal variance (standard deviation) are worse than violations of normality. However, we can check the assume unequal variances when we run the test if the standard deviations are different.
There are two types of t-test: Paired and Independent.
Paired t-tests (dependent) are for paired samples.
This is when measurements in one sample can be sensibly paired with measurements from another sample.
This might be because they are repeated measurements, or because there is some other way of joining them (e.g., IQ of older and younger brothers; measurements on the same glaciers in different decades).
Independent t-tests are for when there is no sensible way to pair off the measurements.
Secondly, the type of effect that you expect to find will influence your choice of test. One-tailed tests are to be used if you expect to find an effect in a certain direction. You would use a one tailed test if:
The first mean is larger than the secondThe first mean is smaller than the secondTwo-tailed tests are to be used if you dont know if one mean will be larger than another, but you are looking for any difference. You therefore expect to find that the first mean will be different from the second in either direction.
We therefore conduct simple descriptive statistics first before embarking on more complex t-tests; this will help us to decide whether the tests are appropriate and which kind of test to use.
You use a t-test to test a hypothesis. We first need to construct a null hypothesis (H0), which the experiment is designed to test.
We will also have an alternative hypothesis (H1), which might be that there is a difference. For example, if we were measuring the heights of pine trees growing in two different regions with different soils, we might have the following hypotheses:
H0: there is no difference in height between the two locations.
H1: Trees in location X are larger than trees in location Y.
This could be summarised as:
H0: 1 - 2 = 0
H1: 1 - 2 0
In the above example, we would select an Independent t-test, as there is no way to pair the samples, and a one-tailed test, as we expect mean tree height in location X to be larger than in location Y.
Our initial descriptive statistics tell us that we have not violated any assumptions, because tree height is normally distributed and not skewed, and that the standard deviations are similar.
Step 1: Distribution of the data
T-tests assume that the data are normally distributed. First, lets see if thats the case.
Referring back to your calculations in Task 3, what is the mean, standard deviation, skew and kurtosis for glacier median elevation in all three regions (Alaska, WesternCanadaUS and ArcticCanadaNorth? Note it down in REF _Ref405558810 h Table 4.
Table SEQ Table * ARABIC 4. Results for Glacier Median Elevation for three different regions of North America. Give your results to 2 d.p.Mean (m) Standard deviation Skew Kurtosis Count Min (m) Max (m)
Alaska WesternCanadaUSArcticCanadaNorthThe distribution of median elevations of glaciers in these regions can also be observed in a histogram ( REF _Ref441489168 h Figure 7). I have made one here for you to study. If you like, you can try and reproduce it, using the notes from Practical 1, or using the Histograms tool in Insert Statistic Chart (see REF _Ref787822 h Figure 12).
Figure SEQ Figure * ARABIC 7. Overlain histograms of glacier median elevations from three different regions.
Look at your descriptive statistics and at the histogram in REF _Ref441489168 h Figure 7. What have you learned about the skew and kurtosis of the data? Note down your observations here.
Now consider the assumptions listed in the background information for t-tests above. Does Median Elevation (ZMED) for these three regions violate these assumptions?
Step 2: T-tests on the glacier median elevation
You are now going to use Excel to undertake t-tests on these samples. The first thing to do is set up hypotheses. We want to see if there is a real difference in the means of glacier median elevation in Alaska and ArcticCanadaNorth. Set up and write down null and alternative hypotheses for your 1st test.
Use the Confidence Intervals you calculated above to help you define your hypotheses.
Null Hypothesis (H0):
Alternative Hypothesis (H1):
We are going to use the t-test to test between these samples in any case. Based on the conditions (see lectures notes) for use of different types of t-test, which form should you use and why?
Will you assume equal or unequal variance?
T-tests can be awkward when run between worksheets, so I recommend copying and pasting the ZMED columns into a new tab in your spreadsheet to run the t-tests. Add a row at the top and give both columns a name, so that you remember which variable is which.
You may have noticed that in the Data Analysis box (see first practical) there are options for t-tests. Open the box and select the correct type of t-test (two-sample test assuming unequal variances). Input the correct data to the variable ranges for Alaska and ArcticCanadaNorth (youll need to switch between the work sheets), select your chosen alpha (0.05 corresponds to a confidence level of 95%; confidence level plus alpha = 1) value and an appropriate output cell (choose a clear cell in your Confidence Intervals worksheet).
Question: What will your hypothesised mean difference be? Hint: remember what we are testing with the null hypothesis (i.e., that the two samples have the same mean).
Step 4: Interpreting the resultsExcel will now give you an output from the test. The left hand side of the box produced requires some explanation ( REF _Ref442942813 h Table 5). If the p-value is less than the alpha (the risk youre willing to take on making a wrong decision), then you reject the null hypothesis. If the p-value is greater than the alpha, then we fail to reject the null hypothesis. Remember:
If the p-value is low, the null must go!
If the p-value is high, the null will fly!
The confidence interval and the p-value will always lead you to the same conclusion. If the p-value is less than alpha (i.e., it is significant), then the confidence interval will NOT contain the hypothesised mean. The p-value is normally given as less than or greater than the confidence level: e.g., p<0.05. If the p-value is 0, this is due to the limited accuracy of the software and indicates that the p-value is extremely low. Just give it as p<0.05.Table SEQ Table * ARABIC 5. Interpreting the results from the t-test
Mean The mean of the two variables will be given. Check against your Descriptive Statistics. Are they the same?
Variance Sample Variance in the Descriptive Statistics
Observations Count in Descriptive Statistics
Pooled Variance Hypothesized Mean Difference If your null hypothesis is that the means are the same, this should be 0.
dfDegrees of freedom
t Stat t Stat stands for t-calc in the lectures: the calculated value of t for this test (i.e. the number of pooled standard deviations the two sample means are from each other).
The probability density curve for the t-distribution is symmetric. The area to the left (negative value) is the same as the area to the right (positive value). A negative t stat indicates that your sample mean (in this case, Alaska) is too small to have been drawn from your hypothesised population (in this case, Western Canada US). A positive t stat indicates that the sample mean is too large to have been drawn from your hypothesised population.
A larger value for t stat (positive or negative) indicates that there is less likelihood that the sample results are a result of chance. A large t stat indicates that the null hypothesis should be rejected and the alternate hypothesis accepted.
Whether you reject or fail to reject the null hypothesis depends on whether the t-Stat value is larger (in absolute terms) than the appropriate t-Critical value (choose the right one for your test). If the value for t Stat (positive or negative) is larger than the t-Critical value, then reject the null hypothesis.
P(T<=t) one-tail P(T<=t) means the probability that there is no significant difference between datasets; i.e. the probability of making a type 1 error.
If you are going to reject the null, this should be less than 5 or 1% (i.e. <0.05 or 0.01).
t Critical one-tail The t-critical value for a one-tailed t-test
P(T<=t) two-tail For a two-tailed t-test
t Critical two-tail The t-critical value for a two-tailed t-test
Compare your t Stat and t Critical values. Do you reject or accept your null hypothesis and why?
Remember, if our value for t stat is (positive or negative) larger than t-crit then we must reject. Please note that if its a negative value; this is fine! In most cases, we only care about the absolute value of the difference, or the distance from 0. It doesnt matter which direction.
What is your p-value? What does this mean?
Interpret your t-test. Is the t statistic small or large, +ve or -ve? What can you infer about the difference between your sampling means? See the lecture notes for the third lecture for help on this.
Moodle Quiz
Once you have completed this practical, use your work to complete the quiz on Moodle. This is worth 1.5% of your final mark. The deadlines are given in the Module D Handbook. You must complete it before the next practical. Deadline: 22nd February at 4:30pm.
I would encourage you to do it this week so the work is fresh in your mind. Feedback is automatically generated when the quiz closes.
To access the online exercise log in to the GG1011 page in Moodle. Open up the quiz Module D Moodle Quiz 2 and follow the instructions. Note that this is timed! You get two attempts with the highest counting.
The exercise is self-marking. If you feel the program has not marked your answer appropriately then please contact Amy.
Guidelines for the infographic coursework assignment
Deadline: 26th March, 12:15pm
Guidelines for the infographic coursework assignmentDesign anA3colourinfographic (300 words)explaining the characteristics of the Worlds glaciers using the Randolph Glacier Inventory, a dataset of all the Worlds glaciers. Include graphs, summary statistics and figures.
You should use the Randolph Glacier Inventory data that is available on Moodle. There are two spreadsheets; one we used in the practicals, and the second (RGIv4.0_worldwide_2020-2021.xlsx) that includes all the rest of the data. It is this second spreadsheet that you should use for your assessment. You shouldnot use any of the regions analysed during the practicals. Do not just reproduce the practicals!
Your infographic should be aimed at the general public (interested adults) and should highlight some interesting facts and figures about the Worlds glaciers.
Your infographic should both:
Part 1: Summarise the characteristics of glaciers in one particular region of the world (2/3 of the page), and;Part 2: Compare and contrast one glacier characteristic between this region and another different region (1/3 of the page).
Part 1 (50% of the infographic mark)You should devote abouttwo-thirdsof your infographics A3 page to conducting an analysis of all the glaciers in one region of your choice. Youshould not useany of the regions from the practicals. It should be a brand new region that you have not previously studied.
You can analyse any glacier characteristic (length, slope, elevation, area, latitude, glacier type etc.)
Part 1 of your infographicmustincludeat leastthe following five elements:
A bar chart (with error bars if appropriate, including the 95% confidence level)
A box plotSummary tables of selected descriptive statistics,
A histogram (commenting on skew, kurtosis, distribution, etc),
You can add additional charts to these if you wish, but you will not get extra credit; rather, focus on applying the tests correctly and without errors.
You can do a general analysis of all the glacier characteristics, or focus in on just one or two characteristics in more detail. It is up to you. I am keen for you to explore the data and to make original discoveries about it. If you want to include descriptive statistics about your second region here as well, you can do, and this can help set up Part 2 of the Infographic.
You could optionally include one of the maps or figures uploaded to the Moodle page, any interesting photographs or images you find of your glaciers, and anything else to engage and interest the reader.
You should use the words available and the figure captions to highlight key statistical information. This should demonstrate your understanding of the statistics that you have applied.
Part 2 (30% of the mark)You should devote aroundone-thirdof your infographics A3 page to comparing and contrastingone glacier characteristicbetween the region you analysed first and another region. You should usea t-test and should also includehistograms and a discussion of skew and kurtosisto justify your choice of statistical difference test.
You should state your null and alternative hypothesis, give your calculated and critical values, and state whether you reject or accept the null hypothesis.
Remember T-Tests can only be applied to normally-distributed datasets, you should refer to your skew and kurtosis values and a histogram to establish this as a key first step.
Poster Design (20% of the mark)20% of the marks are awarded for poster design, using the standard marking criteria from the UG handbook (see also Figure 1).
Word count / page limit:1 x A3 page (portrait or landscape), in colour.
Your infographic should have no more than300 wordsin total. You can include a few short statements or single paragraph highlighting interesting facts in the data for each element. Figure captions and table captions are not included in the wordcount.
You are not expected to read widely around the subject of glacier inventory techniques and data, but if you do, you may find some ideas about how to analyse your data. If you do some wider reading, include a bibliography at the bottom of your infographic.
Suggestions for creating infographics:First thoroughly analyse the data. What information are you going to show? Include summary statistics and interesting findings.
Catchy but descriptive title.
The figures and tables should have captions so that your infographic is easy to follow.
Remember to always label all your axes so that it is clear what each figure/graph shows!
Tell a story. Follow through to logical conclusion. What information are you trying to get across?
Draw by hand on paper first to design the layout.
Choose a limited colour palette. Limit to just 3 complimentary colours. Some colour palettes:http://www.colourlovers.com/palettesand this colour wheel is fun:https://kuler.adobe.com/create/color-wheel/Follow the Rule of Thirds - put a grid of thirds in the page.
Simple designs work best. No need for complicated graphics.
Use Publisher. They have some built-in guides to help you design your work. PowerPoint is also acceptable.
Assessment Marking CriteriaYour infographic will receive the following three marks:
Part 1 stats (50% of the mark)
Part 2 stats (30% of the mark)
Poster design (20% of the mark)
Your final mark for the A3 Infographic will be the sum total of these marks, scaled appropriately.
Marks will be awarded for:
Choice of appropriate data visualisation and statistical methods
Including all the requested statistical techniques
Correct application of statistical techniques
Visual appearance, layout and design
Appropriate labels and designs of charts, appropriately sized, with all axes clearly labelled.
Demonstration of clear understanding and knowledge
Spelling and grammar
Practical 3: Correlation and Regression
NOTE: SAVE YOUR WORK REGULARLY!
Practical Aims
To explore glacier aspect data
To investigate associations between variables
To gain experience writing equations in Excel
To start generating ideas for the assessed infographic
The practical
Open the Excel spreadsheet that you have been working on for the last two practicals. Rename it Practical_3 or similar. We will continue working on the Randolph Glacier Inventory. You will use some of the techniques you learned in this practical, and practicals 1 and 2, to create your Infographic. Make sure there are no spaces in the filename.
Task 1: Exploratory data analysis: Alaskan Glaciers
You will be working on Glaciers that have been mapped in Alaska. The data are downloaded from the World Glacier Inventory.
Figure SEQ Figure * ARABIC 2. Glaciers mapped in Alaska in the World Glacier Inventory.
Step 1: glacier aspects
We want to work out what the orientation (aspect) of glaciers in Alaska is. You will see that each glacier has an average aspect this is the main orientation of the glacier.
Figure SEQ Figure * ARABIC 3. Compass directions
What type of data is Aspect?
What implications does this have for data analysis?
Note that aspects are an example of circular statistics, so we cannot calculate mean, median, mode or other descriptive statistics. Aspect can be a very important control on glacier size, because it determines how much insolation energy the glacier receives. In the Northern Hemisphere, glaciers facing North receive less sunlight and have more shade, and tend to be larger. In the Southern Hemisphere, glaciers facing South tend to be larger.
Navigate to the top row of column headings and right-click on N. Insert a new column. Give it the heading Aspect_cardinal.
Navigate to the Iceland Tab and copy the cell N2.
Navigate back to the Alaska tab and paste that cell into N2.
Drag that formula down.
In a clear space, write some row headings like below:
Direction Number of glaciers Total area of glaciers
N =COUNTIF(N2:N27110,"N") =SUMIF($N$2:$N$27111,"N",$H$2:$H$27111)
NE E SE S SW W NW Total Use countif and sumif statements to calculate the number and total area of glaciers in each size category. Hint: to calculate the number of glaciers with a northerly aspect:
=COUNTIF(N2:N27110,"N")
Calculate the total number of glaciers and check it is correct.
Calculate the total area for glaciers in each aspect category and check it is correct. Hint:
=SUMIF($N$2:$N$27111,"N",$H$2:$H$27111)
Radar plots are a good way to show categorical data like these. They are often used in questionnaires, for example. Draw a Radar Plot of your Number of Glaciers. Select the Direction and Number of Glaciers columns and go to insert -> Charts -> Radar Plot.
If you have written your directions in order like above, they should automatically plot in the right place on the radar plot!
What do you notice about your data? What observations can you make?
Figure SEQ Figure * ARABIC 4. Radar plot of glacier aspects in Alaska, showing that more glaciers are orientated North than South.
How else could you investigate these data?
You could also draw a pie chart or bar chart of numbers or areas of glaciers in each aspect. You could also sort the data (Data -> sort) and draw box plots of areas of glaciers in each cardinal direction, calculate descriptive statistics for glacier elevations in each cardinal direction, etc.
Because glacier aspects are non-parametric, categorical data, if you wanted to compare numbers or areas of glaciers in each aspect category between two different regions for your coursework, it would require a chi squared test. You will not cover this until 2nd year so please avoid this comparison.
Note! Remember not to calculate mean or total aspect because it is a circular statistic!
Task 2: Correlation
Background
Correlation aims to quantify association between two variables. How much is a change in one variable matched by another? For example, when hurricane intensity increases how much does building damage increase?
Step 1
We want to investigate how glacier elevation range is related to glacier length. Are shorter glaciers steeper?
Navigate to the ArcticCanadaNorth tab and examine the data.
Calculate or complete the Descriptive Statistics for glacier area and glacier length. Complete the table below.
! Note! Values of -9 or -999 are null values. If your minimum value is -9, youll need to sort your dataset and remove any null values.
Glacier Area Glacier length
Mean Range Minimum Maximum Count What do you observe about the area range for glaciers in ArcticCanadaNorth?
What does this table tell you about the probability distribution of the glacier area range and length data? e.g., skew and kurtosis, is the data normally distributed? What type of data are these variables?
Before constructing a scatter plot you will need to determine which is the independent (x) and dependent variable (y)?
Independent variable:
Dependent variable:
Construct a scatter plot examining the relationship between Glacier Length (LMAX) and Glacier Area.
Make sure that your independent variable is on the x axis and your dependent variable is on the y axis.
Label your axes and remove the legend.
See REF _Ref412037811 h Figure 1 for a hint on how to do this.
Make sure that column titles arent included.
3848100797560
Figure SEQ Figure * ARABIC 1. Inserting a scatter plot.
Make a qualitative assessment of the correlation. What kind of correlation is this? (strong, weak, positive, negative)
Step 2
Time to make your assessment quantitative if our data is not normally-distributed (non-parametric) then preferentially, we should use Spearmans rank.
The instructions for Spearmans are available here: Spearman rank correlation in Excel: formula and graph (ablebits.com) and on Moodle under Week 23. You can attempt this in your own time, but you are not assessed on this component until 2nd year.
However, today, we will try out EXCELs PEARSON function.
Step 3
To become familiar with the Pearson Product Moment Correlation coefficient. Here we will try out Excels PEARSON function. But remember, this makes the assumption our data is normally-distributed!
So in this case, we are just using it to compare methods.
In an empty cell type =PEARSON(array1, array2) replacing array1 with the values of the independent variable (x) and array2 with the values of the dependent variable (y). Remember that an array in simply a list of numbers.
Remember to label your calculation so that you can refer back to it later! Give your answer below:
What does the Pearson Correlation Coefficient tell you? How strong is the association? What is its direction? Does it support your initial qualitative assertion above?
Task 3: Regression
While correlation can be a useful way of showing relationships between two variables, regression fits a model line to a plot of two or more variables that allows us to quantify that relationship. The equation of this line describes the association between the variables and, if significant (representative of the population rather than just the sample), allows prediction of values of y from values of x.
The R2 value is the coefficient of determination; it is a measure of Goodness-of-Fit and of how close the data are to a fitted regression line. So,
R2 = Explained variation / total variation
R2 is always between 0 and 1:
0 indicates that the model explains none of the variability of the response data (y) around its mean1 indicates that the model explains all of the variability of the response data (y) around its mean.
In general, the higher the R2 value, the better the model fits your data.
Below your Pearsons Correlation Coefficient, calculate the R2 value using the =RSQ(known_ys, known_xs) tool and give your answer below to 2 d.p.:
Describe what the R2 value is telling you in terms of the influence of the independent variable on the dependent variable.
You can also calculate the R2 value using the chart tools.
Add a linear trendline to your graph and calculate the R2 value for your correlation coefficient (Chart Tools -> Design -> Add Chart Element). Select a Linear Trendline.
Right click on your trendline and select format trendline.
Select the Display equation on chart and Display R-squared value on chart buttons to add the R-squared value and equation on the chart, (see REF _Ref405807152 h Figure 6). You could also make the trendline black and thicker to make it easier to see.
Excel will then fit a line through the data (least-squares regression) and calculate the equation for that line.
Task 4: Assessment
I want you to use this time during the practical to explore the RGIv4.0 worldwide 2020 21 excel sheet on Moodle under the GG1011D Assessment tab on Moodle. You must also open and save the GG1011D Assessment Guidance document. Read this in detail and decide upon which region you would like to study. Look back at your notes from the practicals to work out how you will approach the tasks.
As a reminder, this cannot be a region studied in the practicals thus far, i.e. Iceland, Alaska, Western Canada or Arctic Canada.
Please use this time in class to clarify any assessment queries with Amy before you leave or complete the Moodle quiz 2.
 
								