Module 6 - Data Classification

Data classification is all about combining data values into groups or classes, and then representing these groupings with unique symbology on a map. There are 6 common methods of data classifications: equal intervals, quantiles, standard deviation, maximum breaks, natural breaks, and optimal. As you can see from my map below, we focused on 4 of the common methods during our lab assignment:
  • Equal Interval – as the name denotes, this classification creates equal classes based on the range of values found in the data. It takes the total data range (minimum and maximum values), and then divides them by the number of classes you would like to have. It is the easiest to calculate by hand. So, for example, if you have values in a dataset between 0 and 50, and you would like to have 5 classes, then your intervals would be something like 0-10, 11-20, 21-30, 31-40, and 41-50.
  • Quantile – this method divides the range of data into intervals of equal observations. It makes sure that each category has an equal number of values. We calculate these intervals by dividing the total number of observations by the number of classes. Because of this, one drawback about this method is that similar values can be in the same class.
  • Standard Deviation – this method classifies data into equal sections inside a bell (Gaussian) curve. It places the majority of values near the mean (which is the center of the curve), while other classes will have fewer points at the ends of the curve. When this classification method is chosen, ArcGIS Desktop automatically detects it and assigns a divergent color ramp to represent the diverging data.
  • Natural Break – these classes are based on natural groupings inherent to the data which are calculated using algorithms. Its goal is to make all values as similar as possible within a class, while also maximizing the differences between the values in other classes.
After determining the best type of thematic symbolization to apply to our data set (choropleth, proportional, isopleth, or dot), we can then focus on the criteria to selecting a proper data classification method to provide the most accurate output possible. It is important to remember that each of the classification methods has both advantages and disadvantages.

For our map deliverables we worked with census tract senior population distribution data in Miami-Dade County. We created 2 maps, the first in which the census data had already been normalized by percent of population ages 65 and above, and the second map (shown below), looking at the raw counts and manually normalizing the data by area.

After comparing the data classification methods, I made the conclusion that the Equal Interval classification method best displayed the data for an audience looking to target the senior citizen population. I made this decision by looking at column ‘PCT_65ABV’ in the attribute table – where I noticed the values were not evenly distributed amongst a range of 100%. The highest percentage of population was at 79%, but the second highest percentage dropped to 38%, and the rest of the values (the majority) stayed within a 0% to 38% range. After studying all of the classification symbology from a spatial perspective, the Equal Interval method was the only one that broke along the data of the attribute table. Additionally, the Equal Interval classification from the first map is the only one that looks the most similar to all of the normalized classifications of the second map (which is shown below).

This is the reason why I chose the second map (shown below) of the population count manually normalized by area, as being the more accurate map to present to a panel of stakeholders. As we learned in this week’s lesson “Data Classification”, non-normalized data shown in choropleth mapping can be misleading. Also, as we learned in last week’s lesson “Spatial Statistics”, we have to always check the spread of our data to see if it meets the standards of a normally distributed data set. The reason why these measures must be taken is because normalizing data creates a comparative subset with which to use to differentiate against whole scheme of things (in this scenario, areal units). By normalizing the data, we are guaranteeing quality outputs that take into consideration and corrects spatial point-based measures and aggregate areal polygon features – also known as the Modifiable Areal Unit Problem (MAUP). The other map did not take the MAUP into consideration. And after all, we have to remember that most policy decisions are often based on data that is aggregated by area.

Comments