ESDA with GeoDa
Introduction Exploratory spatial data analysis (ESDA) is powerful tool in determining the suitability of data for statistical analysis and the development of hypotheses. GeoDa analysis software is developed by Luc Anselin of the Center for Spatially Integrated Social Science and the School of Geographical Sciences, Arizona State University. It provides a dynamic environment of linked windows. These windows include several types of maps and various traditional statistical plots. For further information on GeoDa please see the user’s guide and other GeoDa guides in the references section.
The data for this exercise comes from a study of fertility in Cairo, Egypt by Weeks et al. (2004). The dataset includes the 300 shiakas (census units) of the greater Cairo region. We will explore the census-derived variables used in this study. The complete set of variables, including those from remote sensing and other surveys, are described in Table 2 of the article. The primary variable of interest is total fertility rate (TFR). This variable is calculated from the population distribution of each shiaka and estimates the average number of children surviving to adulthood for each woman.
The primary goal of this exercise is to become familiar with the basics of GeoDa and its exploratory analysis capabilities. We are also interested in finding outliers in the dataset and developing preliminary hypotheses about fertility in Cairo.
Opening a Project 1. Start the GeoDa application by double-clicking the icon on your desktop. If the icon is not installed, use Window’s Explorer to navigate to the GeoDa folder and double click on the “.exe” file.
2. When the application is running, start a new project by selecting the Open Project option in the File menu or by clicking on the Open Project button.
3. Use the open file button to navigate to your “Cairo” folder and select the “Cairo.shp” file.
4. Next, select “POLY_ID” as the Key Variableon the second line of the GeoDa Project Setting window. This variable must be a unique numeric identifier for each record in the dataset.
5. Press the OK button. A polygon map of the shiakas of Cairo should appear. The vertical line separating the legend from the map area may be moved to hide or show more of the legend. The entire map window may also be resized.
Making a Choropleth Map There are four different kinds of choropleth maps available in GeoDa. Quantile maps partition the data values into a specified number of equally sized goups. Percentile maps are partitioned at the 1, 10, 50, 90, and 99 percentiles. Box maps, like box plots, highlight outliers. The upper and lower breaks, called fences or whiskers, can be specified as 1.5 or 3 times the interquartile range. The standard deviation map partitions the data based on standard deviations from the mean.
1. Begin by making a percentile map of the variable of interest, fertility. Choose the percentile option from the Map menu.
2. The table of values will be opened and the Variable Settings dialog box is displayed. Select the variable “TFR96_03.” This is the total fertility rate (TFR) for 1996. Check the box below the variable list to make this the default variable. When the dialog looks like the figure below, press OK.
3. The percentile map should be displayed (It may be behind the data table). If you have to resize the legend area to see all the values, please do so. The number in parentheses to the right of the percentile range indicates the number of units in each class.
A similar procedure is followed for the other types of choropleth maps. There are two options for exporting maps from GeoDa for use in other applications. Maps can be copied to the Clipboard by selecting the Copy to Clipboard option from the Edit menu. Maps can be exported as a bitmap file by choosing the Export > Save Image as option from the File menu. These options can also be accessed through the popup menus of each map or graph.
We have set the “TFR96_03” variable as the default. This variable will automatically come up in any maps and graphs that are opened. The default variable can be changed or turned off by using the Select Variable option in the Edit menu. When the default variable box is left unchecked, GeoDa will ask for a variable each time a new view is opened.
Making Outlier Maps The percentile map provides a good overall picture of the distribution of values. It highlights the tails of the distribution, but may not give a true indication of the numerical extremity of these values. The values belonging to the highest and lowest percentiles are not necessarily outliers. The box map specifically highlights extreme values.
1. Begin by opening another map window. The quickest way to get a new map is to pres the Duplicate the main map button.
2. With the new map highlighted, select Box Map > Hinge = 1.5 from the Map menu. Resize the maps so that they are both visible.
Selection and Linking We have visually identified some outliers with the box map, but more information is needed to rule them out of the analysis. In this section, we will select outliers on the map and inspect their data values. There are several shapes that can be used for the selection area, and these can be changed in the Options menu. Selected map polygons are indicated by yellow cross hatching. In other graph windows, selected values are colored yellow. Selected table records are highlighted blue.
1. Begin by selecting a couple of the outliers from one of the maps. Hold down the shift key to make multiple selections or use a selection area to get neighboring outliers.
2. If the data table is minimized, restore it. Otherwise, open the data table with the Table button.
3. You may scroll down in the table window to see the selected records. To bring the selected records together at the top of the page, right click in the table window to bring up a popup menu and select the first option, Promote.
The selected records should be brought together and displayed at the top of the table. The table may be sorted by any column by simply double-clicking on the column heading. Familiar table manipulations such as joins, field calculations, and selection by value may also be carried out with options from the Table menu.
This method for examining the values would suffice for a small number of observations, but a better way to select outliers would be to use a linked boxplot or histogram.
4. Open a boxplot and histogram using the respective options in the Explore menu. Each of these windows is linked to the existing maps and table.
5. Close the box map, and arrange the percentile map, box plot, and histogram so they are all visible. Select regions on the map and examine their distributions in the histogram and box plot. Likewise, select bars in the histogram and observations in the box plot and see how they are distributed on the map.
Linked views can be a powerful visualization tool. There are options associated with each type of graph. Right click in the graph area of the histogram and box plot to see these options on the popup menu. These plots make it easy to select all the outliers at once, and the selected records can be examined in the table.
Scatter Plots and Brushing In this section we will open a scatter plot and examine a dynamic linking method called brushing. A brush is created by clicking the mouse and dragging it over a region. Press the Ctrl button on your keyboard before releasing the mouse button. The outline will blink several times indicating that it is ready to be dragged over the map or graph. This is a dynamically linked and moveable selection area.
1. Open up a scatter plot using the Explore menu. For the X variable select “TFR86_03.”
2. Turn the scatter plot into a correlation plot by bringing up the scatter plot’s popup menu and selecting the ScatterPlot > Standardized data option. This correlation plot can be used to assess the correlation between any two variables. The slope displayed above the graph is now equivalent to the correlation coefficient.
3. Now let’s create a brush region using the instructions above. Begin in the scatter plot window, but try brushing in the map and other graphs too.
4. The brush can also be used to exclude observations from the slope calculation in the scatter plot window. Select the Exclude selected option from the scatter plot popup menu. Notice that the slope is now displayed twice at the top of the view. The slope value in blue at the upper left is calculated using all the observations, and the slope value in purple in upper center is the slope calculated excluding the brushed observations.
5. Create a brush in the scatter plot window and find the observations with the most leverage on the global slope value. A view of exclude selected brushing is shown below. Notice that the behavior of the brush remains the same in the other windows.
Parallel Coordinates Plot The parallel coordinates plot is a method for visualizing multivariate relationships. Combined with a map view, this plot is one way to visualize higher dimensional spatial relationships. A table of the available census variables is shown below. Remember that each variable is available for two time periods and the year is attached to the variable names.
Total Fertility Rate
Percent of females with at least intermediate education
Percent of males with at least intermediate education
Percent of women 15-29 that have never been married
Percent of males with higher occupational status
1. Open a parallel coordinate plot using the Explore menu.
2. Include all of the 1996 variables in the table above on the plot. The > and < buttons to transfer variable between lists one at a time. Use the >> and << buttons to transfer the entire list from one list to the other. The selection dialog should look similar to the diagram shown below. Press OK when all the variables are selected.
In the parallel coordinate plot, each line represents an observation. Each variable is scaled to fit on the same size axis. The maximum and minimum values for each variable are given in parentheses beneath the variable name.
3. Create a brush in a map window, and move it over different areas. Likewise, create a brush in the parallel coordinate plot and move it along some of the axes as shown in the diagram below.
The parallel coordinate plot is useful in showing relationships and pointing out data problems. Observations that behave differently from others may indicate transcription problems or areas of special interest.
Creating a Weights Matrix In order to perform spatial analysis of the data we need to define the spatial weights matrix (W). This matrix is an N by N matrix where each element represents the spatial proximity of the corresponding observations. Here, N is the number of observations. The matrix element in the first row and the second column would be a measure of proximity of the first and second observations. There are many ways to define these spatial neighborhoods, but normally the diagonal elements are set to zero and the matrices are symmetric.
For the Cairo example we will use the simplest and most often used W, the contiguity matrix. In this configuration, units that share a border are given a weight of one, and all others are set to zero.
1. Begin by selecting Weights > Create from the Tools menu.
2. Fill in the Creating Weights dialog to match the figure below. Select the Cairo shapefile as the input file. Save the output in the same directory and name it “Cairo Contiguity.” Choose a first order rook contiguity matrix. Press the Create button when you are finished.
3. Press done in the SHP->GAL progress dialog to complete the process.
The contiguity matrix is used here because it is the most common and it generally defines neighborhoods well. As you become more proficient at spatial analysis it might be important to generate other W matrices. GeoDa provides a flexible and easy to use tool for the generation of several kinds of W from shapefiles. They are saved as ASCII text files in sparse formats.
Moran Scatter Plots With our W defined, we can begin to perform some spatial analysis routines. The Moran scatter plot is a graph of the value at a location versus the average values of its neighbors. The neighborhood is defined by the W matrix that you have chosen.
The figure below shows an example Moran scatter plot. The upper right quadrant has been labeled “HH” to indicate that high values in a neighborhood of high values are plotted there. Likewise, the lower left quadrant contains low values surrounded by low values. The other quadrants indicate spatial outliers or units surrounded by unlike values. Points farthest away from the one to one line are most unlike their neighbors.
1. Create a Moran scatter plot by selecting the Univariate Moran option from the Space menu.
2. Select the “CairoContiguity” weights file that we made in the last section.
3. Select different quadrants on the Moran scatter and check to see which regions of Cairo are homogeneous or heterogeneous.
Examination of the Moran scatter plot of TFR for Cairo indicates that there is substantial spatial autocorrelation in this variable, but that there are a few spatial outliers. As the week progresses we will learn statistical methods for characterizing and handling this type of spatial association.
Wrap Up This was just a quick look at the exploratory capabilities available in GeoDa. Hopefully, you were able to get a good feel for the application and the linked windows environment. You should also have some ideas about the spatial distribution of fertility in Cairo, some relationships that might exist, and suspected outliers. Take some time to re-examine the results we produced and try new analyses or try using different variables. One final note: Suspected outliers can be easily marked for removal or closer inspection by using the Save Selected Obs. option from any popup menu. This command will create a dummy variable, setting the selected observations to one.
References Anselin (2003) “An introduction to EDA with GeoDa.” http://sal.agecon.uiuc.edu/csiss/pdf/quicktour.pdf
Anselin (2003) “An introduction to spatial autocorrelation analysis with GeoDa.” http://sal.agecon.uiuc.edu/csiss/pdf/spauto.pdf
Anselin (2003) “An introduction to EDA with GeoDa.” http://sal.agecon.uiuc.edu/csiss/pdf/geoda093.pdf
Anselin et al. (2004) “web-based analytical tools for the exploration of spatial data.” Journal of Geographical Systems 6(2) 197-219
Weeks, JR et al. (2004) “The fertility transition in Egypt: Intraurban patterns in Cairo.” Annals of the Association of American Geographers 94(1) 74-93