Specific problems in using the multiple regression model to model spatial data are that parameters are assumed constant over space and error terms are assumed to be independent. GWR is an extension of the traditional spatial data regression model that takes location in space into account.
Statistical description of the GWR model
Estimation of the regression parameters
Weighted Regression
Think concretely of a fixed regression point on location. So which mayor should not be an observation point. The estimate of f!.(so), the parameter vector at the regression point at location So, is .
Spatial weighting functions
This ensures that observations closer to the regression point will have more influence on the parameter estimates than observations further away. One method is to rank the data points according to their distance from the regression point.
Estimation of 0"2
To reduce this problem, spatial kernels can be constructed that vary their bandwidth according to the data density around the regression point, so that the bandwidth is greater where the data points are sparse than \vhere the data is dense. In areas where the data is sparse, the kernel will have to expand to ensure that the sum of the weights is C, while in areas where the data is dense, the kernel will have to shrink.
Choice of bandwidth
Cross-validation
By plotting the CV scores against bandwidths, guidance can be provided on choosing an appropriate bandwidth (Fotheringham, Brunsdon, & Charlton, 2002). If a bandwidth that minimizes the CV score is identified graphically, a more accurate value for that bandwidth can be obtained by using an optimization routine.
Akaike's Information Criterion
Cross-validation, a technique widely used in statistics with nonparametric modeling, involves refitting the model to predict each data point, leaving that data point out of the fitting process (Hastie, Tibshirani, & Friedman, 2001). The model can be rebuilt repeatedly with different values of bandwidth and the corresponding cross-validation score calculated.
Spatial Non-stationarity
H A: 13k is non-stationary over the region of interest. This procedure is repeated for all parameters 1-h for k = 0,.
Chapter 3
A Proposed Extension to the GWR Model
The Expansion Method
Development of the LLGWR model
The estimate of !::!.* (so), the parameter vector at the regression point at the location So, is given by. In the G\VR model, the spatial variability of the regression coefficients is accommodated by invoking weighted regression centered at a point of interest and with weights that decrease as the distance of observations from that point increases.
Chapter 4
A Small Data Set taken from Soil Science
Data
Exploratory Data Analysis
- Global model
- Residuals
- Global models fitted over quadrants
58 where Yi is the /h observed value of water content and Yi the corresponding fitted value from model (4.1), is shown in Figure 4.4 (a), and a plot of the residuals against the fitted values is shown in Figure 4.4 (b). . However, the spatial distribution of the residuals shown in Figure 4.5 appears to be non-random. Large positive residuals are located in the northeastern part of the map, and negative residuals are located in the south.
In general, standardized residuals greater than 2 in absolute value are considered potential outliers. The impact of these observations on the regression analysis was examined, but removing them made very little difference to the results. The field was divided into four quadrants as shown in Figure 4.7 and simple linear regression models were fitted to the data for each quadrant separately.
This provides a simple way of checking whether the modeled water/clay content relationship is likely to be stationary in space. The results of the global models fitted separately for each quadrant are presented in Table 4.3, and the scatterplots of water versus clay content with the corresponding fitted regression lines for each quadrant are presented in Figure 4.8.
Application of GWR
Thus, the parameters were estimated at each of the grid points producing 5000 estimates for each parameter in space. Thc intcrccpt cocfficient e::;timates, as can be seen in Figure 4.10 (a), show a clear pattern with higher values located in the north-west of the field and lower values located in the south. The standard errors of these estimates as can be seen in Figure 4.10 (b) are highest in the corners of the field.
The estimated clay coefficients mapped in Figure 4.11(a) have: the highest values are in the southwest of the fidd, and the lowest values are in the northwest. The standard errors of these estimates are highest in the northwest corner and along the southern edge of the field. A comparison of Figure 4.10(a) with Figure 4.11(a) shows that high intercept values correspond to low clay coefficient values and low intercept values correspond to high clay coefficient values.
This randomization was repeated 1000 times and the proportions of the variances S2(~k) for Ie = 0.1 exceeding the actual variance obtained from the data at the correct sites were calculated and found to be 0.001 and 0.022, respectively. These proportions provide a measure of the probability of observing variation in the local parameter e::;timated to lea::;t as extreme as that ob::;served for the actual data if the parameter were globally constant.
Im ple m e ntation of LLG\VR
It was found that the parameter /30 was significantly different from zero at all the locations and PI was significantly different from zero at 91 % of the locations. 1\Iape of the estimates of the parameters were produced to illustrate their variation over space. Parameters were therefore estimated at each of the grid points yielding 5000 estimates for each parameter.
1000 randomizations of the data were performed and the results of the tests of the following hypotheses. It can be seen from Figure 4.14(a) that the clay coefficient has high estimates, located in the southwest corner of the field. The standard errors of both the intercept and clay coefficient estimates, as shown in Figures 4.13(b) and 4.14(b), respectively, are lowest in the center of the field and where the COl'lwrs are most difficult.
Low values of the daily coefficient are located in the northwest corner, as well as along the southeast border of the field. Based on the results of the tests of the significance of individual parameters, it was possible to omit the coefficient fJr from the model, for which it should be significantly different from zero at all locations.
Kriging application
Comparative results
Chapter 5
A Large Data Set taken from Geology
Data
Ex pl orato r y Data Anal ysis
- Continuous variables
- Residuals
Argovian rock formations are found at locations furthest north and furthest south of the study area. Sequanian rock formations are mainly found in the west of the region, and Quaternary rock formations are mainly found in the northern half of the region. The sampled sites in the center of the study area are dominated by Kimmeridgian rock formations and there are only 4 sampled sites with Portlandian rock formations in total.
Quaternary and Argonian make up about 20% of the sites, and Portland only 1.5% of the sites. Histograms of metal concentrations expressed in parts per million for each metal are shown in Figure 5.4. 259, where Yi are the observed values of chromium concentration and y, the adjusted values from model (5.1), is shO\vn in Figure 5.7 (a) and the plot of residuals against the adjusted values is shO\vn in Figure 5.7 (b).
However, the spatial distribution of the residuals shO\\"11 in Figure .S.8 appears to be non-random with a cluster of large negative residuals located in the eastern part of the study region and some large positive residuals located in the southwestern part A plot of the standardized residuals, which are useful for outlier detection in presented are Figure 5.9.
- Global model
The "rca plate of invcstigalion w&. The model {c.I) "-M fit separately to the data for ea~h quadrant it allows.
Application of GWR
All parameters were found to be significantly different from zero at most locations, except for the coefficient a,ssocia,tcd with L4i, which was found to be significant only at 187c of the locations. The main outcome of a G\VR analysis is a set of local parameter estimates that can be mapped to show how the model parameters change in space. Parameters were thus estimated at each of the grid points, yielding 5600 estimates for each parameter in space.
These estimates, as well as the standard errors of the estimates, have been mapped using ArcGIS software and are shown in Figures 5.12 to 5.14. The Monte Carlo method described in Section 2.6 was used to determine whether or not the parameters showed significant non-stationarity.
Implementation of LLGWR
HA : ;3k is non-stationary throughout the region of interest Ho : 3;: is stationary throughout the region of interest compared to HA : 3;: is non-stationary throughout the region of interest Ho : .3~ is stationary throughout the Region of interest compared with Table 5.11 shows that some additional parameters were found to be significantly different from zero at more than half of the locations, namely 30, 3f and ;35', suggesting that the inclusion is a linear expansion may be \\'orthhile .
Some parameters were found to be insignificant at most locations, so the model can be re-fitted (by excluding these parameters. The parameter estimates located at A v appear to be significantly non-zero and non-stationary, namely 30. The variable Proportion of sites where parameter p-values are significantly different from zero ~Ionte Carlo test.
Predicted values of chromium obtained from this model as well as the standard errors of predictions are mapped in Figure 5.19. From Figure 5.19 (a) it can be seen that high values of chromium concentration are predicted in the north-eastern and south-western regions.
Comparative results for tl'ailliug data set
Results of the validation data set
Chapter 6 Conclusions
71 that the nature of the variability of the regression coefficients in the two data sets fayour different models. It is therefore debatable whether the extension of the GWR model is worthwhile and further investigation is needed. Local Linear Geographic \Veighted Regression (LLG\VR) is easy to implement and can add value in the analysis of certain data sets and can easily be included in the GWR repertoire.
Further investigations involving the analysis of more datasets are required, especially datasets showing strong non-stationarity. An investigation into ~1fixed LLG\\'R models \by which stationary parameters are modeled globally and non-stationary parameters modeled locally is also required.
Bibliography
BIBLIOGRAPHY 74 ing: 110nte Carlo Studies and Application to Illicit Drug 11arke Modelling
Finding a predictive model for Iberian dung beetle species richness based on spatial and environmental variables.
Appendix A
Soil Science Data
Appendix B GWR code
Appendix C LLGWR code