Adepeju, M.
Big Data Centre, Manchester Metropolitan University, Manchester, M15 6BH, UK
Author:
2022-08-10
Date:
Abstract
With increasingly limited availability of fine-grained spatially and temporally stamped point data, thestppSim
provides an alternative source of data for a wide range of research in social and life sciences. It generates artificial spatio-temporal (ST) point patterns through the integration of microsimulation and agent-based models. Allows a user to define the behaviours of a set of ‘walkers’ (agents, objects, persons, etc.,) whose interactions with the spatial (landscape) and the temporal configurations produce new point events. The resulting point cloud and patterns can be measured, analyzed, and processed for spatial and/or temporal model testings and evaluations.
In many research contexts, access to fine-grained spatiotemporal (ST) point data have been severely restricted due to privacy concerns. The R-stppSim
package has been designed to address this challenge by presenting a framework that is capable of mimicking a real-life data, through the integration of microsimulation and agent-based techniques, in order to generate new point events in space and time. The framework comprises a set of ‘walkers’ (agents, objects, persons, etc.) with modifiable movement characteristics, the landscape (spatial components), and the temporal configurations. It is The interaction between these three elements that give rise to new point events, which can be processed, measured, and manipulated for further applications.
The package contains two key functions for synthesizing new datasets; (i) psim_artif
and (ii) psim_real
. The psim_artif
synthesizes point patterns purely from users specifications (simulation from scratch
). This implies that the simulation does not depend on any existing point data. On the other hand, psim_real
synthesizes point patterns based on a sample real data set. The latter is particularly applicable to situations where only sparse version of a data set is available for an area. The function learns (or extracts) certain spatial and temporal characteristics of the sample data, and then extrapolates to generate full data set. The potential applications of stppSim
include generation of offending data sets (crime), disease infection data sets, and foraging point patterns of wild animals.
The simulation parameters are in relation to three elements, namely; the ‘walkers (agents)
’, the landscape
(spatial), and the temporal dimension
. The parameters are described as follow:
The walkers are defined primarily by the following characteristics:
Movement - Walkers can move in any direction and have the ability to detect obstructions (restrictions) on their path. The movements are controlled primarily by an in-built transition matrix
(TM) which defines two operational states, namely; an exploratory
state (in which a walker is merely exploring the environment) and a performative
state (in which a walker is doing an action). The stochastic properties of the TM
ensure variations in the behavioral patterns amongst walkers. In order to switch from one state to another, a categorical distribution is assigned to the latent state variable \(z_{it}\). So, every time step may be assigned to either operational state, independent of the previous state: \[z_t \sim Categorical(\Psi{_{1t}}, \Psi{_{2t}})\] Such that \(\Psi{_{i}}\) = Pr\((z_t = i)\), where \(\Psi{_{i}}\) is the fixed probability of being in state \(i\) at time \(t\), and \(\sum_{i=1}^{z}\Psi{_{i}}=1\)
Spatial perception [s_threshold
] - Perception range of a walker at a given location. The parameter s_threshold
is generally updated as a walker moves to a new location. A natural method for choosing this parameter is to plot out the data, choose the estimate that is most in accordance with one’s prior ideas about the s_threshold
value. For many applications this approach will be perfectly satisfactory. For psim_artif
, a user will be expected to define a value, but for psim_real
, the optimal s_threshold value may be estimated based on the sample data set.
Steps [step_length
] - The maximum step taken by a walker from one point location to the next. This defines the speed of a walker across the space. The step_length
should be carefully defined, particularly, when movements are restricted along narrow paths, such as a route network. In this case, the value must be smaller than the width of the paths.
Proportional ratios [p_ratio
] - Defines the spatial concentration of events being generated by the walkers. Defined as the percentage of total events emanating from a small number of the most active origins. For example, a 20:80
proportional ratios implies that 20% of origins (walkers) would generate 80% of the total point events. In other words, origins have strength values which can be utilized to model the expected (final) spatial distribution of events (points), named the spatial model
.
The followings are the key properties of a landscape:
coords
] - Walkers emanate from origins. Origins may be distributed randomly across the landscape or exhibit specific spatial concentration. Origins are defined in terms of xy
coordinates. In criminological application, a human offender can be modelled as a walker emanating from his residence (origin). Origins
can exhibit two types of concentration: nucleated
and dispersed
(Hornby and Jones, 1991). A nucleated
concentration is one in which all origins concentrate around one focal location, while a dispersed
concentration has multiple focal locations and the origins can be completely random across the space (see fig. 1).Figure 1: Type of origin concentration
Boundary [poly
] - A landscape is bounded - defined by a polygon shapefile (poly
) or by the spatial extent of the sample point data.
Restrictions [restriction_feat
] - Features constituting obstructions. Comprising two parts: (i) Areas outside the boundary (poly
) with maximum restriction value of 1
. That is, walkers can not step outside the boundary. (ii) Features within the boundary serving as obstructions to movement, e.g., certain land use type or land features, such as a fenced place and hills. Typically, to generate a restriction map, two steps are involved. Using the example of boundary shapefile of Camden area of London (UK), a restriction map can be generated as follows:
Step 1
: Generate boundary restriction
#load shapefile data
load(file = system.file("extdata", "camden.rda", package="stppSim"))
#extract boundary shapefile
boundary = camden$boundary # get boundary
#compute the restriction map
restrct_map <- space_restriction(shp = boundary,res = 20, binary = TRUE)
#plot the restriction map
plot(restrct_map)
Step 2
: Setting the restrct_map
above as the basemap
, and then stack the land use features to define the restrictions within the area,
# get landuse data
landuse = camden$landuse
#compute the restriction map
full_restrct_map <- space_restriction(shp = landuse,
baseMap = restrct_map, res = 20, field = "restrVal", background = 1)
#plot the restriction map
plot(full_restrct_map)
Figure 2: Restriction map
From figure 2, the land use feature has three classes, with each class having it’s own restriction value {e.g., Leisure
(0.5); Sports
(0.7); and Green
(0.9)},
n_foci
] - Locations (origins) of relatively higher importance. Usually present more opportunities (to event occurrences). This is only specified when using psim_artif
. A user will typically specify the number of focal points to simulate. In the context of urban landscape configuration, a focal point is synonymous to a city/town centre
. The major focal point (location) within a city (if exist) can also be specified, using an additional parameter mfocal
. The default value of mfocal
is NULL
. Also, a parameter foci separation
allows a user to specify the proximity of the focal points to one another (values range from 1 to 100), with 1
and 100
being highest
and lowest
proximities, respectively.The following parameters define the temporal dimension:
Long-term trend [trend
] - Defines the long-term direction of the time series to be simulated. The can be stable
, rising
or falling
. When rising
or falling
, an additional slope
argument can be used to specify whether the trend slope is gentle
or steep
. Also, only applies to simulation from scratch.
Seasonal patterns [first_pDate
] - Defines the medium-term fluctuations of the total events over time. This is controlled by specifying the first seasonal peak point of the time series. A 90
day first peak implies a seasonal cycle of 180
days.
Figure 3: Global trends and patterns
Figure 3 shows expected seasonal patterns based on different values of first_pDate
, starting with 90 days, and then increasing the value successively by one month. The number of seasonal cycles decreases with later first_pDate
values. Both the long-term trend and the seasonal patterns are learned when simulating using psim_real
.
The combination of the long-term trend
and the seasonal patterns
represents the temporal model
of a simulation. This should be previewed/review before starting the simulation.
stppSim
From R
console, type:
#To install from `CRAN`
install.packages("stppSim")
#To install the `developmental version`, type:
remotes::install_github("MAnalytics/stppSim")
#Note: `remotes` is an extra package that needed to be installed prior to the running of this code.
Now, to load the package,
library(stppSim)
interactive
argumentBoth psim_artif
and psim_real
functions have the argument interactive
which is set as FALSE
(by default). If the interactive
argument is set as TRUE
, then queries are printed in the console (when running the functions) asking the user whether to show the spatial and temporal models
of the simulation. The spatial model
shows the location of the origins as well as their strength distribution across the space. The strength distribution represents the likeness of the point (event) distribution to be simulated. Similarly, the temporal model
shows the trend and seasonal pattern (smoothened) to be expected from the simulation. Therefore, a user has the opportunity to review both the spatial and the temporal patterns before proceeding with the simulation.
stpp
from scratchThree key arguments required are: n_events
- the number of points to simulate, start_date
- the start date of the time series, and poly
- the polygon shapefile representing the boundary of the study area within which point patterns are to be simulated. For the former, it is recommended that a vector of values is provided, rather than a single value. For example, n_events = c(200, 500, 1000, 2000)
. The output is generated as a list comprising the separate data frame for each value. Besides, the length of n_events
has little of no effects on the processing time.
Given the boundary shapefile of Camden Borough of London (embedded in the package), the stpp
can be generated as follows:
#load the data
load(file = system.file("extdata", "camden.rda",
package="stppSim"))
boundary <- camden$boundary # get boundary data
#specifying data sizes
pt_sizes = c(200, 1000, 2000)
#simulate data
artif_stpp <- psim_artif(n_events=pt_sizes, start_date = "2021-01-01",
poly=boundary, n_origin=50, resistance_feat = NULL,
field = NA,
n_foci=5, foci_separation = 10, mfocal = NULL,
conc_type = "dispersed",
p_ratio = 20, s_threshold = 50, step_length = 20,
trend = "stable", first_pDate=NULL,
slope = NULL,show.plot=FALSE, show.data=FALSE)
The processing time on an Intel Core i7-7500CPU @ 2.70GHz, 16.0GB RAM PC is 3.5 minutes
. The processing time is increases to 30.2
minutes if landscape restriction is added. That is, if the argument resistance_feat = camden$landuse
(with field = "restrVal"
).
To retrieve the result of any n_events
, simply type the object name with the value index. For example to retrieve the result based on n_events = 1000
, type:
stpp_1000 <- artif_stpp[[2]]
The spatial patterns and clustering of events can be controlled by manipulating the arguments that control the properties of the spatial components (e.g., resistance_feat
, n_origin
, mfocal
, foci_separation
, n_foci
, etc.) and the ones that control the properties of the walkers (e.g. step_length
, s_threshold
, p_ratio
). Specifically, in order to add a focal location to the simulation (i.e. mfocal
- see explanation above), use the make_grids
function to generate an interactive map from which the xy
coordinates of every location in the map can be displayed/extracted. The interactive map includes an OpenStreetMap
that allows a user to identify places more easily.
Figure 4 is the spatial point patterns (spp
) of n_events = 1000
at varying parameter settings. Note: the spatial pattern is likely to change each time the code is re-run, owing to the random elements in the function. Figure 4a is the result when all default arguments are used (as in the code above). Figure 4b is the result when the following extra arguments are applied, namely; resistance_feat = camden$landuse
and mfocal = c(530000, 182250)
- the first argument limits the amount of events simulated within the land use (restriction) features, and the second argument ensures that the spatial concentration of origins are around a focal point (point indicated as red dot on the map). Figure 4c is the result when, in addition to applying the resistance_feat
and mfocal
(as above), the foci_separation = 50
- this additional argument ensures that the origins are moderately far from each another. Lastly, Figure 4d is the result when, in addition to setting the mfocal
(as above), the s_threshold
and step_length
are set as equal to 250
and 50
, respectively - these additional settings is to ensure that points are well-spread out from their respective origins.