Learning GIS in Python - Robbery and Police Searches in Space

Violence and Police Activity in Space

Amongst stuffing my face with wine and cheese, I've used this Christmas break to learn more about geospatial modelling in Python.

This blog post is largely intended for my reference and to act as a useful example for others...as such, it may be messy! I'll try and tidy it into a Medium post in the coming weeks.

Space is an often disregarded dimension of modelling within policing research. As per Tobler's first law of geography, "everything is related to everything else, but near things are more related than distant things", and this is probably especially true of crime, that tenders to cluster in both time and space...treating your models as not having distinct physical locations that influence how they behave is likely to miss crucial information.

Nearly all of the below is adapted from a fantastic work in progress book, Geographic Data Science with PySAL and the PyData Stack - I've found it hugely helpful, and the code examples are very approachable. I'd also recommend browsing the Pysal documentation.

All of the below is based on public data:

  • Police recorded crime and searches from January 2019 through October 2020 (data.police.uk)
  • London MSOA geographic and demographic data (MOPAC)

The key libraries used are:

  • Standard Python libraries as included in Anaconda (statsmodels, pandas, sklearn, seaborn)
  • Geopandas - allowing you to read, write, and plot spatial data
  • Pysal - a collection of modules for geospatial data science
In [1]:
#both of the below are used to read your directory
import glob 
import os

#core DS libraries
import pandas as pd
from sklearn.cluster import KMeans, AgglomerativeClustering
import numpy as np

#graphic libraries
import matplotlib.pyplot as plt  
from matplotlib import colors
import seaborn as sns             

#geographic analysis
import geopandas
from pysal.explore import esda   # Exploratory Spatial analytics
from pysal.lib import weights
import contextily                # Background tiles
from pysal.viz import splot
from splot.esda import plot_moran

#positive outcomes to obtain detection rate
positive_outcomes = [
       'Offender given conditional discharge', 'Offender fined',
       'Offender given a drugs possession warning',
       'Court result unavailable', 'Local resolution',
       'Offender given community sentence',
       'Offender given penalty notice', 'Offender given a caution',
       'Offender sent to prison', 'Court case unable to proceed',
       'Defendant found not guilty',
       'Offender given suspended prison sentence',
       'Awaiting court outcome', 'Offender otherwise dealt with',
       'Defendant sent to Crown Court', 'Offender deprived of property',
       'Offender ordered to pay compensation',
       'Offender given absolute discharge',
       'Formal action is not in the public interest',
       'Suspect charged as part of another case']

Spatial Data

We begin by importing our spatial border data. Spatial coordinates are just coordinates, so without understanding what those coordinates mean (for instance, where you are in a city, or in the world, at what altitude, etc), they're essentially points on a chart.

For us, this is provided by the Mayor's Office for Policing and Crime, and also conveniently contains some area characteristics. We use Geopandas to read the file.

Geospatial modelling relies on assigning events to a unit of space. You could theoretically make this as detailed as possible - for instance, meter squares - but given we're going to analyse how our units are interconnected, that's probably not computationally feasible (if everything is connected to everything, you're going to need a really big computer). You'll need to reach a suitable compromise. Helpfully, the UK government provides various statistical units, including border coordinates, for download. Lower Super Output Areas (LSOAs) and Middle Super Output Areas (MSOAs) contain populations of between 5,000 and 7,200, and as such should be (partly) comparable.

Geospatial data will use a specific Coordinate Reference System, or CRS%20is,between%20different%20spatial%20reference%20systems.) which will affect how your data is processed. Make sure you're using the right one - worldwide data that assumes it is on a sphere will behave very differently to data from a specific country on a flat plane.

UK policing data often uses National Grid Easting/Northing coordinates, rather than the more common Lat/Longs. Daata.Police.uk comes pre-processed into lat/long coordinates. Thankfully, whichever you have, geopandas can easily convert (or "re-project") to another system.

In [2]:
msoas = geopandas.read_file("statistical-gis-boundaries-london/MapInfo/MSOA_2011_London_gen_MHW.tab")
msoas = msoas.to_crs(epsg=4326)
C:\Users\Admin\Anaconda3\envs\geospatial\lib\site-packages\geopandas\geodataframe.py:422: RuntimeWarning: Sequential read of iterator was interrupted. Resetting iterator. This can negatively impact the performance.
  for feature in features_lst:

Our data on both stop and search and robberies was manually downloaded from data.police.uk, and extracted into our working folder. Given this is then returned into a folder per month, I have written a series of helper functions to read the files, assign them to an MSOA, and combine them into a total per MSOA.

To assign to an MSOA, we use a "spatial join": like a normal table join (think vlookup in Excel), this connects elements from one table, to elements from another, via a common column. In our case, the common column is the geographic location: which MSOA is our search/crime located in.

Crime data from data.police.uk is separated into "major" crime types, which is very helpful for anonymisation (we can't figure out who the victim was if we don't know specific crime types), but does mean that all violent and sexual crime is agglomerated into one - given I think it's unlikely searches deter sexual offences, that's unhelpful. As such, we'll focus on robbery, which is disaggregated. This isn't ideal, but robbery remains a serious, violent, high priority crime, and as such you'd hope they are connected, and robbery might in fact be a rough proxy for overall violence.

In [3]:
def read_and_add(file_path):
    df = pd.read_csv(file_path)
    robberies = df[df["Crime type"] == 'Robbery']
    #we create a geopandas object that uses the lat/long coordinates, using the appropriate CRS code.
    robberies = geopandas.GeoDataFrame(robberies, geometry=geopandas.points_from_xy(robberies.Longitude, robberies.Latitude), crs="epsg:4326")
    detected_mask = robberies["Last outcome category"].isin(positive_outcomes)
    # identify which crimes resulted in a detected outcome 
    robberies.loc[detected_mask, "Detected"] = 1
    detected_mask = robberies["Last outcome category"].isin(positive_outcomes)

    detected_df = robberies[detected_mask]
    robberies.loc[detected_mask, "Detected"] = 1

    dfsjoin_r = geopandas.sjoin(msoas, robberies) #Spatial join Points to polygons
    dfsjoin_d = geopandas.sjoin(msoas, detected_df) #Spatial join Points to polygons

    #aggregate to a pivot table that will count our outputs.
    pivot_r = pd.pivot_table(dfsjoin_r,index='MSOA11CD',fill_value=0,columns='Crime type',aggfunc={'Crime type':len})
    pivot_d = pd.pivot_table(dfsjoin_d,index='MSOA11CD',fill_value=0,columns='Crime type',aggfunc={'Crime type':len})
    all_pivot = pivot_r.join(pivot_d, rsuffix="detected").reset_index()
    return all_pivot.fillna(0)
In [4]:
def read_and_add_search(file_path):
    df = pd.read_csv(file_path)
    df = geopandas.GeoDataFrame(df, geometry=geopandas.points_from_xy(df.Longitude, df.Latitude), crs="epsg:4326")
    dfsjoin = geopandas.sjoin(msoas, df) #Spatial join Points to polygons
    pivot_d = dfsjoin.groupby(["MSOA11CD"])['Object of search'].count()

    #pivot_d = pd.pivot_table(dfsjoin,index='MSOA11CD',fill_value=0,columns='Object of search', aggfunc=len)
    return pd.DataFrame(pivot_d.fillna(0)).reset_index()
In [5]:
def merge_dfs(list_of_df):
    initial_df = list_of_df[0]
    remaining = list_of_df[1:]
    for pivot in remaining:
        initial_df.groupby("MSOA11CD", as_index=False).sum()
        initial_df = pivot.merge(initial_df.groupby("MSOA11CD", as_index=False).sum(), how="outer")
    print ("dataframe length is " + str(initial_df.shape[0]))
    return initial_df
In [6]:
#we iterate over all files and subfolders in our data directory, and use the appropriate helper functions based on their last characters.
list_of_pivot = []
list_of_searches = []
rootdir = os.getcwd()

for subdir, dirs, files in os.walk("data"):
    for file in files:
        filepath = subdir + os.sep + file

        if filepath.endswith("street.csv"):
        if filepath.endswith("search.csv"):

#now we concatenate all our individual dataframes into a single one
print("aggregating robbery")
final_pivot = merge_dfs(list_of_pivot)

print("aggregating searches")
final_search = merge_dfs(list_of_searches)
aggregating robbery
C:\Users\Admin\Anaconda3\envs\geospatial\lib\site-packages\pandas\core\generic.py:3889: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  obj = obj._drop_axis(labels, axis, level=level, errors=errors)
dataframe length is 1705
aggregating searches
dataframe length is 1953
In [7]:
#finally, merge those into a single pivot with the sum of all incidents during our data, by MSOA.
final_pivot = final_pivot.groupby("MSOA11CD").sum()

final_search = final_search.groupby("MSOA11CD").sum()

final_join = final_search.join(final_pivot).reset_index().copy()
final_join.columns=["MSOA11CD", "search_count", "robbery_count","detected_robbery"]
In [8]:
MSOA11CD search_count robbery_count detected_robbery
0 E02000001 268 37 0.0
1 E02000002 173 38 2.0
2 E02000003 294 115 7.0
3 E02000004 216 19 0.0
4 E02000005 132 61 1.0
... ... ... ... ...
978 E02006927 581 68 7.0
979 E02006928 638 89 5.0
980 E02006929 1373 146 7.0
981 E02006930 168 56 2.0
982 E02006931 393 105 5.0

983 rows × 4 columns

In [9]:
MSOA11CD            E02000001E02000002E02000003E02000004E02000005E...
search_count                                                   477668
robbery_count                                                   60093
detected_robbery                                                 3054
dtype: object

Our final file then, contains nearly 500,000 searches and just over 60,000 robberies (of which a suspect was found in around 3,000) across London's 938 MSOAs.

In [10]:
msoas["MSOA11CD"] = msoas["MSOA11CD"].astype("string")
final_join["MSOA11CD"] = final_join["MSOA11CD"].astype("string")
In [11]:
msoas = msoas.merge(final_join, on=["MSOA11CD"])

We then re-combine this with our geometry file, before calculating a proportion of robbery detected figure, and a search per robbery rate - the final table is below.

In [12]:
msoas["robbery_solve"] = msoas["detected_robbery"] / msoas["robbery_count"]
msoas["search_rate"] = msoas["search_count"] / msoas["robbery_count"]
In [13]:
msoas.index = msoas["MSOA11NM"]

The data is now processed. Now is probably a good time to write this to a file to retrieve it in the future.

In [14]:

MSOA11CD MSOA11NM LAD11CD LAD11NM RGN11CD RGN11NM UsualRes HholdRes ComEstRes PopDen Hholds AvHholdSz geometry search_count robbery_count detected_robbery robbery_solve search_rate
City of London 001 E02000001 City of London 001 E09000001 City of London E12000007 London 7375 7187 188 25.5 4385 1.6 MULTIPOLYGON (((-0.09676 51.52325, -0.09644 51... 268 37 0.0 0.000000 7.243243
Barking and Dagenham 001 E02000002 Barking and Dagenham 001 E09000002 Barking and Dagenham E12000007 London 6775 6724 51 31.3 2713 2.5 POLYGON ((0.14811 51.59678, 0.14809 51.59640, ... 173 38 2.0 0.052632 4.552632
Barking and Dagenham 002 E02000003 Barking and Dagenham 002 E09000002 Barking and Dagenham E12000007 London 10045 10033 12 46.9 3834 2.6 POLYGON ((0.15065 51.58306, 0.14841 51.58075, ... 294 115 7.0 0.060870 2.556522
Barking and Dagenham 003 E02000004 Barking and Dagenham 003 E09000002 Barking and Dagenham E12000007 London 6182 5937 245 24.8 2318 2.6 POLYGON ((0.18511 51.56480, 0.18403 51.56391, ... 216 19 0.0 0.000000 11.368421
Barking and Dagenham 004 E02000005 Barking and Dagenham 004 E09000002 Barking and Dagenham E12000007 London 8562 8562 0 72.1 3183 2.7 POLYGON ((0.14990 51.56807, 0.15078 51.56778, ... 132 61 1.0 0.016393 2.163934
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Greenwich 034 E02006927 Greenwich 034 E09000011 Greenwich E12000007 London 8315 8241 74 33.0 3338 2.5 POLYGON ((0.02900 51.46779, 0.02995 51.46592, ... 581 68 7.0 0.102941 8.544118
Greenwich 035 E02006928 Greenwich 035 E09000011 Greenwich E12000007 London 7341 6410 931 136.0 2977 2.2 MULTIPOLYGON (((-0.00961 51.48366, -0.00979 51... 638 89 5.0 0.056180 7.168539
Greenwich 036 E02006929 Greenwich 036 E09000011 Greenwich E12000007 London 7490 7489 1 29.4 3333 2.2 POLYGON ((0.01619 51.49578, 0.01854 51.49498, ... 1373 146 7.0 0.047945 9.404110
Greenwich 037 E02006930 Greenwich 037 E09000011 Greenwich E12000007 London 6561 6557 4 75.6 2876 2.3 POLYGON ((0.00866 51.48917, 0.00837 51.48877, ... 168 56 2.0 0.035714 3.000000
Greenwich 038 E02006931 Greenwich 038 E09000011 Greenwich E12000007 London 9186 8973 213 46.1 4113 2.2 POLYGON ((-0.00201 51.48155, -0.00136 51.48145... 393 105 5.0 0.047619 3.742857

983 rows × 18 columns


Let's start exploring our data. Before doing anything, we'll use Pysal to create "spatial weights" - this let's us understand our objects in space, and how they interact. There are a myriad of different methods, but we'll just pick the 8 nearest MSOAs in space.

In [15]:
# Generate W from the GeoDataFrame
w = weights.distance.KNN.from_dataframe(msoas, k=8)
w.neighbors['City of London 001']
['Islington 023',
 'Southwark 002',
 'Islington 022',
 'Hackney 027',
 'Hackney 026',
 'Southwark 006',
 'Southwark 003',
 'Southwark 009']
In [16]:
# Row-standardization
w.transform = 'R'

We also create a spatial "lag" for our key values - this is one of the most straightforward measures of spatial relationships, and captures the products and weights of our value in neighbouring observations. Essentially, it's the value of our metric, weighted by the value of the metric in neighbourhing observations - so clusters will expect to be high, while a single high value surrounded by low values will be diminished.

In [17]:
msoas['w_search_count'] = weights.spatial_lag.lag_spatial(w, msoas['search_count'])
msoas['w_robbery_count'] = weights.spatial_lag.lag_spatial(w, msoas['robbery_count'])
msoas['w_rate'] = weights.spatial_lag.lag_spatial(w, msoas['search_rate'])

Let's start by mapping out our key crime and search figures, and seeing if anything stands out.

In [19]:
f, axs = plt.subplots(nrows=2, ncols=2, figsize=(20, 12))
        legend_kwds={"loc": 2},

axs[0,0].title.set_text('Search Count')

        legend_kwds={"loc": 2},

axs[0,1].title.set_text('Robbery Count')

        legend_kwds={"loc": 2},

axs[1,0].title.set_text('Detected Robbery Proportion')

        legend_kwds={"loc": 2},

#contextily pulls a background tile from online providers
axs[1,1].title.set_text('Search per Robbery')
for ax in axs.reshape(-1):