Analyzing Crime in LA
Introduction
The Los Angeles Police Department (LAPD) has asked for assistance in identifying patterns in criminal behavior. In this analysis, patterns in victim demographics, types of crimes, and when crimes are more likely to occur were analyzed.
The Data
The data includes information about crimes reported in Los Angeles, CA from January 2020 to June 2023 and is publicly available here. Information about the date and time, location, classification, victim demographics, and weapons used for the crime are included.
# import packages and data
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('ticks')
crimes = pd.read_csv("data/crimes.csv")Data Cleaning
crimes.info()The steps used in cleaning the data were as follows:
- There are no values for
Crm Cd 4, remove column - Replace nulls in
Crm Cd 2,Crm Cd 3,Weapon Used Cd,Weapon Desc,Premis Cdwith 'None'. - Replace null and negative values in
Vict Agewith 'unknown'. - Clean invalid values in
Vict Sexby replacing with 'X', unknown. - Clean invalid values in
Vict Descby replacing with 'X', unknown. - A few remaining null values in
Premiscolumns dropped. - Date, time columns converted to appropriate format.
- Drop rows with Latitude & Longitude of 0, 0.
- Change ID columns from numeric to strings.
# Drop columns
crimes_clean = crimes.drop(['Crm Cd 4','Cross Street'], axis=1)
# Fill some nulls with 'none'
crimes_clean[['Crm Cd 2','Crm Cd 3','Weapon Used Cd','Weapon Desc','Premis Desc']] = crimes_clean[['Crm Cd 2','Crm Cd 3','Weapon Used Cd','Weapon Desc','Premis Desc']].fillna('NONE')
# Remove and replace negative/null Vict Age with unknown
def ages_pos(x):
if x <= 0:
return np.nan
else:
return x
crimes_clean['Vict Age'] = crimes_clean['Vict Age'].apply(ages_pos)
crimes_clean['Vict Age'].fillna('unknown', inplace=True)
# Clean up invalid values in Vict Sex
sex_codes = ['F','M','X']
def sex_clean(x):
if x not in sex_codes:
return 'X'
else:
return x
crimes_clean['Vict Sex'] = crimes_clean['Vict Sex'].apply(sex_clean)
# Clean up invalid values in Vict Descent
desc_codes = ['A','B','C','D','F','G','H','I','J','K','L','O','P','S','U','V','W','X','Z']
desc_dict = {'A':'Other Asian', 'B':'Black', 'C':'Chinese', 'D':'Cambodian', 'F':'Filipino', 'G':'Guamanian',
'H':'Hispanic/Latin/Mexican', 'I':'American Indian/Alaska Native', 'J':'Japanese', 'K':'Korean',
'L':'Laotian', 'O':'Other', 'P':'Pacific Islander', 'S':'Samoan', 'U':'Hawaiian', 'V':'Vietnamese',
'W':'White', 'X':'Unknown', 'Z':'Asian Indian'}
def desc_clean(x):
if x not in desc_codes:
return 'X'
else:
return x
crimes_clean['Vict Descent'] = crimes_clean['Vict Descent'].apply(desc_clean)
crimes_clean['Vict Descent'] = crimes_clean['Vict Descent'].map(desc_dict)
# Drop remaining null rows
crimes_clean.dropna(inplace=True)
# Convert columns to date & time
import datetime as dt
crimes_clean['Date Rptd'] = pd.to_datetime(crimes_clean['Date Rptd'], format="%Y-%m-%d")
crimes_clean['DATE OCC'] = pd.to_datetime(crimes_clean['DATE OCC'], format="%m/%d/%Y %I:%M:%S %p")
crimes_clean['TIME OCC'] = crimes_clean['TIME OCC'].astype('str')
def four_digit(x):
if len(x) < 4:
return '0' + x
else: return x
crimes_clean['TIME OCC'] = pd.to_datetime(crimes_clean['TIME OCC'].apply(four_digit), format="%H%M")
# Remove invalid latitude/longitude
crimes_clean = crimes_clean[crimes_clean['LAT'] != 0].copy()
# Convert nums to str
cols = ['DR_NO','AREA','Rpt Dist No','Crm Cd','Premis Cd','Crm Cd 1']
crimes_clean[cols] = crimes_clean[cols].astype('str')After cleaning, there are 398,912 rows. While Vict Age should be a numeric column, null values being replaced to 'unknown' made that impossible. Nearly a quarter of the data includes 'unknown' as the victim age, so a lot of information would be lost if these rows were removed.
Data Analysis
Victim Demographics
only_ages = crimes_clean[crimes_clean['Vict Age'] != 'unknown'].copy()
sns.histplot(data=only_ages, x='Vict Age', hue='Vict Sex', multiple='stack', bins=60)
plt.xlabel('Victim age')
plt.ylabel('Number of victims')
sns.despine()
plt.show()Most crime victims are around 30 years old, and there may to be another overlapping distribution around 50 years old. This is true for both sexes. Victims with an unknown sex are most likely to be around 20 years old.
sns.countplot(data=crimes_clean, x='Vict Descent')
plt.xticks(rotation=90)
plt.xlabel('Victim descent')
plt.ylabel('Number of victims')
sns.despine()
plt.show()Hispanic, Unknown descent, White, and Black are the most common victim descents. But perhaps more important than the raw numbers of victims is whether or not they are proportionate to the population as a whole. Are people of a certain descent more likely to be the victim of a crime?
In order to find this information, demographics data from the 2020 U.S. Census was used, here. Additionally, information about race categorization for the census was used to make descent categories from the crime data consistent with the census data, here.
# Preparing US Census data by calculating percentages and storing in a dataframe
census = pd.Series({'Hispanic/Latin/Mexican':1829991,'White':1125052,'Black':322553,'American Indian/Alaska Native':6614,'Asian':454585,'Hawaiian/Pacific Islander':4573,'Other':26351})
census = census.to_frame(name='number')
total = census.number.sum()
census['pct_pop'] = (census['number'] / total * 100).round(2)
census.reset_index(inplace=True)
# Mapping crime victim data to match census data, calculating percentages and storing in a dataframe
victim_map = {'Other Asian':'Asian', 'Korean':'Asian', 'Filipino':'Asian', 'Chinese':'Asian', 'Japanese':'Asian', 'Vietnamese':'Asian', 'Asian Indian':'Asian', 'Pacific Islander':'Hawaiian/Pacific Islander', 'Hawaiian':'Hawaiian/Pacific Islander', 'Laotian':'Asian', 'Cambodian':'Asian', 'Guamanian':'Hawaiian/Pacific Islander', 'Samoan':'Hawaiian/Pacific Islander'}
victims = crimes_clean['Vict Descent'].map(victim_map).fillna(crimes_clean['Vict Descent'])
victims = victims.value_counts(normalize=True).to_frame(name='pct_victims')
victims['pct_victims'] = (victims['pct_victims'] * 100).round(2)
victims.reset_index(inplace=True)
# Combine victim and census tables and pivot for readability
combined = victims.merge(census, how='left', on='index')
combined = combined.pivot_table(values=['pct_victims', 'pct_pop'], columns='index')
combined