| Los Angeles, California. The City of Angels. Tinseltown. The Entertainment Capital of the World! As with any highly populated city, it isn't always glamorous and there can be a large volume of crime. You have been asked to support the Los Angeles Police Department (LAPD) by analyzing crime data to identify patterns in criminal behavior. They plan to use your insights to allocate resources effectively to tackle various crimes in different areas. |
The Data
They have provided you with a single dataset (crimes.csv) to use. A summary and preview are provided below. It is a modified version of the original data, which is publicly available from Los Angeles Open Data.
| Column | Description | Column | Description |
|---|---|---|---|
'DR_NO' | Division of Records Number: Official file number made up of a 2-digit year, area ID, and 5 digits. | 'Crm Cd Desc' | Indicates the crime committed. |
'Date Rptd' | Date reported - MM/DD/YYYY. | 'Vict Sex' | Victim's sex: F: Female, M: Male, X: Unknown. |
'DATE OCC' | Date of occurrence - MM/DD/YYYY. | 'Vict Descent' | Victim's descent: A-Z |
'TIME OCC' | In 24-hour military time. | 'Weapon Desc' | Description of the weapon used (if applicable). |
'Status Desc' | Crime status. | 'LOCATION' | Street address of the crime. |
'AREA NAME' | The 21 Geographic Areas or Patrol Divisions. |
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
crimes = pd.read_csv("crimes.csv", dtype={"TIME OCC": str})
crimes.head()Which hour has the highest frequency of crimes?
(Store as an integer variable called peak_crime_hour)
# convert the occured time into a 24-hour integer - 'the hour'
crimes['the_hour'] = crimes['TIME OCC'].astype(int) // 100
# check the approach works
crimes[['TIME OCC', 'the_hour', 'Crm Cd Desc']].head(3)# sort occurences by the hour,
peak_crime_hour = crimes['the_hour'].value_counts().idxmax()
peak_crime_hourI was initially surprised, assuming that the highest frequency would be at night. Then realised that crime statistics probably suffer from survivorship bias, as more crime at night is probably 'successful' and thus unreported with no original occurrence time.
Which area has the largest frequency of night crimes (crimes committed between 10pm and 3:59am)?
(Save as a string variable called peak_night_crime_location)
I knew I could utilise boolean masking but could not rely on '&' because I'm working with time of day. In this instance I can use OR because I would be including 10pm to midnight, midnight to but not including 4am.
# filter all crimes occuring between 10pm and 3:59am
night_crime_mask = np.logical_or(crimes['the_hour'] >= 22, crimes['the_hour'] < 4)
night_crime = crimes[night_crime_mask]
night_crime.head()peak_night_crime_location = night_crime['AREA NAME'].value_counts().idxmax()
peak_night_crime_locationIdentify the number of crimes committed against victims of different age groups.
(Save as a pandas Series called victim_ages, with age group labels "0-17", "18-25", "26-34", "35-44", "45-54", "55-64", and "65+" as the index and the frequency of crimes as the values.)
# going to use the pandas cut function I just learned
bins = [0, 17, 25, 34, 44, 54, 64, float('inf')]
labels = ["0-17", "18-25", "26-34", "35-44", "45-54", "55-64", "65+"]
crimes['age_group'] = pd.cut(crimes['Vict Age'], bins=bins, labels=labels, right=True)
victim_ages = crimes['age_group'].value_counts().sort_index()
victim_ages# removing error in gender dataset
crimes['Vict Sex'] = crimes['Vict Sex'].replace('H', np.nan)
# creating a dataframe for visualisation
central_crimes = crimes[crimes['AREA NAME'] == 'Central'][['age_group','Vict Sex']]
# plotting the above table, adding gender as a semantic variable - hue
sns.catplot(data=central_crimes,
x='age_group',
kind='count',
hue='Vict Sex'
)
plt.title("Central LA Crimes by Age Group")
plt.xlabel("Age Group")
plt.ylabel("Number of Crimes")
plt.tight_layout()