Los Angeles, California 😎. The City of Angels. Tinseltown. The Entertainment Capital of the World!
Known for its warm weather, palm trees, sprawling coastline, and Hollywood, along with producing some of the most iconic films and songs. However, as with any highly populated city, it isn't always glamorous and there can be a large volume of crime. That's where you can help!
You have been asked to support the Los Angeles Police Department (LAPD) by analyzing crime data to identify patterns in criminal behavior. They plan to use your insights to allocate resources effectively to tackle various crimes in different areas.
The Data
They have provided you with a single dataset to use. A summary and preview are provided below.
It is a modified version of the original data, which is publicly available from Los Angeles Open Data.
crimes.csv
| Column | Description |
|---|---|
'DR_NO' | Division of Records Number: Official file number made up of a 2-digit year, area ID, and 5 digits. |
'Date Rptd' | Date reported - MM/DD/YYYY. |
'DATE OCC' | Date of occurrence - MM/DD/YYYY. |
'TIME OCC' | In 24-hour military time. |
'AREA NAME' | The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example, the 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles. |
'Crm Cd Desc' | Indicates the crime committed. |
'Vict Age' | Victim's age in years. |
'Vict Sex' | Victim's sex: F: Female, M: Male, X: Unknown. |
'Vict Descent' | Victim's descent:
|
'Weapon Desc' | Description of the weapon used (if applicable). |
'Status Desc' | Crime status. |
'LOCATION' | Street address of the crime. |
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
crimes = pd.read_csv("crimes.csv", dtype={"TIME OCC": str})
crimes.head()Before we start, let's get some information about the dataset that we are working with.
crimes.info()From the above, we can see that we are working with a pretty large and clean dataset!
We have a very small number of crimes (11) where the sex of the victim was not reported, as well as 10 crimes where the descent of the victim was not reported.
We also have over half of reported crimes in the dataset with a missing value for 'Weapon Desc', which makes sense as the crime may not have involved a weapon and therefore there would be no description.
Which hour has the highest frequency of crimes?
The first question that we need to answer is which hour has the highest frequency of crimes.
To answer this question, we will use the 'TIME OCC' column.
The data here is stored as HHMM. Therefore, we need to process these values so that we can get an integer hour number e.g. 0635 -> 6.
We want to subset the string for only the first 2 characters as this is the hour, then convert to integer type.
However, before performing any string operations, it is always a good idea to double check the format is as expected, otherwise we may encounter some errors.
assert crimes['TIME OCC'].str.len().eq(4).all(), "Some values still aren't 4 characters long."
print("All time values are exactly 4 characters long!")Great! The column passed our test, and we have confirmed we have a clean 24-hour military time column that we can now begin to process to extract the hour as an integer.
We will create a new column rather than directly replacing the values in the existing column as it is likely we are going to still need the HHMM format in our further analysis.
crimes['TIME OCC HOUR INT'] = crimes['TIME OCC'].str[:2].astype(int)
crimes['TIME OCC HOUR INT'].value_counts()From the above output, we can see that the 12th hour has the highest number of reported crimes, with a total of 13663!
We can also extract the value automatically without having to manually assign it from the above output.
peak_crime_hour = crimes['TIME OCC HOUR INT'].value_counts().idxmax()
print(f"The hour with the highest frequency of crimes is: {peak_crime_hour}")Which area has the largest frequency of night crimes (crimes committed between 10pm and 3:59am)?
The next question we want to answer is which area has the largest frequency of night crimes.
To answer this question, we are going to again use the 'TIME OCC' column, this time filtering for values within our specified range which represent night time.
Then, we are going to be grouping by the 'AREA NAME' column to get the total number of crimes per area.
Let's get started with performing some filtering for our times of interest. Since we already have the hour integer values from our previous analysis step, we can use this here.
night_crimes = crimes[(crimes['TIME OCC HOUR INT'] >= 22) | (crimes['TIME OCC HOUR INT'] <= 3)]Now that we have our filtered night crimes dataframe, we can simply take the value counts of the 'AREA NAME' column and take the maximum index which will be our peak night crime location.
peak_night_crime_location = night_crimes['AREA NAME'].value_counts().idxmax()
peak_night_crime_count = night_crimes['AREA NAME'].value_counts().max()
print(f"Area with highest reported crimes: {peak_night_crime_location} ({peak_night_crime_count} reports)")