Los Angeles, California 😎. The City of Angels. Tinseltown. The Entertainment Capital of the World!
Known for its warm weather, palm trees, sprawling coastline, and Hollywood, along with producing some of the most iconic films and songs. However, as with any highly populated city, it isn't always glamorous and there can be a large volume of crime. That's where you can help!
You have been asked to support the Los Angeles Police Department (LAPD) by analyzing crime data to identify patterns in criminal behavior. They plan to use your insights to allocate resources effectively to tackle various crimes in different areas.
The Data
They have provided you with a single dataset to use. A summary and preview are provided below.
It is a modified version of the original data, which is publicly available from Los Angeles Open Data.
crimes.csv
Column | Description |
---|---|
'DR_NO' | Division of Records Number: Official file number made up of a 2-digit year, area ID, and 5 digits. |
'Date Rptd' | Date reported - MM/DD/YYYY. |
'DATE OCC' | Date of occurrence - MM/DD/YYYY. |
'TIME OCC' | In 24-hour military time. |
'AREA NAME' | The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example, the 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles. |
'Crm Cd Desc' | Indicates the crime committed. |
'Vict Age' | Victim's age in years. |
'Vict Sex' | Victim's sex: F : Female, M : Male, X : Unknown. |
'Vict Descent' | Victim's descent:
|
'Weapon Desc' | Description of the weapon used (if applicable). |
'Status Desc' | Crime status. |
'LOCATION' | Street address of the crime. |
Submitting this
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
crimes = pd.read_csv("crimes.csv", parse_dates=["Date Rptd", "DATE OCC"], dtype={"TIME OCC": str})
crimes.head()
print("Which hour has the highest frequency of crimes? Store as an integer variable called 'peak_crime_hour'")
#make a new column pulling the first two values (the hours) in the TIME OCC column
crimes['HOUR OCC'] = crimes['TIME OCC'].str[:2].astype(int)
#for saving the variable I was right the first time but they might want it a different way
#save max value
peak_crime_hour = crimes['HOUR OCC'].value_counts().idxmax()
print(peak_crime_hour, "\n")
#To show the frequencies use countplot
fig = plt.figure(figsize=(10,6))
#assuming fig size is width by height
a = sns.countplot(x='HOUR OCC', data=crimes)
#I want to make it pretty
fig.suptitle("Crime Occurance by Hour in Military Time")
fig.subplots_adjust(top=0.85)
plt.title("Frequencies")
#this one here is just adding more for learnings sake
#when labeling your axis it's important to have the graphic connected to a variable
a.set_xlabel('Hour of the Day')
a.set_ylabel('Number of Occurances')
#add color
sns.set_palette('pastel')
#so the color is applying, theres no variation in the feqencies that would result in multiple colors
plt.show()
print("Which area has the largest frequency of night crimes? Save as string called 'peak_night_crime_location'")
# Create a boolean mask for night crimes
night_mask = (crimes['TIME OCC'] >= '2200') | (crimes['TIME OCC'] <= '0359')
crimes['Night'] = night_mask
# Filter the night crimes
night_crimes = crimes[crimes['Night']]
# Find the area with the largest frequency of night crimes
peak_night_crime_location = night_crimes['AREA NAME'].value_counts().idxmax()
print(peak_night_crime_location, "\n")
print("Identify number of crimes commited against vitcims of different age groups. Save as pandas Series called 'vicitm_ages', with age groups '0-17', '18-25', '26-34', '35-44', '45-54', '55-64', and '65+' as the index and the frequency of crimes as the values.", '\n')
#last part looks like its looking for a dictionary so vicitm_ages = {age(bin) : frequency}
#bin will need to cut values so were looking at bins based on the listed numbers
#theres also a loop here oh boy
# Create bins for the ages
bins = [0, 17, 25, 34, 44, 54, 64, np.inf]
binlabels = ['0-17', '18-25', '26-34', '35-44', '45-54', '55-64', '65+']
# Bin the ages in the dataframe
crimes['Age Group'] = pd.cut(crimes['Vict Age'], bins=bins, labels=binlabels, include_lowest=True)
# Count the number of crimes in each age group
victim_ages = crimes['Age Group'].value_counts().sort_index()
The Workflow
#LOCATION, Status Desc, Weapon Desc, Vict Descent, Vict Sex, Vict Age, Crm Cd Desc, AREA NAME,
#TIME OCC, DATE OCC, Date Rptd, DR_NO
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
crimes = pd.read_csv("crimes.csv", parse_dates=["Date Rptd", "DATE OCC"], dtype={"TIME OCC": str})
crimes.head()
print("Which hour has the highest frequency of crimes? Store as an integer variable called 'peak_crime_hour'", "\n")
print(crimes['TIME OCC'].value_counts())
#looks like 1200
peak_crime_hour = crimes['TIME OCC'].value_counts().max()
#This doesn't work as well as it could becuase it takes the whole string as opposed to the hrs
#make a new column pulling the first two values (the hours) in the TIME OCC column
crimes['HOUR OCC'] = crimes['TIME OCC'].str[:2].astype(int)
#check outcome
#print(crimes['HOUR OCC'])
#outcome looks good save as the indicated variable
peak_crime_hour = crimes['HOUR OCC'].max()
#check again
#print(peak_crime_hour)
#this portion is wrong, jumped to conclusion, you need to vizualize the variable THAN save it
#To show the frequencies use countplot
fig = plt.figure(figsize=(10,6))
#assuming fig size is width by height
a = sns.countplot(x='HOUR OCC', data=crimes)
#I want to make it pretty
fig.suptitle("Crime Occurance by Hour in Military Time")
fig.subplots_adjust(top=0.85)
plt.title("Frequencies")
#this one here is just adding more for learnings sake
#when labeling your axis it's important to have the graphic connected to a variable
a.set_xlabel('Hour of the Day')
a.set_ylabel('Number of Occurances')
#add color
sns.set_palette('pastel')
#so the color is applying, theres no variation in the feqencies that would result in multiple colors
plt.show()
#for saving the variable I was right the first time but they might want it a different way
peak_crime_hour = crimes[crimes['HOUR OCC'] == '12']
#check
print(peak_crime_hour)
#this comes up as empty dataframe
peak_crime_hour = crimes['HOUR OCC'] == '12'
print(peak_crime_hour)
#this creates a bool that check if the hour matches 12
#think of the grid this format only covers one part of the grid
peak_crime_hour = crimes['HOUR OCC'].value_counts().idxmax()
print(peak_crime_hour)
print("Which area has the largest frequency of night crimes? Save as string called 'peak_night_crime_location'")
#Night crimes happen between 10pm and 3:59, isnt that selection a bit arbitrary?
#Comparing to first question, we wanted the highest frequency of crimes based on hour so we created a new column that pulled the first 2 characters of a string
#For this we'll need to create and short a new column, create a new data frame that takes both the new column and the column for areas, make a chart to find largest frequency, frequency can be found with count plot
# Create a boolean mask for night crimes
night_mask = (crimes['TIME OCC'] >= '2200') | (crimes['TIME OCC'] <= '0359')
crimes['Night'] = night_mask
#check
print(crimes['Night'])
#sort_night = crimes.sort_values(by=['Night', 'AREA NAME'])
#print(sort_night)
#this doesn't come out the way I want it to
#night_list = sorted(([crimes[crimes['Night']]] if [crimes[crimes['Night']]] is not None else float('inf')) for x in my_list)
#just going to stop you right here, your trying to use list sorting methods on a series
#use dropna and bam solution had
# Filter the night crimes
night_crimes = crimes[crimes['Night']]
# Find the area with the largest frequency of night crimes
peak_night_crime_location = night_crimes['AREA NAME'].value_counts().idxmax()
print(peak_night_crime_location)
#I keep bypassing the graphing, these questions make it so it's not really mandatory, its me hi im the problem its me
#start again
#use is in
night_crimes = crimes[(crimes['TIME OCC'] >= '2200') | (crimes['TIME OCC'] <= '0359')]
print(night_crimes)
#count by area
night_crime_location_count = night_crimes['AREA NAME'].value_counts()
peak_night_crime_location = night_crime_location_count.idxmax()
print(peak_night_crime_location)
print("Identify number of crimes commited against vitcims of different age groups. Save as pandas Series called 'vicitm_ages', with age groups '0-17', '18-25', '26-34', '35-44', '45-54', '55-64', and '65+' as the index and the frequency of crimes as the values.", '\n')
#last part looks like its looking for a dictionary so vicitm_ages = {age(bin) : frequency}
#bin will need to cut values so were looking at bins based on the listed numbers
#theres also a loop here oh boy
# Create bins for the ages
bins = [0, 17, 25, 34, 44, 54, 64, np.inf]
binlabels = ['0-17', '18-25', '26-34', '35-44', '45-54', '55-64', '65+']
# Bin the ages in the dataframe
crimes['Age Group'] = pd.cut(crimes['Vict Age'], bins=bins, labels=binlabels, include_lowest=True)
# Count the number of crimes in each age group
victim_ages = crimes['Age Group'].value_counts().sort_index()
#this part not needed
#create a dictionary based off bins
victim_dict = victim_ages.to_dict()
#check
print(victim_dict)