Deciphering crime structure LA- in quest to reduce crime rates

1 hidden cell

import pandas as pd
crimes = pd.read_csv("data/crimes.csv")
crimes.head()

💪 The Challenge

Use your skills to produce insights about crimes in Los Angeles.
Examples could include examining how crime varies by area, crime type, victim age, time of day, and victim descent.
You could build machine learning models to predict criminal activities, such as when a crime may occur, what type of crime, or where, based on features in the dataset.
You may also wish to visualize the distribution of crimes on a map.

Note:

To ensure the best user experience, we currently discourage using Folium and Bokeh in Workspace notebooks.

✍️ Judging criteria

This competition is for helping to understand how competitions work. This competition will not be judged.

✅ Checklist before publishing

Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
Remove redundant cells like the judging criteria, so the workbook is focused on your work.
Check that all the cells run without error.

⌛️ Time is ticking. Good luck!

Data review 1:

Import libraries Read data file Get info on variables and missing values

import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import folium
#read the datafile
print('READ AND REVIEW DATA:')
crimes = pd.read_csv("data/crimes.csv")

# change  time columns to datetime 
crimes['Date Rptd']=pd.to_datetime(crimes['Date Rptd'])
crimes['DATE OCC']=pd.to_datetime(crimes['DATE OCC'])

#copy the original data frame as a backup
crimes_spare=crimes.copy()                              
#print information about the columns
print(crimes.info())
# get the number of missing data for all columns
print(crimes.isna().sum())

Data cleaning 1:

check and clean sex, victim's deescent and reported crimes 'DR_NO'


print('CHECK AND CLEAN CATEGORICAL VARIABLES AND CRIME RECORDS:')

#check Victim's Sex

print(crimes['Vict Sex'].unique())           
# Fill missing sex by 'X' unknown
crimes['Vict Sex']=crimes['Vict Sex'].fillna('X')
#Replace unrecognized letters with X for unknown sex      
crimes['Vict Sex']=crimes['Vict Sex'].replace('H','X')
#Check again if ok

print(crimes['Vict Sex'].unique())

#check 'Vict descent'

#Fill missing descent by X meaning unknown
crimes['Vict Descent']=crimes['Vict Descent'].fillna('X')
#identify rows with the decent outside the predefined categories

print (crimes['Vict Descent'].unique())
vict_desc_cats=['A','B','C','D','F','G','H','I','J','K','L','O','P','S','U','V','W','X','Z']
#Spot crime row with category symbols outside the standard set                
Desc_clean=set(crimes['Vict Descent']).difference(vict_desc_cats)
#Find rows with the symbols outside the standard set                             

print(Desc_clean)
Desc_clean_rows=crimes['Vict Descent'].isin(Desc_clean)
#Pin point rows with the black sheep
inconsistent=crimes[Desc_clean_rows]

print(inconsistent['Vict Descent'])
#replace the trouble symbols with unknown 'X' symbol
crimes['Vict Descent']=crimes['Vict Descent'].replace('-','X')
#check if ok

print(crimes['Vict Descent'].unique())


#Check the range of values for DR_NO

#extract year from datetime data
crimes['year']=pd.to_datetime(crimes['Date Rptd']).dt.year
crimes_spare['year']=pd.to_datetime(crimes_spare['Date Rptd']).dt.year
#use scatterplot to spot anomalous values
sns.scatterplot(data=crimes_spare, x='year', y='DR_NO')
plt.title('DR_NO with spurious values to be eliminated')                
plt.show()
#Eliminate the anomalous low values 
crimes=crimes[crimes['DR_NO']>0.5e8]
#validate
print('minimum valuw now: ',crimes['DR_NO'].min())
#let us check  and clear latitude and longitude

crimes=crimes[(crimes.LAT!=0) & (crimes.LON !=0)]
print('Minimum LAT and maximum LON after cleaning')
print(crimes.LAT.min())
print(crimes.LON.max())

Data Cleaning 1 recap:

Sex and victim's descent categories are rectified and the anomolous low values of DR_NO are eliminated

###############################

Data cleaning 2:

Check remaining variables: area,rpt, crm code 1-4,vict age


print('CHECK AND CLEAN REMAINING VARIABLES:')

#check AREA code
print('AREA sequence',np.sort(crimes['AREA'].unique()))
#Area codes are in order from 1 to 21

#Dist No no missing data but check if all in order

print('\n')

#check 'Vict age'
#There are negative ages
#replace -2 and -1 with 'nan' to represent unknown value
crimes['Vict Age'].replace(-1, np.nan,inplace=True)
crimes['Vict Age'].replace(-2, np.nan,inplace=True)

# deal with missing values Crm Cd 1-4 when and if needed      

#Weapons and description as they both have the same number of missing values,
#but let us see if they agree. Cross check
print('Weapons used:\n')
print('Cross check of all null and non-missing values for CD and Desc match:')
print(crimes[crimes['Weapon Used Cd'].isna()][['Weapon Used Cd','Weapon Desc']].drop_duplicates())
print(crimes[crimes['Weapon Desc'].isna()][['Weapon Used Cd','Weapon Desc']].drop_duplicates())
print(crimes[~crimes['Weapon Used Cd'].isna()][['Weapon Used Cd','Weapon Desc']].drop_duplicates())
print(crimes[~crimes['Weapon Desc'].isna()][['Weapon Used Cd','Weapon Desc']].drop_duplicates())
#OK all missing and non-missing values match
print ('Passed the test. All OK') 

#check 'Premis Cd' and 'Premis desc' because the number of missing values are different
print('\n Premis:\n')
print(crimes[crimes['Premis Desc'].isna()][['Premis Cd','Premis Desc']].drop_duplicates())
print('See if for 256,418 and 972 all premic descrions are indeed null')      
print(crimes[crimes['Premis Cd'].isin ([256.0,418.0,972.0])]['Premis Desc'])
#For these 3 'Premis Cd' codes there are no 'Premis Desc'
#For other missing 'Premis Cd', 'Premis Desc' is also missing
print ('Passed the test. All OK')

Data cleaning 2 recap:

Area list, weapons used and premises are reviewed and cross validated. The remaining variables with missing values such as crm cd 1 are examined to confirm that the missing data indeed represent unavailable data.

‌
‌
‌