Skip to content
0

1 hidden cell
import pandas as pd
crimes = pd.read_csv("data/crimes.csv")
crimes.head()

💪 The Challenge

  • Use your skills to produce insights about crimes in Los Angeles.
  • Examples could include examining how crime varies by area, crime type, victim age, time of day, and victim descent.
  • You could build machine learning models to predict criminal activities, such as when a crime may occur, what type of crime, or where, based on features in the dataset.
  • You may also wish to visualize the distribution of crimes on a map.

Note:

To ensure the best user experience, we currently discourage using Folium and Bokeh in Workspace notebooks.

✍️ Judging criteria

This competition is for helping to understand how competitions work. This competition will not be judged.

✅ Checklist before publishing

  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
  • Remove redundant cells like the judging criteria, so the workbook is focused on your work.
  • Check that all the cells run without error.

⌛️ Time is ticking. Good luck!

Data review 1:

Import libraries Read data file Get info on variables and missing values

import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import folium
#read the datafile
print('READ AND REVIEW DATA:')
crimes = pd.read_csv("data/crimes.csv")

# change  time columns to datetime 
crimes['Date Rptd']=pd.to_datetime(crimes['Date Rptd'])
crimes['DATE OCC']=pd.to_datetime(crimes['DATE OCC'])

#copy the original data frame as a backup
crimes_spare=crimes.copy()                              
#print information about the columns
print(crimes.info())
# get the number of missing data for all columns
print(crimes.isna().sum())

Data cleaning 1:

check and clean sex, victim's deescent and reported crimes 'DR_NO'


print('CHECK AND CLEAN CATEGORICAL VARIABLES AND CRIME RECORDS:')

#check Victim's Sex

print(crimes['Vict Sex'].unique())           
# Fill missing sex by 'X' unknown
crimes['Vict Sex']=crimes['Vict Sex'].fillna('X')
#Replace unrecognized letters with X for unknown sex      
crimes['Vict Sex']=crimes['Vict Sex'].replace('H','X')
#Check again if ok

print(crimes['Vict Sex'].unique())

#check 'Vict descent'

#Fill missing descent by X meaning unknown
crimes['Vict Descent']=crimes['Vict Descent'].fillna('X')
#identify rows with the decent outside the predefined categories

print (crimes['Vict Descent'].unique())
vict_desc_cats=['A','B','C','D','F','G','H','I','J','K','L','O','P','S','U','V','W','X','Z']
#Spot crime row with category symbols outside the standard set                
Desc_clean=set(crimes['Vict Descent']).difference(vict_desc_cats)
#Find rows with the symbols outside the standard set                             

print(Desc_clean)
Desc_clean_rows=crimes['Vict Descent'].isin(Desc_clean)
#Pin point rows with the black sheep
inconsistent=crimes[Desc_clean_rows]

print(inconsistent['Vict Descent'])
#replace the trouble symbols with unknown 'X' symbol
crimes['Vict Descent']=crimes['Vict Descent'].replace('-','X')
#check if ok

print(crimes['Vict Descent'].unique())


#Check the range of values for DR_NO

#extract year from datetime data
crimes['year']=pd.to_datetime(crimes['Date Rptd']).dt.year
crimes_spare['year']=pd.to_datetime(crimes_spare['Date Rptd']).dt.year
#use scatterplot to spot anomalous values
sns.scatterplot(data=crimes_spare, x='year', y='DR_NO')
plt.title('DR_NO with spurious values to be eliminated')                
plt.show()
#Eliminate the anomalous low values 
crimes=crimes[crimes['DR_NO']>0.5e8]
#validate
print('minimum valuw now: ',crimes['DR_NO'].min())
#let us check  and clear latitude and longitude

crimes=crimes[(crimes.LAT!=0) & (crimes.LON !=0)]
print('Minimum LAT and maximum LON after cleaning')
print(crimes.LAT.min())
print(crimes.LON.max())

Data Cleaning 1 recap:

Sex and victim's descent categories are rectified and the anomolous low values of DR_NO are eliminated

###############################

Data cleaning 2:

Check remaining variables: area,rpt, crm code 1-4,vict age


print('CHECK AND CLEAN REMAINING VARIABLES:')

#check AREA code
print('AREA sequence',np.sort(crimes['AREA'].unique()))
#Area codes are in order from 1 to 21

#Dist No no missing data but check if all in order

print('\n')

#check 'Vict age'
#There are negative ages
#replace -2 and -1 with 'nan' to represent unknown value
crimes['Vict Age'].replace(-1, np.nan,inplace=True)
crimes['Vict Age'].replace(-2, np.nan,inplace=True)

# deal with missing values Crm Cd 1-4 when and if needed      

#Weapons and description as they both have the same number of missing values,
#but let us see if they agree. Cross check
print('Weapons used:\n')
print('Cross check of all null and non-missing values for CD and Desc match:')
print(crimes[crimes['Weapon Used Cd'].isna()][['Weapon Used Cd','Weapon Desc']].drop_duplicates())
print(crimes[crimes['Weapon Desc'].isna()][['Weapon Used Cd','Weapon Desc']].drop_duplicates())
print(crimes[~crimes['Weapon Used Cd'].isna()][['Weapon Used Cd','Weapon Desc']].drop_duplicates())
print(crimes[~crimes['Weapon Desc'].isna()][['Weapon Used Cd','Weapon Desc']].drop_duplicates())
#OK all missing and non-missing values match
print ('Passed the test. All OK') 

#check 'Premis Cd' and 'Premis desc' because the number of missing values are different
print('\n Premis:\n')
print(crimes[crimes['Premis Desc'].isna()][['Premis Cd','Premis Desc']].drop_duplicates())
print('See if for 256,418 and 972 all premic descrions are indeed null')      
print(crimes[crimes['Premis Cd'].isin ([256.0,418.0,972.0])]['Premis Desc'])
#For these 3 'Premis Cd' codes there are no 'Premis Desc'
#For other missing 'Premis Cd', 'Premis Desc' is also missing
print ('Passed the test. All OK') 


Data cleaning 2 recap:

Area list, weapons used and premises are reviewed and cross validated. The remaining variables with missing values such as crm cd 1 are examined to confirm that the missing data indeed represent unavailable data.