1 hidden cell
import pandas as pd
crimes = pd.read_csv("data/crimes.csv")
crimes.head()
💪 The Challenge
- Use your skills to produce insights about crimes in Los Angeles.
- Examples could include examining how crime varies by area, crime type, victim age, time of day, and victim descent.
- You could build machine learning models to predict criminal activities, such as when a crime may occur, what type of crime, or where, based on features in the dataset.
- You may also wish to visualize the distribution of crimes on a map.
Note:
To ensure the best user experience, we currently discourage using Folium and Bokeh in Workspace notebooks.
✍️ Judging criteria
This competition is for helping to understand how competitions work. This competition will not be judged.
✅ Checklist before publishing
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the judging criteria, so the workbook is focused on your work.
- Check that all the cells run without error.
⌛️ Time is ticking. Good luck!
Data review 1:
Import libraries Read data file Get info on variables and missing values
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import folium
#read the datafile
print('READ AND REVIEW DATA:')
crimes = pd.read_csv("data/crimes.csv")
# change time columns to datetime
crimes['Date Rptd']=pd.to_datetime(crimes['Date Rptd'])
crimes['DATE OCC']=pd.to_datetime(crimes['DATE OCC'])
#copy the original data frame as a backup
crimes_spare=crimes.copy()
#print information about the columns
print(crimes.info())
# get the number of missing data for all columns
print(crimes.isna().sum())
Data cleaning 1:
check and clean sex, victim's deescent and reported crimes 'DR_NO'
print('CHECK AND CLEAN CATEGORICAL VARIABLES AND CRIME RECORDS:')
#check Victim's Sex
print(crimes['Vict Sex'].unique())
# Fill missing sex by 'X' unknown
crimes['Vict Sex']=crimes['Vict Sex'].fillna('X')
#Replace unrecognized letters with X for unknown sex
crimes['Vict Sex']=crimes['Vict Sex'].replace('H','X')
#Check again if ok
print(crimes['Vict Sex'].unique())
#check 'Vict descent'
#Fill missing descent by X meaning unknown
crimes['Vict Descent']=crimes['Vict Descent'].fillna('X')
#identify rows with the decent outside the predefined categories
print (crimes['Vict Descent'].unique())
vict_desc_cats=['A','B','C','D','F','G','H','I','J','K','L','O','P','S','U','V','W','X','Z']
#Spot crime row with category symbols outside the standard set
Desc_clean=set(crimes['Vict Descent']).difference(vict_desc_cats)
#Find rows with the symbols outside the standard set
print(Desc_clean)
Desc_clean_rows=crimes['Vict Descent'].isin(Desc_clean)
#Pin point rows with the black sheep
inconsistent=crimes[Desc_clean_rows]
print(inconsistent['Vict Descent'])
#replace the trouble symbols with unknown 'X' symbol
crimes['Vict Descent']=crimes['Vict Descent'].replace('-','X')
#check if ok
print(crimes['Vict Descent'].unique())
#Check the range of values for DR_NO
#extract year from datetime data
crimes['year']=pd.to_datetime(crimes['Date Rptd']).dt.year
crimes_spare['year']=pd.to_datetime(crimes_spare['Date Rptd']).dt.year
#use scatterplot to spot anomalous values
sns.scatterplot(data=crimes_spare, x='year', y='DR_NO')
plt.title('DR_NO with spurious values to be eliminated')
plt.show()
#Eliminate the anomalous low values
crimes=crimes[crimes['DR_NO']>0.5e8]
#validate
print('minimum valuw now: ',crimes['DR_NO'].min())
#let us check and clear latitude and longitude
crimes=crimes[(crimes.LAT!=0) & (crimes.LON !=0)]
print('Minimum LAT and maximum LON after cleaning')
print(crimes.LAT.min())
print(crimes.LON.max())
Data Cleaning 1 recap:
Sex and victim's descent categories are rectified and the anomolous low values of DR_NO are eliminated
###############################
Data cleaning 2:
Check remaining variables: area,rpt, crm code 1-4,vict age
print('CHECK AND CLEAN REMAINING VARIABLES:')
#check AREA code
print('AREA sequence',np.sort(crimes['AREA'].unique()))
#Area codes are in order from 1 to 21
#Dist No no missing data but check if all in order
print('\n')
#check 'Vict age'
#There are negative ages
#replace -2 and -1 with 'nan' to represent unknown value
crimes['Vict Age'].replace(-1, np.nan,inplace=True)
crimes['Vict Age'].replace(-2, np.nan,inplace=True)
# deal with missing values Crm Cd 1-4 when and if needed
#Weapons and description as they both have the same number of missing values,
#but let us see if they agree. Cross check
print('Weapons used:\n')
print('Cross check of all null and non-missing values for CD and Desc match:')
print(crimes[crimes['Weapon Used Cd'].isna()][['Weapon Used Cd','Weapon Desc']].drop_duplicates())
print(crimes[crimes['Weapon Desc'].isna()][['Weapon Used Cd','Weapon Desc']].drop_duplicates())
print(crimes[~crimes['Weapon Used Cd'].isna()][['Weapon Used Cd','Weapon Desc']].drop_duplicates())
print(crimes[~crimes['Weapon Desc'].isna()][['Weapon Used Cd','Weapon Desc']].drop_duplicates())
#OK all missing and non-missing values match
print ('Passed the test. All OK')
#check 'Premis Cd' and 'Premis desc' because the number of missing values are different
print('\n Premis:\n')
print(crimes[crimes['Premis Desc'].isna()][['Premis Cd','Premis Desc']].drop_duplicates())
print('See if for 256,418 and 972 all premic descrions are indeed null')
print(crimes[crimes['Premis Cd'].isin ([256.0,418.0,972.0])]['Premis Desc'])
#For these 3 'Premis Cd' codes there are no 'Premis Desc'
#For other missing 'Premis Cd', 'Premis Desc' is also missing
print ('Passed the test. All OK')
Data cleaning 2 recap:
Area list, weapons used and premises are reviewed and cross validated. The remaining variables with missing values such as crm cd 1 are examined to confirm that the missing data indeed represent unavailable data.