Skip to content
Analyzing Crimes in LA - Clustering for Predomenant Patterns - End-to-End Project
  • AI Chat
  • Code
  • Report
  • Analyzing Crime in LA

    🌇🚔 Background

    As with any highely populated city, it isn't always glamarous and there can be a large volume of crime. That's where you can help!

    You have been asked to support the Los Angeles Police Department (LAPD) by analyzing their crime data to identify patterns in criminal behavior. They plan to use your insights to allocate resources effectively to tackle various crimes in different areas.

    You are free to use any methodologies that you like in order to produce your insights.

    The Data

    They have provided you with a single dataset to use. A summary and preview is provided below.

    The data is publicly available here.

    👮‍♀️ crimes.csv

    'DR_NO'Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits.
    'Date Rptd'Date reported - MM/DD/YYYY.
    'DATE OCC'Date of occurence - MM/DD/YYYY.
    'TIME OCC'In 24 hour military time.
    'AREA'The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21.
    'AREA NAME'The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles.
    'Rpt Dist No'A four-digit code that represents a sub-area within a Geographic Area. All crime records reference the "RD" that it occurred in for statistical comparisons. Find LAPD Reporting Districts on the LA City GeoHub at
    'Crm Cd'Crime code for the offence committed.
    'Crm Cd Desc'Definition of the crime.
    'Vict Age'Victim Age (years)
    'Vict Sex'Victim's sex: F: Female, M: Male, X: Unknown.
    'Vict Descent'Victim's descent:
    • A - Other Asian
    • B - Black
    • C - Chinese
    • D - Cambodian
    • F - Filipino
    • G - Guamanian
    • H - Hispanic/Latin/Mexican
    • I - American Indian/Alaskan Native
    • J - Japanese
    • K - Korean
    • L - Laotian
    • O - Other
    • P - Pacific Islander
    • S - Samoan
    • U - Hawaiian
    • V - Vietnamese
    • W - White
    • X - Unknown
    • Z - Asian Indian
    'Premis Cd'Code for the type of structure, vehicle, or location where the crime took place.
    'Premis Desc'Definition of the 'Premis Cd'.
    'Weapon Used Cd'The type of weapon used in the crime.
    'Weapon Desc'Description of the weapon used (if applicable).
    'Status Desc'Crime status.
    'Crm Cd 1'Indicates the crime committed. Crime Code 1 is the primary and most serious one. Crime Code 2, 3, and 4 are respectively less serious offenses. Lower crime class numbers are more serious.
    'Crm Cd 2'May contain a code for an additional crime, less serious than Crime Code 1.
    'Crm Cd 3'May contain a code for an additional crime, less serious than Crime Code 1.
    'Crm Cd 4'May contain a code for an additional crime, less serious than Crime Code 1.
    'LOCATION'Street address of the crime.
    'Cross Street'Cross Street of rounded Address
    'LAT'Latitude of the crime location.
    'LON'Longtitude of the crime location.
    Hidden code

    💪 The Challenge

    • Use your skills to produce insights about crimes in Los Angeles.
    • Examples could include examining how crime varies by area, crime type, victim age, time of day, and victim descent.
    • You could build machine learning models to predict criminal activities, such as when a crime may occur, what type of crime, or where, based on features in the dataset.
    • You may also wish to visualize the distribution of crimes on a map.

    Explanatory Data Analysis

    Data Cleaning and Validation

    Data received as described structured in 25 columns and over 400K rows., the following modification have been made:

    1. Columns ‘Date Rpt’ and ‘ATE OCC’ are of ‘object’ data type so it was converted to ‘datetime’ format.
    2. Missing values analyzing exposed 9 columns with missing values, the following imputation and dropping process have been made:
      • Columns ‘Premis Cd’ and ‘Crm Cd 1’ have only 6 random missing values, so the rows with missing values were dropped.
      • Column ‘Vict Sex’ and ‘Vict Descent’ contain 53875, which equals 13.5% of the total observations, totally not at random missing values, it’s also noticeable that the missing values all accompanied with 0 in the ‘Vict Age’ column and only 2 types of crimes VEHICLE – STOLEN and THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER). So no simple presumptions can be done, and the missing values in Column ‘Vict Sex’ and ‘Vict Descent’ were impyted with ‘X’ for ‘Unknown’.
      • Column ‘Weapon Used Cd’ and ‘Weapon Desc’ contain 264119, which equals 66% of the total observations, totally not at random missing values. Here also no simple presumptions can be done. So a new category ‘Unknown’ was created for ‘Weapon Desc’ and 0 is it’s code in ‘Weapon Used Cd’. This will cause a serious imbalance to the data but the value 0 shall create a big variance with other codes which might refer to a certain pattern.
      • Column ‘Premis Desc’ contain 236 missing values, imputed with their corresponding values of the column ‘Premis Cd’ and converted to ‘text’.
      • Columns 'Crm Cd 2', 'Crm Cd 3' and 'Crm Cd 4' contains huge number of missing values range from 93% and 100% of the total number of observations, imputed all with 0.
      • Column ‘Cross Street’ contains 335564 missing values equals 84% of the total observations, we can assume that the location is NOT at Cross Street of rounded Address. So I decided to impute the missing values with the most frequent value ‘No’.
    3. The inconsistent value ‘H’ in the column ‘'Vict Sex'’ was imputed as ‘X’ for ‘Unknown’.
    4. Column ‘'Vict Age' contained 1 value as 120, 2 values as -2 and 30 values as -1 replaced with 12, 2 and 1 respectively.
    5. ‘Vict Age’ column contains values ranges from 0 to 99. A new column ‘Vict_age_cat’ was created to categorize the age as per the following:
      • Child: 0 to 12
      • Teen: 13 to 18
      • Young Adult: 19 to 35
      • Adult: 36 to 67
      • Senior: Greater than 67
    6. the following new column has been created from ‘Date_Occ’ column: - ‘year_month’: has the date rounded to year and month - ‘date_occ_year’: contains the value of the years only - ‘date_occ_month’: contains the value of the months only - ‘date_occ_day’: contains the value of the day only - 'weekday_name': contains the names of the days


    Total Number of Crimes per Time

    The line plot illustrates the total number of crimes over time in Los Angeles. There seem to be some fluctuations and potentially some trends or patterns that could be worth exploring further, for example the data starts with a notable decrease in number of crimes might have been caused by the pandemic. moreover, we can notice the number of crimes repetitively decreases in Abril and increases in July.

    Hidden code

    Frequency of Top 10 Crime Types by Area in Los Angeles

    The heatmap provides a detailed view of how the top 10 crime types are distributed across different areas in Los Angeles. The color intensity indicates the frequency of each crime type within an area, allowing us to quickly identify which crimes are most prevalent in which areas. This kind of visualization can be particularly useful for law enforcement to target specific crime types in areas where they are most common.

    # Group by 'Crm Cd Desc' (crime description) and count the number of each type of crime
    crime_type_counts = crimes_clean['Crm Cd Desc'].value_counts().nlargest(10).reset_index()
    crime_type_counts.columns = ['Crime Type', 'Count']
    # Filter the dataset for the top 10 crime types
    top_crimes = crime_type_counts['Crime Type'].tolist()
    top_crimes_data = crimes_clean[crimes_clean['Crm Cd Desc'].isin(top_crimes)]
    # Create a pivot table for the heatmap
    crime_type_area_pivot = top_crimes_data.pivot_table(index='Crm Cd Desc', columns='AREA NAME', aggfunc='size', fill_value=0)
    # Adjusting the heatmap with a sequential color palette
    plt.figure(figsize=(15, 8))
    sns.heatmap(crime_type_area_pivot, annot=True, fmt="d", cmap='Blues', linewidths=.5)
    plt.title('Frequency of Top 10 Crime Types by Area in Los Angeles')
    plt.xlabel('Area Name')
    plt.ylabel('Crime Type')

    Top 10 Most Common Crime Types in Los Angeles

    This visualization provides a clear picture of which types of crimes are most frequently reported. Understanding the prevalence of different crime types can help in prioritizing and strategizing crime prevention and response efforts.

    Hidden code

    Trends of Top 10 Crime Types Over Time in Los Angeles

    The line plot illustrates the trends of the top 10 crime types over time in Los Angeles. Each line represents the frequency of a specific crime type from month to month. This visualization helps us identify patterns, such as increases or decreases in crime rates, and any seasonal trends or anomalies.

    From the plot, we can observe that some crime types show clear trends, while others fluctuate more randomly. This kind of analysis can be particularly insightful for predicting future crime rates and understanding the effectiveness of law enforcement strategies over time.