Skip to content
Competition: Alhassan Osman's Contribution to the Competition - Analyzing Crimes in LA
0
  • AI Chat
  • Code
  • Report
  • Analyzing Crime in LA

    🌇🚔 Background

    Los Angeles, California 😎. The City of Angels. Tinseltown. The Entertainment Capital of the World! Known for its warm weather, palm trees, sprawling coastline, and Hollywood, along with producing some of the most iconic films and songs!

    However, as with any highely populated city, it isn't always glamarous and there can be a large volume of crime. That's where you can help!

    You have been asked to support the Los Angeles Police Department (LAPD) by analyzing their crime data to identify patterns in criminal behavior. They plan to use your insights to allocate resources effectively to tackle various crimes in different areas.

    You are free to use any methodologies that you like in order to produce your insights. __

    The Data

    They have provided you with a single dataset to use. A summary and preview is provided below.

    The data is publicly available here.

    👮‍♀️ crimes.csv

    ColumnDescription
    'DR_NO'Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits.
    'Date Rptd'Date reported - MM/DD/YYYY.
    'DATE OCC'Date of occurence - MM/DD/YYYY.
    'TIME OCC'In 24 hour military time.
    'AREA'The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21.
    'AREA NAME'The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles.
    'Rpt Dist No'A four-digit code that represents a sub-area within a Geographic Area. All crime records reference the "RD" that it occurred in for statistical comparisons. Find LAPD Reporting Districts on the LA City GeoHub at http://geohub.lacity.org/datasets/c4f83909b81d4786aa8ba8a74ab
    'Crm Cd'Crime code for the offence committed.
    'Crm Cd Desc'Definition of the crime.
    'Vict Age'Victim Age (years)
    'Vict Sex'Victim's sex: F: Female, M: Male, X: Unknown.
    'Vict Descent'Victim's descent:
    • A - Other Asian
    • B - Black
    • C - Chinese
    • D - Cambodian
    • F - Filipino
    • G - Guamanian
    • H - Hispanic/Latin/Mexican
    • I - American Indian/Alaskan Native
    • J - Japanese
    • K - Korean
    • L - Laotian
    • O - Other
    • P - Pacific Islander
    • S - Samoan
    • U - Hawaiian
    • V - Vietnamese
    • W - White
    • X - Unknown
    • Z - Asian Indian
    'Premis Cd'Code for the type of structure, vehicle, or location where the crime took place.
    'Premis Desc'Definition of the 'Premis Cd'.
    'Weapon Used Cd'The type of weapon used in the crime.
    'Weapon Desc'Description of the weapon used (if applicable).
    'Status Desc'Crime status.
    'Crm Cd 1'Indicates the crime committed. Crime Code 1 is the primary and most serious one. Crime Code 2, 3, and 4 are respectively less serious offenses. Lower crime class numbers are more serious.
    'Crm Cd 2'May contain a code for an additional crime, less serious than Crime Code 1.
    'Crm Cd 3'May contain a code for an additional crime, less serious than Crime Code 1.
    'Crm Cd 4'May contain a code for an additional crime, less serious than Crime Code 1.
    'LOCATION'Street address of the crime.
    'Cross Street'Cross Street of rounded Address
    'LAT'Latitude of the crime location.
    'LON'Longtitude of the crime location.
    import pandas as pd
    crimes = pd.read_csv("data/crimes.csv")
    crimes.head(10)

    💪 The Challenge

    • Use your skills to produce insights about crimes in Los Angeles.
    • Examples could include examining how crime varies by area, crime type, victim age, time of day, and victim descent.
    • You could build machine learning models to predict criminal activities, such as when a crime may occur, what type of crime, or where, based on features in the dataset.
    • You may also wish to visualize the distribution of crimes on a map.

    Note:

    To ensure the best user experience, we currently discourage using Folium and Bokeh in Workspace notebooks.

    ✍️ Judging criteria

    This competition is for helping to understand how competitions work. This competition will not be judged.

    ✅ Checklist before publishing

    • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
    • Remove redundant cells like the judging criteria, so the workbook is focused on your work.
    • Check that all the cells run without error.

    ⌛️ Time is ticking. Good luck!

    import pandas as pd
    
    # Load the dataset
    crimes = pd.read_csv("data/crimes.csv")
    
    # Display the shape of the dataset
    print("(rows, cols)")
    print(crimes.shape)
    
    # Display information about the dataset
    print("\nDataset Information:")
    print(crimes.info())
    
    # Display the summary statistics for numerical columns
    print("\nSummary - numeric Statistics:")
    print(crimes.describe())
    
    # Display the summary statistics for categories columns
    print("\nSummary - category Statistics:")
    print(crimes.select_dtypes('object').describe())
    
    # Number of unique values in each column
    print("\nNumber of Unique Values:")
    print(crimes.nunique())
    
    # Crime counts by area
    crime_counts_by_area = crimes['AREA NAME'].value_counts()
    print("\nCrime Counts by Area:")
    print(crime_counts_by_area)
    
    # Crime counts by crime type
    crime_counts_by_type = crimes['Crm Cd Desc'].value_counts()
    print("\nCrime Counts by Crime Type:")
    print(crime_counts_by_type)
    
    # Crime counts by victim age
    crime_counts_by_age = crimes['Vict Age'].value_counts()
    print("\nCrime Counts by Victim Age:")
    print(crime_counts_by_age)
    
    # Crime counts by time of day
    crime_counts_by_time = crimes['TIME OCC'].astype(str).apply(lambda x: int(x[:2]))
    print("\nCrime Counts by Time of Day (Hourly):")
    print(crime_counts_by_time)
    
    # Crime counts by victim descent
    crime_counts_by_Sex = crimes['Vict Sex'].value_counts(ascending=False)
    print("\nCrime Counts by Victim Sex:")
    print(crime_counts_by_Sex)
    
    # Crime counts by victim Descent
    crime_counts_by_Descent = crimes['Vict Descent'].value_counts(ascending=False)
    print("\nCrime Counts by Victim Descent:")
    print(crime_counts_by_Descent)
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Set the style for the visualizations
    sns.set_style('darkgrid')
    
    # Plotting crime counts by area
    plt.figure(figsize=(12, 6))
    sns.countplot(y='AREA NAME', data=crimes, order=crimes['AREA NAME'].value_counts().index)
    plt.xlabel('Count')
    plt.ylabel('Area Name')
    plt.title('Crime Counts by Area')
    plt.show()
    
    # Plotting crime counts by crime type
    plt.figure(figsize=(12, 6))
    top_crime_types = crime_counts_by_type.head(10)
    top_crime_types.plot(kind='barh')
    plt.xlabel('Count')
    plt.ylabel('Crime Type')
    plt.title('Top 10 Crime Types')
    plt.show()
    
    # Plotting crime counts by victim Sex
    plt.figure(figsize=(24, 12))
    crime_counts_by_Sex.plot(kind='bar')
    plt.xlabel('Victim Sex')
    plt.ylabel('Count')
    plt.title('Crime Counts by Victim Sex')
    plt.show()
    
    # Plotting crime counts by time
    plt.figure(figsize=(24, 12))
    crime_counts_by_Descent.plot(kind='bar')
    plt.xlabel('Victim Descent')
    plt.ylabel('Count')
    plt.title('Crime Counts by Victim Descent')
    plt.show()
    # Building a machine learning model to predict which type of crimes.
    # Load the dataset
    import pandas as pd
    crimes = pd.read_csv("data/crimes.csv", nrows=50000)
    
    # Features and target variable
    X = crimes.drop('Crm Cd Desc', axis=1)
    y = crimes['Crm Cd Desc']
    
    # Drop unnecessary columns and columns with high missing value ratio
    columns_to_drop = ['Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4', 'LOCATION', 'Cross Street', 'Crm Cd']
    crimes.drop(columns=columns_to_drop, inplace=True)
    
    # Drop rows with missing values in the target variable 'Crm Cd Desc'
    crimes.dropna(subset=['Crm Cd Desc'], inplace=True)
    
    # Fill missing values in 'Vict Sex' and 'Vict Descent' with the most frequent value
    crimes['Vict Sex'].fillna(crimes['Vict Sex'].mode()[0], inplace=True)
    crimes['Vict Descent'].fillna(crimes['Vict Descent'].mode()[0], inplace=True)
    
    # Fill missing values in 'Premis Desc' and 'Weapon Desc' with 'Unknown'
    crimes['Premis Desc'].fillna('Unknown', inplace=True)
    crimes['Weapon Desc'].fillna('Unknown', inplace=True)
    
    # Convert categorical variables to numerical using label encoding
    from sklearn.preprocessing import LabelEncoder
    
    label_encoder = LabelEncoder()
    crimes['Vict Sex'] = label_encoder.fit_transform(crimes['Vict Sex'])
    crimes['Vict Descent'] = label_encoder.fit_transform(crimes['Vict Descent'])
    crimes['Premis Desc'] = label_encoder.fit_transform(crimes['Premis Desc'])
    crimes['Weapon Desc'] = label_encoder.fit_transform(crimes['Weapon Desc'])
    crimes['Status Desc'] = label_encoder.fit_transform(crimes['Status Desc'])
    
    # Drop remaining unnecessary columns
    crimes.drop(['DR_NO', 'Date Rptd', 'DATE OCC', 'AREA NAME', 'Rpt Dist No'], axis=1, inplace=True)
    
    # Convert 'Crm Cd Desc' to numerical labels
    crimes['Crm Cd Desc'] = label_encoder.fit_transform(crimes['Crm Cd Desc'])
    
    # Display the preprocessed data
    print(crimes.head())
    
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import classification_report, accuracy_score
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Identify numeric and categorical columns
    numeric_columns = X_train.select_dtypes(include=['number']).columns
    categorical_columns = X_train.select_dtypes(exclude=['number']).columns
    
    # Define separate transformers for numeric and categorical columns
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean'))
    ])
    
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    # Use ColumnTransformer to apply the transformers to the respective columns
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_columns),
            ('cat', categorical_transformer, categorical_columns)
        ])
    
    # Define the full pipeline with the preprocessor and the model
    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(random_state=42))
    ])
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print("Accuracy:", accuracy_score(y_test, y_pred))