Competition - Employee Network Analysis

Background

We are working in the analytics department of a multinational company, and the head of HR wants our help mapping out the company's employee network using message data. They plan to use the network map to understand interdepartmental dynamics better and explore how the company shares information. The ultimate goal of this project is to think of ways to improve collaboration throughout the company.

299 Seconds (Less than 5 Min.s 😊) Executive Summary:

Number of Departments = 6, Number of Employees = 664, Message Count = 3509
Intra-departmental (i.e. within depts.) message count = 920 (26.22%)
Inter-departmental (i.e. among depts.) message count = 2589 (73.78%)

The most active departments, in terms of sent and received message counts, are Sales, Operations, and Admin. Marketing department has the lowest sent and received message counts. Receiving performance of Engineering and IT departments are relatively better compared to their sending performance.

Intra-departmental communication

Most messages (538) are sent within Sales department, then comes Operations and Admin with 198 and 172 messages respectively.
Marketing did not send any messages within the department. This could be normal though, as marketing people would not try to market anything to each other 😊. Following low scorer departments are Engineering and IT, both with 6 within-department messages.

Inter-departmental communication

There are 363 messages from Operations to Sales, 340 messages from Sales to Operations. Total (both sent and received) message count between these two departments is 703. Most messages (364) are sent from Sales to Admin. The message count in the reverse direction (i.e. from Admin to Sales) is 297. Total (both sent and received) message count between these two departments is 661.
Lowest communication activity took place between Marketing (M) and Engineering (E) (2 messages from M to E, and 0 from E to M), and between Marketing (M) and IT (0 messages from M to IT, and 3 from IT to M).

Ratings based connection counts

Top 3 sender employees with most outgoing connections:

Emp.id.#	Dept.		Outgoing connection count
598		Operations	77
144		Sales		75
128		Sales		71

Top 3 receiver employees with most incoming connections:

Emp.id.# 	Dept.		Incoming connection count
194		Admin		13
 32		Sales		11
249		Admin		9

Using graph theoretical network analysis, we selected the following relatively more important 11 nodes (i.e. employees) by combining the analytic power of several "node importance metrics" with the perspective of "connectivity analysis."

Emp.id.#	Dept.
128		Sales
509		Admin
260		Operations
389		Sales
605		Admin
586		Operations
144		Sales
337		Sales
598		Operations
317		Operations
734		Sales

We note that these important people are mostly from Sales (count=5), Operations (count=4), and Admin (count=2) departments, which agrees with above observations about intra- and inter-departmental communication.

Therefore, our humble recommendation to HR is that they should concentrate on Engineering, IT, and Marketing departments as they have relatively low messaging activity (both internally and with other departments) and look for ways to improve or boost communication of these 3 departments, both within themselves and with other departments.

0. Import necessary libraries and load data

%%capture
!pip install matplotlib-venn
from matplotlib_venn import venn2
import networkx as nx # for network analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Hidden output

df_messages = pd.read_csv('data/messages.csv', parse_dates= ['timestamp'])
df_messages

df_employees = pd.read_csv('data/employees.csv')
df_employees

1. "Get to know" the data

1.1 Warmup: Inspect df_messages and df_employees dataframes

For each dataframe, we can find out the size and type of the data in each column (i.e. variable) using the info() method as follows:

df_messages.info()

df_employees.info()

We observe that there are no missing values in either of the dataframes. Let's make sure:

df_messages.isna().sum()

df_employees.isna().sum()

We had better check each table for duplicated rows as well. We don't expect any duplicates in the df_employee table, as its "id" column is the key (i.e. unique identifier) for employees. However, we suspect that 😊 there could be some duplicates in the df_messages table. Let us see:

df_employees.duplicated().sum(),df_messages.duplicated().sum()

‌
‌
‌