Skip to content
0

Background

We are working in the analytics department of a multinational company, and the head of HR wants our help mapping out the company's employee network using message data. They plan to use the network map to understand interdepartmental dynamics better and explore how the company shares information. The ultimate goal of this project is to think of ways to improve collaboration throughout the company.

299 Seconds (Less than 5 Min.s 😊) Executive Summary:

Number of Departments = 6, Number of Employees = 664, Message Count = 3509 Intra-departmental (i.e. within depts.) message count = 920 (26.22%) Inter-departmental (i.e. among depts.) message count = 2589 (73.78%)

The most active departments, in terms of sent and received message counts, are Sales, Operations, and Admin. Marketing department has the lowest sent and received message counts. Receiving performance of Engineering and IT departments are relatively better compared to their sending performance.

Intra-departmental communication
  • Most messages (538) are sent within Sales department, then comes Operations and Admin with 198 and 172 messages respectively.
  • Marketing did not send any messages within the department. This could be normal though, as marketing people would not try to market anything to each other 😊. Following low scorer departments are Engineering and IT, both with 6 within-department messages.
Inter-departmental communication
  • There are 363 messages from Operations to Sales, 340 messages from Sales to Operations. Total (both sent and received) message count between these two departments is 703. Most messages (364) are sent from Sales to Admin. The message count in the reverse direction (i.e. from Admin to Sales) is 297. Total (both sent and received) message count between these two departments is 661.
  • Lowest communication activity took place between Marketing (M) and Engineering (E) (2 messages from M to E, and 0 from E to M), and between Marketing (M) and IT (0 messages from M to IT, and 3 from IT to M).
Ratings based connection counts
  • Top 3 sender employees with most outgoing connections:
Emp.id.# Dept. Outgoing connection count 598 Operations 77 144 Sales 75 128 Sales 71
  • Top 3 receiver employees with most incoming connections:
Emp.id.# Dept. Incoming connection count 194 Admin 13 32 Sales 11 249 Admin 9

Using graph theoretical network analysis, we selected the following relatively more important 11 nodes (i.e. employees) by combining the analytic power of several "node importance metrics" with the perspective of "connectivity analysis."

Emp.id.# Dept. 128 Sales 509 Admin 260 Operations 389 Sales 605 Admin 586 Operations 144 Sales 337 Sales 598 Operations 317 Operations 734 Sales

We note that these important people are mostly from Sales (count=5), Operations (count=4), and Admin (count=2) departments, which agrees with above observations about intra- and inter-departmental communication.

Therefore, our humble recommendation to HR is that they should concentrate on Engineering, IT, and Marketing departments as they have relatively low messaging activity (both internally and with other departments) and look for ways to improve or boost communication of these 3 departments, both within themselves and with other departments.

0. Import necessary libraries and load data

%%capture
!pip install matplotlib-venn
from matplotlib_venn import venn2
import networkx as nx # for network analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Hidden output
df_messages = pd.read_csv('data/messages.csv', parse_dates= ['timestamp'])
df_messages
df_employees = pd.read_csv('data/employees.csv')
df_employees

1. "Get to know" the data

1.1 Warmup: Inspect df_messages and df_employees dataframes

For each dataframe, we can find out the size and type of the data in each column (i.e. variable) using the info() method as follows:

df_messages.info()
df_employees.info()

We observe that there are no missing values in either of the dataframes. Let's make sure:

df_messages.isna().sum()
df_employees.isna().sum()

We had better check each table for duplicated rows as well. We don't expect any duplicates in the df_employee table, as its "id" column is the key (i.e. unique identifier) for employees. However, we suspect that 😊 there could be some duplicates in the df_messages table. Let us see:

df_employees.duplicated().sum(),df_messages.duplicated().sum()
β€Œ
β€Œ
β€Œ