Duplicate of Competition - Employee Network Analysis

How can the company improve collaboration?

import pandas as pd
import numpy as np
from sklearn.utils import shuffle
from IPython.core.display import display, HTML, Javascript
from string import Template
import json, random
import IPython.display
from plotly.offline import init_notebook_mode, iplot
from plotly import subplots
import plotly.figure_factory as ff
import plotly as py
import plotly.graph_objects as go
init_notebook_mode(connected=True)

from dateutil.parser import parse

from sklearn.metrics.pairwise import cosine_similarity
import plotly.express as px
import plotly.graph_objects as go
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
import datetime
#from yellowbrick.cluster.elbow import kelbow_visualizer 
import seaborn as sns
from matplotlib.colors import ListedColormap
import matplotlib.lines as lines
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
#from yellowbrick.cluster import SilhouetteVisualizer
COLORS_SET_B_G_R = [sns.color_palette('muted')[1], sns.color_palette('muted')[3], sns.color_palette('muted')[4]]
# --- Create List of Color Palletes ---
blue_palette=['#020f52','#041a8c','#0423c4','#3050f2','#25f7db','#3db8f5']
mixt_palette = ['#800000', '#008080']
import matplotlib.pyplot as plt
import IPython.display
from plotly.offline import init_notebook_mode, iplot
from plotly import subplots
import plotly.figure_factory as ff
import plotly as py
import plotly.graph_objects as go
init_notebook_mode(connected=True)

import pandas as pd

# --- Plot Color Palletes --
#sns.palplot(blue_palette)

html_contents ="""
<!DOCTYPE html>
<html lang="en">
    <head>
    <style>
    .toc h2{
        color: white;
        background: #3f4d63;
        font-weight: 600;
        font-family: Helvetica;
        font-size: 23px;
        padding: 6px 12px;
        margin-bottom: 2px;
    }
    
    .toc ol li{
        list-style:none;
        line-height:normal;
        }
     
    .toc li{
        background: #235f83;
        color: white;
        font-weight: 600;
        font-family: Helvetica;
        font-size: 18px;
        margin-bottom: 2px;
        padding: 6px 12px;
    }

    .toc ol ol li{
        background: #fff;
        color: #4d4d4d;
        font-weight: 400;
        font-size: 15px;
        font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
        margin-top: 0px;
        margin-bottom: 0px;
        padding: 3px 12px;
    } 
    
    .section_title{
        background-color: #3f4d63;
        color: white;
        font-family: Helvetica;
        font-size: 25px;
        padding: 6px 12px;
        margin-bottom: 5px;
    }
    .subsection_title{
        background: #235f83;
        color: white;
        font-family: Helvetica;
        font-size: 21px;
        padding: 6px 12px;
        margin-bottom: 0px;
    }
    .sidenote{
        font-size: 13px;
        border: 1px solid #d7d7d7;
        padding: 1px 10px 2px;
        box-shadow: 1px 1px 2px 1px rgba(0,0,0,0.3);
        margin-bottom: 3px;
    }
    </style>
    </head>
    <body>
        <div class="toc">
        
        <ol> 
        <h2> Table of Contents </h2>
        <li>1. Introduction </li>
        <ol>
            <li>1.1 Data information </li>
            <li>1.2 Features of data </li>
        </ol>
        <li>2. Study of the employees dataset</li>
        <li>3. Study of the messages dataset</li>
        <li>4. Unified dataset study </li>
        <ol> 
            <li>4.1 In which departments there are in each location?</li> 
            <li>4.2 In which departments there are in each hour?</li> 
            <li>4.3 In which department there are in each age categorical? </li>
            <li>4.4 Categorical age influences the length of the message sent? </li>
            <li>4.5 Does hour influence the length of the message sent? </li>
        </ol>
        <li>5. Answers to the objective questions </li>
         <ol>        
            <li>5.1 Which employees are the most active sending messages?</li> 
            <li>5.2 Which employees receive the most messages?</li> 
            <li>5.3 What is the weight of each department in the dataset?</li>
            <ol>        
                <li>5.3.1 Marketing department</li> 
                <li>5.3.2 Sales department?</li> 
                <li>5.3.3 Operations department</li>
                <li>5.3.4 Administrative department </li>
                <li>5.3.5 IT department </li>
                <li>5.3.6 Engineering department </li>
            </ol>
        </ol>
        <li>6. Conclusions </li>
        </ol>
        </div>
    </body>
</html>
"""

HTML(html_contents)

1.Introduction

Executive summary: in this figure we can see 4 important conclusions. For more details, read me!!

import warnings
warnings.filterwarnings("ignore")

from IPython.display import Image

Image(filename='dep.png')

(Invalid URL)

1.1 The data information

The analytics department of a multinational company, and the head of HR wants to map the company's employee network using message data. The goal is to use the network map to better understand interdepartmental dynamics and explore how the company shares information. The ultimate goal of this project is to think of ways to improve collaboration across the company.To get the interaction orders within the company, we will use two datasets: employees and messages.

`Messages has information on the sender, receiver, and time.`

sender - represents the employee id of the employee sending the message.
receiver - represents the employee id of the employee receiving the message.
timestamp - the date of the message.
message_length - the length in words of the message.

`Employees has information on each employee;`

id - represents the employee id of the employee.
department - is the department within the company.
location - is the country where the employee lives.
age - is the age of the employee.

Acknowledgments: Pietro Panzarasa, Tore Opsahl, and Kathleen M. Carley. "Patterns and dynamics of users' behavior and interaction: Network analysis of an online community." Journal of the American Society for Information Science and Technology 60.5 (2009): 911-932.

Papers objective:

Which departments are the most/least active?
Which employee has the most connections?
Identify the most influential departments and employees.
Using the network analysis, in which departments would you recommend the HR team focus to boost collaboration?

(Invalid URL)

1.2 Features of data

Messages dataset has 3512 entries and 4 columns. Employees dataset has 664 entries and 4 columns. There aren't missing values

Feature	Data Type	Unique	Description
`sender`	categoric	85	represents the employee id of the employee sending the message
`receiver`	categoric	617	represents the employee id of the employee receiving the message.
`timestamp`	datetime	--	the date of the message.
`message_length`	continuous	79	the length in words of the message.
`id`	categoric	664	represents the employee id of the employee.
`department`	categoric	6	is the department within the company
`location`	categoric	5	is the country where the employee lives.
`age`	continuous	1	is the age of the employee.

2.Study of the employees dataset

Next we are going to study the employees dataset. It is composed of 664 entries and 4 columns. This dataset represents where each employee comes from, which department he/she belongs to and how old he/she is.

The first step is: Where do the company's employees live?

According to the dataset, they belong to 5 different countries (US, UK, Brasil, France, Germany). Two countries are in the Americas and the remaining four are in Europe. We can see on the map, the cities that the messages belong to and the size represents the number of source messages that the employees are sent.

employees = pd.read_csv('data/employees.csv')
employees.head()

mundo = employees['location'].value_counts()
mundo = mundo.reset_index()
mundo.columns = ['Location','Count']
mundo['Latitude'] = [37.09024,   46.227638, 51.165691, -15.77972, 55.378051]
mundo['Longitude'] = [-95.712891, 2.213749, 10.451526, -47.92972, -3.435973]
mundo.head()

sns.color_palette("pastel")
sns.set_style("white")

fig = px.scatter_mapbox(
    mundo,
    lat = "Latitude", 
    lon="Longitude",
    #color="Count", 
    size="Count",
    #color_continuous_scale='blue',
    color = "Count",
    size_max=25, 
    zoom=1.1,
    #center={"lat": 0.0, "lon": 0.0},
    title = 'Location of the regions represented in the dataset'
)

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(width=820)
fig.update_layout(height=820)
fig.show()

employees['location'].value_counts()
location = pd.DataFrame(employees['location'].value_counts())
location = location.reset_index()
location.columns = ['Location','Counts']
location.head()

employees['department'].value_counts()
department = pd.DataFrame(employees['department'].value_counts())
department = department.reset_index()
department.columns = ['Department','Counts']
department.head()

employees['age'].value_counts()
age = pd.DataFrame(employees['age'].value_counts())
age = age.reset_index()
age.columns = ['Age','Counts']
age['age_categoric'] = pd.cut(employees['age'],
                     bins=[20, 31, 45, 60],
                     labels=["age_junior", "age_senior", "age_expert"])
employees['age_categoric'] = pd.cut(employees['age'],
                     bins=[20, 31, 45, 60],
                     labels=["age_junior", "age_senior", "age_expert"])
age.head()

employees['age_categoric'].value_counts()
age_cat = pd.DataFrame(employees['age_categoric'].value_counts())
age_cat = age_cat.reset_index()
age_cat.columns = ['Age Categoric','Counts']

We see in the graph that there is bias in the data. That is, more messages are sent from US and France than from the other countries. Therefore, it is possible that the conclusions depend on this factor.

Next, we will represent in the same view the following three columns: location, department and age.

In the departmental graph we see that bias is also apparent. 24.25% of the messages come from the Sales department. In the age graph we see that it is a similar distribution to the homogeneous one. However, if we are statistically accurate there is the presence of positive bias. The age ranges from 22 to 59 years. Later we will change this variable to a categorical variable, establishing young, senior and experts. As can be seen in the horizontal graph on age passed to a categorical variable, we see that 40.96% of the dataset belongs to employees aged between 31-45 years.

Hidden code

3.Study of the messages dataset

We are now going to study the messages dataset, which consists of 3512 entries, 4 columns and no missing values. As you can see in this dataset you can see the employees who send and receive messages, the length of these messages and the data with the date on which they are sent.

‌
‌
‌