Skip to content

Three types of machine learning:

  • Unsupervised, for finding structure unlabeled data
  • Supervised, for making predictions about labeled data with regression or classification
  • Reinforcement learning, feedback by real or synthetic environment

K-means is unsupervised. Assigns datapoints randomly to a subgroup and calculates center. After n iterations and no datapoints change, the algorithm stops. Lowest within cluster SS is taken since this is the best outcome.

# Create the k-means model: km.out
km.out <- kmeans(x, 3, 20) # Data, expected clusters, iterations

# Set up 2 x 3 plotting grid
par(mfrow = c(2, 3))

# Set seed
set.seed(1)

for(i in 1:6) {
  # Run kmeans() on x with three clusters and one start
  km.out <- kmeans(x, 3, nstart = 1) # nstart option attempts multiple initial configurations and 	reports on the best one
  
  # Plot clusters
  plot(x, col = km.out$cluster, 
       main = km.out$tot.withinss, 
       xlab = "", ylab = "")
}
# Selecting amount of clusters
# Initialize total within sum of squares error: wss
wss <- 0

# For 1 to 15 cluster centers
for (i in 1:15) {
  km.out <- kmeans(x, centers = i, nstart = 20)
  # Save total within sum of squares to wss variable
  wss[i] <- km.out$tot.withinss
}

# Plot total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

  1. Hierarchical clustering is a method in R that groups similar data points together based on their characteristics.
  2. It involves calculating the distances between data points using Euclidian or Manhatten distance and creating a hierarchical structure or tree-like diagram called a dendrogram.
  3. The dendrogram visualizes the relationships between clusters at different levels of the hierarchy.
  4. By cutting the dendrogram at a specific height, you can determine the number of clusters in your data.
  5. Finally, you can assign each data point to its respective cluster based on the chosen height.
# Create hierarchical clustering model: hclust.out
hclust.out <- hclust(dist(x))
# Cut by height
cutree(hclust.out, h = 7)
# Cut by number of clusters
cutree(hclust.out, k = 3)

Whether you want balanced or unbalanced trees for your hierarchical clustering model depends on the context of the problem you're trying to solve. Balanced trees are essential if you want an even number of observations assigned to each cluster. On the other hand, if you want to detect outliers, for example, an unbalanced tree is more desirable because pruning an unbalanced tree can result in most observations assigned to one cluster and only a few observations assigned to other clusters.

# Normalize if data have different distributions
# View column means
colMeans(pokemon)

# View column standard deviations
apply(pokemon, 2, sd)

# Scale the data
pokemon.scaled <- scale(pokemon)

# Create hierarchical clustering model: hclust.pokemon
hclust.pokemon <- hclust(dist(pokemon.scaled), method = "complete")

Principal Component Analysis (PCA) is a dimensionality reduction technique used to simplify complex datasets while retaining as much as the variance in the data. It works by identifying the directions (principal components) along which the data varies the most. These components are orthogonal to each other and are ranked based on the amount of variation they capture.

PCA transforms the original data into a new coordinate system defined by the principal components, allowing for visualization and analysis of high-dimensional data. It is recommended to use PCA when dealing with strongly correlated variables. In case of weak correlation, PCA may fail to better reduce the data. It is commonly used in various fields, such as image processing, genetics, and finance, to uncover patterns, reduce noise, and improve computational efficiency.

pr.out <- prcomp(pokemon,
                scale = TRUE,
                center = TRUE)
biplot(pr.out) # biplot with directions 

# For screeplot pve
# Variability of each principal component: pr.var
pr.var <- pr.out$sdev^2

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)

Analyze a real world dataset

url <- "https://assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"

# Download the data: wisc.df
wisc.df <- read.csv(url)

# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[3:32])

# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id

# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis == "M")

# Check column means and standard deviations
colMeans(wisc.data)
apply(wisc.data, 2, sd)

# Execute PCA, scaling if appropriate: wisc.pr
wisc.pr <- prcomp(wisc.data, scale = TRUE)

# Look at summary of results
summary(wisc.pr)

# Create a biplot of wisc.pr
biplot(wisc.pr)

# Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC2")

# Repeat for components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC3")

# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))

# Calculate variability of each component
pr.var <- wisc.pr$sdev^2

# Variance explained by each principal component: pve
pve <- pr.var/sum(pr.var)

# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")