Skip to content
Course Notes: Unsupervised Learning in R
  • AI Chat
  • Code
  • Report
  • Spinner

    Three types of machine learning:

    • Unsupervised, for finding structure unlabeled data
    • Supervised, for making predictions about labeled data with regression or classification
    • Reinforcement learning, feedback by real or synthetic environment

    K-means is unsupervised. Assigns datapoints randomly to a subgroup and calculates center. After n iterations and no datapoints change, the algorithm stops. Lowest within cluster SS is taken since this is the best outcome.

    # Create the k-means model: km.out
    km.out <- kmeans(x, 3, 20) # Data, expected clusters, iterations
    
    # Set up 2 x 3 plotting grid
    par(mfrow = c(2, 3))
    
    # Set seed
    set.seed(1)
    
    for(i in 1:6) {
      # Run kmeans() on x with three clusters and one start
      km.out <- kmeans(x, 3, nstart = 1) # nstart option attempts multiple initial configurations and 	reports on the best one
      
      # Plot clusters
      plot(x, col = km.out$cluster, 
           main = km.out$tot.withinss, 
           xlab = "", ylab = "")
    }
    # Selecting amount of clusters
    # Initialize total within sum of squares error: wss
    wss <- 0
    
    # For 1 to 15 cluster centers
    for (i in 1:15) {
      km.out <- kmeans(x, centers = i, nstart = 20)
      # Save total within sum of squares to wss variable
      wss[i] <- km.out$tot.withinss
    }
    
    # Plot total within sum of squares vs. number of clusters
    plot(1:15, wss, type = "b", 
         xlab = "Number of Clusters", 
         ylab = "Within groups sum of squares")
    
    
    1. Hierarchical clustering is a method in R that groups similar data points together based on their characteristics.
    2. It involves calculating the distances between data points using Euclidian or Manhatten distance and creating a hierarchical structure or tree-like diagram called a dendrogram.
    3. The dendrogram visualizes the relationships between clusters at different levels of the hierarchy.
    4. By cutting the dendrogram at a specific height, you can determine the number of clusters in your data.
    5. Finally, you can assign each data point to its respective cluster based on the chosen height.
    # Create hierarchical clustering model: hclust.out
    hclust.out <- hclust(dist(x))
    # Cut by height
    cutree(hclust.out, h = 7)
    # Cut by number of clusters
    cutree(hclust.out, k = 3)

    Whether you want balanced or unbalanced trees for your hierarchical clustering model depends on the context of the problem you're trying to solve. Balanced trees are essential if you want an even number of observations assigned to each cluster. On the other hand, if you want to detect outliers, for example, an unbalanced tree is more desirable because pruning an unbalanced tree can result in most observations assigned to one cluster and only a few observations assigned to other clusters.

    # Normalize if data have different distributions
    # View column means
    colMeans(pokemon)
    
    # View column standard deviations
    apply(pokemon, 2, sd)
    
    # Scale the data
    pokemon.scaled <- scale(pokemon)
    
    # Create hierarchical clustering model: hclust.pokemon
    hclust.pokemon <- hclust(dist(pokemon.scaled), method = "complete")

    Principal Component Analysis (PCA) is a dimensionality reduction technique used to simplify complex datasets while retaining as much as the variance in the data. It works by identifying the directions (principal components) along which the data varies the most. These components are orthogonal to each other and are ranked based on the amount of variation they capture.

    PCA transforms the original data into a new coordinate system defined by the principal components, allowing for visualization and analysis of high-dimensional data. It is recommended to use PCA when dealing with strongly correlated variables. In case of weak correlation, PCA may fail to better reduce the data. It is commonly used in various fields, such as image processing, genetics, and finance, to uncover patterns, reduce noise, and improve computational efficiency.

    pr.out <- prcomp(pokemon,
                    scale = TRUE,
                    center = TRUE)
    biplot(pr.out) # biplot with directions 
    
    # For screeplot pve
    # Variability of each principal component: pr.var
    pr.var <- pr.out$sdev^2
    
    # Variance explained by each principal component: pve
    pve <- pr.var / sum(pr.var)

    Analyze a real world dataset

    url <- "https://assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"
    
    # Download the data: wisc.df
    wisc.df <- read.csv(url)
    
    # Convert the features of the data: wisc.data
    wisc.data <- as.matrix(wisc.df[3:32])
    
    # Set the row names of wisc.data
    row.names(wisc.data) <- wisc.df$id
    
    # Create diagnosis vector
    diagnosis <- as.numeric(wisc.df$diagnosis == "M")
    
    # Check column means and standard deviations
    colMeans(wisc.data)
    apply(wisc.data, 2, sd)
    
    # Execute PCA, scaling if appropriate: wisc.pr
    wisc.pr <- prcomp(wisc.data, scale = TRUE)
    
    # Look at summary of results
    summary(wisc.pr)
    
    # Create a biplot of wisc.pr
    biplot(wisc.pr)
    
    # Scatter plot observations by components 1 and 2
    plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1), 
         xlab = "PC1", ylab = "PC2")
    
    # Repeat for components 1 and 3
    plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1), 
         xlab = "PC1", ylab = "PC3")
    
    # Set up 1 x 2 plotting grid
    par(mfrow = c(1, 2))
    
    # Calculate variability of each component
    pr.var <- wisc.pr$sdev^2
    
    # Variance explained by each principal component: pve
    pve <- pr.var/sum(pr.var)
    
    # Plot variance explained for each principal component
    plot(pve, xlab = "Principal Component", 
         ylab = "Proportion of Variance Explained", 
         ylim = c(0, 1), type = "b")
    
    # Plot cumulative proportion of variance explained
    plot(cumsum(pve), xlab = "Principal Component", 
         ylab = "Cumulative Proportion of Variance Explained", 
         ylim = c(0, 1), type = "b")