Stochastic gradient descent

Simulate high-dimensional data

Generate data from a high-dimensional model. After your implementation is finished, you can come back here to experiment and see how the algorithm’s performance depends on these values. (One interesting experiment: reduce lambda in rpois to have a sparser beta and see how the results change).

set.seed(1) # comment/uncomment/change this
n <- 100
p <- 400
X <- matrix(rnorm(n*p), nrow = n)
beta <- rpois(p, lambda = 1.5)
SGD_starting_beta <- rnorm(p) # generate once
#beta <- sample(c(-1, 0, 0, 1), p, replace = TRUE)
sigma <- 2
y <- X %*% beta + rnorm(n, sd = sigma)
# sparsity level
sum(beta != 0)

## [1] 308

Question: Can we fit OLS to estimate beta? Why or why not?

Answer: No. Since there are more predictors than observations, i.e. \(p > n\), we cannot fit OLS. (The matrix \(X^TX\) is not invertible)

Implement stochastic gradient descent

Helper functions

# Loss function
least_squares_loss <- function(x, y, beta) {
  residuals <- y - x %*% beta
  sum(residuals^2)
}

# Gradient of the loss function
least_squares_gradient <- function(x, y, beta) {
  residuals <- y - x %*% beta
  -2 * t(x) %*% residuals
}

Mini-batch example

Computing gradient on a subset of the sample:

batch_size <- 10
batch_indices <- sample(1:nrow(X), batch_size)
X_batch <- X[batch_indices, , drop = FALSE]
y_batch <- y[batch_indices]
gradient_batch <- least_squares_gradient(X_batch, y_batch, SGD_starting_beta)
dim(X_batch)

## [1]  10 400

length(y_batch)

## [1] 10

dim(gradient_batch)

## [1] 400   1

SGD: main loop

Modify the code below to complete your implementation.

In this implementation, normalize the gradient to make it a unit vector before taking a step. Recall that if v is vector, then u <- v / sqrt(sum(v^2)) is a unit vector in the direction of v. (Optional: for numerical stability you could also add a small positive number in the denominator like 1e-8).

epochs <- 40 # decrease for early stopping?
batch_size <- 8 # for batch SGD
gamma <- 0.9
beta_hat <- SGD_starting_beta # generated once above
#beta_hat <- rnorm(p) # random start
num_batches <- ceiling(n/batch_size)
steps <- epochs * num_batches
SGD_path <- data.frame(step = 1:steps,
                       epoch = numeric(steps),
                       gamma = numeric(steps),
                       loss = numeric(steps),
                       MSE = numeric(steps),
                       beta_hat_norm = numeric(steps),
                       gradient_norm = numeric(steps))
start_time <- Sys.time()
step <- 1
for (epoch in 1:epochs) {
  # Pass over sample in a random order each time
  shuffled_indices <- sample(1:n, n, replace = TRUE)
  
  for (batch in 1:num_batches) {
    batch_start <- 1 + batch_size * (batch - 1)
    batch_end <- min(batch_size * batch, nrow(X))
    batch_indices <- shuffled_indices[batch_start:batch_end]
   # Mini-batch
    X_batch <- X[batch_indices, , drop = FALSE]
    y_batch <- y[batch_indices]
    # Update beta_hat
   gradient_batch <- least_squares_gradient(X_batch, y_batch, beta_hat)
   gradient_norm <- sqrt(sum(gradient_batch^2))
   beta_hat <- beta_hat - gamma^epoch * gradient_batch / (1e-8 + gradient_norm)
   LS_loss <- least_squares_loss(X, y, beta_hat)

  # Store algorithm path information
  SGD_path[step, ] <- c(step,
                        epoch,
                        gamma^epoch,
                        LS_loss,
                        mean((beta_hat - beta)^2),
                        sqrt(sum(beta_hat^2)),
                        gradient_norm)
  step <- step + 1
  }
}

print(Sys.time() - start_time)

## Time difference of 0.1787379 secs

Plots by epoch

(Averaged over all batches in that epoch)

Note: the loss is plotted on a logarithmic scale so that large initial changes don’t obscure later variation.

Question: Does the MSE of beta_hat level off? After roughly how many epochs? Is this happening because gamma decreased too fast?

Answer: Yes, in this example after roughly 20 epochs, and no because the gamma value is still over 0.1 (has not yet become numerically very close to zero).

Plots by mini-batch

Squared errors

Has the algorithm pulled more of the squared errors closer to zero compared to the (random) initial starting point?

Compute the (Euclidean) distances between the following points:

# From algorithm starting point to end point
sqrt(sum((beta_hat - SGD_starting_beta)^2))

## [1] 21.06934

# From starting point to true beta
sqrt(sum((beta - SGD_starting_beta)^2))

## [1] 43.28581

# From end point to true beta
sqrt(sum((beta - beta_hat)^2))

## [1] 37.63315

Question: Does SGD do better than a random guess (the starting point) at estimating the true coefficients?

Answer: Yes, after taking some steps the algorithm has reduced the MSE of estimating the true beta.