机器学习笔记(2)线性回归

Introduction

Here is a not so short note (Rolling Update) for machine learning and artificial intelligience. Main concepts and some references are presented. This note is based on Wikipedia, Andrew Ng’s Lecture Notes CS229 of Standford, Machine Learning course of SJTU IEEE-yangyang, and Machine Learning Crash Course of Google.

The Learning Problem

  • Input variables / features : x(i)x^{(i)}
  • Target variable : y(i)y^{(i)}
  • A training example: (x(i)x^{(i)},y(i)y^{(i)})
  • A training set: {(x(i)x^{(i)},y(i)y^{(i)}); i=1,2,...,mi=1,2,...,m}
  • The space of input values: XX
  • The space of output values: YY

The goal is addressed as:

Given a training set, learn a function h:X>Yh: X->Y so that h(x)h(x) is a good predictor for the corresponding value of yy. If there are nn features, like x=[x1,x2,...,xn]Tx=[x_1,x_2,...,x_n]^T, the training set will be Xm×nX_{m\times n} and Ym×1Y_{m\times 1}. Note that mm is the number of samples and nn is the number of features of each sample.

Linear Regression

Linear regression is a very simple approach for supervised learning. We use linear regression to predict a quantitative response YY from the predictor variable XX. The relationship between XX and YY is assumed to be linear. A linear function h(x)h(x) can be used to map from XX to YY:

h(x)=i=0nθixi=θTϕ(x)h(x) = \sum_{i=0}^n \theta_i x_i = \theta ^T \phi(x)

The intercept term is: x0=1x_0 = 1. Note that here ii is not the number of samples.

  • multiple linear regression

Multiple linear regression is a generalization of linear regression by considering more than one independent variable (XX: n>1n>1).

  • multivariate linear regression (General linear model)

The multivariate linear regression is a generalization of multiple linear regression model to the case of more than one dependent variable (YY will be a matrix).

Linear basis function models

In polynomial curve fitting, the feature extraction is represented as the polynomial basis functions: ϕj(x)=xj\phi_j(x) = x^j, jj is the polynomial order (1jM1\leq j\leq M). A small change in xx affects all basis functions. Thus using a polynomial basis function is global.

Consider the sigmoidal basis functions: ϕj(x)=σ(xμjs)\phi_j(x) = \sigma (\frac{x - \mu_j}{s}), where σ(x)=ex1+ex\sigma(x) = \frac{e^x}{1+e^x}. A small change in xx only affects nearby basis functions (near μj\mu_j).

The Least Mean Square (LMS) method

The cost function is defined as:

J(θ)=12i=1m(hθ(x(i))y(i))2J(\theta) = \frac{1}{2} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2

The problem is how to get a minimum J(θ)J(\theta).

Gradient descent

Apply a gradient descent algorithm for every jj:

θj:=θjαθjJ(θ)\theta_j:= \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)

where jj is the number of basis function (1jM1\leq j\leq M). Note that descent means the direction is the opposite of the gradient (J- \nabla J).

For a single sample, the LMS update rule (or Widro-Hoff learning rule) can be obtained as:

θj:=θjα(hθ(x(i))y(i))xj(i)\theta_j:= \theta_j - \alpha (h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)}

  • Batch gradient descent

For many samples, we use the batch gradient descent as:

θj:=θjαi=1m(hθ(x(i))y(i))xj(i)\theta_j :=\theta_j - \alpha \sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}

in the iteration:

Repeat until convergence {
  ... (for every j)
}

The iteration can be written as:

θjk+1=θjkαi=1m(hθ(x(i))y(i))xj(i)\theta_j^{k+1}=\theta_j^k - \alpha \sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}

where ii is the number of samples (1im1\leq i\leq m), kk is the iteration step. Note that xj(i)x_j^{(i)} is the feature extraction, x(i)x^{(i)} is the sample.

until θjk+1θjk<ϵ|\theta_j^{k+1} - \theta_j^{k}|<\epsilon, then θj\theta_j reaches convergence. For every iteration, all samples will be traversed. If a new sample is added, the iteration has to start from the first sample again.

  • Stochastic gradient descent

When the training set is large, instead, we use Stochastic gradient descent as:

θjk+1=θjkα(hθ(x(i))y(i))xj(i)\theta_j^{k+1} = \theta_j^k - \alpha (h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)}

in

Loop {
  for i = 1,m {
  ... (for every j)
  }
}

until θjk+1θjk<ϵ|\theta_j^{k+1} - \theta_j^{k}|<\epsilon, then θj\theta_j reaches convergence. For each θ\theta, only one sample is used for iteration until the convergence. If a new sample is added, the iteration will continue only using the new sample. Finally, the new θ\theta and hθh_{\theta} are obtained.

Feature scaling is a method used to standardize the range of independent variables or features of data (data normalization). Feature scaling can be used to improve the iteration rate in gradient descent, e.g., rescaling and standardiztion. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. Thus each feature contributes approximately proportionately to the final values.

The normal equations

To avoid interations, we can use the normal equations. Consider Xm×nX_{m\times n} and Ym×1Y_{m\times 1} in matrix, the derivative of J(θ)J(\theta) equals zero:

θJ(θ)=XTXθXTy=0\nabla_\theta J(\theta) = X^T X\theta - X^T \vec{y} = 0

Finally,

θ=(XTX)1XTy\theta = (X^T X)^{-1}X^T\vec{y}

In most situations of practical interest, the number of data points mm is larger than the dimensionality nn of the input space, and the matrix XX is of full column rank. A matrix is full column rank when each of the columns of the matrix are linearly independent. Then XTXX^TX is necessarily invertible and therefore positive definite. The minimum J(θ)J(\theta) can be obtained at the critical point when θJ(θ)=0\nabla_\theta J(\theta)=0.

If mnm \leq n or XX is not of full column rank, XTXX^TX is not invertible. Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting. E.g. In Quadratic ridge regression, the cost function is rewritten as:

J(θ)=12i=1m(hθ(x(i))y(i))2+λ2j=1nθj2J(\theta) = \frac{1}{2} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^n |\theta_j|^2

According to θJ(θ)=0\nabla_\theta J(\theta)=0, θ^\hat{\theta} can be calculated as:

θ^=(XTX+λI)1XTy\hat{\theta} = (X^TX+\lambda I)^{-1}X^T\vec{y}

Direct methods: Solve the normal equations by Gaussian elimination or QR decomposition

  • Benefit: in a single step or very few steps
  • Shortcoming: not feasible when data are streaming in real time or of very large amount

Iterative methods: use the stochastic or gradient descent

  • Benefit: converging fast, more attactive in large practical problems
  • Shortcoming: the learning rate α\alpha should be carefully chosen.

For probalilistic interpretation of Least Mean Square, by the independence assumption, LMS is equivalent to maximum likehood estimation (MLE) of θ\theta.

Locally weighted linear regression (LWR)

The problem is how to get a minimum J(θ)J(\theta) given as:

J(θ)=i=1mw(i)(hθ(x(i))y(i))2J(\theta) = \sum_{i=1}^m w^{(i)}(h_{\theta}(x^{(i)}) - y^{(i)})^2

where

w(i)=exp((x(i)x)22τ2)w^{(i)} = exp(-\frac{(x^{(i)} - x)^2}{2\tau^2})

where xx is the query point for which we’d like to know its yy. Essentially higher weights will be put on the training examples close to xx.

Parametric and Nonparametric

A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs. Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don’t want to worry too much about choosing just the right features. (Artificial Intelligence: A Modern Approach)

Benefits of parametric models:

  • Simpler: These methods are easier to understand and interpret results.
  • Speed: Parametric models are very fast to learn from data.
  • Less Data: They do not require as much training data and can work well even if the fit to the data is not perfect.

Limitations of parametric models:

  • Constrained: By choosing a functional form these methods are highly constrained to the specified form, because a parametric model has a fixed and finite number of parameters (θ\theta).
  • Limited Complexity: The methods are more suited to simpler problems.
  • Poor Fit: In practice the methods are unlikely to match the underlying mapping function.

Benefits of nonparametric models:

  • Flexibility: Capable of fitting a large number of functional forms.
  • Power: No assumptions (or weak assumptions) about the underlying function.
  • Performance: Can result in higher performance models for prediction.

Limitations of non-parametric models:

  • More data: Require a lot more training data to estimate the mapping function.
  • Slower: A lot slower to train as they often have far more parameters to train.
  • Overfitting: More of a risk to overfit the training data and it is harder to explain why specific predictions are made.

Locally weighted linear regression is a non-parametric learning algorithm. The term non-parametric means the model structure is not specified a priori but is instead determined from data. It is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance. If we use unweighted linear regression, Once θ\theta is determined, we no longer need to keep the training data around to make future predictions. In contrast, if the weighted linear regression is applied, the entire training set should be kept around.

Reward