Introduction
Here is a not so short note (Rolling Update) for machine learning and artificial intelligience. Main concepts and some references are presented. This note is based on Wikipedia, Andrew Ng’s Lecture Notes CS229 of Standford, Machine Learning course of SJTU IEEE-yangyang, and Machine Learning Crash Course of Google.
The Learning Problem
- Input variables / features :
- Target variable :
- A training example: (,)
- A training set: {(,); }
- The space of input values:
- The space of output values:
Given a training set, learn a function so that is a good predictor for the corresponding value of . If there are features, like , the training set will be and . Note that is the number of samples and is the number of features of each sample.
Linear Regression
Linear regression is a very simple approach for supervised learning. We use linear regression to predict a quantitative response from the predictor variable . The relationship between and is assumed to be linear. A linear function can be used to map from to :
The intercept term is: . Note that here is not the number of samples.
- multiple linear regression
Multiple linear regression is a generalization of linear regression by considering more than one independent variable (: ).
- multivariate linear regression (General linear model)
The multivariate linear regression is a generalization of multiple linear regression model to the case of more than one dependent variable ( will be a matrix).
Linear basis function models
In polynomial curve fitting, the feature extraction is represented as the polynomial basis functions: , is the polynomial order (). A small change in affects all basis functions. Thus using a polynomial basis function is global.
Consider the sigmoidal basis functions: , where . A small change in only affects nearby basis functions (near ).
The Least Mean Square (LMS) method
The cost function is defined as:
The problem is how to get a minimum .
Gradient descent
Apply a gradient descent algorithm for every :
where is the number of basis function (). Note that descent means the direction is the opposite of the gradient ().
For a single sample, the LMS update rule (or Widro-Hoff learning rule) can be obtained as:
- Batch gradient descent
For many samples, we use the batch gradient descent as:
in the iteration:
Repeat until convergence {
... (for every j)
}
The iteration can be written as:
where is the number of samples (), is the iteration step. Note that is the feature extraction, is the sample.
until , then reaches convergence. For every iteration, all samples will be traversed. If a new sample is added, the iteration has to start from the first sample again.
- Stochastic gradient descent
When the training set is large, instead, we use Stochastic gradient descent as:
in
Loop {
for i = 1,m {
... (for every j)
}
}
until , then reaches convergence. For each , only one sample is used for iteration until the convergence. If a new sample is added, the iteration will continue only using the new sample. Finally, the new and are obtained.
Feature scaling is a method used to standardize the range of independent variables or features of data (data normalization). Feature scaling can be used to improve the iteration rate in gradient descent, e.g., rescaling and standardiztion. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. Thus each feature contributes approximately proportionately to the final values.
The normal equations
To avoid interations, we can use the normal equations. Consider and in matrix, the derivative of equals zero:
Finally,
In most situations of practical interest, the number of data points is larger than the dimensionality of the input space, and the matrix is of full column rank. A matrix is full column rank when each of the columns of the matrix are linearly independent. Then is necessarily invertible and therefore positive definite. The minimum can be obtained at the critical point when .
If or is not of full column rank, is not invertible. Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting. E.g. In Quadratic ridge regression, the cost function is rewritten as:
According to , can be calculated as:
Direct methods: Solve the normal equations by Gaussian elimination or QR decomposition
- Benefit: in a single step or very few steps
- Shortcoming: not feasible when data are streaming in real time or of very large amount
Iterative methods: use the stochastic or gradient descent
- Benefit: converging fast, more attactive in large practical problems
- Shortcoming: the learning rate should be carefully chosen.
For probalilistic interpretation of Least Mean Square, by the independence assumption, LMS is equivalent to maximum likehood estimation (MLE) of .
Locally weighted linear regression (LWR)
The problem is how to get a minimum given as:
where
where is the query point for which we’d like to know its . Essentially higher weights will be put on the training examples close to .
Parametric and Nonparametric
A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs. Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don’t want to worry too much about choosing just the right features. (Artificial Intelligence: A Modern Approach)
Benefits of parametric models:
- Simpler: These methods are easier to understand and interpret results.
- Speed: Parametric models are very fast to learn from data.
- Less Data: They do not require as much training data and can work well even if the fit to the data is not perfect.
Limitations of parametric models:
- Constrained: By choosing a functional form these methods are highly constrained to the specified form, because a parametric model has a fixed and finite number of parameters ().
- Limited Complexity: The methods are more suited to simpler problems.
- Poor Fit: In practice the methods are unlikely to match the underlying mapping function.
Benefits of nonparametric models:
- Flexibility: Capable of fitting a large number of functional forms.
- Power: No assumptions (or weak assumptions) about the underlying function.
- Performance: Can result in higher performance models for prediction.
Limitations of non-parametric models:
- More data: Require a lot more training data to estimate the mapping function.
- Slower: A lot slower to train as they often have far more parameters to train.
- Overfitting: More of a risk to overfit the training data and it is harder to explain why specific predictions are made.
Locally weighted linear regression is a non-parametric learning algorithm. The term non-parametric means the model structure is not specified a priori but is instead determined from data. It is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance. If we use unweighted linear regression, Once is determined, we no longer need to keep the training data around to make future predictions. In contrast, if the weighted linear regression is applied, the entire training set should be kept around.