4. Regression
predict a numerical value from a possibly infinite set of possible values
4.1 Interpolation vs. Extrapolation
Example: training values between -15°C and 32°C
Interpolating regression: only predicts values from the interval [-15°C,32°C]. Only reasonable/realistic values are predicted -> more ‘safe’
Extrapolating regression: may also predict values outside of this interval -> more ‘interesting’
4.2 Baseline prediction
classification | regression |
---|---|
predict most frequent label | predict average value or median or mode, in any case: only interpolating regression |
如果模型不能超越baseline,说明模型可能存在问题或者数据中没有足够的信息用于建模。 |
4.3 k Nearest Neighbors for Regression
model = KNeighborsRegressor(k=number, weights=‘uniform’/’distance’)
uniform: 所有邻居的权重都被视为相等。即每个最近邻对预测的贡献都是相同的。
distance: 模型更关注距离较近的数据点,将更远的邻居视为对预测的贡献较小。more somooth
4.4 Performance Measures
4.4.1 Mean Absolute Error
how much does the prediction differ from the actual value on average” />
4.4.2 (Root) Mean Squared Error
More severe errors are weighted higher by MSE and RMSE
4.4.3 Correlation
Pearson’s correlation coefficient
Scores well if high actual values get high predictions and low actual values get low predictions. Caution: PCC is scale-invariant!
4.5 Linear Regression
Assumption: target variable y is (approximately) linearly dependent on attributes
Typical performance measure used: Mean Squared Error but we omit the denominator N.
Linear Regression vs. k-NN Regression
Linear regression extrapolates, but k-NN interpolates.
Linear Regression and Overfitting
Occam’s Razor
4.6 Ridge Regression
4.7 Lasso Regression
Lasso vs. Ridge Regression
Two different regularization techniques used in linear regression to prevent overfitting and improve the model’s generalization performance.
Ridge Regression (L2 regularization): adds a penalty term proportional to the square of the magnitude of the coefficients (L2 norm) to the linear regression cost function.
Lasso Regression (L1 regularization): adds a penalty term proportional to the absolute value of the coefficients (L1 norm) to the linear regression cost function.
L1正则化在某些情况下能够实现特征选择,即使在高维数据集中,一些系数会被推向零。
L2正则化通常只会将系数缩小,但不会精确地将其推向零。它更倾向于将系数平均地减小。
4.8 Isotonic Regression (Non-linear Problems)
Isotonic Regression(保序回归)是一种用于回归问题的统计方法,它通过对数据进行顺序化的方式来建立一个保序的函数关系。在这种回归方法中,模型的预测结果是保持输入变量的顺序性的。Isotonic Regression的主要目标是拟合一个递增或递减的非降函数,以最小化预测值与观测值之间的平方误差。
Example: f(x1)≤f(x2) for x1< x2
Target function is monotonous -> Pool Adjacent Violators Algorithm (PAVA)
PAVA 是一种用于执行保序回归的算法,其主要目的是通过合并相邻的违反保序性的数据点,生成一个保序的逼近函数。PAVA 不断地检查相邻的数据点是否违反保序性,如果是,则将它们合并,直到整个序列满足保序性。
Steps:
Step1: Identify adjacent violators, i.e., f(xi)>(xi+1)
Step2: Replace them with new values f’(xi)=f’(xi+1) so that sum of squared errors is minimized…and pool them, i.e., they are going to be handled as one point
Step3: Repeat until no more adjacent violators are left
注意:如果我们假设函数是递增,但实际是递减的,使用PAVA算法后,所有点将被拉平成一条直线
4.9 Polynomial Regression
Non-linear, non-monotonous Problems
其中,Basic expansions是指引入基函数(basis functions)来扩展属性空间的方法。基函数可以是一组预定义的函数,例如高斯函数或sigmoid函数,用于生成新的属性。
Overfitting
Larger feature sets lead to higher degrees of overfitting
Rule of thumb: Datasets should never be wider than long! (never more features than examples)
4.10 Support Vector Regression SVR
Linear Regression: find a linear function that minimizes the distance to data points
SVM: find a linear function that maximizes the distance to data points from different classes (Find hyperplane maximizes the margin)
Many SVMs also support regression
Maximum margin hyperplane only applies to classification. However, idea of support vectors and kernel functions can be used for regression.
Basic method same as in linear regression: want to minimize error,
- Difference A: ignore errors smaller than Ɛ and use absolute error instead of squared error
- Difference B: simultaneously aim to maximize flatness of function
- User-specified parameter Ɛ defines “tube
SVM vs. Polynomial Regression
polynomial regression | SVM |
---|---|
Create polynomial recombinations as additional features; Perform linear regression | Transform vector space with kernel function (e.g., polynomial kernel); Find linear hyperplane |
4.11 Local Regression
Assumption: non-linear problems are approximately linear in local areas -> use linear regression locally, only for the data point at hand (lazy learning)
Steps:
Given a data point, retrieve the k nearest neighbors ->compute a regression model using those neighbors -> locally weighted regression: uses distance as weight for error computation
Advantage: fits non-linear models well (good local approximation, often more exact than pure k-NN)
Disadvantages: runtime, for each test example, find k nearest neighbors and compute a local model
4.12 Regression Trees & Model Trees
Combining Decision Trees and Regression
Idea: split data first so that it becomes “more linear”
4.12.1 Regression Trees
Splitting criterion: minimize intra-subset variation
Termination criterion: standard deviation becomes small
Pruning criterion: based on numeric error measure
Prediction: Leaf predicts average class values of instances
Resulting model: piecewise constant function
4.12.2 Model Trees
Build a regression tree
– For each leaf -> learn linear regression function
Need linear regression function at each node
Prediction: go down tree, then apply function
Resulting model: piecewise linear function
Regression Tree和Model Tree的主要区别在于每个节点的表示。回归树在叶子节点上保存目标变量的均值,而模型树在叶子节点上保存一个线性模型。
4.12.3 Building the Tree
4.13 Artificial Neural Networks for Regression
More complex regression problems can be approximated by combining several perceptrons (in neural networks: hidden layers; with non-linear activation functions!) This allows for arbitrary functions.