python机器学习数值特征处理

数值特征（Numerical Features）指 可以进行数学运算的特征，例如：

年龄、收入、余额
交易次数、金额
股价、收益率、波动率

数值特征如果不处理好，会直接导致：

梯度下降不收敛
模型被某些大数值特征“主导”
非线性关系无法被模型捕捉

标准化（Standardization）

StandardScaler（Z-score 标准化）

把数据变成：

均值 = 0
标准差 = 1

公式： \(x' = \frac{x - \mu}{\sigma}\)

把所有特征“拉回同一个尺度”，比较谁偏离平均值更多

Python 示例

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

fit 只能在 训练集
测试集只能 transform

scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

归一化（Normalization）

MinMaxScaler（0–1 缩放）

把数据压缩到一个固定区间（默认 0～1）：

\[x' = \frac{x - x_{min}}{x_{max} - x_{min}}\]

把所有特征“拉进同一个盒子里”

📌 Python

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X)

三、StandardScaler vs MinMaxScaler 对比

对比项	StandardScaler	MinMaxScaler
输出范围	无固定范围	0～1
是否抗异常值	❌ 否	❌ 更差
是否常用	⭐⭐⭐⭐⭐	⭐⭐⭐
常见用途	线性模型 / SVM	NN / 图像

📌 经验法则

不确定用什么 → StandardScaler

分箱（Binning / Discretization）

为什么要分箱？

数值特征有时：

非线性
对极端值敏感
逻辑上更像“区间”

例如：

年龄：18–25 / 26–35 / 36–50
收入：低 / 中 / 高
违约风险：余额区间

等宽分箱（Equal Width）

import pandas as pd

df['age_bin'] = pd.cut(df['age'], bins=5)

区间宽度相等
样本数可能极不均衡

等频分箱（Quantile Binning）

df['age_bin'] = pd.qcut(df['age'], q=5)

每个箱子样本数相近
常用于 信用评分 / 风控

Sklearn 分箱（推荐）

from sklearn.preprocessing import KBinsDiscretizer

kbd = KBinsDiscretizer(
    n_bins=5,
    encode='onehot',
    strategy='quantile'
)

X_binned = kbd.fit_transform(X[['age']])

分箱的优缺点

优点

降低异常值影响
提升模型稳定性
强化非线性表达

缺点

信息损失
需要调箱数

常见于

逻辑回归
信用评分卡
规则模型

多项式特征（Polynomial Features）

为什么要多项式特征？

线性模型本质： \(y = w_1 x_1 + w_2 x_2\)

但现实世界： \(y = x^2,; x_1 \times x_2\)

多项式特征 = 人工制造非线性

示例（2 次多项式）

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(
    degree=2,
    include_bias=False
)

X_poly = poly.fit_transform(X)

如果原本：

x1, x2

变成：

x1, x2, x1², x1*x2, x2²

🔟 多项式特征的风险

特征数爆炸
容易过拟合
计算成本高

最佳实践

先标准化
degree ≤ 2 或 3
搭配正则化（L1 / L2）

完整推荐流水线（Pipeline）

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('model', LogisticRegression(penalty='l2'))
])

pipe.fit(X_train, y_train)

Share this article:

python机器学习数值特征处理

标准化（Standardization）

StandardScaler（Z-score 标准化）

Python 示例

归一化（Normalization）

MinMaxScaler（0–1 缩放）

📌 Python

三、StandardScaler vs MinMaxScaler 对比

分箱（Binning / Discretization）

为什么要分箱？

等宽分箱（Equal Width）

等频分箱（Quantile Binning）

Sklearn 分箱（推荐）

分箱的优缺点

优点

缺点

常见于

多项式特征（Polynomial Features）

为什么要多项式特征？

示例（2 次多项式）

🔟 多项式特征的风险

完整推荐流水线（Pipeline）

python机器学习类别特征处理（Categorical Feature Engineering）

python机器学习特征选择（Feature Selection）

python机器学习 数值特征处理

标准化（Standardization）

StandardScaler（Z-score 标准化）

Python 示例

归一化（Normalization）

MinMaxScaler（0–1 缩放）

📌 Python

三、StandardScaler vs MinMaxScaler 对比

分箱（Binning / Discretization）

为什么要分箱？

等宽分箱（Equal Width）

等频分箱（Quantile Binning）

Sklearn 分箱（推荐）

分箱的优缺点

优点

缺点

常见于

多项式特征（Polynomial Features）

为什么要多项式特征？

示例（2 次多项式）

🔟 多项式特征的风险

完整推荐流水线（Pipeline）

python机器学习 类别特征处理（Categorical Feature Engineering）

python机器学习 特征选择（Feature Selection）

python机器学习数值特征处理

python机器学习类别特征处理（Categorical Feature Engineering）

python机器学习特征选择（Feature Selection）