0%

NLP文本分类引导

en

1. 数据读取

python原生读取txt文件

1
f = open(filedir, mode='r')
  • f.read()一次读取全部数据。
  • f.readline()分行读取,一次读取一行。
  • f.readlines()读取全部数据,保留\n换行符。

pandas读取txt文件

pd.read_table(filedir, splitsign, header=指定第几行为列名, names=如果没有列名手动指定[‘x’, ‘y’])

2. 数据预处理

分词

去除停词 去除低频词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def optimize_words_dict(self, data, stop_words, threshold):
freq_dict = {}
for line in data: # 去除停词 + 计算词频
for word in line:
if word in stop_words:
continue
if word not in freq_dict:
freq_dict[word] = 1
else:
freq_dict[word] += 1
words_list = []
values = sorted(list(set(freq_dict.values())), reverse=True)
for w in values: # 通过阈值筛选词表 算法可以优化???
if w < threshold:
continue
for k, v in freq_dict.items():
if v == w:
words_list.append((k, v))
return words_list

通过TF-IDF算法筛选单词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def tf_idf(self, data):
doc_num = len(data)
df = {}
for sample in data:
for word in set(sample): # 避免一个单词在一个Sample里出现多次
if word not in df:
df[word] = 1.0
else:
df[word] += 1.0
for word in df: # 计算document frequency 文档总数/单词出现过的文档数
df[word] = log10(doc_num / df[word])
res = {}
index = 0
for sample in data:
res[index] = {}
tf = {}
for word in sample: # 计算term frequency 一个样本里单词的频率
if word not in tf:
tf[word] = 1.0
else:
tf[word] += 1.0
sample_len = len(sample)
for word in sample: # 计算tf * idf 一个样本中各单词的tf_idf
tf_idf = tf[word] / sample_len * df[word]
res[index][word] = tf_idf
index += 1

return res # res格式为每一个样本中每一个单词对应的tf_idf值

在预处理时,因为是要为生成词典做准备,所以TF-IDF应该是以整个corpus计算TF,而提取特征时建立词袋模型中的TF-IDF方法则是对每个Sample的单词计算TF-IDF值。

3. 建立词典

词典 = {‘单词’:索引}

4. 提取特征

词袋模型(BOW)

假设建立的词典有1000维,那么我们将为数据集中的每一个Sample建立一个1000维的向量。

词袋模型有三种形式:

  • 对于每个Smaple单词是否出现
  • 对于每个Sample单词出现次数
  • 对于每个Smaple单词的TF-IDF值

在预处理时,因为是要为生成词典做准备,所以TF-IDF应该是以整个corpus计算TF,而提取特征时建立词袋模型中的TF-IDF方法则是对每个Sample的单词计算TF-IDF值。

例:
词典:1000. [北京,天气,真,好,北京] = [0, 1, 0, …,0], 或 [0, 2, 0, …,0],或 [0, 0.4, 0, …, 0]

独热编码(One-Hot)

对于一个特征有N个状态则可以用N位来表示

例:Feature_2有2个状态

Sample_1 = [0, 1], Sample_2 = [1, 0] …

例:Feature_1有4个状态

Sample_1 = [0, 0, 1, 0], Sample_2 = [0, 1, 0, 0] …

当一个样本有多个状态时

Sample_1 = [0, 1, 0, 0, 1, 0], Sample_2 = [1, 0, 0, 1, 0, 0] …


一点想法
不确定应该建立多大的词典时,可以考虑Corpus的大小或是每个Sample中Word的个数
词典的大小就与Corpus或Sample中Word的个数成某个比例

zh

1. 数据处理

  • 去除空格
  • 符号转换(省略号->句号)?
  • 去除无意义符号
  • 繁体转换简体

KNN-手写体数字数据集

1
2
3
4
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
1
2
3
digits = datasets.load_digits()
X = digits.data
X.shape
(1797, 64)
1
2
y = digits.target
y.shape
(1797,)
1
2
3
one = X[100]
one = one.reshape(8, 8)
one
array([[ 0.,  0.,  0.,  2., 13.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  8., 15.,  0.,  0.,  0.],
       [ 0.,  0.,  5., 16.,  5.,  2.,  0.,  0.],
       [ 0.,  0., 15., 12.,  1., 16.,  4.,  0.],
       [ 0.,  4., 16.,  2.,  9., 16.,  8.,  0.],
       [ 0.,  0., 10., 14., 16., 16.,  4.,  0.],
       [ 0.,  0.,  0.,  0., 13.,  8.,  0.,  0.],
       [ 0.,  0.,  0.,  0., 13.,  6.,  0.,  0.]])
1
plt.imshow(one)
<matplotlib.image.AxesImage at 0x1a203a9ad0>

png

分离训练测试集

1
2
from sklearn.model_selection import train_test_split
train_test_split?
Signature: train_test_split(*arrays, **options)
Docstring:
Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and
``next(ShuffleSplit().split(X, y))`` and application to input data
into a single call for splitting (and optionally subsampling) data in a
oneliner.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters
----------
*arrays : sequence of indexables with same length / shape[0]
    Allowed inputs are lists, numpy arrays, scipy-sparse
    matrices or pandas dataframes.

test_size : float, int or None, optional (default=None)
    If float, should be between 0.0 and 1.0 and represent the proportion
    of the dataset to include in the test split. If int, represents the
    absolute number of test samples. If None, the value is set to the
    complement of the train size. If ``train_size`` is also None, it will
    be set to 0.25.

train_size : float, int, or None, (default=None)
    If float, should be between 0.0 and 1.0 and represent the
    proportion of the dataset to include in the train split. If
    int, represents the absolute number of train samples. If None,
    the value is automatically set to the complement of the test size.

random_state : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator;
    If RandomState instance, random_state is the random number generator;
    If None, the random number generator is the RandomState instance used
    by `np.random`.

shuffle : boolean, optional (default=True)
    Whether or not to shuffle the data before splitting. If shuffle=False
    then stratify must be None.

stratify : array-like or None (default=None)
    If not None, data is split in a stratified fashion, using this as
    the class labels.

Returns
-------
splitting : list, length=2 * len(arrays)
    List containing train-test split of inputs.

    .. versionadded:: 0.16
        If the input is sparse, the output will be a
        ``scipy.sparse.csr_matrix``. Else, output type is the same as the
        input type.

Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]

>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]
File:      /Applications/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py
Type:      function
1
2
3
4
5
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(1437, 64)
(1437,)
(360, 64)
(360,)

测试

1
KNNClf = KNeighborsClassifier(n_neighbors=3)
1
KNNClf.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')
1
2
res = KNNClf.predict(X_test)
res
array([3, 4, 8, 5, 5, 7, 0, 3, 1, 4, 7, 5, 5, 8, 0, 7, 4, 7, 1, 7, 9, 4,
       4, 7, 6, 1, 2, 9, 1, 3, 3, 3, 7, 0, 7, 0, 2, 8, 9, 1, 1, 5, 4, 8,
       9, 0, 4, 9, 4, 9, 7, 2, 7, 3, 3, 4, 1, 9, 9, 9, 0, 4, 0, 6, 1, 0,
       0, 3, 6, 2, 3, 2, 8, 5, 9, 3, 1, 1, 6, 9, 8, 1, 2, 3, 2, 6, 8, 8,
       4, 6, 8, 6, 3, 9, 2, 8, 3, 6, 5, 7, 1, 7, 3, 8, 8, 8, 0, 0, 9, 1,
       9, 8, 5, 1, 1, 0, 1, 6, 5, 1, 7, 6, 5, 7, 7, 2, 2, 7, 3, 1, 9, 5,
       9, 5, 5, 3, 8, 4, 9, 5, 4, 6, 0, 5, 4, 8, 6, 1, 2, 8, 0, 9, 0, 9,
       7, 9, 7, 0, 2, 8, 2, 4, 0, 6, 2, 6, 7, 5, 6, 3, 8, 8, 0, 3, 2, 0,
       6, 1, 0, 6, 0, 5, 9, 3, 3, 0, 4, 0, 4, 2, 4, 9, 0, 6, 7, 4, 6, 5,
       9, 7, 2, 2, 3, 3, 0, 3, 9, 4, 9, 8, 5, 6, 9, 0, 1, 3, 5, 0, 5, 1,
       6, 4, 6, 6, 6, 7, 9, 1, 0, 7, 6, 6, 7, 8, 0, 3, 8, 5, 6, 8, 1, 3,
       0, 3, 6, 0, 3, 5, 6, 7, 0, 6, 9, 7, 0, 0, 2, 1, 6, 6, 9, 1, 6, 9,
       8, 7, 0, 2, 5, 1, 8, 6, 4, 3, 8, 2, 9, 2, 8, 4, 6, 4, 8, 9, 3, 1,
       1, 5, 0, 7, 8, 2, 3, 8, 4, 7, 7, 7, 7, 9, 4, 1, 7, 0, 2, 9, 1, 8,
       2, 4, 1, 0, 7, 5, 6, 0, 7, 1, 1, 5, 7, 1, 1, 6, 5, 6, 2, 2, 3, 9,
       7, 5, 3, 1, 5, 9, 9, 2, 5, 1, 6, 8, 2, 2, 4, 2, 1, 0, 6, 1, 2, 9,
       7, 5, 3, 5, 6, 1, 8, 1])
1
y_test
array([3, 4, 8, 5, 5, 7, 0, 3, 1, 4, 7, 5, 5, 8, 0, 7, 4, 7, 1, 7, 9, 4,
       4, 7, 6, 1, 2, 9, 1, 3, 3, 3, 7, 0, 7, 0, 2, 8, 9, 1, 1, 5, 4, 8,
       9, 0, 4, 9, 4, 9, 7, 2, 7, 8, 3, 4, 1, 9, 9, 9, 0, 4, 0, 6, 1, 0,
       0, 3, 6, 2, 3, 2, 8, 5, 9, 3, 1, 1, 6, 9, 8, 1, 2, 3, 2, 6, 8, 8,
       4, 6, 8, 6, 3, 9, 2, 8, 3, 6, 5, 7, 1, 7, 9, 8, 8, 8, 0, 0, 9, 1,
       4, 8, 5, 1, 1, 0, 1, 6, 5, 1, 7, 6, 5, 7, 7, 2, 2, 7, 3, 1, 9, 5,
       9, 5, 5, 3, 8, 4, 9, 5, 4, 6, 0, 5, 4, 8, 6, 1, 2, 8, 0, 9, 0, 9,
       7, 9, 7, 0, 2, 8, 2, 4, 0, 6, 2, 6, 7, 5, 6, 3, 8, 8, 0, 3, 2, 0,
       6, 1, 0, 6, 0, 5, 9, 3, 3, 0, 4, 0, 4, 2, 4, 9, 0, 6, 7, 4, 6, 5,
       9, 7, 2, 2, 3, 3, 0, 3, 9, 4, 9, 8, 5, 6, 9, 0, 1, 3, 5, 0, 5, 1,
       6, 4, 6, 6, 6, 7, 9, 1, 0, 7, 6, 6, 7, 8, 0, 3, 8, 5, 6, 8, 1, 3,
       0, 3, 6, 0, 3, 5, 6, 7, 0, 6, 9, 7, 0, 0, 2, 1, 6, 6, 9, 1, 6, 9,
       8, 7, 0, 2, 5, 1, 8, 6, 4, 3, 8, 2, 9, 2, 8, 4, 6, 4, 8, 9, 3, 1,
       1, 5, 0, 7, 8, 2, 3, 8, 4, 7, 7, 7, 7, 7, 4, 1, 3, 0, 2, 9, 1, 8,
       2, 4, 1, 0, 7, 5, 6, 0, 7, 1, 1, 5, 7, 1, 1, 6, 5, 6, 2, 2, 3, 9,
       7, 5, 3, 1, 5, 9, 9, 2, 5, 1, 6, 8, 2, 2, 4, 2, 1, 0, 6, 1, 2, 9,
       7, 5, 3, 5, 6, 1, 8, 8])
1
2
num = sum(res == y_test)
num
354
1
print('accuracy ', num / len(y_test))
accuracy  0.9833333333333333

使用sklearn中封装的计算精确度的方法

1
from sklearn.metrics import accuracy_score as accuracy
1
accuracy(res, y_test)
0.9833333333333333
1
2


KNN-鸢尾花数据集

1
2
3
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

使用sklearn本地数据集中的鸢尾花数据集

1
2
iris = datasets.load_iris()
iris.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
1
X = iris.data
1
Y = iris.target
1
X.shape
(150, 4)
1
Y.shape
(150,)

手动分离测试数据

1
2
shuffled_indexes = np.random.permutation(len(X))
shuffled_indexes
array([ 47,  10, 144,  57, 135,  58,  36, 136,  73, 109,  64,  15, 104,
       143, 108,  83,  23, 126, 125, 131, 137,  22,  16,  29, 118,  31,
        33,  52,  32, 132,  45,  38,  78, 139,  30,  37,  61,  97, 122,
        56, 107,  66, 114,  87,  43,  76,  84,  79, 142,  70,  77,  42,
         7, 138, 141, 120, 129,  44,  24,  53, 116,  13,  91, 119,  93,
         6,  60,  50,  67,  20,  54,  71,  89,  68,  21, 133, 148,  81,
        25,  48, 130, 127,  28,  90,  82, 146, 100, 105,  80,  94,  14,
        55, 111, 106, 101, 103,  35,  99,   3,  26,  69, 124,  95,  96,
       140,  46,  19,  34,  75,  59,   1, 117, 121,  49, 110,   0, 115,
        72,   9,  18, 149,  40, 145,  92, 123,  51,   4,  11,  39,  85,
        62, 147, 102,  74,   2,  86,   8,  17,   5,  27, 112, 128,  12,
        65,  41, 134,  63,  88, 113,  98])
1
2
test_ratio = 0.2
test_size = int(len(X) * test_ratio)
1
2
test_indexes = shuffled_indexes[:test_size]
train_indexes = shuffled_indexes[test_size:]
1
2
3
4
5
X_train = X[train_indexes]
Y_train = Y[train_indexes]

X_test = X[test_indexes]
Y_test = Y[test_indexes]
1
print(X_train.shape);print(Y_train.shape);print(X_test.shape);print(Y_test.shape)
(120, 4)
(120,)
(30, 4)
(30,)

使用sklearn分离

1
2
from sklearn.model_selection import train_test_split
train_test_split?
Signature: train_test_split(*arrays, **options)
Docstring:
Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and
``next(ShuffleSplit().split(X, y))`` and application to input data
into a single call for splitting (and optionally subsampling) data in a
oneliner.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters
----------
*arrays : sequence of indexables with same length / shape[0]
    Allowed inputs are lists, numpy arrays, scipy-sparse
    matrices or pandas dataframes.

test_size : float, int or None, optional (default=None)
    If float, should be between 0.0 and 1.0 and represent the proportion
    of the dataset to include in the test split. If int, represents the
    absolute number of test samples. If None, the value is set to the
    complement of the train size. If ``train_size`` is also None, it will
    be set to 0.25.

train_size : float, int, or None, (default=None)
    If float, should be between 0.0 and 1.0 and represent the
    proportion of the dataset to include in the train split. If
    int, represents the absolute number of train samples. If None,
    the value is automatically set to the complement of the test size.

random_state : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator;
    If RandomState instance, random_state is the random number generator;
    If None, the random number generator is the RandomState instance used
    by `np.random`.

shuffle : boolean, optional (default=True)
    Whether or not to shuffle the data before splitting. If shuffle=False
    then stratify must be None.

stratify : array-like or None (default=None)
    If not None, data is split in a stratified fashion, using this as
    the class labels.

Returns
-------
splitting : list, length=2 * len(arrays)
    List containing train-test split of inputs.

    .. versionadded:: 0.16
        If the input is sparse, the output will be a
        ``scipy.sparse.csr_matrix``. Else, output type is the same as the
        input type.

Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]

>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]
File:      /Applications/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py
Type:      function
1
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
1
2
3
4
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)
(120, 4)
(120,)
(30, 4)
(30,)

使用sklearn中的KNN分离器测试

1
2
from sklearn.neighbors import KNeighborsClassifier
KNeighborsClassifier?
Init signature:
KNeighborsClassifier(
    n_neighbors=5,
    weights='uniform',
    algorithm='auto',
    leaf_size=30,
    p=2,
    metric='minkowski',
    metric_params=None,
    n_jobs=None,
    **kwargs,
)
Docstring:     
Classifier implementing the k-nearest neighbors vote.

Read more in the :ref:`User Guide <classification>`.

Parameters
----------
n_neighbors : int, optional (default = 5)
    Number of neighbors to use by default for :meth:`kneighbors` queries.

weights : str or callable, optional (default = 'uniform')
    weight function used in prediction.  Possible values:

    - 'uniform' : uniform weights.  All points in each neighborhood
      are weighted equally.
    - 'distance' : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
    - [callable] : a user-defined function which accepts an
      array of distances, and returns an array of the same shape
      containing the weights.

algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
    Algorithm used to compute the nearest neighbors:

    - 'ball_tree' will use :class:`BallTree`
    - 'kd_tree' will use :class:`KDTree`
    - 'brute' will use a brute-force search.
    - 'auto' will attempt to decide the most appropriate algorithm
      based on the values passed to :meth:`fit` method.

    Note: fitting on sparse input will override the setting of
    this parameter, using brute force.

leaf_size : int, optional (default = 30)
    Leaf size passed to BallTree or KDTree.  This can affect the
    speed of the construction and query, as well as the memory
    required to store the tree.  The optimal value depends on the
    nature of the problem.

p : integer, optional (default = 2)
    Power parameter for the Minkowski metric. When p = 1, this is
    equivalent to using manhattan_distance (l1), and euclidean_distance
    (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric : string or callable, default 'minkowski'
    the distance metric to use for the tree.  The default metric is
    minkowski, and with p=2 is equivalent to the standard Euclidean
    metric. See the documentation of the DistanceMetric class for a
    list of available metrics.
    If metric is "precomputed", X is assumed to be a distance matrix and
    must be square during fit. X may be a :term:`Glossary <sparse graph>`,
    in which case only "nonzero" elements may be considered neighbors.

metric_params : dict, optional (default = None)
    Additional keyword arguments for the metric function.

n_jobs : int or None, optional (default=None)
    The number of parallel jobs to run for neighbors search.
    ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
    ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
    for more details.
    Doesn't affect :meth:`fit` method.

Attributes
----------
classes_ : array of shape (n_classes,)
    Class labels known to the classifier

effective_metric_ : string or callble
    The distance metric used. It will be same as the `metric` parameter
    or a synonym of it, e.g. 'euclidean' if the `metric` parameter set to
    'minkowski' and `p` parameter set to 2.

effective_metric_params_ : dict
    Additional keyword arguments for the metric function. For most metrics
    will be same with `metric_params` parameter, but may also contain the
    `p` parameter value if the `effective_metric_` attribute is set to
    'minkowski'.

outputs_2d_ : bool
    False when `y`'s shape is (n_samples, ) or (n_samples, 1) during fit
    otherwise True.

Examples
--------
>>> X = [[0], [1], [2], [3]]
>>> y = [0, 0, 1, 1]
>>> from sklearn.neighbors import KNeighborsClassifier
>>> neigh = KNeighborsClassifier(n_neighbors=3)
>>> neigh.fit(X, y)
KNeighborsClassifier(...)
>>> print(neigh.predict([[1.1]]))
[0]
>>> print(neigh.predict_proba([[0.9]]))
[[0.66666667 0.33333333]]

See also
--------
RadiusNeighborsClassifier
KNeighborsRegressor
RadiusNeighborsRegressor
NearestNeighbors

Notes
-----
See :ref:`Nearest Neighbors <neighbors>` in the online documentation
for a discussion of the choice of ``algorithm`` and ``leaf_size``.

.. warning::

   Regarding the Nearest Neighbors algorithms, if it is found that two
   neighbors, neighbor `k+1` and `k`, have identical distances
   but different labels, the results will depend on the ordering of the
   training data.

https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
File:           /Applications/anaconda3/lib/python3.7/site-packages/sklearn/neighbors/_classification.py
Type:           ABCMeta
Subclasses:     
1
2
KNNClf = KNeighborsClassifier(n_neighbors=3)
KNNClf.fit(X_train, Y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')
1
2
res = KNNClf.predict(X_test)
res
array([0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 0, 2, 1, 0, 0, 1, 2])
1
Y_test
array([0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 0, 2, 1, 0, 0, 1, 2])
1
acc = sum(res == Y_test)
1
print('accuracy ', acc / len(Y_test))
accuracy  1.0
1
2


手动实现KNN

1
2
import numpy as np
import matplotlib.pyplot as plt
1
2
3
4
5
6
7
8
9
10
11
12
raw_data_x = [[3.393533211, 2.331273381],
[3.110073483, 1.781539638],
[1.343808831, 3.368360954],
[3.582294042, 4.679179110],
[2.280362439, 2.866990263],
[7.423436942, 4.696522875],
[5.745051997, 3.533989803],
[9.172168622, 2.511101045],
[7.792783481, 3.424088941],
[7.939820817, 0.791637231]
]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
1
2
x_train = np.array(raw_data_x)
y_train = np.array(raw_data_y)
1
x_train
array([[3.39353321, 2.33127338],
       [3.11007348, 1.78153964],
       [1.34380883, 3.36836095],
       [3.58229404, 4.67917911],
       [2.28036244, 2.86699026],
       [7.42343694, 4.69652288],
       [5.745052  , 3.5339898 ],
       [9.17216862, 2.51110105],
       [7.79278348, 3.42408894],
       [7.93982082, 0.79163723]])
1
y_train==1
array([False, False, False, False, False,  True,  True,  True,  True,
        True])
1
2
3
plt.scatter(x_train[y_train==1, 0], x_train[y_train==1, 1], color='r', label='1')
plt.scatter(x_train[y_train==0, 0], x_train[y_train==0, 1], color='g', label='0')
plt.legend()
<matplotlib.legend.Legend at 0x122311ad0>

png

1
2
3
4
5
6
7
x = np.array([8.093607318, 3.365731514])

plt.scatter(x_train[y_train==1, 0], x_train[y_train==1, 1], color='r', label='1')
plt.scatter(x_train[y_train==0, 0], x_train[y_train==0, 1], color='g', label='0')
plt.scatter(x[0], x[1], color='b')
plt.legend()
plt.show()

png

1
2
import math
distance = []
1
2
3
for each in x_train:
d = math.sqrt(np.sum((each - x) ** 2))
distance.append(d)
1
distance
[4.812566907609877,
 5.229270827235305,
 6.749798999160064,
 4.6986266144110695,
 5.83460014556857,
 1.4900114024329525,
 2.354574897431513,
 1.3761132675144652,
 0.3064319992975,
 2.5786840957478887]
1
2
distances = [math.sqrt(np.sum((each - x) ** 2)) for each in x_train]
distances
[4.812566907609877,
 5.229270827235305,
 6.749798999160064,
 4.6986266144110695,
 5.83460014556857,
 1.4900114024329525,
 2.354574897431513,
 1.3761132675144652,
 0.3064319992975,
 2.5786840957478887]
1
2
nearest = np.argsort(distances)
nearest
array([8, 7, 5, 6, 9, 3, 0, 1, 4, 2])
1
2
3
k = 6
topK_y = [y_train[i] for i in nearest[:k]]
topK_y
[1, 1, 1, 1, 1, 0]
1
2
3
import collections
votes = collections.Counter(topK_y)
votes
Counter({1: 5, 0: 1})
1
votes.most_common(1)
[(1, 5)]
1
2
predict_y = votes.most_common(1)[0][0]
predict_y
1

scikit-learn实现KNN

1
2
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
raw_data_x = [[3.393533211, 2.331273381],
[3.110073483, 1.781539638],
[1.343808831, 3.368360954],
[3.582294042, 4.679179110],
[2.280362439, 2.866990263],
[7.423436942, 4.696522875],
[5.745051997, 3.533989803],
[9.172168622, 2.511101045],
[7.792783481, 3.424088941],
[7.939820817, 0.791637231]
]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

x_train = np.array(raw_data_x)
y_train = np.array(raw_data_y)

target = np.array([8.093607318, 3.365731514])
1
KNN_classifier = KNeighborsClassifier(n_neighbors=6)
1
KNN_classifier.fit(x_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=6, p=2,
                     weights='uniform')
1
2
target = x.reshape(1, -1)
print(target)
[[8.09360732 3.36573151]]
1
KNN_classifier.predict(target)
array([1])
1
2
res = KNN_classifier.predict(target)
res[0]
1

Python 数据可视化分析


介绍

在机器学习领域中,可视化是十分重要的。在开始一项新任务时,通过可视化手段探索数据能更好地帮助人们把握数据的要点。在分析模型表现和模型报告的结果时,可视化能使分析显得更加生动鲜明。有时候,为了理解复杂的模型,我们还可以将高维空间映射为视觉上更直观的二维或三维图形。

总而言之,可视化是一个相对快捷的从数据中挖掘信息的手段。本文将使用 Pandas、Matplotlib、seaborn 等流行的库,带你上手可视化。

知识点

  • 单变量可视化的常用方法
  • 多变量可视化的常用方法
  • t-SNE

数据集

首先使用 import 载入相关依赖。

1
2
3
4
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()

在第一篇文章中,我们使用的是某电信运营商的客户离网数据集,本次实验仍旧使用这个数据集。

1
df = pd.read_csv('./data/telecom_churn.csv')
1
df.head()

State Account length Area code International plan Voice mail plan Number vmail messages Total day minutes Total day calls Total day charge Total eve minutes Total eve calls Total eve charge Total night minutes Total night calls Total night charge Total intl minutes Total intl calls Total intl charge Customer service calls Churn
0 KS 128 415 No Yes 25 265.1 110 45.07 197.4 99 16.78 244.7 91 11.01 10.0 3 2.70 1 False
1 OH 107 415 No Yes 26 161.6 123 27.47 195.5 103 16.62 254.4 103 11.45 13.7 3 3.70 1 False
2 NJ 137 415 No No 0 243.4 114 41.38 121.2 110 10.30 162.6 104 7.32 12.2 5 3.29 0 False
3 OH 84 408 Yes No 0 299.4 71 50.90 61.9 88 5.26 196.9 89 8.86 6.6 7 1.78 2 False
4 OK 75 415 Yes No 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1 3 2.73 3 False

最后一个数据列 Churn 离网率 是我们的目标特征,它是布尔变量,其中 True 表示公司最终丢失了此客户,False 表示客户被保留。稍后,将构建基于其他特征预测 Churn 特征的模型。

单变量可视化

单变量(univariate)分析一次只关注一个变量。当我们独立地分析一个特征时,通常最关心的是该特征值的分布情况。下面考虑不同统计类型的变量,以及相应的可视化工具。

数量特征

数量特征(quantitative feature)的值为有序数值。这些值可能是离散的,例如整数,也可能是连续的,例如实数。

直方图和密度图

直方图依照相等的间隔将值分组为柱,它的形状可能包含了数据分布的一些信息,如高斯分布、指数分布等。当分布总体呈现规律性,但有个别异常值时,你可以通过直方图辨认出来。当你使用的机器学习方法预设了某一特定分布类型(通常是高斯分布)时,知道特征值的分布是非常重要的。

最简单的查看数值变量分布的方法是使用 DataFrame 的 方法绘制直方图。

1
2
features = ['Total day minutes', 'Total intl calls']
df[features].hist(figsize = (10, 4))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x12e3076d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12f4cbe90>]],
      dtype=object)

png

上图表明,变量 Total day minutes 每日通话时长 呈高斯分布,而 Total intl calls 总国际呼叫数 显著右倾(它右侧的尾巴更长)。

密度图(density plots),也叫核密度图( ,KDE)是理解数值变量分布的另一个方法。它可以看成是直方图平滑( )的版本。相比直方图,它的主要优势是不依赖于柱的尺寸,更加清晰。

让我们为上面两个变量创建密度图。

1
2
df[features].plot(kind='density', subplots=True, layout=(1,2),
sharex=False, figsize=(10, 4), legend=False, title=features)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x133bcc050>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1343f1ad0>]],
      dtype=object)

png

1
sns.distplot(df['Total day calls'])
<matplotlib.axes._subplots.AxesSubplot at 0x12e4f8810>

png

当然,还可以使用 seaborn 的 方法观测数值变量的分布。例如,Total day minutes 每日通话时长 的分布。默认情况下,该方法将同时显示直方图和密度图。

1
sns.distplot(df['Total intl calls'])
<matplotlib.axes._subplots.AxesSubplot at 0x12e5134d0>

png

上图中直方图的柱形高度已进行归一化处理,表示的是密度而不是样本数。

箱型图

箱形图的主要组成部分是箱子(box),须(whisker)和一些单独的数据点(离群值),分别简单介绍如下:

  • 箱子显示了分布的四分位距,它的长度由 $25th , (\text{Q1,下四分位数})$ 和 $75th , (\text{Q3,上四分位数})$ 决定,箱中的水平线表示中位数 ($50%$)。
  • 须是从箱子处延伸出来的线,它们表示数据点的总体散布,具体而言,是位于区间 $(\text{Q1} - 1.5 \cdot \text{IQR}, \text{Q3} + 1.5 \cdot \text{IQR})$的数据点,其中 $\text{IQR} = \text{Q3} - \text{Q1}$,也就是四分位距。
  • 离群值是须之外的数据点,它们作为单独的数据点,沿着中轴绘制。

使用 seaborn 的 boxplot() 方法绘制箱形图。

1
sns.boxplot(df['Total intl calls'])
<matplotlib.axes._subplots.AxesSubplot at 0x12e6e1750>

png

上图表明,在该数据集中,大量的国际呼叫是相当少见的。

提琴形图

我们最后考虑的分布图形是提琴形图(violin plot)。提琴形图和箱形图的区别是,提琴形图聚焦于平滑后的整体分布,而箱形图显示了单独样本的特定统计数据。

使用 violinplot() 方法绘制提琴形图。下图左侧是箱形图,右侧是提琴形图。

1
2
3
4
5
6
import matplotlib.pyplot as plt

_, ax = plt.subplots(1, 2, sharey=True, figsize=(10, 4))

sns.boxplot(df['Total intl calls'], ax=ax[0])
sns.violinplot(df['Total intl calls'], ax=ax[1])
<matplotlib.axes._subplots.AxesSubplot at 0x1380219d0>

png

数据描述

除图形工具外,还可以使用 DataFrame 的 方法来获取分布的精确数值统计。

1
df[features].describe()

Total day minutes Total intl calls
count 3333.000000 3333.000000
mean 179.775098 4.479448
std 54.467389 2.461214
min 0.000000 0.000000
25% 143.700000 3.000000
50% 179.400000 4.000000
75% 216.400000 6.000000
max 350.800000 20.000000

describe() 的输出基本上是自解释性的,25%,50% 和 75% 是相应的百分数

类别特征和二元特征

类别特征(categorical features take)反映了样本的某个定性属性,它具有固定数目的值,每个值将一个观测数据分配到相应的组,这些组称为类别(category)。如果类别变量的值具有顺序,称为有序(ordinal)类别变量。

二元(binary)特征是类别特征的特例,其可能值有 2 个。

频率表

让我们查看一下目标变量 Churn 离网率 的分布情况。首先,使用 方法得到一张频率表。

1
df['Churn'].value_counts()
False    2850
True      483
Name: Churn, dtype: int64

上表显示,该数据集的 Churn 有 2850 个属于 False(Churn==0),有 483 个属于 True(Churn==1),数据集中忠实客户(Churn==0)和不忠实客户(Churn==1)的比例并不相等。我们将在以后的文章中看到,这种数据不平衡的情况会导致建立的分类模型存在一定的问题。在这种情况下,构建分类模型可能需要加重对「少数数据(在这里是 Churn==1)分类错误」这一情况的惩罚。

条形图

频率表的图形化表示是条形图。创建条形图最简单的方法是使用 seaborn 的 函数。让我们来画出两个分类变量的分布。

1
2
3
4
fig, ax = plt.subplots(1, 2, figsize=[10, 4])

sns.countplot(df['Churn'], ax=ax[0])
sns.countplot(df['Customer service calls'], ax=ax[1])
<matplotlib.axes._subplots.AxesSubplot at 0x1382242d0>

png

条形图和直方图的区别如下:

  • 直方图适合查看数值变量的分布,而条形图用于查看类别特征。
  • 直方图的 X 轴是数值;条形图的 X 轴可能是任何类型,如数字、字符串、布尔值。
  • 直方图的 X 轴是一个笛卡尔坐标轴;条形图的顺序则没有事先定义。

上左图清晰地表明了目标变量的失衡性。上右图则表明大部分客户最多打了 2-3 个客服电话就解决了他们的问题。不过,既然想要预测少数数据的分类(Churn==1),我们可能对少数不满意的客户的表现更感兴趣。所以让我们尝试一下更有趣的可视化方法:多变量可视化,看能否对预测有所帮助。

多变量可视化

多变量(multivariate)图形可以在单张图像中查看两个以上变量的联系,和单变量图形一样,可视化的类型取决于将要分析的变量的类型。

先来看看数量变量之间的相互作用。

相关矩阵

相关矩阵可揭示数据集中的数值变量的相关性。这一信息很重要,因为有一些机器学习算法(比如,线性回归和逻辑回归)不能很好地处理高度相关的输入变量。

首先,我们使用 DataFrame 的 方法计算出每对特征间的相关性。接着,我们将所得的相关矩阵(correlation matrix)传给 seaborn 的 方法,该方法根据提供的数值,渲染出一个基于色彩编码的矩阵。

1
2
3
4
numerical = list(set(df.columns) - set(['State', 'International plan', 'Voice mail plan', 'Area code', 'Churn', 'Customer service calls']))

corr = df[numerical].corr()
sns.heatmap(corr)
<matplotlib.axes._subplots.AxesSubplot at 0x1388bad50>

png

上图中,Total day charge 日话费总额 是直接基于 Total day minutes 电话的分钟数 计算得到,它被称为因变量。除了 Total day charege 外,还有 3 个因变量:Total eve charge,Total night charge,Total intl charge。这 4 个因变量并不贡献任何额外信息,我们直接去除。

1
2
3
4
numerical = list(set(numerical) - set(['Total day charge', 'Total eve charge', 'Total night charge', 'Total intl charge']))

corr = df[numerical].corr()
sns.heatmap(corr)
<matplotlib.axes._subplots.AxesSubplot at 0x1396822d0>

png

散点图

散点图(scatter plot)将两个数值变量的值显示为二维空间中的笛卡尔坐标(Cartesian coordinate)。通过 matplotlib 库的 方法可以绘制散点图。

1
plt.scatter(df['Total day minutes'], df['Total night minutes'])
<matplotlib.collections.PathCollection at 0x139a17c10>

png

我们得到了两个正态分布变量的散点图,看起来这两个变量并不相关,因为上图的形状和轴是对齐的。

seaborn 库的 方法在绘制散点图的同时会绘制两张直方图,某些情形下它们可能会更有用。

1
sns.jointplot(df['Total day minutes'], df['Total night minutes'])
<seaborn.axisgrid.JointGrid at 0x139888cd0>

png

jointplot() 方法还可以绘制平滑过的散点直方图。

1
sns.jointplot(df['Total day minutes'], df['Total night minutes'], kind='kde', color='g')
<seaborn.axisgrid.JointGrid at 0x139dc1890>

png

上图基本上就是之前讨论过的核密度图的双变量版本。

散点图矩阵

在某些情形下,我们可能想要绘制如下所示的散点图矩阵(scatterplot matrix)。它的对角线包含变量的分布,并且每对变量的散点图填充了矩阵的其余部分。

1
2
# %config InlineBackend.figure_format = 'png'
sns.pairplot(df[numerical])
<seaborn.axisgrid.PairGrid at 0x139cb5810>

png

数量和类别

为了让图形更有趣一点,可以尝试从数值和类别特征的相互作用中得到预测 Churn 的新信息,更具体地,让我们看看输入变量和目标变量 Churn 的关系。使用 方法的 hue 参数来指定感兴趣的类别特征。

1
sns.lmplot('Total day minutes', 'Total night minutes', data=df, hue='Churn', fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x13e7df950>

png

看起来不忠实客户偏向右上角,也就是倾向于在白天和夜间打更多电话的客户。当然,这不是非常明显,我们也不会基于这一图形下任何确定性的结论。

现在,创建箱形图,以可视化忠实客户(Churn=0)和离网客户(Churn=1)这两个互斥分组中数值变量分布的统计数据。

1
2
3
4
5
6
7
8
9
numerical.append('Customer service calls')
print(numerical)
fig, axes = plt.subplots(3, 4, figsize=[10, 7])
for index, feat in enumerate(numerical):
ax = axes[int(index / 4), index % 4]
sns.boxplot(df['Churn'], df[feat], ax=ax)
ax.set_xlabel('')
ax.set_ylabel(feat)
fig.tight_layout()
['Total day minutes', 'Total night minutes', 'Number vmail messages', 'Total eve calls', 'Account length', 'Total intl calls', 'Total eve minutes', 'Total night calls', 'Total day calls', 'Total intl minutes', 'Customer service calls', 'Customer service calls', 'Customer service calls', 'Customer service calls', 'Customer service calls', 'Customer service calls', 'Customer service calls', 'Customer service calls']



---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-47-539701aff8ff> in <module>
      3 fig, axes = plt.subplots(3, 4, figsize=[10, 7])
      4 for index, feat in enumerate(numerical):
----> 5     ax = axes[int(index / 4), index % 4]
      6     sns.boxplot(df['Churn'], df[feat], ax=ax)
      7     ax.set_xlabel('')


IndexError: index 3 is out of bounds for axis 0 with size 3

png

上面的图表表明,两组之间分歧最大的分布是这三个变量:Total day minutes 日通话分钟数、Customer service calls 客服呼叫数、Number vmail messages 语音邮件数。在后续的课程中,我们将学习如何使用随机森林(Random Forest)或梯度提升(Gradient Boosting)来判定特征对分类的重要性,届时可以清晰地看到,前两个特征对于离网预测模型而言确实非常重要。

创建箱型图和提琴形图,查看忠实客户和不忠实客户的日通话分钟数。

1
2
3
4
5
_, axes = plt.subplots(2, 2, sharex=True, sharey=True, figsize=[10, 8])
sns.boxplot(x='Churn', y='Total day minutes', data=df, ax=axes[0][0])
sns.violinplot(x='Churn', y="Total day minutes", data=df, ax=axes[0][1])
sns.boxplot(x='Churn', y='Total night minutes', data=df, ax=axes[1][0])
sns.violinplot(x='Churn', y="Total night minutes", data=df, ax=axes[1][1])
<matplotlib.axes._subplots.AxesSubplot at 0x140f70290>

png

上图表明,不忠实客户倾向于打更多的电话。

我们还可以发现一个有趣的信息:平均而言,离网客户是通讯服务更活跃的用户。或许是他们对话费不满意,所以预防离网的一个可能措施是降低通话费。当然,公司需要进行额外的经济分析,以查明这样做是否真的有利。

当想要一次性分析两个类别维度下的数量变量时,可以用 seaborn 库的 函数。例如,在同一图形中可视化 Total day minutes 日通话分钟数 和两个类别变量(Churn 和 Customer service calls)的相互作用。

1
2
sns.catplot(x='Churn', y='Total day minutes', col='Customer service calls',
data=df[df['Customer service calls'] < 8], kind='box', col_wrap=4, height=3, aspect=.8)
<seaborn.axisgrid.FacetGrid at 0x140728250>

png

上图表明,从第 4 次客服呼叫开始,Total day minutes 日通话分钟数 可能不再是客户离网(Churn==1)的主要因素。也许,除了我们之前猜测的话费原因,还有其他问题导致客户对服务不满意,这可能会导致日通话分钟数更少。

类别与类别

正如之前提到的,变量 Customer service calls 客服呼叫数 的重复值很多,因此,既可以看成数值变量,也可以看成有序类别变量。之前已通过计数图(count plot)查看过它的分布了,现在我们感兴趣的是这一有序特征和目标变量 Churn 离网率 之间的关系。

使用 countplot() 方法查看客服呼叫数的分布,这次传入 hue=Churn 参数,以便在图形中加入类别维度。

1
sns.countplot(x='Customer service calls', hue='Churn', data=df[df["Customer service calls"] < 10])
<matplotlib.axes._subplots.AxesSubplot at 0x140cb7990>

png

上图表明,呼叫客服达到 4 次以上后,离网率显著增加了。

使用 countplot() 方法查看 Churn 离网率 和二元特征 International plan 国际套餐、Voice mail plan 语音邮件套餐 的关系。

1
2
3
4
_, axes = plt.subplots(1, 2, sharey=True, figsize=(10, 4))

sns.countplot(x='International plan', hue='Churn', data=df, ax=axes[0])
sns.countplot(x='Voice mail plan', hue='Churn', data=df, ax=axes[1])
<matplotlib.axes._subplots.AxesSubplot at 0x142344290>

png

上图表明,开通国际套餐后,离网率会高很多,即 International plan 是否开通国际套餐 是一个重要的特征。我们在 Vocie mail plan 语音邮件套餐 特征上没有观察到类似的效果。

交叉表

除了使用图形进行类别分析之外,还可以使用统计学的传统工具:交叉表(cross tabulation),即使用表格形式表示多个类别变量的频率分布。通过它可以查看某一列或某一行以了解某个变量在另一变量的作用下的分布情况。

通过交叉表查看 Churn 离网率 和类别变量 State 州 的关系。

1
pd.crosstab(df['State'], df['Churn']).T

State AK AL AR AZ CA CO CT DC DE FL ... SD TN TX UT VA VT WA WI WV WY
Churn
False 49 72 44 60 25 57 62 49 52 55 ... 52 48 54 62 72 65 52 71 96 68
True 3 8 11 4 9 9 12 5 9 8 ... 8 5 18 10 5 8 14 7 10 9

2 rows × 51 columns

上表显示,State 州 有 51 个不同的值,并且每个州只有 3 到 17 个客户抛弃了运营商。通过 groupby() 方法计算每个州的离网率,由高到低排列。

1
df.groupby(['State'])['Churn'].agg([np.mean]).sort_values(by='mean', ascending=False).T

State NJ CA TX MD SC MI MS NV WA ME ... RI WI IL NE LA IA VA AZ AK HI
mean 0.264706 0.264706 0.25 0.242857 0.233333 0.219178 0.215385 0.212121 0.212121 0.209677 ... 0.092308 0.089744 0.086207 0.081967 0.078431 0.068182 0.064935 0.0625 0.057692 0.056604

1 rows × 51 columns

上表显示,新泽西和加利福尼亚的离网率超过了 25%,夏威夷和阿拉斯加的离网率则不到 6%。然而,这些结论是基于极少的样本得出的,可能仅适用于这一特定数据集,不太具有泛用性。

全局数据集可视化

上面我们一直在研究数据集的不同方面(facet),通过猜测有趣的特征并一次选择少量特征进行可视化。如果我们想一次性显示所有特征并仍然能够解释生成的可视化,该怎么办?

降维

大多数现实世界的数据集有很多特征,每一个特征都可以被看成数据空间的一个维度。因此,我们经常需要处理高维数据集,然而可视化整个高维数据集相当难。为了从整体上查看一个数据集,需要在不损失很多数据信息的前提下,降低用于可视化的维度。这一任务被称为降维(dimensionality reduction)。降维是一个无监督学习(unsupervised learning)问题,因为它需要在不借助任何监督输入(如标签)的前提下,从数据自身得到新的低维特征。

主成分分析(Principal Component Analysis, PCA)是一个著名的降维方法,我们会在之后的课程中讨论它。但主成分分析的局限性在于,它是线性(linear)算法,这意味着对数据有某些特定的限制。

与线性方法相对的,有许多非线性方法,统称流形学习(Manifold Learning)。著名的流形学习方法之一是 t-SNE。

实验总结

本章节首先介绍了 Pandas、Matplotlib 和 seaborn 库的一些常用可视化方法,并对客户离网数据集进行了可视化分析和 t-SNE 降维。可视化是一个相对快捷的从数据中挖掘信息的手段,因此,学习这一技术并将其纳入你的日常机器学习工具箱,是很有必要的。