聚类与K最近邻

聚类的概念

对于没有分类的数据,首先要做的就是寻找具有相同特征的数据,将他们分配到相同的组。

为此,数据集可以分成任意数量的端,其中每个段都可以用它的成员的质量中心(质点)来表示。

为了将不同的成员分配到相同的组中,需要定义一下怎样表示不同元素之间的距离。在定义距离之后,可以说相对于其他质心,每个类成员都更靠近自己所在类的质心。

K均值

K均值(K-means)是一种常见的聚类算法,并且比较容易实现。它非常直接,一般用于分析数据的第一步。经过该处理,我们能够得到一些相关数据集的先验知识。

K均值的机制

K均值算法师徒给给定的数据分割K个不相交的组,每个组的指标就是该组所有成员的均值。这个点通常称为质心,指具有相同名称的算术实体,并且可以被表示为任意维度的向量。

K均值是一个朴素的办法,因为他在不知道组数量的前提下,寻找合适的质心。

算法迭代判据

K均值的损失函数:

其中 $x_i$ 表示各个样本, $\mu_i$表示所属的组的中心点

该式子所表达的含义为对每个点与其所属组的距离的最小值进行求和

K最近邻

K最近邻是一种简单而经典的聚类方法。该方法只需查看周围点的类别信息,并假设所有的样本都属于已知的类别。

K最近邻流程图:

matplotlib 绘图库介绍

数据绘图是科学学科的一个组成部分。为此,我们需要一个非常强大的框架,以能够绘制我们的结果。对于这个任务,我们使用matplotlib库来完成。

matplotlib官方网站 http://matplotblib.org

matplotlib是一个python语言编写的2D绘图库,能够在跨平台的交互环境中产生各种硬拷贝格式的印刷机图形

下面生成100个随机数列表,并用matplotlib绘制这100个数据:

1
2
3
4
5
6
7
8
9
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
with tf.Session() as sess:
fig, ax = plt.subplots()
ax.plot(tf.random_normal([100]).eval(),
tf.random_normal([100]).eval(),'o')
ax.set_title("Sample random plot for Tensorflow")
plt.savefig("D:\\result.png")

得到结果如下:

scikit-learn 生成人工数据集

完整代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
with tf.Session() as sess:
fig, ax = plt.subplots()
#ax.plot(tf.random_normal([100]).eval(),
#tf.random_normal([100]).eval(),'o')
ax.set_title("Sample random plot for Tensorflow")
plt.savefig("D:\\result.png")
centers = [(-2,-2),(-2,1.5),(1.5,-2),(2,1.5)]
data, features = datasets.make_blobs(
n_samples=200,centers=centers ,
n_features=2,
cluster_std=0.8,
shuffle=False,random_state= 42
)
'''生成圆
data, features = datasets.make_circles(
n_samples=100,
noise = None,
factor=0.8,
random_state=None,
shuffle=True
)
'''

ax.scatter(
np.asarray(data).transpose()[0],
np.asarray(data).transpose()[1],
marker = 'o',s = 250
)
plt.plot()
plt.savefig("D:\\result.png")

生成的数据分布图:

K最近邻完整实现代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import tensorflow as tf
import numpy as np
import time

import matplotlib
import matplotlib.pyplot as plt

from sklearn.datasets.samples_generator import make_blobs
from sklearn.datasets.samples_generator import make_circles

DATA_TYPE = 'blobs'
N=200
# Number of clusters, if we choose circles, only 2 will be enough
if (DATA_TYPE == 'circle'):
K=2
else:
K=4


# Maximum number of iterations, if the conditions are not met
MAX_ITERS = 1000


start = time.time()


centers = [(-2, -2), (-2, 1.5), (1.5, -2), (2, 1.5)]
if (DATA_TYPE == 'circle'):
data, features = make_circles(n_samples=200, shuffle=True, noise= 0.01, factor=0.4)
else:
data, features = make_blobs (n_samples=200, centers=centers, n_features = 2, cluster_std=0.8, shuffle=False, random_state=42)


fig, ax = plt.subplots()
ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250)
plt.show()


fig, ax = plt.subplots()
if (DATA_TYPE == 'blobs'):
ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250)
ax.scatter(data.transpose()[0], data.transpose()[1], marker = 'o', s = 100, c = features, cmap=plt.cm.coolwarm )
plt.show()


points=tf.Variable(data)
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))

centroids = tf.Variable(tf.slice(points.initialized_value(), [0,0], [K,2]))

sess = tf.Session()
sess.run(tf.initialize_all_variables())

sess.run(centroids)


rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, 2])
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, 2])
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),
reduction_indices=2)


best_centroids = tf.argmin(sum_squares, 1)


did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids, cluster_assignments))


def bucket_mean(data, bucket_ids, num_buckets):
total = tf.unsorted_segment_sum(data, bucket_ids, num_buckets)
count = tf.unsorted_segment_sum(tf.ones_like(data), bucket_ids, num_buckets)
return total / count


means = bucket_mean(points, best_centroids, K)


with tf.control_dependencies([did_assignments_change]):
do_updates = tf.group(
centroids.assign(means),
cluster_assignments.assign(best_centroids))

changed = True
iters = 0


fig, ax = plt.subplots()
if (DATA_TYPE == 'blobs'):
colourindexes=[2,1,4,3]
else:
colourindexes=[2,1]
while changed and iters < MAX_ITERS:
fig, ax = plt.subplots()
iters += 1
[changed, _] = sess.run([did_assignments_change, do_updates])
[centers, assignments] = sess.run([centroids, cluster_assignments])
ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker = 'o', s = 200, c = assignments, cmap=plt.cm.coolwarm )
ax.scatter(centers[:,0],centers[:,1], marker = '^', s = 550, c = colourindexes, cmap=plt.cm.plasma)
ax.set_title('Iteration ' + str(iters))
plt.savefig("kmeans" + str(iters) +".png")


ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker = 'o', s = 200, c = assignments, cmap=plt.cm.coolwarm )
plt.show()


end = time.time()
print (("Found in %.2f seconds" % (end-start)), iters, "iterations")
print ("Centroids:")
print (centers)
print ("Cluster assignments:", assignments)
# 输出结果

Centroids:
[[ 1.65289262 -2.04643427]
[-2.0763623 1.61204964]
[-2.08862822 -2.07255306]
[ 2.09831502 1.55936014]]
Cluster assignments: [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 3 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3]

最后找的质心点的图: