post on 16 Feb 2025 about 15697words require 53min
CC BY 4.0 (除特别声明或转载文章外)
如果这篇博客帮助到你,可以请我喝一杯咖啡~
实例分割是目标检测和语义分割的结合,旨在精确识别图像中的每个目标对象,并区分同一类别中的不同实例。例如,在一张包含多个人的图像中,实例分割不仅需要识别出“人”这一类别,还需要将每个人单独区分开来,为每个人生成独立的分割掩码(Mask)。
实例分割通过融合目标检测和语义分割的结果来实现这一目标。具体而言,它利用目标检测提供的目标类别和位置信息(如边界框和置信度),从语义分割的结果中提取出对应目标的像素级掩码。简而言之,实例分割的任务是将同一类别中的具体对象(即实例)分别分割出来。
举个例子,近年来,随着自动驾驶等领域的快速发展,实例分割任务受到了广泛关注。自动驾驶场景中,精确区分和分割道路上的行人、车辆等目标对于环境感知和决策至关重要。此外,一些实例分割任务还会输出检测结果(如边界框),以提供更全面的目标描述。
对实例分割、语义分割和目标检测混合任务感兴趣的读者,可以参考 CVPR 2019 的论文《Hybrid Task Cascade》(HTC),该研究提出了一种混合任务级联框架,能够同时处理这三种任务,为多任务学习提供了新的思路。
实例分割在以下领域有重要应用:
Mask R-CNN 是 2017 年发表的文章,一作是何恺明大神,没错就是那个男人,除此之外还有 Faster R-CNN 系列的大神 Ross Girshick,可以说是强强联合。该论文也获得了 ICCV 2017 的最佳论文奖(Marr Prize)。并且该网络提出后,又霸榜了 MS COCO 的各项任务,包括目标检测、实例分割以及人体关键点检测任务。Mask R-CNN 的结构很简洁而且很灵活效果又很好(仅仅是在 Faster R-CNN 的基础上根据需求加入一些新的分支)。
Mask R-CNN 的提出初衷是为了实现高效且精确的实例分割任务,同时继承 Faster R-CNN 在目标检测方面的优势。具体而言,Mask R-CNN 的核心动机是将目标检测与语义分割相结合,既能检测图像中的目标并定位其边界框,又能为每个目标生成精确的像素级分割掩码。此外,Mask R-CNN 通过引入 RoI Align 技术,解决了 Faster R-CNN 中 RoI Pooling 导致的特征图与原始图像区域对齐不精确的问题,显著提高了分割掩码的精度。Mask R-CNN 在 Faster R-CNN 的基础上增加了全卷积网络(FCN)分支,用于预测每个感兴趣区域(RoI)的分割掩码,这种设计不仅简单高效,且只增加了较小的计算开销。
它还具备良好的多任务扩展性,能够轻松扩展到其他任务(如人体关键点检测),成为一种通用的视觉框架。通过为每个类别预测独立的二元掩码,Mask R-CNN 能够更好地提取目标的空间布局信息,从而生成更精确的分割掩码,并在 COCO 等数据集上取得了显著优于当时其他模型的性能。总之,Mask R-CNN 的提出填补了目标检测和语义分割之间的空白,提供了一种高精度、高效率的实例分割解决方案。
Mask R-CNN 的提出初衷是为了实现高效且精确的实例分割任务,同时继承 Faster R-CNN 在目标检测方面的优势。具体动机和背景如下:
Mask R-CNN 的结构也很简单,就是在通过 RoIAlign(在原 Faster R-CNN 中是 RoIPool)得到的 RoI 基础上并行添加一个 Mask 分支(小型的 FCN)。见下图,之前 Faster R-CNN 是在 RoI 基础上接上一个 Fast R-CNN 检测头,即图中 class, box 分支,现在又并行了一个 Mask 分支。
注意带和不带 FPN 结构的 Mask R-CNN 在 Mask 分支上略有不同,对于带有 FPN 结构的 Mask R-CNN 它的 class、box 分支和 Mask 分支并不是共用一个 RoIAlign。在训练过程中,对于 class, box 分支 RoIAlign 将 RPN(Region Proposal Network)得到的 Proposals 池化到 7x7 大小,而对于 Mask 分支 RoIAlign 将 Proposals 池化到 14x14 大小。
Q:Mask R-CNN 中的 RoI Align 技术是如何解决 Faster R-CNN 中 RoI Pooling 导致的特征图与原始图像区域对齐不精确的问题的呢?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
"""
Model definitions
"""
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch
import numpy as np
import math
from model.resnet import resnet50
from model.rpn import RPN
#from model.lib.roi_align.roi_align.roi_align import RoIAlign
from model.lib.roi_align.roi_align.crop_and_resize import CropAndResize
from model.lib.bbox.generate_anchors import generate_pyramid_anchors
from model.lib.bbox.nms import torch_nms as nms
def log2_graph(x):
"""Implementatin of Log2. pytorch doesn't have a native implemenation."""
return torch.div(torch.log(x), math.log(2.))
def ROIAlign(feature_maps, rois, config, pool_size, mode='bilinear'):
"""Implements ROI Align on the features.
Params:
- pool_shape: [height, width] of the output pooled regions. Usually [7, 7]
- image_shape: [height, width, chanells]. Shape of input image in pixels
Inputs:
- boxes: [batch, num_boxes, (x1, y1, x2, y2)] in normalized
coordinates. Possibly padded with zeros if not enough
boxes to fill the array.
- Feature maps: List of feature maps from different levels of the pyramid.
Each is [batch, channels, height, width]
Output:
Pooled regions in the shape: [batch, num_boxes, height, width, channels].
The width and height are those specific in the pool_shape in the layer
constructor.
"""
"""
[ x2-x1 x1 + x2 - W + 1 ]
[ ----- 0 --------------- ]
[ W - 1 W - 1 ]
[ ]
[ y2-y1 y1 + y2 - H + 1 ]
[ 0 ----- --------------- ]
[ H - 1 H - 1 ]
"""
#feature_maps= [P2, P3, P4, P5]
rois = rois.detach()
crop_resize = CropAndResize(pool_size, pool_size, 0)
roi_number = rois.size()[1]
pooled = rois.data.new(
config.IMAGES_PER_GPU*rois.size(
1), 256, pool_size, pool_size).zero_()
rois = rois.view(
config.IMAGES_PER_GPU*rois.size(1),
4)
# Loop through levels and apply ROI pooling to each. P2 to P5.
x_1 = rois[:, 0]
y_1 = rois[:, 1]
x_2 = rois[:, 2]
y_2 = rois[:, 3]
roi_level = log2_graph(
torch.div(torch.sqrt((y_2 - y_1) * (x_2 - x_1)), 224.0))
roi_level = torch.clamp(torch.clamp(
torch.add(torch.round(roi_level), 4), min=2), max=5)
# P2 is 256x256, P3 is 128x128, P4 is 64x64, P5 is 32x32
# P2 is 4, P3 is 8, P4 is 16, P5 is 32
for i, level in enumerate(range(2, 6)):
scaling_ratio = 2**level
height = float(config.IMAGE_MAX_DIM)/ scaling_ratio
width = float(config.IMAGE_MAX_DIM) / scaling_ratio
ixx = torch.eq(roi_level, level)
box_indices = ixx.view(-1).int() * 0
ix = torch.unsqueeze(ixx, 1)
level_boxes = torch.masked_select(rois, ix)
if level_boxes.size()[0] == 0:
continue
level_boxes = level_boxes.view(-1, 4)
crops = crop_resize(feature_maps[i], torch.div(
level_boxes, float(config.IMAGE_MAX_DIM)
)[:, [1, 0, 3, 2]], box_indices)
indices_pooled = ixx.nonzero()[:, 0]
pooled[indices_pooled.data, :, :, :] = crops.data
pooled = pooled.view(config.IMAGES_PER_GPU, roi_number,
256, pool_size, pool_size)
pooled = Variable(pooled).cuda()
return pooled
# ---------------------------------------------------------------
# Heads
class MaskHead(nn.Module):
def __init__(self, config):
super(MaskHead, self).__init__()
self.config = config
self.num_classes = config.NUM_CLASSES
#self.crop_size = config.mask_crop_size
#self.roi_align = RoIAlign(self.crop_size, self.crop_size)
self.conv1 = nn.Conv2d(256, 256, kernel_size=3, padding=1, stride=1)
self.bn1 = nn.BatchNorm2d(256)
self.conv2 = nn.Conv2d(256, 256, kernel_size=3, padding=1, stride=1)
self.bn2 = nn.BatchNorm2d(256)
self.conv3 = nn.Conv2d(256, 256, kernel_size=3, padding=1, stride=1)
self.bn3 = nn.BatchNorm2d(256)
self.conv4 = nn.Conv2d(256, 256, kernel_size=3, padding=1, stride=1)
self.bn4 = nn.BatchNorm2d(256)
self.deconv = nn.ConvTranspose2d(256, 256, kernel_size=4, padding=1, stride=2, bias=False)
self.mask = nn.Conv2d(256, self.num_classes, kernel_size=1, padding=0, stride=1)
def forward(self, x, rpn_rois):
#x = self.roi_align(x, rpn_rois)
x = ROIAlign(x, rpn_rois, self.config, self.config.MASK_POOL_SIZE)
roi_number = x.size()[1]
# merge batch and roi number together
x = x.view(self.config.IMAGES_PER_GPU * roi_number,
256, self.config.MASK_POOL_SIZE,
self.config.MASK_POOL_SIZE)
x = F.relu(self.bn1(self.conv1(x)), inplace=True)
x = F.relu(self.bn2(self.conv2(x)), inplace=True)
x = F.relu(self.bn3(self.conv3(x)), inplace=True)
x = F.relu(self.bn4(self.conv4(x)), inplace=True)
x = self.deconv(x)
rcnn_mask_logits = self.mask(x)
rcnn_mask_logits = rcnn_mask_logits.view(self.config.IMAGES_PER_GPU,
roi_number,
self.config.NUM_CLASSES,
self.config.MASK_POOL_SIZE * 2,
self.config.MASK_POOL_SIZE * 2)
return rcnn_mask_logits
class RCNNHead(nn.Module):
def __init__(self, config):
super(RCNNHead, self).__init__()
self.config = config
self.num_classes = config.NUM_CLASSES
#self.crop_size = config.rcnn_crop_size
#self.roi_align = RoIAlign(self.crop_size, self.crop_size)
self.fc1 = nn.Linear(1024, 1024)
self.fc2 = nn.Linear(1024, 1024)
self.class_logits = nn.Linear(1024, self.num_classes)
self.bbox = nn.Linear(1024, self.num_classes * 4)
self.conv1 = nn.Conv2d(256, 1024, kernel_size=self.config.POOL_SIZE, stride=1, padding=0)
self.bn1 = nn.BatchNorm2d(1024, eps=0.001)
def forward(self, x, rpn_rois):
x = ROIAlign(x, rpn_rois, self.config, self.config.POOL_SIZE)
roi_number = x.size()[1]
x = x.view(self.config.IMAGES_PER_GPU * roi_number,
256, self.config.POOL_SIZE,
self.config.POOL_SIZE)
#print(x.shape)
#x = self.roi_align(x, rpn_rois, self.config, self.config.POOL_SIZE)
#x = crops.view(crops.size(0), -1)
x = self.bn1(self.conv1(x))
x = x.permute(0, 2, 3, 1).contiguous().view(x.size(0), -1)
x = F.relu(self.fc1(x), inplace=True)
x = F.relu(self.fc2(x), inplace=True)
#x = F.dropout(x, 0.5, training=self.training)
rcnn_class_logits = self.class_logits(x)
rcnn_probs = F.softmax(rcnn_class_logits, dim=-1)
rcnn_bbox = self.bbox(x)
rcnn_class_logits = rcnn_class_logits.view(self.config.IMAGES_PER_GPU,
roi_number,
rcnn_class_logits.size()[-1])
rcnn_probs = rcnn_probs.view(self.config.IMAGES_PER_GPU,
roi_number,
rcnn_probs.size()[-1])
rcnn_bbox = rcnn_bbox.view(self.config.IMAGES_PER_GPU,
roi_number,
self.config.NUM_CLASSES,
4)
return rcnn_class_logits, rcnn_probs, rcnn_bbox
#
# ---------------------------------------------------------------
# Mask R-CNN
class MaskRCNN(nn.Module):
"""
Mask R-CNN model
"""
def __init__(self, config):
super(MaskRCNN, self).__init__()
self.config = config
self.__mode = 'train'
feature_channels = 128
# define modules (set of layers)
# self.feature_net = FeatureNet(cfg, 3, feature_channels)
self.feature_net = resnet50().cuda()
#self.rpn_head = RpnMultiHead(cfg,feature_channels)
self.rpn = RPN(256, len(self.config.RPN_ANCHOR_RATIOS),
self.config.RPN_ANCHOR_STRIDE)
#self.rcnn_crop = CropRoi(cfg, cfg.rcnn_crop_size)
self.rcnn_head = RCNNHead(config)
#self.mask_crop = CropRoi(cfg, cfg.mask_crop_size)
self.mask_head = MaskHead(config)
self.anchors = generate_pyramid_anchors(self.config.RPN_ANCHOR_SCALES,
self.config.RPN_ANCHOR_RATIOS,
self.config.BACKBONE_SHAPES,
self.config.BACKBONE_STRIDES,
self.config.RPN_ANCHOR_STRIDE)
self.anchors = self.anchors.astype(np.float32)
self.proposal_count = self.config.POST_NMS_ROIS_TRAINING
# FPN
self.fpn_c5p5 = nn.Conv2d(
512 * 4, 256, kernel_size=1, stride=1, padding=0)
self.fpn_c4p4 = nn.Conv2d(
256 * 4, 256, kernel_size=1, stride=1, padding=0)
self.fpn_c3p3 = nn.Conv2d(
128 * 4, 256, kernel_size=1, stride=1, padding=0)
self.fpn_c2p2 = nn.Conv2d(
64 * 4, 256, kernel_size=1, stride=1, padding=0)
self.fpn_p2 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
self.fpn_p3 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
self.fpn_p4 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
self.fpn_p5 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
self.scale_ratios = [4, 8, 16, 32]
self.fpn_p6 = nn.MaxPool2d(
kernel_size=1, stride=2, padding=0, ceil_mode=False)
def forward(self, x):
# Extract features
C1, C2, C3, C4, C5 = self.feature_net(x)
P5 = self.fpn_c5p5(C5)
P4 = self.fpn_c4p4(C4) + F.upsample(P5,
scale_factor=2, mode='bilinear')
P3 = self.fpn_c3p3(C3) + F.upsample(P4,
scale_factor=2, mode='bilinear')
P2 = self.fpn_c2p2(C2) + F.upsample(P3,
scale_factor=2, mode='bilinear')
# Attach 3x3 conv to all P layers to get the final feature maps.
# P2 is 256, P3 is 128, P4 is 64, P5 is 32
P2 = self.fpn_p2(P2)
P3 = self.fpn_p3(P3)
P4 = self.fpn_p4(P4)
P5 = self.fpn_p5(P5)
# P6 is used for the 5th anchor scale in RPN. Generated by
# subsampling from P5 with stride of 2.
P6 = self.fpn_p6(P5)
# Note that P6 is used in RPN, but not in the classifier heads.
rpn_feature_maps = [P2, P3, P4, P5, P6]
self.mrcnn_feature_maps = [P2, P3, P4, P5]
rpn_class_logits_outputs = []
rpn_class_outputs = []
rpn_bbox_outputs = []
# RPN proposals
for feature in rpn_feature_maps:
rpn_class_logits, rpn_probs, rpn_bbox = self.rpn(feature)
rpn_class_logits_outputs.append(rpn_class_logits)
rpn_class_outputs.append(rpn_probs)
rpn_bbox_outputs.append(rpn_bbox)
rpn_class_logits = torch.cat(rpn_class_logits_outputs, dim=1)
rpn_class = torch.cat(rpn_class_outputs, dim=1)
rpn_bbox = torch.cat(rpn_bbox_outputs, dim=1)
rpn_proposals = self.proposal_layer(rpn_class, rpn_bbox)
# RCNN proposals
rcnn_class_logits, rcnn_class, rcnn_bbox = self.rcnn_head(self.mrcnn_feature_maps, rpn_proposals)
rcnn_mask_logits = self.mask_head(self.mrcnn_feature_maps, rpn_proposals)
# <todo> mask nms
return [rpn_class_logits, rpn_class, rpn_bbox, rpn_proposals,
rcnn_class_logits, rcnn_class, rcnn_bbox,
rcnn_mask_logits]
def proposal_layer(self, rpn_class, rpn_bbox):
# handling proposals
scores = rpn_class[:, :, 1]
#print(scores.shape)
# Box deltas [batch, num_rois, 4]
deltas_mul = Variable(torch.from_numpy(np.reshape(
self.config.RPN_BBOX_STD_DEV, [1, 1, 4]).astype(np.float32))).cuda()
deltas = rpn_bbox * deltas_mul
pre_nms_limit = min(6000, self.anchors.shape[0])
scores, ix = torch.topk(scores, pre_nms_limit, dim=-1,
largest=True, sorted=True)
ix = torch.unsqueeze(ix, 2)
ix = torch.cat([ix, ix, ix, ix], dim=2)
deltas = torch.gather(deltas, 1, ix)
_anchors = []
for i in range(self.config.IMAGES_PER_GPU):
anchors = Variable(torch.from_numpy(
self.anchors.astype(np.float32))).cuda()
_anchors.append(anchors)
anchors = torch.stack(_anchors, 0)
pre_nms_anchors = torch.gather(anchors, 1, ix)
refined_anchors = apply_box_deltas_graph(pre_nms_anchors, deltas)
# Clip to image boundaries. [batch, N, (y1, x1, y2, x2)]
height, width = self.config.IMAGE_SHAPE[:2]
window = np.array([0, 0, height, width]).astype(np.float32)
window = Variable(torch.from_numpy(window)).cuda()
refined_anchors_clipped = clip_boxes_graph(refined_anchors, window)
refined_proposals = []
scores = scores[:,:,None]
#print(scores.data.shape)
#print(refined_anchors_clipped.data.shape)
for i in range(self.config.IMAGES_PER_GPU):
indices = nms(
torch.cat([refined_anchors_clipped.data[i], scores.data[i]], 1), 0.7)
indices = indices[:self.proposal_count]
indices = torch.stack([indices, indices, indices, indices], dim=1)
indices = Variable(indices).cuda()
proposals = torch.gather(refined_anchors_clipped[i], 0, indices)
padding = self.proposal_count - proposals.size()[0]
proposals = torch.cat(
[proposals, Variable(torch.zeros([padding, 4])).cuda()], 0)
refined_proposals.append(proposals)
rpn_rois = torch.stack(refined_proposals, 0)
return rpn_rois
def apply_box_deltas_graph(boxes, deltas):
"""Applies the given deltas to the given boxes.
boxes: [N, 4] where each row is y1, x1, y2, x2
deltas: [N, 4] where each row is [dy, dx, log(dh), log(dw)]
"""
# Convert to y, x, h, w
height = boxes[:, :, 2] - boxes[:, :, 0]
width = boxes[:, :, 3] - boxes[:, :, 1]
center_y = boxes[:, :, 0] + 0.5 * height
center_x = boxes[:, :, 1] + 0.5 * width
# Apply deltas
center_y += deltas[:, :, 0] * height
center_x += deltas[:, :, 1] * width
height *= torch.exp(deltas[:, :, 2])
width *= torch.exp(deltas[:, :, 3])
# Convert back to y1, x1, y2, x2
y1 = center_y - 0.5 * height
x1 = center_x - 0.5 * width
y2 = y1 + height
x2 = x1 + width
result = [y1, x1, y2, x2]
return result
def clip_boxes_graph(boxes, window):
"""
boxes: [N, 4] each row is y1, x1, y2, x2
window: [4] in the form y1, x1, y2, x2
"""
# Split corners
wy1, wx1, wy2, wx2 = window
y1, x1, y2, x2 = boxes
# Clip
y1 = torch.max(torch.min(y1, wy2), wy1)
x1 = torch.max(torch.min(x1, wx2), wx1)
y2 = torch.max(torch.min(y2, wy2), wy1)
x2 = torch.max(torch.min(x2, wx2), wx1)
clipped = torch.stack([x1, y1, x2, y2], dim=2)
return clipped
Related posts