实例分割

post on 16 Feb 2025 about 15697words require 53min
CC BY 4.0 （除特别声明或转载文章外）
如果这篇博客帮助到你，可以请我喝一杯咖啡~

实例分割（Instance Segmentation）

定义

实例分割是目标检测和语义分割的结合，旨在精确识别图像中的每个目标对象，并区分同一类别中的不同实例。例如，在一张包含多个人的图像中，实例分割不仅需要识别出“人”这一类别，还需要将每个人单独区分开来，为每个人生成独立的分割掩码（Mask）。

实例分割通过融合目标检测和语义分割的结果来实现这一目标。具体而言，它利用目标检测提供的目标类别和位置信息（如边界框和置信度），从语义分割的结果中提取出对应目标的像素级掩码。简而言之，实例分割的任务是将同一类别中的具体对象（即实例）分别分割出来。

举个例子,近年来，随着自动驾驶等领域的快速发展，实例分割任务受到了广泛关注。自动驾驶场景中，精确区分和分割道路上的行人、车辆等目标对于环境感知和决策至关重要。此外，一些实例分割任务还会输出检测结果（如边界框），以提供更全面的目标描述。

对实例分割、语义分割和目标检测混合任务感兴趣的读者，可以参考 CVPR 2019 的论文《Hybrid Task Cascade》（HTC），该研究提出了一种混合任务级联框架，能够同时处理这三种任务，为多任务学习提供了新的思路。

特点

能够精确地定位和区分同一类别的不同实例。
计算成本较高，因为需要对每个目标实例进行单独检测和分类。

应用

实例分割在以下领域有重要应用：

自动驾驶：用于检测和分割车辆、行人。
医学成像：用于检测组织和病理的特定边界。
机器人视觉：用于识别和隔离目标物体。

常见模型

Mask R-CNN

优点：能够同时进行目标检测和语义分割，具有较好的性能。
缺点：模型参数多，训练和推理速度较慢；大目标的边缘分割较为粗糙。

提出初衷

Mask R-CNN 是 2017 年发表的文章，一作是何恺明大神，没错就是那个男人，除此之外还有 Faster R-CNN 系列的大神 Ross Girshick，可以说是强强联合。该论文也获得了 ICCV 2017 的最佳论文奖（Marr Prize）。并且该网络提出后，又霸榜了 MS COCO 的各项任务，包括目标检测、实例分割以及人体关键点检测任务。Mask R-CNN 的结构很简洁而且很灵活效果又很好（仅仅是在 Faster R-CNN 的基础上根据需求加入一些新的分支）。 Mask R-CNN 的提出初衷是为了实现高效且精确的实例分割任务，同时继承 Faster R-CNN 在目标检测方面的优势。具体而言，Mask R-CNN 的核心动机是将目标检测与语义分割相结合，既能检测图像中的目标并定位其边界框，又能为每个目标生成精确的像素级分割掩码。此外，Mask R-CNN 通过引入 RoI Align 技术，解决了 Faster R-CNN 中 RoI Pooling 导致的特征图与原始图像区域对齐不精确的问题，显著提高了分割掩码的精度。Mask R-CNN 在 Faster R-CNN 的基础上增加了全卷积网络（FCN）分支，用于预测每个感兴趣区域（RoI）的分割掩码，这种设计不仅简单高效，且只增加了较小的计算开销。它还具备良好的多任务扩展性，能够轻松扩展到其他任务（如人体关键点检测），成为一种通用的视觉框架。通过为每个类别预测独立的二元掩码，Mask R-CNN 能够更好地提取目标的空间布局信息，从而生成更精确的分割掩码，并在 COCO 等数据集上取得了显著优于当时其他模型的性能。总之，Mask R-CNN 的提出填补了目标检测和语义分割之间的空白，提供了一种高精度、高效率的实例分割解决方案。 Mask R-CNN 的提出初衷是为了实现高效且精确的实例分割任务，同时继承 Faster R-CNN 在目标检测方面的优势。具体动机和背景如下：

结合目标检测与语义分割 在 Mask R-CNN 提出之前，目标检测（如 Faster R-CNN）和语义分割（如 FCN）是两个独立的领域。Mask R-CNN 的核心动机是将两者结合起来，既能够检测图像中的目标并定位其边界框，还能为每个目标生成精确的像素级分割掩码。
解决像素级对齐问题 Faster R-CNN 在处理像素级任务时存在局限性，尤其是其 RoI Pooling 操作会导致特征图与原始图像区域的对齐不精确。Mask R-CNN 通过引入 RoI Align 技术，解决了这一问题，显著提高了分割掩码的精度。
简单高效的架构设计 Mask R-CNN 在 Faster R-CNN 的基础上，增加了一个全卷积网络（FCN）分支，用于预测每个感兴趣区域（RoI）的分割掩码。这种设计不仅简单高效，而且只增加了较小的计算开销。
多任务学习的扩展性 Mask R-CNN 不仅适用于实例分割，还可以轻松扩展到其他任务，如人体关键点检测。这种多任务扩展性使其成为一种通用的视觉框架。
提升实例分割精度 通过为每个类别预测独立的二元掩码，Mask R-CNN 能够更好地提取目标的空间布局信息，从而生成更精确的分割掩码。这一改进使其在 COCO 等数据集上取得了显著优于当时其他模型的性能。

网络结构

Mask R-CNN 的结构也很简单，就是在通过 RoIAlign（在原 Faster R-CNN 中是 RoIPool）得到的 RoI 基础上并行添加一个 Mask 分支（小型的 FCN）。见下图，之前 Faster R-CNN 是在 RoI 基础上接上一个 Fast R-CNN 检测头，即图中 class, box 分支，现在又并行了一个 Mask 分支。

注意带和不带 FPN 结构的 Mask R-CNN 在 Mask 分支上略有不同，对于带有 FPN 结构的 Mask R-CNN 它的 class、box 分支和 Mask 分支并不是共用一个 RoIAlign。在训练过程中，对于 class, box 分支 RoIAlign 将 RPN（Region Proposal Network）得到的 Proposals 池化到 7x7 大小，而对于 Mask 分支 RoIAlign 将 Proposals 池化到 14x14 大小。

Q:Mask R-CNN 中的 RoI Align 技术是如何解决 Faster R-CNN 中 RoI Pooling 导致的特征图与原始图像区域对齐不精确的问题的呢？

模型代码

"""
Model definitions

"""
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch
import numpy as np
import math

from model.resnet import resnet50
from model.rpn import RPN
#from model.lib.roi_align.roi_align.roi_align import RoIAlign
from model.lib.roi_align.roi_align.crop_and_resize import CropAndResize
from model.lib.bbox.generate_anchors import generate_pyramid_anchors
from model.lib.bbox.nms import torch_nms as nms


def log2_graph(x):
    """Implementatin of Log2. pytorch doesn't have a native implemenation."""
    return torch.div(torch.log(x), math.log(2.))


def ROIAlign(feature_maps, rois, config, pool_size, mode='bilinear'):
    """Implements ROI Align on the features.
    Params:
    - pool_shape: [height, width] of the output pooled regions. Usually [7, 7]
    - image_shape: [height, width, chanells]. Shape of input image in pixels
    Inputs:
    - boxes: [batch, num_boxes, (x1, y1, x2, y2)] in normalized
             coordinates. Possibly padded with zeros if not enough
             boxes to fill the array.
    - Feature maps: List of feature maps from different levels of the pyramid.
                    Each is [batch, channels, height, width]
    Output:
    Pooled regions in the shape: [batch, num_boxes, height, width, channels].
    The width and height are those specific in the pool_shape in the layer
    constructor.
    """
    """
    [  x2-x1             x1 + x2 - W + 1  ]
    [  -----      0      ---------------  ]
    [  W - 1                  W - 1       ]
    [                                     ]
    [           y2-y1    y1 + y2 - H + 1  ]
    [    0      -----    ---------------  ]
    [           H - 1         H - 1      ]
    """
    #feature_maps= [P2, P3, P4, P5]
    rois = rois.detach()
    crop_resize = CropAndResize(pool_size, pool_size, 0)

    roi_number = rois.size()[1]

    pooled = rois.data.new(
            config.IMAGES_PER_GPU*rois.size(
            1), 256, pool_size, pool_size).zero_()

    rois = rois.view(
            config.IMAGES_PER_GPU*rois.size(1),
            4)

    # Loop through levels and apply ROI pooling to each. P2 to P5.
    x_1 = rois[:, 0]
    y_1 = rois[:, 1]
    x_2 = rois[:, 2]
    y_2 = rois[:, 3]

    roi_level = log2_graph(
        torch.div(torch.sqrt((y_2 - y_1) * (x_2 - x_1)), 224.0))

    roi_level = torch.clamp(torch.clamp(
        torch.add(torch.round(roi_level), 4), min=2), max=5)

    # P2 is 256x256, P3 is 128x128, P4 is 64x64, P5 is 32x32
    # P2 is 4, P3 is 8, P4 is 16, P5 is 32
    for i, level in enumerate(range(2, 6)):

        scaling_ratio = 2**level

        height = float(config.IMAGE_MAX_DIM)/ scaling_ratio
        width = float(config.IMAGE_MAX_DIM) / scaling_ratio

        ixx = torch.eq(roi_level, level)

        box_indices = ixx.view(-1).int() * 0
        ix = torch.unsqueeze(ixx, 1)
        level_boxes = torch.masked_select(rois, ix)
        if level_boxes.size()[0] == 0:
            continue
        level_boxes = level_boxes.view(-1, 4)

        crops = crop_resize(feature_maps[i], torch.div(
                level_boxes, float(config.IMAGE_MAX_DIM)
                )[:, [1, 0, 3, 2]], box_indices)

        indices_pooled = ixx.nonzero()[:, 0]
        pooled[indices_pooled.data, :, :, :] = crops.data

    pooled = pooled.view(config.IMAGES_PER_GPU, roi_number,
               256, pool_size, pool_size)
    pooled = Variable(pooled).cuda()
    return pooled


# ---------------------------------------------------------------
# Heads

class MaskHead(nn.Module):

    def __init__(self, config):
        super(MaskHead, self).__init__()
        self.config = config
        self.num_classes = config.NUM_CLASSES
        #self.crop_size = config.mask_crop_size

        #self.roi_align = RoIAlign(self.crop_size, self.crop_size)
        self.conv1 = nn.Conv2d(256, 256, kernel_size=3, padding=1, stride=1)
        self.bn1 = nn.BatchNorm2d(256)
        self.conv2 = nn.Conv2d(256, 256, kernel_size=3, padding=1, stride=1)
        self.bn2 = nn.BatchNorm2d(256)
        self.conv3 = nn.Conv2d(256, 256, kernel_size=3, padding=1, stride=1)
        self.bn3 = nn.BatchNorm2d(256)
        self.conv4 = nn.Conv2d(256, 256, kernel_size=3, padding=1, stride=1)
        self.bn4 = nn.BatchNorm2d(256)

        self.deconv = nn.ConvTranspose2d(256, 256, kernel_size=4, padding=1, stride=2, bias=False)
        self.mask = nn.Conv2d(256, self.num_classes, kernel_size=1, padding=0, stride=1)

    def forward(self, x, rpn_rois):
        #x = self.roi_align(x, rpn_rois)
        x = ROIAlign(x, rpn_rois, self.config, self.config.MASK_POOL_SIZE)

        roi_number = x.size()[1]

        # merge batch and roi number together
        x = x.view(self.config.IMAGES_PER_GPU * roi_number,
                   256, self.config.MASK_POOL_SIZE,
                   self.config.MASK_POOL_SIZE)

        x = F.relu(self.bn1(self.conv1(x)), inplace=True)
        x = F.relu(self.bn2(self.conv2(x)), inplace=True)
        x = F.relu(self.bn3(self.conv3(x)), inplace=True)
        x = F.relu(self.bn4(self.conv4(x)), inplace=True)
        x = self.deconv(x)
        rcnn_mask_logits = self.mask(x)

        rcnn_mask_logits = rcnn_mask_logits.view(self.config.IMAGES_PER_GPU,
                                                 roi_number,
                                                 self.config.NUM_CLASSES,
                                                 self.config.MASK_POOL_SIZE * 2,
                                                 self.config.MASK_POOL_SIZE * 2)

        return rcnn_mask_logits


class RCNNHead(nn.Module):
    def __init__(self, config):
        super(RCNNHead, self).__init__()
        self.config = config
        self.num_classes = config.NUM_CLASSES
        #self.crop_size = config.rcnn_crop_size

        #self.roi_align = RoIAlign(self.crop_size, self.crop_size)
        self.fc1 = nn.Linear(1024, 1024)
        self.fc2 = nn.Linear(1024, 1024)
        self.class_logits = nn.Linear(1024, self.num_classes)
        self.bbox = nn.Linear(1024, self.num_classes * 4)

        self.conv1 = nn.Conv2d(256, 1024, kernel_size=self.config.POOL_SIZE, stride=1, padding=0)
        self.bn1 = nn.BatchNorm2d(1024, eps=0.001)

    def forward(self, x, rpn_rois):
        x = ROIAlign(x, rpn_rois, self.config, self.config.POOL_SIZE)
        roi_number = x.size()[1]

        x = x.view(self.config.IMAGES_PER_GPU * roi_number,
                   256, self.config.POOL_SIZE,
                   self.config.POOL_SIZE)
        #print(x.shape)
        #x = self.roi_align(x, rpn_rois, self.config, self.config.POOL_SIZE)
        #x = crops.view(crops.size(0), -1)
        x = self.bn1(self.conv1(x))
        x = x.permute(0, 2, 3, 1).contiguous().view(x.size(0), -1)
        x = F.relu(self.fc1(x), inplace=True)
        x = F.relu(self.fc2(x), inplace=True)
        #x = F.dropout(x, 0.5, training=self.training)
        rcnn_class_logits = self.class_logits(x)
        rcnn_probs = F.softmax(rcnn_class_logits, dim=-1)

        rcnn_bbox = self.bbox(x)

        rcnn_class_logits = rcnn_class_logits.view(self.config.IMAGES_PER_GPU,
                                                   roi_number,
                                                   rcnn_class_logits.size()[-1])

        rcnn_probs = rcnn_probs.view(self.config.IMAGES_PER_GPU,
                                     roi_number,
                                     rcnn_probs.size()[-1])

        rcnn_bbox = rcnn_bbox.view(self.config.IMAGES_PER_GPU,
                                   roi_number,
                                   self.config.NUM_CLASSES,
                                   4)

        return rcnn_class_logits, rcnn_probs, rcnn_bbox


#
# ---------------------------------------------------------------
# Mask R-CNN

class MaskRCNN(nn.Module):
    """
    Mask R-CNN model
    """

    def __init__(self, config):
        super(MaskRCNN, self).__init__()
        self.config = config
        self.__mode = 'train'
        feature_channels = 128
        # define modules (set of layers)
#        self.feature_net = FeatureNet(cfg, 3, feature_channels)
        self.feature_net = resnet50().cuda()
        #self.rpn_head = RpnMultiHead(cfg,feature_channels)
        self.rpn = RPN(256, len(self.config.RPN_ANCHOR_RATIOS),
                             self.config.RPN_ANCHOR_STRIDE)
        #self.rcnn_crop = CropRoi(cfg, cfg.rcnn_crop_size)
        self.rcnn_head = RCNNHead(config)
        #self.mask_crop = CropRoi(cfg, cfg.mask_crop_size)
        self.mask_head = MaskHead(config)

        self.anchors = generate_pyramid_anchors(self.config.RPN_ANCHOR_SCALES,
                                                self.config.RPN_ANCHOR_RATIOS,
                                                self.config.BACKBONE_SHAPES,
                                                self.config.BACKBONE_STRIDES,
                                                self.config.RPN_ANCHOR_STRIDE)
        self.anchors = self.anchors.astype(np.float32)
        self.proposal_count = self.config.POST_NMS_ROIS_TRAINING
        # FPN
        self.fpn_c5p5 = nn.Conv2d(
            512 * 4, 256, kernel_size=1, stride=1, padding=0)
        self.fpn_c4p4 = nn.Conv2d(
            256 * 4, 256, kernel_size=1, stride=1, padding=0)
        self.fpn_c3p3 = nn.Conv2d(
            128 * 4, 256, kernel_size=1, stride=1, padding=0)
        self.fpn_c2p2 = nn.Conv2d(
            64 * 4, 256, kernel_size=1, stride=1,  padding=0)

        self.fpn_p2 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
        self.fpn_p3 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
        self.fpn_p4 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
        self.fpn_p5 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)

        self.scale_ratios = [4, 8, 16, 32]
        self.fpn_p6 = nn.MaxPool2d(
            kernel_size=1, stride=2, padding=0, ceil_mode=False)

    def forward(self, x):
        # Extract features
        C1, C2, C3, C4, C5 = self.feature_net(x)
        P5 = self.fpn_c5p5(C5)
        P4 = self.fpn_c4p4(C4) + F.upsample(P5,
                                            scale_factor=2, mode='bilinear')
        P3 = self.fpn_c3p3(C3) + F.upsample(P4,
                                            scale_factor=2, mode='bilinear')
        P2 = self.fpn_c2p2(C2) + F.upsample(P3,
                                            scale_factor=2, mode='bilinear')

        # Attach 3x3 conv to all P layers to get the final feature maps.
        # P2 is 256, P3 is 128, P4 is 64, P5 is 32
        P2 = self.fpn_p2(P2)
        P3 = self.fpn_p3(P3)
        P4 = self.fpn_p4(P4)
        P5 = self.fpn_p5(P5)
        # P6 is used for the 5th anchor scale in RPN. Generated by
        # subsampling from P5 with stride of 2.
        P6 = self.fpn_p6(P5)

        # Note that P6 is used in RPN, but not in the classifier heads.
        rpn_feature_maps = [P2, P3, P4, P5, P6]

        self.mrcnn_feature_maps = [P2, P3, P4, P5]

        rpn_class_logits_outputs = []
        rpn_class_outputs = []
        rpn_bbox_outputs = []
        # RPN proposals
        for feature in rpn_feature_maps:
            rpn_class_logits, rpn_probs, rpn_bbox = self.rpn(feature)
            rpn_class_logits_outputs.append(rpn_class_logits)
            rpn_class_outputs.append(rpn_probs)
            rpn_bbox_outputs.append(rpn_bbox)

        rpn_class_logits = torch.cat(rpn_class_logits_outputs, dim=1)
        rpn_class = torch.cat(rpn_class_outputs, dim=1)
        rpn_bbox = torch.cat(rpn_bbox_outputs, dim=1)

        rpn_proposals = self.proposal_layer(rpn_class, rpn_bbox)

        # RCNN proposals
        rcnn_class_logits, rcnn_class, rcnn_bbox = self.rcnn_head(self.mrcnn_feature_maps, rpn_proposals)
        rcnn_mask_logits = self.mask_head(self.mrcnn_feature_maps, rpn_proposals)
        # <todo> mask nms

        return [rpn_class_logits, rpn_class, rpn_bbox, rpn_proposals,
                rcnn_class_logits, rcnn_class, rcnn_bbox,
                rcnn_mask_logits]

    def proposal_layer(self, rpn_class, rpn_bbox):
        # handling proposals
        scores = rpn_class[:, :, 1]
        #print(scores.shape)
        # Box deltas [batch, num_rois, 4]
        deltas_mul = Variable(torch.from_numpy(np.reshape(
            self.config.RPN_BBOX_STD_DEV, [1, 1, 4]).astype(np.float32))).cuda()
        deltas = rpn_bbox * deltas_mul

        pre_nms_limit = min(6000, self.anchors.shape[0])

        scores, ix = torch.topk(scores, pre_nms_limit, dim=-1,
                                largest=True, sorted=True)

        ix = torch.unsqueeze(ix, 2)
        ix = torch.cat([ix, ix, ix, ix], dim=2)
        deltas = torch.gather(deltas, 1, ix)

        _anchors = []
        for i in range(self.config.IMAGES_PER_GPU):
            anchors = Variable(torch.from_numpy(
                self.anchors.astype(np.float32))).cuda()
            _anchors.append(anchors)
        anchors = torch.stack(_anchors, 0)

        pre_nms_anchors = torch.gather(anchors, 1, ix)
        refined_anchors = apply_box_deltas_graph(pre_nms_anchors, deltas)

        # Clip to image boundaries. [batch, N, (y1, x1, y2, x2)]
        height, width = self.config.IMAGE_SHAPE[:2]
        window = np.array([0, 0, height, width]).astype(np.float32)
        window = Variable(torch.from_numpy(window)).cuda()

        refined_anchors_clipped = clip_boxes_graph(refined_anchors, window)

        refined_proposals = []
        scores = scores[:,:,None]
        #print(scores.data.shape)
        #print(refined_anchors_clipped.data.shape)
        for i in range(self.config.IMAGES_PER_GPU):
            indices = nms(
                torch.cat([refined_anchors_clipped.data[i], scores.data[i]], 1), 0.7)
            indices = indices[:self.proposal_count]
            indices = torch.stack([indices, indices, indices, indices], dim=1)
            indices = Variable(indices).cuda()
            proposals = torch.gather(refined_anchors_clipped[i], 0, indices)
            padding = self.proposal_count - proposals.size()[0]
            proposals = torch.cat(
                [proposals, Variable(torch.zeros([padding, 4])).cuda()], 0)
            refined_proposals.append(proposals)

        rpn_rois = torch.stack(refined_proposals, 0)

        return rpn_rois


def apply_box_deltas_graph(boxes, deltas):
    """Applies the given deltas to the given boxes.
    boxes: [N, 4] where each row is y1, x1, y2, x2
    deltas: [N, 4] where each row is [dy, dx, log(dh), log(dw)]
    """
    # Convert to y, x, h, w
    height = boxes[:, :, 2] - boxes[:, :, 0]
    width = boxes[:, :, 3] - boxes[:, :, 1]
    center_y = boxes[:, :, 0] + 0.5 * height
    center_x = boxes[:, :, 1] + 0.5 * width
    # Apply deltas
    center_y += deltas[:, :, 0] * height
    center_x += deltas[:, :, 1] * width
    height *= torch.exp(deltas[:, :, 2])
    width *= torch.exp(deltas[:, :, 3])
    # Convert back to y1, x1, y2, x2
    y1 = center_y - 0.5 * height
    x1 = center_x - 0.5 * width
    y2 = y1 + height
    x2 = x1 + width
    result = [y1, x1, y2, x2]
    return result


def clip_boxes_graph(boxes, window):
    """
    boxes: [N, 4] each row is y1, x1, y2, x2
    window: [4] in the form y1, x1, y2, x2
    """
    # Split corners
    wy1, wx1, wy2, wx2 = window
    y1, x1, y2, x2 = boxes
    # Clip

    y1 = torch.max(torch.min(y1, wy2), wy1)
    x1 = torch.max(torch.min(x1, wx2), wx1)
    y2 = torch.max(torch.min(y2, wy2), wy1)
    x2 = torch.max(torch.min(x2, wx2), wx1)

    clipped = torch.stack([x1, y1, x2, y2], dim=2)
    return clipped

实例分割

实例分割（Instance Segmentation）

定义

特点

应用

常见模型

Mask R-CNN

提出初衷

网络结构

模型代码

相关资源

项目

博客

论文

实例分割

实例分割（Instance Segmentation）

** 定义**

特点

** 应用**

** 常见模型**

Mask R-CNN

提出初衷

网络结构

模型代码

相关资源

项目

博客

论文

Related posts

定义

应用

常见模型