Skip to content

0x550 Task

1. Classification

1.1. Localization

localization task is to identify the boundary box of a given single object

2. Detection

Unlike the localization task, the detection task should detect multiple objects

Task (Object Detection) The task of object detection is as follows:

  • input: an RGB image
  • output: A set of detected objects, for each object we have an category label (from fixed, known set of categories) and Bounding box (x, y, width, height)

The challenges of the object detection task is

  • multiple outputs: need to output variable numbers of objects per image
  • multiple output types: need to predict what and where
  • large images: need a higher resolution for detection often ~800x600 (classification task: 224x224)

A simple approach is to use the sliding window: apply a CNN to many different crops of images to classify for each crop. However, this would generate too many boxes therefore not feasible.

Metric (IoU Intersection Over Union) compute the overlap between the ground truth box and the prediction box. IoU > 0.5 is "decent", 0.7 is "pretty good", 0.9 is "almost perfect"

\[\frac{\text{Area of Intersection}}{\text{Area of Union}}\]

Metric (mAP: Mean Average Precision) compute AP (area under precision recall area), for each category and take mean. In COCO MAP, repeat this for different IoU threshold

3. Segmentation

3.1. Semantic Segmentation

3.2. Instance Segmentation

segment each object instance (e.g. every person) within an image

4. Scene Understanding

4.1. Depth Estimation

5. Generation

5.1. Text-to-Image