Using Detectron2 for Simultaneous Human Body KeyPoints Detection and coco.names Object Detection.

The detectron2 library for Python is Facebook’s latest advent in open sourcing computer vision technology, and already it is showing a lot of potential in fields such as: human robot interaction, intruder detection systems, crowd control etc.

Part of what makes this possible is Detectron2’s ability to simultaneously detect parts of a human body (left and right separated) and other objects according to the coco.names dataset. While there is not much documentation out there, on how you could extract body parts detection data and especially with many Python IDEs’ intelli-senses not being immensely helpful, the aim of this article is to show how this could be done in the most straightforward manner.

This article will be separated into two parts: coco.names object detection and human body keypoints detection.

Selecting the right model

For the purposes of this article, you may want to set the Detectron2 model to the following implementation:

COCO-Keypoints/keypoint_rcnn_X_101_32x8d_FPN_3x.yaml

To do this, find the lines as shown in the image below and edit your code accordingly.

cfg.merge_from_file(model_zoo.get_config_file("COCO-Keypoints/keypoint_rcnn_X_101_32x8d_FPN_3x.yaml"))
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Keypoints/keypoint_rcnn_X_101_32x8d_FPN_3x.yaml")

This will allow the Detectron2 detector to detect both objects and limbs of the human body.

coco.names Object Detection

The coco.names data set covers labels for 80 different object classes, which can be detected by Detectron2’s detector. For each detection: the class name, XY coordinates and the width height will be outputted. This data can be extracted using the python snippet shown below:

for pred_idx in range(len(outputs["instances"].pred_boxes)):
        detection={
            'date_time':time.ctime(),
            'class':str(coco_names[outputs["instances"].pred_classes[pred_idx]]).replace('\n',''),
             'x': str(outputs["instances"].pred_boxes[pred_idx][0].tensor.cpu().numpy()[0][0]),
             'y': str(outputs["instances"].pred_boxes[pred_idx][0].tensor.cpu().numpy()[0][1]),
             'w': str(outputs["instances"].pred_boxes[pred_idx][0].tensor.cpu().numpy()[0][2]),
             'h': str(outputs["instances"].pred_boxes[pred_idx][0].tensor.cpu().numpy()[0][3])
        }

Note the tensor.cpu(), which converts the tensor from a GPU based tensor to a CPU based one. As a personal preference, I have chosen to extract the object detection data into a dictionary for better code readability and handling.

The insight behind using String to cast Floating Point Values

The motive behind the casting of the floating point values of XY coordinates as strings, was for easy embedding in: HTTP, Socket and WebSocket responses, so that the Detectron2 algorithm can be hosted on a powerful server, which can then be accessed by many different embedded systems (such as a RaspberryPi, Arduino etc.); which have limited image processing capacity, in a big-little configuration. Most high level languages, including C++ and Python/microPython, offer solutions for easy casting from String to Floating Point, making this as a preferred method for accuracy preservation during data transfer.

For example, if you were to run a client Python program on a RaspberryPi, one can decode an entire JSON response and easily extract all the data; by means of converting the JSON response to a Python dictionary.

Human Body Keypoints Detection

The output from the Human Body keypoints detection provides the: human body part name, respective XY coordinates and the confidence of the detection. The following snippet shows how this data can be extracted.

for det_keypoints in outputs["instances"].pred_keypoints:
            person_keypoint={
                "date_time":time.ctime(),
                "nose":{'x':str(det_keypoints.cpu().numpy()[0][0]),'y': str(det_keypoints.cpu().numpy()[0][1]), 'conf': str(det_keypoints.cpu().numpy()[0][2])},
                "left_eye":{'x':str(det_keypoints.cpu().numpy()[1][0]),'y': str(det_keypoints.cpu().numpy()[1][1]), 'conf': str(det_keypoints.cpu().numpy()[1][2])},
                "right_eye":{'x':str(det_keypoints.cpu().numpy()[2][0]),'y': str(det_keypoints.cpu().numpy()[2][1]), 'conf': str(det_keypoints.cpu().numpy()[2][2])},
                "left_ear":{'x':str(det_keypoints.cpu().numpy()[3][0]),'y': str(det_keypoints.cpu().numpy()[3][1]), 'conf': str(det_keypoints.cpu().numpy()[3][2])},
                "right_ear":{'x':str(det_keypoints.cpu().numpy()[4][0]),'y': str(det_keypoints.cpu().numpy()[4][1]), 'conf': str(det_keypoints.cpu().numpy()[4][2])},
                "left_shoulder":{'x':str(det_keypoints.cpu().numpy()[5][0]),'y': str(det_keypoints.cpu().numpy()[5][1]), 'conf': str(det_keypoints.cpu().numpy()[5][2])},
                "right_shoulder":{'x':str(det_keypoints.cpu().numpy()[6][0]),'y': str(det_keypoints.cpu().numpy()[6][1]), 'conf': str(det_keypoints.cpu().numpy()[6][2])},
                "left_elbow":{'x':str(det_keypoints.cpu().numpy()[7][0]),'y': str(det_keypoints.cpu().numpy()[7][1]), 'conf': str(det_keypoints.cpu().numpy()[7][2])},
                "right_elbow":{'x':str(det_keypoints.cpu().numpy()[8][0]),'y': str(det_keypoints.cpu().numpy()[8][1]), 'conf': str(det_keypoints.cpu().numpy()[8][2])},
                "left_wrist":{'x':str(det_keypoints.cpu().numpy()[9][0]),'y': str(det_keypoints.cpu().numpy()[9][1]), 'conf': str(det_keypoints.cpu().numpy()[9][2])},
                "right_wrist":{'x':str(det_keypoints.cpu().numpy()[10][0]),'y': str(det_keypoints.cpu().numpy()[10][1]), 'conf': str(det_keypoints.cpu().numpy()[10][2])},
                "left_hip":{'x':str(det_keypoints.cpu().numpy()[11][0]),'y': str(det_keypoints.cpu().numpy()[11][1]), 'conf': str(det_keypoints.cpu().numpy()[11][2])},
                "right_hip":{'x':str(det_keypoints.cpu().numpy()[12][0]),'y': str(det_keypoints.cpu().numpy()[12][1]), 'conf': str(det_keypoints.cpu().numpy()[12][2])},
                "left_knee":{'x':str(det_keypoints.cpu().numpy()[13][0]),'y': str(det_keypoints.cpu().numpy()[13][1]), 'conf': str(det_keypoints.cpu().numpy()[13][2])},
                "right_knee":{'x':str(det_keypoints.cpu().numpy()[14][0]),'y': str(det_keypoints.cpu().numpy()[14][1]), 'conf': str(det_keypoints.cpu().numpy()[14][2])},
                "left_ankle":{'x':str(det_keypoints.cpu().numpy()[15][0]),'y': str(det_keypoints.cpu().numpy()[15][1]), 'conf': str(det_keypoints.cpu().numpy()[15][2])},
                "right_ankle":{'x':str(det_keypoints.cpu().numpy()[16][0]),'y': str(det_keypoints.cpu().numpy()[16][1]), 'conf': str(det_keypoints.cpu().numpy()[16][2])}
            }

Note that in this dataset, the left and right parts are given separately where applicable. The XY coordinates refer to the centre of the respective body part. With careful study, the conf variable can be used to eliminate false human detections.

Conclusion

Detectron2 Keypoints Detection is a pretty impressive library when it comes to computer vision and we have only barely scratched the surface of its capabilities. Whilst this can be a great springboard for many amazing robotics, crowd control, intrusion detection or human robot interaction projects, much work needs to be done on increasing the accuracy and robustness of such frameworks. The real advantage of Detectron2 and Python is how simply it can be implemented in many different from factors: from RESTful APIs to WebSocket based bots or even run natively; for the most digitally endowed of robotic systems.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.