Virtual Dance Reality Stage using Semantic Segmentation.

3 min readDec 7, 2021

It is a feature that offers you to share a stage with another user virtually. It uses the simple concept of Image background removal using DeepLab Architecture (semantic segmentation), which is a state-of-art DL model from Google-Brain. Hence, just for fun, the goal of the current project is to share the stage virtually (as one application can be, the Instagram Reels remix option can use this feature.)

So, In this article, we will look at how Semantic Segmentation is used for Image Background Removal.

Table of Content

Introduction
Semantic Segmentation
Conclusion

Introduction

Separating foreground/background involves identifying a foreground (in our case it’s human) and extracting the background — usually replacing the background with a plain white color. This task can be manually performed using Photoshop, even many apps have this option and Deep Learning Algorithms (nowadays, it is being used mostly). But still, fully automated image background removal is a challenging task, there is no product that has satisfactory results with it.

Semantic Segmentation

It is one of well know Computer Vision and Image Processing Classification Task, in sense of classifying image pixel to class (either for-ground or background). It has an understanding of the image, suppose if there is an object in the image then it points where and what object is on the pixel level.

The three basic components of Semantic Segmentation:

Deep CNN: It is multiple layers of convolutional function that uses kernel(N xN Dimensional) to run through the pixel of an image and generate result as a custom feature. While examining the deep layers of CNN, it can be observed that there is high activation(for example edge, contours) around the item to classify, as they are coarse in nature because of repetitive pooling action. So in short, while generating the feature vector it can be hypothesized that deep layers can be used to classify the objects as foreground and background.

Atrous Convolution and Atrous Spatial Pyramid Pooling(ASPP): Atrous Convolutional, also known as Dilated Convolution. It introduces a dilation rate where the kernel captures more field of view while convolving with an image to generate the feature vector. ASPP, on the other hand, is an extension of Atrous Convolution with different dilation rates to capture different scales of the same object in an image.
Decoder Neural Network: It is the up-sampling layer to convert low-resolution images to a high-resolution images. For smoothening/refining the foreground edges, Bilinear up-sampling is commonly used.

There are several architectures that can be used as a base model to train like FCN, Mobile-net, U-net, etc.

Conclusion

This was just a basic overview of semantic segmentation. There are a series of articles to be read. I followed this two article mainly: Background removal with deep learning and Modelling a Background Clean-up Deep Learning Model.
There can be multiple applications of semantic segmentation, as I used it for background image removal for extending the Virtual Dance Reality Stage, which is a fun feature with lots of learning.

Here is a glimpse of the Virtual Dance Reality Stage:

Thank-You for Reading!

If you like my work, please share your ❤️ by supporting me on Medium, Subscribing to my YouTube Channel, and Following my projects on Github.

For source code, please send mail to me on devashi882@gmail.com

References

Background removal with deep learning.
Modelling a Background Clean-up Deep Learning Model.
Understanding the DeepLab Model.