(Publisher of Peer Reviewed Open Access Journals)

International Journal of Advanced Technology and Engineering Exploration (IJATEE)

ISSN (Print):2394-5443    ISSN (Online):2394-7454
Volume-8 Issue-84 November-2021
Full-Text PDF
Paper Title : A Mask-RCNN based object detection and captioning framework for industrial videos
Author Name : Manasi Namjoshi and Khushboo Khurana
Abstract :

Video analysis of the surveillance videos is a tiresome and burdenous activity for a human. Automating the task of surveillance video analysis, specifically industrial videos could be very useful for productivity analysis, to assess the availability of raw materials and finished goods, fault detection, report generation, etc. To accomplish this task we have proposed a video captioning and reporting method. In video captioning, we generate summaries in understandable language that comprehend the video. These descriptions are generated by understanding the events and objects present in the video. The method presented in this paper constructs a captioned video summary, comprising of frames and their descriptions. Firstly, the frames are extracted from the video by performing uniform sampling. This reduces the task of video captioning to image captioning. Then, Mask- Region-based Convolutional Neural Network (RCNN) is utilized for detecting the objects like raw materials, products, humans, etc. from the sampled video frames. Further, a template-based sentence generation method is applied to obtain the image captions. Finally, a report is generated outlining the products present, and details relating to the production, like duration of the product being present, the number of products detected, the presence of operator at the workstation, etc. This framework can greatly help in bookkeeping, performing day-wise work-analysis, to keep track of employees working in a labor-intensive industry or factory, performing remote monitoring, etc., thereby reducing the human effort of video analysis. On the object classes for the created dataset, we have obtained an average confidence score of 0.8975, and an average accuracy of 95.62%. Moreover, as the captions are template-based the sentences generated are grammatically and meaningfully correct.

Keywords : Object detection, Mask-RCNN, Video captioning, Video analysis, Image captioning.
Cite this article : Namjoshi M, Khurana K. A Mask-RCNN based object detection and captioning framework for industrial videos. International Journal of Advanced Technology and Engineering Exploration. 2021; 8(84):1466-1478. DOI:10.19101/IJATEE.2021.874394.
References :
[1]Gasparetto A, Scalera L. A brief history of industrial robotics in the 20th century. Advances in Historical Studies. 2019; 8(1):24-35.
[Crossref] [Google Scholar]
[2]Chandan G, Jain A, Jain H. Real time object detection and tracking using Deep learning and openCV. In international conference on inventive research in computing applications 2018 (pp. 1305-8). IEEE.
[Crossref] [Google Scholar]
[3]Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R. A survey of deep learning-based object detection. IEEE Access. 2019; 7:128837-68.
[Crossref] [Google Scholar]
[4]Minaee S, Luo P, Lin Z, Bowyer K. Going deeper into face detection: a survey. arXiv preprint arXiv:2103.14983. 2021.
[Google Scholar]
[5]Ganokratanaa T, Aramvith S, Sebe N. Unsupervised anomaly detection and localization based on deep spatiotemporal translation network. IEEE Access. 2020; 8:50312-29.
[Crossref] [Google Scholar]
[6]Elihos A, Balci B, Alkan B, Artan Y. Deep learning based segmentation free license plate recognition using roadway surveillance camera images. arXiv preprint arXiv:1912.02441. 2019.
[Google Scholar]
[7]Yang X, Wang X. Recognizing license plates in real-time. arXiv preprint arXiv:1906.04376. 2019.
[Google Scholar]
[8]Song H, Liang H, Li H, Dai Z, Yun X. Vision-based vehicle detection and counting system using deep learning in highway scenes. European Transport Research Review. 2019; 11(1):1-6.
[Crossref] [Google Scholar]
[9]Cao J, Pang Y, Xie J, Khan FS, Shao L. From handcrafted to deep features for pedestrian detection: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021.
[Crossref] [Google Scholar]
[10]Janahiraman TV, Subuhan MS. Traffic light detection using tensorflow object detection framework. In international conference on system engineering and technology (ICSET) 2019 (pp. 108-13). IEEE.
[Crossref] [Google Scholar]
[11]Shen L, Margolies LR, Rothstein JH, Fluder E, Mcbride R, Sieh W. Deep learning to improve breast cancer detection on screening mammography. Scientific Reports. 2019; 9(1):1-12.
[Crossref] [Google Scholar]
[12]Ali AR, Li J, O’shea SJ. Towards the automatic detection of skin lesion shape asymmetry, color variegation and diameter in dermoscopic images. Plos One. 2020; 15(6):1-21.
[Crossref] [Google Scholar]
[13]Hopfield JJ. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences. 1982; 79(8):2554-8.
[Crossref] [Google Scholar]
[14]Viola P, Jones M. Robust real-time object detection. International Journal of Computer Vision. 2001; 4(34-47):4.
[Google Scholar]
[15]Lowe DG. Object recognition from local scale-invariant features. In proceedings of the seventh international conference on computer vision 1999 (pp. 1150-7). IEEE.
[Crossref] [Google Scholar]
[16]Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In proceedings of the conference on computer vision and pattern recognition 2016 (pp. 779-88). IEEE.
[Google Scholar]
[17]Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In proceedings of the conference on computer vision and pattern recognition 2014 (pp. 580-7). IEEE.
[Google Scholar]
[18]Girshick R. Fast R-CNN. In proceedings of the international conference on computer vision 2015 (pp. 1440-8). IEEE.
[Google Scholar]
[19]Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems. 2015; 28:91-9.
[Google Scholar]
[20]He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In proceedings of the international conference on computer vision 2017 (pp. 2961-9). IEEE.
[Google Scholar]
[21]Amirian S, Rasheed K, Taha TR, Arabnia HR. Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access. 2020; 8:218386-400.
[Crossref] [Google Scholar]
[22]Kojima A, Tamura T, Fukunaga K. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision. 2002; 50(2):171-84.
[Crossref] [Google Scholar]
[23]Hakeem A, Sheikh Y, Shah M. CASE^ E: a hierarchical event representation for the analysis of videos. In AAAI 2004 (pp. 263-8).
[Google Scholar]
[24]Khan MU, Gotoh Y. Describing video contents in natural language. In proceedings of the workshop on innovative hybrid approaches to the processing of textual data 2012 (pp. 27-35).
[Google Scholar]
[25]Das P, Xu C, Doell RF, Corso JJ. A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In proceedings of the conference on computer vision and pattern recognition 2013 (pp. 2634-41). IEEE.
[Google Scholar]
[26]Khan MU, Al HN, Gotoh Y. A framework for creating natural language descriptions of video streams. Information Sciences. 2015; 303:61-82.
[Crossref] [Google Scholar]
[27]Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, et al. Deep learning for generic object detection: a survey. International Journal of Computer Vision. 2020; 128(2):261-318.
[Crossref] [Google Scholar]
[28]Bay H, Tuytelaars T, Van Gool L. Surf: Speeded up robust features. In European conference on computer vision 2006 (pp. 404-17). Springer, Berlin, Heidelberg.
[Crossref] [Google Scholar]
[29]Viola P, Jones M. Rapid object detection using a boosted cascade of simple features. In proceedings of the computer society conference on computer vision and pattern recognition. CVPR 2001. IEEE.
[Crossref] [Google Scholar]
[30]Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995; 20(3):273-97.
[Crossref] [Google Scholar]
[31]Ben-Hur A, Horn D, Siegelmann HT, Vapnik V. Support vector clustering. Journal of Machine Learning Research. 2001:125-37.
[Google Scholar]
[32]Wang J, Shen X, Pan W. On transductive support vector machines. Contemporary Mathematics. 2007; 443:7-20.
[Google Scholar]
[33]He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015; 37(9):1904-16.
[Crossref] [Google Scholar]
[34]Wang X, Shrivastava A, Gupta A. A-fast-rcnn: Hard positive generation via adversary for object detection. In proceedings of the conference on computer vision and pattern recognition 2017 (pp. 2606-15). IEEE.
[Google Scholar]
[35]Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In proceedings of the conference on computer vision and pattern recognition 2015 (pp. 3431-40). IEEE.
[Google Scholar]
[36]Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot multibox detector. In European conference on computer vision 2016 (pp. 21-37). Springer, Cham.
[Crossref] [Google Scholar]
[37]Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: common objects in context. In European conference on computer vision 2014 (pp. 740-55). Springer, Cham.
[Crossref] [Google Scholar]
[38]https://github.com/matterport/Mask_RCNN. Accessed 10 January 2021.
[39]https://www.robots.ox.ac.uk/~vgg/software/via. Accessed 11 February 2021.
[40]Bradski G. The openCV library. Dr. Dobbs Journal: Software Tools for the Professional Programmer. 2000; 25(11):120-3.
[Google Scholar]
[41]Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In proceedings of the conference on computer vision and pattern recognition 2017 (pp. 2117-25). IEEE.
[Google Scholar]
[42]He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In proceedings of the conference on computer vision and pattern recognition 2016 (pp. 770-8).
[Google Scholar]