Video Analysis and Natural Language Description Generation System (original) (raw)

2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), 2020

Abstract

The project revolves around the idea of scene understanding purpose based on the video input, thus not continuously monitoring the feed manually. The videos are extracted into the form of raw video frames and using 2D-3D CNN, the feature vector is extracted. Using You Only Look Once - version 3 (YOLOv3) algorithm, the objects present in a particular frame is identified. Also, the count of the objects is stored. The pose of people present in the frames is estimated for identification of movements. Through this, the actions are recognized as being performed by the people. All the words that are formed through the above three methods count to input to the LSTM cell. This cell selects the words based on their probabilities and confidence rate and forms a natural language sentence for the user to understand. Finally, the generated output can be modified or changed completely by the user using Human-in-the-loop concept, if required. The machine will retrain itself based on this input and generate better results next time. The central model is capable of identifying as well as discriminating between types of elements which are required for this project. This project was built as a continuation of the previous system, which works on object identification from live video input from drones. In the case of poor network issues, when sending video data becomes difficult, the data is sent in textual format.

Sujata Khedkar hasn't uploaded this paper.

Let Sujata know you want this paper to be uploaded.

Ask for this paper to be uploaded.