The research in this project was carried out at Fraunhonfer Institute for Integrated Circuits, Erlangen, Germany during 2016-17.
Automatic Video Tagging Using Deep Learning
Video tagging refers to extracting the key information from a video and summarizing it into a few key tags, representing the overall theme of the video. The aim is not only to understand what is present in each frame of the video, but also to identify the few key topics that best describe what the video is about. This work is different than other related work in this domain in the following ways:
- We intend to work at a much higher level by understanding the overall context in the individual frames of a video. The context represents the interaction of the objects in a scene and their overall meaning. The examples of context include romance, fight, violence, nudity, action, etc.
- This work is different than typical event or scene recognition tasks, where each item belongs to a single event or scene.
- It is also different than most object recognition tasks, where the goal is to label everything visible in an image. This would produce thousands of labels on each video but without answering what the video is really about.
- This work does include, but it is not limited to, genre classification of movies. A movie typically has 2-3 genres which do not reveal other information in the movie (e.g., violence, nudity, sex, etc).
- Video classification
- Detecting categories not allowed for movie classification (e.g. violence in kids movies)
- Context based video search
- Efficient archiving
We have used deep learning to train a Convolution Neural Network (CNN) to classify a number of scenes in a given video, understanding the context in each video frame and summarizing the results in a few key tags. Our tagging method is summarized in the following diagram:
Our current CNN model has been trained for 53 scenes/classes. However, we are continuously increasing the number of classes as we encounter new scenes in movies. Suggestion about new scenes/classes are welcome.
The scene classes are shown in the following table. The class names with the same color represent an overlapping among them. For example, the class of action has overlapping with violence and sword fight.
Some results on the individual movie frames and their top tags generated by our algorithm are shown below.
For results on movies, please see the following movie trailers and their generated tags.
Key tags: Action, car crash, sci-fi, violence, weapon, plane, destruction, war, helicopter, bomb explosion, military, ship
Key tags: Romance, violence, action, car crash, horror, sex, child, nudity, outdoor/nature/forest, hospital, food, Club/bar, college/university, music, crowd
We are constructing our own data set. We look forward to your feedback in this regard. We also need suggestions for other scenes/cases mostly found in movies. Any other suggestions in this regard will also be highly appreciated.