Week 8 (Aug 1 - Aug 5)
This week we continued the implementation of action recognition. My task was to utilize the trained model and integrate it within our architecture in order to correctly predict the action type for the players. Our secondary goal was to investigate if the action recognition would be robust enough to help us catch misdetections for frames that were assumed incorrectly to be “hit-frames” (frames in which a player is striking the ball). We decided to implement it in such a way that the frame is fed into the player detection model to create a bounding box around both top player and bottom player. We know that this particular frame shows a top player striking the ball, so we crop the relevant player using the bounding box, and feed that cropped image to the action recognition model. The model then predicts the action type, and saves it to a json file, to be later used as part of the spatialized audio.
After coding this subsystem, I encountered a few issues. Primarily, the player detection model sometimes does not detect the player at all, which causes the downstream steps to fail. I implemented a buffer that feeds a configurable number of frames surrounding the hit frame into the model as well, so that if the player is misdetected in the hit frame itself (the frame may be blurry), it can try to detect the player in the next or previous frames. This proved successful, and allowed action recognition to work more consistently. We found that action recognition worked very well for the bottom player in particular (as the bottom player is closer to the camera, and thus, much larger and has better image quality). In order to combat the low performance of the top player’s action recognition, we decided to implement a voting system using approximately 10 frames surrounding the actual hit frame. All ten frames would be run through the action recognition model, and the probabilities for each category would be summed together. Then, the aggregated probabilities would be checked for the maximum category, and that would yield the best possible prediction. This functionality significantly improved the action recognition for both the top player and bottom player.
We also chose to postpone some optimizations for later development, which include using the bounding boxes from player detection in another module within the architecture, rather than re-running the model for player detection. We opted to postpone this optimization in order to prioritize the creation of demo videos for our upcoming user studies.
Outside of the system development work, I also attended the SURE program’s final poster presentation. Two of the CEAL lab members for the summer are part of the SURE program, so we attended their presentations and also listened to other participants of the program. It’s always exciting to see students undertake interesting research work and develop creative solutions to pressing problems!