Week 7 (Jul 25 - Jul 29)
This week we decided to implement a new feature: action recognition. The goal for this feature is to provide richer spatialized audio for the blind and low vision users. Instead of hearing a default noise to indicate a strike, a specialized noise could be dedicated to each category of tennis player action. In order to decide the scope of this feature, we decided to conduct a literature review and investigate the existing solutions for this problem. I was tasked with this literature review and found many papers that utilize richer inputs in order to recognize action types. For example, many models use HD video streams from 4 corners of the tennis match. Since our goal is to implement a solution for tennis broadcasts retroactively, which do not include HD videos from four cameras simultaneously, and sometimes do not include HD quality at all, this type of solution was not feasible.
After some research, I found an interesting journal paper that details a solution. This paper uses a well-known deep CNN architecture called Inception for feature extraction, paired with a deep LSTM for action prediction. We decided instead to train the final layer of the ResNet-50 architecture using labeled video frames for each type of strike. A main lesson learned from the paper above was that prediction accuracy reaches 88% only when considering three categories (forehand, backhand, serve/smash), as opposed to sub-categories like “backhand-slice”, or “forehand-volley”, so we structured our annotated data accordingly and trained the model.
Aside from system development work, this week the lab had a field trip, in which we visited the Whitney Museum! I had a fantastic time and really enjoyed the artwork on display, and of course, the great company!