I have always been impressed by applications of gesture recognition, from the serious business of sterile manipulation of radiology images to the less serious exploration of the inner workings of cats. In the same way that voice assistants such as Siri or Alexa are now a regular fixture in the homes of many, over the coming years we will see a large increase in the role of gesture recognition in our daily lives.
To nurture this interest, and as part of Tessella’s Cortex program, I recently spent 3 months investigating a particular application of gesture recognition: the transcription of fingerspelled words in British Sign-Language (BSL). The aim of the project was to collect videos of people signing the 26 letters of the BSL alphabet, and create a machine learning model to classify these gestures.
Such ideas have been investigated before, but never on such a large scale with British Sign-Language.
British Sign-Language, often referred to as BSL, is an extremely interesting and rich language, composed of many thousands of gestures for individual words, as well as 26 manual representations of the letters of the alphabet.
Communicating in BSL is heavily dependent on the way in which signs are performed: facial expressions and the direction of the signer’s upper body are just a couple of mechanisms that can vastly affect the meaning of the phrase. BSL users may use fingerspelling (spelling using the 26 gestures for the letters of the alphabet) to introduce technical words, proper nouns or to spell out acronyms.
Figure 1 - The letters of the British Sign-Language alphabet.
It’s important to state that sign-language is much more than just fingerspelling, and the aim of this project is not to create a device to facilitate communication or perform “translation” of any kind; there is no substitute for learning sign-language, and even a few basic phrases can go a long way!
Before embarking on the project, I tried to learn as much about sign-language and Deaf culture as I could, gathering information from many members of the Deaf and Hard-of-hearing community. In the process, I learned a great deal, and I would like to thank all those that shared information along the way.
Part of the reason that this domain was chosen is that the classification of gestures from the BSL alphabet presents a manageable problem, with several interesting domain specific points:
- Most gestures are two-handed
While American Sign Language allows for the entirety of the alphabet to be signed with one hand, BSL is different in that 25 of the 26 letters of the alphabet must be signed with two hands. This introduces the problem of occlusion, where one hand hides or partially obscures the other. Depending on the point of view of the observer, it may be difficult to tell which letter the user is signing.
- Some gestures are dynamic
The letters J and H rely on motion to be signed, this introduces an interesting temporal aspect to the problem.
Given that there is meaning in the movement of hands, the problem may best be tackled using a video-analysis approach, instead of relying solely on a static, image-analysis approach.
Figure 2 - The letter J
Unfortunately, there is no comprehensive, publicly available dataset of videos for the BSL alphabet. With this in mind, I set out to crowdsource the data from across Tessella using a web app that I created that allowed users to record their signs using a webcam, and upload these videos directly to cloud storage. Most video analysis approaches require large amounts of data, and with only 10s of examples per class during the early stages of the project, I was unsure about how to tackle this problem.
Fortuitously, two weeks before the project start date, a team at Google open-sourced a model to extract the co-ordinates of key landmarks from videos of hands. Using this model allowed me to reduce the dimensionality of the data, and therefore reduced the amount of data that would need to be captured.
Figure 3 - The Google landmark extraction model being used in a game of rock, paper, scissors
With the help of over 40 colleagues and friends, I managed to collect almost 3000 individual videos, with each video containing a letter of the alphabet. After a lot of data cleaning (some signs were performed incorrectly, some users had their hands not entirely visible) I had roughly just over 100 examples per class.
For every frame in the video, I applied the landmark extraction model, which gave the 21 2D-coordinates of the key landmarks of the hand, for each hand in the frame. As BSL letters can be performed left-handed or right-handed, I could horizontally flip the sequence of landmarks, thus doubling the size of the dataset.
For every video, this sequence of landmarks was fed into a Long short-term memory (LSTM) neural network to make predictions about which letter was being signed. Without going into the details of an LSTM, the important thing to know is that they have a concept of “memory”: they use contextual information from earlier in a sequence to make inferences about that sequence, instead of just relying on the most recent input.
Below is a confusion matrix showing the results of the training phase for the LSTM.
Figure 4 - A confusion matrix displaying the results of the final model.
The model has an accuracy of around 63%, and a strong diagonal indicates fairly good performance on most letters. There are, however, some weak points. We see that the model does not deal so well with the occluded letters, L, M and N, and it does not perform particularly well for the vowels I and O.
The letters L, M and N are difficult to classify in part because they often become occluded by the signers palm, making it difficult for the model to determine how many fingers are resting on the subject’s palm.
Figure 5 - A still of me signing the letter N.
Above you can see a still of me signing a letter. It is extremely difficult to determine whether the letter being signed is and L, an N or a V (it´s actually an N). But take a look at the video below and observe the key landmarks in the period before the hand meets the palm for a colleague signing the letter N.
Figure 6 The key landmarks of two hands signing the letter N.
We can clearly see two fingers being raised, before being occluded by an open palm. It is this earlier information that our LSTM uses to infer that this sign is probably an N or a V, and not an L or an M.
A further complication that also makes the classification of the letters I and O difficult is that the landmark extraction model performs poorly when one hand is in close proximity to the other. While Google’s landmark extraction model was instrumental to the success of this project given the small dataset, it was also a limitation.
From Letters to Words
The next step in the project was to see if this model could be applied to longer video sequences in order to extract which word or acronym a user was trying to spell. To achieve this, I applied the model for classifying individual letters over a 40-frame sliding window for every frame in the video. I did not know how well this approach would work, in part because the model had not been trained on such sequences, and the 40-frame sliding window will have captured frames belonging to other letters, as well as the transition between signs.
Figure 7 A plot of the confidence of the model’s prediction vs the frame number for the video. The type of each marker represents the most probable letter as predicted by the model.
Hopefully you can see which word the signer was trying to spell here! We can see that the model predicts with fairly good confidence most of the letters, with a notable drop for the repeated letter L – unsurprising given the similarity of the letters L, M, N, R, V.
Such graphs as these are not too difficult for humans to interpret, but for machines to register which words are being performed is more difficult. In my next post, I’ll be discussing how considering the time-variation of the probability distributions presented above, along with methods such as Beam search, can be used predict words based on the sequence of letter gestures.
As discussed, the problem of occlusion was an issue throughout this project, not least due to the decreased accuracy of the hand tracking model when two hands were in close proximity. A good solution to this may have been to re-train the open-sourced model, or perhaps to explore alternative (albeit, data intensive) methods such as 3D Convolutional Neural Networks. Further interesting avenues of research for sign-language include the analysis of facial expressions, mouth movements, coarticulation and body position.
The process of creating this model was not straightforward and involved touching upon many aspects of data science and software development. A key to the success of this project was leveraging the power of Google’s hand tracking and landmark extraction model – many thanks to that team for open-sourcing the project. This raises a more important point though. As new models become progressively sophisticated and complex, it will become increasingly important for data scientists to be able to build on top of these models and avoid re-inventing the wheel. This means that data science practitioners should possess the requisite software skills to build on top of these models. It also means that teams open-sourcing their work should ensure that it is easy to utilise, well documented and reproducible.
During the course of this project, I learned a lot about the domain, and a great deal about the finer points of using machine learning for gesture classification. We now have a model that can classify, with good accuracy, videos of words fingerspelled using British Sign-Language. There remains a great deal of work to be done in this area and in the field of video classification more broadly. I will be interested to see how the field develops over the coming years, and to what extent gesture recognition plays a role in our daily lives