A jaw clench. An eye turn. A vocal crack. A sigh.
These subtle, sometimes barely perceptible microexpressions and vocalizations can communicate as much about a person's mental and emotional states as the words they choose to express.
Many brands rely on artificial intelligence to identify consumer sentiment and emotion as part of their market research and product development efforts. But current commercially available machine learning models are limited, often relying on overly simplistic techniques or single, isolated features to make their determinations.
Stevens Institute of Technology fourth-year computer engineering seniors Mourad Deihim, Daniel Gural, and Jocelyn Ragukonis have devised an artificial neural network to close this gap.
Advised by electrical and computer engineering professor Yu-Dong Yao, the team improves on existing AI emotion recognition efforts by using deep learning to identify multiple concurrent emotions or sentiments gleaned from both audio and video sources simultaneously. This multimodal approach provides a more precise representation of the complexity of human expression.
Potential applications for this technology beyond market research include use in healthcare, security, and human resources.
Thinking outside out-of-the-box solutions
"A lot of algorithms that currently exist come from some kind of foundation in old algorithm and computer vision techniques," said Gural. "They're basically drawing certain points on your face to understand how your face moves and reacts and how that can relay emotion."
Such software, he noted, tends to analyze only a single frame of video at a time for only a single emotion or sentiment at a time. Other types of related software analyze only words, text, or audio features independently.
But words, sound, motion, and emotion do not exist in a vacuum. A seemingly positive head nod, for example, conveys a very different sentiment if a speaker's voice cracks or adopts a particular tone. Emotions like happiness can coexist with states of fear, sadness, or surprise.
Focusing on such narrow types of input and output fails to recognize the many inseparable and dynamic modes of expression — such as hand gestures, body language, and the changing sound and dynamics of one's voice — or the complicated, overlapping emotions that together define the gestalt of human communication.
To paint a more accurate picture of human emotion, the team set out to develop an AI that could analyze multiple modes of expression for a combination of emotions simultaneously.
Recognizing the complexities of human emotion
The team began their project with an exhaustive investigation of emotion recognition research. But finding a data set large enough and of sufficient quality to use to build their algorithm models proved challenging.
The students ultimately landed on the Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) data set, which consists of more than 23,500 short video segments gleaned from YouTube. Each video has been analyzed (by humans) and classified for any or all of six sentiments or emotions: happiness, sadness, anger, disgust, surprise, and fear. The segments are also annotated for the strength of each emotion expressed.
To develop their neural network, the team chose to approach the problem in three parts, Mourad said, initially analyzing the visuals and audio from the video segments separately, then fusing the resulting models together to analyze both visuals and audio simultaneously.
This last, he noted, "is probably going to be the hardest part of it since it's not as widely researched as using unimodal models. But we're making progress."
Using a preprocessing algorithm that broke each audio segment into 74 different sections, the audio was analyzed for such features as pitch, sound, volume, and pronunciation. The video, meanwhile, was analyzed not only by each frame individually but by how each frame related to the previous and next. In contrast to traditional methods, which analyze only a single frame at a time, the students' more dynamic, multi-frame approach allows the software to more accurately capture what emotions or sentiments are being expressed.
Composed of a series of algorithms, the students' neural network learns, over time, how to identify patterns of speech, movement, and sound from input audio and video segments and classify them into individual and overlapping emotions. Using the data set as a guide, it learns from its successes through trial and error.
"We're letting the model brute-force it through learning and — like playing the game Memory — seeing how much it can get right," Gural said. "You do that millions of times to the model, and it gets really good at it, to the point where we really can't explain what it's looking for, just how it works."
Using emotion recognition in a variety of industries
The students' focus on emotion recognition for market research in particular was inspired by a conversation Ragukonis had with her aunt, a market researcher at Advanced Micro Devices.
Ragukonis's aunt relayed how she had to spend hours sifting through hundreds of hours of consumer interview videos for relevant emotional information and insights in order to prepare presentations — time that would be better spent gathering new consumer data instead.
"The idea with this project was to be able analyze interview videos to give people trying to compile their presentations the important information they need so they don't have to spend as much time doing that," Ragukonis said.
By automatically identifying customer emotion and sentiment, Gural added, employing AI could enable market researchers to quantify their emotion recognition data beyond what a single human might be able to observe or interpret.
The team consulted Ragukonis's aunt and other marketing professionals during the project to help hone their approach and use cases to best fit the marketers' needs. But the emotion-recognition AI they've developed has potential applications far beyond market research and product development.
In a hospital or nursing home, for example, the neural network could be adapted to monitor and assess a patient's condition and emotional state, allowing caregivers to better calibrate their diagnoses and approach to care based on a more comprehensive understanding of the person's state than can necessarily be gleaned directly.
"A lot of times patients will act differently with a doctor versus with a nurse, for instance," Gural said.
The neural network is not limited only to emotion recognition either. Gural noted the algorithms could be adapted, for example, to analyze security camera video for suspicious people or activities or perhaps even to identify people showing symptoms of COVID-19.
Figuring out how to format the CMU-MOSEI data in a way their models could use posed an early challenge for the team. Quantifying the end results, Deihim said, also proved challenging.
"We're not just saying this video shows happiness or this video shows sadness. It's supposed to quantify all the emotions and display them in a way where you can tell what is being shown in each video," Deihim said. "A lot of what's going on now is trying to figure out how to display what truly is accurate here."
While running parts of the model can give "snapshot looks" at how the model is functioning, Gural said, "we are waiting to pull the trigger on a full model run-through on AWS or one of our systems until we are sure that the model won't have any problems or is fully optimized."
An opportunity for success
Deihim, Gural, and Ragukonis are all enrolled in Stevens' Accelerated Master's Program, completing this senior design project while also taking undergraduate- and graduate-level courses simultaneously.
Although Deihim noted the difficulty of finding time to accomplish all these things at once, their hard work has paid off. While not yet fully optimized as of this writing, the students' AI shows a very high degree of accuracy not only for identifying individual emotions through both audio and video, but for identifying combinations of simultaneous emotions as well.
"It doesn't seem to drop that much accuracy when you are looking for one emotion versus all emotions, which is good," Gural said.
Yao, who also advised Gural and Ragukonis during a summer study abroad on AI at Nanjing University two years ago, praised the students' efforts.
"The team members worked together and collaborated very well," Yao said. "They have developed a good background in AI and deep learning, and their design work and experience in classification using deep learning will be beneficial for future work in various application domains."
All three students described the experience as a positive one that allowed them the chance to investigate a variety of different aspects of deep learning and AI.
"This is probably one of the biggest projects I've worked on," said Ragukonis. "I went in wanting to learn a lot, and I definitely have. I still have a lot to learn. When you feel like you're starting to understand something in a space, there's always something new to learn. But it is exciting."
Gural also credits Stevens for granting the team the support and space to work on a project usually reserved for students further along in their academic careers.
"I'm very thankful to Stevens for allowing us this opportunity. A lot of places would shut down undergrads trying to do a project like this for being too complex or too sophisticated. But they let three students like us sign up and do it. So I'm really appreciative of that."
Learn more about electrical and computer engineering at Stevens: