Assistant Professor Jie Shen Brings Another CAREER Award to Stevens
Jie Shen, assistant professor in the Department of Computer Science, receives a prestigious 2023 NSF grant of $590,661 to make machine learning cost-efficient and fair
Defining an immense field
The NSF CAREER awards are given out once a year by the National Science Foundation at US institutes of higher learning and provide funds for research and education to standout junior faculty for five years. In 2022, of the 500 NSF CAREER grants awarded across the United States, four were won by Stevens faculty members.
Jie Shen, assistant professor in the Department of Computer Science, is starting 2023 strong by joining their ranks. His proposal, “Robustness, Active Learning, Sparsity, and Fairness in Classification,” won an NSF CAREER grant of $590,661.
Shen’s research focuses on the theory of machine learning. Machine learning as a field focuses on the design of programs that learn the rules of categorization from the data they categorize. As the problems humanity is asking computers to solve get more and more complex, designing these programs does too. Machine learning theory tries to understand learning as a computational process — at its most basic, how to teach a program to find patterns. The fundamental quantification is how to get the most accurate information out of a data set with the least number of inputs — and where the weak spots are in this system when done incorrectly. Machine learning theory aims to both guide better machine learning and understand the very way that that learning occurs.
“At the very beginning [of machine learning], people tried to make some very simplified assumptions to explain some success for machine learning algorithms,” explained Shen. “As the time goes and we have more knowledge and background, we develop new theories, weaken some assumptions we had before, and make our theory more practical.”
Under the hood with machine learning algorithms
Currently, one way that algorithms are taught how to sort data is by providing the computer with a massive set of pre-labeled data that the algorithm analyzes for similarities and then can use to predict an unknown image. For example, imagine a data set containing 100 labeled pictures of either lemons or bananas. The computer would run through the images and their associated labels and see the pattern that a curved yellow shape is a banana and a round yellow shape is a lemon. It could apply this knowledge to identify either a lemon or a banana in a new image. However, this model has its limitations.
“In that model, it is assumed we have sufficient data and all the data is correct,” said Shen. “We assume we have a lot of correctly labeled data so the algorithm can make a correct prediction. But in modern machine learning applications, such assumptions do not hold. Sometimes we have incorrect labels.” Going back to the previous example, this would be if some of the lemons were labeled banana, or vice versa. Shen Continued, “In healthcare, we don’t have sufficient data.” For example, maybe there was only one lemon in the whole set and the rest were bananas, so the algorithm could be biased to identifying bananas. “These are limitations that we are facing in modern machine learning.”
“We need new theory,” said Shen. “How can we design efficiently to meet these goals?”
Four goals in one
Shen’s goal is really four smaller projects: to design new algorithms that “can tolerate adversarial corruptions in the data, mitigate data annotation costs, circumvent the curse of high dimensionality, and fortify models with fairness guarantees.”
Currently, Shen is focusing on the corruption portion of his goals. He has created algorithms that increase the robustness of a dataset, recognizing and offsetting any irrelevant data so as to not impact the result. “A fraction of the data can be arbitrary, can be malicious. I can present my algorithm, show the attacker, and allow him to corrupt the data-- but the attacker cannot change the output of the algorithm.” A simple example of this at play could be an algorithm presenting the online rating of a beloved restaurant. Imagine a five-star restaurant that a competitor rates one-star as a malicious attempt to impact their ratings. Shen’s algorithm can use machine learning to account for this outlier and prevent it from impacting the restaurant’s rating as a whole. “The majority [of the ratings] are five stars, so we know it is a good restaurant,” he said. “We can calculate a certain growth structure if we have some prior knowledge of a clean dataset. We [then] compare the statistics of the clean data to the training data. If the statistics look different, there must be a nontrivial amount of corruption and we can remove it.”
Another portion of Shen’s work focuses on cutting down the cost — of both money and time — for working with large datasets. Shen’s example focused on analyzing X-ray data to predict illness. A dataset for this application would have two parts — the X-ray image (the data), and the corresponding label of that image: for example, if it is an image of a lung, whether it is a cancerous lung, a pneumatic lung, a COVID-19-affected lung or a healthy lung. “Getting the data is easy, but the label is expensive.” The labels have to be applied by experts, doctors who need to be compensated for their time and might have to bring in other experts for validation. Shen’s proposal formalizes an adaptive strategy to overcome this issue. “We want to devine some efficient ways to save the labeling cost, and figure out how many labels we really need.”
The strategy works by utilizing successive iterations. “For some data points, even the initial model can do a very good job,” such as an initial separation of the lungs into very recognizable examples of their illness or health. “For the rest of them, the model is uncertain. We just need to highlight these uncertain points.” The experts then aren’t having to look at every single sample in the data set — only the ones that confuse the algorithm. These edge case images can be correctly labeled by the experts and then re-entered into the system to inform the model going forward. The system can be repeated as many times as necessary, producing a much more effective algorithm in less time and with less cost. “Roughly speaking, with some assumptions, reduction can be from 1 million [labels] to 10 [labels]. Quantitatively speaking, from M to Log of M.”
In the future, Shen will be looking at the third and fourth parts of his grant application. The third tenant centers on the “curse” of high dimensionality and how it relates to sparsity in data sets. This is connected to what happens the more complex the analyzed datasets (such as images) become. Data with lots of features is known as “high dimension” and the higher the dimension, the more space there is for error in analysis because the algorithm could be looking at the “wrong” part of the data. To return to the lemon and banana example earlier, an image of a bunch of bananas displayed on a yellow sweater on golden sand would be a much harder image for the algorithm to parse than a cropped image of a single banana on a black background. The algorithm needs to know where in the dataset the relevant portion is.
Lastly, Shen plans to explore ways to make sure that machine learning algorithms are fair. One issue machine learning data sets have run into is the repetition of the biases that are plugged into it. For example, an algorithm meant to match potential job-seekers to careers based on their resumes could recommend more women with medical backgrounds for nursing jobs and more men for doctor positions because it was trained on data that showed the current uneven gender breakdown for those roles in society and learned to see that pattern as “correct”.
Shen has plans to tackle all of these issues within the next five years of his grant.
Inspiring the next generation
Winning the CAREER grant was “exciting” for Shen. “I’m doing the research that I am really passionate about but might not get external funding as easily.” Shen hopes to add in more mentoring as well. His grant application included the outline of the development of “a new undergraduate program in artificial intelligence and machine learning that has the potential to inspire a transformation of nationwide STEM , (science, technology, engineering, and mathematics) education.” While the full implication of this program might be beyond the scope of the grant, “Our first step that is being practiced is to progressively improve the study plan of the undergraduates to align with artificial intelligence and machine learning.”
Shen’s work, along with programs such as Stevens’ Machine Learning Masters Programs, are ensuring that as machine learning becomes more and more ubiquitous, Stevens students are prepared to create algorithms that make cost-effective decisions that are accurate and fair.