Yangyang Yi, Ph.D. Candidate in Data Science
Bio
Yangyang Yu, Ph.D. candidate in data science at Stevens Institute of Technology and earned her master’s degree at Syracuse University. Her research integrates cognitive science, language agent design, Bayesian statistics and multimodal learning, focusing on FinTech applications. She serves as a program committee member and an organizer for workshops at ACM AI in Finance, IJCAI and COLING. She also serves as a reviewer of NeurIPS, ICLR, AISTATS, ICASSP, CogSci, UIST, etc.
Skillset
She is skill in large language models, cognitive science, language agent design, NLP, data mining, fintech and algorithmic trading.
Dissertation Summary
Aligning Multi-modal Object Representations to Human Cognition
People enhance their cognitive abilities by gradually accumulating life experiences, which stem from interactions with the intricateenvironment around them. This developmental journey is progressively influenced and shaped by a variety of events and objects they encounter. Understanding how human perceptions are formulated in response to a variety of external stimuli remains a persistently challenging topic in behavioral research, particularly in traditional laboratory environments. These settings frequently face constraints due to limited experimental budgets, reliance on low-dimensional, human-interpretable features, and an inability to effectively representthe interrelationships among various types of stimuli. However, recent advancements in deep learning and artificial intelligence, coupled with the proliferation of web data and online crowdsourcing platforms, are opening new avenues for better aligning human cognition with a wide range of perceived objects. Specifically, these developments present a valuable opportunity to enhance the quality andefficiency of research in studying and predicting the similarities and differences in human perceptions towards diverse types of objects. The improvement in aligning human cognition with objects can be realized through two primary approaches: 1) The use of deep neural networks, designed to represent diverse objects and perform various tasks, offers expressive high-dimensional features of these objects. 2) Techniques such as tensor fusion and attention mechanisms facilitate a thorough consideration of various relevant entities and their interactions. Incorporating these novel methods has proven to be more effective in managing the complex and common cognitiveformulation processes that involve interactions between multi-modal entities. This dissertation shows the notable enhancement of both quality and efficiency in aligning the perception of multi-modal objects with human cognition by employing these advanced approaches.
In Essay 1, we employ a combination of matrix fusion models and diverse attention mechanisms to substantially more accurately predict human perceptions. This approach's efficacy is demonstrated through an experimental task focused on predicting first impression scores of synthesized human faces. We apply several matrix factorization methods to integrate entities like faces and traits. High-dimensional features from deep neural networks are integrated as multi-way side information into these fusion models. Furthermore, multiple attention mechanisms are implemented and compared to better represent and capture the cross-modality interactions between entities. This fusionframework exhibits robustness, interpretability, and high precision, making it adept at aligning varied machine-generated features of stimulus objects with human cognitive perceptions. The flexibility offered by its model architecture makes it particularly suitable for conducting human behavioral predictions based on multiple interconnected entities.
In Essay 2, we propose to utilize active learning strategies to select informative samples from a voluminous unknown data pool to formulate a compact training dataset. So that it can facilitate the training efficiency of large-scale behavioral studies within the model framework of Bayesian matrix factorization with deep-side information. It can address the common issue of information sparsity in behavioral research, which often leads to inefficient training datasets due to time and budget constraints. In our approach, we use BayesianProbabilistic Matrix Factorization (BPMF) to develop a model for predicting human perceptual outcomes. The active learning component is key, targeting the most informative stimulus-attribute-score triplets for sampling, taking into account the uncertainty in the model's posterior parameters. This method significantly improves the efficiency of perceptual predictions across all stimulus-attribute combinations in a quality-assured way, utilizing both the Monte Carlo Markov Chain (MCMC) simulated Bayesian posterior distributionand adaptively learned training data. Empirical studies have demonstrated its superior performance compared to passive learning. This advantage is critical in real-world crowdsourced cognitive studies with limited resources. Additionally, our methodology exhibits broad applicability across numerous other human behavioral domains, such as in the development of online recommendations for social media,e-commerce, or behavioral prediction systems. With minimal data collected from a small segment of existing customers or users, our approach efficiently generates high-quality behavioral forecasts for new users or objects.
In Essay 3, we leverage the advanced content generation capabilities and the comprehensive pre-existing knowledge base of multi-modal generative AI to better align object perceptions with human cognition. This method is termed Bi-channel Multi-modal Object Matchingwith Large Language Models-powered Open-ended Concepts Generation (Bi-OCGen). By integrating a series of extra concepts derived from interactions between humans and the multi-modal generative model, this approach notably enhances the accuracy of object-matching tasks. Despite the commendable performance of previous object-matching methods, which use textual features extracted from pre-defined labels or existing object descriptions, these approaches lack consideration of the impact of evolving public beliefs over time. Additionally, these methods have limitations in providing correlation and similarity measurements among objects based on key conceptsthat, while not present in the training textual dataset, are commonly recognized and understood through everyday human experiences. Bi-OCGen tackles these limitations. It generates joint feature representations from object image-description pairs and extracts task- specific open-ended label sets. Subsequently, the approach utilizes LLMs such as FlanT5 to distill a set of concepts from objects, aiding in establishing their intercorrelations and distinguishing similarities. This set of concepts can be used to enhance labeling for precise and complementary object matching for scenarios like e- commerce. Finally, the BiOCGen model offers an open-ended and precision-enhanced framework for object matching and recommendation. It is adept at incorporating multi-modal fusion object features and effectively leveraging the latent commonsense present in human cognition.
Academic Advisor
Jordan Suchow