May 13, 2009
Streaming Techniques for Statistical ModelingSpeaker: Dr. Yihua Wu Google, Inc.
Time: Wednesday 05/13/2009 3-4PM Location: Babbio 110
Biography:
Dr. Yihua Wu received her PhD in Computer Science from Rutgers, the State University of New Jersey in 2007 and has been working in Google Inc. New York since then. Her research interests are streaming techniques for statistical modeling of massive data with applications to databases and networking areas. During her PhD, she extensively studied i) parametric modeling of skewed data sets; ii) graph modeling of individual's communication patterns; iii) sequential change detection on data streams. Dr. Yihua Wu spent years of her PhD collaborating with researchers from AT&T Shannon Labs, Telcordia Applied Research, Narus Inc. to develop space- and time-efficient streaming algorithms on real world data sets and is holding two patents on that. While working at Google, she designs and develops features and models to improve search quality.
Abstract: Streaming is an important paradigm for handling high-speed data sets that are too large to fit in main memory. Prior work in data streams has shown how to estimate simple statistical parameters, such as histograms, heavy hitters, frequent moments, etc., on data streams. This talk focuses on a number of more sophisticated statistical analyses that are performed in near real-time, using limited resources.
I will first present how to model stream data parametrically; in particular, we fit hierarchical (binomial multifractal) and non-hierarchical (Pareto) power-law models on a data stream. It yields algorithms that are fast, space-efficient, and provide accuracy guarantees. I also designed fast methods to perform online model validation at streaming speeds. Then I studied the detection of changes in models on data with unknown distributions. I adapt the sound statistical method of sequential probability ratio test to the online streaming case, without independence assumption. The resulting algorithm works seamlessly without window limitations inherent in prior work, and is highly effective at detecting changes quickly. Furthermore, I formulated and extended our streaming solution to the local change detection problem that has not been addressed earlier.
As concrete applications of our techniques, we complement our analytic and algorithmic results with experiments on network traffic data to demonstrate the practicality of our methods at line speeds, and the potential power of streaming techniques for statistical modeling in data mining. For more information please contact:
Yingying Chen Assistant Professor & NIS Graduate Program Director Burchard Room 210 Phone: 201.216.8066 Fax: 201.216.8246 yingying.chen@stevens.edu Dept_Seminar_0513 |