Research & Innovation

Dean’s Lecture Series Reveals that the Future of Artificial Intelligence Has Arrived

Microsoft’s Dr. Xuedong Huang unpacks HoloLens 2, a virtual reality device with the ability to translate anyone’s speech into 60 languages in their own voice, in real-time

Imagine a world where medicine is made more precise with the aid of holograms, allowing doctors to digitally “see” into a patient’s body during a procedure. A world where you can give a keynote address in perfect Japanese, in your own voice, anywhere, at any time—even if you don’t speak Japanese.

This may sound like the stuff of some far-away future, but in the 2019 SES Dean’s Lecture Series, hosted by dean Jean Zu on October 17, Dr. Xuedong Huang assured an audience of more than 200 faculty, students, and staff that “All of these technologies exist today. The future is here.”

Sponsored by the Schaefer School of Engineering and Science at Stevens Institute of Technology, Huang’s enthralling lecture—“Breaking Human Interaction Barriers—AI, HoloLens and Beyond”—revealed a future enriched by artificial intelligence.

Huang, a Microsoft Technical Fellow in Microsoft Cloud and AI, founded the company’s speech technology group in 1993. This group brought speech recognition to the mass market with the introduction of the company’s Speech Application Programming Interface (SAPI) in 1995. This technology allows for speech recognition and speech synthesis on a personal computer. His ambitions have only soared since then.

Huang began his lecture with a chicken-or-egg proposition: We know that language sets us apart from other animals; but are we smarter because we have language, or do we have language because we are so smart? Huang left the origins of language to anthropologists, and instead looked to the future of language and artificial intelligence with a presentation of Microsoft’s HoloLens 2, a virtual reality device that aims to bridge the gap between the physical and digital worlds.

“Language is the most important crown jewel that we have accomplished,” said Huang. “So, [it follows that] language will play an important part in AI.”

Huang takes inspiration from a simple granodiorite stone, engraved in 196 BC, that held the key to deciphering Hieroglyphics: the famed Rosetta Stone. While this artifact allowed ancient Egyptian history to be unlocked, today Huang aims to further break down language barriers on a much larger scale with the unveiling of HoloLens 2—a technology that can translate any individual’s speech into 60 languages, in their own voice, in real time.

“We’re using AI to achieve something that has never been imagined before,” said Huang.

The device is worn over the head, eyes, and ears, enabling the user to have face-to-face conversations with multiple people who are speaking different languages, even with distracting background noise. The device also accounts for cross-talking, assigning dictation to individual speakers. This can be used anywhere, ranging from a noisy conference room to a virtual meeting. Even more, the device can integrate transcription with the display of holograms, which could represent anything from a complex mechanical device being studied in the classroom, to a set on a stage, or even an avatar.

Machine learning was utilized in the development of HoloLens 2, beginning with an indexing of the whole web as crawled by Microsoft’s search engine Bing. In total, three trillion words were processed.

“Computers have read three trillion words; I don’t think a human has done that,” remarked Huang. On the other hand, he added, “It’s just amazing to see how humans can understand speech with much less data.”

A graduate of Hunan and Tsinghua Universities in China, Huang received his Ph.D. in electrical engineering from the University of Edinburgh, where he experienced the profound impact of language barriers first-hand.

“I was one of those students suffering, because we had fantastic Scottish professors with strong Scottish accents, and I had no clue,” Huang recalled.

Meanwhile, what he referred to as his “Scottish-Chinese accent” was being transcribed onto his PowerPoint presentation with remarkable accuracy. When he asked the audience to choose another language, someone called out, “Chinese!” He selected Mandarin from a drop-down menu, and new characters appeared on the screen as he dictated.

At once, impressed whispers arose from the audience. Those who could not read Chinese asked their neighbors about the translation, who relayed that it was nearly spot-on—a complex feat considering that the eastern language does not share syntactical or grammatical origins with the romance language. Here was tangible AI, in action, in the auditorium. If he did not have the audience’s undivided attention before, he had it now.

Huang compared perceptive intelligence, a capability of AI, to reasoning and meaning-making—a task that, thus far, only humans can accomplish. “Our ability to understand speech, and to capture what is said, and the computer vision, is all the perceptive level,” he explained. “Most of the progress we have made is in the perceptive intelligence.”

Huang’s lecture captivated an audience that already has a strong investment in AI. More than a decade ago, Stevens designated six strategic foundational pillars that encompass what many leading researchers believe to be the critical future of technology and innovation. One of these strategic pillars is artificial intelligence, machine learning, and cybersecurity.

The interdisciplinary Stevens Institute for Artificial Intelligence (SIAI) was founded in 2018 to bring together more than 50 faculty members from all academic units at Stevens (engineering, business, systems, and arts and music) researching a variety of applications in AI and machine learning. SIAI hopes to amplify the impact of its research and analysis through collaborations with entities in industry, government, foundations, and other academic partners.

As for Huang, he foresees a future in which AI will specifically transform the business world.

“One of the most important functions in the corporate world,” he said, “is meetings. With Microsoft language transcription services, this is being transformed. Just by using a conference microphone and video, we can do transcription. Unlimited vocabulary is happening, because we are able to see all the words ever published on the web.”

He played a video demonstration of Microsoft’s Azure speech system, which, using microphones and cameras, can transcribe the conversations of up to eight people in a conference room, assigning words to each speaker by voice recognition, and closing language barriers by providing real-time translations.

Far from a Hollywood dystopia where machines take over, Huang’s vision is of machines helping professionals to become more effective and more creative in their jobs.

“We are here to help every person and every organization to achieve more,” said Huang. “We want our partners to be more successful; we’re not trying to replace their jobs. Microsoft has no ability to understand what is happening in email or in meetings. We still need humans for that understanding. Translational AI is real; understanding AI is human.”

“AI is going to change everything in this society,” said Huang. “Most of the progress we have made is in the perceptive intelligence. The ability to apply knowledge to reason, to understand context, is way out. I have no idea if this is going to happen in my lifetime. But the perceptive intelligence is real, is here.”