A Concept for the Future of Multimodal Voice User Interfaces

The following Bachelor Thesis deals with the future and development of Voice User Interfaces(VUI). It shows how Voice User Interfaces will become more relevant in the future and how they will function. "A Concept for the Future of Multimodal Voice User Interfaces and Voice Assistants" introduces an interaction system that explains how Voice User Interfaces and Voice Assistants(VA) can be integrated into the workflow on the computer and the Internet in the future. The project introduces the user to Voice User Interfaces outside the context of use "Home", as it currently exists in the form of Alexa and Google Home. Unlike existing voice user interfaces, our system operates in a new area and links programs and services across the board.

“By 2020, the average person will have more conversations with bots than with their spouse.”

Why Voice User Interfaces?

Language is one of the most natural and humane forms of communication that enables much more information quickly and easily. We are sure that voice user interfaces will play a very dominant role in the future, but we only associate voice with assistants like Amazon Alexa, Google Assistant or Apple Siri. However, these voice assistants are currently able to solve less complex tasks and thus remain as „toys“ in the living room. Although the first primitive speech computers are almost 60 years back, just in recent years significant progress in deep learning started to develop, an area of ​​artificial intelligence and natural language processing. Therefore, with our concept we focused on the future, assuming that at this time artificial intelligence and language processing will be even more advanced than we now realize from the current language assistant.

How does a Voice User Interface become a paradigm?

The first apps came with the smartphone. These changed our access to information significantly. Previously, you worked mainly on a desktop, where every Internet search was placed. But with the rise of mobile applications, also the handling of the devices changed. Meanwhile, it is possible to work on images on smartphones, to write emails and to do research on the go.

So how can and must a voice user interface look like so that the speech interaction becomes comparably functional and applicable? We see the voice assistants not only as a pure voice user interface in the form of a smart speaker, but above all, we consider it being multimodal and firmly integrated into devices. With multimodal we describe the combination of different forms of interaction, such as on a computer with mouse and keyboard.

But how does such a voice user interface and language assistant need to be structured so that it can function properly?

Parameters for a Voice User Interface

According to the results of our research, we were able to set various parameters and requirements for a good voice user interface. We are sure that a good VUI is only possible if it is firmly integrated in the system and thus can access all devices and has reached a certain level of artificial intelligence.


The voice user interface must be firmly linked to the system so that it can interact with other applications or interfaces, such as the Graphical User Interface. To support the user, it needs access to all areas of the computer or network.

Artificial Intelligence

The AI is responsible for the autonomy and independent action of the Voice User Interface. In addition, the Voice User Interface can thus recognize user behavior patterns and thus personalize them. With the help of a strong Artificial Intelligence the Voice User Interface becomes more than just a "toy". There must also be a basis for natural language processing, so that the Voice User Interface can respond to more than just keywords.

For a language assistant to become a really helpful interaction, we distinguish between the current concepts in the following two points:

One Assistant

The assistant must be customizable and accessible on every device. The user uses an assistant across several application areas. It also reveals that the assistant must be a system. Here, however, it can be distinguished whether it is the personal assistant of the user in everyday life, or a voice assistant of another person or company.

Contextual Knowledge

Good contextual knowledge requires that the assistant must have access to the data and the corresponding editing right, depending on the state of the AI. On the other hand sensors play an important role. Depending on the environment in which the user resides, different sensors must be installed in the room or on the corresponding device. A microphone and a speaker are not enough for this claim. Comparable would be the current situation with a butler, which everyone currently desires from a language assistant, but has no access and the user neither knows, nor sees, nor can communicate with him.

The steps of a voice assistant

Our research has also shown that a voice assistant can act in three stages. With each level, the assistants artificial intelligence and contextual knowledge increases, allowing him to perform more complex tasks.

Step I

External access:
The Voice Assistant is firmly integrated into the Sytsem and is able to access files and applications, such as sending and opening files.

Step II

Internal access:
The assistant not only recognizes the file or application, but can also actively access its content. The content is made tangible for the assistant so, for example, he can interpret it.

Step III

Full access:
In the last level, the assistant gets full access to content and structure. The assistant processes and interprets the content independently and actively supports the user. At this point, artificial intelligence must be at the highest level.

Selected scenarios

We have designed different scenarios based on our research results which build on our stage model and our parameters, which we have defined for a language assistant. In these scenarios, the user works with a computer, desktop, keyboard and mouse. With his integrated voice assistant, he can perform simple tasks parallel to his current workflow and thus works more effectively with the help of language.

Other Projects