It is simply a science fiction shining into reality.
AI Technology Review Press: Google held the 2018 Developer Conference in California last night (Google I/O 2018). Among the many new products and features of the Android P, Gmail, Gboard, TPUv3 and other products introduced at the conference, it is undoubtedly the new addition of Duplex in the personal assistant Google Assistant. It can call the restaurants, hair salons and other commercial stores to help people. User appointment time.
In these two real voice recordings, Duplex not only communicated with the natural human voice and the human on the other end of the phone. The other party did not even realize that the call was actually an “AI,” and it was successful in the second recording. We dealt with unexpected developments and not only understood "no need to book", but also asked about alleged time. According to Google CEO Sundar Pichai, they plan to further expand Duplex's ability to ask for business hours in the future. One user, Google Assistant, phoned and asked about the opening hours of a store, they could synchronize the results of the query to more. Users not only save the user/consumer's own query time, but also save time for the store owner. In fact, this is Google's design philosophy for Google Assistant: to save time for users and get things done for users.
Google also updated the technical details of Duplex on the Google AI blog. The AI technology review compiled as follows.
Google Duplex: AI system that can make real-world tasks on the phone
For a long time, the goal of the interaction between humans and computers was to hope for a natural dialogue between the two, just like the two people talking to each other. In recent years, there has been a revolutionary increase in the ability of computers to understand and generate natural speech. Google Voice Search, WaveNet and other technologies based on deep neural networks have contributed.
Even so, the current state-of-the-art human-machine dialogue system still has only stiff electronic sounds, and it does not understand human natural language. Specifically, even if the automatic call system only recognizes simple words and control instructions, it is not satisfactory, let alone talk to people naturally. The caller needs to adjust his own way of speaking to adapt to the system, but the system cannot adapt to the caller.
The Google Duplex released today contains new technologies that can call humans to complete a series of real-world tasks through natural dialogue. This technology is currently aimed at performing some specific tasks, such as scheduling time for certain types of activities. In these tasks, Duplex can make the conversation process as natural as possible. Humans on the other side of the phone can communicate naturally as if talking to another person without any adjustments (in fact, the other party may not have discovered that the phone is not human at all. Called).
In the research of this technology, an important research point is to limit the functions of the Duplex to closed scenes. These scenes cover enough content that the AI system can fully explore learning. Correspondingly, after a thorough training of these scenes, Duplex can only perform the natural dialogue tasks within these scenes, and cannot perform general conversations with people.
However, according to the video at the beginning, we have already felt that Duplex has brought a surprising performance in these tasks and the dialogue process is very comfortable for humans.
How to start natural conversation
There are several difficulties in starting a natural dialogue: natural language is difficult to understand, human natural behavior is difficult to model, human tolerance for delay is low and therefore requires high processing speed, and the generation of natural-sounding voices, including To properly mix some mood words.
When talking between humans and humans, they use more complex sentences than they do with computers. They often say one half of a sentence and then correct a part of the expression. They are ambiguous, rely on the context and omit some words, and sometimes express multiple meanings in a single sentence. For example: "From Tuesday to Thursday we open the door from 11am to 2pm and then reopen the door from 4pm to 9pm. Then we will be on Friday 6th. Oh no, on Fridays and Saturdays we open the door to 9am and Sunday to 1pm. 9:00."
In natural, spontaneous conversations, humans speak faster and speak less clearly than when talking to a computer. At this time, speech recognition is even more difficult and word error rates are higher. This problem will be more apparent when you call, often with background noise and poor call quality.
In longer conversations, the same sentence can have different meanings depending on the context. For example, "ok for 4" may indicate the number of people when booking seats, and may also mean time. The relevant contextual sentences may be preceded by several sentences and affected by the error rate of the words in the phone. This problem will become increasingly difficult to understand.
After identifying the semantics of the other party, what the AI system wants to say is determined by the task currently being performed and the status of the conversation. In addition, there are some common language habits in natural language conversations; these conventional syntax patterns include: More detailed when repeating (- "Time is next Friday." - "When?" - " Next Friday, No. 18, "."), Synchronous Statement ("Can you hear clearly"), Interrupted (- "The number is 212…" - "I'm sorry can you say it again?"), and Pause ("You can Wait a moment? [pause] Thank you!”, 1 second pause and 2 minute pause have different meanings).
With recent technological developments in language understanding, interaction, time control, and speech generation, Google Duplex's conversation sounds pretty natural.
In order to deal with the challenges mentioned above, the heart of Duplex is a RNN network, which is built by TensorFlow Extended (RFX). In order to achieve high accuracy, Google trained Duplex's RNN network with anonymous phone conversation data. This network will use Google's automatic speech recognition (ASR) recognition results text, but also use the features in the audio, dialogue history, dialogue parameters (such as the service to be booked, the current time) and so on. Google trained different models of understanding for each different task, but there are also some training materials shared between different tasks. Finally, Google has further refined the model using TFX's hyperparameter optimization.
The input speech is first processed by the Automatic Speech Recognition System (ASR). The generated text is input to the RNN network together with the context data and other input. The generated response text is then read out by the Text to Speech (TTS) system.
Generate natural speech
Google jointly uses a cascading TTS engine and a generative TTS engine (which uses Tacotron and WaveNet) to control the tone of speech according to different contexts.
The system can also generate some modal words (such as "hmmm", "uh"), which also makes speech more natural. When cascading TTSs need to combine very varied speech units, or need to increase the pauses generated, modal words are added to the generated speech, which allows the system to indicate to the other party in a natural way. Yes, I listen to it. "Or I'm still thinking about it." (When humans talk, they often say a few words while thinking.) Google’s user survey also confirmed that humans feel more familiar and natural conversations with modal words.
On the other hand, the system's delay must also meet human expectations. For example, when a person speaks a simple sentence like “Hello” on the phone, they will hope to hear a brief reply soon, and this time will be more sensitive to delay. When the AI system detects a situation that requires a short delay, it uses a faster but less accurate model. In some extreme cases, the system will not even wait for the RNN to run, but instead use the fast approach model directly (usually in combination with a slower formal response, just as humans would hesitate to fully understand the other party). ). This approach allows the system to achieve very short delays within 100ms. Interestingly, Google has found that in some cases it is necessary to add some delay to make the dialogue sound more natural, such as when replying to a very complicated sentence.
The Google Duplex system can conduct complex conversations, it can complete most tasks completely automatically, without any human involvement. The system also has an automatic monitoring mechanism that can not only prompt the user to pop up after successfully completing a task, but also identify tasks that failed to complete successfully (such as processing an unusually complicated booking). In this case, it will give instructions to a human operator and transfer it to humans to complete the task.
In order to handle the new situation in the training system, Google also used real-time supervision training. This training method is similar to the teaching method for many things. There is an instructor who guides a student, and provides necessary guidance while doing it. This ensures that the execution of the task achieves the level of quality required by the instructor. In a Duplex system, an experienced human operator can act as an instructor. When the system calls to handle a new, unfamiliar situation, the human operator can influence the behavior of the system in real time. This learning process can continue until the system achieves its desired performance. Then the system can make a full-automatic call.
Good for users, but also for businesses
Many merchants do not have their own online reservation system and still use online reservations. Duplex can help them, and without having to change daily behavioral practices or train employees, users can easily schedule reservations through Google Assistant. Duplex can also reduce the user's pigeons, automatically remind users of reservations on the phone, and help users easily cancel or reschedule.
In some cases, the user may call the merchant to inquire about the business hours, such as the business hours during the holiday period, which are generally not visible on the shop's online information page. After Duplex telephoned the inquiry, he could disclose this information through Google service, saving the effort of other users to make the same phone call, asking the same questions, and saving the manpower of the merchants. At the same time, businesses themselves operate as normal as they normally would. This new technology does not require them to learn any skills or make any changes to enjoy the convenience.
For users, Google Duplex can certainly help users easily complete the various tasks it supports. Users only need to interact with Google Assistant simply, Duplex will automatically call in the background, and automatically complete the required user information.
The user asks Google Assistant to make an appointment. Google Assistant will then make an appointment with the merchant via Duplex.
Duplex also provides users with the added convenience that they can serve as service provider's agents asynchronously. For example, when a business call is made during non-business hours, or when the mobile phone signal is not good, Duplex becomes the case. An additional way to get information. It can also help people with disabilities or users who do not speak the language to make phone calls for hearing impaired users to complete appointments or for users to complete tasks in another language.
This summer, Google will begin testing Duplex based on Google Assistant, starting with the issues of booking a restaurant, booking a hair salon, and asking for business hours for holidays.
Yaniv Leviathan, head of the Google Duplex team, and Matan Kalma, director of project engineering, ate at the restaurant. The meal was booked by Duplex on the phone.
It has always been Google’s goal to allow people to “naturally interact with technology like people interact.” The Google Duplex is a step in this direction, where natural conversations interact with technology in specific scenes. Google hopes that the development of these specific technologies can also bring more meaningful improvements to the daily interaction between humans and computers.
Via GoogleBlog, AI Technology Review Compilation.