On November 8th, at the 5th World Internet Conference, which opened on November 7, Sogou announced that it cooperated with Xinhua News Agency to develop the world's first fully simulated intelligent synthesis host ——“AI composite anchor&rdquo The official debut, while Sogou's core technology of the future artificial intelligence "sogou avatar" also received wide attention.
According to the introduction, the viewer only needs to input an existing news text to the “AI Synthetic Anchor”, and a synthesizing news broadcaster of Xinhua News Agency will appear on the screen. He will not only broadcast the voice with the same voice as the real person, even the lip shape. The facial expressions can also be completely matched. This kind of video effect, no matter how it looks or sounds, is not much different from the reality broadcast of the Xinhua News Agency anchor.
In the development process of the synthetic anchor, the relevant technical staff of Sogou Company carried out various exploration attempts together with the news anchor of Xinhua News Agency. Through the support of “Sogou avatar” technology, through the detection of face key points and face features After extracting, face reconstruction, lip recognition, emotional migration and other cutting-edge technologies, combined with multi-modal information such as voice and image for joint modeling training, “AI Synthetic Anchor” was officially born.
At the same time, according to Wang Yufeng, general manager of Sogou's Intelligent Voice Division, “Sogou avatar” technology is one of the core technologies of Sogou artificial intelligence, born in Sogou “natural interaction + knowledge calculation” under the artificial intelligence concept . The technology can use Sogou's AI ability to perform anthropomorphic training on AI from the aspects of image expression, sound language habits, and logical thinking, and then clone and create human AI avatars, thus helping humans to improve the efficiency of information expression and transmission. This technology is the core that supports the "AI Synthetic Anchor" operating normally.
After the meeting, Wang Xiaochuan, CEO of Sogou Company, was interviewed by Tencent Technology and other media. The following is a partial interview (removed without changing the original intention)
Media: As the moderator just said, will he be unemployed?
Wang Xiaochuan: First of all, AI technology is divided into perceptual technology and cognitive technology. Perception is sound, there are images, in the direction of perception technology, the machine basically has the opportunity to be as good as people. However, in the direction of cognitive technology, the reasoning, knowledge, and thinking behind the machine, the logical thing with the core of the language, the processing power of the machine is limited. In this case, when it comes to the high-level activities of people, the machine It is impossible to do now.
Media: Is it possible to do it in the future?
Wang Xiaochuan: There is no such technology now. If it is in a vertical field, specifically subdivided into specific areas, such as law or medical care, the narrower the machine, the more opportunities the machine will be closer to humans in this particular field. You can't think of AI as a big word, and AI has a lot of connotations. In this case, the AI is raised to something related to people's cognition. The machine can only be used as an aid in it. There is no way to replace human beings. But if you just listen to a voice, look at an image, and now like Face++ or Shang Tang, the machine can replace some of the perceived things. For the high-level activities of mankind, it is no longer a machine to replace human beings. Advanced activities are called cognition, low-level activities are called perception, and current machines can be perceived, which is the first big box.
Media: What do you think of AI synthesis anchors, how does humans relate to machines? Replace or cooperate?
Wang Xiaochuan: The name we have set up with Xinhua, called AI Synthetic Anchor. This technique involves three synthesis: sound synthesis, expression synthesis, lip synthesis, especially lip synthesis. Because of the virtual word itself, drawing a cartoon painting is also called virtual. If it is called a virtual anchor, others are more likely to be true, saying how this thing is called virtual, so we call AI composite anchor. The relationship between the AI composite anchor and the human being is to replace or cooperate with this question. If it is only a problem of perception, it does not involve the following language organization, writing a deep manuscript, it can be close to the real person.
Media: It means that people can do the press release, and the rest is handed over to it?
Wang Xiaochuan: If you want to be vivid, such as where you should be angry, where is the tenderness, such a machine is difficult to do, because the content of the manuscript is not understood, it does not understand the true meaning. If only visual and auditory expressions are used, the AI composite anchor can be close to the real person, and once it is highly correlated with the content, the machine's role will be weaker.
Media: What is the difference between Sogou's AI composite anchor and Microsoft Xiaobing's anchor?
Wang Xiaochuan: Sogou's AI composite anchor is the image of a real person. Xiao Bing gives a virtual image, the sound is different from the real person, and there is no change in expression and lip shape. The AI composite anchor is really a combination of real people.
Media: In addition to the technology in the anchor field, what other scenarios are there?
Wang Xiaochuan: For example, now we are telling a story with Kai Shu. It used to be a story of Kai’s story. In the future, it may become your father and mother to tell you stories. Under our big idea, the anchor is one of them, and then we must personalize it and become other people.
Media: Sogou now have to consider how to do further?
Wang Xiaochuan: In the future, the direction of Sogou input method, I once said that it is called auxiliary dialogue, which is to help you to talk. When Sogou was listed last year, I received 3,000 WeChat messages every day. If I use voice to return, I have to say 3,000, and the people who send messages are different. Some are journalists and some are old classmates. It's not the same. You need a avatar at this time. It can help you with your personality and help you to do mechanical and repetitive work. Sogou has two concepts, one is to make the machine become your avatar, and the other is to make the machine your assistant. Sogou search is to be your assistant. This is the core direction of Sogou AI.
Media: In the language, the threshold is not high?
Wang Xiaochuan: The threshold of language is quite high. When Gao De map synthesized Lin Zhiling's voice, Lin Zhiling read a lot of words, not a fixed sentence of “turning left and turning right”. Sogou now only needs 10 minutes of data, enough to synthesize a person's voice with very small data.
Media: Why? Is there any algorithmic breakthrough?
Wang Xiaochuan: The so-called small data is in fact inseparable from big data. The machine has to look at a lot of sounds, and then find out how the voice characteristics of this person are different from others, so the small is also big. We used to say that the baby is learning very fast. It will be a picture. In fact, he looks at a picture after watching a lot of pictures. So for a specific area, the smaller the data, the better, but you must have enough data in the general field. Therefore, the technology involved here requires both a large amount of sound training and the ability to train a specific person's voice with less voice data. This is a technical barrier.
Media: Will the rhythm of synthetic anchor commercialization be faster than other AI landing projects?
Wang Xiaochuan: The fastest, the first is translation, the translation is just needed. It's not just a translation treasure. Sogou search supports searching global information in Chinese and global information in Chinese. This is the application of translation technology. Translation is one of Sogou's missions. Input method input Chinese into foreign language, search can search foreign language into Chinese. We are a company that is a bridge of information, so translation is very important, and this is the fastest technology. The first is speech, image, and then translation. After translation, it is a avatar. After training a person's data, help him to express. Finally, the question and answer is the personal assistant, who can help you answer the question. For consumers, this is the route of technological evolution.
Media: When Sogou does general training, how is the voice material obtained, is it the voice of ordinary voice input, or what?
Wang Xiaochuan: We have a lot of cooperation, a lot of annotations. For example, now we have cooperation with the Himalayas. We can use them to read and collect as many different voices as possible.
Media: In addition to medical care, Sogou is also doing legal-related content search. What are the selection criteria for this field and the plan for expanding the vertical search in the next step?
Wang Xiaochuan: First of all, the biggest is medical treatment. The law has obvious knowledge structure ability in the middle. This knowledge boundary is relatively authoritative and standardized. In this case, we may make a choice. But medical care is far superior to the law.
Media: Sogou will expand other areas?
Wang Xiaochuan: In other fields, we still want to work hard to make it more authoritative and true. In short, in some areas, the information on the Internet is not good enough. We hope that we can use new methods in it, using AI technology or in different ways.
Media: Why do you want to find the dog number, what is the content of this piece, why should you intervene in this market?
Wang Xiaochuan: Because today is a platform, like a headline or like a vibrato, a user is spending inside, and a producer is a cooperative relationship. This kind of cooperation is not a contract with a light contract. It can really be a platform and put it in. However, the search engine is not, the search mode is caught, the headline number or the Sogou number like this is very willing to increase the proportion of cooperation in the content, it is more standardized, such as how much advertising you do here, there are certain mutual Consensus, which can enhance the content and quality of cooperation and make the user experience better.
Media: Do you think it’s late?
Wang Xiaochuan: It will not be a strategic breakthrough point to speak to us soon.
Media: Will this change in information flow bring greater revenue to Sogou?
Wang Xiaochuan: There will be some, information flow advertisements are mainly based on APP, Sogou APP or browser, which can be supplemented. If the amount of APP is particularly large, the benefits are large.
Media: Before you said that you have been brushing the sound, brushing the headline, it is easy to indulge in it, can the Sogou number change this thing?
Wang Xiaochuan: The big part of our search for the dog is for search services, not just for information flow. Second, some things don't change, like games, we don't do it. For us, from our hobbies and experiences, it's not games, games, satisfying you, indulging in it, always getting virtual world, too illusory To express such a concept, people need it, but how to guide it, we do not have the ability to create guidance in this area, we are willing to put our advantages in the place we are good at, we make information expression easier, we do translation, we Do the question and answer technique in the middle.
Media: What do you think of the Sogou search at the traffic channel or at the customer level?
Wang Xiaochuan: There are two things in the channel that are to be broken. One place is to use its own traffic channels. For example, when the user inputs the input method, if you have the intention to identify him, we can directly provide him with better. The information satisfies him and even shares with others. On this matter, we still have a lot of space to improve, the search is connected with the input behavior, and the second we hope to have better differentiation or authority in the search results. Just like the WeChat content we have done before, now we are focusing on medical health content, and we hope that the content will be differentiated, so that users become active, rather than relying on cooperation, QQ browser or mobile phone manufacturers, so the cost will drop. Come down.
Media: The strategy of Sogou AI is still relatively focused. Have you considered software or research or hardware related to diversification or more AI?
Wang Xiaochuan: I don't think about it. I think we have already opened up very much. The core of our information civilization era is the understanding of knowledge or language. Do AI. I think we have to meet a few. Today, AI is a big company. There are several companies, the first one has scenes, there is data, so you do AI, if you have no scenes, no data, only technology, this thing is very difficult, our data and scenes are in user expression and information acquisition. Inside, in the input and search, so we do this around the scene. Second, the middle is the need to have continuous investment in this, investing in many startups or companies without business models. It is like this today. If the market does not open any day, then this thing may fall, we are enough Funds go for investment, but I also hope to match the visible business value. We have not considered business for translation. We are already expanding. We have recently supported 500 simultaneous interpretations. When we have tried to expand, we are not considering business. The problem is still around our mission, which is to make it easier to express and access information, and to let the machine partially replace or provide services in the future.
Media: Simply pass this question, is it possible for the machine to replace people?
Wang Xiaochuan: No, I can't do it. If you use good people, the machine can't catch up, but there are many places because you don't have good simultaneous interpretation or you can't give a person a person to travel abroad. In this case, the machine can go to work, and the translation is simple. Repetitive work, but really good translation is knowledge, thinking, what you let him think, the machine is no strong, open thinking, if only the chessboard closed thinking, the machine can play, but can do, but In an open environment, machines are not enough.
Media: Will you consider having some layouts on multimedia search?
Wang Xiaochuan: The core of the search is language-centered. If you leave the text and just draw, this is not enough. We have the ability to search for images and have the voice to do the search, but the core point is to read and understand. This is not the most benefit we have or the biggest place for us to break through. Our breakthrough is in the understanding of language, this is more difficult, and more difficult than 5G.
Media: From the hardware products, is it possible to be an OEM in the future?
Wang Xiaochuan: It is possible, but now I have made it through my own things, and then open it. Just like Amazon, the speaker that made echo before is also doing it by myself. I have the opportunity to play with others, otherwise the first The day is B2B2C, you don't know where the customer is, or have no positive habits with the customer. To C company is not enough, first make yourself transparent, and then find that the ability is not enough, and then open.
Media: How long will it take to get to the real AI personal assistant?
Wang Xiaochuan: In the past 20 years, we called it the information age. Every era has its starting point. Just like the earliest wheel invention, with the agricultural era, the wheel was invented, and it was possible to push the car to plant the land. This is the original. Come over. Later, there were steam engines, entering the industrial age, and later with computers and the Internet, we entered the information age. The great feature of the information age is the ability to transmit information across regions, time, time and space. Like e—mail or IM, you can communicate remotely with everyone, or put information online, and then you search with search. In this case, input method, search engine and communication software are the core of the new era. Application, the question you just asked is a very important thing for the AI Personal Assistant for the next 20 years.
Media: How long does it really take to help people and assist people?
Wang Xiaochuan: The vertical field is slowly coming. It really helps people, and translates to ordinary people. It is also an AI assistant. It used to be a real person. Now it can be done by machine, and then down, like in our vertical scene, we also Do the machine to help you make an automatic reply, to the sales company or customer service company. He has already begun to do it, but he needs domain support. It is up to people to train this knowledge before they do it. It is not just thinking ability. So the next step is to do this by data-driven. You let it only help people. He hasn't replaced people now. Now, you can't see technology to replace people, but things that help people have already begun.
Media: Is this a selling solution?
Wang Xiaochuan: We are to C, we will use it to consumers, making it easier for consumers to use.
Media: What is the future usage scenario of AI composite anchors or future AI composite images?
Wang Xiaochuan: Synthetic anchor, today gives us the core ability to do dialogue and question and answer, and at the same time can be interactive, so in medical, legal, in some human-computer interaction, assume a role that makes people more friendly communication, but the real service The content is to face the content, is to bring the service into it. Another scenario is to see one today, handed it to Xinhua News Agency, they have editors writing the manuscript in it, the machine is just reading the character, it is one-way, not two-way, so he just interacts naturally, in the knowledge calculation I haven't put this ability in it yet.
Media: Sogou's future strategy is AI+IoT (Internet of Things). Can you understand this?
Wang Xiaochuan: IoT is just an interface. I can't put IoT at such a high level. AI is the core to help people express their information. IoT is just the middle of the way to help you express the information.
Media: Why don't Sogou do smart speakers?
Wang Xiaochuan: The core point of this product is that it is cheap enough and cheap, not driven by technology. Without this ability, it can only be driven by funds, and there is no AI in it. It is equivalent to selling money at a loss, like a taxi take-out is a life of burning money, we actually can't do it, not that ability.
Media: When is the inflection point of artificial intelligence making money?
Wang Xiaochuan: Artificial intelligence is a technology. You say that technology makes money. The word is not established. You have to become a business to become a profitable thing. Moreover, artificial intelligence technology requires data drive in particular, so small companies are more difficult. It's a scene, it can do something extra, this is not from the point of view. So this is the social division of labor, the exchange of data, let small companies provide technical services to large companies, look at the current trend, artificial intelligence is really a big company. On the other hand, if the government does a lot of data opening, it may bring new investment opportunities, which means that small companies have access to data. If the government opens up data, it may bring new opportunities. will happen.
Media: There is a new development in the science and technology sector. Will it invest in some companies? Will not be on your own.
Wang Xiaochuan: Our model will not change. We will not invest and make money. This is not what we do. We can do other professional institutions. When it comes to ourselves, because we have already been in the United States, the domestic environment, if the science and technology sector is successful, has great significance. But this thing depends on the policy. I only know that this happened, but how to evolve in the end, I hope it will succeed, and it is revolutionary for China.
Media: When choosing a 2C smart hardware product, what do you think are still unsatisfied?
Wang Xiaochuan: First, there needs to be a better recording pen. It is really for the teacher, when the media person makes a speech or when the two sides talk about the contract, as long as there is business activity, this is a need to do. Something, but not a branded product, if there is an upgrade in capabilities, there will be new products, I think that efforts to bring everyone to think about new technologies, may also be in cooperation with hardware manufacturers.
Media: What's new in data and privacy protection?
Wang Xiaochuan: First, respecting privacy and protecting privacy is a very serious topic. It is in line with the law and the culture of users, but we know that if one side is down, privacy is inviolable. This perspective is particularly harmful because only Others know your personality and can give you a better service. Therefore, in this case, from the perspective of the overall value of society, individuals can open up some of their own data, let the enterprise know that it can provide better services, and this kind of thing should be advocated in a safe situation, so The EU's approach will eventually ruin itself, and the user consumers will not buy it. In the end, it will not be able to upgrade the service, and it will become backward.
Media: When Sogou opened a new business, what depends on whether the business should be opened or not? What is the biggest thinking dimension?
Wang Xiaochuan: The first one, we know what the future trend is. This has to be known, this is a source. But why is the second thing we do, when we are doing this, with our values, with our ability, with our current state, and what kind of relationship. Why is it, this is a very serious matter. It is best to have a global mission for this matter. If this is in the trend, in your mission, you will work hard.
Media: At Sogou, are you more like a professional manager or co-founder?
Wang Xiaochuan: Actually, I have both dual attributes. I also bear the spiritual role and lead direction of the founder, but because of the equity relationship, I have to work like a professional manager. This is a unique state.