Landing smart speakers (below): Dialogue OS hunting service ecological chain reconstruction

This article is produced by NetEase Smart Studio (public number smartman 163). Focus on AI and read the next big era!

Editor's note: Smart speakers, like the tide, have swept the tech industry. With the entry of various giants, the 100-box battle is imminent. On August 15th, Netease Smart released the special feature "Snapshot Smart Speaker (on)": The New World or the Mirage? â€ Interviewed with manufacturers and senior industry professionals in the smart speaker ecological chain.

In the previous chapter, the main reasons for the gap between smart speakers in China and the United States and the specific causes of the gap, the major manufacturers compete for the voice interaction behind the smart speaker era and landing scenes, as well as the smart home control argument.

In the next article, we will discuss what technical issues are currently faced with making a smart speaker, as well as content service and music copyright disputes behind smart speakers.

Text / Xiaoyan

Fourth, technology first: on the far field recognition, wake-up time and voiceprint recognition

Want to be a smart speaker, we must first solve the problem.

In order to achieve natural voice interaction on smart speakers, the most critical technology among them is far-field speech recognition.

The realization of far-field speech recognition involves a series of problems such as microphone arrays, noise reduction algorithms, accuracy of speech recognition, and delay. "The microphone array itself is very mature. No matter whether it is 2, 4, 6, or 7 bucks, currently domestic manufacturers have production, and the key is in the soft box program (noise reduction, sound source positioning, etc.) if you want to be a smart Speakers, many speech recognition technology companies will assign buyers to you, "said Liu Rui, NetEase artificial intelligence director.

But how to choose the combination number of microphones? Many people give different opinions. It is generally believed that the more microphones (mics) are, the better the sound collection effect is, but the more complex the algorithm, the higher the frequency requirement for the CPU. Zhang Peng, who is responsible for Pandora's project, believes that the difference between 6mic and 4mic is not obvious, but the cost is higher, and there is still a certain gap between the effects of 2mic and 4mic. Comprehensive consideration of choosing 4mic is a good idea. Program.

"The microphone is not as good as possible, but it should be the best." Song Shaopeng, CEO of Smart Speaker Integration Solutions, said: "Google Home only uses two microphones, but its algorithm is very good and the results are good. So, The number of microphones to choose from, need to consider the use of scenes, distance, cost, and even system algorithms."

Currently, the 6+1 microphone solution is an Amazon Echo authentication solution. There are many vendors that use similar solutions. According to Wei Qiang, general manager of Linglong Technology, Wei speaker is currently using a 7+1 microphone combination scheme, which is usually a hardware and software solution, in addition to hardware, but also must be matched with noise reduction, background noise elimination and many other Algorithms, even involving external structures, circuit design.

Although there are already many well-established microphone array hardware and software integration solutions in the industry, there are still many problems in the actual use of real-world scenarios. Typical problems are dialect recognition problems and Chinese and English inclusion recognition problems.

There are many dialects in Chinese, which has led to a great gap in the experience of users in different regions when using smart speakers for voice interaction. Wei Qiang believes that the problem of dialects is essentially the problem of data training. If we have enough dialect corpus, we can solve this problem.

Another very typical problem is the use of Chinese speech to interactively search for English songs, and even mixed English and Chinese songs. The final result is that the donkey's lip is not on the horse's mouth. This requires voice recognition technology companies to find a breakthrough in the switch between Chinese and English.

In addition to far-field recognition, another technical issue that is of more concern is the customization of wake-up words and the issue of wake-up time.

From the current level of technology, there is not much problem in awakening word customization. The difficulty lies in that the awakening accuracy of custom awakening words is not as high as that of regular arousal words. Recently, Baidu purchased 100% of KITT.AI, a technology company that specializes in awakening words, to strengthen its strength in this part of the technology.

Regarding the length of awakening, this is where the industry has not yet reached the technical unification difficulty. In other words, after the smart speaker is woken up, it is always in the pickup state or in the sleep state, which is a problem. If you have been awake, you may have misidentified the problem. For example, if the sound on the TV says â€œalarmâ€, the smart speaker will immediately call the alarm. This is a real case that happened on Google Home.

â€œAt present, the common practice in the industry is to set aside a 6-second or 10-second wake-up time, or simply allow users to set aside time for their wake-up words.â€ Liu Rui explained to Netease Intelligence.

In addition to the above technologies, there is still a rising technology in smart speakers, that is, voiceprint recognition. In the interviews with NetEase Intelligence for people in various industries, we all agree with the future application prospects of this technology.

For Zhang Peng, voiceprint recognition provides an identity ID for the era of voice interaction, which has laid the foundation for individual members to provide personalized services. "The voiceprint recognition technology will become the standard for smart speakers and even future voice interactions." Liu Rui is positioning the voiceprint recognition technology.

â€œHowever, the current voiceprint recognition technology has only just begun. There is no standard for the number of users that can be identified. From the technical point of view, the more users are identified, the higher the false recognition rate.â€ Liu Rui said that the current voiceprint recognition The algorithm is still in the early stages of data accumulation and needs further development. Wei Qiang believes that the current voiceprint technology can only be used in a relatively clear voice environment, and can not be used for payment and other risky scenes.

"This year's voice interaction technology is like the mobile phone touch technology in 2008. At the time, the touch operation was not sensitive. It was not easy to play the game and the device was still hot. But the future voice technology will surely become more and more mature. The above problems will be Resolved," Song Shaopeng said.

V. From Cloud Services to Skills: Reconstruction of Ecological Chains

Above the smart speakers, more and more people believe that cloud content services will become the focus of competition in the future of voice interaction.

In order to connect the content service to the voice interaction device, Amazon Alexa gives a good solution, that is, to open the voice technology to an API interface. When you ask Echo what the weather is like today, it will first process the voice locally. Upload to the cloud server, translate the voice into text, and then find the text keyword to understand the meaning, find the corresponding answer, this answer to call the weather information database, and finally feedback to the speaker broadcast out, the whole process may only take a few seconds. The weather information here is a cloud service content. Amazon calls it Skill. The latest data shows that Alexa already has 15,000 skills on the platform.

"For example, you said to the refrigerator that it is a little tired today. It will give you recommendations on what you like to eat but it is particularly nutritious." Haizhi Smart CEO Xie Diaoxia said that all future business services will be upgraded to skills. Skills can recommend things to you like an expert. Wei Qiang also expressed similar viewpoints. â€œIn the mobile phone ecosystem, basically a few APPs grasp a huge entrance. However, voice interactions can naturally switch freely. This is a very long-tailed demand, and the more services the user has, the better. â€

"From this point of view, all future APPs will be reconstructed. This kind of reconstruction may be to upgrade a single product to a skill, or it may be that the past isolated APPs and highly praised APP ideas have opened up to each other." Xie Dianxia speculates Future forms of voice interaction.

At present, domestic giants and startup companies all hope to build a platform similar to Alexa. Baidu, Ali, Tencent, NetEase, Xiaomi, and HKUST have entered. The purpose is to create a Chinese speech dialogue platform that will turn the services on the mobile Internet into platform skills.

In the case of the 100-box battle, startup companies are not to be outdone, and they are trying their best to seek a slice of the dialogue platform. According to SBI Chi-Chung Long Mengzhu, SBCI will soon release a dialogue platform for developers, DUI, â€œWe did a survey and found that about 60% of developers on the Alexa platform are watching. Adding speech recognition to your own product? In this issue, big business developers need to wait for superior assignments, and small and medium customers need more customization.â€ Long Mengzhu said that in the early days of voice interaction development, you must talk to developers. One-on-one conducts in-depth communication, and this matter can only be done by startups. Currently, DUI is open to 500 seed developers, not only supporting multiple rounds of dialogue, microphone noise reduction, voice recognition and output, TTS speech synthesis, but also integrating many third-party skills (such as chat, navigation, weather, etc.), can be customized Wake up words, the most important thing is to conduct one-to-one communication, to meet the diversified needs of developers, and data consolidation, to meet the needs of developers operating.

For cloud-savvy teams, they want to integrate hardware, software and services through chip-level solutions. According to Zhang Peng, Yun Zhisheng hopes to continue the cloud core product technology architecture, allowing users to get the chip directly to use the speaker shell, Yun Zhisheng to provide a complete solution. Xie Dianxia also believes that the essence of the smart speaker is the robot's MVP (minimum available function body). It has functions such as wake up, constellation, fortune, and yellow calendar. It reads various functions such as encyclopedias and recipes, and can be loaded into various robots and smart home devices. in.

According to Zhu Junwen, CEO of the solution integrator old tree blossom technology, the competition focus of the future voice interaction is on the cloud platform, and the Internet company is an important force. He believes that "in the future, pure speech engine technology will become a mature foundation technology, and each individual difference is not great, but the ultimate goal is to fight or content services. This is an ecological construction process."

Whether it is XXUI or XXOS, all major manufacturers hope to build a platform for content services, so as to grasp the entrance of the era of voice interaction. However, Sonos, a wireless audio equipment manufacturer, believes that they can integrate platforms.

Wang Hanhua explained to Netease Intelligence that the positioning of Sonos is to become a part of the smart speaker industry chain, mainly to do sound and sound design, sound quality, interconnection and other software and hardware experience, as to carry the content service OS, with domestic and foreign manufacturers Cooperation, even open access to all platforms.

However, in the Chinese market, the competition for operating systems for dialogue is so fierce. How can we connect multiple OSs to a piece of hardware? This path can not go through? Before the Chinese dialogue platform is truly scaled, there is a big question mark here.

â€œThe platform-level companies that will eventually be able to stay are two or three. The two or three OSs will gather a lot of application scenarios and hardware.â€ Wang Hanhua believes, â€œSimilar to mobile phones, future smart speaker products will also appear from 1000-6000. Yuan products with different price ranges."

Sixth, music copyright dispute: The future is very difficult, it is the first step to survive

For smart speakers, the debate over music copyright has become a key point in the development of smart speakers before voice dialogue platforms have been formed.

According to Liu Rui, NetEaseâ€™s director of artificial intelligence, the most basic function of smart speakers is to listen to music. This requires each product manufacturer to ensure that there are enough music materials in their speakers. At present, domestic music copyright is basically in the hands of Tencent, Alibaba and NetEase. Other manufacturers who do smart speakers need to buy second-hand copyrights. However, there is a thorny issue here. When record companies such as Sony license music copyrights to Internet companies, they are only authorized to play on the APP. They are not compliant with other products.

It is reported that many smart speakers are still crawling APP music interface and music library, there is no copyright at all, which will lay the hidden danger for future smart speaker market mass shipments.

"From the music copyright point of view, smart speakers can only be afforded by large companies. But now it is to see who first survived, and later depends on the user's demand for voice interaction." Liu Rui believes that smart speakers have just started, behind difficulties Heavy.

Wang Hanhua said, "Maybe smart speakers will become very vertical in the future. For example, vertical products such as music speakers and shopping speakers will appear in different scenarios to solve the problem of content service and distribution of benefits."

postscript

Netease Intelligence released two special reports on â€œSmart Speaker Battle for the Beachâ€, standing on the trend of voice interaction, combing the development of smart speaker technology, content service integration, and music copyright disputes, trying to restore real smart speaker products and technologies. market. Of course, whether China's smart speaker market can usher in a real outbreak depends on the follow-up product sales channels and user experience. At that time, NetEase Intelligence will also make further follow-up reports on the experience and development of smart speakers. Welcome everyone to continue to pay attention to us (public number Smartman163).

Specialized Slip Ring

Specialized Slip Ring,Slip Ring Rotary Joint,Slip Ring Induction Motor Rotor,Induction Motor Slip Ring

Dongguan Oubaibo Technology Co., Ltd. , https://www.sliprob.com