Shenzhen ZTRON Microelectronics Co., Ltd
Telephone
0755-8299 4126

Personal consumer electronics

Smart speaker solution development


Smart speakers are products of modern technology and are derivatives of speakers based on voice recognition technology. Smart speakers are widely used in homes and involve many aspects of life. It can be said that smart speakers have entered daily life. Smart speakers have a variety of functions and basically meet people's daily life services. Current home smart speakers can realize functions such as setting alarm clocks and on-demand music. After connecting to the network, they can realize functions such as understanding the future weather, online shopping, and making phone calls. In addition, they can connect to third-party software and control household smart appliances. Smart services can implement multiple functions such as ordering takeout, ordering service, hailing a taxi, ordering food, etc. by just speaking a word. For people of different ages, smart speakers can also set different modes to achieve more humane answers. For example, the tone of voice in the children's mode of the smart speaker will be more friendly, making children feel more intimate.


1. Introduction to smart speakers


Since Amazon launched its first smart speaker, Echo, in 2014, smart speakers have sprung up like mushrooms after rain. Abroad, Amazon, Google, Microsoft and Apple have successively released their own smart speakers; domestically, companies such as Baidu, Alibaba, Tencent and Xiaomi have stepped into this field and released their own products one after another. The products of different speaker manufacturers are obviously homogeneous, but they have different focuses. JD.com and Alibaba are committed to improving the business ecological layout; Xiaomi is committed to building a smart home industry chain; Himalaya is committed to improving audio content and quality. However, there is still room for improvement in terms of user experience and interactive entertainment. With the development of technology, smart speakers have good development prospects, whether in the business ecological chain, the smart home industry chain, or in terms of audio resources.


深圳智能音箱方案设计公司


2. The main technology of the smart speaker solution


The process of a smart speaker is voice wake-up, then internal processing, and finally finding the corresponding content output, which mainly includes front-end signal processing, voice wake-up, voice interaction and other technologies.


1. Front-end signal processing


Front-end signal processing is to prepare before waking up. When the speaker is working, Mailie is in the sound pickup state. When the sound is received, the sound is processed, including four aspects: speech detection, noise reduction, sound source positioning and beam forming.


Speech detection is used to detect the starting position of the speech segment of the audio signal and filter irrelevant non-speech signals to achieve the purpose of separating speech segments and non-speech segment signals. Noise reduction is to reduce the impact of noise on smart speaker recognition, including acoustic echo cancellation and dereverberation. There are various types of noise in the actual environment. Noise reduction is used to reduce noise interference and improve the signal-to-noise ratio. Since indoor speech will be reflected multiple times by walls, etc., the collected sound is mixed, so it is processed by dereverberation. Sound source localization determines the user's position based on the Mail. It can be used to display azimuth lights to enhance interactive effects, and can also be used as a precursor task for beam forming to determine spatial filtering parameters. Beam forming uses spatial filtering to integrate multiple signals into one channel to enhance the original speech signal and suppress side-channel signals.


2. Voice wake-up


Speech wake-up is also called keyword detection, which is to detect target keywords in continuous speech. Generally, the number of target keywords is small. Voice wake-up performance depends on wake-up rate and false wake-up rate. The wake-up rate refers to the probability of detecting wake-up words existing in the continuous speech stream. Commonly used implementation methods for voice wake-up are dnn+hmm (deep neural network + hidden Markov model) and lstm+ctc (long short-term memory network + fully connected temporal classification model). The current open source wake-up solution can provide SDK, and the wake-up function is generally divided into online and offline versions. China is mainly represented by HKUST Xunfei. There are also a variety of open source small speech recognition engines on the Internet that can implement independent voice wake-up functions, and their performance is uneven.


3. Voice interaction


Voice interaction includes speech recognition, natural language understanding, dialogue management, natural language generation and speech synthesis.


Speech recognition technology, also known as automatic speech recognition, can convert speech information into text information. The instructions given by the user are voice. However, the voice cannot be directly analyzed and needs to be converted into text. With the application of deep neural networks, the use of big data and the popularization of cloud computing, voice technology has entered people's daily lives, such as iFlytek, Alibaba's AliGenie, Himalaya's Xiaoya, etc.


The purpose of natural language understanding is to convert natural language into a form that is easy for computers to process. That is, after receiving the instruction, identify the field to which the user command belongs, then identify the user's intention in the corresponding field, and finally perform entity extraction to determine the parameters of the intention. Currently, the NPL algorithm used in natural language processing is based on machine learning and has various language processing data sets, including Chinese word segmentation, part-of-speech tagging, entity recognition, syntactic analysis, and automatic text classification.


Dialogue management is extremely important for the interaction of continuous dialogues. The general solution is to use the parameters parsed in the previous round of dialogue as global variables and bring them into the next round of dialogue. Based on the current round of dialogue and certain conditions, it is judged whether to maintain the previous round. field, or clear the context.


Natural language generation enables computers to have the same expression and writing functions as humans. That is, it can automatically generate a high-quality natural language text through a planning process based on some key information and its internal expression form in the machine. Speech synthesis, also called text-to-speech conversion, enables smart speakers to read any given text like a human. The main synthesis methods include parametric synthesis and splicing synthesis. Parametric synthesis has a small amount of calculation and flexible deployment, but has poor naturalness. Splicing synthesis is close to human pronunciation, requires high storage and computing resources, and can generally only be synthesized online.


4. Other key technologies


In addition to the above key technologies, there are also technologies that are relatively mature but have not been widely used in smart speakers, such as voiceprint recognition, face detection and face recognition. Voiceprint recognition is used for payment. It reflects the voice parameters of a person's voice and behavioral characteristics based on the voice waveform. Facial recognition is similar to voiceprint recognition, but can also be used to confirm a user's identity. Face detection is based on the speaker equipped with a camera to determine the user's position, bringing better interaction design and auxiliary sound source localization.

智能音箱方案开发商


3. Speech recognition technology of smart speakers


At present, most Internet companies have launched their own smart speakers, which makes the smart speaker market bigger and bigger and the competition more and more fierce. Today's smart speakers are not much different in appearance, so users pay more attention to the performance of the smart speakers themselves. The performance of smart speakers is mainly reflected in its language interaction capabilities, response speed, and accuracy.


In order to realize the multiple functions of smart speakers, intelligence often requires multiple technologies. When the user sends a speech signal, the smart speaker must first receive the signal. Microphone array technology is used here. Generally, the speaker has 7 to 8 built-in microphones, which enables the smart speaker to correctly receive speech signals from multiple directions. As well as eliminating the impact of echo and noise, after obtaining the speech signal, it is necessary to process the signal so that the machine can "understand" natural language, and natural language processing and language recognition technology are used. Finally, the result calculated by the smart speaker needs to be re-synthesized into a speech signal, which uses speech synthesis technology. Among these many technologies, the core is speech recognition technology.


Speech recognition technology in smart speakers is very complex. It integrates psychology, linguistics, statistics and other disciplines. If you want to study speech recognition technology, you need to start with each important step in speech recognition technology. Here we mainly study its preprocessing, Feature extraction, training and recognition are three parts.


(1) Preprocessing


A speech signal is often accompanied by environmental noise, which has a huge impact on speech recognition. Therefore, these environmental noises must be removed first. The frequency of a speech signal is basically stable within a range. Anti-aliasing filtering is used to distinguish the noise section on the signal spectrum from the target speech signal and obtain the target signal. At the same time, the conversion of the analog signal to a digital signal is completed.


In addition, since the power of the target signal in the speech signal is small and the power of the noise is large, the noise occupies most of the input area. Thus, pre-emphasis processing is performed on the target signal to increase the energy of the target signal. Basically, it is to increase the amplitude to make it easier to distinguish noise.


Endpoint detection is also an important part of preprocessing. Environmental noise exists at any time, but the voice signal only exists for a period. Endpoint detection aims to determine the starting position of the voice signal to avoid the mixing of noise during non-voice periods. Short-term average amplitude and short-term average zero-crossing rate are two commonly used methods for endpoint detection.


In addition, the current speech recognition software is divided into two modes. One is to manually intercept speech, such as Siri in Apple mobile phones. The user needs to press and hold specific buttons to complete the collection of speech signals. The other is to automatically intercept speech. Most smart speakers on the market adopt this mode, but the accuracy is relatively poor. Generally, the user inputs a specific speech signal before starting speech collection. For example, when using Xiaoai, the command format is "Classmate Xiaoai + your question" Generally speaking, the purpose of the preprocessing part is to eliminate noise and lay the foundation for the computer to understand natural language later.


(2) Feature extraction


The first step after collecting the speech signal is feature extraction, which divides a speech signal into multiple segments, extracts the characteristic parameters with practical significance, and makes statistics. This section of feature extraction can represent this section of signal, because unnecessary sections are discarded. Feature extraction is also a kind of data compression, which can simplify subsequent calculations to a certain extent. Feature extraction is based on the hidden Markov model, which contains invisible unknown parameters. In feature extraction, these unknown parameters refer to the semantics contained in the signal, but semantics greatly affects the speech signal, so through changes in the speech signal It is not impossible to work backwards into the semantics of invisibleness.


(3) Training and identification


Currently, speech recognition in smart speakers has high accuracy, but this relies on a large amount of data and training. Training the recognition network is equivalent to training a computer, and each user is equivalent to a trainer. A large amount of training and statistical calculations will produce answers that are generally satisfactory to users. In this way, computers can complete normal interactions between humans and machines without actually understanding natural language.


Deep learning is an important part of training recognition networks, and it is the key to self-learning of artificial intelligence. A major feature of deep learning is multi-level operations and multi-level processing of information. The results obtained by each layer in deep learning will be used as the input of the next layer, thus achieving the "depth" effect. But in actual application, this also requires controlling the number of layers of deep learning. If the number of layers is insufficient, the self-learning effect of artificial intelligence will be poor, but too many layers will lead to cumbersome calculations and low efficiency. Deep learning in speech recognition mainly learns the characteristics of the speech signal, and then needs to compare it with the data for training the recognition network to finally obtain the calculation results.

深圳智能音箱电路板厂家


4. Deficiencies and Improvements of Speech Recognition Technology


1. Defects of current speech recognition technology


Although speech recognition technology has been widely used, it still has many shortcomings, mainly as follows.


1) The uncertainty of natural language. Natural language consists of semantics, context, etc. Therefore, natural language has great uncertainty. Existing artificial intelligence is basically top-down artificial intelligence, which means that programmers first formulate the rules for the computer to understand language before understanding natural language. Once programmers make mistakes in programming, it will lead to the computer's misunderstanding of natural language. Although writing all the grammatical rules into a program may allow a computer to understand the language, there are so many grammatical rules that it is almost impossible to write all of them into a program.


In addition, natural language has a large amount of information. In different situations, a word may have a positive or derogatory meaning, and the upper and lower sentences have a huge impact on the actual meaning of a sentence. For example, the sentence "Help me." This sentence omits the subject and object. But if there are foreshadowing sentences before and after, then this sentence is not difficult for people to understand. However, when speech recognition is applied, the machine will not be able to understand special sentences. The uncertainty of natural language greatly hinders the progress of speech recognition.


2) Environmental interference. Environmental noise and noise in public places have a huge impact on recognition. In this environment, it is difficult for the calculator to receive appropriate speech signals, which greatly limits the scope of speech recognition.


3) Pronunciation is not standard. Nowadays, the number of words is gradually increasing, and it is normal for the pronunciation to be similar, but it is difficult for machines to distinguish such pronunciation. In particular, some words are associated with the pronunciation of the previous word. If you speak quickly, it will be difficult for the computer to recognize it.


2. Improvement Direction of Speech Recognition Technology


To sum up, this article believes that the important improvement directions of speech recognition are as follows.


1) For a specific field. Natural language is very complex, so it is very difficult to establish comprehensive speech recognition. However, through research, it has been found that certain words appear very frequently in specific fields and are relatively fixed. Therefore, establishing a speech recognition system by establishing a specific field is currently a method with relatively high practicality and value. Finally, various speech recognition systems are combined together to establish a relatively complete system.


2) Dynamic semantic analysis. The current speech recognition only analyzes a specific sentence, but cannot conduct dynamic analysis by connecting the issues between the user before and after it. Future speech recognition can implement new analysis in different contexts during the question-and-answer process with users, and predict the semantics contained in speech signals. Such improvements can enable computers to truly become language users from language recipients, making communication between humans and machines more natural.

智能音箱PCBA价格


5. Development direction of smart speakers


So far, there are many types of smart speakers on the market, and the technology is becoming more and more mature. However, there are still some factors restricting the development. Smart speaker brands greatly limit consumers' willingness to buy. In addition, the skills provided by smart speakers are far from meeting people's actual needs because they are not well developed or access to too few third-party service platforms.


In the future, with the development of the Internet of Things, smart speakers will be fully developed in terms of hardware, software, and platforms. In terms of hardware, it is committed to building smart homes and forming an industrial chain; in terms of software, various personalized needs are explored, and the functions of products are expanded to cover people's lives from entertainment to shopping, home furnishing, social networking, etc.; various third-party services, etc. Access to smart speakers, graft services to different scenes of life, and meet people's daily needs. In addition, technically, the sound quality should be improved, the accuracy of speech recognition should be improved, the user experience of human-computer interaction should be optimized, and a complete industrial chain should be created.


Summarize


The current language recognition technology is not yet complete, but smart speakers with speech recognition technology as the core are sufficient to meet people's needs. The continuous improvement of various technologies and the growing demand for smart products have pointed out the direction for the development of speech recognition technology. Due to the expansion of the market, various companies are bound to accelerate the competition for speech recognition, thereby accelerating the development of speech recognition technology. In iterative updates, smart speakers will pay more attention to user experience and become an indispensable and important device in family life.


At present, smart speakers are still in the development stage. With the development of technology, commercial ecological chain, smart home ecological chain and rich audio resources will be created, and more personalized services will be proposed. Smart speakers will penetrate into all aspects of people's lives, bringing more convenience and fun to life.


The above are the details of the smart speaker solution introduced by Shenzhen Zuchuang Microelectronics Co., Ltd. for you. If you have electronic function development needs for voice speakers, you can trust us. We have rich experience in customized development of electronic products. We can evaluate the development cycle and IC price as soon as possible, and can also calculate PCBA quotations. We are agents for many domestic and foreign chips, including MCU, voice IC, Bluetooth IC and modules, and wifi modules. Our development capabilities cover software and hardware design such as PCB design, microcontroller development, software customization development, APP customization development, WeChat public account development, etc. It can also undertake the research and development of smart electronic products, the design of household appliances, the development of beauty equipment, the development of Internet of Things applications, the design of smart home solutions, the development of TWS solutions, the development of Bluetooth audio, the development of children's toys, and the development of electronic education products.

  • TOP