Design of Cloud TV System Based on Intelligent Speech Recognition

In order to improve the operability of smart TV, the paper proposes a cloud TV system design based on intelligent voice design. The system adds voice input and cloud network technology to the traditional smart TV, and realizes the function of operating the TV after the voice intelligent processing. It can automatically find or use the TV function through voice input, thereby improving the operability of the smart TV and allowing the smart TV. It is more convenient to use and suitable for more people.

At present, with the rapid development of computer and Internet technologies, the trend of 3c convergence and the digital development of TV sets, TV sets, which are the core appliances for home entertainment, have begun to develop towards intelligent multimedia network TV. The intelligent network television is a multi-functional network terminal, through which users can get a lot of information and services, but with the increase of application functions, the operation becomes complicated. In the face of the complex functions of the smart TV and the difficult operation, it is only the paper manual of the TV set or the electronic document is played on the TV in the form of flash. There is no detailed navigation function to guide the user operation, or detailed The description, TV is facing any consumer, they are not very clear about the operation of many functions, and even many functions can not be found. In today's intelligent electronic products, intelligent voice design is a hot topic. The realization of this technology has improved the operability of electronic products and brought more convenience to users. Therefore, designing a TV system based on intelligent voice design, using voice to achieve fast navigation to various functions, information, services and other applications has become a top priority.

The system is a cloud television system based on intelligent voice setting, and the input voice data is transmitted to the television system. The system preprocesses the analog voice data into a digital voice signal, and sends the digital voice data to the cloud according to the requirements of each module. After the cloud is processed and processed by intelligent semantic recognition, the cloud returns specific control commands for processing.

1. Overall system design

The structure of the TV system is shown in Figure 1. The system is divided into three modules: voice setting, TV system processing, and cloud processing. In the case of a network connection, the voice is recorded by the microphone, and then the voice input is converted into a specific voice format and transmitted to the central server of the cloud through the voice module, and the cloud server transmits the past voice and many voice models representing specific characters. Compare to provide many different possibilities for the specific characters contained in the input speech. The cloud server then generates a sequence of characters that, according to a character-based language model, represent the different possibilities of the particular sequence of known specific characters contained in the input speech. The sequence of characters is then transmitted over the network to a central server, where the sequence of characters generates a sequence of words, according to a vocabulary and a vocabulary-based language model that represents the difference in the particular sequence of known specific characters contained in the input speech. possibility. Then, the cloud server determines, according to the vocabulary, which specific lexical sequence matches the input speech, and transmits the determined vocabulary sequence back to the terminal television system via the network, and the television system processes the obtained data into modules (the television system is different). Modules have different functions). The TV system hardware uses the MIPS architecture CPU and configures the Linux operating system. Voice through the MIC input, designed with two MIC interfaces, using a standard network interface for network communication.

2. Speech recognition system design

2.1 Basic knowledge of speech recognition

Voice-based technology, also known as automatic speech design, is AutomaTIc Speech RecogniTIon (ASR), which aims to convert vocabulary content in human speech into computer-readable input such as buttons, binary codes, or sequences of characters. Unlike speaker settings and speaker confirmation, the latter attempts to identify or confirm the speaker of the speech rather than the vocabulary content contained therein.

The speech recognition system is essentially a pattern recognition system. Speech recognition generally takes two steps. The first step is the "learning" or "training" phase of the system. The task at this stage is to establish an acoustic model that identifies the basic unit and a language model for grammar analysis. The second step is the "recognition" or "test" phase. According to the type of the identification system, a recognition method that satisfies the requirements is selected, and the speech feature parameters required by the recognition method are extracted by using the speech analysis method, and the system is compared with the system model according to certain criteria and measures, and the recognition result is obtained through the judgment. .

2.2 Voice design system design

The block diagram of the voice setting system is shown in Figure 2. First, the analog voice signal input by the TV microphone is pre-processed, and the cloud needs a digital voice signal. Here, the pre-processing uses voice IC for processing, including pre-filtering, sampling and quantization, signal digitization, windowing, breakpoint detection, pre-emphasis. Wait. After the speech signal is preprocessed, the next important part is the feature parameter extraction. The purpose is to extract the time series of speech feature sequences from the speech waveform. The result of feature extraction is sent to the TV operating system for judgment processing, and it is analyzed whether it needs to be transmitted to the cloud server, and the cloud server transmits the received voice to the television terminal after performing intelligent analysis processing, and performs corresponding function processing. .

2.3 Cloud Server Intelligent Processing

The cloud server processing mainly analyzes and processes the digitized voice data. The function of the system is relatively complicated, and the voice processing workload is very large. The design is completed based on the cloud computing server, and the voice is analyzed and processed at the server end. In addition, the smart device mainly focuses on the semantic analysis of some keywords and voices of the television system, and separately processes different modules of the television to complete the functions desired by the user. The use of a cloud computing server can reduce the hardware cost of the television terminal and increase the processing speed to achieve intelligent processing of user commands.

2.3.1 Transmission Protocol between TV and Cloud

For a particular TV system, each module has a specific keyword that needs to transmit module characteristics and corresponding voice data when transmitting data to the cloud.

2.3.2 Main methods of speech training and recognition

After the data is received in the cloud, the voice data needs to be set. Speech training and recognition is a process of pattern training and recognition. Pattern training refers to processing a large amount of training information according to certain rules, obtaining model parameters that can reflect the essential characteristics of the information, and combining model parameters obtained from the training information into a pattern library, and pattern matching refers to the basis. A certain rule specification, matching the input unknown mode with the mode in the pattern library, and finding a mode with the highest similarity, that is, the best match, from the pattern library. There are many kinds of methods for training and matching. At present, the more common methods include dynamic time warping (DTW), hidden Markov chain (HMM) model, and artificial neural network (ANN).

2.3.3 Hidden Markov Chain Model

The system uses the Hidden Markov Models (HMM) model to train and recognize speech. In the hidden Markov chain model, it uses Markov chains to simulate the change of statistical properties of the signal. Essentially it Is a probabilistic model of a double stochastic process. The probability model of the first stochastic process refers to the transition between states by Markov chain, and the probability model of another stochastic process refers to the stochastic correspondence between each state and multiple observations. In the application of practical problems, the HMM double random process observer can not directly see the state, can only see the observation value, and only use a random process to sense the existence and characteristics of the state. In essence, the human language process is also a double stochastic process. The speech signal itself is a time-varying sequence that can be observed. It is a parameter stream of the phoneme emitted by the human brain according to the needs of grammar knowledge and speech. This part is relative to the unobservable states in the HMM model. The HMM model can simulate this double stochastic process well, and well describes the local stationarity of the speech signal and the overall non-stationarity. It is an ideal model for describing speech signals.

2.3.4 Intelligent speech recognition

The keyword recognition system used here is a continuous speech recognition (LVCSR) based keyword recognition system, as shown in FIG. 3, using this structure for a continuous speech keyword recognition system: after the language passes through the continuous speech syllable recognizer, A corresponding N-Best word or syllable grid is generated, and then a keyword search algorithm is used to perform a keyword search on the grid. The process can be roughly divided into three steps: the first step is to search for the phonetic primitive, that is to say, the pinyin sequence corresponding to the input speech is obtained through this search. By continuous decoding, a N-Best syllable sequence or a grid of syllables can be obtained. In the second step, different keyword tables are selected for the TV terminal function module. In the third step, according to the syllable sequence obtained in the previous step and the keyword vocabulary comparison, the keyword search is performed to obtain an imaginary hit (a word that may become a keyword). The fourth step is to analyze the confidence of the hypothetical hit obtained in the third step according to other knowledge sources, and give the result of keyword recognition. In the fifth step, the keyword result outputted in the fourth step is intelligently processed, and the final output result is given according to a specific TV system function module.

3. TV intelligent speech recognition processing software flow

3.1 Recording detection

The flow chart of the intelligent voice recognition processing of the TV is shown in Figure 4. When the voice setting is needed, the recording key needs to be pressed first. At this time, the system will detect whether the network is connected and whether the microphone can be used normally. If one of the detections fails, the system Will not do recording work, prompt to check the network or check the microphone.

3.2 Recording processing

After the device is detected, the recording is performed. Due to system limitations, the recording has a time limit and cannot be too long. The TV terminal performs pre-processing and feature extraction on the voice recorded by the microphone, and then transmits the voice and module features to the cloud server together, and the cloud server performs detailed processing, and then transmits the data back to the terminal television.

3.3 Intelligent function processing

The TV terminal waits to receive data, and no data is received within 5 seconds. It is regarded as TIme out, and data processing fails. If the corresponding processing is performed after receiving the data, there is a keyword identification for each module in the cloud, and the returned data is then judged and processed for the corresponding module. For example, in the main function interface, if the voice input "shezhi", the system will enter the setting interface. Or in the video interface, if you enter "halibote", the system will look for the Harry Potter movie.

4. Experimental application

Since the situation of the TV system is complicated during use, there are some differences in the accuracy of voice setting. In order to obtain relatively accurate data, the test is divided into several cases, one is when the television system does not play audio and when playing audio, and the other is when the length of the input voice is inconsistent.

4.1 Testing noisy environmental tests

This test is divided into two situations, one is when there is no audio (or audio mute), the other is when there is audio (because the audio is not the same when the audio is played, so the synthesis of various noisy environments The value is mainly), the experimental results are shown in Table 1:

4.2 Test changes the input keyword length test

The system is intelligent voice setting, and needs to do intelligent analysis. The input of voice is used to judge the action of the system. The key is the accuracy of voice setting and intelligent recognition processing, and the length of the input keyword is equivalent to the accuracy of the system. The essential. This experiment is to analyze the input with inconsistent length. The experimental results are shown in Table 2:

From the two tests, the system identification accuracy is quite high, and the experiment has achieved the expected results. The key is that when processing in a special environment, the system has keywords and intelligent processing after recognition to achieve better intelligent processing.

5 Conclusion

The system is based on efficient voice setting technology and stable MIPS hardware platform. The software design is based on the Linux operating system. The original smart TV system uses cloud computing to process voice data, making the system process more real-time. high. The test shows that the system can judge the voice input very accurately, the data processing speed is fast, and the system stability is high. This system achieves the function of using intelligent voice setting in the TV system, which greatly improves the operability of the system through voice operation, making it convenient to use and more intelligent.

KNLE2-63 Residual Current Circuit Breaker With Over Load Protection

KNLE2-63 TWO FUNCTION : MCB AND RCCB FUNCTIONS

leakage breaker is suitable for the leakage protection of the line of AC 50/60Hz, rated voltage single phase 240V, rated current up to 63A. When there is human electricity shock or if the leakage current of the line exceeds the prescribed value, it will automatically cut off the power within 0.1s to protect human safety and prevent the accident due to the current leakage.
leakage breaker can protect against overload and short-circuit. It can be used to protect the line from being overloaded and short-circuited as wellas infrequent changeover of the line in normal situation. It complies with standard of IEC/EN61009-1 and GB16917.1.


KNLE2-63 Residual Current Circuit Breaker,Residual Current Circuit Breaker with Over Load Protection 1p,Residual Current Circuit Breaker with Over Load Protection 2p

Wenzhou Korlen Electric Appliances Co., Ltd. , https://www.korlenelectric.com

Posted on