How to detect when the person starts speaking using Speechrecognition() in Javascript

Asked

Viewed 748 times

8

I’m using the function Speechrecognition (native of each browser) to be able to do voice searches on a website and I noticed that Google can identify when the person starts talking (both in "Ok google" and when the person clicks on the button to talk). I tried to look at the codes but they are too compressed and 'scrambled', can not understand anything and I wanted to know if someone knows how to identify the person’s voice when he is talking to the microphone.

The idea would be to detect after the command start

2 answers

8


You will need to develop a VAD (voice Activity Detection) !

I have developed some with satisfactory results, the methods I know and have tested are:

  • Zero Crossing Rate - It consists in detecting how many times the voice signal crossed the X axis, if it has low occurrence of crossings the speech this present, with high occurrence without speech found.
  • Energy - It consists in detecting the decibels/rms, it is one of the simplest ways but with serious problem of false-positive.
  • band Pitch Filter- Apply filters on the signal to capture only the human voice range, the human voice is able to reproduce sounds between 80 and 1100Hz, ie it is a broad spectrum of frequencies which makes things more complicated.
  • Besides applying filters it is important to capture the frequencies of each processed frame (Pitch Track), this will help you and a lot in some decisions, can refine your results when faced with the result of other techniques.

Many algorithms use only Zero Crossing rate information, see a Plot of this technique:

inserir a descrição da imagem aqui

It is visible the comparison between the amplitude of the signal with the contour of the axis crossing, in the image perceive the peaks of the ZCR(Zero Crossing rate) are exactly where the speech is not this present that is fully reciproco with the amplitude that is closely connected to the signal energy.

If you combine the techniques described here will achieve good results, you will need to define thresholds for noises, frequencies, axle crossings and time in seconds or milliseconds of considerable silence (the person may be speaking a sentence with pauses between each word).

Of course we are talking about real-time processing, for each processed frame it is necessary to apply three or more techniques, the great advantage is that they are not complex, are computationally efficient which will allow you to know where to cut the beginning and end of each sentence or word.

Just so you know google can understand "OK google" by having a speech recognition algorithm or whatever is said is transcribed in text, this is another story much more complex ....

3

I believe I can use this plugin for what you want.

<script src="//cdnjs.cloudflare.com/ajax/libs/annyang/1.1.0/annyang.min.js"></script>
<script>
    if (annyang) {
      // vamos definir o primeiro comando, que no seu caso seria o Start
      var commands = {
        'start': function() {
          $('#algo').animate({bottom: '-100px'});
        }
      };

      // adicionando os comandos ao annyang
      annyang.addCommands(commands);

      // começa a ouvir aguardando os comandos.
      annyang.start();
    }
</script>

https://www.talater.com/annyang/

Browser other questions tagged

You are not signed in. Login or sign up in order to post.