You will need to develop a VAD (voice Activity Detection) !
I have developed some with satisfactory results, the methods I know and have tested are:
- Zero Crossing Rate - It consists in detecting how many times the voice signal crossed the X axis, if it has low occurrence of crossings the speech this present, with high occurrence without speech found.
- Energy - It consists in detecting the decibels/rms, it is one of the simplest ways but with serious problem of false-positive.
- band Pitch Filter- Apply filters on the signal to capture only the human voice range, the human voice is able to reproduce sounds between 80 and 1100Hz, ie it is a broad spectrum of frequencies which makes things more complicated.
- Besides applying filters it is important to capture the frequencies of each processed frame (Pitch Track), this will help you and a lot in some decisions, can refine your results when faced with the result of other techniques.
Many algorithms use only Zero Crossing rate information, see a Plot of this technique:
It is visible the comparison between the amplitude of the signal with the contour of the axis crossing, in the image perceive the peaks of the ZCR(Zero Crossing rate) are exactly where the speech is not this present that is fully reciproco with the amplitude that is closely connected to the signal energy.
If you combine the techniques described here will achieve good results, you will need to define thresholds for noises, frequencies, axle crossings and time in seconds or milliseconds of considerable silence (the person may be speaking a sentence with pauses between each word).
Of course we are talking about real-time processing, for each processed frame it is necessary to apply three or more techniques, the great advantage is that they are not complex, are computationally efficient which will allow you to know where to cut the beginning and end of each sentence or word.
Just so you know google can understand "OK google" by having a speech recognition algorithm or whatever is said is transcribed in text, this is another story much more complex ....