How is a sound interpreted by a computer?

Asked

Viewed 2,481 times

40

How a song is interpreted by the computer?

I understand that an image is a matrix of pixels, in which each element is a colour.

When I open a song on R, he presents it as two sequences (left and right) of frequencies, according to the image.

Is it correct to state that a song is an array with two columns and the number of rows being some unit of time?

Matriz de Pixels

  • 2

    Great question! I don’t have an answer, but I found a material (in English) which seems to be in depth on the subject, I leave as a reference. By the way, I believe that in general its interpretation is right, only instead of "frequencies" I think this matrix contains "amplitudes".

  • 2

    Um... I would guess that they are the two channels of stereo sound. Other than that, I think each channel is represented by frequencies (or amplitudes) in time. I just don’t answer 'cause that’s just mine educated Guess (I don’t really know the details).

  • 1

    @brasofilo I know the answer, this is called audio fingerprint, it is possible to recognize any music using techniques like this, you must have already used the Hazam app on your phone, they have something similar to it, if you elaborate a question I can try to explain in detail this...

  • 1

    @ederwander: didn’t know the name of the technology, thanks! I didn’t know the app but I recently saw someone using, I thought "oh, youtube tech"

  • Tip: There is a course of Digital Signal Processing - Sampling provided by Coursera and Unicamp.

3 answers

34


This subject is very extensive and I will not lengthen much, let alone go into detail especially in the physical part.

There are basically two ways of sound being worked on in the computer: amplitude and frequency.

The amplitude is the most common (and simple), and is the one I will focus on.

In nature, sound is nothing more than a continuous wave of pressures in a medium, this being normally air. This wave can be the composition of several other waves (think of two noises happening simultaneously), and in the case of a song, it is the sum of the sound of the various instruments.

Electronic devices are perfectly capable of identifying these continuous waves. Microphones are nothing more than devices capable of this. However, the computer is terrible at handling continuous data. Note for example that real numbers have problems of representation and accuracy on the computer.

So, after the microphone converts the sound to an electrical signal, your computer’s input device will interpret it and perform a process known as sampling.

Sampling is nothing more than sampling this signal at each fixed time interval. For example, in the case of audio Cds, sampling is performed at 44.1 Khz (44100 times per second!) - if you want to know why this frequency was chosen, see Nyquist-Shannon sampling Theorem.

44100 times per second, the computer takes a sample of that electrical signal and converts it to any value (more on that in the next paragraph). This sample is nothing more than the signal amplitude. An important feature of the sampling is that it at all times distorts the signal, because there is loss of information. These losses can be minimised by increasing the sampling rate (studios for example use 192 Khz).

Audio devices are usually able to work with various sample representations. The most common are: 16-bit integer with/without signal; and 16-bit floating point.

In each representation, some specific values have specific meanings, and have different manipulation characteristics:

  • Integer without sign: the value 2^15 + 1 is the value "mute" without sound. It has the advantage of being very fast its manipulation (depends only on the ULA), but the fact that the mute is a value "displaced" must be considered in several manipulations, including volume change, normalization, equalization, etc.

  • Integer with sign: the value 0 is the "mute" value. Easy handling and speed are its advantages, but still, some tasks are complicated or require conversions between types and special cases to be handled, making this kind simpler to work with than the "no signal" integer, but still not as simple as that.

  • Floating point: the value 0 is the "mute" value. It is fast (modern computers have no problem working with floating points except when performing split operations) and simple to work with. The maximum sample value of this type is the 1, and the minimum value -1.

Operations such as volume, equalization and normalization, operate on these samples.

Volume, for example, can be thought of as a real coefficient of 0 until 1 which multiplies the sample value by keeping it as it is (1), making her mute (0) or some intermediate value.

Normalization and equalization are more complex. There are several normalization methods, which consider this the peak pressure (amplitude) of any audio track or even the signal power in a sliding window. Equalization usually involves a Fourier transform, which decomposes a wave into its frequency components.

Som Estéreo

It is nothing more than two recorded sounds, one for the left channel, the other for the right.

Representation in File

The simplest representation is the so-called RAW PCM, which stores each amplitude value in sequence from the other. Thus, more information is needed to play the sound without distorting it: amount of channels (stereo? ), sampling rate and formatting (integer with? floating point signal?).

The WAVE format is almost as simple as RAW PCM, except that it adds headers and standardizes (at least within Microsoft products) the representation of uncompressed audios (as far as I know, compression support has never been implemented by Microsoft within this format). The WAVE format is based on the container called RIFF, which is extensible and even supports metadata as artist, title, etc.

Other formats like MP3, AAC, OGG, FLAC, etc., are much more complex and each requires a chapter to describe its basic functioning.

Lossless and Lossy

Even with discretized sound (sampled), it can still be too large to store and distribute. Note, for example, that a 700MB audio CD supports between 70 and 80 minutes of audio only. But you can put well over 5 hours on the same CD since you encoded the audios as MP3.

Audio compression techniques are divided into two major groups: Lossless and lossy.

Lossless comes from "lossless", and means that the uncompressed sound is exactly the same as the sound before decompression. This feature is common to formats such as AIFF and FLAC. Paralleling images, PNG and BMP are Lossless encodings.

Formats lossy usually have small losses in frequency bands that are not interesting to their end and take into account psycho-acoustic models to fool our ears. A normal ear is capable of interpreting frequencies of up to 20 Khz. Thus frequencies above this value do not need to be encoded and can be discarded without affecting the sound quality. Many ears, even, don’t even know the difference. In doing so, uncompressed sound (or decoded if you prefer) is not identical to the original, but its size is much smaller than the original and sound Lossless. Formats of this type are MP3, OGG, WMA, AAC. They can be compared in the field of images with JPG format.

Completion

One can hardly say that this is an introduction to the subject. But I tried to cover some concepts of simple understanding and that will allow to deepen. Many techniques used in image processing are also used in the processing of audio, video, electrical signals and even in the propagation of heat in a medium.

Personally, I am fascinated in the subject, although I am not far from someone with deep knowledge. ;D

I hope I helped clear some doubts, and built others too.

24

How a song is interpreted by the computer?

In the same way as a video is interpreted: By changing values over time.

To understand how this occurs, let’s establish some basic concepts:

Resolution per Unit

An easy way to understand the resolution per unit is by analyzing how an image is represented.

An image that uses only one bit per pixel would have only two colors- black and white:

inserir a descrição da imagem aqui

The same image, using 8 bits per color, can make use of more than 16 million combinations:

inserir a descrição da imagem aqui

The same concept applies in an audio file. The more bits you use to express a sample, greater is the amount of amplitudes that can be represented:

inserir a descrição da imagem aqui
Top to Bottom: Original Waveforms, sampling 16-bit and sampling 8-bit

Frequency

However, the bit resolution of a sample is not all. How often samples are also presented is important. Again, we will use images to represent this concept:

inserir a descrição da imagem aqui
Junkie XL, Elvis Presley - A Little Less Conversation (Elvis vs JXL)

This video is a sequence of images presented at 5 frames per second. Now the same video at 25 frames per second:

inserir a descrição da imagem aqui

The same happens with an audio file. The higher the frequency of collection (or sampling rate), the greater the variation of data over a certain period of time.

inserir a descrição da imagem aqui

In the example above, a sample is collected every 22 microseconds - that is 44,100 per second.

I understand that an image is an array of pixels, in which each element is a color

Exactly. An audio file is therefore like a video, where the representation changes over time. However, unlike the video file that has thousands (or in some cases millions) of pixels to display changing to 30 times per second (in the case of a video at 30 FPS), the audio file represents only one 'pixel' - which can change thousands of times per second.

When I open a song on R, it presents it as two sequences (left and right) frequency [...]

This is because you opened a stereo file, which contains the left and right channels. The same occurs in stereo videos - two sequences are presented at the same time:

https://www.youtube.com/watch?v=gQ6vJGLfMxs

A 7.1 audio file would contain 8 channels.

Is it correct to state that a song is an array with two columns and the number of rows being some unit of time?

A two channel (stereo) file, yes. The number of lines would be the amount of samples the file has.

A mono file would contain only one sequence.

19

@Start gave you an excellent initial approach on the entire audio capture and conversion process (+1), still on your question:

Is it correct to state that a song is an array with two columns and the number of rows being some unit of time?

Yes if the audio is stereo it will be a two-speaker array, if it is Mono it will be a Vector, and yes the number of rows is literally the audio time, I will get to this point soon.

Now that you know some processes of how audio is represented and how the sampling process works, it is possible to associate some things with the figure presented by you just by looking, here some conclusions:

  1. Audio is Stereo because the figure is representing two channels (left and right), it is literally a matrix with two columns. If it were mono it would just be a size vector N on the axis x, where N represents the audio time, generally it is easier and faster to manipulate vectors (audio in mono) and for these reasons it is customary to mix (join) the two columns of the matrix when necessary to perform some manipulation in the audio, Manipulations can be of any kind from increasing or decreasing volume to more complex things like perform auto-Tune on signal, change speed, etc. In real-systemstime where speed is really important this process is commonly used because you would not be submitting the signal twice for handling, yes twice, the right channel and then the left channel, the process would be done only once after mixing the channels, therefore performance twice as fast.

  2. Amplitude is not being represented in float point (dots floating), she is a short int, is the "inteiro com sinal" explained in the @Vinícius answer, therefore the amplitude values for this type may vary from -32768 até 32767, take a look at axle y of the presented Plot, the amplitude has passed from 20000 positive/negative.

  3. Time the eixo x, you can’t know exactly but it’s of course it’s past its value 150, can you guess the eye that came to 200 ie 200 samples, if you know the rate that the signal was sampled (sample rate), it is possible to know the audio time, go guess that the sample rate is 44100Hz, check on R will have that information there, so the time of your audio would be 200(meu palpite tamanho do eixo x)/(taxa de amostragem) that’s the even though length(sinal)/taxa de amostragem, this gives you the total time of your audio in seconds, this is explicitly connected with the sampling process.
  • 1

    Thanks for the clarifications. Do you know how the stereo transformation is made p/ o mono? It is simply the average of the matrix lines?

  • 1

    @Danielfalbel takes a look at this my answer here, she explains how the mixing process between two channels is done, this is valid for mono->stereo conversion

Browser other questions tagged

You are not signed in. Login or sign up in order to post.