This subject is very extensive and I will not lengthen much, let alone go into detail especially in the physical part.
There are basically two ways of sound being worked on in the computer: amplitude and frequency.
The amplitude is the most common (and simple), and is the one I will focus on.
In nature, sound is nothing more than a continuous wave of pressures in a medium, this being normally air. This wave can be the composition of several other waves (think of two noises happening simultaneously), and in the case of a song, it is the sum of the sound of the various instruments.
Electronic devices are perfectly capable of identifying these continuous waves. Microphones are nothing more than devices capable of this. However, the computer is terrible at handling continuous data. Note for example that real numbers have problems of representation and accuracy on the computer.
So, after the microphone converts the sound to an electrical signal, your computer’s input device will interpret it and perform a process known as sampling.
Sampling is nothing more than sampling this signal at each fixed time interval. For example, in the case of audio Cds, sampling is performed at 44.1 Khz (44100 times per second!) - if you want to know why this frequency was chosen, see Nyquist-Shannon sampling Theorem.
44100 times per second, the computer takes a sample of that electrical signal and converts it to any value (more on that in the next paragraph). This sample is nothing more than the signal amplitude. An important feature of the sampling is that it at all times distorts the signal, because there is loss of information. These losses can be minimised by increasing the sampling rate (studios for example use 192 Khz).
Audio devices are usually able to work with various sample representations. The most common are: 16-bit integer with/without signal; and 16-bit floating point.
In each representation, some specific values have specific meanings, and have different manipulation characteristics:
Integer without sign: the value 2^15 + 1
is the value "mute" without sound. It has the advantage of being very fast its manipulation (depends only on the ULA), but the fact that the mute is a value "displaced" must be considered in several manipulations, including volume change, normalization, equalization, etc.
Integer with sign: the value 0
is the "mute" value. Easy handling and speed are its advantages, but still, some tasks are complicated or require conversions between types and special cases to be handled, making this kind simpler to work with than the "no signal" integer, but still not as simple as that.
Floating point: the value 0
is the "mute" value. It is fast (modern computers have no problem working with floating points except when performing split operations) and simple to work with. The maximum sample value of this type is the 1
, and the minimum value -1
.
Operations such as volume, equalization and normalization, operate on these samples.
Volume, for example, can be thought of as a real coefficient of 0
until 1
which multiplies the sample value by keeping it as it is (1
), making her mute (0
) or some intermediate value.
Normalization and equalization are more complex. There are several normalization methods, which consider this the peak pressure (amplitude) of any audio track or even the signal power in a sliding window. Equalization usually involves a Fourier transform, which decomposes a wave into its frequency components.
Som Estéreo
It is nothing more than two recorded sounds, one for the left channel, the other for the right.
Representation in File
The simplest representation is the so-called RAW PCM, which stores each amplitude value in sequence from the other. Thus, more information is needed to play the sound without distorting it: amount of channels (stereo? ), sampling rate and formatting (integer with? floating point signal?).
The WAVE format is almost as simple as RAW PCM, except that it adds headers and standardizes (at least within Microsoft products) the representation of uncompressed audios (as far as I know, compression support has never been implemented by Microsoft within this format). The WAVE format is based on the container called RIFF, which is extensible and even supports metadata as artist, title, etc.
Other formats like MP3, AAC, OGG, FLAC, etc., are much more complex and each requires a chapter to describe its basic functioning.
Lossless and Lossy
Even with discretized sound (sampled), it can still be too large to store and distribute. Note, for example, that a 700MB audio CD supports between 70 and 80 minutes of audio only. But you can put well over 5 hours on the same CD since you encoded the audios as MP3.
Audio compression techniques are divided into two major groups: Lossless and lossy.
Lossless comes from "lossless", and means that the uncompressed sound is exactly the same as the sound before decompression. This feature is common to formats such as AIFF and FLAC. Paralleling images, PNG and BMP are Lossless encodings.
Formats lossy usually have small losses in frequency bands that are not interesting to their end and take into account psycho-acoustic models to fool our ears. A normal ear is capable of interpreting frequencies of up to 20 Khz. Thus frequencies above this value do not need to be encoded and can be discarded without affecting the sound quality. Many ears, even, don’t even know the difference. In doing so, uncompressed sound (or decoded if you prefer) is not identical to the original, but its size is much smaller than the original and sound Lossless. Formats of this type are MP3, OGG, WMA, AAC. They can be compared in the field of images with JPG format.
Completion
One can hardly say that this is an introduction to the subject. But I tried to cover some concepts of simple understanding and that will allow to deepen. Many techniques used in image processing are also used in the processing of audio, video, electrical signals and even in the propagation of heat in a medium.
Personally, I am fascinated in the subject, although I am not far from someone with deep knowledge. ;D
I hope I helped clear some doubts, and built others too.
Great question! I don’t have an answer, but I found a material (in English) which seems to be in depth on the subject, I leave as a reference. By the way, I believe that in general its interpretation is right, only instead of "frequencies" I think this matrix contains "amplitudes".
– mgibsonbr
Um... I would guess that they are the two channels of stereo sound. Other than that, I think each channel is represented by frequencies (or amplitudes) in time. I just don’t answer 'cause that’s just mine educated Guess (I don’t really know the details).
– Luiz Vieira
@brasofilo I know the answer, this is called audio fingerprint, it is possible to recognize any music using techniques like this, you must have already used the Hazam app on your phone, they have something similar to it, if you elaborate a question I can try to explain in detail this...
– ederwander
@ederwander: didn’t know the name of the technology, thanks! I didn’t know the app but I recently saw someone using, I thought "oh, youtube tech"
– brasofilo
Tip: There is a course of Digital Signal Processing - Sampling provided by Coursera and Unicamp.
– Marcelo de Andrade