External and internal memory sorting algorithms

Asked

Viewed 536 times

11

I was researching the difference between memory sorting algorithms external and internal and found the following answer in Quora :

"In cases where we have to classify more data than can fit into Main memory, we need an external classification algorithm. Instead of doing all the grading in memory, we sort pieces of the data in memory each time, pour the results for a file, and so on until we have a file fully ordered."

So far I have understood the concept, I’m having a hard time understanding which sort algorithm is used in each type , there are several types of ordering but how do I know if this sort algorithm is memory primary or secondary ?

  • 1

    I think this may depend on each OS. Are you looking at any specific OS? (Windows, OS X, etc)

  • Actually no , I just want to know how to distinguish the ones that are used for each type same.

2 answers

10


In fact it can be said that they are the same for both external and internal use. It is likely that some algorithm is a little better in one case than other algorithms. I will not analyze one by one to reach that conclusion.

The choice of the most suitable algorithm depends on a number of factors. It is often said that the Quicksort is the most suitable for the general case. This does not mean that it is the best ever. It was created for use in RAM memory. But nothing prevents it from being used in secondary storage.

When going to secondary storage it is of utmost importance to access the disk as little as possible. So in theory you would need to find an algorithm that minimizes that, even if it has other higher costs.

But that is naive. In practice any classification algorithm that needs a lot of performance will combine different techniques to achieve the goal. It is obvious that one of these techniques is to put as much data as possible in memory and classify what gives there and go organizing all the portions. If I can get everything in the secondary memory into the primary, the best algorithm for that case will work well in memory. If you can’t put it all together, you need to manage the load as intelligently as possible. This has little to do with the algorithm of Sort in itself.

Note that the amount of memory that computers have today and the advent of SSD has greatly improved the condition of an algorithm not so optimized to achieve good results. If you have multiple processors the choice may tend more toward a specific algorithm that makes use of parallelism. In fact the proper hardware will make a lot of difference, probably more than the most suitable algorithm, as long as it doesn’t use a very bad one for any case.

You can do this transparently. You can use a memory Mapped file and do in memory everything you expect to do on file. The operating system manages what should be in memory and what needs to be on disk. This way you can focus on the algorithm and the memory management is on account of the operating system. Reading and writing is something you would have to do anyway. Of course it will be very efficient if you fit everything in memory or the data allow few page changes. Of course, this facility for having a cost, with the right algorithm can better control how you want to do paging, or can do much worse than the operating system would do. MMF is more used in computing than you think. All your executables are loaded into memory like this. All virtual memory works like this. In fact any memory allocation, in one way or another ends up calling an MMF.

If really the hardware is too restricted I remember that the Mergesort is usually good in cases like this. I’ve been doing some research and it seems that Bucketsort is even better, and has some variants that can better help each case. Has a study that seems to confirm this.

8

The points exposed by @Maniero are accurate. An image might help you visualize how such a process would work.

Any sort algorithm can be used in both primary and secondary memory if you segment your data. In the following example we have a computer that can load and manipulate 6 main memory addresses, and a secondary memory structure that supports 12 addresses.

  • MP - Main memory
  • MS - Secondary memory
  • RO - Result of Ordination

inserir a descrição da imagem aqui

In the first cycle, 6 addresses (1 - 6) are loaded into the main memory, ordered and returned to the secondary memory. (In the image above the addresses that have changed are marked with a light blue background.)

In the second cycle, the 3 final addresses of the previous transaction (4-6) plus 3 (7-9) undergo the same loading, sorting and storage.

This happens successively until all addresses of the secondary memory are loaded, ordered and written back. After all the secondary memory is sorted, the process assesses whether the cycle should be run again; if there has been at least one change, then the process will be repeated.

Secondary memory can be considered ordered if no sorting operation in primary memory results in address changes during a full cycle:

inserir a descrição da imagem aqui

Browser other questions tagged

You are not signed in. Login or sign up in order to post.