What are the pros and cons of indexing vectors by zero or one?

Asked

Viewed 1,089 times

15

Most programming languages I know have zero as the first index of a vector.

I know that several programming languages have content um as the first vector index. Until recently I thought this was limited to older languages (such as COBOL, Fortran, Algol and Smalltalk), but recently I discovered that Lua and Matlab also begin their arrays for 1.

I think the initial index decision should be something quite serious. What factors do language designers consider to determine base indexing? What are the pros and cons of each form?

  • 2

    Perhaps, and this is my opinion ( flagged issue ) which was originally thought more important than a subscript meant something analogous to the human reader. Putting something into the zero slot when explaining it is less easy to understand than putting something into a slot. Compiler / interpreter does some work of "translate" but it is not big problem.

  • I could almost understand :D

  • 3

    @Bigown For the fastest, I use English. With less importance of time, my own weak Portuguese. For answers, a mixture of time, Google Translate, and my son’s help. This, only weak method :-)

  • 5

    @Billwoodger do not worry, there are times when your text becomes much easier to understand than those of some of our "natives". don’t Worry, sometimes your text is easier to understand than some of our "Native"’s

3 answers

15


It is possible to enumerate advantages and disadvantages of each indexing style. But honestly, if someone says a way is much better and the other is one rubbish, this person is a bit fanatic... After all, even C programmers count the lines in their text editor starting at 1 ;-)

Let’s go to the examples:

Vector of 100 elements, starting at 0 as in C, C++, Java or Python and using open intervals at the end as in Python (that is, the last index is left out):

  1. The statement in C, C++, Java, etc. is lista[100] (indicating the amount of elements), but can only be indexed until lista[99] (use lista[100] equal to what was declared would be a mistake).
  2. Therefore it will be necessary to use size()-1 for access the last element.
  3. Access to sub-bands vector is easy. If I wanted to walk on the vector of 10 in 10 elements, just do 0*10, 1*10, 2*10, 3*10. In Python it looks like this: lista[0:10] includes the first 10 elements, from 0 to 9; lista[10:20] includes items 10 to 19; and lista[20:30] represents the elements from 20 to 29. Note that no "+1" or "-1" had to be done and to obtain the number of items just calculate fim - início.
  4. The second element will be called [1], which is kind of strange. The best way to interpret it is to think about indices as displacement: list[0] is the item at the beginning of the list, list[1] is from the beginning and skip 1 (therefore the second element), list[2] account from the beginning and skip 2 (therefore arriving at the third element). If the low-level implementation makes good use of this detail in the implementation (such as the duality of pointers and arrays in C), one can save machine instruction, making code more efficient.
  5. If you want circle in the first 5 elements, it is possible to increment the counter so: i = (i + 1) % 5
  6. One empty lane can be written as lista[10:10].
  7. One range of an element stays lista[10:11] (contains only the list element[10])
  8. You will need to add 1 to display the indexes from a list to the user, because it’s kind of weird to start a list with 0, unless it’s a log for debugging. It is also strange to present a list that ends in 9 but does not have 9 elements.
  9. A invalid index can be represented with -1 (but not with 0). For example, if a search does not find any element, many functions in popular languages return -1 or another negative value. Python uses -1 to represent the last element, so the most common is to play an exception when a search finds nothing.

Vector of 100 elements, starting at 1, using closed intervals (that is, the last index is included) as in Matlab:

  1. The first element is 1 and intuitively indexes are like ordinal numbers: first, second, third...
  2. The last element is 100 and the amount of elements is the same as the last index. Much more intuitive and avoids errors.
  3. Access to sub-bands is a little more complicated: lista(1:10), lista(11:20), lista(21:30). Note that it is necessary to use 10 and 11, 20 and 21, 30 and 31, etc. Since these values will probably be in variables, the code gets multiple "+1" and/or "-1".
  4. The amount of elements in a subband is fim - início + 1. For example, from 11 to 20 there are 10 elements, because it includes the two tips, both 11 and 20.
  5. If you want circle in the first 5 elements, the formula is: i = mod(i, 5) + 1 (now, instead of %, I used mod, since I’m using Matlab as an example). Note that the "+1" in this case is outside the mod operation, but is neither easier nor harder than indexing from 0.
  6. One empty lane (if supported) could be described as lista(10:9).
  7. One range of an element stays lista(10:10).
  8. You can display the indexes directly to the user, as most people expect a list to be numbered from 1 (this list looks like this).
  9. A invalid index can be represented with 0. For example, if a search does not find any element, returning 0 to represent its absence can be quite intuitive.

Note also that these are not the only possibilities. I combined indexing starting with 0 and subbands with open ending as in Python, but in Vimscript indexes start at 0 and subbands are lista[0:9], lista[10:19], lista[20:29]. Many people justify starting arrays with 0 based on subtracks, but this mixes two separate concepts.

Or a language may have little or no support for subtracks, which diminishes the importance of some of the features I listed above. The advantages of subtracks are well visible in Python, while the C language, which has no subtracks, was probably created by indexing from 0 because of pointer arithmetic. After all, if we have a pointer to the beginning of a vector, we don’t need to do anything else to access the first item (we’re already on it, so the displacement is 0).

Another curiosity is the Visual Basic, in which a statement lista(10) is a vector of 11 elements, numbered from 0 to 10. Advantage: the declaration contains a valid index. Downside: you need to mentally add 1 to know the number of items on the list.

Some languages allow you to choose any track, as in Fortran or Pascal. For example, in Pascal (or Delphi) it would look like this: array[-10 .. 10] of integer, which allows you to use arrays in a very creative way and well adapted to the problem to be solved, at least for fixed size arrays. Dynamic arrays are another story and depend on the Pascal dialect - they often imitate the C language and index from the same 0. An alternative would be to provide functions that return the array boundaries.

  • I remembered another interesting situation: search functions. A search function that does not find an element can return 0 if the indexes start with 1, which is interesting because 0 can be associated with the absence of the element. If 0 is the index of the first element, the function must return -1 or indicate otherwise the failure in the search (exception, null return, None return, etc.). Using -1 may be inconvenient because it is neither an absent value nor a "false" value, preventing you from using the if form (! find(...)).

  • An article with an interesting point of view, showing how the history of indexing by 0 can be easily forgotten and replaced by explanations "after the fact": http://exple.tive.org/blarg/2013/10/22/citation-needed/

12

Behold these notes of Dijkstra that I blatantly I copied from Soen.

Basically the convention helps to represent sequences without having to deal with corner cases. You know those horrible codes they have ifs to deal with the upper and lower limits? Breaking this note explains why there is another convention: Why we include the lower limit but not the upper limit in the range of functions that deal with indices (see functions such as substring; to get the first n string characters usually do something like string.substring(0,n)).


See also that article in Wikipedia listing the following advantages:

  1. Reduction of errors by an element
  2. Pointer arithmetic. The first element of an array is in position 0 - that is, in ponteiro, any other element is in position ponteiro + tamanho do elemento * índice.
  3. The question of a sequence being more conveniently represented by an open interval [0, n) than by a closed interval [1, n] according to the Dijkstra article.
  • 1

    Given that in a list of 100 elements the last one has the index 99 (and not 100), the reduction of errors by an element is somewhat doubtful; they can increase. The reduction of errors only occurs after taking the practice, but this also applies to indexing by 1 (and we have already practiced it since childhood in the first math classes or even before).

  • 1

    Hi @Marcus, I think the point of Dijkstra, as well as the authors of the linked article is that indexing by 0 combined with open interval usually, on average, avoid errors by one, mainly in logics of slicing. Thinking of your example, use 0..100 with an open interval (n < 100) at termination does not result in any problem. It is possible, of course, to argue that 1..100 with closed interval (n <= 100) would be similar. That said, I agree with your point, use size - 1 to get the last element can be quite counterintuitive. At bottom it is a matter of convention / preference.

5

Well, my second sign rejected (of two) :-)

I understand that this site has different rules of the OS - and I would like to take this opportunity to say that this site is my favorite, by a long margin, among the SE sites I visited (which are more than can be seen in my profile) - and it might not be so good to comment on a peripheral position. That’s a matter of opinion too :-)

Okay, how do we solve this question? First, take the four age languages that were mentioned, and identify why they used "1" as a starting point (at the same time, establish which, if any, has "arrays" in the sense that they are understood by many).

So take a handful of languages, preferably popular, and find out why they use zero. Find the mother tongue, where there must be an obvious connection.

Find out if any of the two sets of answers indicate a "grouping" of reason that in turn supports the idea of a specific reason for this trait.

Also find out for Moon and Mathematica (this may be the easy part) why they use a.

Join all results. Consider. Speculate. Theorize. Publish.

Work for a Computer Historian who specializes in Language Development.


Dijkstra

A man of opinions, all correct from his own point of view. For to him his knowledge is absolute, and his opinions are facts.

A man of opinions, Every one of them correct to his Own Absolute Certain Knowledge.

Some quotes from notes quoted in that reply.

I mention this experimental Evidence - for what it is worth - because some people Feel uncomfortable with conclusions that have not been Confirmed in Practice.


Many Programming Languages have been Designed without due Attention to this Detail [that we had Better Regard zero as a Most natural number].


[a Mathematical colleague of mine] accused a number of Younger computing Scientists of "pedantry" because - as they do by Habit - they Started numbering at zero. He Took consciously Adopting the Most sensible Convention as a provocation.


Dijkstra writes very well. It is a pleasure to read his texts, not only for the content but also for the hints of good humor.

But to say that starting at zero is the only way to do things, and that any other option would be neither natural nor rigorous is... a statement, and assumes that everything in society (or at work), especially computing, must occur through rigor, and lead to some natural conclusion that cannot be disputed.

Things don’t work that way.

I think Dijkstra would love to get in an elevator and see a button Piso 0. He might refuse to enter an elevator he used T to the ground floor or 1 as the top floor before the garage.

In fact, I like this idea. Imagine the Dijkstra pressing zero and going to the garage. It goes back to the elevator and presses zero again. Garage again. He would write well about it, saying that the problem is the design of the building. OK. That’s just my opinion and my imagination.

Open a file drawer. See the folders inside. If someone says "first", "last", "third from the bottom", "the one next to the red one" or something like that, we can find what we’re looking for, whether we count from scratch or from one. We may even need to look at the contents of two folders to find, but it will work if the information you gave us is correct.

It matters?

The language processor handles everything anyway. No matter what we write (as long as it’s appropriate for that language), the language processor will do what it has to do. In many examples there is no longer a meaning even in itself. An associative array doesn’t start with zero, or one. Today there are many "loop constructs" that do not have to specify an initial value, because when you think about it, it is reasonable to expect from a higher level language that "I would like to query all the values of this thing, compiler please start at the right place" function as expected.


No, I’m not spending any more effort on that question than she’s worth, I’m sorry. I marked it as CW, since these are some reactions to opinions, in an opinionated question, which can only be answered with uncertainty, I certainly don’t want any reputation for it, even if someone deems it worthy.

It’s a question for a bar on the beach, for when you’re stuck in an elevator with some programmers, or for your dreams, when you win the debate and the whole (programming) world agrees with you.


Initial Position 1 = Displacement 0

COBOL, that language that has been there for a while and that perhaps few understand, does not have "arrays", but it has "tables" with OCCURS.

01  A-TABLE.
    05  FILLER OCCURS 100 TIMES
        INDEXED BY AN-INDEX-NAME.
        10  SOME-DATA             PIC XX.

To reference an entry in the table by subscribing, you can use the date-item index AN-INDEX-NAME or use a literal or use a data-name (integers only, but we don’t have "ints" that way).

The fastest way to access an element is with a literal, because the compiler calculates the displacement. Literal 1 gains displacement 0. Literal 2 gains ((2-1) * element size) etc.

The data-item index you assign (SET) to a value. You make SET ... TO 1 and the compiler uses a zero value. You use SET ... TO 2 and the compiler uses ((2-1) * element size) etc. You can do SET ... UP or SET ... DOWN and the compiler adds or subtracts the element size appropriately.

Using a data-item index is the second fastest way. Most of the time.

Using a data-item as a subscript, the compiler needs to generate code to "convert" the value to an offset. A little more work. Always a date-name to subscribe with a value of 1 will be calculated for a displacement of 0.

Except. For special case, where the size of the element is a.

In this case, the compiler will use the data-name value directly, it will just "pretend" that the table starts a byte earlier. In this special case, the date-name-for-a-subscript is faster that using the date-item index.

We humans always use 1 for the first entry of a table. The compiler uses 0, or modifies its implementation of 0 in the special case of a single-byte table and uses 1 for the first entry.

Now, the situation is, with COBOL, our tables have fixed size. We need to know how many entries there are in the table, which are actually used. The compiler cannot tell us. If we have to know how many there are, we have to know that it is 0 when there is no input. If we had to subscribe from scratch, we would have an interesting test to see if we exceeded the number of entries in a table.

My guess would be that the above is not the reason why one’s "starts" subscription in design, but that design decided, because the purpose of the language was to make it easier for humans to program, that starting with one was an easier concept to get the hang of (at that stage in the history of computing).

To get a basis for one or the other point, you would probably have to take a look at the FLOWMATIC language, where much of the design was used for COBOL.

Maybe Alvezquemsabe was neither a conscious decision but something so "obvious" that it was not even much discussed.

However, it is unlikely that anyone alive knows the answer to COBOL. Or FORTRAN. Or ALGOL. SMALLTALK is a little younger. Perhaps an answer can be found by looking at all the design notes of COBOL. Maybe not.

It’s just a question of speculation and opinion, and the way I see it, there’s no "right" or "wrong" answer to why those languages of that time did it this way.

Dijkstra, who seems to think everyone <ol><li> should start at 0, certainly did not know.

Reply freely translated from the original in English, in the revision history.

  • 3

    Most people don’t Ever intend to design new Programming Languages. But some of us do, myself included. For us, this is a very Relevant Question, Even if the Answer turns out to be "a Matter of preference" (it’s not: Computers are as happy consuming Binary code, but Programs are Written by Humans). I wouldn’t Ask Every Question about language design here at Stackoverflow (Most would be more appropriate in a Platform like Discourse), but a Selected few are sure answerable objectively (I have Strong subjective reasons to Shun 1-based indices, but they don’t fit this site so I’m keeping to myself)

  • @mgibsonbr fair enough, but I don’t see Something like this as a good Starting-point for Even Approaching the Question over 1/0, Let alone Answering it. I think a language could be Designed without the Programmer having to know, and that would Deal with a Lot of "off by one" problems. To design a new language for Todays' new programmers, I think you have to research Pretty-Much in Modern times. I don’t think the Past will Tell you whether 0 or 1 is "right" or "Wrong", just how they Were used in a Given particular language. Implementation Detail.

  • 1

    Agreed. In Fact, one very good Reason to Choose between one or Another is how well it integrates with other Programs - potentially Written in Different languagues (Most of them contemporary). BTW: IMHO the Answer got a Lot Better after the Edit (those Facts about the Reasoning Behind COBOL implementations are very insightful), so I retracted my downvote.

  • @mgibsonbr no problem, Although I was thinking of avoiding Benefits if anyone Wanted to upvote, I also get protected from the downvotes (didn’t think of that). Inter-language Communication? Very interesting. Can’t discuss it here though :-)

  • @mgibsonbr and just a point, those are Facts about how COBOL (on IBM Mainframes) is implemented (for the one-byte table especially, the Rest is fairly general). I deliberately avoided rationalising, Reading history Backwards, to Say Why COBOL had 1 when it came into being.

  • 2

    So... who’s going to Translate this to Portuguese? :)

  • @bfavaretto, I think quoting Dijkstra just got me going (Coming as it Did after the Allusion that COBOL is a pre-Christian-era language). A bit of a non-answer, I have to admit. If there is interest I can do a little work on it, but I’m not sure how helpful it is to the Discussion (which shouldn’t be happening in this format....).

  • Regarding the Question, this is answerable: What factors do language designers consider to determine base indexing? What are the pros and cons of each form? Of Course there is no right Answer to "which is Better", but that’s not what’s being Asked. Note: your flag was Rejected from the review Queue (3 "Leave open" votes). Had a mod handled it, it would probably be marked helpful.

  • 2

    @bfavaretto "So... who’s going to Translate this to Portuguese?" There is Goes... Hopefully I didn’t mess up the COBOL part. Only left the Dijkstra quote as it is, Since I couldn’t figure out its Exact meaning ("correct with absolute certainty, at least as far as your knowledge goes"? " correct according to your knowledge, which is right and absolute"? etc)

  • 1

    @mgibsonbr I also can not translate the quote, I had the same doubt. Bill, you can explain the meaning to us?

  • @mgibsonbr Thank you very much. Sorry about the continuation...

  • @bfavaretto more Difficult than I thought.

Show 7 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.