How to search for a substring from an offset in Rust?

Asked

Viewed 179 times

1

How to find the index of the beginning of a substring starting from a given index of string?

In C++, for example, the method std::string::find accepts a offset or index where the search should begin.

There’s something similar in Rust?

1 answer

3


I don’t know if this fits what you call "idiomatic," but let’s see...

Basically, assuming we want to find the index of a substring from a given offset, we could do so:

let string = "ABC ABC ABC";
let offset = 3;

let idx = string
    .chars()
    .skip(offset)
    .collect::<String>()
    .find("ABC")
    .map(|n| n + offset);

See working on Rust Playground.

Basically, it works as follows:

  1. Create an iterator from the string characters (string string with method chars).
  2. From the iterator returned by chars, jump the number of elements corresponding to offset (chaining skip).
  3. From the rest of the iterator, we generate a new String. Note that it will be owned and therefore will represent allocation in the heap (chaining collect::<String>).
  4. From the previously collected string, we used the method find to find the index corresponding to the searched substring. This method returns a Option<usize>, which means you will return None when substring is not found.
  5. In case the substring is found, we’ll need to add the offset that we skipped in step 2 to the index returned by the method find. For this, we use the method map, implemented by Option, that maps the value in the case Some according to the function passed. In the case of None, mapping will be ignored and map will simply return his own None.

It may sound a little performance-y, but how Rust is so strict with his calls Zero Cost Abstractions, I suppose several optimizations are made during the intermediate phase of the compilation process. However, I did not benchmarks to confirm this hypothesis.

  • Thank you very much. But one question, if I use the find in a subset like string[offset..].find, this would avoid allocation?

  • 2

    You will not have the allocation, but keep in mind that when using this notation, you will not be iterating over the characters in the string, but over the bytes in the UTF-8 encoding. The character ``, for example, has 4 bytes. Already á, 2 bytes. While a, 1 byte. Also, when using Slice, you must ensure that you are indexing "valid ranges" in relation to bytes... Anyway, it’s a hell of a complication. So, in this case, I think it’s worth using the method chars (to ensure that we are indexing on characters, not bytes).

  • 1

    Behold this playground with a drastic example to better understand the "problem". See sections § 4.3 and § 8.2 of the Book to learn more. And also has the question of unicode normalizations, which may affect even the chars...

  • 1

    This and this discussions may be useful for more information on the indexing of Slices string too.

  • 1

    @suriyel, I ended up taking the opportunity of this comment of yours to go a little deeper into this subject. I will add here another reference to another good article I found: Unicode String Models.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.