Split by upper case letters

Asked

Viewed 629 times

1

How can I use the split to separate the words in the sentence "QueroSepararAsPalavrasNestaSentença" in Ruby?

  • 1

    have you tested "QueroSepararAsPalavrasNestaSentença".split /(?=[A-Z])/?

  • Ah, the problem there seems to be the cedilla (ç), but a .split was working well: http://rubyfiddle.com/riddles/9b484

  • "QueroSepararAsPalavrasNestaSentença".split(/(?=[A-Z])/).join(" ") ?

  • It worked :D thank you very much

2 answers

1

You can use the method split with the following regular expression: /(?=[A-ZÀ-Ú])/. It’ll look something like this:

expression = 'EssaÉUmaFraseÇÁrvore'

expression.split(/(?=[A-ZÀ-Ú])/)
=> ["Essa", "É", "Uma", "Frase", "Ç", "Árvore"]

1

It depends a lot on how the sentence you want to separate is. If there are no accented characters, just use what has already been suggested in the comments:

"QueroSepararAsPalavrasNestaSentença".split /(?=[A-Z])/

Here a regular expression (regex) containing a Lookahead, syntax-denoted (?=. Basically, the Lookahead serves to check if something exists ahead. In case, I am using the character class [A-Z], which corresponds to a letter of A to Z.

The "trick" of Lookahead is that it only checks if there is a capital letter, but this is not part of the match. The result of this is that regex returns only the position of the string that has an uppercase letter in front, and the split is done in that position.

The result is that the string is broken at points immediately prior to an uppercase letter. In the case of the above string, the result is:

["Quero", "Separar", "As", "Palavras", "Nesta", "Sentença"]

If the string is ABCdeFG, the result will be ["A", "B", "Cde", "F", "G"] - Note that the break is done before each of the uppercase letters. When there is more than one uppercase letter then each of them becomes a different element of the array.


Accented characters

To another answer suggests using [A-ZÀ-Ú], that works in many cases, but there are some details to be attentive.

The interval À-Ú picks up several capital letters with accent, but also picks up other characters that are not necessarily "letters", such as the MULTIPLICATION SIGN (which is the character ×, which, depending on the font used, looks a lot like the letter x lowercase - and in some fonts may even be identical - but it is another character, and is far from an uppercase letter).

In addition, this interval leaves out the Û, the Ü and the Ý. See here an example. Obviously, if your text never has the above characters, this will not be a problem.

But if you want a more "generic" option to accept any letters, use a regex with Unicode properties:

sua_string.split /(?=\p{Lu})/

In the case, \p{Lu} considers any character that is in the Unicode category "Letter, Uppercase". This includes characters from all alphabets that are mapped by Unicode (not just our Latin alphabet).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.