How to find hashtags in a string and store them in an array?

Asked

Viewed 301 times

9

I have a content posting system on a particular social network of our company.

When the user type the text with hashtags, I need to detect all of them and store them in an array.

Example:

Hi, I’m posting this #question on #stackoverlow. I hope you find good #answers.

I want you to call me back:

array('pergunta', 'stackoverlow', 'respostas');

Remembering that if the hashtag contains accentuated characters, they should also be processed.

Example:

#notícias
#sãoPaulo
  • 1

    http://stackoverflow.com/a/3060756/4056678

  • If there is a possibility of hashtag starting or having accented characters the solution will fail. To avoid this you can use REGEX `#([ s]*)

  • Using the u modifier as @rray showed, no problem

4 answers

14


I believe that this regex solves the problem, is made the initial combination of # followed by any character in the range of (a-z, 0-9), the parameters i means that the combination will be case insensitive, already the u add multibyte character support.

<?php

   $str = '#pergunta no #stackoverlow #notícias 2015 #sãoPaulo';
   preg_match_all('/#\w+/iu', $str, $itens);

   echo "<pre>";
   print_r($itens);

Exit:

Array
(
    [0] => Array
        (
            [0] => #pergunta
            [1] => #stackoverlow
            [2] => #notícias
            [3] => #sãoPaulo
        )

)

The @Wallace Maxters, asked to remove the # of the caputra, @Guilherme Lautert suggested to change the regex to: (?<=#)\w+/iu, using the positive Lookbehind, which checks if the character exists but does not capture it.

Recommended reading

Meaning of ?: ?= ?! ?= ?! in a regex

  • Interesting is that you didn’t use parentesis. It really helps to decrease the number of returned items, that’s good. + 1

  • I had put parentheses ... then I saw that I had more things.

  • I want a way not to repeat the hashtag and at the same time not repeat, disregarding the CASE of the string. Only this is another question

  • Let me ask @rray: In the question, I don’t know if you noticed, an array of hastags is returned, but without the # before. To remove this, already in the regular expression, it is very difficult?

  • @Wallacemaxters. I’m trying, my idea is to make an empty group for #, I think it solves, I haven’t been able to.

  • Unless you process the sharp and then the group and take the $matches[1], I’m sure?

  • 1

    /(?<=#)\w+/iu https://regex101.com/r/zO9kC9/1

  • @Guilhermelautert I do not have access, hehe but it was worth even so I did not remember what was the symbol of the empty group.

  • 1

    @Guilhermelautert, that dough :D, tested here, ran.

  • Add to your reply as a complement, in case you don’t want to capture the # together = D

  • @Guilhermelautert, I will add, thank you for this solution.

  • I didn’t remember the modifier either u, we are even hehe.

  • What a mass @Guilhermelautert. This is very good, I didn’t know this guy was for this!

  • Mass, @rray. Add only flag m /#\w+/ium if you receive fulfilled texts.

Show 9 more comments

8

Using the @Renan comment.

A by changing the answer la dada:

$tweet = "this has a #hashtag a  #badhash-tag and a #goodhash_tag";

preg_match_all("/(#[^ #]+)/", $tweet, $matches);

var_dump( $matches );

So he looks for anything except ""(space), and the very #, that has # the front.

regex101

3

Another way is to marry in regex the tag with hashtag and separate only the group with :

function extractTags($mensagem)
{
    // Casa tags como #dia #feliz #chateado
    // Não casa caracteres especias #so-pt
    $pattern = '/#(\w+)/u';

    // Alternativa para incluir outros caracteres
    // Basta incluir entre os colchetes
    //$pattern = '/#([\w-]+)/u';

    preg_match_all($pattern, $mensagem, $tags);

    // Utiliza o vetor com os grupos capturados entre parenteses
    return $tags[1];
}

Extract this function from an answer I gave earlier in another question: PHP hashtag system

0

In PHP you use the function preg_replace, with the regex below, it will fetch all words that contain # and return in Matches

preg_replace('/\#[A-Za-z-0-9]+/m',$string,$matches);
var_dump( $matches );
  • 2

    Gilmar, it would be interesting to add some explanation of how the proposed code works.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.