Problem capturing the title of a URL using regular expression

Question

Problem capturing the title of a URL using regular expression

Asked 6 years, 9 months ago

Viewed 115 times

2

I’m learning Go language in the competition part. I had the challenge of using the Generator standard to get a Channel that reads the title of a URL through a goroutine.

Inside this goroutine that I built, the reading is done through http GET and then gets it in a string after checking a regex. Initially, the code returned the index error out of bounds (Panic: Runtime error: index out of range), and found that the mistake was because of the line break between the tags <title>, my regular expression didn’t recognize this line break using (.*?), for the point (.) disregard line break characters.

I discovered this by giving view-source on any site, realizing that not all titles are set between tags <title> on the same line, and they may also be on broken lines, for example:

<title>
meusite
</title>

instead of <title>meusite</title>

So far, so good.

With this, I tried to improve my regex to match titles that are in the same line, as well as broken lines, but unfortunately I did not succeed because the code did not return titles the way I wanted.

Below is my source code:

//Padrões de concorrência - Generator
//Para mais informações sobre padrões de concorrência, visitar a documentação
//Google I/O 2012 - Go Concurrency Patterns

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
    "regexp"
)

func tituloURL(urls ...string) <-chan string {
    ch := make(chan string)

    for _, url := range urls {
        go func(url string) {
            resp, _ := http.Get(url)
            html, _ := ioutil.ReadAll(resp.Body)

            //r, _ := regexp.Compile("<title>(.*?)<\\/title>")
            // r, _ := regexp.Compile("<title>(.|\n)*?<\\/title>")
            r, _ := regexp.Compile("<title>(.*?)|([^\\d])*?<\\/title>")
            //r, _ := regexp.Compile("<title>([\\s\\S]*?)<\\/title>")
            // r, _ := regexp.Compile("<title>(.|[\\s\\S])*?<\\/title>")

            ch <- r.FindStringSubmatch(string(html))[1]

        }(url)
    }
    return ch
}

func main() {
    t1 := tituloURL("https://www.github.com", "https://www.linkedin.com")
    t2 := tituloURL("https://www.instagram.com", "https://www.youtube.com")
    fmt.Println("Prmeiros títulos:", <-t1, "|", <-t2)
    fmt.Println("Segundos títulos:", <-t1, "|", <-t2)
}

As you can see, I tried to use some regex patterns, and Regexpal gave match, but the code did not return the expected result.

Some of you already have some suggestion of another regex that can fix this error?

I’m counting on your help!

I’m on hold.

You can use that to parse HTML. Is not recommended the use of regexps.

– Ainar-G

2018/10/23 at 22:57

2 answers

Browser other questions tagged html regex golang competition

You are not signed in. Login or sign up in order to post.

by Anderson F. Viana • **111** points · Answer 1 · 2021-08-07T20:33:19+00:00

To the . marry \n it is necessary to add (?s) à regex.

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
    "regexp"
    "strings"
)

func tituloURL(urls ...string) <-chan string {
    ch := make(chan string)

    for _, url := range urls {
        go func(url string) {
            resp, _ := http.Get(url)
            html, _ := ioutil.ReadAll(resp.Body)

            r, _ := regexp.Compile("<title>(?s)(.*?)<\\/title>")

            s := r.FindStringSubmatch(string(html))[1]
            ch <- strings.TrimSpace(s)

        }(url)
    }
    return ch
}

func main() {
    t1 := tituloURL("https://www.github.com", "https://www.linkedin.com")
    t2 := tituloURL("https://www.instagram.com", "https://www.youtube.com")
    fmt.Println("Prmeiros títulos:", <-t1, "|", <-t2)
    fmt.Println("Segundos títulos:", <-t1, "|", <-t2)
}

You can find more information on regexp at https://github.com/google/re2/wiki/Syntax.

The function TrimSpace package strings can remove the beginning and end spaces.

by Ainar-G • **227** points · Answer 2 · 2018-10-27T10:10:23+00:00

If you want to use regexp:

var r = regexp.MustCompile(`(?is)<title>(.*?)</title>`)

matches := r.FindStringSubmatch(site)
if len(matches) == 2 {
    fmt.Printf("título: %q\n", matches[1])
} else {
    fmt.Println("sem título!")
}

Playground: https://play.golang.org/p/s5239JVHMQu.

But the best way is parse HTML:

func getTitle(site string) (title string, err error) {
    resp, err := http.Get(site)
    // Check err.
    defer resp.Body.Close()

    node, err := html.Parse(resp.Body)
    // Check err.

    title, ok := findTitle(node)
    if !ok {
        return "", errors.New("no title")
    }

    return title, nil
}

func findTitle(node *html.Node) (title string, ok bool) {
    if node.DataAtom == atom.Title && node.FirstChild != nil {
        return node.FirstChild.Data, true
    }

    for c := node.FirstChild; c != nil; c = c.NextSibling {
        title, ok = findTitle(c)
        if ok {
            return title, ok
        }
    }

    return "", false
}