Convert Jsoup doc to string, apply Regex and return a value in String


Viewed 268 times


I’m using an example I found called Androidjsoup to get the source HTML from a certain page, but I’m not getting just the snippet of code I wish is in a certain <script>

In short, the Androidjsoup should rotate, picking up the HTML, applying a regex and returning in String resultado1

Follow my source together with the page reference and example HTML to be caught. Also the regex taken from my script php.


package com.survivingwithandroid.jsoup;
import android.os.AsyncTask;
import android.os.Bundle;
import android.util.Log;
import android.view.Menu;
import android.view.MenuItem;
import android.view.View;
import android.widget.Button;
import android.widget.EditText;

import org.jsoup.Jsoup;
import org.jsoup.nodes.DataNode;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class MainActivity extends ActionBarActivity {
private EditText respText;

protected void onCreate(Bundle savedInstanceState) {

    final EditText edtUrl = (EditText) findViewById(;
    Button btnGo = (Button) findViewById(;
    respText = (EditText) findViewById(;
    btnGo.setOnClickListener(new View.OnClickListener() {
        public void onClick(View view) {
            String siteUrl = edtUrl.getText().toString();
            ( new ParseURL() ).execute(new String[]{siteUrl});

public boolean onCreateOptionsMenu(Menu menu) {
    // Inflate the menu; this adds items to the action bar if it is present.
    getMenuInflater().inflate(, menu);
    return true;

public boolean onOptionsItemSelected(MenuItem item) {
    // Handle action bar item clicks here. The action bar will
    // automatically handle clicks on the Home/Up button, so long
    // as you specify a parent activity in AndroidManifest.xml.
    int id = item.getItemId();
    if (id == {
        return true;
    return super.onOptionsItemSelected(item);

private class ParseURL extends AsyncTask<String, Void, String> {

    protected String doInBackground(String... strings) {
        StringBuffer buffer = new StringBuffer();
        try {
            Log.d("JSwa", "Connecting to ["+strings[0]+"]");
            Document doc  = Jsoup.connect(strings[0]).get();
            Log.d("JSwa", "Connected to ["+strings[0]+"]");
            // Get document (HTML page) title
            String title = doc.title();
            Log.d("JSwA", "Title ["+title+"]");
            buffer.append("Title: " + title + "\r\n");

            // Get meta info
            Elements metaElems ="meta");
            buffer.append("META DATA\r\n");
            for (Element metaElem : metaElems) {
                String name = metaElem.attr("name");
                String content = metaElem.attr("content");
                buffer.append("name ["+name+"] - content ["+content+"] \r\n");

            Elements topicList ="h2.topic");
            buffer.append("Topic list\r\n");
            for (Element topic : topicList) {
                String data = topic.text();

                buffer.append("Data [" + data + "] \r\n");

            Elements scriptElements = doc.getElementsByTag("script");
            buffer.append("Variavel resultado1\r\n");
            for (Element element :scriptElements ){
                for (DataNode node : element.dataNodes()) {
                    String scriptdata = node.getWholeData();
                    buffer.append("StriptData [" + scriptdata + "] \r\n");
                    //String resultado1

        catch(Throwable t) {

        return buffer.toString();

    protected void onPreExecute() {

    protected void onPostExecute(String s) {

Sample page HTML


    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <script type="text/javascript">
        function var1() {
    <title>Link das Pessoas</title>

        etc valorM = (valores de xyz);
        etc valorE = (valores de xy);
        pegavalor(function() {
            ...funcoes_diversars(Conteudo dinamico e estatico... 
                ...Conteudo dinamico e estatico)
    <div>Conteudo #2</div>
    <script type="text/javascript">
        var google...


Regex to take the value of resultado1:


Note: I removed php, are know if you need to change something in this regex for java.

If possible a code that would allow me to add other regex to capture other values in a string. Ex resultado2...

Source Androidjsoup
Source Source Code

  • No, it’s the url, as well as other values I will do as needed. Take a look at Regex. @re22

  • So far I’ve been able to refine the search and only get the content of the script I want with String procurarPor = "pegavalor(function()";&#xA; if(scriptdata.toLowerCase().contains(procurarPor.toLowerCase())){ However I can not apply the regex at all, always returns a java.util.regex.Matcher@53724000, I’ll keep looking when dawn dawn dawn until a solution appears here, in case you find it first, put, but I don’t think it will happen. =(

  • The problem was solved, not in the way I expected, reading the entire page and passing the content to the php and in it applying the search operations with regex and returning the value, I wish I had answered my own question, but as this does not fit into what I wanted, which may be the doubt of others, I leave here the alternative, and the open question for a possible answer, whatever it may be, but at least serve as my alternative.

1 answer


whereas the content to be extracted is we can create regular expression as follows:

  • Begins with http or https - (http|https) - The character | determines the operator OU;
  • Is followed by :// - :\\/\\/ -//` escapes characters that are used in regular expressions;
  • Any text - .* - The . literally means "any character" and * is the quantifier of zero or more;
  • .com/pessoas - \\.com\\/pessoas - // escapes characters that are used in regular expressions;

Putting all this together the regular expression will be as follows:


To separate this content we use the grouping that is demonstrated by (). Applying in your code the result will be as follows:

Pattern regex = Pattern.compile("((http|https):\\/\\/.*\\.com\\/pessoas)");
Matcher matcher = regex.matcher(scriptdata);

if (matcher.find()) {
  resultado1 =;

Browser other questions tagged

You are not signed in. Login or sign up in order to post.