Regex to match in citation title

Asked

Viewed 152 times

1

I’m trying to capture all citation titles in scientific articles, my regex is like this:

(([A-ZÁÀÃÂÉÈÊÍÌÎÙÛÓÒÇ]{3,10}, ([A-Za-záàããâéèêêíìîúùóòõç- . ]. ( ){0,1}){1,3})(et al.){0,1}([Oo]rg. ){0,1}([Ee]d. ){0,1}(; [A-ZÁÀÃÂÉÈÊÍÌÎÚÛÒÓÒÇ]{2,10}, ([A-Za-záàãâéèêííìîúùóòõç- . ]. ( ){0,1}){1,3}){0,2}){0,1}([A-ZÁÀÂÉÈÊÊÍÌÎÚÓÒÇ() . ]{3,60}){0,1}([A-Za-Z0-9:-áàâéèêêíìîúóòç, ]{10,}. ( ){0,1})(([A-Za-záàãâéèêíìîúûóòç-: /]){1,30}(, ) d{4}. ){0,1}

Some examples of citations, with the title in bold:

DI MAIO, P. The Missing Pragmatic Link in the Semantic Web. Business Intelligence Advisory Service Executive Update. v. 8, n. 7, 2008.

ECO, U. Lector in Fabula: interpretative cooperation in narrative text. Barcelona: Lumen, 1987

ECO, U. The concept of text. São Paulo: T. A. Q. /EDUSP, 1984.

ECO, U. Open work: form and indetermination in contemporary poetics. São Paulo: Perspectiva, 1988.

ECO, U. The limits of interpretation. São Paulo: Pioneer, 2000.

EDMONDS, B. The Pragmatic Roots of Context. In: PROC. OF THE 2ND INTERNATIONAL AND INTERDISCIPLINARY CONFERENCE ON MODELING AND USING CONTEXT. Berlin; Heidelberg; New York, v. 1688, 1999. Annals... v. 1688, p. 119-132, 1999.

BERNERS-LEE, T. Semantic Web Concepts. 2005a. Available in: http://www.w3.org/2005/Talks/0517-boit-tbl. Accessed in: Sep 25. 2014

BERNERS-LEE, T. Web for real people. 2005b. Available at . Accessed: 25 Sep. 2014.

BERNERS-LEE, T.; CAILLIAU, R. Worldwideweb: Proposal for a Hypertext Project. 1990. Available in: < http://www.w3.org/Proposal.html >. Accessed: Oct 13. 2014.

BERNERS-LEE, T.; HENDLER, J.; LASSILA, O. The semantic web: a new form of web content that is Meaningful to Computers will Unleash a Revolution of new Possibilities. New York: Scientific American, 2001. Available in: http://www.sciam.com/2001/050lissue/0501berners-lee.html. Accessed: Oct 13. 2014.

BLAIR, D. C. Information Retrieval and the Philosophy of Language. Annual Review of Information Science and Tecchnology, v. 37, pp. 3-50, Medford, 2003.

BLAIR, D. C. Wittgenstein, Language and Information: Back to the Rough Ground! Dordrecht: Springer, 2006.

BONFIM, M. AND. Text Document Recovery Using an Extended Probabilistic Template. Piracicaba: UNIMEP, 2006. 131 f. Dissertation (Master in Computer Science). Master in Computer Science. Methodist University of Piracicaba, 2006.

BORLUND, P. The Concept of Relevance in IR. Journal of the American Society for Information Science and Technology, v.54, p. 913-925, 2003.

BORST, W. N. Construction of Engineering ontologies. Thesis (Phd in Information and Knowledge Systems). University of Tweenty - Centre for Telematica and Information Technology, Enschede, Nederland, 1997.

BOUNDLESS. Boundless Psychology. 201X. Available in < https://www.boundless.com/psychology/textbooks/boundless-psychology-textbook/ > Accessed: 13 Aug. 2014.

BRATT, S. Semantic Web, and Other Technologies to Watch. 2008. Available in < http://www.w3.org/2008/Talks/1009-bratt-W3CSemTech/Overview.html > Accessed: 13 Aug. 2014.

BRÉAL, M. Semantics: English in the English language. New York: Henry Holt & Company, 1900.

BRICKLEY, D.; MILLER, L. FOAF Vocabulary Specification 0.9. 2007. Available in < http://xmlns.com/foaf/spec/20070524.html > Accessed: 17 May 2015.

BRITISH LIBRARY. Sample Data. Available in . Accessed: 12 Dec. 2014.

BRUYNE, P. de, HERMAN, J., SCHOUTHEETE, M. de. Dynamics of research in social sciences. Rio de Janeiro: Francisco Alves, 1977.

BUFREM, L. S, et al. Modelling practices for the socialization of information- the construction of knowledge in higher education. Perspectives in Information Science, Belo Horizonte, v.15, n.2, p.22-41, May/Aug. 2010.

THESE ARE NOT ALL CASES, THE COMPLETE LIST OF CITATIONS IS FOUND HERE:

For testing: https://regex101.com/r/n2554R/1/

  • 1

    That’s the best I could do: demo Regex101 and Debuggex. But as you were told in the Soen: this is a natural language problem that can’t be Solved with a regex.

  • 1

    use [A-zà-ḥ] or even [A-zà-ḥ] to capture accented letters, don’t do it => [A-ZÁÀÂÂÉÈÍÍÌÎÙÛÓÒÇ]

  • @danieltakeshi What is a "natural language problem"? Why can’t this be solved with regex? I’m starting to use regex a little while ago.

  • 1

    Because it’s too complex for a Regex. A Natural Language is much more complex, as it needs an understanding of natural human languages. Seeing in the Regex101 demo list you will notice several quoting modes without having a pattern. What is possible with Regex is the search for citations within some standard/norm with clear rules. For example ABNT or IEEE. But for all kinds of citations it takes something like Artificial Intelligence.

  • 1

    A site I suggest to solve this problem is the Kaggle, but before creating a competition, you need to study about Data Science and better understand how the competitions of this site work. In which there are several competitions worth money to those who solve them. Note: In Latex you can get these fields more easily, but the article should be done in Latex.

  • @Danieltakeshi Thanks for the clarification!

  • Managed to solve?

Show 2 more comments

2 answers

2

Instead of using regex, I suggest breaking in array by . e espaço and pick up the second index [1] which will be precisely the Title. See:

var strings = [
   "DI MAIO, P. The Missing Pragmatic Link in the Semantic Web. Business Intelligence Advisory Service Executive Update. v. 8, n. 7, 2008.",
   "ECO, U. Lector in Fabula: la cooperación interpretativa en el texto narrativo. Barcelona: Lumen, 1987",
   "ECO, U. O conceito de texto. São Paulo: T. A. Q. /EDUSP, 1984.",

   "ECO, U. Obra aberta: forma e indeterminação nas poéticas contemporâneas. São Paulo: Perspectiva, 1988.",
   "ECO, U. Os limites da interpretação. São Paulo: Pioneira, 2000.",
   "EDMONDS, B. The Pragmatic Roots of Context. In: PROC. OF THE 2ND INTERNATIONAL AND INTERDISCIPLINARY CONFERENCE ON MODELING AND USING CONTEXT. Berlin; Heidelberg; New York, v. 1688, 1999. Anais… v. 1688, p. 119-132, 1999.",
   "BERNERS-LEE, T. Semantic Web Concepts. 2005a. Disponível em: http://www.w3.org/2005/Talks/0517-boit-tbl. Acesso em: 25 set. 2014",
   "BERNERS-LEE, T. Web for real people. 2005b. Disponível em . Acesso em: 25 set. 2014.",
   "BERNERS-LEE, T.; CAILLIAU, R. WorldWideWeb: Proposal for a HyperText Project. 1990. Disponível em: < http://www.w3.org/Proposal.html >. Acesso em: 13 out. 2014.",
   "BERNERS-LEE, T.; HENDLER, J.; LASSILA, O. The semantic web: a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. New York: Scientific American, 2001. Disponível em: http://www.sciam.com/2001/050lissue/0501berners-lee.html. Acesso em: 13 out. 2014."
]

for(var x=0; x<strings.length; x++){
   var titulo = strings[x].split(". ")[1];
   document.querySelector("#res").innerHTML += strings[x].replace(titulo,"<span style='color:blue;'>"+titulo+"</span>")+"<br><b style='color: red;'>Título -></b> <b>"+titulo+"</b><br><br>";
}
<div id="res"></div>

whereas in the middle of the title there is no . e espaço.

The code would be this:

var string = "DI MAIO, P. The Missing Pragmatic Link in the Semantic Web. Business Intelligence Advisory Service Executive Update. v. 8, n. 7, 2008";

var titulo = string.split(". ")[1];
console.log(titulo);


Another way would be manipulating strings:

var strings = [
   "DI MAIO, P. The Missing Pragmatic Link in the Semantic Web. Business Intelligence Advisory Service Executive Update. v. 8, n. 7, 2008.",
   "ECO, U. Lector in Fabula: la cooperación interpretativa en el texto narrativo. Barcelona: Lumen, 1987",
   "ECO, U. O conceito de texto. São Paulo: T. A. Q. /EDUSP, 1984.",
   "ECO, U. Obra aberta: forma e indeterminação nas poéticas contemporâneas. São Paulo: Perspectiva, 1988.",
   "ECO, U. Os limites da interpretação. São Paulo: Pioneira, 2000.",
   "EDMONDS, B. The Pragmatic Roots of Context. In: PROC. OF THE 2ND INTERNATIONAL AND INTERDISCIPLINARY CONFERENCE ON MODELING AND USING CONTEXT. Berlin; Heidelberg; New York, v. 1688, 1999. Anais… v. 1688, p. 119-132, 1999.",
   "BERNERS-LEE, T. Semantic Web Concepts. 2005a. Disponível em: http://www.w3.org/2005/Talks/0517-boit-tbl. Acesso em: 25 set. 2014",
   "BERNERS-LEE, T. Web for real people. 2005b. Disponível em . Acesso em: 25 set. 2014.",
   "BERNERS-LEE, T.; CAILLIAU, R. WorldWideWeb: Proposal for a HyperText Project. 1990. Disponível em: < http://www.w3.org/Proposal.html >. Acesso em: 13 out. 2014.",
   "BERNERS-LEE, T.; HENDLER, J.; LASSILA, O. The semantic web: a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. New York: Scientific American, 2001. Disponível em: http://www.sciam.com/2001/050lissue/0501berners-lee.html. Acesso em: 13 out. 2014."
]

for(var x=0; x<strings.length; x++){

   for(var y=0; y<strings[x].length; y++){

      var letra = strings[x][y];
      
      if(letra.match(/[a-z]/)){
         var titIni = y-(strings[x][y-1] == " " ? 2 : 1);
         break;
      }
   }

var titulo = strings[x].substring(titIni,strings[x].indexOf(". ", titIni));
document.querySelector("#res").innerHTML += strings[x].replace(titulo,"<span style='color:blue;'>"+titulo+"</span>")+"<br><b style='color: red;'>Título -></b> <b>"+titulo+"</b><br><br>";

}
<div id="res"></div>

Considering also that in the middle of the title there is no . and space.

Code:

var string = "ECO, U. O conceito de texto. São Paulo: T. A. Q. /EDUSP, 1984.";

   for(var x=0; x<string.length; x++){

      var letra = string[x];
      
      if(letra.match(/[a-z]/)){
         var titIni = x-(string[x-1] == " " ? 2 : 1);
         break;
      }
   }

var titulo = string.substring(titIni,string.indexOf(". ", titIni));
console.log(titulo);

1

The logic I thought was this: :

  • Authors
  • Title
  • Descriptions

Thus it is possible to define the following rules :

Authors : surname + comma + space + name + period = BERNERS-LEE, T.; Several authors are separated by ; the last author ends with a dot . at all times.

Title : anything that doesn’t have ;,. in the middle, but should end with .

Description : anything that comes after the title.

REGEX

^((?:.+?, .+?;)*?(?:[^;\s]+?, .+?)\.)([^;]+?\.).*$

Explanation

  • (?:.+?, .+?;)*?(?:[^;\s]+?, .+?)\.) - picks up the authors
    • (?:.+?, .+?;)*? - takes care of the multi authors who always terminal with ;
    • (?:[^;\s]+?, .+?)\.) - picks up the last author, who will never have ; and ends in .
  • ([^;]+?\.) - Take the Title that ends with .
  • .*$ - Description, go to the end.

See in Regex101

  • I couldn’t understand the "?:" meaning of the regex?

  • 1

    @Brunobrito when you use parentheses you create a group by using (?: ...) you basically say that these parentheses will not generate group.

  • Guilherme your logic worked perfectly, the problem is that I sent only a portion of the total of citations and this solution does not encompass all cases. I took some ideas from your regex and adapted it to mine, but I just couldn’t find a way to match just in the title, I need to find a way to "stop" the match once the first endpoint is found. Link: https://regex101.com/r/PlkNWz/1 You know how I can fix this?

  • @Brunobrito this kind of thing should be put in the original question, because looking at this new, da para notar que quebra a minha logica como como da @dvd, because there are some that the last author ends with ;. I have more time to analyze what can be done

  • Right, your answer is correct because it is in accordance with the question, but I will edit and put all the regex cases, my mistake.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.