1
I’m trying to capture all citation titles in scientific articles, my regex is like this:
(([A-ZÁÀÃÂÉÈÊÍÌÎÙÛÓÒÇ]{3,10}, ([A-Za-záàããâéèêêíìîúùóòõç- . ]. ( ){0,1}){1,3})(et al.){0,1}([Oo]rg. ){0,1}([Ee]d. ){0,1}(; [A-ZÁÀÃÂÉÈÊÍÌÎÚÛÒÓÒÇ]{2,10}, ([A-Za-záàãâéèêííìîúùóòõç- . ]. ( ){0,1}){1,3}){0,2}){0,1}([A-ZÁÀÂÉÈÊÊÍÌÎÚÓÒÇ() . ]{3,60}){0,1}([A-Za-Z0-9:-áàâéèêêíìîúóòç, ]{10,}. ( ){0,1})(([A-Za-záàãâéèêíìîúûóòç-: /]){1,30}(, ) d{4}. ){0,1}
Some examples of citations, with the title in bold:
DI MAIO, P. The Missing Pragmatic Link in the Semantic Web. Business Intelligence Advisory Service Executive Update. v. 8, n. 7, 2008.
ECO, U. Lector in Fabula: interpretative cooperation in narrative text. Barcelona: Lumen, 1987
ECO, U. The concept of text. São Paulo: T. A. Q. /EDUSP, 1984.
ECO, U. Open work: form and indetermination in contemporary poetics. São Paulo: Perspectiva, 1988.
ECO, U. The limits of interpretation. São Paulo: Pioneer, 2000.
EDMONDS, B. The Pragmatic Roots of Context. In: PROC. OF THE 2ND INTERNATIONAL AND INTERDISCIPLINARY CONFERENCE ON MODELING AND USING CONTEXT. Berlin; Heidelberg; New York, v. 1688, 1999. Annals... v. 1688, p. 119-132, 1999.
BERNERS-LEE, T. Semantic Web Concepts. 2005a. Available in: http://www.w3.org/2005/Talks/0517-boit-tbl. Accessed in: Sep 25. 2014
BERNERS-LEE, T. Web for real people. 2005b. Available at . Accessed: 25 Sep. 2014.
BERNERS-LEE, T.; CAILLIAU, R. Worldwideweb: Proposal for a Hypertext Project. 1990. Available in: < http://www.w3.org/Proposal.html >. Accessed: Oct 13. 2014.
BERNERS-LEE, T.; HENDLER, J.; LASSILA, O. The semantic web: a new form of web content that is Meaningful to Computers will Unleash a Revolution of new Possibilities. New York: Scientific American, 2001. Available in: http://www.sciam.com/2001/050lissue/0501berners-lee.html. Accessed: Oct 13. 2014.
BLAIR, D. C. Information Retrieval and the Philosophy of Language. Annual Review of Information Science and Tecchnology, v. 37, pp. 3-50, Medford, 2003.
BLAIR, D. C. Wittgenstein, Language and Information: Back to the Rough Ground! Dordrecht: Springer, 2006.
BONFIM, M. AND. Text Document Recovery Using an Extended Probabilistic Template. Piracicaba: UNIMEP, 2006. 131 f. Dissertation (Master in Computer Science). Master in Computer Science. Methodist University of Piracicaba, 2006.
BORLUND, P. The Concept of Relevance in IR. Journal of the American Society for Information Science and Technology, v.54, p. 913-925, 2003.
BORST, W. N. Construction of Engineering ontologies. Thesis (Phd in Information and Knowledge Systems). University of Tweenty - Centre for Telematica and Information Technology, Enschede, Nederland, 1997.
BOUNDLESS. Boundless Psychology. 201X. Available in < https://www.boundless.com/psychology/textbooks/boundless-psychology-textbook/ > Accessed: 13 Aug. 2014.
BRATT, S. Semantic Web, and Other Technologies to Watch. 2008. Available in < http://www.w3.org/2008/Talks/1009-bratt-W3CSemTech/Overview.html > Accessed: 13 Aug. 2014.
BRÉAL, M. Semantics: English in the English language. New York: Henry Holt & Company, 1900.
BRICKLEY, D.; MILLER, L. FOAF Vocabulary Specification 0.9. 2007. Available in < http://xmlns.com/foaf/spec/20070524.html > Accessed: 17 May 2015.
BRITISH LIBRARY. Sample Data. Available in . Accessed: 12 Dec. 2014.
BRUYNE, P. de, HERMAN, J., SCHOUTHEETE, M. de. Dynamics of research in social sciences. Rio de Janeiro: Francisco Alves, 1977.
BUFREM, L. S, et al. Modelling practices for the socialization of information- the construction of knowledge in higher education. Perspectives in Information Science, Belo Horizonte, v.15, n.2, p.22-41, May/Aug. 2010.
THESE ARE NOT ALL CASES, THE COMPLETE LIST OF CITATIONS IS FOUND HERE:
For testing: https://regex101.com/r/n2554R/1/
That’s the best I could do: demo Regex101 and Debuggex. But as you were told in the Soen: this is a natural language problem that can’t be Solved with a regex.
– danieltakeshi
use [A-zà-ḥ] or even [A-zà-ḥ] to capture accented letters, don’t do it => [A-ZÁÀÂÂÉÈÍÍÌÎÙÛÓÒÇ]
– Paz
@danieltakeshi What is a "natural language problem"? Why can’t this be solved with regex? I’m starting to use regex a little while ago.
– Bruno Brito
Because it’s too complex for a Regex. A Natural Language is much more complex, as it needs an understanding of natural human languages. Seeing in the Regex101 demo list you will notice several quoting modes without having a pattern. What is possible with Regex is the search for citations within some standard/norm with clear rules. For example ABNT or IEEE. But for all kinds of citations it takes something like Artificial Intelligence.
– danieltakeshi
A site I suggest to solve this problem is the Kaggle, but before creating a competition, you need to study about Data Science and better understand how the competitions of this site work. In which there are several competitions worth money to those who solve them. Note: In Latex you can get these fields more easily, but the article should be done in Latex.
– danieltakeshi
@Danieltakeshi Thanks for the clarification!
– Bruno Brito
Managed to solve?
– Sam