Vocabulary of our language, where to find VOLP data?

Asked

Viewed 4,057 times

18

The applications of complete and reliable dictionaries are immense (you don’t even have to choose them here!)... Our language, unlike English and many others without an "official reference", is standardized, and has an important reference vocabulary, that would simplify the lives of language database developers and users.

Question

In Brazil (in Portugal they say it is similar) the school books, the orthographic correction software, etc. all are (indirectly) obliged, by law, to comply with the spelling expressed in VOLP - Portuguese Language Orthographic Vocabulary. These are ~381,000 entries (see link): where are they?

Does anyone know where or how I can get the VOLP (for download or on CD) in XML, SQL or other structured format? In fact does not need to be "the VOLP", just be a serious and reliable vocabulary (ex. basis of Unitex or of True) with a flag in the official words of the VOLP.

  • Peter Krauss, not directly related to your question, but if your correction is not based on the VOLP, maybe this answer that I gave in another question be useful somehow.

  • 1

    "I heard around" that they have managed to extract from the site mentioned the list (and that it does not give all this verbiage), using scripts, but I do not know how the legal aspects are. Maybe it is true, because the url of ajax http://www.academia.org.br/sistema_busca_palavras_portuguesas/volta_voca_org.asp?palavra=a should be very simple to parse, and a smart script would deviate from the limit of 200 entries per page without problems. (if A passes 200, uses AA, AB, in turn if IN passes 200, INA .. INZ, and so on). Of course it is only hypothetical, the ideal is to buy the same dictionary ;)

  • 7

    @Bacco Unfortunately here in Brazil it is common the situation in which, by law, you need to follow a certain standard or technical standard, but the information on this standard needs to be purchased - as being subject to copyright cannot be copied. The same goes for various official data, such as the postal codes. This is a huge obstacle to progress, because many computerized solutions become unviable and/or prohibitively expensive - even if the technical means are readily available, you are barred by legal aspects.

  • @mgibsonbr my comment was rightly inspired by the "disagreement" with these absurd things, as is the case of the Zip Codes in Brazil also (which were free and suddenly cost a fortune). Now, of course I won’t download anything illegally right... it became evident that my answer is merely hypothetical, if you know what I mean ;)

  • @Bacco OK, I was just agreeing with you ("but I don’t know how the legal aspects look"). It didn’t seem that you were stimulating anything illegal no, sorry if I passed this impression!

  • @Raelgugelmincunha, my problem is not merely correction, it is much wider, and I know of several other applications that require the VOLP to "certify in accordance with the Law". Your suggestion I think fits, and is much more related, to discussion we had about Metaphone.

  • 1

    @Bacco and mgibsonbr: I am editing the question... See if with the notes I can get some answers from our readers, even if only to discuss the subject.

  • 1

    There are some corpora of Portuguese text extracted from various sources such as newspapers and magazines that scientists use for various purposes (such as the construction of SPAM analyzers, for example). Of course they are not official like the VOLP, but perhaps they are a useful alternative to other needs than dictionaries. Examples: CETEM (PT), CETEM/Folha (BR) and LAEL (BR)

  • 1

    @Luizvieira, The selection and organization of corpus linguistics is very important to establish relevance (frequency of use), still neglected by the VOLP: in it we do not find relevant terms such as "environment" (appears in federal laws, scientific works, journalistic, etc.) but we found archaic terms of no relevance as "half-gun". The focus of my question, however, is certifying (sorry I pointed that out just at the Notes/Contextualization): I need to certify which terms of a text are official and which are not.

  • @Peterkrauss Ah, ok. : ) I think your question was actually quite clear. I just wondered if the corpus (this is the correct plural, right? sorry) linguistics could help in other types of need and if so would be worth the mention here. But since I’m not a real connoisseur of the subject, I just wanted to comment.

  • I tried to get in touch with ABL by the site and by e-mail without result. We still have to try by phone...

  • @Miguelangelo, I tried a long time ago, it doesn’t hurt to insist... Perhaps, even taking advantage of the fact that you are in Rio, it is best to inform yourself by phone, asking them to at least confirm (the conclusions we wrote here).

Show 7 more comments

2 answers

10

(THAT’S NOT AN ANSWER!)

The following notes are subsidies to the answers, and also a return to the comments posted on the question. It is a text Wiki: you can collaborate by reviewing and expanding!

GRANT NOTES

This is an open text (Wiki) to subsidize the general question of free access at the VOLP (or to the VOP), heritage of Portuguese-speaking nations.

As most of us are unfamiliar, we need to start with a certain review of laws, open vocabularies and trustworthy dictionaries. One can notice from the comments that the question is not merely technical. The option in these grant notes was to encourage the direction taken on similar issues, where more than one answer is discussed. All readers and respondents are invited to also edit the text of these notes.

Copyright and duration

(in response to @Bacco) About obligatory and its duration. According to the consolidated version of several sources on Wikipedia and Official laws of Lexml:

  • The "Orthographic Agreement of 1990" was promulgated by the National Congress on April 18, 1995;

  • To implement the aforementioned Orthographic Agreement, here in Brazil, the Federal Decrees 6.583, 6.584 and 6.585, and the Amending Protocol to that Agreement has been approved.

    • The Decree nº6.583 of 2008 has as an annex "ORTHOGRAPHIC AGREEMENT OF THE PORTUGUESE LANGUAGE", which stated in its Article 2 that "the signatory States shall take (...) the necessary steps to develop (...) a common spelling vocabulary for the Portuguese language". Further establishes that "authorised vocabularies shall record admissible alternative spellings (...) it is clear that only the consultation of vocabularies or dictionaries may indicate".

    • Decreto 6584, is annexed to the "AMENDING PROTOCOL TO THE ORTHOGRAPHIC AGREEMENT OF THE PORTUGUESE LANGUAGE": it gives a new wording only for Articles 2 and 3, fixing as valid the vocabularies elaborated "until 1 January 1993".

    • The VOLP is edited by the Brazilian Academy of Letters (ABL), which allegedly had the legal responsibility to edit it: this assumption and many others on spelling, pronunciation, etc. Portuguese officials, not listed in Decreto Eduardo Ramos, de n. 726, de 8/12/1900, only standard cited for this purpose. Second crumbs.com.br "... this position has thus been recognised without challenge for decades", that is, there is no written law, only a "tradition" in filling this gap.

  • The MEC was the agency that most charged for mandatory (and almost got in Brazilian textbooks) a partir de janeiro de 2013... Not by chance Brazil had been represented in 1990 by the Minister of Education. But (in 2012) the federal government, by Law 7875, postponed the obligation to 2016.

(in response to @mgibsonbr) About contradictions between the right to collect copyright (not offer download) and the Brazilian Constitutional Law. It appears that the sale of the VOLP is theoretically unconstitutional:

  • The fundamentals of Lexml can be extended to the VOLP: the VOLP is quoted in Law, so it is part of it. The government has an obligation to make it public, can not charge the citizen for access to the Law.

  • The Brazilian citizen can not claim "ignorance of the Law": the Constitution Federeal (CF) would guarantee "mandatory publication" (art. 37), "right of access" (art. 5º, item XIV) and "obligation of franchise to access" (art. 216, § 2º).

  • Previous case: standard ABNT NBR 9050:2005 - Accessibility of buildings, and it seems that also the NBR-15575-5, are the only ones open (text download is offered). Complaints about ABNT abuses are old (see 1, 2, and dozens of others)... Complaints seem to have triggered a first opening initiative. The impact of the VOLP on every citizen, however, is much greater than the impact of restricting access to an ABNT standard, so greater attention should be paid to.

Contextualization

There are dozens of "unofficial vocabularies", some even reliable and better structured than the VOLP, but are not suitable for certification (none of the searches contains a flag indicating the spellings of the VOLP):

  • Unitex Project: probably the most rigorous and solid "framework for dictionaries". The ideal would be to work with it... See Downloads from Unitex3.0.zip with all the dictionaries, in 2013 was not yet updated to the Orthographic Agreement 1990.

  • Project Vero do Livreoffice: perhaps not as strict as Unitex, but certainly today the most complete (voluminous) and the one that most received collaborations, reviews, checks, etc. The download the application depends on downloading the latest version of the file extension "oxt", which is itself the source data of the dictionary and everything else. The VERO project was built with the use of the hunspell software which implements Libreoffice, Firefox, Chrome and many other applications. The VERO project is the initiative that created the linguistic data (affixes, flexions, etc.) that are processed by Hunspell: the most current Vero file, for example, Veroptbrv320aoc.oxt, loads all the data. To access them simply rename the file to Veroptbrv320aoc.zip and unzipar. Running (tip from Raimundo and his collaborators)  ./unmunch pt_BR.dic pt_BR.aff | sort -u > listaCompleta.txt  we obtain a complete list of all Portuguese words (more complete and coherent even than the VOLP). According to R.S.Moura,

... VERO never received support from ABL. Our lexicon is the result of the voluntary work of many selfless who wore the shirt of this Project, making available their academic materials, researches, lists of terms, pointing out flaws and suggesting new words, during the eight years of VERO activity" (personal email of April 2014, authorized reproduction).

  • Unix "words" feature: on UBUNTU was nicknamed Wordlist and stays in /usr/share/dict/brazilian (list with more). Can help check words, consolidate with other open dictionaries, but does not seem very reliable or as active. Installs with sudo apt-get install wbrazilian.

Dictionaries proper: they have the character of "ontology" (semantic description of words), rather than "vocabulary"... As in general they also cover the vocabulary, if reliable and complete, they can be as useful as vocabularies:

  • Portuguese wiktionary.org: collaborative source... Can be evaluated by download (ptwiktionary-latest-all-titles.gz) which is still incomplete, in addition to not having "flag VOLP".

  • pt-PT, Docionário de Candido de Figueiredo, de 1913: is a good starting point for the creation of a dictionary in the public domain... "being the edition of this dictionary of 1913, according to the current legislation on copyright, the copyrights of its contents have already prescribed, making it integral in the Public Domain"3.

  • 1

    Has anyone had contact with the YOU? http://voc.iilp.cplp.org/ Apparently it will be the "universal Portuguese vocabulary" replacing the VOLP (!).

  • ... It’s 2021 and nothing. A starting point would be to mobilize experts to resume the Unitex project, apparently the most serious and organized, generating very reliable results.. Just need to update a little. https://github.com/UnitexGramLab/unitex-lingua/tree/master/pt-BR

0

I found a very good material on the site Brazilian Portuguese Lexicon - Lexporbr.

Lowering the Brazilian Portuguese Lexicon - Alfa, zip, we found inside it the file lexporbr_alfa_txt.txt, the first column of which contains the words and the second column contains the grammatical classification of each one. As I searched for a list of nouns, it was easy to filter, resulting in 82,097 nouns. All other grammatical classes are present, obviously.

EDIT

What I did to extract nouns from the above file was, in R:

d = read.csv('lexporbr_alfa_txt_utf8.txt',sep='\t')
n = d[d$cat_gram == 'nom',] # 82097
length(unique(n$ortografia)) # 82097 - só pra conferir

ort = vector()
for (i in 1:nrow(n)) {
  if (i %% 1000 == 0) {
    print(i) # mostrar o progresso, já que é um pouco lento
    flush.console()
  }
  dn = d[d$ortografia == n$ortografia[i],c(1,2,4)]
  if (nrow(dn) == 1 | dn$freq_orto[dn$cat_gram == 'nom'] == max(dn$freq_orto)) {
    ort = c(ort,n$ortografia[i])
  }
}
write.table(ort,'substantivos.txt',row.names=F,col.names=F,quote=F) # 78894 linhas - cheio de lixo, é verdade, mas serve pro meu propósito atual.

If anyone can do it using apply, would love to see.

I hope it helps!

  • Thank you Rodrigo for indicating more this source (!), which in fact seems to be updated (2019). To download in Linux the ideal is to convert to UTF8, iconv -f ISO-8859-1 -t utf8 lexporbr_alfa_txt.txt > lexporbr_alfa_txtUtf8.txt. Looking up does not seem reliable as a reference vocabulary for certifications, is more a statistical summary for academic purposes... Apparently it is a selection of findings in some large linguistic corpus, so it presents strange words and frequency values. See the word "mom" classified as verb by freq_orto=1.

  • @Peterkrauss I used Geany to convert to UTF-8 (unbelievable still using ISO-8859 today!). In fact, it has its problems. I’m ruling out nouns that also appear as other grammatical classes, with that down from 82 to 73,000 words. For my use case will be enough. Have you found some better source that includes grammar classes?

  • Hello @Rodrigo! Yes, I found it years ago and I used it at the time with Lexml, is the Unitex. Today a little more updated, it is still the most serious and organized. It is maintained in this git, but for those who do not know perhaps better from their software, https://UnitexGramLab.org/ . USP/NILC stopped updating in 2015, care is only historical http://www.nilc.icmc.usp.br/nilc/projects/unitex-pb/web/dicionarios.html

  • @Peterkrauss I had already downloaded this Unitex, but the biggest file I found in it is DELAS_PB_2018.dic, which only has 75 thousand terms, verbs included. The USP website speaks of "67,500 canonical rules associated with their bending rules" (that is, also with verbs). That’s a lot less than the 73,000 nouns from the link I passed. But if you’re better cured, it’s worth it. The file to use is DELAS_PB_2018.dic? From what I see, the nouns are followed by ,Nxxx. Filtering this in R won’t be difficult. If it is, then I put the script here. And thank you!

  • Hello @Rodrigo, even the DELAS_PB_2018.dic, being an expanded file, it has only canonical, does not contain the flexions that are generated by the software I indicated. The Unitex project is cool because it gives consistency to the language, as in voc.iilp.cplp.org/ it generates the flexions by logical rule (bending graphs), not by determining the linguistic corpus. Only canonical ones come from corpus. I think the DELAF (Dictionary of Simple Inflected Words) has everything, need to use the software to generate (last case on the USP site has old zip).

  • Check out the thesis of Muniz, despite being 2004 Unitex remained with the same concepts and most of the standards in force at that time. Graphs and differences from DELAS to DELAF can be understood... See thesis in https://doi.org/10.11606/D.55.2020.tde-19022020-151305 . Then if you really like maybe we can build together a SQL or JSON version for Unitex, much more practical and easy to give

  • @Peterkrauss From what I’ve seen, creating an SQL or JSON version is nothing trivial, and it’s well beyond my needs at the moment. If I need it in the future, I’ll call you to play this project. Thank you!

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.