Believe me, what you want to do is called compilação
. Yes, you need to compile the code to make the perfect count. No, you do not need to generate a bytecode with this compilation. I know of no alternative but to assemble the abstract syntax tree that generates the perfect information; in this case, the best you can do with regex is an estimate. To compile Java, you need to use a context-free language.
Okay, I’ve said a lot but nothing practical yet. Well, that question brings a range of answers about context-free languages. Victor’s response is more intuitive, mine is more formal.
But why is Java context-free? There’s nothing in Java I can do with a regular language?
Obs a regular language is the set of all strings formed/recognized by a regular expression
Basically because Java has self-nested constructions. What do you mean? Well, look at the image below:
Consider that N
in this case it is a named class statement. Within it, you can declare another named class (you have a N
nestled in one another N
). We call this property auto aninhamento
. If you have thingsV
and X
not empty, you cannot reduce this nesting to a regular derivation.
If you want an example of code with 3 nested classes in each other, see this snippet
Read the answers of that question before you continue reading here. Ready, finished? Okay, come on.
To compile Java, it is very salutary to have a grammar that describes the language. I will remove the question regarding comments from the compilation, okay? This simplification is not absurdly harmful, let us continue.
A Java code starts with package and import declaration. Only after that I declare a list of classes (I can declare more than one in the same file, then it is valid). If J
is the starting nonterminal of our grammar to define Java, we have it here:
J -> P? Is C Cs
P -> "package" (RDNS | RDNS "\.\*") ";"
Is -> "" | I Is
I -> "import" "static"? RDNS ";"
Cs -> "" | C Cs
Sounds German? Well, it’s not exactly German (German is context-sensitive, you can’t describe it with context-free grammar).
All symbols that are not in quotes are nonterminal symbols; this means that they will have productions that generate more symbols. What’s inside quotes are terminal symbols; ";"
represents a ; literal.
Explained this, I can state that the notation I used is an adaptation of BNF expanded, where I place typical metacharacters of regular expressions within productions. For example, "static"?
indicates the optional presence of the literal word static
, and P?
indicates that non-terminal presence may occur P
in that part of the derivation of J
.
More details than I’ve written so far:
J
: terminal indicating a valid program within the JAva
P
: not terminal declaration of pquilt
RDNS
: nonterminal of Reversed DNS; not yet defined its production
Is
: nonterminal of Imports, an import list
I
: nonterminal of Import
Cs
: not list terminal of CLasses
C
: nonterminal of CLasse
And how is the production that expresses a class?
C -> Ac? Nm Ex? Im? "{" Dsc "}"
Where Ac
is the level of access, Nm
is any valid name, Ex
relates to the inheritance, Im
is about the list of interface implementations that this class implements, Dsc
refers to statements made within the class.
I could spend the night writing the productions, but I don’t intend to do that. I just wanted to make you aware of how to write in that notation. Once this is done, you mount the syntax tree, where each non-leaf node is a nonterminal symbol and all the children of a node represent the derivation that the node suffered. With the syntax tree, simply count the number of class nodes C
and the knots of methods.
After defining the grammar, you can use a compiler (Compiler) like the yacc
and a lexical recognizer such as the yylex
to write your Java compiler.
Recognition of nominated classes
I said that recognizing with regular expressions is, at best, an approximate value. Shall we try? A class begins with the access modifiers (public? static?), followed by the reserved word class
and then the class name:
(public|protected|private)? (static)? class [_a-zA-Z][_a-zA-Z0-9]*
This does not take into account that this text is inside strings or comments.
Recognition of methods
I said that recognizing with regular expressions is, at best, an approximate value. Shall we try? A method starts with the access modifiers (public? static?), followed by their return, their name, parentheses and list of arguments, closes parentheses:
(public|protected|private)? (static)? [_a-zA-Z](\.?[_a-zA-Z0-9])* [_a-zA-Z][_a-zA-Z0-9]* \( ([_a-zA-Z](\.?[_a-zA-Z0-9])* [_a-zA-Z][_a-zA-Z0-9]* (, [_a-zA-Z](\.?[_a-zA-Z0-9])* [_a-zA-Z][_a-zA-Z0-9]*)* )? \)
Note that this does not take constructors into consideration.
for builders: constructors cannot be static, also have no return, otherwise is identical to a method
(public|protected|private)? [_a-zA-Z][_a-zA-Z0-9]* \( ([_a-zA-Z](\.?[_a-zA-Z0-9])* [_a-zA-Z][_a-zA-Z0-9]* (, [_a-zA-Z](\.?[_a-zA-Z0-9])* [_a-zA-Z][_a-zA-Z0-9]*)* )? \)
Create a function, in this function make it open the class files, after opening them do a check with the regex you want, store the occurrences with a list, in this list put the path of the class files and the occurrence of regex, store the list as a txt file, ready, you just implemented the program you described
– Paz
I implemented no regular expressions after I edit the question
– lipesmile