Implementing Regular expression in Java

Asked

Viewed 423 times

1

I need to make a program that reads a file and can count how many classes and methods there are.

The reading and separation part by line I have done, I need to identify class and method.

I am considering that classes and methods open keys on the same line as they start and do not skip the line Ex: public class foo { Does not need to be a real class counter and methods equal to code validators do.

I thought of something to class like: Search anywhere from line to word class and check if it ends with { Ex:

public class Hello {

Regular Expression:

["class"]{$

Method: At the end of the line check if you have the words of data type(int, double, float, bool, char) and finish with ){. Ex: public int method(){ Regular expression:

["int""double""float""bool""char"] )$

I just don’t know how to apply this in java if we just take the string comparison class method and call:

String.compareTo(["class"]{$);
String.compareTo(["int""double""float""bool""char"] )$);
// Ou Asim
String.equals(["class"]{$);
String.equals(["int""double""float""bool""char"] )$);
  • Create a function, in this function make it open the class files, after opening them do a check with the regex you want, store the occurrences with a list, in this list put the path of the class files and the occurrence of regex, store the list as a txt file, ready, you just implemented the program you described

  • I implemented no regular expressions after I edit the question

1 answer

5


Believe me, what you want to do is called compilação. Yes, you need to compile the code to make the perfect count. No, you do not need to generate a bytecode with this compilation. I know of no alternative but to assemble the abstract syntax tree that generates the perfect information; in this case, the best you can do with regex is an estimate. To compile Java, you need to use a context-free language.

Okay, I’ve said a lot but nothing practical yet. Well, that question brings a range of answers about context-free languages. Victor’s response is more intuitive, mine is more formal.

But why is Java context-free? There’s nothing in Java I can do with a regular language?

Obs a regular language is the set of all strings formed/recognized by a regular expression

Basically because Java has self-nested constructions. What do you mean? Well, look at the image below:

auto aninhamento

Consider that N in this case it is a named class statement. Within it, you can declare another named class (you have a N nestled in one another N). We call this property auto aninhamento. If you have thingsV and X not empty, you cannot reduce this nesting to a regular derivation.

If you want an example of code with 3 nested classes in each other, see this snippet

Read the answers of that question before you continue reading here. Ready, finished? Okay, come on.

To compile Java, it is very salutary to have a grammar that describes the language. I will remove the question regarding comments from the compilation, okay? This simplification is not absurdly harmful, let us continue.

A Java code starts with package and import declaration. Only after that I declare a list of classes (I can declare more than one in the same file, then it is valid). If J is the starting nonterminal of our grammar to define Java, we have it here:

J -> P? Is C Cs
P -> "package" (RDNS | RDNS "\.\*") ";"
Is -> "" | I Is
I -> "import" "static"? RDNS ";"
Cs -> "" | C Cs

Sounds German? Well, it’s not exactly German (German is context-sensitive, you can’t describe it with context-free grammar).

All symbols that are not in quotes are nonterminal symbols; this means that they will have productions that generate more symbols. What’s inside quotes are terminal symbols; ";" represents a ; literal.

Explained this, I can state that the notation I used is an adaptation of BNF expanded, where I place typical metacharacters of regular expressions within productions. For example, "static"? indicates the optional presence of the literal word static, and P? indicates that non-terminal presence may occur P in that part of the derivation of J.

More details than I’ve written so far:

  1. J : terminal indicating a valid program within the JAva
  2. P : not terminal declaration of pquilt
  3. RDNS : nonterminal of Reversed DNS; not yet defined its production
  4. Is : nonterminal of Imports, an import list
  5. I : nonterminal of Import
  6. Cs : not list terminal of CLasses
  7. C: nonterminal of CLasse

And how is the production that expresses a class?

C -> Ac? Nm Ex? Im? "{" Dsc "}"

Where Ac is the level of access, Nm is any valid name, Ex relates to the inheritance, Im is about the list of interface implementations that this class implements, Dsc refers to statements made within the class.

I could spend the night writing the productions, but I don’t intend to do that. I just wanted to make you aware of how to write in that notation. Once this is done, you mount the syntax tree, where each non-leaf node is a nonterminal symbol and all the children of a node represent the derivation that the node suffered. With the syntax tree, simply count the number of class nodes C and the knots of methods.

After defining the grammar, you can use a compiler (Compiler) like the yacc and a lexical recognizer such as the yylex to write your Java compiler.

Recognition of nominated classes

I said that recognizing with regular expressions is, at best, an approximate value. Shall we try? A class begins with the access modifiers (public? static?), followed by the reserved word class and then the class name:

(public|protected|private)? (static)? class [_a-zA-Z][_a-zA-Z0-9]*

This does not take into account that this text is inside strings or comments.

Recognition of methods

I said that recognizing with regular expressions is, at best, an approximate value. Shall we try? A method starts with the access modifiers (public? static?), followed by their return, their name, parentheses and list of arguments, closes parentheses:

(public|protected|private)? (static)? [_a-zA-Z](\.?[_a-zA-Z0-9])* [_a-zA-Z][_a-zA-Z0-9]* \(  ([_a-zA-Z](\.?[_a-zA-Z0-9])* [_a-zA-Z][_a-zA-Z0-9]* (, [_a-zA-Z](\.?[_a-zA-Z0-9])* [_a-zA-Z][_a-zA-Z0-9]*)*  )? \)

Note that this does not take constructors into consideration.

for builders: constructors cannot be static, also have no return, otherwise is identical to a method

(public|protected|private)? [_a-zA-Z][_a-zA-Z0-9]* \(  ([_a-zA-Z](\.?[_a-zA-Z0-9])* [_a-zA-Z][_a-zA-Z0-9]* (, [_a-zA-Z](\.?[_a-zA-Z0-9])* [_a-zA-Z][_a-zA-Z0-9]*)*  )? \)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.