System.Outofmemoryexception - Parser for large files

Asked

Viewed 260 times

2

I created a simple grammar to interpret a file whose format is much like a JSON. However, when I try to parse the file I get the exception Sytem.OutOfMemoryException. This is because of the file size I am trying to parse. The file has 108MB and 4,682,073 lines.

When I parse smaller files, everything works normally, however, for this file, I realize that when the memory occupied by the process reaches almost 2GB the exception is triggered and the program stops. The exception comes from the code generated for the parser with the ANTLR extension for Visual Studio.

How do I run the parser for a really big files with ANTLR?

More information

The machine I’m running the parser has 8GB of memory, 2.8 Ghz processor (intel core 2 duo).

Example of the problem

Example file for reading

(
    :field ("ObjectName"
        :field (
            :field ("{6BF621F9-A0E2-49BB-A86B-3DE4750954F4}")
            :field (Value)
            :field (Value)
            :field (
                :Time ("Sun Jan 26 10:08:33 2014")
                :last_modified_utc (1390730913)
                :By ("Mensagem qualquer")
                :From (localhost)
            )
            :field ("Applications/application_fw1")
            :field (false)
            :field (false)
        )
        :field ()
        :field ()
        :field ()
        :field (0)
        :field (true)
        :field (true)
    )
.
.
.
Milhares de outros fields.
.
.
.
)

The grammar

grammar Objects;

/*
 * Parser Rules
 */


compileUnit
    : obj
    ;


obj
    : OPEN ID? (field)* CLOSE
    ;

field
    : ':'(ID)? obj
    ;


/*
 * Lexer Rules
 */


OPEN 
    : '(' 
    ;

CLOSE 
    : ')' 
    ;

ID
    : (ALPHA | ALPHA_IN_STRING)
    ;


fragment
INT_ID
    : ('0'..'9')
    ;

fragment
ALPHA_EACH
    : 'A'..'Z' | 'a'..'z' | '_' | INT_ID | '-' | '.' | '@'
    ;

fragment
ALPHA
    : (ALPHA_EACH)+
    ;

fragment
ALPHA_IN_STRING
    : ('"' ( ~[\r\n] )+ '"')
    ;



WS
    // :    ' ' -> channel(HIDDEN)
    : [ \t\r\n]+ -> skip  // skip spaces, tabs, newlines
    ;

Execution of parser

// text é o texto do arquivo de 108MB que será lido.
var input = new Antlr4.Runtime.AntlrInputStream(text);
var lexer = new ObjectsLexer(input);
var tokens = new Antlr4.Runtime.CommonTokenStream(lexer);
var parser = new ObjectsParser(tokens);

// Contexto para a regra compileUnit
// ERRO: Aqui ocorre o problema. Quando inicia a montagem da árvore para compileUnit
// Não chega no Visitor, a exceção ocorre em compileUnit()
var ctx = parser.compileUnit();


// Execução do visitor
new ObjectsVisitor().Visit(ctx);
  • Can you please put an example of the code you are using for parse?

  • @Gypsy omorrisonmendez added an example

1 answer

2

It is possible to define some things to avoid the problem:

When compiling the working unit, the framework tries to load the file and the entire tree in memory. In theory, the address space of the application is 4gb, but I believe the limitation of 2Gb is the maximum size of the data structure within the process.

Eliminating the need for buffer, the file is loaded in a segmented way, as well as the parse, and memory problem is avoided.

  • The exception stopped occurring after placing the parser.BuildParseTree = false;, the rest was not necessary. But this makes my Visitor unable to navigate the tree (why is there no tree?!). Have any other suggestions?

  • I looked for something, but then it’s too specific. I don’t know the ANTLR so deep.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.