Read files . CAP efficiently with Python

Asked

Viewed 795 times

1

I have some. CAP files originated from catching packages with tcpdump. When trying to open with wireshark, the machine gets very slow, because I imagine he tries to load everything to RAM.

I would like to write a program in Python to work more efficiently with dumps. The first question is: what is the difference between . CAP and . PCAP?

I don’t need to read the entire file at once. Imagine you want to read the file. CAP only from time(time) = 9h15 to 11h12 , instead of loading it whole in memory. How to do this in Python? Remembering that files are . CAP and no. PCAP.

Follow the exit from: "tcpdump -r /path/to/ficehiro.cap | Less"

09:32:20.107281 IP iskcon.interactivedns.com.http > 192.168.91.34.47651: Flags [S.], seq 63
8820025, ack 2476676485, win 28960, options [mss 1380,sackOK,TS val 3245680284 ecr 42949413
64,nop,wscale 7], length 0
09:32:20.107308 IP 192.168.91.34.47651 > iskcon.interactivedns.com.http: Flags [.], ack 1, 
win 229, options [nop,nop,TS val 4294941466 ecr 3245680284], length 0
09:32:20.107357 IP 192.168.91.34.47651 > iskcon.interactivedns.com.http: Flags [P.], seq 1:
181, ack 1, win 229, options [nop,nop,TS val 4294941466 ecr 3245680284], length 180: HTTP: 
GET / HTTP/1.1
09:32:20.144075 IP ec2-52-73-252-184.compute-1.amazonaws.com.http > 192.168.91.34.47570: Fl
ags [.], seq 831563414:831564782, ack 387706135, win 75, options [nop,nop,TS val 499391566 
ecr 4294941090], length 1368: HTTP
09:32:20.144094 IP 192.168.91.34.47570 > ec2-52-73-252-184.compute-1.amazonaws.com.http: Fl
ags [.], ack 1368, win 816, options [nop,nop,TS val 4294941475 ecr 499391566], length 0
09:32:20.144368 IP ec2-52-73-252-184.compute-1.amazonaws.com.http > 192.168.91.34.47570: Fl
ags [.], seq 1368:2736, ack 1, win 75, options [nop,nop,TS val 499391566 ecr 4294941090], l
ength 1368: HTTP
09:32:20.144376 IP 192.168.91.34.47570 > ec2-52-73-252-184.compute-1.amazonaws.com.http: Fl
ags [.], ack 2736, win 838, options [nop,nop,TS val 4294941475 ecr 499391566], length 0
09:32:20.145197 IP ec2-52-73-252-184.compute-1.amazonaws.com.http > 192.168.91.34.47570: Fl
ags [.], seq 2736:4104, ack 1, win 75, options [nop,nop,TS val 499391566 ecr 4294941090], l
ength 1368: HTTP
09:32:20.145204 IP 192.168.91.34.47570 > ec2-52-73-252-184.compute-1.amazonaws.com.http: Fl
ags [.], ack 4104, win 861, options [nop,nop,TS val 4294941475 ecr 499391566], length 0
09:32:20.145214 IP ec2-52-73-252-184.compute-1.amazonaws.com.http > 192.168.91.34.47570: Fl
ength 1368: HTTP
09:32:20.145218 IP 192.168.91.34.47570 > ec2-52-73-252-184.compute-1.amazonaws.com.http: Fl
ags [.], ack 5472, win 883, options [nop,nop,TS val 4294941475 ecr 499391566], length 0
09:32:20.148032 IP ec2-52-73-252-184.compute-1.amazonaws.com.http > 192.168.91.34.47570: Fl
ags [.], seq 5472:6840, ack 1, win 75, options [nop,nop,TS val 499391566 ecr 4294941090],

imagem de outro arquivo CAP

Wireshark memory consumption when opening a 1GB CAP:

memoria wireshark

  • Try "tcpdump -r /path/to/ficehiro.cap | Less" at the terminal

  • Thanks @Miguel. I need it in Python, because I will process the file later!

  • You can put an example of the internal file format, the first 10 lines for example: "head -10.cap file" in the sff terminal

  • Dear @Miguel, head -10.cap file -> the output were strange characters. The command is that same? used from Ubuntu terminal

  • is to print the first 10 lines of a file

  • @Miguel:�ò���T�X�JJ�H���g��E<@*�ϋ�&#xA;���["P�#&������q zd&#xA;�u&�����T�X,�BB�g����H��E4@@����["��&#xA;��#P���&����Y&#xA;����u&�T�X]����g����H��E�@@����["��&#xA;��#P���&�����! u& GET / HTTP/1.1 User-Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:50.1.0) Gecko/20100101 Firefox/50.1.0 Accept: / Host: www.iskconbangalore.org

  • http://www.linfo.org/head.html. This may help: https://www.google.pt/search?cli=ms-android-huawei&ei=Jy63WL2mGYmtU7Kho4gC&q=read+cap+cap+file+with+python&oq=read+cap+file+with+python&gs_l=mobile-gws-Serp.12...13514.14049.0.15132.4.4.0.0.0.291.702.0j3j1.4.0....0...1c.1j4.64.mobile-gws-Serp..1.0.0.Yxlkol3wsai

  • I will try to make a print of wireshark (open file) and post the link.

  • More than 30 minutes charging on wireshark. The dump is 2.6 GB

  • You are best followed by https://www.google.pt/search?cli=ms-android-huawei&ei=Jy63WL2mGYmtU7Kho4gC&q=read+cap+with+python&oq=read+cap+cap+file+with+python&gs_l=mobile-gws-serp.12...13514.14049.0.15132.4.0.0.0.0.291.702.0j3j1.4.0.......0..1j4.4.64.mobile-gws-Serp..1.0.0.Yxlkol3wsai, for example: https://jon.oberheide.org/blog/2008/10/15/dpkt-tutorial-2-parsing-a-pcap-file/. I’ve never worked with this, I don’t think I’ll be able to help you with this

  • @Miguel: I put the output of tcpdump -r /path/to/ficehiro.cap | Less in question.

  • @Miguel: I placed the image of another CAP file

  • Apparently Wireshark has its own API for Lua integration: https://www.wireshark.org/docs/wsdg_html_chunked/wsluarm_modules.html That is, maybe Python is not the best for you. :)

  • take a look at my answer, it will help you process giant acquisitions...

Show 9 more comments

2 answers

2


I just saw this question now, the efficient way is to open the file in pieces, with the help of pointers it is possible to set the starting and end reading position of a file, in python cannot handle memory pointers directly, for our luck the function open(this function must be written in C) python internally handles pointers that help us in the process of reading files, with it it is possible to set the opening of a file by size in bytes, ie it is possible to open the file piece by piece (at each byte) without having to open the entire file, see how it is done:

from scapy.all import *
import dpkt

f = open("capture21dez2016.pcap")


pcap = f.read(4096)
while pcap:

    #processe cada pedaço aqui

    pcap = f.read(4096)

f.close()

The example opens the file every 4096 bytes of data and goes through the entire file until the end of it, is a way not to burst the memory for lack of resources, very useful when you have to walk through giant files, the function f.read() knows the position of the last pointer and starts reading the next bytes from the last known position.

You can still start reading a file from a certain position using the seek very useful when you need to start a reading from a given byte see an example:

txt file.

A
B
C
D
E
F
G
H
I
J

each enter is equivalent to 2bytes=\n or 1 byte for the \ and another byte for the n, to exemplify and if I want to read the arquivo.txt beginning from byte=3?

>>> f = open('arquivo.txt', 'r+')
>>> f.seek(3)
>>> f.read(1)
'B'

The seek(3) tells you where to point, in this case position the pointer at the third byte and the read(1) says to read 1byte of data from the pointed position.

Then you ask me why the third byte equals the letter B? recalls that enter equals 2bytes carrying the first line is as if it were:

A\n = 3bytes

In other words, the words B of the file will be on byte=4 which is what we did, we pointed the reading to start at the byte=3 and have the next 1 byte

And if I want the words F?

Starting from the same principle and counting the characters and the enters, the letter F Will be in the byte=16, to position the pointer and reach it this way:

>>> f.seek(15)
>>> f.read(1)
'F'

For a running text file without enter:

arquivo2.txt

KLMNOPQRS

If I want the letter of the fifth byte:

>>> f = open('arquivo2.txt', 'r+')
>>> f.seek(4)
>>> f.read(1)
'O'

And if I want the entire file from the fifth byte?

>>> f.seek(4)
>>> f.read()
'OPQRS'

And if I want to walk in the file backwards ? you can set the parameter 2 in the seek(X,2) This indicates that it will walk starting from the end of the file.

>>> f.seek(-6,2)
>>> f.read(1)
'N'

With this concept you will be able to manipulate and walk efficiently within giant files...

Now all you have to do is open the file in pieces or start from a certain place and go comparing which lines are within the desired range, after storing the data end the loop with a break, this way it is very likely that you do not need to walk through the entire file, unless the desired data is in the last line of the file, and even so you can create some ruse to know if you should start reading the file at the beginning or end of it.

  • thank you very much. I will study calmly and return to mark the response!

  • scapy works 100% in python 3?

  • I tried to adapt my code here: http://imgur.com/aMqxn3H Errors: http://imgur.com/a/QybqX

  • I need to separate the source and destination IP in different text files to subsequently calculate entropy!

  • Somehow you have to take the pieces of data and pass it to dpkt.pcap.Reader isn’t that it? it puts in the read format you want, I have no way to test it here...

  • don’t have to use any Decode? ask pq pcap = F1.read(4096) is string!

  • is what the pcap. Reader does not? it does Decode, that’s with vc, I believe that after every pcap line = f.read(4096) you would have to do pcap_converted=dpkt.pcap.Reader(pcap)

  • I was able to do it another way too, what do you think? http://imgur.com/a/gMY4m

  • RAM consumption dropped from 97/98% to about 20%. You see some problem in the code?

  • thanks for your solution too! Before, with a PCAP of 500MB even, the consumption of RAM already beat 97%. My laptop has 8 GB of RAM...

  • this code is part of an academic research project. I really appreciate your contribution! I’m running the algorithm now on all dumps to then fill the entropy charts! Thank you

Show 6 more comments

2

To facilitate I simulated a file of the type . cap, creating a cap1.cap text file where each line has only the first characters (indicating the time), according to what Voce posted here. Stayed like this:

09:32:20.107281

09:32:20.107357

09:32:21.144075

09:32:21.144094

09:32:21.144368

09:33:21.144376

09:34:21.145197

09:35:00.145204

09:36:20.145214

09:36:20.145218

09:37:20.148032

Then I developed a code to read this file and "print" only the lines between the times: 09:32:21 and 09:35:00, I did the test and it occurred as expected, I believe that with some adaptations you solve your problem. Code below.

import datetime
import re

inicio = datetime.datetime.strptime('09:32:21', '%H:%M:%S').time()
fim = datetime.datetime.strptime('09:35:00', '%H:%M:%S').time()

startprint = False
with open('cap1.cap', 'r') as f:
    for line in f:
        str_begin = line[:8]
        if re.match(r'^[0-9][0-9]:[0-9][0-9]:[0-9][0-9]$', str_begin) != None:
            t = datetime.datetime.strptime(str_begin, '%H:%M:%S').time()
            startprint = True
        if startprint:    
           if t>fim:
               break
           if t>=inicio and t<=fim:
               print (line)

Upshot:

09:32:21.144075

09:32:21.144094

09:32:21.144368

09:33:21.144376

09:34:21.145197

09:35:00.145204

  • See that anyway the code first loads the entire file to memory to then read line by line (I couldn’t see otherwise), so I don’t know if it will get faster than the application Voce uses.

  • An attempt to gain speed in python would be to use the cython

Browser other questions tagged

You are not signed in. Login or sign up in order to post.