Define a specific log format in .goaccessrc - goaccess

I'm reading the goaccess man page but I'm missing simple examples. I have a customised nginx with the following config:
log_format timed_combined '$remote_addr - $remote_user [$time_local] '
'$ssl_protocol/$ssl_cipher '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'$request_time $upstream_response_time $pipe';
Here is an example log entry:
66.249.76.120 - - [20/Dec/2016:19:04:03 +0100]
TLSv1.2/ECDHE-RSA-AES128-GCM-SHA256 "GET / HTTP/1.1" 200 27232
"-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)" 0.026 0.026 .
How do I have to configure the .goaccessrc to read that format?

You can add this to your config file or ~/.goaccessrc
log-format %h %^[%d:%t %^] %^"%r" %s %b "%R" "%u" %T %^
date-format %d/%b/%Y
time-format %H:%M:%S

Related

Parsing apache log files

I just started learning Python and would like to read an Apache log file and put parts of each line into different lists.
line from the file
172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"
according to Apache website the format is
%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\
I'm able to open the file and just read it as it is but I don't know how to make it read in that format so I can put each part in a list.
This is a job for regular expressions.
For example:
line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"'
regex = '([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) - "(.*?)" "(.*?)"'
import re
print re.match(regex, line).groups()
The output would be a tuple with 6 pieces of information from the line (specifically, the groups within parentheses in that pattern):
('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')
Use a regular expression to split a row into separate "tokens":
>>> row = """172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827" """
>>> import re
>>> map(''.join, re.findall(r'\"(.*?)\"|\[(.*?)\]|(\S+)', row))
['172.16.0.3', '-', '-', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '-', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827']
Another solution is to use a dedicated tool, e.g. http://pypi.python.org/pypi/pylogsparser/0.4
I have created a python library which does just that: apache-log-parser.
>>> import apache_log_parser
>>> line_parser = apache_log_parser.make_parser("%h <<%P>> %t %Dus \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %l %u")
>>> log_line_data = line_parser('127.0.0.1 <<6113>> [16/Aug/2013:15:45:34 +0000] 1966093us "GET / HTTP/1.1" 200 3478 "https://example.com/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)" - -')
>>> pprint(log_line_data)
{'pid': '6113',
'remote_host': '127.0.0.1',
'remote_logname': '-',
'remote_user': '',
'request_first_line': 'GET / HTTP/1.1',
'request_header_referer': 'https://example.com/',
'request_header_user_agent': 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)',
'response_bytes_clf': '3478',
'status': '200',
'time_received': '[16/Aug/2013:15:45:34 +0000]',
'time_us': '1966093'}
RegEx seemed extreme and problematic considering the simplicity of the format, so I wrote this little splitter which others may find useful as well:
def apache2_logrow(s):
''' Fast split on Apache2 log lines
http://httpd.apache.org/docs/trunk/logs.html
'''
row = [ ]
qe = qp = None # quote end character (qe) and quote parts (qp)
for s in s.replace('\r','').replace('\n','').split(' '):
if qp:
qp.append(s)
elif '' == s: # blanks
row.append('')
elif '"' == s[0]: # begin " quote "
qp = [ s ]
qe = '"'
elif '[' == s[0]: # begin [ quote ]
qp = [ s ]
qe = ']'
else:
row.append(s)
l = len(s)
if l and qe == s[-1]: # end quote
if l == 1 or s[-2] != '\\': # don't end on escaped quotes
row.append(' '.join(qp)[1:-1].replace('\\'+qe, qe))
qp = qe = None
return row
Add this in httpd.conf to convert the apache logs to json.
LogFormat "{\"time\":\"%t\", \"remoteIP\" :\"%a\", \"host\": \"%V\", \"request_id\": \"%L\", \"request\":\"%U\", \"query\" : \"%q\", \"method\":\"%m\", \"status\":\"%>s\", \"userAgent\":\"%{User-agent}i\", \"referer\":\"%{Referer}i\" }" json_log
CustomLog /var/log/apache_access_log json_log
CustomLog "|/usr/bin/python -u apacheLogHandler.py" json_log
Now you see you access_logs in json format.
Use the below python code to parse the json logs that are constantly getting updated.
apacheLogHandler.py
import time
f = open('apache_access_log.log', 'r')
for line in f: # read all lines already in the file
print line.strip()
# keep waiting forever for more lines.
while True:
line = f.readline() # just read more
if line: # if you got something...
print 'got data:', line.strip()
time.sleep(1)
import re
HOST = r'^(?P<host>.*?)'
SPACE = r'\s'
IDENTITY = r'\S+'
USER = r'\S+'
TIME = r'(?P<time>\[.*?\])'
REQUEST = r'\"(?P<request>.*?)\"'
STATUS = r'(?P<status>\d{3})'
SIZE = r'(?P<size>\S+)'
REGEX = HOST+SPACE+IDENTITY+SPACE+USER+SPACE+TIME+SPACE+REQUEST+SPACE+STATUS+SPACE+SIZE+SPACE
def parser(log_line):
match = re.search(REGEX,log_line)
return ( (match.group('host'),
match.group('time'),
match.group('request') ,
match.group('status') ,
match.group('size')
)
)
logLine = """180.76.15.30 - - [24/Mar/2017:19:37:57 +0000] "GET /shop/page/32/?count=15&orderby=title&add_to_wishlist=4846 HTTP/1.1" 404 10202 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"""
result = parser(logLine)
print(result)

Python parse GET|POST path from access logs

Suppose we have some access logs like this
83.198.250.175 - - [22/Mar/2009:07:40:06 +0100] "GET /images/ht1.gif HTTP/1.1" 200 61 "http://www.facades.fr/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Wanadoo 6.7; Orange 8.0)" "-"
65.33.94.190 - - [05/Apr/2003:17:26:27 -0500] "POST /samples/dem/tt.php?x=e2323 HTTP/1.0" 404 276
151.227.152.48 - - [02/Jul/2014:14:35:55 +0100] "GET /css/main.css HTTP/1.1" 200 4658 "http://stanmore.menczykowski.co.uk/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
10.143.2.119 64.103.161.112 - [06/Jan/1970:00:48:01 +0000] "GET /right_arrow.jpg HTTP/1.1" 304 0 "http://64.103.161.112/index_eth_diag.html" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36"
I need to get the bolded text parts after POST and GET (path to files).
the log format may be vary but the request type and path will always exist.
I tried to with the following but It didn't always work because the log format not the same
parts = [
r'(?P<host>\S+)', # host %h
r'\S+', # indent %l (unused)
r'(?P<user>\S+)', # user %u
r'\[(?P<time>.+)\]', # time %t
r'"(?P<request>.*)"', # request "%r"
r'(?P<status>[0-9]+)', # status %>s
r'(?P<size>\S+)', # size %b (careful, can be '-')
r'"(?P<referrer>.*)"', # referrer "%{Referer}i"
r'"(?P<agent>.*)"', # user agent "%{User-agent}i"
]
def get_structured_access_logs_list(access_logs):
pattern = re.compile(r'\s+'.join(parts) + r'\s*\Z')
# Initialize required variables
log_data = []
# Get components from each line of the log file into a structured dict
for line in access_logs:
try:
log_data.append(pattern.match(line).groupdict())
except:
pass
return log_data
def parse_path(request_string) :
rx = re.compile(r'^(?:GET|POST)\s+([^?\s]+).*$', re.M)
return rx.findall(request_string)
def get_file_paths(access_logs_list):
file_path_set = set()
for dict in access_logs_list:
if 'request' in dict.keys():
file_name = parse_path(dict['request'])[0] # passing a single line, the list will contain only 1 element
if file_name is not None:
file_path_set.add(full_path)
return accessed_file_set
UPDATE:
after adjusting the code, the function 'get_file_paths' will return a set contains full path to files accessed in access logs
def parse_path(request_string) :
rx = re.compile(r'"(?:GET|POST)\s+([^\s?]*)', re.M)
return rx.findall(request_string)
def get_file_paths(access_logs):
file_set = set()
for line in access_logs:
matches = parse_accessed_file_name_list(line) # passing a single line, the list will contain only 1 element
if matches is None or len(matches) <= 0:
continue
full_path = root_path + matches[0]
if os.path.isfile(full_path):
file_set.add(full_path)
return file_set
Since your regex is very generic (you use \S and . that are very broad), why don't you use directly:
"(?:GET|POST)\s+([^\s?]*)
[^\s?] matches all the characters that are not spaces nor question marks.
See here a demo.
You may use
(?x)^
(?P<host>\S+) \s+ # host %h
\S+ \s+ # indent %l (unused)
(?P<user>\S+) \s+ # user %u
\[(?P<time>.*?)\] \s+ # time %t
"\S+\s+(?P<request>[^"?\s]*)[^"]*" \s+ # request "%r"
(?P<status>[0-9]+) \s+ # status %>s
(?P<size>\S+) (?:\s+ # size %b (careful, can be '-')
"(?P<referrer>[^"?\s]*[^"]*)" \s+ # referrer "%{Referer}i"
"(?P<agent>[^"]*)" (?:\s+ # user agent "%{User-agent}i"
"[^"]*" )? )? # unused
$
See the regex demo.
There are a lot of minor improvements I introduced (see [^"]* instead of .*), the major ones are optional non-capturing groups to match referrer and agent fields that may be missing and the request pattern that looks like (?P<request>[^"?\s]*) and only captures 0 or more chars other than whitespace, ? and " char,while the subsequent [^"]*" matches the rest of the field.
Also, it makes sense to compile the pattern once, not as you do it when processing each line.
The (?x) modifier enables the free spacing mode making it possible to format the pattern on multiple lines and add comments.
Python demo:
import re
pattern = re.compile(r"""(?x)^
(?P<host>\S+) \s+ # host %h
\S+ \s+ # indent %l (unused)
(?P<user>\S+) \s+ # user %u
\[(?P<time>.*?)\] \s+ # time %t
"\S+\s+(?P<request>[^"?\s]*)[^"]*" \s+ # request "%r"
(?P<status>[0-9]+) \s+ # status %>s
(?P<size>\S+) (?:\s+ # size %b (careful, can be '-')
"(?P<referrer>[^"?\s]*[^"]*)" \s+ # referrer "%{Referer}i"
"(?P<agent>[^"]*)" (?:\s+ # user agent "%{User-agent}i"
"[^"]*" )?)? # optional argument (unused)
$""")
def get_structured_access_logs_list(access_logs):
# Initialize required variables
log_data = []
# Get components from each line of the log file into a structured dict
for line in access_logs:
try:
log_data.append(pattern.match(line).groupdict())
except:
pass
return log_data
lines = ['83.198.250.175 - - [22/Mar/2009:07:40:06 +0100] "GET /images/ht1.gif HTTP/1.1" 200 61 "http://www.facades.fr/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Wanadoo 6.7; Orange 8.0)" "-"',
'65.33.94.190 - - [05/Apr/2003:17:26:27 -0500] "POST /samples/dem/tt.php?x=e2323 HTTP/1.0" 404 276',
'151.227.152.48 - - [02/Jul/2014:14:35:55 +0100] "GET /css/main.css HTTP/1.1" 200 4658 "http://stanmore.menczykowski.co.uk/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"',
'10.143.2.119 64.103.161.112 - [06/Jan/1970:00:48:01 +0000] "GET /right_arrow.jpg HTTP/1.1" 304 0 "http://64.103.161.112/index_eth_diag.html" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36"']
for res in get_structured_access_logs_list(lines):
print(res)
You can use this regex and get the path from group1,
^.*?"(?:GET|POST) ([^\s?]+)
Demo

Varnish stops waiting after 1 second

this problem is gonna make me crazy: my varnish istance stops waiting for a backend response after exactly 1 second.
Every first call to a page is a 503 Backend
Daemon is configured this way:
DAEMON_OPTS="-a :80 \
-T localhost:6082 \
-f /etc/varnish/default.vcl \
-S /etc/varnish/secret \
-p thread_pool_add_delay=2 \
-p thread_pools=4 \
-p thread_pool_min=200 \
-p thread_pool_max=4000 \
-p timeout_linger=50 \
-p connect_timeout=300 \
-p first_byte_timeout=300 \
-p between_bytes_timeout=300 \
-p send_timeout=900 \
-s malloc,3G"
and the VCL backend:
backend default { # Define one backend
.host = "127.0.0.1"; # IP or Hostname of backend
.port = "8080"; # Port Apache or whatever is listening
.probe = {
.url = "/";
.timeout = 1s;
.interval = 1s;
.window = 10;
.threshold = 8;
}
.first_byte_timeout = 60s; # How long to wait before we receive a first byte from our backend?
.connect_timeout = 60s; # How long to wait for a backend connection?
.between_bytes_timeout = 60s; # How long to wait between bytes received from our backend?
}
Here is the one call in the log:
* << Request >> 3440734
- Begin req 3440733 rxreq
- Timestamp Start: 1462781837.623325 0.000000 0.000000
- Timestamp Req: 1462781837.623325 0.000000 0.000000
- ReqStart 10.20.129.118 58572
- ReqMethod GET
- ReqURL xxxxx.html
- ReqProtocol HTTP/1.1
- ReqHeader Accept: image/jpeg, application/x-ms-application, image/gif, application/xaml+xml, image/pjpeg, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, image/pjpeg, application/x-shockwave-flash, */*
- ReqHeader Referer: http://xxxxxx.html
- ReqHeader Accept-Language: it-IT
- ReqHeader User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)
- ReqHeader Accept-Encoding: gzip, deflate
- ReqHeader Host: xxxxxx
- ReqHeader DNT: 1
- ReqHeader Connection: Keep-Alive
- ReqHeader Cookie: fc_uid=p; __utma=127650066.830977012.1423064118.1426582505.1426588086.20; _ga=GA1.3.830977012.1423064118; _gat_UA-13041322-1=1; ZNPCQ003-38303300=71a0f671; ff607e18ab6c715f4bb35b5bbcbe1c56=d82989olp2v0ur3gpl2ouprko6; _ga=GA1.2.830977012.142306411
- ReqHeader X-Forwarded-For: 10.20.129.118
- VCL_call RECV
- ReqUnset Host: xxx
- ReqHeader Host: xxx
- ReqURL /xxxx.html
- ReqUnset Cookie: fc_uid=p; __utma=127650066.830977012.1423064118.1426582505.1426588086.20; _ga=GA1.3.830977012.1423064118; _gat_UA-13041322-1=1; ZNPCQ003-38303300=71a0f671; ff607e18ab6c715f4bb35b5bbcbe1c56=d82989olp2v0ur3gpl2ouprko6; _ga=GA1.2.830977012.142306411
- ReqHeader Cookie: fc_uid=p; __utma=127650066.830977012.1423064118.1426582505.1426588086.20; _ga=GA1.3.830977012.1423064118; _gat_UA-13041322-1=1; ZNPCQ003-38303300=71a0f671; ff607e18ab6c715f4bb35b5bbcbe1c56=d82989olp2v0ur3gpl2ouprko6; _ga=GA1.2.830977012.142306411
- ReqUnset Cookie: fc_uid=p; __utma=127650066.830977012.1423064118.1426582505.1426588086.20; _ga=GA1.3.830977012.1423064118; _gat_UA-13041322-1=1; ZNPCQ003-38303300=71a0f671; ff607e18ab6c715f4bb35b5bbcbe1c56=d82989olp2v0ur3gpl2ouprko6; _ga=GA1.2.830977012.142306411
- ReqHeader Cookie: fc_uid=p; __utma=127650066.830977012.1423064118.1426582505.1426588086.20; _ga=GA1.3.830977012.1423064118; _gat_UA-13041322-1=1; ZNPCQ003-38303300=71a0f671; ff607e18ab6c715f4bb35b5bbcbe1c56=d82989olp2v0ur3gpl2ouprko6; _ga=GA1.2.830977012.142306411
- ReqHeader Surrogate-Capability: key=ESI/1.0
- VCL_return hash
- ReqUnset Accept-Encoding: gzip, deflate
- ReqHeader Accept-Encoding: gzip
- VCL_call HASH
- VCL_return lookup
- VCL_call MISS
- VCL_return fetch
- Link bereq 3440735 fetch
- Timestamp Fetch: 1462781838.492085 0.868760 0.868760
- Timestamp Process: 1462781838.492101 0.868776 0.000016
- RespHeader Date: Mon, 09 May 2016 08:17:18 GMT
- RespHeader Server: Varnish
- RespHeader X-Varnish: 3440734
- RespProtocol HTTP/1.1
- RespStatus 503
- RespReason Service Unavailable
- RespReason Service Unavailable
- VCL_call SYNTH
- VCL_return deliver
- RespHeader Content-Length: 0
- Storage malloc Transient
- Debug "RES_MODE 2"
- RespHeader Connection: keep-alive
- Timestamp Resp: 1462781838.492145 0.868820 0.000044
- ReqAcct 985 0 985 153 0 153
- End
Any precious help? Suggestion?
Pretty sure the issue is here:
.timeout = 1s;
.interval = 1s;
Notice that all of them have a time of 1s = 1 second.
I think what's happening is the backend's probe timeouts so varnish flags it as "unhealthy" until your backend respond in a second or less.
Try modifying those values to check whether thats the issue or not

How to generate regex to match access.log acording to its format config?

The access.log format config may be like
'$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"'
Is there a way to generate a regex to match the access.log according to it? I can write regex according to the actuall log like:
'112.3.194.120 - - [17/Jan/2015:20:07:34 +0800] "GET /Introdction%20to%20Guitar/1%20-%202%20-%20Choosing%20the%20Right%20Guitar-%20Right-Handed%20vs%20Left-Handed%20(3-20).mp4 HTTP/1.1" 206 546849 "http://example.com/video/302/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"'
but I can't write regex with the format config. Could anyone help?
To build an expression from the config, replace config variables like $xxx with named groups like (?P<xxx>.*?) and escape delimiter characters:
import re
conf = '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"'
regex = ''.join(
'(?P<' + g + '>.*?)' if g else re.escape(c)
for g, c in re.findall(r'\$(\w+)|(.)', conf))
Now if you match a log entry against this expression:
log = '112.3.194.120 - - [17/Jan/2015:20:07:34 +0800] "GET /Introdction%20to%20Guitar/1%20-%202%20-%20Choosing%20the%20Right%20Guitar-%20Right-Handed%20vs%20Left-Handed%20(3-20).mp4 HTTP/1.1" 206 546849 "http://example.com/video/302/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"'
m = re.match(regex, log)
your variables get captured in the matchObject.groupdict:
import pprint
pprint.pprint(m.groupdict())
result:
{'body_bytes_sent': '546849',
'http_referer': 'http://example.com/video/302/',
'http_user_agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36',
'remote_addr': '112.3.194.120',
'remote_user': '-',
'request': 'GET /Introdction%20to%20Guitar/1%20-%202%20-%20Choosing%20the%20Right%20Guitar-%20Right-Handed%20vs%20Left-Handed%20(3-20).mp4 HTTP/1.1',
'status': '206',
'time_local': '17/Jan/2015:20:07:34 +0800'}
If there are no delimiters in your log config, you'll have to use more specific sub-patterns, not just .*. This can be coded elegantly in a way similar to this:
# variable-specific patterns
patterns = {
'remote_addr': r'(\d{1,3}\.){3}\d{1,3}',
'body_bytes_sent': r'\d+',
# etc
}
regex = ''.join(
'(?P<%s>%s)' % (g, patterns.get(g, '.*?')) if g
else re.escape(c)
for g, c in re.findall(r'\$(\w+)|(.)', conf))

Regex search debug

This is the expression:
.*\[(\d*)/(\w*)/(\d*).*"(GET|POST)\s(https?://)[a-z].*?\.([a-z]+)[^\w.-].*200
The problem I am getting is with the domain name. I get both .net, .cgi, .com and .htm
I only need .net and .com, in other words, the first domain appearing in this case, .net and .com
68.134.160.117 - - [09/Mar/2004:22:24:27 -0500] "GET http://www.glocksoft.net/cgi-bin/jenv.cgi HTTP/1.0" 200 1169 "-" "Mozilla/4.0"
220.175.18.42 - - [09/Mar/2004:22:47:30 -0500] "GET http://www.searchlikecrazy.com/cgi-bin/smartsearch.cgi?keywords=Web+Design%20&username=arongyi HTTP/1.0" 200 26166 "http://www.yourwindow.com/searchlikecrazy.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; MyIE2)"
Where am I getting the problem?
Thanks!
Seems like your regex works just fine for me with the both examples you provided (or maybe I just got the question wrong). I tested it with the following script (sorry for long lines):
#!/usr/bin/env python
import re
lines = ['68.134.160.117 - - [09/Mar/2004:22:24:27 -0500] "GET http://www.glocksoft.net/cgi-bin/jenv.cgi HTTP/1.0" 200 1169 "-" "Mozilla/4.0"',
'220.175.18.42 - - [09/Mar/2004:22:47:30 -0500] "GET http://www.searchlikecrazy.com/cgi-bin/smartsearch.cgi?keywords=Web+Design%20&username=arongyi HTTP/1.0" \
200 26166 "http://www.yourwindow.com/searchlikecrazy.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; MyIE2)"']
regex = re.compile(r'.*\[(\d*)/(\w*)/(\d*).*"(GET|POST)\s(https?://)[a-z].*?\.([a-z]+)[^\w.-].*200')
for line in lines:
match = regex.match(line)
if match:
print match.groups()
Output:
('09', 'Mar', '2004', 'GET', 'http://', 'net')
('09', 'Mar', '2004', 'GET', 'http://', 'com')
Python version: 2.7.1
I guess this is probably what are you searching.
import re
log_line = '68.134.160.117 - - [09/Mar/2004:22:24:27 -0500] "GET https://www.blog.glocksoft.net/cgi-bin/jenv.cgi HTTP/1.0" 200 1169 "-" "Mozilla/4.0"'
print re.search(r'GET\s\w{3,5}://((\w+\.?)+)', log_line).group(1)
print re.search(r'GET\s\w{3,5}://((\w+\.?)+)', log_line).group(2)
Output:
www.blog.glocksoft.net
net

Resources