Now that you have been introduced to grep, it is time to delve into a real world example to find some useful information. We have a service that monitors one of our websites and uses google indexed URL's to check the availability of presentation pages on the site. I know you are asking yourself who would pay for something like that and the answer is simple, our marketing people, they will buy anything as long as they hear some salesman say market share or SEO. Well, this service provided our marketing people with a list of URL's that returned a 404 status and marketing wanted to know why our servers were not delivering content.
The first place I looked was the http access_log. The access_log does exactly that and logs each request a server receives and the result of the request, for example 200 for a successful request handling, a 404 for a page not found or a 503, that the request returned a server error because the application did not know how to handle such a request.
So I started with a tail on the access log to get the date format (different logs uses different date formats) for the day our servers were, according to this expensive service, not delivering content it should have been delivering.
tail without any options outputs the last 10 lines of a file to standard output (usually your monitor unless your redirect your output).
So I now know the date format for for the access log looks like this
Since this expensive service only offers a date of their result and not a correct timestamp, I filter all log entries for 29 Aug, 2016 and start paring down from that point. I usually work in steps to see what I am getting and continue building on my command. Since our servers rotate the access_log weekly I can safely grep 29/Aug and not worry about getting results from 29 Aug, 2015.
grep 29/Aug access_log | less
Notice the | and less. Since I do not want to see over 100,000 lines scrolling across my monitor I am piping ( that is what you call the | ) the result of my grep into the program less so I only get one screen at a time.
So far so good but all I really want is to filter the 404 results from this day. Luckily, you can grep your grep result and keep chaining the process until you have filtered down to just the information you want. The next step is then
grep 29/Aug access_log | grep 400 | less
Which does not exactly return what I wanted.
GET /zh/search/?cou=DEU&cou=CHE&cou=FRA&cou=ESP&rpp=36&cou=IND HTTP/1.1" 200 14400
notice the 14400 at the end includes my 400 pattern and returns a line with a 200 result. I am looking for 404 results. Being the smart guy I am I notice after HTTP/1.1" is a 200 code so I just need to change my pattern to reflect that and now try
grep 29/Aug access_log | grep HTTP/1.1" 404 | less
and become a > prompt and not my expected output. The reason can be traced to the double quotation marks and the blank space in my pattern. As soon as I start using such patterns, I need to escape special characters or tell grep to protect my pattern. So now I use one of my favorite flags -e
grep 29/Aug access_log | grep -e 'HTTP/1.1" 404' | less
and now I get
GET /search/?cid=textiles-clothing%2Ftextiles%2Fservices-publishers-associations&cou%5B3%5D=JPN&ord=name HTTP/1.1" 404
my intended result. Notice that the 404 status code after the HTTP/1.1" is returned and only lines in the access_log with the pattern I protected with the flag -e. Also notice I enclosed my protected pattern with single quotes
-e 'PROTECTED PATTERN HERE'
You can also use regular expressions with the -e flag but that is for another article.
Points to take away from the article: You can use grep multiple times in the same command by piping the first result into the second command and continue to chain grep or any command line program by use of the pipe | character.