Using Grep Recursive and Regex to Search Big Data Sets

Share on:

Overview

How do you find a needle in a haystack?

Suppose you have a big data set consisting of various types of data including phone numbers, email addresses, user identifiers, addresses, etc. Within this dataset you need to pick out specific strings. Normally this would be easy:

grep -air -e mystring -e anotherstring and you're done, right? Not so fast.

If the strings you are searching for are generic, common strings that appear in the dataset hundreds or thousands of times, you may end up with a lot of false positives. This is especially true when you're searching on partial strings, like a few digits of a phone number or a few characters from an email address. Let's look at how you can use grep to find what you're looking for while cutting down on the noise.

The Phone Number

Let's say you have a partial phone number: 470-xxx-xx32. Now look at the following grep command:

grep -Eairon '.{0,100}470[0-9-]{5,7}32.{0,100}' > ~/Documents/470xxxxx32.txt

The number could be formatted as above, or with no dashes, so there could be between 5 and 7 digits or dashes between the 470 and the 32 (470[0-9-]{5,7}32). This pattern is an extended regex or REGular EXpression (-E).

We want to make sure grep outputs only up to 100 characters before and after each match (-o and .{0,100}), as well as the line number (-n). This will make it easier to go back later and look at the context of the matches.

Using the preceding command, grep will recurse (-r) and search all files starting in the current directory and output the results to a file (> ~/Documents/470xxxxx32.txt). You can then take the output file and grep it further to find what you're looking for. Note that redirecting the output to a file will mean you will not see the output on screen. Remove the redirect if you don't want it.

Keep in mind that this will still match non-phone strings, such as long random identifiers (e.g. 47006776632). If you're still getting too many false positives, you can filter more aggressively by excluding 0 and 1 from the first digit of the phone number prefix using the following regex.

grep -Eairon '.{0,100}470[2-9-]{1,2}[0-9-]{4,5}32.{0,100}'

This would match 4705551232 but not 4700551232 or 4701551232.

This is better, but it's still going to match long strings of digits. This is where we need to think about how the phone numbers are stored in our datasets. In a mixed structure dataset, fields may be delimited by commas, single or double quotes, tabs, or even semicolons.

Some trial and error is needed here. Commas are a common delimeter, so you could try the regex:

grep -Eairon '.{0,100},470[2-9-]{1,2}[0-9-]{4,5}32,.{0,100}'

The command is almost identical to the one before it, but notice the commas before 470 and after the 32.

To try a double quote delimeter, you'll need to escape it using a backslash (\), like so:

grep -Eairon '.{0,100}\"470[2-9-]{1,2}[0-9-]{4,5}32\".{0,100}'

To use a single quote delimeter, you'll have to wrap the entire search string in double quotes and escape the single quotes. There's no way to escape a single quote within a single quoted string.

grep -Eairon ".{0,100}\'470[2-9-]{1,2}[0-9-]{4,5}32\'.{0,100}"

What about a tab delimeter? Just put an actual tab character in the search sting. Ctrl-V + Tab should work.

grep -Eairon '.{0,100} 470[2-9-]{1,2}[0-9-]{4,5}32 .{0,100}'

The Email Address

Now suppose you're looking for an email address with the pattern g*d@gmail.com where * is an unknown number of characters. This one is trickier because unlike a phone number, email addresses don't follow a predictable format. We'll have to make some assumptions about the length of the address.

grep -Eairon 'g.{2,15}d@gmail\.com'

This matches any string starting with 'g', followed by anywhere from 2 to 15 characters (the regex .{2,15}), followed by a 'd@gmail.com'. The period must be escaped.

If you suspect that the mystery phone number and email might be connected, you can try grepping against the output of the phone number search. For example:

grep -Eairon '470[2-9-]{1,2}[0-9-]{4,5}32' > ~/Documents/470xxxxx32.txt

grep -Eairon 'g.{2,15}d@gmail\.com' ~/Documents/470xxxxx32.txt

BTW, if you want to view an entire matching line of a file, you can use the sed command:

sed -n '116p' ~/Documents/470xxxxx32.txt

This will display line 116 of the file.

Files with Long Lines

Some datasets have extremely long lines, as in billions of characters on a single line, making them difficult to browse without a lot of horizontal scrolling. This is especially true of SQL database dumps. What we need is a way to jump straight to the interesting string. Here's how we do it.

Let's say database.sql is a single-line text file gigabytes in length. Consider the following command:

grep -Eoairnb 'ohthatsinterestingtellmemore@gmail\.com' database.sql

The -b flag prints the byte offset from the beginning of the file, allowing us to jump straight to the start of the match. Suppose grep returns an offset of 4211445779. We want to see 100 characters before and after the start of the matching string, and to do that we'll use the dd command:

dd if=database.sql ibs=1 skip=4211445679 count=100

if is the input file.

skip is the byte offset to start reading at. Notice that we subtract 100 from the offset reported by grep.

count is the number of bytes to read after the offset.

The command should display your match in context.

As with all things search related, you'll need to go through some trial and error, but these examples should be enough to get you on the right track.

Listen to this article