Tuesday, March 13, 2012

Large-scale matching exercise using MySQL

In my previous post (MySQL: A Few Metrics, 3 March 2012) I have mentioned some parameters of a task I recently faced. We are now going to examine that task in more detail.

We have two large text files, File1 and File2. They both contain text entries, one per line, over 400 million lines each. We know almost nothing about the content beyond that; it is definitely unsorted within each of the files, some lines may be repetitive. So for the purpose of this discussion let us say File1 is 430 million lines and File2 is 440 million lines.

To recap: the only machine I had available for this task was an VM that had plenty of disk - about 1 TB unused - but little processing power and only 512 MB RAM. It was running CentOS 6 and MySQL 5.1.52. First word of caution: if you intend to manipulate large tables it is advisable to ascertain that either /tmp has plenty of room for its invisible temporary files, or else change the temporary directory to something else. You can do that by setting the TMPDIR environment variable to the desirable location. On CentOS I just inserted the appropriate line towards the top of /etc/init.d/mysqld and that did the trick:

# Alternate temporary storage directory
export TMPDIR=/home/mysql/tmp

My first instinct was to first sort the two lists individually and then, after they are sorted, find matches as well as content exclusive to either list by doing one forward pass through both. I still believe that approach was sensible - however, the sorting phase proved to be more time-consuming than I expected. The most likely reason for that was that, as I already mentioned, inexact comparisons take qualitatively longer than exact ones - and sorting, no matter how you do it, is based upon inexact comparisons.

However, that same fact could be used to our advantage. We could do exact comparisons to determine the intersection of the two lists - and then separate the entries exclusive to either list.

Let us now run through a practical example that reflects what I ended up doing after some trials and errors. The names have been adjusted from those I used to make this text more readable. My apologies for any possible typos in that syntax.

Alright, let us get going now. We have our files: File1 (430 million lines) and File2 (440 million lines).

So first let us create the necessary tables:

mysql> CREATE TABLE f1_list (f1_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, text_line TEXT NOT NULL, INDEX (text_line(400) ASC));
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE TABLE f2_list (f2_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, text_line TEXT NOT NULL, INDEX (text_line(400) ASC));
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE common_list (cl_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, f1_id INT NOT NULL, f2_id INT NOT NULL, text_line TEXT NOT NULL, INDEX (f1_id ASC), INDEX (f2_id ASC), INDEX (text_line(400) ASC));
Query OK, 0 rows affected (0.00 sec)


Now let us populate f1_list and f2_list with the contents of File1 and File2 respectively. You will need to run roughly the following commands using full paths to File1 and File2:

mysql> LOAD DATA LOCAL INFILE "/home/my_user/big_files/File1" INTO TABLE f1_list (text_line);
Query OK, 430000000 rows affected (8 hours 13 min 10.01 sec)
Records: 430000000 Deleted: 0 Skipped: 0 Warnings: 0

mysql> LOAD DATA LOCAL INFILE "/home/my_user/big_files/File2" INTO TABLE f2_list (text_line);
Query OK, 440000000 rows affected (10 hours 01 min 15.0 sec)
Records: 440000000 Deleted: 0 Skipped: 0 Warnings: 0


The times are roughly consistent with what I saw; I'd estimate the import times as 8-12 hours per file.

Now let us extract the commonalities into common_list:

mysql> INSERT INTO common_list (f1_id, f2_id, text_line) SELECT f1_list.f1_id,
f2_list.f2_id, f1_list.text_line FROM f1_list, f2_list WHERE f1_list.text_line = f2_list.text_line;
Query OK, 427677003 rows affected, 2 warnings (16 hours 12 min 16.58 sec)
Records: 427677003 Duplicates: 0 Warnings: 0


Now let us extract the exclusive content:

mysql> DELETE FROM f1_list WHERE f1_id IN (SELECT f1_id FROM common_list);
Query OK, 427001324 rows affected (16 hours 24 min 41.20 sec)

mysql> DELETE FROM f2_list WHERE f2_id IN (SELECT f2_id FROM common_list);
Query OK, 427001129 rows affected (16 hours 39 min 44.47 sec)


And so by now it looks like we are done. We have content exclusive to List 1 listed in f1_list, content exclusive to List 2 in f2_list and the common content in common_list. The total processing time - even if the operations are performed serially as delineated above - can be capped at about 76 hours. I used a little padding there, too. For instance, I counted 20 hours for the processes that in reality took about 16. Thus it appears realistic to sort this mass of data within 3-4 days using a machine which in this day and age would be considered substandard in terms of its performance.

Saturday, March 3, 2012

MySQL: A Few Metrics

I have recently encountered a problem that involved dealing with massive lists of text strings. The strings were sometimes hundreds of characters long and the lists numbered hundreds of millions of them.

The following are some of the things I encountered. Bear in mind that none of those things are necessarily set in stone as MySQL has many a tunable parameter and it is entirely possible that by tuning some of them one can achieve a behavior very different from mine.

So moving on to the particulars. I had to compare two very large lists of strings, each encompassing hundreds of millions of lines (strings). The task was to find the intersection of the two sets as well as the lines exclusive to each set.

The only machine where I could play was a really weak VM with 512 MB or RAM assigned to it. That happened to be the only machine that had enough room - and even with about 900 GB to spare that was sometimes not enough considering MySQL's appetite for temporary space. Before I decided to use MySQL I tried doing it using Linux shell commands but after awhile I gave up realizing that since ultimately they do all their work in memory - or swap - the task would require designing a process with more stages to it than I'd be able to even think of. So, logically, the next candidate was an RDBMS of some sort - and for that I chose MySQL. So there I was trying to use a MySQL engine on a 512 MB RAM VM running CentOS 6. And the data sets I had to process were two files, over 50 GB each, with hundreds of millions of lines of text, each line to be treated as a separate entity (string).

So let us now delve into it. From now on, let us designate the initial files as File 1 and File 2. The numbering will stay the same throughout this discussion.

Listed below are few interesting facts I discovered while attacking this problem. Note that the syntax used below differs slightly from that used in real life as I changed table, column and index names to be more self-explainatory. I have not tested that syntax against a real database so there may be typos in it - though I will do my best to be careful.

1. Full text indexes on the text strings seem to take prohibitively long. Yet partial indexes seem to work within 8-12 hours even though the part (400) was likely quite long enough as I could not find strings even that long and most of them seemed to be much shorter.

For example, let us consider a table like the following:


Then you import the data. I used mysqlimport and it loaded over 400 million rows in about 6-10 hours. I have then considered using the FULLTEXT index on the text_line column (the actual text).

mysql> CREATE FULLTEXT INDEX file1_full_text ON file1_list (text_line ASC);

Well, I never found enough patience to wait for this one to finish. It had definitely run for over a day by the time I finally decided to terminate it.

2. Indexing on substring seems to be far more efficient. Consider the following:

mysql> CREATE INDEX text_100 ON file1_list ( text_line(100) ASC);Query OK,
427415951 rows affected (13 hours 22 min 55.84 sec)
Records: 427000000 Duplicates: 0 Warnings: 0


Indexing on text_line(400) takes about the same amount of time even though from my estimates most if not all lines had less than 400 characters to them. Thus at some point I simply standardized on indexing on a 400 character substring for the future analysis purposes.

3. Exact string comparisons are extremely fast.
For example:

mysql> SELECT count(*) FROM file1_list WHERE text_line = "abracadabra12345thisysysyysylkjhf";
| count(*) |
| 0 |
1 row in set (0.40 sec)
mysql> SELECT count(*) FROM file1_list;
| count(*) |
| 430967651 |
1 row in set (0.00 sec)


That means we can see whether our string matches any other out of hundreds of millions of indexed strings in under 1s.

4. Inexact comparisons are quite slow. For example:

mysql> SELECT count(*) FROM file1_list WHERE text_line != "abracadabra12345thisysysyysylkjhf";
| count(*) |
| 430967651 |
1 row in set (13 min 26.86 sec)


The complimentary exact comparison clearly went a bit faster, to put it mildly.

This is it for now. Next we will address some applications where these results came in handy.