Tuesday, March 13, 2012

Large-scale matching exercise using MySQL

In my previous post (MySQL: A Few Metrics, 3 March 2012) I have mentioned some parameters of a task I recently faced. We are now going to examine that task in more detail.

We have two large text files, File1 and File2. They both contain text entries, one per line, over 400 million lines each. We know almost nothing about the content beyond that; it is definitely unsorted within each of the files, some lines may be repetitive. So for the purpose of this discussion let us say File1 is 430 million lines and File2 is 440 million lines.

To recap: the only machine I had available for this task was an VM that had plenty of disk - about 1 TB unused - but little processing power and only 512 MB RAM. It was running CentOS 6 and MySQL 5.1.52. First word of caution: if you intend to manipulate large tables it is advisable to ascertain that either /tmp has plenty of room for its invisible temporary files, or else change the temporary directory to something else. You can do that by setting the TMPDIR environment variable to the desirable location. On CentOS I just inserted the appropriate line towards the top of /etc/init.d/mysqld and that did the trick:

# Alternate temporary storage directory
export TMPDIR=/home/mysql/tmp

My first instinct was to first sort the two lists individually and then, after they are sorted, find matches as well as content exclusive to either list by doing one forward pass through both. I still believe that approach was sensible - however, the sorting phase proved to be more time-consuming than I expected. The most likely reason for that was that, as I already mentioned, inexact comparisons take qualitatively longer than exact ones - and sorting, no matter how you do it, is based upon inexact comparisons.

However, that same fact could be used to our advantage. We could do exact comparisons to determine the intersection of the two lists - and then separate the entries exclusive to either list.

Let us now run through a practical example that reflects what I ended up doing after some trials and errors. The names have been adjusted from those I used to make this text more readable. My apologies for any possible typos in that syntax.

Alright, let us get going now. We have our files: File1 (430 million lines) and File2 (440 million lines).

So first let us create the necessary tables:

mysql> CREATE TABLE f1_list (f1_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, text_line TEXT NOT NULL, INDEX (text_line(400) ASC));
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE TABLE f2_list (f2_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, text_line TEXT NOT NULL, INDEX (text_line(400) ASC));
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE common_list (cl_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, f1_id INT NOT NULL, f2_id INT NOT NULL, text_line TEXT NOT NULL, INDEX (f1_id ASC), INDEX (f2_id ASC), INDEX (text_line(400) ASC));
Query OK, 0 rows affected (0.00 sec)


Now let us populate f1_list and f2_list with the contents of File1 and File2 respectively. You will need to run roughly the following commands using full paths to File1 and File2:

mysql> LOAD DATA LOCAL INFILE "/home/my_user/big_files/File1" INTO TABLE f1_list (text_line);
Query OK, 430000000 rows affected (8 hours 13 min 10.01 sec)
Records: 430000000 Deleted: 0 Skipped: 0 Warnings: 0

mysql> LOAD DATA LOCAL INFILE "/home/my_user/big_files/File2" INTO TABLE f2_list (text_line);
Query OK, 440000000 rows affected (10 hours 01 min 15.0 sec)
Records: 440000000 Deleted: 0 Skipped: 0 Warnings: 0


The times are roughly consistent with what I saw; I'd estimate the import times as 8-12 hours per file.

Now let us extract the commonalities into common_list:

mysql> INSERT INTO common_list (f1_id, f2_id, text_line) SELECT f1_list.f1_id,
f2_list.f2_id, f1_list.text_line FROM f1_list, f2_list WHERE f1_list.text_line = f2_list.text_line;
Query OK, 427677003 rows affected, 2 warnings (16 hours 12 min 16.58 sec)
Records: 427677003 Duplicates: 0 Warnings: 0


Now let us extract the exclusive content:

mysql> DELETE FROM f1_list WHERE f1_id IN (SELECT f1_id FROM common_list);
Query OK, 427001324 rows affected (16 hours 24 min 41.20 sec)

mysql> DELETE FROM f2_list WHERE f2_id IN (SELECT f2_id FROM common_list);
Query OK, 427001129 rows affected (16 hours 39 min 44.47 sec)


And so by now it looks like we are done. We have content exclusive to List 1 listed in f1_list, content exclusive to List 2 in f2_list and the common content in common_list. The total processing time - even if the operations are performed serially as delineated above - can be capped at about 76 hours. I used a little padding there, too. For instance, I counted 20 hours for the processes that in reality took about 16. Thus it appears realistic to sort this mass of data within 3-4 days using a machine which in this day and age would be considered substandard in terms of its performance.


Syarzhuk said...

How unique are the entries in the files? I mean, out of 440M lines how many unique ones?
If the answer is "thousands" or even "hundreds of thousands" you might be better off creating nonunique indices first and using them to compare the values later.

Syarzhuk said...

Nevermind the previous comment - you are using the indices. I wonder would performance be a bit better if you first create the tables without the INDEX (text_line(400) ASC, then load data, then create the indices through CREATE INDEX?