I have recently encountered a problem that involved dealing with massive lists of text strings. The strings were sometimes hundreds of characters long and the lists numbered hundreds of millions of them.
The following are some of the things I encountered. Bear in mind that none of those things are necessarily set in stone as MySQL has many a tunable parameter and it is entirely possible that by tuning some of them one can achieve a behavior very different from mine.
So moving on to the particulars. I had to compare two very large lists of strings, each encompassing hundreds of millions of lines (strings). The task was to find the intersection of the two sets as well as the lines exclusive to each set.
The only machine where I could play was a really weak VM with 512 MB or RAM assigned to it. That happened to be the only machine that had enough room - and even with about 900 GB to spare that was sometimes not enough considering MySQL's appetite for temporary space. Before I decided to use MySQL I tried doing it using Linux shell commands but after awhile I gave up realizing that since ultimately they do all their work in memory - or swap - the task would require designing a process with more stages to it than I'd be able to even think of. So, logically, the next candidate was an RDBMS of some sort - and for that I chose MySQL. So there I was trying to use a MySQL engine on a 512 MB RAM VM running CentOS 6. And the data sets I had to process were two files, over 50 GB each, with hundreds of millions of lines of text, each line to be treated as a separate entity (string).
So let us now delve into it. From now on, let us designate the initial files as File 1 and File 2. The numbering will stay the same throughout this discussion.
Listed below are few interesting facts I discovered while attacking this problem. Note that the syntax used below differs slightly from that used in real life as I changed table, column and index names to be more self-explainatory. I have not tested that syntax against a real database so there may be typos in it - though I will do my best to be careful.
1. Full text indexes on the text strings seem to take prohibitively long. Yet partial indexes seem to work within 8-12 hours even though the part (400) was likely quite long enough as I could not find strings even that long and most of them seemed to be much shorter.
For example, let us consider a table like the following:
mysql> CREATE TABLE file1_list (f1l_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, text_line TEXT NOT NULL);
Then you import the data. I used mysqlimport and it loaded over 400 million rows in about 6-10 hours. I have then considered using the FULLTEXT index on the text_line column (the actual text).
mysql> CREATE FULLTEXT INDEX file1_full_text ON file1_list (text_line ASC);
Well, I never found enough patience to wait for this one to finish. It had definitely run for over a day by the time I finally decided to terminate it.
2. Indexing on substring seems to be far more efficient. Consider the following:
mysql> CREATE INDEX text_100 ON file1_list ( text_line(100) ASC);Query OK,
427415951 rows affected (13 hours 22 min 55.84 sec)
Records: 427000000 Duplicates: 0 Warnings: 0
Indexing on text_line(400) takes about the same amount of time even though from my estimates most if not all lines had less than 400 characters to them. Thus at some point I simply standardized on indexing on a 400 character substring for the future analysis purposes.
3. Exact string comparisons are extremely fast. For example:
mysql> SELECT count(*) FROM file1_list WHERE text_line = "abracadabra12345thisysysyysylkjhf";
| count(*) |
| 0 |
1 row in set (0.40 sec)
mysql> SELECT count(*) FROM file1_list;
| count(*) |
| 430967651 |
1 row in set (0.00 sec)
That means we can see whether our string matches any other out of hundreds of millions of indexed strings in under 1s.
4. Inexact comparisons are quite slow. For example:
mysql> SELECT count(*) FROM file1_list WHERE text_line != "abracadabra12345thisysysyysylkjhf";
| count(*) |
| 430967651 |
1 row in set (13 min 26.86 sec)
The complimentary exact comparison clearly went a bit faster, to put it mildly.
This is it for now. Next we will address some applications where these results came in handy.