I have recently encountered a problem that involved dealing with massive lists of text strings. The strings were sometimes hundreds of characters long and the lists numbered hundreds of millions of them.
The following are some of the things I encountered. Bear in mind that none of those things are necessarily set in stone as MySQL has many a tunable parameter and it is entirely possible that by tuning some of them one can achieve a behavior very different from mine.
So moving on to the particulars. I had to compare two very large lists of strings, each encompassing hundreds of millions of lines (strings). The task was to find the intersection of the two sets as well as the lines exclusive to each set.
The only machine where I could play was a really weak VM with 512 MB or RAM assigned to it. That happened to be the only machine that had enough room - and even with about 900 GB to spare that was sometimes not enough considering MySQL's appetite for temporary space. Before I decided to use MySQL I tried doing it using Linux shell commands but after awhile I gave up realizing that since ultimately they do all their work in memory - or swap - the task would require designing a process with more stages to it than I'd be able to even think of. So, logically, the next candidate was an RDBMS of some sort - and for that I chose MySQL. So there I was trying to use a MySQL engine on a 512 MB RAM VM running CentOS 6. And the data sets I had to process were two files, over 50 GB each, with hundreds of millions of lines of text, each line to be treated as a separate entity (string).
So let us now delve into it. From now on, let us designate the initial files as File 1 and File 2. The numbering will stay the same throughout this discussion.
Listed below are few interesting facts I discovered while attacking this problem. Note that the syntax used below differs slightly from that used in real life as I changed table, column and index names to be more self-explainatory. I have not tested that syntax against a real database so there may be typos in it - though I will do my best to be careful.
1. Full text indexes on the text strings seem to take prohibitively long. Yet partial indexes seem to work within 8-12 hours even though the part (400) was likely quite long enough as I could not find strings even that long and most of them seemed to be much shorter.
For example, let us consider a table like the following:
mysql> CREATE TABLE file1_list (f1l_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, text_line TEXT NOT NULL);
Then you import the data. I used mysqlimport and it loaded over 400 million rows in about 6-10 hours. I have then considered using the FULLTEXT index on the text_line column (the actual text).
mysql> CREATE FULLTEXT INDEX file1_full_text ON file1_list (text_line ASC);
Well, I never found enough patience to wait for this one to finish. It had definitely run for over a day by the time I finally decided to terminate it.
2. Indexing on substring seems to be far more efficient. Consider the following:
mysql> CREATE INDEX text_100 ON file1_list ( text_line(100) ASC);Query OK,
427415951 rows affected (13 hours 22 min 55.84 sec)
Records: 427000000 Duplicates: 0 Warnings: 0
mysql>
Indexing on text_line(400) takes about the same amount of time even though from my estimates most if not all lines had less than 400 characters to them. Thus at some point I simply standardized on indexing on a 400 character substring for the future analysis purposes.
3. Exact string comparisons are extremely fast. For example:
mysql> SELECT count(*) FROM file1_list WHERE text_line = "abracadabra12345thisysysyysylkjhf";
+----------+
| count(*) |
+----------+
| 0 |
+----------+
1 row in set (0.40 sec)
mysql> SELECT count(*) FROM file1_list;
+-----------+
| count(*) |
+-----------+
| 430967651 |
+-----------+
1 row in set (0.00 sec)
mysql>
That means we can see whether our string matches any other out of hundreds of millions of indexed strings in under 1s.
4. Inexact comparisons are quite slow. For example:
mysql> SELECT count(*) FROM file1_list WHERE text_line != "abracadabra12345thisysysyysylkjhf";
+-----------+
| count(*) |
+-----------+
| 430967651 |
+-----------+
1 row in set (13 min 26.86 sec)
mysql>
The complimentary exact comparison clearly went a bit faster, to put it mildly.
This is it for now. Next we will address some applications where these results came in handy.
Saturday, March 3, 2012
Monday, February 27, 2012
MySQL's Mystery Files
MySQL is a very powerful and easy to use - and administer RDBMS. For light use it mostly just works out of the box - but if you intend to use it for something serious you need to tweak it some.
For one thing, MySQL makes use of temporary space - that is, temporary space outside of the designated areas where it stores the database files themselves. It appears to be configured as follows:
Source: Where MySQL Stores Temporary Files
One potential caveat is that it uses "hidden" files that can not be detected by normal file system level tools. Here's a discussion on how that is possible. In short - MySQL creates files which are not linked in the file system and hence can not be found and simply vanish if the process making use of them - i.e., the MySQL daemon process - disappears. In other words, you are guaranteed that no left-over space will be used. From the IT standpoint it is a nuisance though as you get file system space consumed without being able to find the files responsible for such consumption.
As to the size of such consumption - I was actively using a database amounting to no more than 300 GB and due to the fact that I only had about 900 GB to spare in the filesystem I was forced to do "repairing with keycache" which is way slower than regular reindexing - but regular reindexing requires lots of temporary space which I apparently lacked.
For one thing, MySQL makes use of temporary space - that is, temporary space outside of the designated areas where it stores the database files themselves. It appears to be configured as follows:
On Unix, MySQL uses the value of the TMPDIR environment variable as the path name of the directory in which to store temporary files. If TMPDIR is not set, MySQL uses the system default, which is usually /tmp, /var/tmp, or /usr/tmp.
On Windows, Netware and OS2, MySQL checks in order the values of the TMPDIR, TEMP, and TMP environment variables. For the first one found to be set, MySQL uses it and does not check those remaining. If none of TMPDIR, TEMP, or TMP are set, MySQL uses the Windows system default, which is usually C:\windows\temp\.
Source: Where MySQL Stores Temporary Files
One potential caveat is that it uses "hidden" files that can not be detected by normal file system level tools. Here's a discussion on how that is possible. In short - MySQL creates files which are not linked in the file system and hence can not be found and simply vanish if the process making use of them - i.e., the MySQL daemon process - disappears. In other words, you are guaranteed that no left-over space will be used. From the IT standpoint it is a nuisance though as you get file system space consumed without being able to find the files responsible for such consumption.
As to the size of such consumption - I was actively using a database amounting to no more than 300 GB and due to the fact that I only had about 900 GB to spare in the filesystem I was forced to do "repairing with keycache" which is way slower than regular reindexing - but regular reindexing requires lots of temporary space which I apparently lacked.
Wednesday, February 1, 2012
Some pracitcal uses of VPN
VPN is a term one hears often these days. However, while many people have some ideas about what it is used for - secure access, for instance - many probably lack vision of how they could benefit from its use. So for starters - what is VPN? You can use the Wiki link for a formal definition but in a less formal way one can define it as a network one can build to their own design provided one controls the server and the other machines one intends to network together have the capability to access that server via the internet.
So let us say you control a machine on the Internet with a public IP address. On it you can install a VPN server process. Then you can issue authorization to those you want to allow to join your VPN network.
Let us consider a practical example. I configure a server to serve a VPN with a private network defined as 192.168.20.0/255.255.255.0. Let us say the server gets the virtual IP address of 192.168.20.1, with the other addresses (192.168.20.2-254) available for grabs. So let us say my laptop gets an address of 192.168.20.2, my home machine gets an address of 192.168.20.5 and my office machine gets an address of 192.168.20.10.
Thus - using the same network protocols - I can collect the video feed off of my home computer to see what is going on at home, print to my office computer's printer - and do all of it from a WiFi point half the world away using my laptop.
Or - let us say - in addition to my office in Boston I decide to get one in Buenos Aires. No problem. I get another machine there - let's say, with a virtual IP address of 192.168.20.15 - and use it and the one at my office in Boston - 192.168.20.10 - to link the two networks. Now they are linked - via the internet but at the same time utilizing the VPN's security which is normally considered an unrbeakably secure way to communicate.
Those are just a couple of possible usage scenarios. I will try to cover this topic in more detail later on. For now just think of the VPN as a network you can define the way you like no matter where the computers who will join it happen to be geographically and topologically. So long as they have access to the internet and you allow them to join your VPN they can do so.
So let us say you control a machine on the Internet with a public IP address. On it you can install a VPN server process. Then you can issue authorization to those you want to allow to join your VPN network.
Let us consider a practical example. I configure a server to serve a VPN with a private network defined as 192.168.20.0/255.255.255.0. Let us say the server gets the virtual IP address of 192.168.20.1, with the other addresses (192.168.20.2-254) available for grabs. So let us say my laptop gets an address of 192.168.20.2, my home machine gets an address of 192.168.20.5 and my office machine gets an address of 192.168.20.10.
Thus - using the same network protocols - I can collect the video feed off of my home computer to see what is going on at home, print to my office computer's printer - and do all of it from a WiFi point half the world away using my laptop.
Or - let us say - in addition to my office in Boston I decide to get one in Buenos Aires. No problem. I get another machine there - let's say, with a virtual IP address of 192.168.20.15 - and use it and the one at my office in Boston - 192.168.20.10 - to link the two networks. Now they are linked - via the internet but at the same time utilizing the VPN's security which is normally considered an unrbeakably secure way to communicate.
Those are just a couple of possible usage scenarios. I will try to cover this topic in more detail later on. For now just think of the VPN as a network you can define the way you like no matter where the computers who will join it happen to be geographically and topologically. So long as they have access to the internet and you allow them to join your VPN they can do so.
Wednesday, January 18, 2012
Sparse Files
OK, I heard the term before but never had to really delve into them. That's an interesting concept, though. First delved into it needing a large (multi-terrabyte) file to house a file system. See here.
Then came across this article:
Sparse files – what, why, and how
I like the concept - though it does come with a few pitfalls, it seems. More on that later.
Then came across this article:
Sparse files – what, why, and how
I like the concept - though it does come with a few pitfalls, it seems. More on that later.
Friday, December 30, 2011
SSHFS and AutoFS
Mounting an SSH-accessible remote directory automatically is a nifty capability. Here's a nice description of how one can do that - specifically under Ubuntu but it will work the same just fine under most other Linux distributions:
Automatically mounting a remote directory in Ubuntu using autofs + sshfs
One thing not mentioned there - and something I keep forgetting about between the instances I need to recall it - is that the passwordless SSH login will fail unless the user directory on the SSH server is writable by the owner only! So, using the same terms as in the example above one should do the following:
1) Log into example.com as remoteusername
2) Execute the following command:
chmod g-w,o-w ~
Automatically mounting a remote directory in Ubuntu using autofs + sshfs
One thing not mentioned there - and something I keep forgetting about between the instances I need to recall it - is that the passwordless SSH login will fail unless the user directory on the SSH server is writable by the owner only! So, using the same terms as in the example above one should do the following:
1) Log into example.com as remoteusername
2) Execute the following command:
chmod g-w,o-w ~
Wednesday, December 14, 2011
The different flavors of 568
Yes, I admit this one was stupid. But cut me a little slack - last time I had to make network cables was many moons ago. And yes, I even got to make coax cables for 10 Mbit Ethernet. But no matter.
OK, so we got this spool of cable from a generic vendor with the cable labeled "ANSI/TIA-568-C.2". For some reason - looking at some pictures online, most likely - I decided that was to be wired according to the T568A layout. And wire them I did - a few cables, actually. Which worked fine in multiple jack with an exception of a Cisco Catalyst 2000 100 Mbit switch where they did not work at all.
Then I wired one cable to the T568B layout and it seems to work everywhere including that Catalyst 2900 switch. Given the T568C is said to have superseded T658B which in turn superseded T568A which is by now obsolete all of this seems to make sense. Hence it appears that the solution has been found.
Let me also take the time here to thank all those mailing list respondents whose guidance helped me figure this mess out. It always helps to have other minds to collect thoughts off of!
Reference:
TIA/EIA-568
Ethernet over twisted pair
How to wire Ethernet Cables
OK, so we got this spool of cable from a generic vendor with the cable labeled "ANSI/TIA-568-C.2". For some reason - looking at some pictures online, most likely - I decided that was to be wired according to the T568A layout. And wire them I did - a few cables, actually. Which worked fine in multiple jack with an exception of a Cisco Catalyst 2000 100 Mbit switch where they did not work at all.
Then I wired one cable to the T568B layout and it seems to work everywhere including that Catalyst 2900 switch. Given the T568C is said to have superseded T658B which in turn superseded T568A which is by now obsolete all of this seems to make sense. Hence it appears that the solution has been found.
Let me also take the time here to thank all those mailing list respondents whose guidance helped me figure this mess out. It always helps to have other minds to collect thoughts off of!
Reference:
TIA/EIA-568
Ethernet over twisted pair
How to wire Ethernet Cables
Monday, December 12, 2011
ClipGrab
Oh, how fast things change!
Just a little while ago it looked like DamnVid was all the rage - but now YouTube has changed something and it is dead in the water. However, it looks like ClipGrab is a viable alternative... for now.
Just a little while ago it looked like DamnVid was all the rage - but now YouTube has changed something and it is dead in the water. However, it looks like ClipGrab is a viable alternative... for now.
Labels:
ClipGrab,
DamnVid,
internet,
Linux,
multimedia,
Ubuntu Linux,
video,
YouTube
Subscribe to:
Comments (Atom)