Saturday, June 30, 2012

MooseFS: continuing progress

As you know I have been getting multiple MooseFS installation up and running. See here and here. This is just a continuation of the series.

On one of the machines which I was trying to configure as a MooseFS server I decided to put the chunk server's storage into an XFS mounted in a file. The reason I found it necessary to do that was that a chunnk server is expecting its storage area to be of a constant size - or else it gets confused about how much space it has at its disposal. Hence the best option is a dedicated filesystem. I had the "parent" XFS already filled with some content and occupying all the available disk - and at the time filesystem-in-a-file seemed like a reasonable solution. But it was not.

The way you accomplish this is you create a large file - in another XFS, as it were - and then run mkfs.xfs on that file and then mount it with a -loop option.

I did that and did a little testing on that file. The performance for single file reads and writes was quite good - though I don't remember the precise numbers at this point. However, it was on par with the sort of performance you would get out of a filesystem sitting directly "on the metal" (i.e., on the RAID as the case was with this system). However I never suspected that the latency associated with the multi-file I/O in this two-tired filesystem would become my undoing.

About a week into running in the configuration described above the first "missing chunks" appeared. In general MooseFS displayed rather uneven performance and seemed only marginally usable if that. Later, the number of those missing chunks increased. As I am now sure it was due to the latency-induced timeouts of some kind; however as this email exchange indicates even people who work with the code are finding this odd. That may well be because no one ever envisioned such a twisted configuration as the one I created.

So, finally, even though by then I had already put about 15 TB of data into that MooseFS it was clearly not acceptable to continue running losing massive amounts of data in the process. So I scrapped that, did away with the "parent" XFS - and, naturally, with the filesystem-in-a-file - repartitioned the underlying disk in such a way as to give the chunk server a dedicated partition and relaunched the MooseFS. It has been a few days now and at this point all is well, the data is written at over 15 MB/s, it can be read at about 50 MB/s and there has not been a single error message.

The lessons learned thus far appear to be the following: no VM's and no UNFSD server on the same hardware - at least so long as this hardware is in the low-end server class. And it also looks like MooseFS data needs to reside as close to the actual hardware as possible - i.e., no middle layers such as this filesystem-in-a-file.

And while this was a bit time consuming I now have multiple MooseFS installations running and ready for growth.

Monday, June 25, 2012


Boris Epstein
Malden (Metro Boston), Massachusetts, US (map)

Phone: 617.816.9654


Extensive experience in IT and programming and good practical ability to apply that experience to real-world problems. Have on several occasions successfully suggested and introduced solutions that have made a qualitative difference to the efficiency and reliability of the system in question. Capable of thinking outside the box and showing initiative. Good ability to see the big picture as well as the details of the immediate task.
Quick learner, capable of rapidly coming up to speed on new technologies. Have a good understanding of logistics, process and technology. Have experience working in small teams where being a jack of all trades is pretty much required and learning new technologies is almost an everyday constant.

Operating Systems: Linux/UNIX, Mac OS X, MS Windows, MS DOS, Apollo Aegis, VMS
Programming Languages: C/C++, Pascal, PERL, Ruby, JAVA, UNIX (C-Shell, Bourne, Korn), Tcl/Tk, PHP, Lisp/Scheme
Tools: Netbeans, Eclipse, TogetherJ, Emacs, MS Office, OpenOffice, LibreOffice, FrameMaker, VI
Technologies: MySQL, PostgreSQL, SYBASE, Xen, VirtualBox, OpenVPN, Oracle RDBMS, CVS, SVN, ClearCase, DSEE, NFS, NIS, LDAP, DHCP, Webmin, TCP/IP, routers/firewalls, etc.
Hardware: IBM-compatible PC, HP, SUN, Mac


May 2007 - present
Cambridge, MA
Work as something akin to a one-person IT department for a small cutting-edge research lab. Support communication and integration needs between the lab and various collaborators. Plan and implement the lab's hardware and software infrastructure while at the same time provide everyday support to lab's users. Participate in coding lab's internal utilities as well as the publicly available OpenMIMS Analysis Module. Integrate various software and hardware systems for the lab's needs. When necessary fulfill general office management duties such as scheduling meetings, placing orders, etc.
During my over 5 year tenure have played a pivotal role in the lab's transition to a enterprise-level configuration featuring MooseFS-based distributed file storage, functionally segmented network architecture, Subversion-based code control, Bugzilla-based bug tracking, etc.Have a track record of setting up systems and services that sometimes ran for over two years with no downtime.
Beyond officially delineated responsibilities maintain my own privately-run VPN solution to provide connectivity to the lab's employees and collaborators outside the lab. Due to the distributed nature of the lab's activities this is a function critical to the lab's success.

April 2004 - May 2007
Travel. Educational activities. Occasional freelance projects.Provided technical assistance to several online projects including Cooperative Research and New England JAVA Users Group.In 2005, in the wake of Hurricane Katrina helped coordinate volunteer relief activities in the Gulf and personally participated in those activities.

July 2001 - April 2004
Waltham, MA
Worked as a programmer involved in development and support of the ETMS air traffic management software for the FAA at the Volpe National Transportation Systems Center in Cambridge, MA. Specific area of concentration was the CDMcomponent of the ETMS. The tasks included design and development of new functionality, as well as supporting legacy code. A large codebase together with the need for system reliability provide for a challenging task. The code was originally written in Pascal and later migrated to C.
The system was in essence a large transportation management system receiving and reflecting frequent (mostly once-a-minute) updates regarding the status of flights operating over the US airspace. Near realtime requirements were in place for processing and analyzing data which made for an exciting and challenging task.

December 1999 - April 2001
Boston, MA
Performed multiple roles on a daily basis, including those of a senior system designer and developer, Windows NT and UNIX system administrator and web hosting support engineer. Key player in the technology planning and implementation area. As a sole expertise in a number of areas, including object-oriented design and development, networking, network security and systems management, advised other team members on various technology issues.
Main tasks included web site backend implementation for clients, in-house product design and implementation as well as day-to-day activities mentioned above. Most of the coding was done using object-oriented technology, with JAVA as a programming language of choice.
Provided critical insight which allowed to greatly improve stability and efficiency of the internal systems.
Clients included Davox Corporation (now part of Aspect Software), Pilates StoreLobsters-Online, Inc. and others.

September 1999 - January 2000
Medford, MA
Worked as a UNIX administrator in a large-scale web and dataserver hosting facility. Was responsible for maintenance and troubleshooting of multiple industrial-scale UNIX servers in SUN, HP and IBM platforms. Performed database maintenance of SYBASE and ORACLE databases as well as data recovery and general server troubleshooting.

May 1999 - July 1999
GTE International (currently part of Level3 Communications)
Cambridge, MA
As part of the Y2K team worked on the remediation of custom SUN Solaris machines hosting clients' mission-critical WWW sites and applications. Complex upgrades and modifications had to be accomplished requiring an absolute minimum in customers' downtime. Tasks included upgrades of the OS, DB servers and various other third-party software. Custom scripting and coding was often required to facilitate the necessary transition.

November 1995 - May 1999
Cambridge, MA
Originally hired as an outside contractor. Accepted a permanent staff position in a three months' time. Throughout my whole tenure was a critical part of a small and continually overtasked team. Responsibilities included day-to-day maintenance and support of a network of UNIX hosts. Maintained code control systems. Designed and implemented a data backup/archival system for in-house use. Created web pages using HTML, PERL and JAVA. With the emphasis on publicly available software restructured the environment to optimize and economize the development and production process. Modified publicly available software for local needs using C, Tcl/Tk, PERL, JAVA, etc. Was also involved in equipment and software installation and support at client sites. Worked with medical applications and protocols including DICOM, ISG's VRS graphical application and AWARE wavelet compression. Wrote system installation and maintenance scripts. Modified and integrated various third party software packages. Worked with multimedia devices in medical data capture/processing systems. Set up and supported LANs and WANs.

July 1992 - November 1995
Kenan Systems Corporation (currently part of Alcatel-Lucent)
Denver, CO - Cambridge, MA
Participated in database design of SYBASE databases.As part of a product team supported development effort in a heterogeneous UNIX environment. Oversaw the operation of a distributed development environment which included multiple geographically disjoint locations. Wrote a suite of PERL scripts that encompassed the local customizations to ClearCase as required by the project. Other responsibilities included release/code management using ClearCase code control system, SYBASE database administration, system design activities.

Summers of 1990 and 1991
Cambridge, MA
Participated in the development of ATMS (Automated Traffic Management System), an air traffic management and control system for the FAA. Coding was done in Pascal against a proprietary database. The network consisted of a multitude of Aegis hosts on the Apollo platform. The development was done under DSEE as an integrated code control and management environment.

Born in St. Petersburg, Russia in 1969. Have lived in the US since 18 years of age. Attended Tver University (website in Russian) in Tver, Russia; Boston University and University of Massachusetts at Amherst graduating in 1992 with a Bachelor's Degree in Mathematics/Computer Science.

Available upon request.

Tuesday, June 19, 2012

MooseFS taking shape

I am continuing experimenting with MooseFS. However, the final configuration looks somewhat different from what I had originally envisioned. For one thing, the idea of placing the master and metadata servers in VirtualBox VM's didn't quite work out. I guess that created just too many levels of execution and as a result that lead to the overall load growing too much and the performance suffering as soon as any serious load was applied.

So I switched to simply running all the processes (master, meta, chunkserver) on the same hardware and got rid of all the VM's. That worked fine. I defined a separate network - currently fully confined to the same host - in order to host the MooseFS installation. And MooseFS clients have started to run just fine. I got a performance of up to 80 MB/s for reading data from the MooseFS over a 1 Gbit/s network.

However, one problem remained. Running UNFSD on the same machine I got very poor performance.  As few as 5 clients could drive it down to just 30 KB/s! And that on an 8-core 48 GB RAM machine - while a MooseFS client would read at 3 orders of magnitude as much!

Surprisingly, the fix was simple: if I ran UNFSD on a separate physical machine the performance went back into the tens of MB/s range. So that was what I settled for. That NFS server machine is currently just a CentOS Linus MooseFS client sitting on the "general" network - different from the one hosting MooseFS - and sporting a mere 2 cores and 2 GB of RAM. So I guess for now I have a working solution.

Saturday, June 2, 2012


Just reporting that I started playing with it - and more than playing. So far so good. The architecture is really simplistic, the executables very lightweight. For more detail see here:

I am running it on several server-class boxes using VirtualBox VM's to emulate a network so as to be able to distribute it to multiple hardware boxes later on. Both the VirtualBox hosts and the VM's are running CentOS 6.

The only problems so far seem to have to do with integrating MooseFS with other technologies. I tried using the UNFS3 user-space NFS server to create an NFS gateway to the MooseFS installation. And so far it looks like the UNFS3 server does not scale well to multiple connections. In other words, you get an excellent performance with one NFS client, you get a decent quality with 2-3 connections but when it is above 5 it seems to go down the drain and accessing one's home directory over such an NFS connection becomes pretty much unfeasible. So at this point what's lacking is a good NFS gateway for situations where a MooseFS client is for some reason not available or not a workable solution. Or perhaps I will choose a different sharing method. Time will tell.

The Ultimate Boot CD

Just a boot CD distro with lots and lots of utilities. See here: Came in real handy for me when I was trying to test RAM in this server-class box which for some reason decided not to play with a regular memtest CD.

Wednesday, April 4, 2012

The mysterious evince

evince, otherwise known as "Document Viewer", is a pretty much standard feature of many a Linux distribution. So there I was, trying to use it on a mostly up-to-date Ubuntu 10.04 LTS 64-bit machine and I kept getting messages that looked like the following:

(evince:5691): EggSMClient-WARNING **: Failed to connect to the session manager: Authentication Rejected, reason : None of the authentication protocols specified are supported and host-based authentication failed

There were other messages as well, some stated that evince could not open the display(???) even though other X-applications, such as xterm, for instance, would run just fine from the same command line.

To make things even more bizarre that would happen for some users on the machine and not others. Attempts to play with Gnome settings - or even evince-specific settings - such as deleting ~/.gnome2 directory or specifically ~/.gnome2/evince subdirectory - appeared to be making no difference.

After some web searches I found a solution that seems to work. Hat tip to the participants of this discussion on the Ubuntu Forums. The following seemed to actually fix the problem for everybody:

sudo bash
cd /usr/bin
mv evince evince.bin
ln -s evince.bin evince

And don't ask me why!

Tuesday, March 13, 2012

Large-scale matching exercise using MySQL

In my previous post (MySQL: A Few Metrics, 3 March 2012) I have mentioned some parameters of a task I recently faced. We are now going to examine that task in more detail.

We have two large text files, File1 and File2. They both contain text entries, one per line, over 400 million lines each. We know almost nothing about the content beyond that; it is definitely unsorted within each of the files, some lines may be repetitive. So for the purpose of this discussion let us say File1 is 430 million lines and File2 is 440 million lines.

To recap: the only machine I had available for this task was an VM that had plenty of disk - about 1 TB unused - but little processing power and only 512 MB RAM. It was running CentOS 6 and MySQL 5.1.52. First word of caution: if you intend to manipulate large tables it is advisable to ascertain that either /tmp has plenty of room for its invisible temporary files, or else change the temporary directory to something else. You can do that by setting the TMPDIR environment variable to the desirable location. On CentOS I just inserted the appropriate line towards the top of /etc/init.d/mysqld and that did the trick:

# Alternate temporary storage directory
export TMPDIR=/home/mysql/tmp

My first instinct was to first sort the two lists individually and then, after they are sorted, find matches as well as content exclusive to either list by doing one forward pass through both. I still believe that approach was sensible - however, the sorting phase proved to be more time-consuming than I expected. The most likely reason for that was that, as I already mentioned, inexact comparisons take qualitatively longer than exact ones - and sorting, no matter how you do it, is based upon inexact comparisons.

However, that same fact could be used to our advantage. We could do exact comparisons to determine the intersection of the two lists - and then separate the entries exclusive to either list.

Let us now run through a practical example that reflects what I ended up doing after some trials and errors. The names have been adjusted from those I used to make this text more readable. My apologies for any possible typos in that syntax.

Alright, let us get going now. We have our files: File1 (430 million lines) and File2 (440 million lines).

So first let us create the necessary tables:

mysql> CREATE TABLE f1_list (f1_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, text_line TEXT NOT NULL, INDEX (text_line(400) ASC));
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE TABLE f2_list (f2_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, text_line TEXT NOT NULL, INDEX (text_line(400) ASC));
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE common_list (cl_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, f1_id INT NOT NULL, f2_id INT NOT NULL, text_line TEXT NOT NULL, INDEX (f1_id ASC), INDEX (f2_id ASC), INDEX (text_line(400) ASC));
Query OK, 0 rows affected (0.00 sec)


Now let us populate f1_list and f2_list with the contents of File1 and File2 respectively. You will need to run roughly the following commands using full paths to File1 and File2:

mysql> LOAD DATA LOCAL INFILE "/home/my_user/big_files/File1" INTO TABLE f1_list (text_line);
Query OK, 430000000 rows affected (8 hours 13 min 10.01 sec)
Records: 430000000 Deleted: 0 Skipped: 0 Warnings: 0

mysql> LOAD DATA LOCAL INFILE "/home/my_user/big_files/File2" INTO TABLE f2_list (text_line);
Query OK, 440000000 rows affected (10 hours 01 min 15.0 sec)
Records: 440000000 Deleted: 0 Skipped: 0 Warnings: 0


The times are roughly consistent with what I saw; I'd estimate the import times as 8-12 hours per file.

Now let us extract the commonalities into common_list:

mysql> INSERT INTO common_list (f1_id, f2_id, text_line) SELECT f1_list.f1_id,
f2_list.f2_id, f1_list.text_line FROM f1_list, f2_list WHERE f1_list.text_line = f2_list.text_line;
Query OK, 427677003 rows affected, 2 warnings (16 hours 12 min 16.58 sec)
Records: 427677003 Duplicates: 0 Warnings: 0


Now let us extract the exclusive content:

mysql> DELETE FROM f1_list WHERE f1_id IN (SELECT f1_id FROM common_list);
Query OK, 427001324 rows affected (16 hours 24 min 41.20 sec)

mysql> DELETE FROM f2_list WHERE f2_id IN (SELECT f2_id FROM common_list);
Query OK, 427001129 rows affected (16 hours 39 min 44.47 sec)


And so by now it looks like we are done. We have content exclusive to List 1 listed in f1_list, content exclusive to List 2 in f2_list and the common content in common_list. The total processing time - even if the operations are performed serially as delineated above - can be capped at about 76 hours. I used a little padding there, too. For instance, I counted 20 hours for the processes that in reality took about 16. Thus it appears realistic to sort this mass of data within 3-4 days using a machine which in this day and age would be considered substandard in terms of its performance.

Saturday, March 3, 2012

MySQL: A Few Metrics

I have recently encountered a problem that involved dealing with massive lists of text strings. The strings were sometimes hundreds of characters long and the lists numbered hundreds of millions of them.

The following are some of the things I encountered. Bear in mind that none of those things are necessarily set in stone as MySQL has many a tunable parameter and it is entirely possible that by tuning some of them one can achieve a behavior very different from mine.

So moving on to the particulars. I had to compare two very large lists of strings, each encompassing hundreds of millions of lines (strings). The task was to find the intersection of the two sets as well as the lines exclusive to each set.

The only machine where I could play was a really weak VM with 512 MB or RAM assigned to it. That happened to be the only machine that had enough room - and even with about 900 GB to spare that was sometimes not enough considering MySQL's appetite for temporary space. Before I decided to use MySQL I tried doing it using Linux shell commands but after awhile I gave up realizing that since ultimately they do all their work in memory - or swap - the task would require designing a process with more stages to it than I'd be able to even think of. So, logically, the next candidate was an RDBMS of some sort - and for that I chose MySQL. So there I was trying to use a MySQL engine on a 512 MB RAM VM running CentOS 6. And the data sets I had to process were two files, over 50 GB each, with hundreds of millions of lines of text, each line to be treated as a separate entity (string).

So let us now delve into it. From now on, let us designate the initial files as File 1 and File 2. The numbering will stay the same throughout this discussion.

Listed below are few interesting facts I discovered while attacking this problem. Note that the syntax used below differs slightly from that used in real life as I changed table, column and index names to be more self-explainatory. I have not tested that syntax against a real database so there may be typos in it - though I will do my best to be careful.

1. Full text indexes on the text strings seem to take prohibitively long. Yet partial indexes seem to work within 8-12 hours even though the part (400) was likely quite long enough as I could not find strings even that long and most of them seemed to be much shorter.

For example, let us consider a table like the following:


Then you import the data. I used mysqlimport and it loaded over 400 million rows in about 6-10 hours. I have then considered using the FULLTEXT index on the text_line column (the actual text).

mysql> CREATE FULLTEXT INDEX file1_full_text ON file1_list (text_line ASC);

Well, I never found enough patience to wait for this one to finish. It had definitely run for over a day by the time I finally decided to terminate it.

2. Indexing on substring seems to be far more efficient. Consider the following:

mysql> CREATE INDEX text_100 ON file1_list ( text_line(100) ASC);Query OK,
427415951 rows affected (13 hours 22 min 55.84 sec)
Records: 427000000 Duplicates: 0 Warnings: 0


Indexing on text_line(400) takes about the same amount of time even though from my estimates most if not all lines had less than 400 characters to them. Thus at some point I simply standardized on indexing on a 400 character substring for the future analysis purposes.

3. Exact string comparisons are extremely fast.
For example:

mysql> SELECT count(*) FROM file1_list WHERE text_line = "abracadabra12345thisysysyysylkjhf";
| count(*) |
| 0 |
1 row in set (0.40 sec)
mysql> SELECT count(*) FROM file1_list;
| count(*) |
| 430967651 |
1 row in set (0.00 sec)


That means we can see whether our string matches any other out of hundreds of millions of indexed strings in under 1s.

4. Inexact comparisons are quite slow. For example:

mysql> SELECT count(*) FROM file1_list WHERE text_line != "abracadabra12345thisysysyysylkjhf";
| count(*) |
| 430967651 |
1 row in set (13 min 26.86 sec)


The complimentary exact comparison clearly went a bit faster, to put it mildly.

This is it for now. Next we will address some applications where these results came in handy.

Monday, February 27, 2012

MySQL's Mystery Files

MySQL is a very powerful and easy to use - and administer RDBMS. For light use it mostly just works out of the box - but if you intend to use it for something serious you need to tweak it some.

For one thing, MySQL makes use of temporary space - that is, temporary space outside of the designated areas where it stores the database files themselves. It appears to be configured as follows:

On Unix, MySQL uses the value of the TMPDIR environment variable as the path name of the directory in which to store temporary files. If TMPDIR is not set, MySQL uses the system default, which is usually /tmp, /var/tmp, or /usr/tmp.

On Windows, Netware and OS2, MySQL checks in order the values of the TMPDIR, TEMP, and TMP environment variables. For the first one found to be set, MySQL uses it and does not check those remaining. If none of TMPDIR, TEMP, or TMP are set, MySQL uses the Windows system default, which is usually C:\windows\temp\.

Source: Where MySQL Stores Temporary Files

One potential caveat is that it uses "hidden" files that can not be detected by normal file system level tools. Here's a discussion on how that is possible. In short - MySQL creates files which are not linked in the file system and hence can not be found and simply vanish if the process making use of them - i.e., the MySQL daemon process - disappears. In other words, you are guaranteed that no left-over space will be used. From the IT standpoint it is a nuisance though as you get file system space consumed without being able to find the files responsible for such consumption.

As to the size of such consumption - I was actively using a database amounting to no more than 300 GB and due to the fact that I only had about 900 GB to spare in the filesystem I was forced to do "repairing with keycache" which is way slower than regular reindexing - but regular reindexing requires lots of temporary space which I apparently lacked.

Wednesday, February 1, 2012

Some pracitcal uses of VPN

VPN is a term one hears often these days. However, while many people have some ideas about what it is used for - secure access, for instance - many probably lack vision of how they could benefit from its use. So for starters - what is VPN? You can use the Wiki link for a formal definition but in a less formal way one can define it as a network one can build to their own design provided one controls the server and the other machines one intends to network together have the capability to access that server via the internet.

So let us say you control a machine on the Internet with a public IP address. On it you can install a VPN server process. Then you can issue authorization to those you want to allow to join your VPN network.

Let us consider a practical example. I configure a server to serve a VPN with a private network defined as Let us say the server gets the virtual IP address of, with the other addresses ( available for grabs. So let us say my laptop gets an address of, my home machine gets an address of and my office machine gets an address of

Thus - using the same network protocols - I can collect the video feed off of my home computer to see what is going on at home, print to my office computer's printer - and do all of it from a WiFi point half the world away using my laptop.

Or - let us say - in addition to my office in Boston I decide to get one in Buenos Aires. No problem. I get another machine there - let's say, with a virtual IP address of - and use it and the one at my office in Boston - - to link the two networks. Now they are linked - via the internet but at the same time utilizing the VPN's security which is normally considered an unrbeakably secure way to communicate.

Those are just a couple of possible usage scenarios. I will try to cover this topic in more detail later on. For now just think of the VPN as a network you can define the way you like no matter where the computers who will join it happen to be geographically and topologically. So long as they have access to the internet and you allow them to join your VPN they can do so.

Wednesday, January 18, 2012

Sparse Files

OK, I heard the term before but never had to really delve into them. That's an interesting concept, though. First delved into it needing a large (multi-terrabyte) file to house a file system. See here.

Then came across this article:

Sparse files – what, why, and how

I like the concept - though it does come with a few pitfalls, it seems. More on that later.