Boris Epstein's Technology Blog: MediaWiki

Wednesday, October 3, 2007

MediaWiki hack

This story did have a successful resolution, although a somewhat dirty one. After signing up to the MediaWiki mailing list and asking my question, specifically, how does one get the MediaWiki to search the text inside PDF files I received several tentative recommendations but no definite advice. The only sure thing I came across was this hack by one MHART.

The hack basically amounts to forcing the upload module to index a PDF file as it isbeing uploaded. The text is extracted from that PDF by means of pdftotext(1) and used for indexing. To do this, here is the hack. It is introduced to includes/SpecialUpload.php, function processUpload(). The function, by the way, seems a bit too long and convoluted to my taste and needs a redesign but let's not dwell on that for now.

So here is the hack (emphasis on added code):

. . .
if( $this->saveUploadedFile( $this->mUploadSaveName,
$this->mUploadTempName,
$hasBeenMunged ) ) {
/**
* Update the upload log and create the description page
* if it's a new file.
*/
$img = Image::newFromName( $this->mUploadSaveName );

/*
* Parsing the file if it is a PDF, by MHART
*/
if (strtolower($finalExt) == "pdf") {
$NewDesc = $this->mUploadDescription . "\r\n" . "","",$DocLine);
}
$NewDesc .= "\r\n" . " -->";
$this->mUploadDescription = $NewDesc;
}

$success = $img->recordUpload( $this->mUploadOldVersion,
$this->mUploadDescription,
$this->mLicense,
$this->mUploadCopyStatus,
$this->mUploadSource,
$this->mWatchthis );

...

Monday, October 1, 2007

In praise of MediaWiki

Well, something new to deal with every day... Or every week at least. The latest wonder I've come across was MediaWiki. Don't laugh - you may have had a Wiki site for years but me - up until very recently I had only been a Wiki user, not a Wiki administrator.

What can I say - this is a very nice piece of software. Installs right off the bat, easy to administer. I read the code a bit and it mostly makes sense right away too - which is, obviously, quite nice to know.

There is only one drawback that I have discovered thus far - it does not seem to be configured to search within PDF files by default, and that is exactly the functionality we need. Under Linux one has a very nice text extracftor for the purpose called pdftotext(1) that comes as part of xpdf. So technically it should not be difficult to implement this functionality under MediaWiki though I am generally somewhat reluctant to touch the code I don't own. The best I could get out of the users' forum was a suggestion to use the Lucene Search software but that looks like a whole can of worms in and of itself - and is not even guaranteed to actually do what I need done. But I guess this way or the other I will get through this one.

If any of you have any relevant experience - please advise, and your advice will as always be appreciated. But this little diffuculty notwithstanding MediaWiki really is very nice technology that does allow one to have a Wiki-style reference site up and running in under an hour.

Boris Epstein's Technology Blog

My Other Sites

Blog Archive

About Me

Labels

Wednesday, October 3, 2007

MediaWiki hack

Monday, October 1, 2007

In praise of MediaWiki