Boris Epstein's Technology Blog: MediaWiki hack

Wednesday, October 3, 2007

MediaWiki hack

This story did have a successful resolution, although a somewhat dirty one. After signing up to the MediaWiki mailing list and asking my question, specifically, how does one get the MediaWiki to search the text inside PDF files I received several tentative recommendations but no definite advice. The only sure thing I came across was this hack by one MHART.

The hack basically amounts to forcing the upload module to index a PDF file as it isbeing uploaded. The text is extracted from that PDF by means of pdftotext(1) and used for indexing. To do this, here is the hack. It is introduced to includes/SpecialUpload.php, function processUpload(). The function, by the way, seems a bit too long and convoluted to my taste and needs a redesign but let's not dwell on that for now.

So here is the hack (emphasis on added code):

. . .
if( $this->saveUploadedFile( $this->mUploadSaveName,
$this->mUploadTempName,
$hasBeenMunged ) ) {
/**
* Update the upload log and create the description page
* if it's a new file.
*/
$img = Image::newFromName( $this->mUploadSaveName );

/*
* Parsing the file if it is a PDF, by MHART
*/
if (strtolower($finalExt) == "pdf") {
$NewDesc = $this->mUploadDescription . "\r\n" . "","",$DocLine);
}
$NewDesc .= "\r\n" . " -->";
$this->mUploadDescription = $NewDesc;
}

$success = $img->recordUpload( $this->mUploadOldVersion,
$this->mUploadDescription,
$this->mLicense,
$this->mUploadCopyStatus,
$this->mUploadSource,
$this->mWatchthis );

...

Boris Epstein's Technology Blog

My Other Sites

Blog Archive

About Me

Labels

Wednesday, October 3, 2007

MediaWiki hack

No comments: