Babelworx: Januar 2008

Dienstag, 29. Januar 2008

New project in Google code!

New app - pyOcrHelper!
I just checked a new project into Google Code - I called it pyOcrHelper (because I couldn't think of anything else). Basically, it's a python class which makes access to OCR software such as Tesseract or Ocropus easier, because you don't have to think about converting the image/document you have into the format required by Tesseract or by Ocropus - pyOcrHelper takes care of this for you.

What it can do currently:
The first release provides the basic functionality that I required - simply to be able to OCR scan any image file and (importantly) also PDFs (seeing as scanned documents are often sent as images embedded in PDF). As mentioned, this works (kind of). It badly needs documentation and probably also needs to be packaged in the openSUSE build service. There are a couple of loose dependencies which could probably be deleted altogether.

Next steps:
The next steps are to tighten up the code (a lot), to make the code readable and to start raising worthwhile exceptions instead of having class member functions bail out with sys.exit() after doing a sys.stderr.write(). I also want to do some work on output formats. Currently, Ocropus produces half usable HTML, but this could easily be improved upon - and XML can't be that hard to output either. Apart from that, there are other things that I might consider, like taking the opportunity to get to grips with pyqt4 and KDE4/Plasma. I'm thinking of a nice plasma desktop app where you can drop any file and have the OCRd version jump back out at you...

Similar applications:
Just spotted another python project on Google code - Clarify which is aimed at doing more or less the same as what I'm aiming at - but possibly with multithreading as well. Must have a look at the code and the results. Maybe I can learn something from it.

Samstag, 19. Januar 2008

Determining file type with python using python-magic as an alternative to the inbuilt mimetypes

Often enough during python programming, you need to read a file, or perform actions on a file. Before doing so, it is often necessary to find out what type of file it is you are dealing with - an mp3, a text file, an image file - whatever.

Using the command 'file' to determine the file type
Linux comes with a wickedly good utility called 'file'. It is fast and can handle lots of different file types. Using it is as simple as:

cf@opensuse:~/bin/python/playground> file myjpeg.jpg
myjpeg.jpg: JPEG image data, EXIF standard

Try to get the filetype by looking at the file extension
One way of doing this to look at the file's extension, though this sucks, for a number of reasons (particularly on linux/unix). To do so, you would take the filename as a string and use a regular expression to try to get the part of the string before the first dot (e.g. myfile.tar.gz or myfile.mp3). This might not work for some filenames - I'm thinking of files with version numbers etc - something like myfile-1.0.4-1.src.rpm. However, python brings an inbuilt helper called mimetypes, which could make life a bit easier.

Try to get the filetype using the inbuilt mimetypes
If we have a JPEG image file called myjpeg.jpg, we can try to read in information about the file type using mimetypes. Consider the following python code (after starting the interpreter):

>>> import os,mimetypes
>>> mimetypes.guess_type(os.path.join(os.getcwd(),"myjpeg.jpg"))
('image/jpeg', None)

As you can see, mimetypes was clever enough to see that the file is a jpeg file. The second entry in the tuple returned by mimetypes.guess_type would normally be the encoding (if the guess_type function can actually determine the encoding). Compared to the output of the 'file' command shown above, this is pretty feeble, but apparently, it is possible to get better results by using some of the mimetypes other functions to map different encodings. Try using the following for more information:

pydoc mimetypes

Using python-magic as an alternative to mimetypes

As mentioned above, linux has a really good utility for determining file type and other useful information from files - the 'file' utility. python-magic is a kind of python interface to the file utility - it brings its own shared library 'magic.so' and thus provides more information that mimetypes (though I emphasise that I haven't spent too much time on mimetypes - it's probably better than I'm describing. Anyhow, consider the following code:

#!/usr/bin/env python
import magic,os

jpg = os.path.join(os.getcwd(),"myjpeg.jpg")

ms = magic.open(magic.MAGIC_NONE)
ms.load()
type = ms.file(jpg)
print type

f = file(jpg,"r")
buffer = f.read(4096)
f.close()

type = ms.buffer(buffer)
print type
ms.close()

This code outputted the following (when used on the same 'myjpeg.jpg' file as 'file' above:

cf@opensuse:~/bin/python/playground> ./testmagic.py
JPEG image data, EXIF standard
JPEG image data, EXIF standard

The information is more or less the same as provided by 'file'. The major problem with python-file is actually finding and installing the module. There are some Ubuntu and Debian packages available, but I was looking for an openSUSE package. After searching for ages, I decided to package it myself, from the Ubuntu sources. You can download the package from my openSUSE Build Service project. If you are using openSUSE, you can use the 1-Click Install button below.

Samstag, 5. Januar 2008

Messing about with a python implementation of wget

wget is a really wicked tool - I really miss it when I have to use other systems. So I thought, "why not try to implement wget in python - seeing as python is easy to install on lots of systems - even Nokia's S60 can handle it. Of course, writing something with anywhere near the functionality of wget is pretty much impossible - and anyway, I don't particularly need all of it's functionality. The concrete requirement I have is to build a downloader module for currxchange (to download xml files containing currency exchange rates).

After messing about with python's urllib, I decided to use urllib2 - I only really needed urllib2.urlopen as this provides the info() function which can spit out the metadata about the upstream file - things like file_descriptor.info()["Content-Length"] are thus easy to access.

I have the script kind of working with Kelvie Wong's cool ProgressBar module from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/168639 (note: the class posted as a comment right down at the bottom of the page). There is another pretty cool python class available for doing more or less exactly what I want to do (even with gui if you're into that) at http://www.python-forum.de/topic-9647.html. It's threaded and downloads the file in three different parts. Methinks it can resume downloads as well - pretty neat. Still, I'm going to keep messing about with my own class so I can tailor it for currxchange. From all this messing about you really appreciate how much work went into a utility like wget - amazing!

Freitag, 4. Januar 2008

Medion MD96360 on openSUSE 10.3

Sandra got a brand new laptop before Christmas - a stylish shiny Medion in black with swaroski studs on the front panel. Niiice... Anyhow, she didn't particulary want the Windows Vista Home Premium which came with it, so I happily clicked on "Hell no - I don't agree with this EULA" and popped in the openSUSE 10.3 GM 64 bit DVD.

Installing openSUSE 10.3 is ... well ... dead easy. In fact, since 10.1, I haven't really had any problems with openSUSE - now and again there are some quirky hardware parts which just don't work (tm) but nothing that has particulary bothered me. With this laptop, X wasn't detected properly and I got dropped back into the terminal. Installing the fglrx driver solved this (except for an annoying little problem which pops up now and again - more on this later). I was happy to see stuff like the inbuilt card readers working. Ouch - one more thing, now that I think of it - WLAN was a bitch to configure - I still haven't got it working properly - though the inbuilt Ralink r73 card was detected and can be used with iwconfig/iwlist - it seems to have problems with wpa2 authentication. A USB WLAN stick later and WLAN was working... though this really isn't the proper solution...

Anyhow, those were just a few thoughts on the laptop - the overall verdict is "niiice". Small, light, comfortable to type and wickedly fast. More to come on this...

Babelworx