Python Programming, news on the Voidspace Python Projects and all things techie.

The Battle Against Spam - Akismet

emoticon:eyeballz The battle against spam in my guestbook continues. I blogged about investigation DNS Blacklists to combat spam - and how it is basically ineffective for my guestbook [1].

Someone has pointed me to a very promising looking new web service called Akismet.

They promise to be almost 100% effective at catching comment spam, and say that currently 81% of all comments submitted to them are spam.

It's designed to work with the Wordpress Blog Tool, but it's not restricted to that - so I've written a Python interface to the Akismet API.

You'll need a Wordpress Key to use it. This script will allow you to plug akismet into any CGI script or web application, and there are full docs in the code. It's extremely easy to use, as the folks at akismet have implemented a nice and straightforward REST API.

Note

This module now has a homepage at : Akismet the Python Module.

Download it from akismet.py - 0.1.1.

Here's an example of how to use it :

from akismet import Akismet
api = Akismet(agent='Test Script')
# if apikey.txt is in place,
# the key will automatically be set
# or you can call ``api.setAPIKey()``
#
if api.key is None:
    print "No 'apikey.txt' file."
elif not api.verify_key():
    print "The API key is invalid."
else:
    # data should be a dictionary of values
    # They can all be filled in with defaults
    # from a CGI environment
    if api.comment_check(comment, data):
        print 'This comment is spam.'
    else:
        print 'This comment is ham.'

It also comes with an example CGI - which isn't that useful really. Razz

My guestbook will shortly be revamped to use it.

[1]The blacklists don't recognise the IP address that posts are made from, or the websites they promote, as spam.

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2005-12-02 09:49:44 | |

Categories: ,


Sites Built with rest2web

I'm compiling a list of rest2web users for the rest2web documentation - as example sites.

If you would like me to include your site then please let me know the URL I should include (and any brief description).

This would be very helpful for me. I currently know of six websites built with rest2web (plus Voidspace of course).

Also - if there are features you would like to see, or suggestions you have, could you post them to the mailing list.

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2005-12-01 15:21:06 | |

Categories: , ,


File Locking

emoticon:file1 For a long time I've been looking for a simple, cross-platform file locking solution. This is to solve some concurrency problems I have in some CGI scripts, especially downman which manages the downloads I offer.

The problem occurs because I like to store my data as plain text files, and if the CGI is accessed simultaneously by more than one user then data can become inaccurate, or corrupted.

What I want to do is to read a file, amend it, and then save it back again - and guarantee that during that time, no other process is able to access the file.

I did find a nice module called XFile that is a cross platform file locking module. Under the hood it uses fcntl (for Unix like platforms) or the win32 API to do the locking.

The showstopper for me is that I want to lock the file so that I can read it, but another process can't. XFile provides :

LOCK_EX - the flag used to specify an exclusive lock. This cannot be used when the file is in a read-only mode.

It doesn't seem possible to lock a file in the way I want.

A bit of googling on the subject and I've come across a suggestion by the benevolent dictator himself, back in 1999 - in a Python for CGI presentation. It works by creating a directory with the same name as the file [1], and testing for that before giving a lock.

I've amended it a fair bit, including adding timeouts and a file like object with lock and unlock methods. This is simple, and almost hackish, but for situations where you're not going to have more than three or four concurrent users - it's a lot easier (and just as effective) as other solutions.

Note

The following code and docs are now part of the pathutils module. Razz

##########################################################
# A set of object for providing simple, cross-platform file locking

class LockError(IOError):
    """The generic error for locking - it is a subclass of ``IOError``."""

class Lock(object):
    """A simple file lock, compatible with windows and Unixes."""

    def __init__(self, filename, timeout=5, step=0.1):
        """
        Create a ``Lock`` object on file ``filename``

        ``timeout`` is the time in seconds to wait before timing out, when
        attempting to acquire the lock.

        ``step`` is the number of seconds to wait in between each attempt to
        acquire the lock.

        """

        self.timeout = timeout
        self.step = step
        self.filename = filename
        self.locked = False

    def lock(self, force=True):
        """
        Lock the file for access by creating a directory of the same name (plus
        a trailing underscore).

        The file is only locked if you use this class to acquire the lock
        before accessing.

        If ``force`` is ``True`` (the default), then on timeout we forcibly
        acquire the lock.

        If ``force`` is ``False``, then on timeout a ``LockError`` is raised.
        """

        if self.locked:
            raise LockError('%s is already locked' % self.filename)
        t = 0
        while t < self.timeout:
            t += self.step
            try:
                os.mkdir(self._mungedname())
            except os.error, err:
                time.sleep(self.step)
            else:
                self.locked = True
                return
        if force:
            self.locked = True
        else:
            raise LockError('Failed to acquire lock on %s' % self.filename)

    def unlock(self, ignore=True):
        """
        Release the lock.

        If ``ignore`` is ``True`` and removing the lock directory fails, then
        the error is surpressed. (This may happen if the lock was acquired
        via a timeout.)
        """

        if not self.locked:
            raise LockError('%s is not locked' % self.filename)
        self.locked = False
        try:
            os.rmdir(self._mungedname())
        except os.error, err:
            if not ignore:
                raise LockError('unlocking appeared to fail - %s' %
                    self.filename)

    def _mungedname(self):
        """
        Override this in a subclass if you want to change the way ``Lock``
        creates the directory name.
        """

        return self.filename + '_'

    def __del__(self):
        """Auto unlock when object is deleted."""
        if self.locked:
            self.unlock()

class LockFile(Lock):
    """
    A file like object with an exclusive lock, whilst it is open.

    The lock is provided by the ``Lock`` class, which creates a directory
    with the same name as the file (plus a trailing underscore), to indicate
    that the file is locked.

    This is simple and cross platform, with some limitations :

        * Unusual process termination could result in the directory
          being left.
        * The process acquiring the lock must have permission to create a
          directory in the same location as the file.
        * It only locks the file against other processes that attempt to
          acquire a lock using ``LockFile`` or ``Lock``.
    """


    def __init__(self, filename, mode='r', bufsize=-1, timeout=5, step=0.1,
        force=True):
        """
        Create a file like object that is locked (using the ``Lock`` class)
        until it is closed.

        The file is only locked against another process that attempts to
        acquire a lock using ``Lock`` (or ``LockFile``).

        The lock is released automatically when the file is closed.

        The filename, mode and bufsize arguments have the same meaning as for
        the built in function ``open``.

        The timeout and step arguments have the same meaning as for a ``Lock``
        object.

        The force argument has the same meaning as for the ``Lock.lock`` method.

        A ``LockFile`` object has all the normal ``file`` methods and
        attributes.
        """

        Lock.__init__(self, filename, timeout, step)
        # may raise an error if lock is ``False``
        self.lock(force)
        # may also raise an error
        self._file = open(filename, mode, bufsize)

    def close(self, ignore=True):
        """
        close the file and release the lock.

        ignore has the same meaning as for ``Lock.unlock``
        """

        self._file.close()
        self.unlock(ignore)

    def __getattr__(self, name):
        """delegate appropriate method/attribute calls to the file."""
        if name not in self.__dict__:
            return getattr(self._file, name)
        else:
            return self.__dict__[self, name]

    def __setattr__(self, name, value):
        """Only allow attribute setting that don't clash with the file."""
        if not '_file' in self.__dict__:
            Lock.__setattr__(self, name, value)
        elif hasattr(self._file, name):
            return setattr(self._file, name, value)
        else:
            Lock.__setattr__(self, name, value)

    def __del__(self):
        """Auto unlock (and close file) when object is deleted."""
        if self.locked:
            self.unlock()
            self._file.close()

File Locking

Simple cross platform file locking is a common task, it is not as easy as it should be.

One useful module is XFile, which is a cross platform file locking module. Under the hood it uses fcntl (for Unix like platforms) or the win32 API to do the locking.

Unfortunately, you can't gain an exclusive lock for a read only access.

The following approach (as originally implemented by Guido van Rossum) provides a lock creating a directory with the same name (plus a trailing underscore), as the file. This is simple and cross platform, with some limitations :

  • Unusual process termination could result in the directory being left.
  • The process acquiring the lock must have permission to create a directory in the same location as the file.
  • It only locks the file against other processes that attempt to acquire a lock using LockFile or Lock.

LockError

The generic error for locking - it is a subclass of IOError.

Lock

A simple file lock, compatible with windows and Unixes.

You create a lock by calling :

lock = Lock(filename, timeout=5, step=0.1)

Create a Lock object on file filename

timeout is the time in seconds to wait before timing out, when attempting to acquire the lock.

step is the number of seconds to wait in between each attempt to acquire the lock.

Note

If you don't like the way Lock creates a directory by adding a _ to the filename, then you can subclass and override the _mungedname method.

A Lock object has the following methods.

lock

lock(force=True)

Lock the file for access by creating a directory of the same name (plus a trailing underscore).

The file is only locked if you use this class to acquire the lock before accessing.

If force is True (the default), then on timeout we forcibly acquire the lock.

If force is False, then on timeout a LockError is raised.

unlock

unlock(ignore=True)

Release the lock.

If ignore is True and removing the lock directory fails, then the error is surpressed. (This may happen if the lock was acquired via a timeout.)

unlock is called automatically when the Lock object is deleted.

LockFile

A file like object with an exclusive lock, whilst it is open.

The lock is provided by the Lock class, which creates a directory with the same name as the file (plus a trailing underscore), to indicate that the file is locked.

You create a new LockFile by calling :

lockedfile = LockFile(filename, mode='r', bufsize=-1, timeout=5,
    step=0.1, force=True)

This creates a file like object that is locked (using the Lock class) until it is closed.

The file is only locked against another process that attempts to acquire a lock using Lock (or LockFile).

The lock is released automatically when the file is closed.

The filename, mode and bufsize arguments have the same meaning as for the built in function open.

The timeout and step arguments have the same meaning as for a Lock object.

The force argument has the same meaning as for the Lock.lock method.

A LockFile object has all the normal file methods and attributes.

Usage Examples

Here are examples of using Lock and LockFile.

Where you just want exclusive access to a file, for a single read or write operation, LockFile is the class to use.

# force=False means an error is raised if
# we fail to acquire the lock
lockedfile = LockFile(filename, 'w', force=False)
lockedfile.write(the_file)

# close releases the lock
lockedfile.close()

If you want to read a file, then amend it, then Lock is the class to use.

lock = Lock(filename, force=False)
lock.lock()
handle = open(filename)
data = handle.read()
handle.close()

# we've read in the file
# now we do something to it
data = data.replace(something, something_else)

# now we write out the new data
handle = open(filename, 'w')
handle.write(data)
handle.close()

# finally, release the lock
lock.unlock()
[1]This part has actually stopped working since 1999. On Windoze at least you can't now create a directory with the same name as the file - so this

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2005-12-01 12:06:28 | |

Categories: ,


The New odict - Beta 2

emoticon:halt Progress on the updated implementation of dict continues. (I hesitate to say new version, as it's just a heavy makeover for the old code - which was basically sound).

FancyODict is now a full implementation of an Ordered Dictionary, with custom callable sequence objects for keys, values, and items. These can be called like normal methods, but can also be accessed directly as sequence objects. This includes assigning to, indexing, and slicing - as well as all the other relevant sequence methods. Smile

I've also added an optional index to OrderedDict.popitem.

I'm sure there are lots of ways this can be optimised for efficiency - but the new objects have got pretty full test coverage.

You can download the new version (for testing) from odict Beta 2

The following issues still remain :

  • FancyOdict is a separate class from OrderedDict.

    Because this version is undoubtedly less efficient than OrderedDict, my current thinking is that I should leave them separate (and have both available). Being able to operate on the keys/values/items as sequences is for convenience only.

    Anyone got a suggestion for a better name than FancyODict ?

  • You can no longer access the key order directly. The old sequence attribute is deprecated and will eventually go away.

    You can currently alter the order (of keys, values and items) by passing an iterable into those methods.

    Someone has suggested that this "smells bad" - and it ought to be done through separate setkeys`, setvalues, and setitems methods.

    I'm inclined to agree, but I don't feel strongly about it. Anyone else got any opinions ?

  • repr ought to return a value that eval could use to turn back into an OrderedDict.

    I have actually done an implementation of this, but it would mean that all the doctests need to be changed. I will do this at some point.

  • Slice assignment.

    The semantics for slice assignment are fiddly.

    For example, what do you do if in a slice assignment a key collides with an existing key ?

    My current implementation does what an ordinary dictionary does, the new value overwrites the previous one. This means that the dictionary can reduce in size as the assignment progresses. Confused

    I think this is preferable to raising an error and preventing assignment. It does prevent an optimisation whereby I calculate the indexes of all the new items in advance.

    It also means you can't rely on the index of a key from a slice assignment, unless you know that there will be no key collisions.

    In general I'm against preventing programmers from doing things, so long as the documentation carries an appropriate warning.

    An example will probably help illustrate this :

d = OrderedDict()
d[1] = 1
d[2] = 2
d[3] = 3
d[4] = 4
d.keys()
[1, 2, 3]

# fetching every other key
# using an extended slice
# this actually returns an OrderedDict
d[::2]
{1: 1, 3: 3}

# we can assign to every other key
# using an ordered dict
d[::2] = OrderedDict([(2, 9), (4, 8)])
len(d) == 4
False

d
{2: 9, 4: 8}

"""
Because of the key collisions the length of
d has changed - it now only has two keys instead
of four.
"""

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2005-12-01 10:39:13 | |

Categories: , ,


Wired for Chaos Review and Python Search Engine Update

emoticon:carrot I've just put online my Review of Wired for Chaos, the new Cyberpunk novel by Brett Renwick.

To celebrate I've updated the domain lists used by CyberSearch - the Cyberpunk Search Engine and Skimpy the Python Search Engine.

The Python Search Engine now returns results from over twelve hundred domains relevant to Python programming. It's been used over fifteen hundred times, not bad for a hack. Wink

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2005-11-29 11:58:04 | |

Categories: , ,


Ordered Dictionary II

emoticon:mobile 0.2 actually, Smile at least an experimental version of it.

This comes out of the discussion at comp.lang.python.

There are two implementations. The first is just an improvement of the OrderedDict class, with the following changes :

  • You can now slice, including :

    • Slicing, which returns an OrderedDict
    • Slice assignment (only with an OrderedDict - for an extended slice it must be the same size as the one you are assigning to)
    • Slice deletion
  • The sequence attribute is now deprecated

  • The keys, items, and values methods now take an optional argument so that you can set them

  • Several sequence methods are now implemented

    • sort
    • reverse
    • index
    • insert

There is also a subclass called FancyODict. This has a custom object as the keys attribute. This can be treated directly as a sequence - iterated over, sliced, assigned to etc. (Assigning to a key is effectively renaming it). you can't delete keys though.

If Nicola Larosa approves this, I'll implement items and values as well. There are still some optimisations to be made as well.

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2005-11-28 16:41:40 | |

Categories: ,


Guestbook Spam and DNS Blacklists

emoticon:Lithium My Guestbook has been getting some really nasty spam recently. I've been meaning to sort it out for ages.

I thought the answer was going to be a DNS blacklist, like the ones run by Spamhaus.

I've finally got around to working out how to do it. When an entry is made I can check the IP address the entry is made from. I can also check any URL's they post against the blacklist.

The way to do it is to first make sure you have an IP address (using socket.gethostbyname(domainname) if you haven't).

You then reverse the order of the digits in the IP address, and stick it onto the black list domain name. If you can successfully call socket.gethostbyname(mungedname), then that IP is on that blacklist. Otherwise you get a socket.gaierror.

Got all that ? Laughing

Here's the Python code to do it :

def blacklisted(ip):
    """
    Returns ``True`` if ip address is a blacklisted IP (i.e. from a spammer).

    IP should first be created by calling ``getname(ip)`` - where you specify
    the DNSBL host.

    Useful for vetting user added information posted to web applications.
    """

    # turn '1.2.3.4' into '4.3.2.1.sbl-xbl.spamhaus.org'
    try:
        socket.gethostbyname(ip)
        return True
    except socket.gaierror:
        return False

def getname(ip, DNSBL_HOST):
    """Turns an IP address into a format used by DNS blacklists."""
    iplist = ip.split('.')
    iplist.reverse()
    ip = '.'.join(iplist) + '.' + DNSBL_HOST
    return ip

import socket
domain = 'www.voidspace.org.uk'
ip = socket.gethostbyname(domain)
mungedname = getname(ip, 'sbl-xbl.spamhaus.org')
if blacklisted(mungedname):
    print 'Looks like %s is a spammer.' % domain

I've turned this into an example CGI script.

It will look up an IP address, or domain name, with several common blacklists.

In order to check it is working, you can do a test with 127.0.0.2 as the IP address. Most blacklists will return a positive on this.

The only problem is - it doesn't work. Sad At least I mean, most of the spammer websites (and the IP addresses they post from) aren't blacklisted. sbl-xbl.spamhaus.org is supposed to include spammer websites - but the ones cluttering my guestbook aren't getting listed. Rolling Eyes

For example :

It seems like this problem is harder to solve than I thought. sigh

Anyway - just in case this is useful to you, I'll post the whole source below. It needs cgiutils.

It should be fairly obvious how it works. Smile

#!/usr/bin/python -u
import os
import sys
import socket

import cgitb
cgitb.enable()

sys.path.append('../modules')
sys.path.append('modules')
from cgiutils import *

DNSBL_LIST = [
    'sbl-xbl.spamhaus.org',
    'relays.ordb.org',
    'dns.rfc-ignorant.org',
    'postmaster.rfc-ignorant.org',
    'http.dnsbl.sorbs.net',
    'misc.dnsbl.sorbs.net',
    'spam.dnsbl.sorbs.net',
    'bl.spamcop.net',
    'bsb.spamlookup.net',
    'opm.blitzed.org',
    ]

def blacklisted(ip):
    """
    Returns ``True`` if ip address is a blacklisted IP (i.e. from a spammer).

    IP should first be created by calling ``getname(ip)`` - where you specify
    the DNSBL host.

    Useful for vetting user added information posted to web applications.
    """

    # turn '1.2.3.4' into '4.3.2.1.sbl-xbl.spamhaus.org'
    try:
        socket.gethostbyname(ip)
        return True
    except socket.gaierror:
        return False

def getname(ip, DNSBL_HOST):
    """Turns an IP address into a format used by DNS blacklists."""
    iplist = ip.split('.')
    iplist.reverse()
    ip = '.'.join(iplist) + '.' + DNSBL_HOST
    return ip

header = '''<html><head><title>Test Blacklist Lookup</title></head><body>
<center><h1>Spammer Check</h1>
By
<em><a href="http://www.voidspace.org.uk/python/index.shtml">Fuzzyman</a></em>
'''

form = '''<br><br>
**result**
<br><br>
<form action="**scriptname**" method="GET">
    Enter IP or web address to check :
    <input type="text" value="**ip**" name="ip">
</form>
'''

footer = '</center></body></html>'

if __name__ == '__main__':
    cgiprint(serverline)
    cgiprint()
    print header
    #
    ip = getrequest(['ip'])['ip']
    result = []
    if ip:
        if ip.startswith('http://'):
            ip = ip[7:]
        if ip.endswith('/'):
            ip = ip[:-1]
        try:
            newip = socket.gethostbyname(ip)
        except socket.error:
            result.append('<br><h2>Failed to resolve %s</h2><br>' % ip)
        else:
            result.append('<br><h2>Checking %s</h2><br>' % ip)
            for entry in DNSBL_LIST:
                test_ip = getname(newip, entry)
                if blacklisted(test_ip):
                    result.append('This address <strong>is blacklisted'
                        '</strong> by <em>%s</em>.' % entry)
                else:
                    result.append('This address <strong>is not blacklisted'
                        '</strong> by <em>%s</em>.' % entry)
    #
    rep = {'**result**': '<br><br>'.join(result),
        '**scriptname**': os.environ.get('SCRIPTNAME', ''),
        '**ip**': ip}
    print replace(form, rep)
    print footer

It's worth noting that the different blacklists all keep lists for different purposes (relay mail servers, open proxies etc). For more information, see the relevant websites.

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2005-11-28 15:25:10 | |

Categories: , , ,


Hosted by Webfaction

Counter...