proxycleaner.py A function to sort out web pages fetched via a proxy. Undoes URL modifications made by approx.py or the James Marshall CGI proxy Copyright Michael Foord You are free to modify, use and relicense this code. No warranty express or implied for the accuracy, fitness to purpose or otherwise for this code.... Use at your own risk !!! E-mail michael AT foord DOT me DOT uk Maintained at www.voidspace.org.uk/atlantibots/pythonutils.html proxycleaner.py Written in python - beauty matters. Will clean HTML and CSS files modified by a cgi-proxy. It will search through a specified directory (and it's sub-directories) modifying all html/css files it finds. It uses the directory specified inside the script - or in a config file. To run - either edit the default settings in this script first, or create an external config file. See the example one provided and the explanations below. If you are in a restricted or censored internet environment your only way of freely browsing the web may be through a CGI proxy like approx.py or the James Marshall perl cgi-proxy. These modify web pages fetched through them, so that images and links are also fetched through the proxy. (In fact approx can turn this off - but it's not always convenient when you're browsing and saving pages). If the proxy is private you may not want to give away the URL, or you might just want to store a better copy of the original. This script knows how these two proxies modify pages and can go through all the files in a directory and restore urls in web pages you've saved. To read and modify the default settings, see the start of the source code (using any text editor). Only if no config-file is available will proxycleaner will use it's built in settings. You might need to delete or rename the example 'config.txt' file to make sure the right settings are used. If a file called config.txt exists in the current working directory when proxycleaner is run it will read the settings from this. Either that or you can supply a path to a config-file as the command line argument. If proxycleaner is run with '?', 'h', 'help', '-h' etc as the command line argument it displays this message. (In order to load the config-file the ConfigObj module must be available. This is the file 'fullconfigobj.py' that comes with proxycleaner.) Following is a brief description of the values that proxycleaner needs in order to work. proxyurl The proxyurl we are cleaning directory Which directory are we searching for files to modify overwrite Should modified files overwrite the original ? If overwrite is True we need to know an additional thing : alt_add what should we add to the start of filenames that we clean Because the files I change often include css files, where relative location to the main file and an unchanged filename are important, I haven't provided an option to save modified files in a different directory. I recommend having overwrite on - but keeping an original copy of the files in case anything goes wrong. If anyone would prefer to be able to specify a target directory let me know - it would be easy to add. proxycleaner cleans all the files from a source directory. If you use different machines the chances are that your source directory will be in different places on the different machines. proxycleaner lets you specify which directory to use for which machine. It works out which machine you are using through three 'marker files'. See the keywords 'home', 'work' and 'thirdpc'. Each of these should point to a file that only exists on one of the computers. For example, on my home computer I have a file called 'c:\HOME', at work I have a file called 'c:\WORK '. When proxycleaner.py runs, it checks these files one by one. The first one it finds it assumes that's the machine it's on - and uses the appropriate directory. If you don't want to bother with all that, just set the value of 'setlocation' to '1' and put the directory you use in the setting for 'work'. If that file also contains a line that looks like : #proxy http://www.proxyurl.com proxycleaner will use that instead of the 'proxyurl' entry in the settings/config file. This means I can distribute my example config file without having to give away my proxy location ! So the full config file looks like this : setlocation = 0 # Set to 0 for proxycleaner to check for location using work, home and thirdpc marker files. # Otherwise set to 1, 2 or 3 to manually set location (1 is work, 2 is home, 3 is thirdpc) # This choice affects which directory is used - 'defaultwork', 'defaulthome', 'defaultthirdpc' # Marker Files # These files are only checked if 'setlocation' is 0 work = 'C:\WORK' # If this file exists then proxycleaner will search through the 'defaultwork' directory for files to modify home = 'C:\HOME' # if this file exists.... 'defaulthome' is used thirdpc = '\PPC' # ... 'defaultthirdpc' defaultwork = 'C:\Documents and Settings\michael\Desktop\Processing Files' defaulthome = 'C:\Documents and Settings\Voidspace\Desktop\Processing Files ew' defaultthirdpc = '\Storage Card\PP\docs' #overwrite = False overwrite = True # Do you want your cleaned files to just overwrite the original ? alt_add = 'clean_' # If overwrite is set to False then this attribute is added to the start of modified filenames proxyurl = '' # your proxy location (url) - this is the url we are cleaning. As you can see, in my config file I don't specify a proxyurl - that's set in my marker file using '#proxy' I use marker files - so setlocation is 0 Anyway - I hope proxycleaner is useful. If anyone doesn't have python installed and wants a windoze executable then just get in touch... it can be done.... TODO/ISSUES Throw an error if more than one command line argument given ? Could remove the forms and cookie stuff from pages modified using James Marshalls proxy. Should we specify an alternative directory for saving changed files in. Remove the 'saved from..' line ? Remove the 'resource modified by proxy' comment ? Add verbosity levels using StandOut ? CHANGELOG 03-09-04 Version 1.0.1 Slight change to clean up nntp links mangled by the perl cgiproxy. 13-08-04 Version 1.0.0 A reasonable attempt.