Cataloging Ebay Purchases the Hard Way
January 9, 2010
So as part of the wwiigis.org effort, I’m buying a ton of World War 2-related imagery from EBay, including a lot of neat aerial photography. Once I win an auction, I save the auction page in Firefox via ‘Save Page As’ which writes out an html page, and folder containing images for that auction. In getting a separate blog post together, I wanted to have access to the auction title and auction photograph for each auction, without having to manually go through a million folders. So what follows is a quick and dirty Python script that gets the job done, albeit in a pretty ineffecient way. I’m sharing this script on the assumption that there’s one other person in universe in a similiar situation, who doesn’t want to waste an hour in programming land.
Check out the latest images below (soon to be live on wwiigis.org):
And here’s the code:
import os
import re
import operator
from PIL import Image
# A script to extract auction titles and
# copy a representative image
# to a specified output directory.
# giencke@gmail.com
auction_title = re.compile(
'<title>(.*?)<\/title', re.IGNORECASE)
auction_images = re.compile(
'img.*?src=\"(.*?)\"\s', re.IGNORECASE)
# Directory containing saved ebay pages
INPUT_DIRECTORY = "some dir"
# Output directory to store representative
# image from auction, maybe
OUTPUT_DIRECTORY = 'some dir'
# Output file to contain listing of files, and copied images
OUTPUT_TEXTFILE_NAME = 'images.txt'
# Copy image to OUTPUT_DIRECTORY?
COPY_IMAGE = True
# Some images aren't worth processing, those go here
EBAY_IMAGES_TO_IGNORE = ['logoEbay_x45.gif', 'noscript']
output_file_obj = open(
os.path.join(OUTPUT_DIRECTORY,
OUTPUT_TEXTFILE_NAME), 'w')
# Recusively go through images folders
# containing saved ebay pages
for ebay_folders in os.walk(INPUT_DIRECTORY):
# The first tuple value is the directory
parent_directory = ebay_folders[0]
# open the actual auction html
ebay_page = '%s.htm' % ebay_folders[0].strip('_files')
try:
in_html = open(ebay_page).read()
except IOError:
continue
# The auction name lives in the html title
auction_name = ''.join(auction_title.search(
in_html).groups()[0].split(' - ')[:-1])
# This tuple will be used to contain the largest image
largest_image = (0,)
for image in set(auction_images.findall(in_html)):
image_basename = os.path.basename(image)
image_to_open = os.path.join(ebay_folders[0],
image_basename)
# open up the image to get dimensions, etc
try:
in_image = Image.open(image_to_open)
except IOError:
continue
if image_basename not in EBAY_IMAGES_TO_IGNORE:
# TODO: Check for preferred formats
width, height = in_image.size
image_dimensions = width * height
if image_dimensions >= largest_image[0]:
largest_image = (image_dimensions,
image_to_open,
'%sx%s' % (width, height))
# Format for output text
output_file_obj.write('%s || %s || %s\n' % (
auction_name.title(), os.path.basename(largest_image[1]),
largest_image[-1]))
if COPY_IMAGE:
out_image = Image.open(largest_image[1])
out_image.save(os.path.join(
OUTPUT_DIRECTORY, os.path.basename(largest_image[1])))
output_file_obj.close()
Filed under: programming, wwiigis







