blog | bbs

Cataloging Ebay Purchases the Hard Way

January 9, 2010

So as part of the wwiigis.org effort, I’m buying a ton of World War 2-related imagery from EBay, including a lot of neat aerial photography. Once I win an auction, I save the auction page in Firefox via ‘Save Page As’ which writes out an html page, and folder containing images for that auction. In getting a separate blog post together, I wanted to have access to the auction title and auction photograph for each auction, without having to manually go through a million folders. So what follows is a quick and dirty Python script that gets the job done, albeit in a pretty ineffecient way. I’m sharing this script on the assumption that there’s one other person in universe in a similiar situation, who doesn’t want to waste an hour in programming land.

Check out the latest images below (soon to be live on wwiigis.org):

And here’s the code:

import os
import re
import operator
from PIL import Image

# A script to extract auction titles and
# copy a representative image
# to a specified output directory.
# giencke@gmail.com

auction_title = re.compile(
    '<title>(.*?)<\/title', re.IGNORECASE)
auction_images = re.compile(
    'img.*?src=\"(.*?)\"\s', re.IGNORECASE)

# Directory containing saved ebay pages
INPUT_DIRECTORY = "some dir"

# Output directory to store representative
# image from auction, maybe
OUTPUT_DIRECTORY = 'some dir'

# Output file to contain listing of files, and copied images
OUTPUT_TEXTFILE_NAME = 'images.txt'

# Copy image to OUTPUT_DIRECTORY?
COPY_IMAGE = True

# Some images aren't worth processing, those go here
EBAY_IMAGES_TO_IGNORE = ['logoEbay_x45.gif', 'noscript']
output_file_obj = open(
    os.path.join(OUTPUT_DIRECTORY,
                 OUTPUT_TEXTFILE_NAME), 'w')

# Recusively go through images folders
# containing saved ebay pages
for ebay_folders in os.walk(INPUT_DIRECTORY):

  # The first tuple value is the directory
  parent_directory = ebay_folders[0]

  # open the actual auction html 
  ebay_page = '%s.htm' %  ebay_folders[0].strip('_files')
  try:
    in_html = open(ebay_page).read()
  except IOError:
    continue

  # The auction name lives in the html title   
  auction_name = ''.join(auction_title.search(
      in_html).groups()[0].split(' - ')[:-1])

  # This tuple will be used to contain the largest image
  largest_image = (0,)

  for image in set(auction_images.findall(in_html)):

    image_basename = os.path.basename(image)
    image_to_open = os.path.join(ebay_folders[0],
                                 image_basename)

    # open up the image to get dimensions, etc
    try:
      in_image = Image.open(image_to_open)
    except IOError:
      continue

    if image_basename not in EBAY_IMAGES_TO_IGNORE:
      # TODO: Check for preferred formats

      width, height = in_image.size
      image_dimensions = width * height
      if image_dimensions >= largest_image[0]:
        largest_image = (image_dimensions,
                         image_to_open,
                         '%sx%s' % (width, height))

  # Format for output text
  output_file_obj.write('%s || %s || %s\n' % (
      auction_name.title(), os.path.basename(largest_image[1]),
                                             largest_image[-1]))

  if COPY_IMAGE:
    out_image = Image.open(largest_image[1])
    out_image.save(os.path.join(
        OUTPUT_DIRECTORY, os.path.basename(largest_image[1])))
output_file_obj.close()

Filed under: programming,wwiigis


Connect with:

Archives

Recent Blog Posts

Calendar

January 2010
M T W T F S S
« Dec   Mar »
 123
45678910
11121314151617
18192021222324
25262728293031