Wednesday, August 25, 2010

Importing data from Spreadsheets to App Engine

Google App Engine provides the remote_api mechanism for uploading and downloading data from the datastore. It's handy and lets you import different types of data, but requires a certain amount of setup, and well, sometimes I'm lazy and don't feel like going through that setup. So, another way that you can easily import data into your datastore is to store it in a Google spreadsheet, publish the sheet, and write a handler to import the spreadsheet rows as datastore entities.

For example, I created a spreadsheet to store information on Wave extensions, using one column for the URL and another column to indicate if they're featured or not.

Then, I published that spreadsheet using the Share->Publish menu, and constructed a URL for the JSON database-like output:

https://spreadsheets.google.com/feeds/list/0Ah0xU81penP1dDNwSFROSU5KVlFRbmo5cERsTElKTGc/od6/public/values?alt=json

To get the URL for your own public spreadsheet, just change the spreadsheet key (the long string there) and the worksheet ID (the first sheet is always 'od6').

That JSON includes an array of entry objects, and each entry object contains an object for each of the columns, e.g.

[...
 gsx$url: {$t: "http://api.rucksack.com/hostelwithme.xml"},
 gsx$featured: {$t: "yes"}
..]

Note: Column headers are stripped of whitespace and lowercased when converted to keys in the JSON feed, so I always just start off with them that way in the spreadsheet to make it painless to find them in the JSON.

Now, I write a simple handler that will pull in that JSON, parse each entry object, and convert them into datastore entities.

class ImportAppsActionHandler(BaseHandler):
 """ Handler for importing existing apps."""

 def get(self):
   user = users.get_current_user()
   # Need admin access to import
   if not user.is_current_user_admin():
     self.error(403)
   # Fetch JSON of published spreadsheet
   url = "http://spreadsheets.google.com/feeds/list/0Ah0xU81penP1dDNwSFROSU5KVlFRbmo5cERsTElKTGc/od6/public/values?alt=json"
   result = urlfetch.fetch(url)
   if result.status_code == 200:
     feed_obj = simplejson.loads(result.content)
     if "feed" in feed_obj:
       entries = feed_obj["feed"]["entry"]
       # Make an Application entity for each entry in feed
       for entry in entries:
         url = entry['gsx$url']['$t']
         featured = entry['gsx$featured']['$t']
         app = models.Application()
         app.url = url
         app.moderation_status = models.Application.MOD_APPROVED
         app.AddAuthor(user)
         app.AddMetadata()
         app.put()

   # Clear the memcache
   memcache.flush_all()

When I visit that handler, it imports the data, and works both on the local devapp server and the public server in the same way.

There are various caveats to this technique, of course. First, your spreadsheet needs to be published. If you wanted to do it with a private spreadsheet, for more sensitive data, you would need to use the full spreadsheets API and do an authentication dance. Second, your handler is limited to the typical 30 seconds limit for an App Engine request. If you wanted to use it to import many rows of data, you'd probably want to split it up across multiple requests by using the deferred task queue or re-directing with pagination.

But, hey, it was useful for my situation, so maybe it's useful for one more situation out there in the world. :)

1 comment:

Jairo Vasquez said...

I like your creative way to upload data to the datastore but please don't use in python that java camel style for methods :s