Google Chat History Downloader

Update 2011-11-09:
Gmail now officially supports downloading chat history via IMAP. Thank’s to Steve for pointing it out. It can be enabled in the “Labels” section of Gmail settings.

Update 2011-08-30:

Based on the comments, this doesn’t work anymore. I’d recommend checking out this thread for solutions: http://www.google.com/support/forum/p/gmail/thread?tid=7a7d2d6da5be047f

I personally have been using a javascript-based solution for exporting recent chat data, which still doesn’t solve the TOS / getting blocked problem. If there is enough interest, I’ll post my code.

A couple weeks ago, I decided to migrate from one Google Account to another. I was able to transfer all of my emails from one to the other without too much difficulty. However, I looked around for a while and have not found any way to export all of my Google Talk Chat history. I don’t think there is any way to access saved chats from either IMAP or POP. I did notice though, that through the Gmail web interface, you can view saved chats as a raw message. There happens to be an old python library for interacting with the Gmail web interface called libgmail. I found however that it does not scale very well to large amounts of messages, so I had to write my own method to only process results one page at a time. Also, I found that I was easily blocked using this method over a long time, so I added 13 second delays after every request so as not to get my account suspended. It took me a day and a half to actually export all of the messages. I’m not sure if this is over kill or not, but I am tired of getting my account blocked.

Anyway, This program goes through and saves each chat history message as an .eml file. One they are in that format, it is not super hard to get them into a different Gmail account, but I’ll save that for another post.

import os
import time
import libgmail # http://libgmail.sourceforge.net/

def thread_search(ga, searchType, **kwargs):
    index = 0
    while (index == 0) or index < threadListSummary[libgmail.TS_TOTAL]:
            threadsInfo = []
            items = ga._parseSearchResult(searchType, index, **kwargs)
            try:
                threads = items[libgmail.D_THREAD]
            except KeyError:
                break
            else:
                for th in threads:
                    if not type(th[0]) is libgmail.types.ListType:
                        th = [th]
                    threadsInfo.append(th)
                threadListSummary = items[libgmail.D_THREADLIST_SUMMARY][0]
                threadsPerPage = threadListSummary[libgmail.TS_NUM]
                index += threadsPerPage
            yield libgmail.GmailSearchResult(ga, (searchType, kwargs), threadsInfo)

ga = libgmail.GmailAccount("username@gmail.com", "password")
ga.login()

for page in thread_search(ga, "query", q="is:chat"):
    print "New Page"
    time.sleep(13)
    for thread in page:
        if thread.info[0] == thread.info[10]:
            # Common case: Chats that only span one message
            filename = "chats/%s_%s.eml" % (thread.id, thread.id)
            #only download the message if we don't have it already
            if os.path.exists(filename):
                print "already have %s" % filename
                continue
            print "Downloading raw message: %s" % filename,
            message = ga.getRawMessage(thread.id).decode('utf-8').lstrip()
            print "done."
            file(filename, 'wb').write(message)
            time.sleep(13)
            continue
        # Less common case: A thread that has multiple messages
        print "Looking up messages in thread %s" % thread.id
        time.sleep(13)
        for message in thread:
            filename = "chats/%s_%s.eml" % (thread.id, message.id)
            #only download the message if we don't have it already
            if os.path.exists(filename):
                print "already have %s" % filename
                continue
            print "Downloading raw message: %s" % filename,
            file(filename, 'wb').write(message.source.lstrip())
            print "done."
            time.sleep(13)
Advertisements

Message Queue

I wrote some code for a group project that I am kind of proud of. It’s not very clean code, but it accomplishes something cool. It’s a way for a website to sent messages to a browser in real time, without the browser needing to constantly be checking to see if the website has a message that is ready to be sent.

We did most of our project in php. Here is our php code:

<?php include_once("json.php"); function get_url($url) { $output = array(); exec("curl " . $url, &$output); return $output[0]; } function msgq_new() { $id = get_url("http://127.0.0.1:8888/new/"); return $id; } function send_message($data) { return get_url("http://127.0.0.1:8888/post/" . urlencode(array2json($data))); } if (isset($_GET['action']) && $_GET['action'] == "wait_for_message") { header("Content-type: text/plain"); echo get_url("http://127.0.0.1:8888/wait/" . $_GET['id']); }

Here is the Javascript part. We were using the YUI library, but you could easily do this without it.:

function wait_for_message() {
  var id = document.body.id;
  YAHOO.util.Connect.asyncRequest('GET', '/wait_for_message.php?id=' + id, {success: function(response) {
    wait_for_message();
    if (!response.responseText || response.responseText == "\n") return; // Server sent a nop
    var data = YAHOO.lang.JSON.parse(response.responseText);
    window[data.handler](data);
  }});
}
YAHOO.util.Event.addListener(window,'load', wait_for_message)

Here is the Python part:

from time import time
from random import uniform
from Queue import Queue,Empty
from socket import error
from threading import Thread
from urllib import unquote_plus
from wsgiref.simple_server import make_server


clients={} # A Queue and a http server thread for each client

def send_to_all(msg):
    print "sending %r to all" % msg
    for k, x in clients.items()[:]:
        if (time() - x.last_get > 300):
            # Client has not asked for any messages for 5 minutes
            # Delete them.
	    x.active = False
	    del clients[k]
	    continue
        print "sending to %s" % k
        x.q.put(msg)
    print "finished sending messages."

def wait_for_message(q):
    try:
        #Wait for a new message.
        return q.get(True, uniform(55, 59))
    except Empty:
        return "" # no message within a minute, send keep-alive

def handle(environ, start_response):
    start_response('200 OK', [('Content-type', 'text/plain')])
    path = environ['PATH_INFO']

    if path.startswith("/wait/"):
	id = unquote_plus(path[len("/wait/"):])
	if id not in clients:
	    clients[id] = Server()
	clients[id].last_get = time()
        print "%s is waiting for a message..." % id
        return wait_for_message(clients[id].q)

    if path.startswith("/new"):
        from hashlib import md5
        id = md5(str(time())).hexdigest()
        clients[id] = Server()
        return id

    if path.startswith("/post/"):
	msg = unquote_plus(path[len("/post/"):])
        send_to_all(msg)

    return ""

class Server(Thread):
    "A Queue and a http server thread for each client"
    def __init__(self):
        Thread.__init__(self)
	self.q = Queue()
        self.setDaemon(1)
	self.active = True
	self.last_get = time()
        self.start()
    def run(self):
        while self.active:
            self.httpd.handle_request()

def start():
    httpd = make_server('0.0.0.0', 8888, handle)
    Server.httpd = httpd
    httpd.serve_forever()

start()

This one moves all of my new drafts into my inbox every 15 minutes.

I don’t think there is a way to do this without having to check all of the time.

import libgmail
import time
from getpass import getpass

name = raw_input("Gmail account name: ")
pw = getpass()

ga = libgmail.GmailAccount(name, pw)
ga.login()
while 1:
    for thread in ga.getMessagesByFolder("drafts", True):
        if len(thread) == 1: #only apply to new threads
            print thread.id, thread.subject
            thread._account._doThreadAction("ib", thread)
    time.sleep(15 * 60)

This one logs me into pop.org

import urllib
import cookielib

class dummy_request:
    def get_full_url(self):
        return '/'

def extract_cookies(response):
    """Given a response, returns a dictionary of the cookies."""
    ns_headers = response.headers.getheaders("Set-Cookie")
    attrs_set = cookielib.parse_ns_headers(ns_headers)
    cookie_tuples = cookielib.CookieJar()._normalized_cookie_tuples(attrs_set)
    cookies = {}
    for tup in cookie_tuples:
        name, value, standard, rest = tup
        cookies[name] = value
    return cookies
    
def log_into_pop(username, password):
    url = 'http://www.peopleofpraise.org/user/login'
    args = {'edit[name]': username, 'edit[pass]': password}
    response = urllib.urlopen(url, urllib.urlencode(args))
    return extract_cookies(response)

if __name__ == '__main__':
    import getpass
    username = 'Collin Anderson'
    password = getpass.getpass()
    print log_into_pop(username, password)

Paul Graham’s Essays

Paul Graham’s Essays don’t show when they were written, so this one goes through all of the essays and finds out.

import feedparser
import re
import urllib

#Regular expression that looks for something like this: "March 2004"
date_search = re.compile('([A-Z][a-z]* 200[0-9])')

def open_url(url):
    response = urllib.urlopen(url)
    data = response.read()
    response.close()
    return data

rss_feed = feedparser.parse('http://www.aaronsw.com/2002/feeds/pgessays.rss')

for entry in rss_feed['entries']:
    print entry['title'],
    url = entry['link']
    page = open_url(url)
    dates = date_search.findall(page)
    if not dates:
        print 'Unknown'
    else:
        print dates[0]

Web Changes

This one checks for changes to webpages and sends me an email if one of them has changed. Works for logged in pages too in theory if you know how to extract the cookie data.

import md5
import urllib2
import smtplib
import time
import getpass

smtpserver = 'smtp.umn.edu'
smtpuser = 'ande7966'
smtppass = getpass.getpass()
RECIPIENTS = ['cmawebsite@gmail.com']
SENDER = 'ande7966@umn.edu'

def send_email(message):
    session = smtplib.SMTP(smtpserver)
    session.starttls()
    session.login(smtpuser, smtppass)
    return session.sendmail(SENDER, RECIPIENTS, message)            

def check_for_changes():
    f = file('urls.txt', 'r')
    urls = []
    for line in f:
        line = line.split()
        if not line:
            continue
        data = {}
        data['url'] = line.pop(0)
        if len(line) > 0:
            data['hash'] = line.pop(0)
        if len(line) > 0:
            data['cookie'] = line.pop(0)
        urls.append(data)
    f.close()

    for row in urls:
        if not row:
            continue
        try:
            req = urllib2.Request(row['url'])
            req.add_header('Cookie', row.get('cookie'))
            response = urllib2.urlopen(req)
            data = response.read()
            response.close()
            new_hash = md5.new(data).hexdigest()
        except Exception, e:
            print e
            if str(e) == "<urlopen error (11001, 'getaddrinfo failed')>":
                print "Internet connection Troubles."
                return
            if str(e) == "<urlopen error (10060, 'Operation timed out')>":
                print "Timed Out"
                continue
            new_hash = str(e).replace(' ', '_')
        if row.get('hash') != new_hash:
            print row['url'] + ' has changed: ' + new_hash
            try:
                print send_email('Subject: %srnrn%s has changed: %s %s' % (row['url'], row['url'], row.get('hash'), new_hash))
            except Exception, e:
                print e.__class__
                print e
                return
            row['hash'] = new_hash

    f = file('urls.txt', 'w')
    for row in urls:
        f.write(row['url'] + ' ' + row.get('hash', '') + ' ' + row.get('cookie', ''))
        f.write('n')
    f.close()

if __name__ == '__main__':
    try:
        while 1:
            print 'Checking now...'
            check_for_changes()
            print 'Checking again in 15 minutes'
            time.sleep(5 * 60)
            print 'Checking again in 10 minutes'
            time.sleep(5 * 60)
            print 'Checking again in 5 minutes'
            time.sleep(5 * 60)
    except Exception, e:
        import traceback
        traceback.print_exc()
    print 'done'
    raw_input()

Comet Chat Server

Here is a demo I wrote that demonstrates how to use the Comet method of http streaming. Of course this was before it was named Comet.

from cgi import escape
from random import uniform
from Queue import Queue,Empty
from sets import Set
from socket import error
from threading import Thread
from urllib import unquote_plus
from wsgiref.simple_server import make_server

class Connection(Queue):
    """Handles the persistant connection between the client and server"""
    #This set could get messed up by multi-threading.
    objects=Set() #set of live connections

    def __init__(self,obj_up_hook=None):
        self.name=""
        self.obj_up_hook=obj_up_hook
        Queue.__init__(self)

    def __str__(self):
        return self.name

    def __repr__(self):
        return "Connection object: " + str(self)

    def online(self):
        self.objects.add(self)
        print '"' + str(self) + '" has joined'
        if self.obj_up_hook:
            self.obj_up_hook(self)

    def offline(self):
        self.objects.discard(self)
        print '"' + str(self) + '" has left'
        if self.obj_up_hook:
            self.obj_up_hook(self)

    def send_to_all(msg):
        """Sends a message to all online objects"""
        for x in Connection.objects:
            x.put(msg)
            
    send_to_all = staticmethod(send_to_all)

    def run(self,write,keep_alive=" "):
        """Waits for messages and outputs them until window is closed"""
        self.online()
        while 1:
            try:
                #Wait for a new message.
                m=self.get(True,uniform(10,15))
            except Empty:
                #The waiting timed out.
                m=keep_alive
            try:
                write(m)
            except error:
                #most likely the client closed the window
                self.offline()
                return

class ChatApp():
    """Handles a Request"""
    def __init__(self, environ, start_response):
        self.environ=environ
        self.start_response=start_response

    def index(self):
        """login page"""
        return """<script>
function submit(e){
 if(!e)e=window.event;
 if(e.keyCode==13){
  url="/main/"+input.value;
  location.href=url
 }
}
</script>
<body>
<table width=100% height=60%>
<td width=100% height=100%><center>Enter your name:<br><input id=input style="width:50%" onkeypress="submit(event)">"""

    def main(self,user):
        """main page"""
        return '''<script>
function submit(e){
 if(!e)e=window.event;
 if(e.keyCode==13){
  url="/ajax/'''+str(user)+'''?"+input.value;
  input.value="";
  if(window.ActiveXObject){ajax=new ActiveXObject("Microsoft.XMLHTTP")};
  if(window.XMLHttpRequest){ajax=new XMLHttpRequest()};
  ajax.open("GET",url,true);
  ajax.send(null);
 }
}
</script>
<body topmargin=0 bottommargin=0 leftmargin=0 rightmargin=0>
<table width=100% height=100% cellspacing=0 cellpadding=0>
<td width=80% height=100%>
<iframe id=thebox style="border-right:0;border-left:0;border-top:0;border-bottom:0" width=100% height=100% src="/top/'''+str(user)+'''"></iframe>
<td width=20% height=100%>
<iframe name=thelist style="border-right:0;border-left:0;border-top:0;border-bottom:0" width=100% height=100% src="/list/'''+str(user)+'''"></iframe>
<tr><td><input id=input style="width:100%" value="Type your message here" onkeypress="submit(event)">'''

    def refresh_online_list(self,connection):
        Connection.send_to_all('<script>u()</script>')
        
    def top(self,write,usern):
        """actual chat window"""
        #create another thread to serve new requests
        Server()

        write("""<body><script>
function u(){parent.frames["thelist"].location.reload();}
function s(str){document.write(str);window.scrollBy(0,100);}
</script>""")
        u=Connection(self.refresh_online_list)
        u.name=usern
        u.run(write)

    def onlinelist(self,user):
        """online list"""
        string =  "<b>" + str(len(Connection.objects)) + " Online:</b><br>"
        for u in Connection.objects:
            string += str(u)+"<br>n"
        return string

    def ajax(self,user,message):
        """page that accepts messages"""
        #this escape function escapes all html and quotes
        print "recieved message: " + message
        print "sending message: " + self.esc(message)
        Connection.send_to_all("<script>s('" + "<b>" + self.esc(str(user)) + 
                ":</b> " + self.esc(message) + "<br>" + "')</script>")
        print "done"

    def esc(string):
        #for html:
        string = escape(string,True)
        #for javascript (order is important)
        string = string.replace("\","\\")
        string = string.replace("'","\'")
        return string

    esc = staticmethod(esc)

    def __iter__(self):
        print "recieved request"
        write = self.start_response('200 OK', [('Content-type', 'text/html')])
        patharray = self.environ['PATH_INFO'].split('/')
        if patharray==["",""]:
            yield self.index()
            return
        if patharray[1]=="favicon.ico":
            return
        command=patharray[1]
        user=patharray[2]
        if command=="main":
            yield self.main(user)
        elif command=="top":
            self.top(write,user)
            yield ""
        elif command=="list":
            yield self.onlinelist(user)
        elif command=="ajax":
            self.ajax(user,unquote_plus(self.environ['QUERY_STRING']))
            yield ""
        else:
            yield "unknown command: "+str(command)


class Server(Thread):
    """A thread that serves requests"""
    def __init__(self):
        Thread.__init__(self)
        self.setDaemon(1)
        self.start()
    def run(self):
        self.httpd.serve_forever()

def start():
    httpd = make_server('0.0.0.0', 9081, ChatApp)
    Server.httpd=httpd
    print "Serving HTTP on port 9081..."
    s=Server()
    s.join()#don't exit

if __name__ == '__main__':
    start()