Google Chat History Downloader

Update 2011-11-09:
Gmail now officially supports downloading chat history via IMAP. Thank’s to Steve for pointing it out. It can be enabled in the “Labels” section of Gmail settings.

Update 2011-08-30:

Based on the comments, this doesn’t work anymore. I’d recommend checking out this thread for solutions: http://www.google.com/support/forum/p/gmail/thread?tid=7a7d2d6da5be047f

I personally have been using a javascript-based solution for exporting recent chat data, which still doesn’t solve the TOS / getting blocked problem. If there is enough interest, I’ll post my code.

A couple weeks ago, I decided to migrate from one Google Account to another. I was able to transfer all of my emails from one to the other without too much difficulty. However, I looked around for a while and have not found any way to export all of my Google Talk Chat history. I don’t think there is any way to access saved chats from either IMAP or POP. I did notice though, that through the Gmail web interface, you can view saved chats as a raw message. There happens to be an old python library for interacting with the Gmail web interface called libgmail. I found however that it does not scale very well to large amounts of messages, so I had to write my own method to only process results one page at a time. Also, I found that I was easily blocked using this method over a long time, so I added 13 second delays after every request so as not to get my account suspended. It took me a day and a half to actually export all of the messages. I’m not sure if this is over kill or not, but I am tired of getting my account blocked.

Anyway, This program goes through and saves each chat history message as an .eml file. One they are in that format, it is not super hard to get them into a different Gmail account, but I’ll save that for another post.

import os
import time
import libgmail # http://libgmail.sourceforge.net/

def thread_search(ga, searchType, **kwargs):
    index = 0
    while (index == 0) or index < threadListSummary[libgmail.TS_TOTAL]:
            threadsInfo = []
            items = ga._parseSearchResult(searchType, index, **kwargs)
            try:
                threads = items[libgmail.D_THREAD]
            except KeyError:
                break
            else:
                for th in threads:
                    if not type(th[0]) is libgmail.types.ListType:
                        th = [th]
                    threadsInfo.append(th)
                threadListSummary = items[libgmail.D_THREADLIST_SUMMARY][0]
                threadsPerPage = threadListSummary[libgmail.TS_NUM]
                index += threadsPerPage
            yield libgmail.GmailSearchResult(ga, (searchType, kwargs), threadsInfo)

ga = libgmail.GmailAccount("username@gmail.com", "password")
ga.login()

for page in thread_search(ga, "query", q="is:chat"):
    print "New Page"
    time.sleep(13)
    for thread in page:
        if thread.info[0] == thread.info[10]:
            # Common case: Chats that only span one message
            filename = "chats/%s_%s.eml" % (thread.id, thread.id)
            #only download the message if we don't have it already
            if os.path.exists(filename):
                print "already have %s" % filename
                continue
            print "Downloading raw message: %s" % filename,
            message = ga.getRawMessage(thread.id).decode('utf-8').lstrip()
            print "done."
            file(filename, 'wb').write(message)
            time.sleep(13)
            continue
        # Less common case: A thread that has multiple messages
        print "Looking up messages in thread %s" % thread.id
        time.sleep(13)
        for message in thread:
            filename = "chats/%s_%s.eml" % (thread.id, message.id)
            #only download the message if we don't have it already
            if os.path.exists(filename):
                print "already have %s" % filename
                continue
            print "Downloading raw message: %s" % filename,
            file(filename, 'wb').write(message.source.lstrip())
            print "done."
            time.sleep(13)

55 thoughts on “Google Chat History Downloader”

  1. What I did was to use the archive.py script provided by libgmail and also add a delay (got locked out twice) – then import the mbox into thunderbird, use import/export tools to export all chats as .txt files, then I wrote a perl script which sorted all chats into folders by person.

    However – I noticed that a lot of chat logs had their timestamps missing and messages out of order. Did you notice this as well?

    Like

  2. Oh, I was just archiving my chat logs on my local computer for offline viewing – not re-importing them into an account. (I wasn’t actually transferring one account to another)

    I had a delay of 8 seconds – still got locked out. (And yours was even longer at 13 seconds) I have a suspicion that it may be a daily limit of downloaded messages rather than time? I can’t say, though.

    You didn’t say though – whether you got out-of-order messages in the downloaded chats?

    Like

    1. I guess I really didn’t notice if the chats were out of order or not. I thought that it downloaded every message in every thread in order. I also thought that Thunderbird did a fairly good job of displaying all of the messages in order, so I guess I’d say that they weren’t out of order.

      Like

      1. Oh, I didn’t mean the threads – but the ACTUAL chat log itself. For example, one of my downloaded logs was

        person: but only if you can do it and its easy
        person: i have a favor to ask
        person: hey!
        person: gracias!
        person: can you give me a call at xxxxxxxxxx?

        when in gmail itself, the chat log is:

        6:41 AM person: hey!
        i have a favor to ask
        but only if you can do it and its easy
        can you give me a call at xxxxxxxxxx?
        gracias!

        So – all out of order!

        Apparently someone else also had some issue with this. I guess it’s a problem with gmail then.. and maybe the reason why they disabled downloading chats via IMAP…

        Like

  3. Before I tell you my conclusions, here is some experimentation that I did with this script:

    First I ran it overnight, exactly as it appears above, with 2×13 seconds between chats. I got locked out after 4 hours, 525 chats.

    Then I decreased the interval to 1×10 seconds, but I was using my email pretty frequently while the script was running. It ran for 8 hours, 2550 chats without locking me out, until I closed the script.

    In the spirit of experimentation, I got rid of the “sleep” commands entirely, and ran the script while obsessively messing with my email account: opening emails, moving them between labels, archiving/unarchiving, deleting spam, etc. It ran for surprisingly long (about 1500 chats) before locking me out.

    So the safe ground is somewhere in between. It seems pretty clear that running the script overnight is a bad idea, since I suppose the predictability of the requests received by the server makes them be flagged as suspicious. However, if you are using your google account while running the script, the requests are not as predictable and therefore not suspicious enough to lock you out.

    Regarding lines of chat being out of the proper order, I am getting the same results as Hoong. Strangely this issue does not affect group chats, even though my naive assumption would be that they are more difficult to order correctly than a normal chat between two people.

    However, it is possible to salvage the original order of the chats (not to mention in pretty HTML format instead of .eml). Collin’s script includes the “message id” of each chat in the file name. All you need is a simple script to run through the directory of .eml files, and load and save an HTML page from the following address, where {id} is replaced by the message id. The pages returned are actually very clean, without any javascript nonsense or anything like that.

    http://mail.google.com/mail/?ui=1&view=lg&msg=id

    Collin, perhaps you could write such a script for us, or extend the original to have it save html files instead of eml for people who would prefer that (or who just want their chats in the proper order).

    By the way, thank you very much for your work on this script. I wrote a more primitive version a few months ago, also based on libgmail, but never got very far with it for fear of getting locked out permanently. More recently I read through Gmail’s TOS looking for a ban on using third-party software (such as this script) to access your account, but there was no such ban. Oddly, I came across the following article in the google help pages:

    http://mail.google.com/support/bin/answer.py?hl=en&answer=13107
    “Using third party software applications that interact with Gmail directly violates the Terms of Use that all users must agree to before creating a Gmail address.”

    After going back to check the TOS again, I am sure this is a contradiction.

    Like

  4. Hello,

    Thank you for your script! I was very helpful. But like David, I also noticed the order of my chats was completely messed up. So I decided to modify your script to include his suggestion of downloading the messages as HTML files. So far, so good!

    I also included random sleep intervals to try to avoid the lock out problem.

    If you want to have a look at the code, I uploaded it here: http://tinypaste.com/ef35f. I didn’t care too much about the quality of the code, I just wanted to get the work done…

    Feel free to change, publish, use, do whatever with it :)

    Like

    1. Dear Racoonette,

      I don’t know python at all, but have installed it and tried your and Collin’s code. Not sure if I have done everything right – had to install a few additional packages, namely pysqlite2, libgmail, mechanize and I think something else – but now it works. Collin’s code works as advertised although messes up some chats as described here a number of times. Your code seems to work, but instead of chats saves files that look like this:

      top.location=”https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3D1&bsv=zpwhtygjntrz&scc=1&ltmpl=googlemail”;

      I realize this is not supported code or anything, but if there’s something simple I’m missing, I’d greatly appreciate your advice.

      Many thanks!

      Like

      1. NOTE: the text quoted above is actually surrounded by opening and closing script tags which this site has stripped

        Like

      2. That looks like the content of the page Gmail will send you to if you are not logged in, so it appears to be a problem with your cookies. Check to make sure that the my_username, my_password, and cookie_file variables inside Raccoonette’s code are all correct, and that you have recently logged in to Gmail from Firefox and chosen “remember me” or whatever option it gives you. I see that the comment in Raccoonette’s code specifically references Firefox 3, so if you are using 3.5 things might have changed – I will wait for him to clarify that.

        Like

      3. Hello, Crible-Crable,

        it is exactly like David said: it’s a problem with the cookies, Gmail is not recognizing you as signed in. You have to mark the “remember me” option when logging in to Gmail in Firefox 3 (I really don’t know if it works with Firefox 3.5 and definitely it won’t work with a cookie file from other browsers). The browser doesn’t even need to be open, as long as the cookie to stay signed in is set. Let me know if this fixes the issue and good luck =)

        @David: I’m a “she” ;)

        Like

      4. Hi David and Raccoonette-a-she :)

        I’m definitely logged in and have remember me checked, I’m using Firefox 3.0.13 (btw, tried with 3.5 – program says “pysqlite2._sqlite.OperationalError: database is locked”), also I’m definitely pointing the program to correct cookies.sqlite file, but I’m still getting the same results :(

        Tried clearing the cookies via Clear Private Data and even rudely deleting the file – same stuff :(

        Like

      5. Okay, I think I know what is the problem. You might have the setting “Always use https” checked inside Gmail. That option prevents you from accessing Gmail from an unencrypted connection, which is what the script is trying to do.

        So, log in to Gmail, go to Settings, tab General, and mark the radio button “Don’t always use https”. To make sure that you’ll have the right cookies, log out, clear the cookies (or at least the Gmail cookies), log in again marking the “Stay signed in” option in the login page.

        I hope it works now!
        Cheers!

        Like

      6. Hi folks,

        Thanks for all your work on this functionality.

        New to Python, I’m trying to give this a whirl myself. I ran into trouble with installing the pysqlite2 package. So a heads up to anyone else out there with Snow Leopard (I think this is where the issue is).

        I got this error during the build:
        “/Developer/SDKs/MacOSX10.4u.sdk/usr/include/stdarg.h:4:25: error: stdarg.h: No such file or directory”

        However, the stdarg.h file does exist…

        I searched around for stdarg.h errors and came across this: http://blog.coredumped.org/2009_09_01_archive.html
        Moyer says: “you have to change your default compiler from gcc 4.2 to gcc 4.0. This can be done by removing the symlink file in /usr/bin/gcc and re-linking that to gcc-4.0. This can be done by doing:

        sudo rm /usr/bin/gcc
        sudo ln -s /usr/bin/gcc-4.0 /usr/bin/gcc


        This worked for me. After spending waaaaay too much time trying to figure this one out, I hope this might save someone else the trouble…

        peace!

        Like

      1. Bump. It looked so promising!!! But if fact I’m having the same problem. Will try it on another PC as well as on this PC, but with a different account and let you know.

        Could it be that I installed any of the required python supplementary pakages incorrectly? I assume I would be getting different sorts of error messages…

        Many thanks!

        Like

      2. That is weird… Even after removing all cookies, after setting https to NOT be used, loging out and in again with the remember me checkbox marked? What about the labels you are using? Do they have any special character? And could you try the setting marking Gmail to use the simple HTML interface?

        I don’t think that it has to do with any supplementary packages, as the redirection text you are getting from Google seems to indicate that the problem is with the cookies. It tries to recognize the cookie, doesn’t get the expected response, and tries to redirect you to the authentication page again. But if you want to make sure, tell me the versions of the supplemental packages you have so that I can compare with mine.

        But before that, I ask you to try again to clear the cookies, log in, change all settings about https and the html interface, log out, clear the cookies and cache again, close the browser, log in again, close the browser and try to run the script. Maybe your browser is just being stubborn!

        Good luck!

        Like

      3. Bingo!

        The only things I haven’t tried before were
        1. basic HTML interface
        2. closing the browser before starting the script

        Tried these as you suggested – unfortunately together and not one by one – and everything worked! Then I restored the standard interface and kept the browser open and tried again, but the problem didn’t return, it just worked. I guess my browser was indeed stubborn!

        Thank you so much for your help, good luck :)

        Crible

        Like

  5. No prob!

    It took me three days to download all my chats (some 4500, I think), as GMail locked me out three times, but now it is done :)

    Like

  6. Hi colin,
    do you have an executable for this downloader?
    I never use python before, but apparently your program has some dependencies which must be downloaded too. When I download a dependency that is missing, it says another dependency is missing, and so on until I got tired. Is there an executable that I can just run? or is there a list of dependency that I need to download?

    Like

  7. Hi again,
    nevermind the message above I managed to learn how to use easy_install.
    Now I can run the program, however I can’t login and it gives an error message:
    GmailLoginFailure(“Login failed. (Wrong username/password?)”)
    even though I entered the right username and password
    I tried giving the username as “username@gmail.com” and “username”. but none of them work. Do you know why? thanks

    Like

  8. Hi. I grabbed the latest version of libgmail and can login successfully using it.
    When I tried using this script, it always ceases to run at this point:
    try:
    threads = items[libgmail.D_THREAD]
    except KeyError:
    break
    I commented the code to print what it’s doing at the moment, and it ends with the keyerror. I’m not really python-savvy (in fact, I started learning python only because of this script, and it took a long time for me to figure out where to download all the plugins, eg. libgmail, mechanize and the sqlite one, which I by the way found out to be obsolate, you can just do “from sqlite3 import dbapi2 as sqlite”) so I don’t know what a keyerror is and how to fix it. Any help, please ?

    Like

  9. KeyError means that libgmail.D_THREAD, which is “t”, in this case, is not found in the dictionary, items. That section of code I copied and modified from the _parseThreadSearch method in libgmail.py.

    I thought I heard from somewhere that libgmail no longer successfully logs in. I haven’t tried it in over 6 months, so I don’t know.

    Also, be aware that we found that this did download all of the text of the chats, but for some reason, the order of the individual messages in the chat logs were completely out of order. I have no idea why.

    I appreciate your effort to install python and all of the dependencies. I’ve spent many hours trying to get things to install correctly.

    Like

  10. > KeyError
    I see. What does it mean, though? The search was not performed successfully?
    > libgmail
    If you grab the latest version from their svn, it works.
    > log out of order
    I tried using Raccoonette’s modification of your script when I got this error.
    Thanks for trying to help me ;_; (this is all part of my large plan to put all my chat histories in one place, that is, my Miranda profile; now it’s missing only Skype and Google Talk!)

    Like

  11. I have installed python 2.6.5 and modified the code to use sqlite3 that is installed with it on Raconette code. I have also downloaded the latest CVS version of libgmail.

    I have Firefox 3.5 and have located the cookie file. I am not sure if there is something with sqlite3 that causes the program to stall.

    I get a sqlite3.OperationalError: unable to open database file.

    Is someone able to help me? I am quite unfamiliar with python and been just mucking around…. Really want to download my old chats.

    Like

  12. I’m totally new to python and libgmail – could someone give me a step-by-step walkthrough for this?
    Thanks

    Like

  13. I’m also getting login errors with Raccoonette’s script using libgmail-modified from Archlinx AUR. libgmail-cvs is also available, but it outputs a different set of errors.

    Mostly I’m posting to suggest a different approach that is technically beyond my ability (my ability being following the directions in the script).

    the below site states that chat history can be saved to local storage for offline access using google gears. is it possible to access and convert the history at this point with a script? or is google clever enough to encrypt the data once it is stored locally?
    http://www.aeinrst.info/backup-gmail-chat-history/

    hopefully that sparks the imagination of someone here. thanks

    Like

      1. Wow, awesome. I got stumped, because all the text is in messagesft_content while the date is only in messages. I was looking on how to link them… but their row ids are the same? Wow, I’m stupid.
        (I’m not sure if the replybychat th

        Like

      2. (oops, accident submit)
        (I wasn’t sure if the replybychat trick would work, either – I can “reply by chat” in gmail to non-chat messages. Anyway, good work and thanks for the hint!

        (Won’t stop me from doing mine for practice ;))

        Like

    1. If the replybychat thing doesn’t work (it worked for my 6000 chats!), I think that the “Rationale” field is always “2” for the chats. The query can be easily adapted for this :)

      Like

      1. Also note that the date timestamp is stored in milliseconds, so you have to divide it by 1000 if you want a standard UNIX timestamp.

        Like

  14. There is a huge problem: If your chat is long, it will cut off, and say “This message has been clipped. click to download entire message”

    So how do you make offline gmail download the full messages???? :-(

    Like

    1. CandorZ – I’m sorry, I got this post mixed up with this webpage: http://martinml.com/en/how-to-download-and-backup-your-gtalk-gmail-chat-logs/ I apologize. However, I download libgmail from sourceforge, installed it, and it doesn’t work, it says I have the wrong account info:
      (put the code into the file ‘gm’, and modified it with my user & pass; I’m sure it is correct info, I even tried my old gmail account with the same result)

      matt: /tmp $ python ./gm
      Traceback (most recent call last):
      File “./gm”, line 26, in
      ga.login()
      File “/usr/local/lib/python2.6/dist-packages/libgmail.py”, line 320, in login
      raise GmailLoginFailure(“Login failed. (Wrong username/password?)”)
      libgmail.GmailLoginFailure: ‘Login failed. (Wrong username/password?)’

      ~~~~~
      what’s wrong? is libgmail out of date?

      Like

  15. I get the following error.
    can someone please help??

    ======================

    Please wait, logging in…
    Traceback (most recent call last):
    File “libgmail.py”, line 1578, in
    ga.login()
    File “libgmail.py”, line 305, in login
    pageData = self._retrievePage(req)
    File “libgmail.py”, line 340, in _retrievePage
    req = ClientCookie.Request(urlOrRequest)
    File “/usr/local/lib/python2.6/dist-packages/mechanize-0.2.5-py2.6.egg/mechanize/_request.py”, line 31, in __init__
    if not _rfc3986.is_clean_uri(url):
    File “/usr/local/lib/python2.6/dist-packages/mechanize-0.2.5-py2.6.egg/mechanize/_rfc3986.py”, line 62, in is_clean_uri
    return not bool(BAD_URI_CHARS_RE.search(uri))
    TypeError: expected string or buffer
    ============================

    Like

  16. Hi,

    Please can you tell me how can i transfer my chat history from one gmail account to another gmail account. I read above but i don’t know how to use the above code.

    Thanks
    Badal

    Like

  17. I used the instructions that Candorz left here and they did work. Had some trouble locating where the downloaded files were but managed to find them. After that I ran the script .. and got all the messages in a web page form as he explained in the Carnivorous…..blog. Good work and thanks

    Like

Comments are closed.