Chapter 28. Downloading Entries

Client authors frequently deal with downloading entries the wrong way. There are various ways of getting entries from the servers, each with different merits.

Language Support

First, support Unicode (UTF-8). If you write a client and release it at all, it will be used by people who need Unicode support. LiveJournal.com, and other LiveJournal installations, have a large community of users that do not necessarily keep their journal in English. The Russian community is huge, for example, and their journals require Unicode to post/view the entries.

An example journal backup tool named jbackup.pl[o] is available in the SVN repository. It shows how to download entries and comments from the servers correctly and safely.

In general, there are four methods of downloading entries with the getevents protocol mode: lastn, syncitems, one, and day. These four methods are specified in the selecttype variable of the getevents call.

lastn This is most effectively used when you are providing the user a snapshot of their recent entries, or when you just want to get their most recently posted entry, to verify the entry you just posted was posted, or you want to allow the user to edit their most recent entry.

You should not use this mode to download an entire journal. You cannot specify a huge number (such as a number greater than fifty) that would give you their entire journal (unless their journal was a few dozen entries only).

day This is useful for people who are writing calendars and want to get entries on a day that the user has clicked on. This should be used in conjunction with the getdaycounts protocol mode to figure out when the user has posted and then to get entries on that particular date.

You should never use this mode for enumerating someone's journal and downloading their entries., nor when you are going to re-upload the data. Always use syncitems.

If you do not specify a version, the server will assume the client does not understand Unicode. If, for some reason (non-Unicode client, for example), the server is unable to send you a particular entry, it will instead send you text indicating the entry's subject and body “(cannot be shown)”. It does not tell you it has done this, so you may end up thinking that is the user's real entry and overwrite whatever they had.

one When you want to download a handful of entries scattered about, you can use this mode to get them. It is usually safe to download an entry with this mode and then to re-submit it to the server. Example: you use getdaycounts to show a calendar, then you use the day mode to show entries for that day, then you use this mode to get the real entry for editing.

syncitems If you are trying to download someone's entire journal, this is the mode to use. This mode is the only way you can account for edits that the user has made to their entries without using your client. This is also the most efficient way of downloading entries, because the server will send you a bunch at a time (say, 100). This mode is used in conjunction with the appropriately titled syncitems client protocol mode.

The syncitems client protocol mode returns a list of events modified/created/deleted after lastsync time, while getevents using selecttype syncitems returns the actual events.

The entries are returned in order of modification. So, in 2007 if you go back and edit an entry from 1999, it will show up when you do a sync and specify a lastsync of 2007. This is the only way to account for edits that the user makes on the web site or with another client.

If you want to download and re-submit a particular group of entries, perhaps within a particular time period, use syncitems. Download the entire journal, then re-upload the subset you want to change. A user may have used the site for a few years, writing many entries. You will be hitting the server once per day for every day that the user has had a journal, whether or not they posted. A day-by-day download might take over a thousand separate requests, while a full syncitems download would only be about ten. It will substantially reduce the amount of hits to the server. This is considerate, and also means your bot is not likely to get itself banned for not being smart.

Here is a pseudo-code example of how to use this mode properly to download someone's entire journal.

send client request “syncitems” with the
“lastsync” variable not specified
get list of items back from request, save items into list for processing later
while size_of_list < sync_total {
    find most recent time in list
    call “syncitems” again, but set “lastsync” to most recent time
    push result items onto lost
}
iterate through list and remove items that do not start with “L-” (L means “log” which is a journal entry)
create hash of journal itemids with data { downloaded => 0, time => whatever sync_X_time was }
while (any item in hash has downloaded == 0) {
    find the oldest “time” in this hash for items that have downloaded == 0
    …decrement this time by one second.
    mark this item as downloaded (so we don't use the same time twice and loop forever)
    send client request “getevents” with selecttype set to syncitems, lastsync set to oldest time minus 1 second
    mark each item you get back as downloaded in your hash
    put the entries you got into storage somewhere.
}

You will have to call syncitems and getevents several times each to get the data you need. This is not a problem if you do it smartly. Also note that the server keeps track of the times you use when you call getevents, and if you start specifying the same time repeatedly (infinite loop) then your client will be given an error message “Perhaps the client is broken?”, or similar. Last, remember to set ver to 1!