Sometimes code, security, transit, other projects.

I'd like Caltrain to publish raw train data

I enjoy cool visualizations of transit system performance, especially this one for the Boston MBTA, so I wanted to make one for the train that I often use, Caltrain. The MBTA visualization shows patterns of how the service works, including the difference in times between station stops at different times of day.

For example, here’s a Marey graph of a weekday Caltrain schedule (local trains in brown, limited in black, express in red). The horizontal axis shows departure and arrival times from morning (first train leaves San Jose at 4:30 AM) to evening (last train arrives in San Jose at 1:32 AM), with a few stations noted:

Caltrain weekday Marey graph Caltrain weekday Marey graph

But that’s just the planned schedule. I wanted to visualize the performance of real Caltrain trains over several days. I went through the whole process of using transit data APIs, parsing and massaging the unfiltered data, and trying to turn it into something useful — and then I set aside this project, and here’s why.

This post may be useful to you if you’re curious about what kind of data you can get from Caltrain, obstacles you might run into when building web scrapers (and what you can do about them), and the downsides of the scraped data I got. You can also poke around the scraper source code to see what I did.

Trying an API and resorting to scraping

Lots of transit agencies share data about their trains and buses and when those vehicles are going places, including the MBTA, the London Underground, and Bay Area Rapid Transit (BART). Caltrain theoretically provides this too: the real-time info for developers page includes a link to the 511 real-time API. Great! The first step is to try out the 511 API - I used the rest-client gem for Ruby to do this.

Now that I have a way to get the data, can I build my visualization? Not really. The 511 data only provides train departure times. At a glance this seems like it could be enough information, but there’s a big problem: it doesn’t tell you which train is departing. Caltrain is not a simple system that stops at every station — there are express trains that only make a few stops from beginning to end, limited trains that make more stops (in several service patterns!), and local trains that make all stops. How do we learn which train is where?

Well, Caltrain provides a real-time departure list on their website, and that includes all the different train numbers! This is pretty helpful, although it doesn’t have an API for accessing it, so I scraped the data directly from their website. Using the excellent Burp Suite to intercept my HTTP requests and figure out what I needed to do, I found out how to both get the full list of stations, and how to get the data for each station. This required a bit of work, as Caltrain appears to be using an ancient ASP.NET CMS named “IronPoint”, but in the end, I got the data. It looks like this:

s1<IRONPOINT>TIME</IRONPOINT>as of&nbsp;5:44 PM<IRONPOINT>TIME</IRONPOINT><IRONPOINT>ALERTS</IRONPOINT><IRONPOINT>ALERTS</IRONPOINT><IRONPOINT>TRAINS</IRONPOINT><table class="ipf-caltrain-table-trains" width="100%"  cellspacing="0" cellpadding="0" border="0"><tr class="ipf-st-ip-trains-table-dir-tr"><th class="ipf-st-ip-trains-table-dir-td1"><div>SOUTHBOUND</div></th><th class="ipf-st-ip-trains-table-dir-td2"><div>NORTHBOUND</div></th></tr><tr class="ipf-st-ip-trains-table-trains-tr"><td><table class="ipf-st-ip-trains-subtable"><tr class="ipf-st-ip-trains-subtable-tr"><td class="ipf-st-ip-trains-subtable-td-id">440</td><td class="ipf-st-ip-trains-subtable-td-type">Local</td><td class="ipf-st-ip-trains-subtable-td-arrivaltime">32 min.</td></tr><tr class="ipf-st-ip-trains-subtable-tr"><td class="ipf-st-ip-trains-subtable-td-id">442</td><td class="ipf-st-ip-trains-subtable-td-type">Local</td><td class="ipf-st-ip-trains-subtable-td-arrivaltime">92 min.</td></tr><tr class="ipf-st-ip-trains-subtable-tr"><td class="ipf-st-ip-trains-subtable-td-id">804</td><td class="ipf-st-ip-trains-subtable-td-type">Baby Bullet</td><td class="ipf-st-ip-trains-subtable-td-arrivaltime">114 min.</td></tr></table></td><td><table class="ipf-st-ip-trains-subtable"><tr class="ipf-st-ip-trains-subtable-tr"><td class="ipf-st-ip-trains-subtable-td-id">803</td><td class="ipf-st-ip-trains-subtable-td-type">Baby Bullet</td><td class="ipf-st-ip-trains-subtable-td-arrivaltime">13 min.</td></tr><tr class="ipf-st-ip-trains-subtable-tr"><td class="ipf-st-ip-trains-subtable-td-id">443</td><td class="ipf-st-ip-trains-subtable-td-type">Local</td><td class="ipf-st-ip-trains-subtable-td-arrivaltime">46 min.</td></tr><tr class="ipf-st-ip-trains-subtable-tr"><td class="ipf-st-ip-trains-subtable-td-id">445</td><td class="ipf-st-ip-trains-subtable-td-type">Local</td><td class="ipf-st-ip-trains-subtable-td-arrivaltime">106 min.</td></tr></table></td></tr></table><IRONPOINT>TRAINS</IRONPOINT><IRONPOINT>LINK</IRONPOINT><IRONPOINT>LINK</IRONPOINT>

I hooked up the scraper to the Sequel gem and logged the very-slightly-processed raw data into an sqlite database. Time to sit back and watch the data roll in.

Iterating and debugging

The basic scraper logged the data in a fairly raw form and had minimal error handling. Next I wanted to make the data more usable, allowing searching by individual trains or by time of day. I also planned to improve the error handling, both by logging more details and by making it easier for me to bucket the errors.

Of course, before I could even start thinking about these improvements, the script broke after fifteen minutes of running. It turns out it didn’t handle the server responding with no data. After I squashed that bug, I added a bit of logging to help me catch similar problems in the future.

Next I fixed the scraper to run at the correct part of each minute. I wanted to make sure I got as much data as possible without having to make multiple requests per minute. If I requested data just as the servers updated, I’d get data from different minutes. My workaround for this: make a request every five seconds, and once the minute changes, use that as the time to start a scrape request. Then, do a scrape every minute, and everything is good.

got 32 stations
looping now
retrieving departures
--- Times should be '11:45 AM'
--- Times all look good
retrieving departures
--- Times should be '11:46 AM'
--- Times all look good
retrieving departures
--- Times should be '11:47 AM'
--- Times all look good
[...]

Another bug cropped up in the time parsing code. The scraped data included two kinds of times: the current time, in the form 11:30 AM, and the arrival time, in the form 15 min.. I originally stored this data in the database as raw strings, but it’s a lot easier to work with if they are stored as computer-readable times. My initial attempt was to parse the time and then just use the current day. Arrival times then added the specified number of minutes. This seemed OK, so I decided to check on the data the next day and went to sleep.

Well, there were weird things with the times. I had a bunch of entries right up until about 11:49 PM the day before. Then I had some from 11:50–11:59 PM on the current day, followed by a bunch more from 12:00 AM onwards. This struck me as odd, as 11:50 PM wouldn’t occur for almost 24 hours! This happened because my server’s time was slightly off — about ten minutes. I added logic to determine the current day, and weird time traveling became a thing of the past. I also fixed the clock on my server.

sqlite> select id, created_at, time from readings where id > 108 limit 16;
109|2014-07-26 23:57:01.051938|2014-07-26 23:47:00.000000
110|2014-07-26 23:58:01.087952|2014-07-26 23:48:00.000000
111|2014-07-26 23:59:00.109986|2014-07-26 23:49:00.000000
112|2014-07-27 00:00:00.149401|2014-07-27 23:50:00.000000
113|2014-07-27 00:01:00.615091|2014-07-27 23:51:00.000000
114|2014-07-27 00:02:00.377811|2014-07-27 23:52:00.000000
[...]
120|2014-07-27 00:08:00.119957|2014-07-27 23:58:00.000000
121|2014-07-27 00:09:00.065927|2014-07-27 23:59:00.000000
122|2014-07-27 00:10:00.251709|2014-07-27 00:00:00.000000
123|2014-07-27 00:11:00.077243|2014-07-27 00:01:00.000000
124|2014-07-27 00:12:00.124230|2014-07-27 00:02:00.000000

Another set of issues that kept coming back was server timeouts. Occasionally the server would do nothing after I made my request. No response, no failure, just nothing. This caused some weird interactions with the scraper, which would hang while waiting for these responses. There were a few times where I missed several minutes of data because of this! I ended up with a reasonably robust system for handling this. I gave the initial parallel requests a 10-second timeout, and all requests that didn’t complete would be re-run afterward. Of course, I had to deal with timeouts there too — once, a re-request hung for five minutes before returning. My result: I wrote scraping code that took less than a minute to run, and if it couldn’t retrieve data for a station, it logged the error and continued scraping.

[...]
doing scrape at 2014-08-03 11:05:15 -0700
scrape times: request: 1.70s retries: 0.00s
finished retrieval at 2014-08-03 11:05:17 -0700
got reading time 2014-08-03T10:55:00-07:00 and creation time 2014-08-03T11:05:17-07:00
scrape complete
doing scrape at 2014-08-03 11:06:15 -0700
got an error for Atherton, making special request
got an error for Broadway, making special request
scrape times: request: 1.74s retries: 1.02s
finished retrieval at 2014-08-03 11:06:18 -0700
got reading time 2014-08-03T10:56:00-07:00 and creation time 2014-08-03T11:06:18-07:00
scrape complete
doing scrape at 2014-08-03 11:07:15 -0700
scrape times: request: 1.72s retries: 0.00s
finished retrieval at 2014-08-03 11:07:17 -0700
got reading time 2014-08-03T10:57:00-07:00 and creation time 2014-08-03T11:07:17-07:00
scrape complete
[...]

Pain and suffering

At this point, things were mostly working, and I was getting nice piles of data. Usually. It turns out that there are a lot of issues with trying to rely on this scraped data.

  1. The data doesn’t know everything

    Even though I got the train numbers by switching away from 511, I didn’t get train arrival data, which is really what I wanted. So, stations at the end of the line - San Francisco, San Jose, Tamien, and Gilroy - could never have good data for the arrival. San Francisco and San Jose, in particular, have a reasonably long and somewhat variable time between the second-to-last station and the terminal, so having this data would be helpful.

    Other examples include things like express trains that turn into locals when there’s some kind of SNAFU. The real-time departure system doesn’t or can’t handle this, so this data is always lost.

  2. The data is not raw

    Since the data I scraped is essentially the same data fed to the departure signs at individual stations, I didn’t get the full raw data. Instead, I got data that Caltrain already slightly processed before publishing! For example, if a train is five minutes late at the second station, it will still show as arriving on-time at the last few stations on its route. This meant that only the next few stations had data even worth considering.

    For example, the following train is 8 minutes late (scheduled: 19:38, actual: 19:46) as of reading 2642 at Palo Alto. But, in San Jose, it’s shown as being one minute early as of reading 2642. As the train gets closer, the time becomes more realistic, settling on 20:20, which is eight minutes later than scheduled.

    sqlite> select name from stations where id = 17 ;
    Palo Alto
    sqlite> select reading_id,station_id,arrival from timepoints where reading_id = 2642 and station_id = 17 and train_id = 77 ;
    2642|17|2014-07-29 19:46:00.000000
    sqlite> select name from stations where id = 9 ;
    San Jose Diridon
    sqlite> select reading_id,station_id,arrival from timepoints where reading_id > 2642 and arrival < '2014-07-30' and station_id = 9 and train_id = 77 order by arrival ;
    2642|9|2014-07-29 20:11:00.000000
    2643|9|2014-07-29 20:11:00.000000
    [...]
    2655|9|2014-07-29 20:11:00.000000
    2656|9|2014-07-29 20:12:00.000000
    2659|9|2014-07-29 20:12:00.000000
    2660|9|2014-07-29 20:12:00.000000
    2661|9|2014-07-29 20:12:00.000000
    2657|9|2014-07-29 20:13:00.000000
    [...]
    2673|9|2014-07-29 20:19:00.000000
    2675|9|2014-07-29 20:19:00.000000
    2676|9|2014-07-29 20:20:00.000000
    2677|9|2014-07-29 20:20:00.000000
    
  3. The data is often wrong

    For example, a southbound train was scheduled to stop at Palo Alto. The data showed the train as on-time at first, and then it spent ten minutes (from reading 2633, at 19:35 onwards) saying it was two minutes late. A friend, waiting at Menlo Park, told me when he finally got on that train. The station signs had said it was arriving for about eight minutes before giving up entirely. The train itself actually arrived at the last predicted time (19:46), but there was no data for the couple of minutes before it arrived.

    sqlite> select name from stations where id = 17 ;
    Palo Alto
    sqlite> select reading_id,arrival from timepoints where station_id = 17 and reading_id > 2631 and train_id = 77 and arrival < '2014-07-30' ;
    2631|2014-07-29 19:37:00.000000
    2632|2014-07-29 19:37:00.000000
    2633|2014-07-29 19:37:00.000000
    2634|2014-07-29 19:38:00.000000
    2635|2014-07-29 19:39:00.000000
    2636|2014-07-29 19:40:00.000000
    2637|2014-07-29 19:41:00.000000
    2638|2014-07-29 19:42:00.000000
    2639|2014-07-29 19:43:00.000000
    2640|2014-07-29 19:44:00.000000
    2641|2014-07-29 19:45:00.000000
    2642|2014-07-29 19:46:00.000000
    sqlite> select time from readings where id = 2631 ;
    2014-07-29 19:33:00.000000
    sqlite> select time from readings where id = 2642 ;
    2014-07-29 19:44:00.000000
    

    Other oddities I observed include the data saying a train was scheduled to depart two different station pairs (San Antonio and Mountain View, #11 and #19, and Palo Alto and California Ave, #17 and #21) at the same time. Unless Caltrain has wormhole technology, I don’t think this is very likely:

    sqlite> select station_id,arrival from timepoints where reading_id = 2642 and train_id = 77 order by arrival ;
    17|2014-07-29 19:46:00.000000
    21|2014-07-29 19:46:00.000000
    19|2014-07-29 19:51:00.000000
    11|2014-07-29 19:51:00.000000
    4|2014-07-29 19:55:00.000000
    22|2014-07-29 19:58:00.000000
    3|2014-07-29 20:03:00.000000
    9|2014-07-29 20:11:00.000000
    

With all these variables, trying to get useful data out of this mass of scraped data is really hard. Getting data with a pile of issues is easy; getting something that you can use to view the day-in, day-out performance of the trains is much harder. One reason I abandoned this project was that I wasn’t sure that having this vaguely unreliable data would be useful for anything.

The other problem was that my script broke. A lot. I keep getting weird bits of unscraped data showing up, I got weird exceptions, I got Ruby interpreter crashes (which appear to be related to my use of the curb gem). All of this was a lot to keep up with. One of the things I should have done from the start was make it easier to detect and view errors. While I ended up logging a lot of the necessary info, I accessed it manually. Having an automatic tool to find and present inconsistencies and errors would have made the process of maintaining the scraper a lot easier.

What did I learn from all of this?

If I were going to keep working on this, I would tackle two things next: making the automated error detector, and building a tool to process the raw data into something easier to analyze. The automation would help me expend energy on the more interesting parts, and making the raw data usable would provide more motivation to keep working on this.

I learned that I enjoy writing web scrapers and trying to make them reliable, even when making 50k requests a day.

The other thing I learned is that I would be much happier if Caltrain just published their raw train data, and then I wouldn’t have needed to write this blog post.