I'd like Caltrain to publish raw train data30 Aug 2014
I enjoy cool visualizations of transit system performance, especially this one for the Boston MBTA, so I wanted to make one for the train that I often use, Caltrain. The MBTA visualization shows patterns of how the service works, including the difference in times between station stops at different times of day.
For example, here’s a Marey graph of a weekday Caltrain schedule (local trains in brown, limited in black, express in red). The horizontal axis shows departure and arrival times from morning (first train leaves San Jose at 4:30 AM) to evening (last train arrives in San Jose at 1:32 AM), with a few stations noted:
But that’s just the planned schedule. I wanted to visualize the performance of real Caltrain trains over several days. I went through the whole process of using transit data APIs, parsing and massaging the unfiltered data, and trying to turn it into something useful — and then I set aside this project, and here’s why.
This post may be useful to you if you’re curious about what kind of data you can get from Caltrain, obstacles you might run into when building web scrapers (and what you can do about them), and the downsides of the scraped data I got. You can also poke around the scraper source code to see what I did.
Trying an API and resorting to scraping
Lots of transit agencies share data about their trains and buses and when those vehicles are going places, including the MBTA, the London Underground, and Bay Area Rapid Transit (BART). Caltrain theoretically provides this too: the real-time info for developers page includes a link to the 511 real-time API. Great! The first step is to try out the 511 API - I used the rest-client gem for Ruby to do this.
Now that I have a way to get the data, can I build my visualization? Not really. The 511 data only provides train departure times. At a glance this seems like it could be enough information, but there’s a big problem: it doesn’t tell you which train is departing. Caltrain is not a simple system that stops at every station — there are express trains that only make a few stops from beginning to end, limited trains that make more stops (in several service patterns!), and local trains that make all stops. How do we learn which train is where?
Well, Caltrain provides a real-time departure list on their website, and that includes all the different train numbers! This is pretty helpful, although it doesn’t have an API for accessing it, so I scraped the data directly from their website. Using the excellent Burp Suite to intercept my HTTP requests and figure out what I needed to do, I found out how to both get the full list of stations, and how to get the data for each station. This required a bit of work, as Caltrain appears to be using an ancient ASP.NET CMS named “IronPoint”, but in the end, I got the data. It looks like this:
Iterating and debugging
The basic scraper logged the data in a fairly raw form and had minimal error handling. Next I wanted to make the data more usable, allowing searching by individual trains or by time of day. I also planned to improve the error handling, both by logging more details and by making it easier for me to bucket the errors.
Of course, before I could even start thinking about these improvements, the script broke after fifteen minutes of running. It turns out it didn’t handle the server responding with no data. After I squashed that bug, I added a bit of logging to help me catch similar problems in the future.
Next I fixed the scraper to run at the correct part of each minute. I wanted to make sure I got as much data as possible without having to make multiple requests per minute. If I requested data just as the servers updated, I’d get data from different minutes. My workaround for this: make a request every five seconds, and once the minute changes, use that as the time to start a scrape request. Then, do a scrape every minute, and everything is good.
got 32 stations looping now retrieving departures --- Times should be '11:45 AM' --- Times all look good retrieving departures --- Times should be '11:46 AM' --- Times all look good retrieving departures --- Times should be '11:47 AM' --- Times all look good [...]
Another bug cropped up in the time parsing code. The scraped data included two kinds of times: the current time, in the form
11:30 AM, and the arrival time, in the form
15 min.. I originally stored this data in the database as raw strings, but it’s a lot easier to work with if they are stored as computer-readable times. My initial attempt was to parse the time and then just use the current day. Arrival times then added the specified number of minutes. This seemed OK, so I decided to check on the data the next day and went to sleep.
Well, there were weird things with the times. I had a bunch of entries right up until about 11:49 PM the day before. Then I had some from 11:50–11:59 PM on the current day, followed by a bunch more from 12:00 AM onwards. This struck me as odd, as 11:50 PM wouldn’t occur for almost 24 hours! This happened because my server’s time was slightly off — about ten minutes. I added logic to determine the current day, and weird time traveling became a thing of the past. I also fixed the clock on my server.
sqlite> select id, created_at, time from readings where id > 108 limit 16; 109|2014-07-26 23:57:01.051938|2014-07-26 23:47:00.000000 110|2014-07-26 23:58:01.087952|2014-07-26 23:48:00.000000 111|2014-07-26 23:59:00.109986|2014-07-26 23:49:00.000000 112|2014-07-27 00:00:00.149401|2014-07-27 23:50:00.000000 113|2014-07-27 00:01:00.615091|2014-07-27 23:51:00.000000 114|2014-07-27 00:02:00.377811|2014-07-27 23:52:00.000000 [...] 120|2014-07-27 00:08:00.119957|2014-07-27 23:58:00.000000 121|2014-07-27 00:09:00.065927|2014-07-27 23:59:00.000000 122|2014-07-27 00:10:00.251709|2014-07-27 00:00:00.000000 123|2014-07-27 00:11:00.077243|2014-07-27 00:01:00.000000 124|2014-07-27 00:12:00.124230|2014-07-27 00:02:00.000000
Another set of issues that kept coming back was server timeouts. Occasionally the server would do nothing after I made my request. No response, no failure, just nothing. This caused some weird interactions with the scraper, which would hang while waiting for these responses. There were a few times where I missed several minutes of data because of this! I ended up with a reasonably robust system for handling this. I gave the initial parallel requests a 10-second timeout, and all requests that didn’t complete would be re-run afterward. Of course, I had to deal with timeouts there too — once, a re-request hung for five minutes before returning. My result: I wrote scraping code that took less than a minute to run, and if it couldn’t retrieve data for a station, it logged the error and continued scraping.
[...] doing scrape at 2014-08-03 11:05:15 -0700 scrape times: request: 1.70s retries: 0.00s finished retrieval at 2014-08-03 11:05:17 -0700 got reading time 2014-08-03T10:55:00-07:00 and creation time 2014-08-03T11:05:17-07:00 scrape complete doing scrape at 2014-08-03 11:06:15 -0700 got an error for Atherton, making special request got an error for Broadway, making special request scrape times: request: 1.74s retries: 1.02s finished retrieval at 2014-08-03 11:06:18 -0700 got reading time 2014-08-03T10:56:00-07:00 and creation time 2014-08-03T11:06:18-07:00 scrape complete doing scrape at 2014-08-03 11:07:15 -0700 scrape times: request: 1.72s retries: 0.00s finished retrieval at 2014-08-03 11:07:17 -0700 got reading time 2014-08-03T10:57:00-07:00 and creation time 2014-08-03T11:07:17-07:00 scrape complete [...]
Pain and suffering
At this point, things were mostly working, and I was getting nice piles of data. Usually. It turns out that there are a lot of issues with trying to rely on this scraped data.
The data doesn’t know everything
Even though I got the train numbers by switching away from 511, I didn’t get train arrival data, which is really what I wanted. So, stations at the end of the line - San Francisco, San Jose, Tamien, and Gilroy - could never have good data for the arrival. San Francisco and San Jose, in particular, have a reasonably long and somewhat variable time between the second-to-last station and the terminal, so having this data would be helpful.
Other examples include things like express trains that turn into locals when there’s some kind of SNAFU. The real-time departure system doesn’t or can’t handle this, so this data is always lost.
The data is not raw
Since the data I scraped is essentially the same data fed to the departure signs at individual stations, I didn’t get the full raw data. Instead, I got data that Caltrain already slightly processed before publishing! For example, if a train is five minutes late at the second station, it will still show as arriving on-time at the last few stations on its route. This meant that only the next few stations had data even worth considering.
For example, the following train is 8 minutes late (scheduled: 19:38, actual: 19:46) as of reading 2642 at Palo Alto. But, in San Jose, it’s shown as being one minute early as of reading 2642. As the train gets closer, the time becomes more realistic, settling on 20:20, which is eight minutes later than scheduled.
sqlite> select name from stations where id = 17 ; Palo Alto sqlite> select reading_id,station_id,arrival from timepoints where reading_id = 2642 and station_id = 17 and train_id = 77 ; 2642|17|2014-07-29 19:46:00.000000 sqlite> select name from stations where id = 9 ; San Jose Diridon sqlite> select reading_id,station_id,arrival from timepoints where reading_id > 2642 and arrival < '2014-07-30' and station_id = 9 and train_id = 77 order by arrival ; 2642|9|2014-07-29 20:11:00.000000 2643|9|2014-07-29 20:11:00.000000 [...] 2655|9|2014-07-29 20:11:00.000000 2656|9|2014-07-29 20:12:00.000000 2659|9|2014-07-29 20:12:00.000000 2660|9|2014-07-29 20:12:00.000000 2661|9|2014-07-29 20:12:00.000000 2657|9|2014-07-29 20:13:00.000000 [...] 2673|9|2014-07-29 20:19:00.000000 2675|9|2014-07-29 20:19:00.000000 2676|9|2014-07-29 20:20:00.000000 2677|9|2014-07-29 20:20:00.000000
The data is often wrong
For example, a southbound train was scheduled to stop at Palo Alto. The data showed the train as on-time at first, and then it spent ten minutes (from reading 2633, at 19:35 onwards) saying it was two minutes late. A friend, waiting at Menlo Park, told me when he finally got on that train. The station signs had said it was arriving for about eight minutes before giving up entirely. The train itself actually arrived at the last predicted time (19:46), but there was no data for the couple of minutes before it arrived.
sqlite> select name from stations where id = 17 ; Palo Alto sqlite> select reading_id,arrival from timepoints where station_id = 17 and reading_id > 2631 and train_id = 77 and arrival < '2014-07-30' ; 2631|2014-07-29 19:37:00.000000 2632|2014-07-29 19:37:00.000000 2633|2014-07-29 19:37:00.000000 2634|2014-07-29 19:38:00.000000 2635|2014-07-29 19:39:00.000000 2636|2014-07-29 19:40:00.000000 2637|2014-07-29 19:41:00.000000 2638|2014-07-29 19:42:00.000000 2639|2014-07-29 19:43:00.000000 2640|2014-07-29 19:44:00.000000 2641|2014-07-29 19:45:00.000000 2642|2014-07-29 19:46:00.000000 sqlite> select time from readings where id = 2631 ; 2014-07-29 19:33:00.000000 sqlite> select time from readings where id = 2642 ; 2014-07-29 19:44:00.000000
Other oddities I observed include the data saying a train was scheduled to depart two different station pairs (San Antonio and Mountain View, #11 and #19, and Palo Alto and California Ave, #17 and #21) at the same time. Unless Caltrain has wormhole technology, I don’t think this is very likely:
sqlite> select station_id,arrival from timepoints where reading_id = 2642 and train_id = 77 order by arrival ; 17|2014-07-29 19:46:00.000000 21|2014-07-29 19:46:00.000000 19|2014-07-29 19:51:00.000000 11|2014-07-29 19:51:00.000000 4|2014-07-29 19:55:00.000000 22|2014-07-29 19:58:00.000000 3|2014-07-29 20:03:00.000000 9|2014-07-29 20:11:00.000000
With all these variables, trying to get useful data out of this mass of scraped data is really hard. Getting data with a pile of issues is easy; getting something that you can use to view the day-in, day-out performance of the trains is much harder. One reason I abandoned this project was that I wasn’t sure that having this vaguely unreliable data would be useful for anything.
The other problem was that my script broke. A lot. I keep getting weird bits of unscraped data showing up, I got weird exceptions, I got Ruby interpreter crashes (which appear to be related to my use of the curb gem). All of this was a lot to keep up with. One of the things I should have done from the start was make it easier to detect and view errors. While I ended up logging a lot of the necessary info, I accessed it manually. Having an automatic tool to find and present inconsistencies and errors would have made the process of maintaining the scraper a lot easier.
What did I learn from all of this?
If I were going to keep working on this, I would tackle two things next: making the automated error detector, and building a tool to process the raw data into something easier to analyze. The automation would help me expend energy on the more interesting parts, and making the raw data usable would provide more motivation to keep working on this.
I learned that I enjoy writing web scrapers and trying to make them reliable, even when making 50k requests a day.
The other thing I learned is that I would be much happier if Caltrain just published their raw train data, and then I wouldn’t have needed to write this blog post.