Hansblog Sometimes code, security, transit, other projects. Also @n0nst1ck.

Making an IoT Garage Door Opener

I live in a house with a garage and a few roommates. Because it’s California, nobody actually stores cars in the garage. We do keep stuff like bicycles and car covers there though! For a while, we had a wifi-based opener built by one of my roommates (based on a Particle Photon), but it was so IoT that it needed internet access and an in-home server for the web interface. Once the server in our house disappeared, we needed something new. This is what the old one looked like (until I disassembled it):

The original garage door opener

The most important feature the old garage door opener had was that by joining the local wifi, you could open and close the door. Since I was rebuilding it, I set a few other goals, such as making it work entirely on local wifi. This means if our internet goes down, we can still open the door!

There are three main parts to put together:

  1. Connect to the local wifi network
  2. Run a webserver that we can access
  3. Do some Electronics Things when we get certain web requests

The board I ended up using for the new one is the C.H.I.P., a $9 board that runs Linux, has onboard wifi and Bluetooth (solving part 1), and most importantly was in my roommate’s spare parts bin. Since it’s running Linux, that makes it easy to use modern web frameworks like Flask, neatly handling part 2. It also supports digital input and output pins (which gives a start on part 3).

The C.H.I.P. was already set up with their custom Debian install, so I connected to it via the serial port. After logging in, I used nmtui to connect to the wifi network used by our phones. The rest of the work was getting the application and associated hardware up and running. In the rest of this post, I’ll talk about my meandering path to getting something working, and close it out by presenting my final working schematic in addition to the webapp running the opener.

Toggling some outputs

The C.H.I.P. has a piece of hardware known as a GPIO (General Purpose Input/Output) expander. This is a very long way to say it can output 0s and 1s or read a digital signal from a pin. If you’ve ever used digital pins on an Arduino before, you’ve used a GPIO without even knowing it! Normally, GPIO devices have to be controlled in very low-level software, often by writing bits directly into memory. The next step was figuring out what I have to do on the C.H.I.P. to control its GPIO expander.

Linux makes our lives easier

Since I already had some vaguely-working circuitry from the previous opener, I left that alone and started by trying to get input / output working on the C.H.I.P. The documentation notes that there are eight GPIO ports provided by an I/O expander, and there are some fancy libraries for using it. I ignored this and promptly investigated the shell commands below. There was a lot of stuff about exporting the right port number based on your kernel version. I had no idea what this was talking about, and I decided to look further.

I discovered that sometime in the past several years, Linux grew a GPIO interface in SysFS! The documentation provides a clear explanation about how you can use this interface. From what I can tell, it’s designed for slower applications that aren’t timing-critical. In other words, it’ll work great for a garage door opener.

How do you use it?

The first step is figuring out which devices you have. If you look in /sys/class/gpio/, there are a bunch of entries like gpiochip408. These are the individual GPIO drivers available. The 408 is the first GPIO port on the device, and it maps to one of the pins on the GPIO output. Each device can have a bunch of pins, which means the 8th pin on gpiochip408 would be 415.

Running the shell command echo 408 > /sys/class/gpio/export will cause the kernel to make that GPIO available to you as a filesystem object! Note the fancy new /sys/class/gpio/gpio408 folder. This folder contains a bunch of files like value, direction, and active_low. You can read the value from value when the direction is in, or write 0 / 1 to it when the direction is out.

That’s pretty much it for things we will need to configure to make our GPIOs work. One of the interesting things about this abstraction is that it doesn’t get super technical and isn’t tied to the underlying electronics. This means that while you can’t express stuff like “this is an open-collector output, and you can enable or disable the pull-up”, it does make it extremely easy to work with, even from a shell!

How did I use it?

On the C.H.I.P. board I have, the GPIO expander starts at #408, which means it’s called /sys/class/gpio/gpiochip408 on the filesystem. Depending on your kernel version, this number might change! I started out by using just one pin. I figured out which pin mapped to which port by using the pinout in the documentation. For me, 408 maps to XIO-P0. The setup code in Python looked something like:

try:
    with open("/sys/class/gpio/export", "w") as f:
        f.write("408")
except:
    pass # The write fails if the pin is already exported
with open("/sys/class/gpio/gpio408/direction", "w") as f:
    f.write("out")
with open("/sys/class/gpio/gpio408/value", "w") as f:
    f.write("1") # The default value of the GPIO expander

In actuality, I tested this with the following shell commands, which turn the port on and off every second:

echo 408 > /sys/class/gpio/export
echo out > /sys/class/gpio/gpio408/direction
while true ; do
    echo 1 > /sys/class/gpio/gpio408/value
    sleep 1
    echo 0 > /sys/class/gpio/gpio408/value
    sleep 1
done

I then used my multimeter to measure the voltage between XIO-P0 and GND (ground). It worked! The pin toggled between 3.3V and 0V, which is the expected range for this chip.

Making sure the analog parts work

I hooked up the pin to an LED from my parts box in series with a 300 ohm or so resistor as a basic test:

LED test schematic LED test schematic

The LED worked, but it was pretty dim. Some poking around on the internet led to me discovering that the GPIO expander hardware couldn’t produce a lot of current, which meant that this was expected behavior. I tried it with the existing board, but it didn’t really work. The current provided by the GPIO expander (the Texas Instruments PCF8574A) was so low it couldn’t even fully turn on the transistor that we used to switch power to the relays. According to the diagram on page 12 of the datasheet, the expander will only source 100 microamps! For context, a normal LED draws 20 milliamps, or 200 times more current.

Luckily, my roommate had a relay board with optoisolators on it. Optoisolators are normally used to electrically disconnect two circuits, such as when you have low-power electronics driving a high-power motor. It does this by having an internal LED turn on a phototransistor, in essence acting as a light-controlled switch! There is no electrical connection between the two sides. While the opener doesn’t need any electrical isolation, the low current required to enable a high-current output is useful to us. I hooked the optoisolator inputs up between the GPIO expander and 3.3V power, and it worked! It draws about 400 microamps, which is too much for the expander to source, but not too much for it to sink.

Schematic of first attempt with optoisolated relays Schematic of first attempt with optoisolated relays

If you have questions about reading schematics, SparkFun has a good primer on it! In this schematic, pins 1 through 8 on the J2 correspond to XIO-P0 through XIO-P7.

If you’re wondering why there are two relays with a capacitor in series on one of them, it’s all due to how the actual garage door opener works. It measures the capacitance on the line and takes different actions based on the value read. Shorting the line opens / closes the door, and putting a 1µF capacitor in series toggles the light.

One interesting quirk of the GPIO expander is that all of the I/O pins are high by default (it says this on the first page!). One thing to think about is that we don’t want our garage door to toggle when we power on our board, so we want the default state to be safe. By providing power to one side of the optoisolator and hooking the negative side of the LED to the GPIO expander, there will be no current flowing through it by default. This means that the relay should be in the same position if the power is off completely or if the power has been turned on and the GPIO hasn’t been configured yet.

“Hardware isn’t hard”

Now that I’d worked out how to make a pin toggle its output whenever I wanted, I figured that the rest of the hardware would be easy and all I had left was the software (spoilers: I’d made some assumptions about the hardware that came back to bite me, but I’ll get to that later). I got a copy of the previous opener’s Python script from my roommate so I could make it less cranky and work with this new hardware.

The opener program itself is a small webapp written in Python using the Flask framework. It exposes three endpoints: a page that displays two buttons (door and light), an endpoint that toggles the door, and an endpoint that toggles the light. It’s really basic, but it does the job.

One of the annoying parts of our old opener was that it didn’t “debounce” the “open door” link. Debouncing is the process by which we make sure what humans perceive as a single button press is interpreted by our program as a single button press. This was a problem when you pulled the page up on your phone, but it loaded slowly and you impatiently hit the “door” button again (which is of course not a thing that has ever happened to me). One of the easiest ways to implement debouncing is to have the program wait after it detects the first button click. By ignoring all additional button clicks in the next second or two, the number of false clicks goes way down! I implemented this with some basic synchronization primitives in Python:

class GpioToggler(object):
    def __init__(self):
        self.lock = threading.Lock()
        # Other initialization here

    def toggle_pin(self, pin):
        has_lock = self.lock.acquire(False)
        if has_lock:
            t = threading.Thread(target=self.worker,
                                 args=(pin,))
            t.start()
            return 200
        else:
            return 409

    def worker(self, pin):
        try:
            print "Toggling pin %s" % pin
            # Toggle the GPIOs here
            time.sleep(1.25)
            print "Unlocking!"
        except Exception as e:
            print "Something broke: %s" % e
        finally:
            self.lock.release()

I also abstracted out the hardware interface to make it easy to change the GPIO pin number in the future. While pins 408 and 409 work on this version of the C.H.I.P, that might not be the case on a Raspberry Pi or even a new Linux version on the C.H.I.P.! It’s not quite using a config file, but it’ll be really easy to abstract out when I want to.

class GpioToggler(object):
    def __init__(self):
        self.pins = {}
        self.set_up_pin("garage", "408")
        self.set_up_pin("light", "409")

    def set_up_pin(self, name, which, value="0"):
        self.pins[name] = which
	# All the GPIO sysfs operations happen here
        self.set_up_pin_raw(which, value)

    def toggle_pin(self, pin):
        if self.pins.has_key(pin) == False:
            print "Bad pin %s" % pin
            return 400
        # Do the pin toggling here
        return 200

@app.route('/garage/door')
def garage():
    toggler.toggle_pin("garage")
    return redirect('/garage')

@app.route('/garage/light')
def door():
    toggler.toggle_pin("light")
    return redirect('/garage')

The “last thing” left to do was plug in the relays and make sure everything functioned. I plugged it in, clicked on the “door” button, and it worked! I could hit the button, the relay would click on, and then click off. I was done! One last thing to check: does it open the door when I reboot the whole thing? This is a really important thing to consider, as the last thing you want your garage door opener to do is open the garage when you aren’t expecting it, such as after a power outage. It turned out that the door would in fact open when the device got rebooted, which meant I had to go back to the drawing board.

Hardware requires problem-solving

Why did the door open when I rebooted the C.H.I.P.? The symptom I saw was when turning the board on, the relays would click on for a moment and then turn off as soon as the on-board electronics had powered up, about a quarter of a second later. I wasn’t sure exactly what was causing this, but my guess wass that the GPIO expander wasn’t turning itself on as fast as it should. This meant that for a brief period of time, the expander would sink current from the optoisolator, allowing the relays to engage (and open the door accidentally). I was going to have to come up with a significantly more clever solution.

This whole project has been building stuff from my roommate’s junk bin, and I wasn’t about to stop now. The junk bin had a bunch of transistors (specifically the PN2222, which is an NPN-type transistor) which we’d used on the original board, so I set to work figuring out how to use a few of these in concert with the GPIO expander. My goal was to find some configuration that would not toggle the relays on startup while also working with the default-high output of the GPIO expander.

I eventually settled on a design that used one transistor per optoisolator as a switch. The positive side of the optoisolator LED was connected to 3.3V power, and the negative side was connected to the collector on a transistor. The base (which you can think of as the “switch input”) was hooked up to the GPIO expander. The trick was, instead of hooking up the emitter to ground, I connected it to the GPIO expander. What this means is that during startup, the base and emitter will be at the same potential the whole time. Transistors do all of their switching work based on there being a potential (i.e. non-zero voltage) between the base and emitter, so it stays completely off during startup!

Final garage door opener schematic Final garage door opener schematic

To use this from software, I first have to set the two GPIO outputs to logic 0. Once I’ve done that, I can set the GPIO pin connected to the emitter to logic 0 as well, allowing it to sink current when the door or light transistors are switched back on. I can then toggle the two switch outputs at will, and it will operate the relay successfully! Note that if I did this in the opposite order and set the emitter GPIO to logic 0 first, I would accidentally turn on both relays. This is because the switch inputs were still high relative to ground, and when I set the emitter GPIO to logic 0, it creates a potential across the transistor, allowing it to switch on and let current flow through the LED.

This is the good-case timing diagram:

door          ______...---‾‾‾‾ ... ‾‾‾\____________________/‾‾\______
light         ______...---‾‾‾‾ ... ‾‾‾\____________/‾‾\______________
emitter       ______...---‾‾‾‾ ... ‾‾‾‾‾‾‾‾\_________________________

door relay    ________________ ... ________________________/‾‾\______
light relay   ________________ ... ________________/‾‾\______________
                    ^ power-on       ^             ^ light is clicked
                                     | program starts

And here’s what happens if we do it out of order, accidentally triggering both relays on program start:

door          ______...---‾‾‾‾ ... ‾‾‾‾‾‾‾‾‾\_______/‾‾\_____________
light         ______...---‾‾‾‾ ... ‾‾‾‾‾‾‾‾‾\________________________
emitter       ______...---‾‾‾‾ ... ‾‾‾‾‾\____________________________

door relay    ________________ ... _____/‾‾‾\_______/‾‾\_____________
light relay   ________________ ... _____/‾‾‾\________________________
                    ^ power-on          ^           ^ door is clicked
                                        | program starts

Putting it all together

The new hardware design required some additional code for the startup logic, but only a little bit! Once I verified that the script worked, I added a systemd unit file for the script to start it automatically on boot.

I used a small breadboard, the relay board, some double-sided tape, and a bunch of jumper wires to get the whole thing packaged up and ready to install. It now lives on top of my garage door opener, quietly waiting for one of us to click the button and have it do something useful. It seems a little silly to use an entire Linux system just to have a web page where I can click “open the door”, but this is the world we now live in. It’s definitely easier to glue together Python and GPIO files than to write low-level microcontroller code that serves a webpage. I put the code for the opener up on the internet so you can run your own wireless opener!

There are definitely future things I want to do with this! It would be nice to use the on-board Bluetooth chip to do BLE-based opening and provide iBeacon / Eddystone support. It would also be nice if I had an integrated app for my phone instead of having to pull up a web page. It could be fun to support user authentication and give time-based access to people or guests. It’d be helpful if it had audit logging. But these are all projects for another day!

Thanks to all who helped me review this before posting!

I'd like Caltrain to publish raw train data

I enjoy cool visualizations of transit system performance, especially this one for the Boston MBTA, so I wanted to make one for the train that I often use, Caltrain. The MBTA visualization shows patterns of how the service works, including the difference in times between station stops at different times of day.

For example, here’s a Marey graph of a weekday Caltrain schedule (local trains in brown, limited in black, express in red). The horizontal axis shows departure and arrival times from morning (first train leaves San Jose at 4:30 AM) to evening (last train arrives in San Jose at 1:32 AM), with a few stations noted:

Caltrain weekday Marey graph Caltrain weekday Marey graph

But that’s just the planned schedule. I wanted to visualize the performance of real Caltrain trains over several days. I went through the whole process of using transit data APIs, parsing and massaging the unfiltered data, and trying to turn it into something useful — and then I set aside this project, and here’s why.

This post may be useful to you if you’re curious about what kind of data you can get from Caltrain, obstacles you might run into when building web scrapers (and what you can do about them), and the downsides of the scraped data I got. You can also poke around the scraper source code to see what I did.

Trying an API and resorting to scraping

Lots of transit agencies share data about their trains and buses and when those vehicles are going places, including the MBTA, the London Underground, and Bay Area Rapid Transit (BART). Caltrain theoretically provides this too: the real-time info for developers page includes a link to the 511 real-time API. Great! The first step is to try out the 511 API - I used the rest-client gem for Ruby to do this.

Now that I have a way to get the data, can I build my visualization? Not really. The 511 data only provides train departure times. At a glance this seems like it could be enough information, but there’s a big problem: it doesn’t tell you which train is departing. Caltrain is not a simple system that stops at every station — there are express trains that only make a few stops from beginning to end, limited trains that make more stops (in several service patterns!), and local trains that make all stops. How do we learn which train is where?

Well, Caltrain provides a real-time departure list on their website, and that includes all the different train numbers! This is pretty helpful, although it doesn’t have an API for accessing it, so I scraped the data directly from their website. Using the excellent Burp Suite to intercept my HTTP requests and figure out what I needed to do, I found out how to both get the full list of stations, and how to get the data for each station. This required a bit of work, as Caltrain appears to be using an ancient ASP.NET CMS named “IronPoint”, but in the end, I got the data. It looks like this:

s1<IRONPOINT>TIME</IRONPOINT>as of&nbsp;5:44 PM<IRONPOINT>TIME</IRONPOINT><IRONPOINT>ALERTS</IRONPOINT><IRONPOINT>ALERTS</IRONPOINT><IRONPOINT>TRAINS</IRONPOINT><table class="ipf-caltrain-table-trains" width="100%"  cellspacing="0" cellpadding="0" border="0"><tr class="ipf-st-ip-trains-table-dir-tr"><th class="ipf-st-ip-trains-table-dir-td1"><div>SOUTHBOUND</div></th><th class="ipf-st-ip-trains-table-dir-td2"><div>NORTHBOUND</div></th></tr><tr class="ipf-st-ip-trains-table-trains-tr"><td><table class="ipf-st-ip-trains-subtable"><tr class="ipf-st-ip-trains-subtable-tr"><td class="ipf-st-ip-trains-subtable-td-id">440</td><td class="ipf-st-ip-trains-subtable-td-type">Local</td><td class="ipf-st-ip-trains-subtable-td-arrivaltime">32 min.</td></tr><tr class="ipf-st-ip-trains-subtable-tr"><td class="ipf-st-ip-trains-subtable-td-id">442</td><td class="ipf-st-ip-trains-subtable-td-type">Local</td><td class="ipf-st-ip-trains-subtable-td-arrivaltime">92 min.</td></tr><tr class="ipf-st-ip-trains-subtable-tr"><td class="ipf-st-ip-trains-subtable-td-id">804</td><td class="ipf-st-ip-trains-subtable-td-type">Baby Bullet</td><td class="ipf-st-ip-trains-subtable-td-arrivaltime">114 min.</td></tr></table></td><td><table class="ipf-st-ip-trains-subtable"><tr class="ipf-st-ip-trains-subtable-tr"><td class="ipf-st-ip-trains-subtable-td-id">803</td><td class="ipf-st-ip-trains-subtable-td-type">Baby Bullet</td><td class="ipf-st-ip-trains-subtable-td-arrivaltime">13 min.</td></tr><tr class="ipf-st-ip-trains-subtable-tr"><td class="ipf-st-ip-trains-subtable-td-id">443</td><td class="ipf-st-ip-trains-subtable-td-type">Local</td><td class="ipf-st-ip-trains-subtable-td-arrivaltime">46 min.</td></tr><tr class="ipf-st-ip-trains-subtable-tr"><td class="ipf-st-ip-trains-subtable-td-id">445</td><td class="ipf-st-ip-trains-subtable-td-type">Local</td><td class="ipf-st-ip-trains-subtable-td-arrivaltime">106 min.</td></tr></table></td></tr></table><IRONPOINT>TRAINS</IRONPOINT><IRONPOINT>LINK</IRONPOINT><IRONPOINT>LINK</IRONPOINT>

I hooked up the scraper to the Sequel gem and logged the very-slightly-processed raw data into an sqlite database. Time to sit back and watch the data roll in.

Iterating and debugging

The basic scraper logged the data in a fairly raw form and had minimal error handling. Next I wanted to make the data more usable, allowing searching by individual trains or by time of day. I also planned to improve the error handling, both by logging more details and by making it easier for me to bucket the errors.

Of course, before I could even start thinking about these improvements, the script broke after fifteen minutes of running. It turns out it didn’t handle the server responding with no data. After I squashed that bug, I added a bit of logging to help me catch similar problems in the future.

Next I fixed the scraper to run at the correct part of each minute. I wanted to make sure I got as much data as possible without having to make multiple requests per minute. If I requested data just as the servers updated, I’d get data from different minutes. My workaround for this: make a request every five seconds, and once the minute changes, use that as the time to start a scrape request. Then, do a scrape every minute, and everything is good.

got 32 stations
looping now
retrieving departures
--- Times should be '11:45 AM'
--- Times all look good
retrieving departures
--- Times should be '11:46 AM'
--- Times all look good
retrieving departures
--- Times should be '11:47 AM'
--- Times all look good
[...]

Another bug cropped up in the time parsing code. The scraped data included two kinds of times: the current time, in the form 11:30 AM, and the arrival time, in the form 15 min.. I originally stored this data in the database as raw strings, but it’s a lot easier to work with if they are stored as computer-readable times. My initial attempt was to parse the time and then just use the current day. Arrival times then added the specified number of minutes. This seemed OK, so I decided to check on the data the next day and went to sleep.

Well, there were weird things with the times. I had a bunch of entries right up until about 11:49 PM the day before. Then I had some from 11:50–11:59 PM on the current day, followed by a bunch more from 12:00 AM onwards. This struck me as odd, as 11:50 PM wouldn’t occur for almost 24 hours! This happened because my server’s time was slightly off — about ten minutes. I added logic to determine the current day, and weird time traveling became a thing of the past. I also fixed the clock on my server.

sqlite> select id, created_at, time from readings where id > 108 limit 16;
109|2014-07-26 23:57:01.051938|2014-07-26 23:47:00.000000
110|2014-07-26 23:58:01.087952|2014-07-26 23:48:00.000000
111|2014-07-26 23:59:00.109986|2014-07-26 23:49:00.000000
112|2014-07-27 00:00:00.149401|2014-07-27 23:50:00.000000
113|2014-07-27 00:01:00.615091|2014-07-27 23:51:00.000000
114|2014-07-27 00:02:00.377811|2014-07-27 23:52:00.000000
[...]
120|2014-07-27 00:08:00.119957|2014-07-27 23:58:00.000000
121|2014-07-27 00:09:00.065927|2014-07-27 23:59:00.000000
122|2014-07-27 00:10:00.251709|2014-07-27 00:00:00.000000
123|2014-07-27 00:11:00.077243|2014-07-27 00:01:00.000000
124|2014-07-27 00:12:00.124230|2014-07-27 00:02:00.000000

Another set of issues that kept coming back was server timeouts. Occasionally the server would do nothing after I made my request. No response, no failure, just nothing. This caused some weird interactions with the scraper, which would hang while waiting for these responses. There were a few times where I missed several minutes of data because of this! I ended up with a reasonably robust system for handling this. I gave the initial parallel requests a 10-second timeout, and all requests that didn’t complete would be re-run afterward. Of course, I had to deal with timeouts there too — once, a re-request hung for five minutes before returning. My result: I wrote scraping code that took less than a minute to run, and if it couldn’t retrieve data for a station, it logged the error and continued scraping.

[...]
doing scrape at 2014-08-03 11:05:15 -0700
scrape times: request: 1.70s retries: 0.00s
finished retrieval at 2014-08-03 11:05:17 -0700
got reading time 2014-08-03T10:55:00-07:00 and creation time 2014-08-03T11:05:17-07:00
scrape complete
doing scrape at 2014-08-03 11:06:15 -0700
got an error for Atherton, making special request
got an error for Broadway, making special request
scrape times: request: 1.74s retries: 1.02s
finished retrieval at 2014-08-03 11:06:18 -0700
got reading time 2014-08-03T10:56:00-07:00 and creation time 2014-08-03T11:06:18-07:00
scrape complete
doing scrape at 2014-08-03 11:07:15 -0700
scrape times: request: 1.72s retries: 0.00s
finished retrieval at 2014-08-03 11:07:17 -0700
got reading time 2014-08-03T10:57:00-07:00 and creation time 2014-08-03T11:07:17-07:00
scrape complete
[...]

Pain and suffering

At this point, things were mostly working, and I was getting nice piles of data. Usually. It turns out that there are a lot of issues with trying to rely on this scraped data.

  1. The data doesn’t know everything

    Even though I got the train numbers by switching away from 511, I didn’t get train arrival data, which is really what I wanted. So, stations at the end of the line - San Francisco, San Jose, Tamien, and Gilroy - could never have good data for the arrival. San Francisco and San Jose, in particular, have a reasonably long and somewhat variable time between the second-to-last station and the terminal, so having this data would be helpful.

    Other examples include things like express trains that turn into locals when there’s some kind of SNAFU. The real-time departure system doesn’t or can’t handle this, so this data is always lost.

  2. The data is not raw

    Since the data I scraped is essentially the same data fed to the departure signs at individual stations, I didn’t get the full raw data. Instead, I got data that Caltrain already slightly processed before publishing! For example, if a train is five minutes late at the second station, it will still show as arriving on-time at the last few stations on its route. This meant that only the next few stations had data even worth considering.

    For example, the following train is 8 minutes late (scheduled: 19:38, actual: 19:46) as of reading 2642 at Palo Alto. But, in San Jose, it’s shown as being one minute early as of reading 2642. As the train gets closer, the time becomes more realistic, settling on 20:20, which is eight minutes later than scheduled.

    sqlite> select name from stations where id = 17 ;
    Palo Alto
    sqlite> select reading_id,station_id,arrival from timepoints where reading_id = 2642 and station_id = 17 and train_id = 77 ;
    2642|17|2014-07-29 19:46:00.000000
    sqlite> select name from stations where id = 9 ;
    San Jose Diridon
    sqlite> select reading_id,station_id,arrival from timepoints where reading_id > 2642 and arrival < '2014-07-30' and station_id = 9 and train_id = 77 order by arrival ;
    2642|9|2014-07-29 20:11:00.000000
    2643|9|2014-07-29 20:11:00.000000
    [...]
    2655|9|2014-07-29 20:11:00.000000
    2656|9|2014-07-29 20:12:00.000000
    2659|9|2014-07-29 20:12:00.000000
    2660|9|2014-07-29 20:12:00.000000
    2661|9|2014-07-29 20:12:00.000000
    2657|9|2014-07-29 20:13:00.000000
    [...]
    2673|9|2014-07-29 20:19:00.000000
    2675|9|2014-07-29 20:19:00.000000
    2676|9|2014-07-29 20:20:00.000000
    2677|9|2014-07-29 20:20:00.000000
    
  3. The data is often wrong

    For example, a southbound train was scheduled to stop at Palo Alto. The data showed the train as on-time at first, and then it spent ten minutes (from reading 2633, at 19:35 onwards) saying it was two minutes late. A friend, waiting at Menlo Park, told me when he finally got on that train. The station signs had said it was arriving for about eight minutes before giving up entirely. The train itself actually arrived at the last predicted time (19:46), but there was no data for the couple of minutes before it arrived.

    sqlite> select name from stations where id = 17 ;
    Palo Alto
    sqlite> select reading_id,arrival from timepoints where station_id = 17 and reading_id > 2631 and train_id = 77 and arrival < '2014-07-30' ;
    2631|2014-07-29 19:37:00.000000
    2632|2014-07-29 19:37:00.000000
    2633|2014-07-29 19:37:00.000000
    2634|2014-07-29 19:38:00.000000
    2635|2014-07-29 19:39:00.000000
    2636|2014-07-29 19:40:00.000000
    2637|2014-07-29 19:41:00.000000
    2638|2014-07-29 19:42:00.000000
    2639|2014-07-29 19:43:00.000000
    2640|2014-07-29 19:44:00.000000
    2641|2014-07-29 19:45:00.000000
    2642|2014-07-29 19:46:00.000000
    sqlite> select time from readings where id = 2631 ;
    2014-07-29 19:33:00.000000
    sqlite> select time from readings where id = 2642 ;
    2014-07-29 19:44:00.000000
    

    Other oddities I observed include the data saying a train was scheduled to depart two different station pairs (San Antonio and Mountain View, #11 and #19, and Palo Alto and California Ave, #17 and #21) at the same time. Unless Caltrain has wormhole technology, I don’t think this is very likely:

    sqlite> select station_id,arrival from timepoints where reading_id = 2642 and train_id = 77 order by arrival ;
    17|2014-07-29 19:46:00.000000
    21|2014-07-29 19:46:00.000000
    19|2014-07-29 19:51:00.000000
    11|2014-07-29 19:51:00.000000
    4|2014-07-29 19:55:00.000000
    22|2014-07-29 19:58:00.000000
    3|2014-07-29 20:03:00.000000
    9|2014-07-29 20:11:00.000000
    

With all these variables, trying to get useful data out of this mass of scraped data is really hard. Getting data with a pile of issues is easy; getting something that you can use to view the day-in, day-out performance of the trains is much harder. One reason I abandoned this project was that I wasn’t sure that having this vaguely unreliable data would be useful for anything.

The other problem was that my script broke. A lot. I keep getting weird bits of unscraped data showing up, I got weird exceptions, I got Ruby interpreter crashes (which appear to be related to my use of the curb gem). All of this was a lot to keep up with. One of the things I should have done from the start was make it easier to detect and view errors. While I ended up logging a lot of the necessary info, I accessed it manually. Having an automatic tool to find and present inconsistencies and errors would have made the process of maintaining the scraper a lot easier.

What did I learn from all of this?

If I were going to keep working on this, I would tackle two things next: making the automated error detector, and building a tool to process the raw data into something easier to analyze. The automation would help me expend energy on the more interesting parts, and making the raw data usable would provide more motivation to keep working on this.

I learned that I enjoy writing web scrapers and trying to make them reliable, even when making 50k requests a day.

The other thing I learned is that I would be much happier if Caltrain just published their raw train data, and then I wouldn’t have needed to write this blog post.