I recently purchased a Chromebook for use as a travel laptop. I wanted something light, portable, and secure that I wouldn’t have to back up or worry about losing. I bought the Asus Chromebook Flip C302, and it’s been great! Except for one fatal flaw…
Sure, you can change this setting once you’ve logged in. If you use guest mode, though, you have to change it every time. I decided that this was Not Ok and set out to fix my beleaguered Chromebook, freeing it from the shackles of being polite.
After a pile of Google searches, some educated guesses, and some light hardware disassembly, I was able to permanently change the setting while keeping Chrome OS’s secure boot feature on. Curiosity and a desire to stick it to Canada can get you a long way!
I spent a good amount of time trying to solve my problem the straightforward way. I tried changing various Chrome OS settings as a guest user and as the primary logged-in user to no avail. None of the settings I touched kept their impact after a logout or reboot. Searching around the internet for things like chromebook permanently change locale didn’t return useful results beyond the basic settings I’d already tried tweaking.
It was pretty clear to me that these settings needed to be changed at a lower level.
Finding the real setting
I performed a bunch of searches like chrome os change default region, adding other search terms including firmware, locale, and system. I ignored any results that looked like a standard help forum or website, instead preferring results that looked more technical. Because Chrome OS is open source, it’s likely that the setting I want is in the code, but I didn’t know where to start.
I no longer remember the original search, but I eventually stumbled across a post to the Chromium OS (the open-source part of Chrome OS) code review mailing list titled recovery: Read locale by region database. It’s likely the search results have changed by now anyways. I usually save results like these so that I can still find them after they fall out of favor with the search engines.
At this point, I wasn’t familiar with any of the internals, but whatever this change was looked like it related to region settings that would apply even in recovery mode. I followed the breadcrumbs from here to the first link for the source code commit review. The first few lines of the commit message were even more promising:
Chromebooks have a “VPD” dictionary-like area in firmware. ChromeOS keeps and
reads data, including initial locale, by calling program ‘vpd’.
However this becomes an issue that these data can’t be updated after device
shipped, and also a maintenance issue.
This confirmed some of my suspicions, including that there was an initial configuration in firmware for the region! It also confirmed that it wasn’t easily updated after the device was shipped. It was also the first time I’d heard about VPD, which meant it was time to do more research.
Vital Product Data
I looked up vpd on Google and figured out that it probably meant “vital product data”. The Wikipedia page for it was pretty barren but corroborated it being a system-level configuration tool of some kind.
I narrowed my search to chrome os vpd. The first result was a pretty dry page in the Chromium source that has a bunch of stuff about datatypes and encodings and whatnot. While I quickly moved on to other pages, it turned out that a more careful reading of this would have revealed useful info!
The next page I looked at was somebody’s random script. The example output from dump_vpd_log had variables in it that directly referenced the system language, so it had to be on the right path.
I suspected this script was meant to run on a Chromebook, so I gave running dump_vpd_log a shot through the default terminal (crosh). Unsurprisingly, it wasn’t present as a command, and because my Chromebook wasn’t in developer mode, I can’t run commands as root.
I put my Chromebook into developer mode and tried running dump_vpd_log --full --stdout as root again. Sure enough, it dumped out a similar list of settings, including:
These settings reflected what I saw when I was using my Chromebook, cementing my belief that these were the correct things to change.
First try at changing VPD settings
Knowing that the VPD was almost certainly the thing I wanted to change, I went back and read the VPD page I found earlier. There’s a great part towards the bottom that lists a bunch of command examples for the vpd utility. I ran vpd -l (which should list all the values) and it matched the settings I saw earlier!
I made a backup of the ROM by running flashrom -r vpd.bin, although I will note that I didn’t have a plan to restore it in case everything broke. Then I tried to update a variable! I guessed that this would fail, as I hadn’t done anything to make the ROM writeable and also because being able to write to this from Chrome OS without hardware tweaks seems like it would be bad.
root@localhost:/home/chronos/user # flashrom -r vpd.bin
flashrom v0.9.9 : cfd7dfc : Apr 06 2018 05:12:10 UTC on Linux 3.18.0-17554-g9194949d4df2 (x86_64)
flashrom v0.9.9 : cfd7dfc : Apr 06 2018 05:12:10 UTC on Linux 3.18.0-17554-g9194949d4df2 (x86_64)
Calibrating delay loop... OK.
coreboot table found at 0x7ab80000.
WARNING: SPI Configuration Lockdown activated.
Reading flash... SUCCESS
root@localhost:/home/chronos/user # vpd -s initial_locale=en-US
As I expected, it didn’t work. On to the next challenge!
Making the VPD writeable
I had vague memories of some hardware switch to enable writes, so back to searching the internet. chrome os write to rom brought up this page which noted that the various bits of firmware (and things that get touched by flashrom) live on an SPI ROM which is write-protected. It also notes that I can disable that via a switch or a screw. Some more searching around turned up this Reddit post about how to disable write protect on the C302. I’ve replicated the steps below.
After booting the laptop back up, I tried the vpd -s initial_locale=en-US again… and it worked! I assume, because it didn’t output anything. vpd -l now showed the new setting too. I rebooted the laptop and went through the Chrome OS setup screen. Sure enough, more of it was en-US than it was before! There were still several Canadian remnants, which makes sense since I didn’t change all of the firmware settings.
Testing out all the various settings was somewhat annoying, because it involved rebooting every time I tweaked a variable. Some of them (like keyboard_layout) were tricky to figure out. The settings I ended up with were:
While writing this blog post, I discovered that non-region settings are now deprecated, but it looks like manufacturers (including Asus) specify them anyways. The Chrome OS regions.py file, used to generate the defaults for each region, has a set of reasonable defaults for all the different locales it supports. The comments near the top of the file (for the Region object) specifically call out the initial_locale, initial_timezone, and keyboard_layout settings as being superseded by region. I wish I had tested this while I had the write-protect screw removed, but I didn’t, and I’m too lazy to pull it apart again.
Putting it back together
There was one last question I had: would Chrome OS’s secure boot feature still work after I’d monkeyed with the VPD settings in firmware? There was only one way to find out: put the write protect screw back in, set it back to verified mode, and see if everything worked. And it did! This was pretty important to me, as I didn’t want to run an unsecured device.
This made sense – settings like region and serial number are per-device and set in the factory. It’s possible they could be tied to the hardware with a signature of some kind, but given their (relative) lack of sensitivity, it seems they’re left unchecked.
What have we learned?
Beyond proving that the locale change was possible, it showed the power of searching for information and just trying things. The original attempt only took about two hours to get working, including time spent trying random searches and clicking to the second or third page of results.
Writing up this post took an order of magnitude longer. Many of the original things I tried were just cargo-culting commands seen on websites. I ignored various options and didn’t test things that ended up being important. In particular, I didn’t spend a lot of time figuring out which settings needed to be changed in the VPD and which didn’t. There’s always more to explore!
My hope is that this post is useful to somebody, whether they just want to change the default region on their device or do other fun things with the VPD area. Particularly with open-source software, the answer may only be a short search away!
Changing your Chromebook’s locale in three easy steps*
*may be more than three steps
A warning! If you are making modifications like this to your Chromebook, you should be careful to return the device to the normal verified mode! With the write protect switch off or when you’re in developer mode, it’s trivial to bypass any protections on the device, such as your password.
Another warning! This is going to clear all of the data on your device. It’s a Chromebook, so it’s probably all backed up in the cloud, but don’t say I didn’t warn you.
Chrome OS in developer mode
Developer mode access allows you full root access to the underlying Linux system. It usually involves a “hidden” key combination, but is otherwise a built-in part of most Chrome OS devices.
My Chromebook (the Asus Chromebook Flip C302, aka cave) has no special instructions, so you can follow the generic ones provided by Google. If you have a different device, try looking it up here and seeing if that works.
For the C302, follow these steps:
Hold down the Escape and Refresh keys.
Click the power button. You can release the Escape and Refresh keys at this point.
Once you see the Recovery screen, hit Ctrl-D.
Wait a while. The Chromebook will beep at you once or twice.
Eventually, it’ll boot back up to the normal Chrome OS login screen.
Hit Ctrl-Alt-F2 (F2 is the right-facing arrow next to the refresh key) to get to a login prompt.
Log in with the user chronos, and then sudo -s to get a root shell.
As an alternative to steps 6 and 7, you can:
Go through the first setup screen and select “Browse as Guest” as soon as possible.
Make sure a Chrome browser is open.
Hit Ctrl-Alt-T to open crosh.
Run shell to get a real shell and then sudo -s to get a root shell.
To get out of developer mode, turn the Chromebook off and then on again. It’ll say “OS verification is OFF” and “Press SPACE to re-enable”. Unsurprisingly, hit space, and your Chromebook will be reset to as good as new (assuming you haven’t modified the underlying OS install at all)! If you have, maybe try these steps to fix it.
Enabling EC firmware writes
From the beginning, Chromebooks have provided some sort of mechanism to alter the internals and play around. Beyond just developer mode, there’s also a way to allow writes to the embedded controller firmware. Because allowing flash ROM writes bypasses most of the internal secure boot controls, Google specifically designed it to be an annoying process to perform. The most common mechanism is a conductive metal screw that bridges two traces as a physical “on” switch. This is what shipped in the original Chromebook, and it’s also what ships in the C302.
The C302’s screw is located on the motherboard, hidden under a piece of fabric tape. Since we want to get write access to the EC firmware, we’re going to remove this screw. If you’re following along and you don’t have a C302, you can probably find documentation for your specific device elsewhere; the concept is the same.
Open the case.
There are ten Torx T4 screws on the bottom that have to be removed, highlighted by the blue circles. Once you’ve done that, there are two more (much smaller) Philips screws hidden under the rubber feet at the back side by the hinge, highlighted by the red circles. You can carefully peel the rubber feet up with a spudger.
Peel back the tape.
With the motherboard at the back and the batteries at the front, you can see a giant copper heatpipe with a heatsink in the middle. If you look just down and to the right, there’s a small rubber block on top of the fabric tape. Carefully pull this tape back to reveal the screw.
Remove the screw.
You can see a nice big Philips screw next to a label that says WP. This is the write protect screw; remove it! You’ll note that under the screw there are two disconnected copper pads. By removing the screw, we’ve cut this particular circuit, changing the behavior of the Chromebook’s internal electronics and software.
To change the settings, you’ll have to run the Chromebook with the bottom off for a little bit. It’s generally safe to just put in on a piece of cardboard or some other non-conductive material. You can also loosely fit the bottom back on and run it like that.
Changing the VPD settings
First, we have to figure out what settings (if any) are present.
On my C302, there were four: region, keyboard_layout, initial_locale, and initial_timezone. I might have been able to get away with deleting all the non-region ones, but they still seemed to have some impact despite being deprecated.
The next step is to figure out some new values for it. regions.py, from the Chrome OS source tree, has a list of all the currently supported regions. If you can read Python, it’s simple, well-documented code and you should just read it! If not:
Look for things starting with Region(. What follows will be the value for region, keyboard_layout, initial_timezone, and initial_locale.
And after running the appropriate vpd -s commands, the settings are changed! Rebooting will show if they’ve taken effect. Be sure to put the write protect screw back in and take the Chromebook out of developer mode! If you don’t, the security of your device (and data) is at risk.
I live in a house with a garage and a few roommates. Because it’s California, nobody actually stores cars in the garage. We do keep stuff like bicycles and car covers there though! For a while, we had a wifi-based opener built by one of my roommates (based on a Particle Photon), but it was so IoT that it needed internet access and an in-home server for the web interface. Once the server in our house disappeared, we needed something new. This is what the old one looked like (until I disassembled it):
The most important feature the old garage door opener had was that by joining the local wifi, you could open and close the door. Since I was rebuilding it, I set a few other goals, such as making it work entirely on local wifi. This means if our internet goes down, we can still open the door!
There are three main parts to put together:
Connect to the local wifi network
Run a webserver that we can access
Do some Electronics Things when we get certain web requests
The board I ended up using for the new one is the C.H.I.P., a $9 board that runs Linux, has onboard wifi and Bluetooth (solving part 1), and most importantly was in my roommate’s spare parts bin. Since it’s running Linux, that makes it easy to use modern web frameworks like Flask, neatly handling part 2. It also supports digital input and output pins (which gives a start on part 3).
The C.H.I.P. was already set up with their custom Debian install, so I connected to it via the serial port. After logging in, I used nmtui to connect to the wifi network used by our phones. The rest of the work was getting the application and associated hardware up and running. In the rest of this post, I’ll talk about my meandering path to getting something working, and close it out by presenting my final working schematic in addition to the webapp running the opener.
Toggling some outputs
The C.H.I.P. has a piece of hardware known as a GPIO (General Purpose Input/Output) expander. This is a very long way to say it can output 0s and 1s or read a digital signal from a pin. If you’ve ever used digital pins on an Arduino before, you’ve used a GPIO without even knowing it! Normally, GPIO devices have to be controlled in very low-level software, often by writing bits directly into memory. The next step was figuring out what I have to do on the C.H.I.P. to control its GPIO expander.
Linux makes our lives easier
Since I already had some vaguely-working circuitry from the previous opener, I left that alone and started by trying to get input / output working on the C.H.I.P. The documentation notes that there are eight GPIO ports provided by an I/O expander, and there are some fancy libraries for using it. I ignored this and promptly investigated the shell commands below. There was a lot of stuff about exporting the right port number based on your kernel version. I had no idea what this was talking about, and I decided to look further.
I discovered that sometime in the past several years, Linux grew a GPIO interface in SysFS! The documentation provides a clear explanation about how you can use this interface. From what I can tell, it’s designed for slower applications that aren’t timing-critical. In other words, it’ll work great for a garage door opener.
How do you use it?
The first step is figuring out which devices you have. If you look in /sys/class/gpio/, there are a bunch of entries like gpiochip408. These are the individual GPIO drivers available. The 408 is the first GPIO port on the device, and it maps to one of the pins on the GPIO output. Each device can have a bunch of pins, which means the 8th pin on gpiochip408 would be 415.
Running the shell command echo 408 > /sys/class/gpio/export will cause the kernel to make that GPIO available to you as a filesystem object! Note the fancy new /sys/class/gpio/gpio408 folder. This folder contains a bunch of files like value, direction, and active_low. You can read the value from value when the direction is in, or write 0 / 1 to it when the direction is out.
That’s pretty much it for things we will need to configure to make our GPIOs work. One of the interesting things about this abstraction is that it doesn’t get super technical and isn’t tied to the underlying electronics. This means that while you can’t express stuff like “this is an open-collector output, and you can enable or disable the pull-up”, it does make it extremely easy to work with, even from a shell!
How did I use it?
On the C.H.I.P. board I have, the GPIO expander starts at #408, which means it’s called /sys/class/gpio/gpiochip408 on the filesystem. Depending on your kernel version, this number might change! I started out by using just one pin. I figured out which pin mapped to which port by using the pinout in the documentation. For me, 408 maps to XIO-P0. The setup code in Python looked something like:
In actuality, I tested this with the following shell commands, which turn the port on and off every second:
I then used my multimeter to measure the voltage between XIO-P0 and GND (ground). It worked! The pin toggled between 3.3V and 0V, which is the expected range for this chip.
Making sure the analog parts work
I hooked up the pin to an LED from my parts box in series with a 300 ohm or so resistor as a basic test:
The LED worked, but it was pretty dim. Some poking around on the internet led to me discovering that the GPIO expander hardware couldn’t produce a lot of current, which meant that this was expected behavior. I tried it with the existing board, but it didn’t really work. The current provided by the GPIO expander (the Texas Instruments PCF8574A) was so low it couldn’t even fully turn on the transistor that we used to switch power to the relays. According to the diagram on page 12 of the datasheet, the expander will only source 100 microamps! For context, a normal LED draws 20 milliamps, or 200 times more current.
Luckily, my roommate had a relay board with optoisolators on it. Optoisolators are normally used to electrically disconnect two circuits, such as when you have low-power electronics driving a high-power motor. It does this by having an internal LED turn on a phototransistor, in essence acting as a light-controlled switch! There is no electrical connection between the two sides. While the opener doesn’t need any electrical isolation, the low current required to enable a high-current output is useful to us. I hooked the optoisolator inputs up between the GPIO expander and 3.3V power, and it worked! It draws about 400 microamps, which is too much for the expander to source, but not too much for it to sink.
If you have questions about reading schematics, SparkFun has a good primer on it! In this schematic, pins 1 through 8 on the J2 correspond to XIO-P0 through XIO-P7.
If you’re wondering why there are two relays with a capacitor in series on one of them, it’s all due to how the actual garage door opener works. It measures the capacitance on the line and takes different actions based on the value read. Shorting the line opens / closes the door, and putting a 1µF capacitor in series toggles the light.
One interesting quirk of the GPIO expander is that all of the I/O pins are high by default (it says this on the first page!). One thing to think about is that we don’t want our garage door to toggle when we power on our board, so we want the default state to be safe. By providing power to one side of the optoisolator and hooking the negative side of the LED to the GPIO expander, there will be no current flowing through it by default. This means that the relay should be in the same position if the power is off completely or if the power has been turned on and the GPIO hasn’t been configured yet.
“Hardware isn’t hard”
Now that I’d worked out how to make a pin toggle its output whenever I wanted, I figured that the rest of the hardware would be easy and all I had left was the software (spoilers: I’d made some assumptions about the hardware that came back to bite me, but I’ll get to that later). I got a copy of the previous opener’s Python script from my roommate so I could make it less cranky and work with this new hardware.
The opener program itself is a small webapp written in Python using the Flask framework. It exposes three endpoints: a page that displays two buttons (door and light), an endpoint that toggles the door, and an endpoint that toggles the light. It’s really basic, but it does the job.
One of the annoying parts of our old opener was that it didn’t “debounce” the “open door” link. Debouncing is the process by which we make sure what humans perceive as a single button press is interpreted by our program as a single button press. This was a problem when you pulled the page up on your phone, but it loaded slowly and you impatiently hit the “door” button again (which is of course not a thing that has ever happened to me). One of the easiest ways to implement debouncing is to have the program wait after it detects the first button click. By ignoring all additional button clicks in the next second or two, the number of false clicks goes way down! I implemented this with some basic synchronization primitives in Python:
I also abstracted out the hardware interface to make it easy to change the GPIO pin number in the future. While pins 408 and 409 work on this version of the C.H.I.P, that might not be the case on a Raspberry Pi or even a new Linux version on the C.H.I.P.! It’s not quite using a config file, but it’ll be really easy to abstract out when I want to.
The “last thing” left to do was plug in the relays and make sure everything functioned. I plugged it in, clicked on the “door” button, and it worked! I could hit the button, the relay would click on, and then click off. I was done! One last thing to check: does it open the door when I reboot the whole thing? This is a really important thing to consider, as the last thing you want your garage door opener to do is open the garage when you aren’t expecting it, such as after a power outage. It turned out that the door would in fact open when the device got rebooted, which meant I had to go back to the drawing board.
Hardware requires problem-solving
Why did the door open when I rebooted the C.H.I.P.? The symptom I saw was when turning the board on, the relays would click on for a moment and then turn off as soon as the on-board electronics had powered up, about a quarter of a second later. I wasn’t sure exactly what was causing this, but my guess wass that the GPIO expander wasn’t turning itself on as fast as it should. This meant that for a brief period of time, the expander would sink current from the optoisolator, allowing the relays to engage (and open the door accidentally). I was going to have to come up with a significantly more clever solution.
This whole project has been building stuff from my roommate’s junk bin, and I wasn’t about to stop now. The junk bin had a bunch of transistors (specifically the PN2222, which is an NPN-type transistor) which we’d used on the original board, so I set to work figuring out how to use a few of these in concert with the GPIO expander. My goal was to find some configuration that would not toggle the relays on startup while also working with the default-high output of the GPIO expander.
I eventually settled on a design that used one transistor per optoisolator as a switch. The positive side of the optoisolator LED was connected to 3.3V power, and the negative side was connected to the collector on a transistor. The base (which you can think of as the “switch input”) was hooked up to the GPIO expander. The trick was, instead of hooking up the emitter to ground, I connected it to the GPIO expander. What this means is that during startup, the base and emitter will be at the same potential the whole time. Transistors do all of their switching work based on there being a potential (i.e. non-zero voltage) between the base and emitter, so it stays completely off during startup!
To use this from software, I first have to set the two GPIO outputs to logic 0. Once I’ve done that, I can set the GPIO pin connected to the emitter to logic 0 as well, allowing it to sink current when the door or light transistors are switched back on. I can then toggle the two switch outputs at will, and it will operate the relay successfully! Note that if I did this in the opposite order and set the emitter GPIO to logic 0 first, I would accidentally turn on both relays. This is because the switch inputs were still high relative to ground, and when I set the emitter GPIO to logic 0, it creates a potential across the transistor, allowing it to switch on and let current flow through the LED.
This is the good-case timing diagram:
door ______...---‾‾‾‾ ... ‾‾‾\____________________/‾‾\______
light ______...---‾‾‾‾ ... ‾‾‾\____________/‾‾\______________
emitter ______...---‾‾‾‾ ... ‾‾‾‾‾‾‾‾\_________________________
door relay ________________ ... ________________________/‾‾\______
light relay ________________ ... ________________/‾‾\______________
^ power-on ^ ^ light is clicked
| program starts
And here’s what happens if we do it out of order, accidentally triggering both relays on program start:
door ______...---‾‾‾‾ ... ‾‾‾‾‾‾‾‾‾\_______/‾‾\_____________
light ______...---‾‾‾‾ ... ‾‾‾‾‾‾‾‾‾\________________________
emitter ______...---‾‾‾‾ ... ‾‾‾‾‾\____________________________
door relay ________________ ... _____/‾‾‾\_______/‾‾\_____________
light relay ________________ ... _____/‾‾‾\________________________
^ power-on ^ ^ door is clicked
| program starts
Putting it all together
The new hardware design required some additional code for the startup logic, but only a little bit! Once I verified that the script worked, I added a systemd unit file for the script to start it automatically on boot.
I used a small breadboard, the relay board, some double-sided tape, and a bunch of jumper wires to get the whole thing packaged up and ready to install. It now lives on top of my garage door opener, quietly waiting for one of us to click the button and have it do something useful. It seems a little silly to use an entire Linux system just to have a web page where I can click “open the door”, but this is the world we now live in. It’s definitely easier to glue together Python and GPIO files than to write low-level microcontroller code that serves a webpage. I put the code for the opener up on the internet so you can run your own wireless opener!
There are definitely future things I want to do with this! It would be nice to use the on-board Bluetooth chip to do BLE-based opening and provide iBeacon / Eddystone support. It would also be nice if I had an integrated app for my phone instead of having to pull up a web page. It could be fun to support user authentication and give time-based access to people or guests. It’d be helpful if it had audit logging. But these are all projects for another day!
Thanks to all who helped me review this before posting!
I enjoy cool visualizations of transit system performance, especially this one for the Boston MBTA, so I wanted to make one for the train that I often use, Caltrain. The MBTA visualization shows patterns of how the service works, including the difference in times between station stops at different times of day.
For example, here’s a Marey graph of a weekday Caltrain schedule (local trains in brown, limited in black, express in red). The horizontal axis shows departure and arrival times from morning (first train leaves San Jose at 4:30 AM) to evening (last train arrives in San Jose at 1:32 AM), with a few stations noted:
But that’s just the planned schedule. I wanted to visualize the performance of real Caltrain trains over several days. I went through the whole process of using transit data APIs, parsing and massaging the unfiltered data, and trying to turn it into something useful — and then I set aside this project, and here’s why.
This post may be useful to you if you’re curious about what kind of data you can get from Caltrain, obstacles you might run into when building web scrapers (and what you can do about them), and the downsides of the scraped data I got. You can also poke around the scraper source code to see what I did.
Now that I have a way to get the data, can I build my visualization? Not really. The 511 data only provides train departure times. At a glance this seems like it could be enough information, but there’s a big problem: it doesn’t tell you which train is departing. Caltrain is not a simple system that stops at every station — there are express trains that only make a few stops from beginning to end, limited trains that make more stops (in several service patterns!), and local trains that make all stops. How do we learn which train is where?
Well, Caltrain provides a real-time departure list on their website, and that includes all the different train numbers! This is pretty helpful, although it doesn’t have an API for accessing it, so I scraped the data directly from their website. Using the excellent Burp Suite to intercept my HTTP requests and figure out what I needed to do, I found out how to both get the full list of stations, and how to get the data for each station. This required a bit of work, as Caltrain appears to be using an ancient ASP.NET CMS named “IronPoint”, but in the end, I got the data. It looks like this:
I hooked up the scraper to the Sequel gem and logged the very-slightly-processed raw data into an sqlite database. Time to sit back and watch the data roll in.
Iterating and debugging
The basic scraper logged the data in a fairly raw form and had minimal error handling. Next I wanted to make the data more usable, allowing searching by individual trains or by time of day. I also planned to improve the error handling, both by logging more details and by making it easier for me to bucket the errors.
Of course, before I could even start thinking about these improvements, the script broke after fifteen minutes of running. It turns out it didn’t handle the server responding with no data. After I squashed that bug, I added a bit of logging to help me catch similar problems in the future.
Next I fixed the scraper to run at the correct part of each minute. I wanted to make sure I got as much data as possible without having to make multiple requests per minute. If I requested data just as the servers updated, I’d get data from different minutes. My workaround for this: make a request every five seconds, and once the minute changes, use that as the time to start a scrape request. Then, do a scrape every minute, and everything is good.
got 32 stations
--- Times should be '11:45 AM'
--- Times all look good
--- Times should be '11:46 AM'
--- Times all look good
--- Times should be '11:47 AM'
--- Times all look good
Another bug cropped up in the time parsing code. The scraped data included two kinds of times: the current time, in the form 11:30 AM, and the arrival time, in the form 15 min.. I originally stored this data in the database as raw strings, but it’s a lot easier to work with if they are stored as computer-readable times. My initial attempt was to parse the time and then just use the current day. Arrival times then added the specified number of minutes. This seemed OK, so I decided to check on the data the next day and went to sleep.
Well, there were weird things with the times. I had a bunch of entries right up until about 11:49 PM the day before. Then I had some from 11:50–11:59 PM on the current day, followed by a bunch more from 12:00 AM onwards. This struck me as odd, as 11:50 PM wouldn’t occur for almost 24 hours! This happened because my server’s time was slightly off — about ten minutes. I added logic to determine the current day, and weird time traveling became a thing of the past. I also fixed the clock on my server.
Another set of issues that kept coming back was server timeouts. Occasionally the server would do nothing after I made my request. No response, no failure, just nothing. This caused some weird interactions with the scraper, which would hang while waiting for these responses. There were a few times where I missed several minutes of data because of this! I ended up with a reasonably robust system for handling this. I gave the initial parallel requests a 10-second timeout, and all requests that didn’t complete would be re-run afterward. Of course, I had to deal with timeouts there too — once, a re-request hung for five minutes before returning. My result: I wrote scraping code that took less than a minute to run, and if it couldn’t retrieve data for a station, it logged the error and continued scraping.
doing scrape at 2014-08-03 11:05:15 -0700
scrape times: request: 1.70s retries: 0.00s
finished retrieval at 2014-08-03 11:05:17 -0700
got reading time 2014-08-03T10:55:00-07:00 and creation time 2014-08-03T11:05:17-07:00
doing scrape at 2014-08-03 11:06:15 -0700
got an error for Atherton, making special request
got an error for Broadway, making special request
scrape times: request: 1.74s retries: 1.02s
finished retrieval at 2014-08-03 11:06:18 -0700
got reading time 2014-08-03T10:56:00-07:00 and creation time 2014-08-03T11:06:18-07:00
doing scrape at 2014-08-03 11:07:15 -0700
scrape times: request: 1.72s retries: 0.00s
finished retrieval at 2014-08-03 11:07:17 -0700
got reading time 2014-08-03T10:57:00-07:00 and creation time 2014-08-03T11:07:17-07:00
Pain and suffering
At this point, things were mostly working, and I was getting nice piles of data. Usually. It turns out that there are a lot of issues with trying to rely on this scraped data.
The data doesn’t know everything
Even though I got the train numbers by switching away from 511, I didn’t get train arrival data, which is really what I wanted. So, stations at the end of the line - San Francisco, San Jose, Tamien, and Gilroy - could never have good data for the arrival. San Francisco and San Jose, in particular, have a reasonably long and somewhat variable time between the second-to-last station and the terminal, so having this data would be helpful.
Other examples include things like express trains that turn into locals when there’s some kind of SNAFU. The real-time departure system doesn’t or can’t handle this, so this data is always lost.
The data is not raw
Since the data I scraped is essentially the same data fed to the departure signs at individual stations, I didn’t get the full raw data. Instead, I got data that Caltrain already slightly processed before publishing! For example, if a train is five minutes late at the second station, it will still show as arriving on-time at the last few stations on its route. This meant that only the next few stations had data even worth considering.
For example, the following train is 8 minutes late (scheduled: 19:38, actual: 19:46) as of reading 2642 at Palo Alto. But, in San Jose, it’s shown as being one minute early as of reading 2642. As the train gets closer, the time becomes more realistic, settling on 20:20, which is eight minutes later than scheduled.
sqlite> select name from stations where id = 17 ;
sqlite> select reading_id,station_id,arrival from timepoints where reading_id = 2642 and station_id = 17 and train_id = 77 ;
sqlite> select name from stations where id = 9 ;
San Jose Diridon
sqlite> select reading_id,station_id,arrival from timepoints where reading_id > 2642 and arrival < '2014-07-30' and station_id = 9 and train_id = 77 order by arrival ;
The data is often wrong
For example, a southbound train was scheduled to stop at Palo Alto. The data showed the train as on-time at first, and then it spent ten minutes (from reading 2633, at 19:35 onwards) saying it was two minutes late. A friend, waiting at Menlo Park, told me when he finally got on that train. The station signs had said it was arriving for about eight minutes before giving up entirely. The train itself actually arrived at the last predicted time (19:46), but there was no data for the couple of minutes before it arrived.
sqlite> select name from stations where id = 17 ;
sqlite> select reading_id,arrival from timepoints where station_id = 17 and reading_id > 2631 and train_id = 77 and arrival < '2014-07-30' ;
sqlite> select time from readings where id = 2631 ;
sqlite> select time from readings where id = 2642 ;
Other oddities I observed include the data saying a train was scheduled to depart two different station pairs (San Antonio and Mountain View, #11 and #19, and Palo Alto and California Ave, #17 and #21) at the same time. Unless Caltrain has wormhole technology, I don’t think this is very likely:
sqlite> select station_id,arrival from timepoints where reading_id = 2642 and train_id = 77 order by arrival ;
With all these variables, trying to get useful data out of this mass of scraped data is really hard. Getting data with a pile of issues is easy; getting something that you can use to view the day-in, day-out performance of the trains is much harder. One reason I abandoned this project was that I wasn’t sure that having this vaguely unreliable data would be useful for anything.
The other problem was that my script broke. A lot. I keep getting weird bits of unscraped data showing up, I got weird exceptions, I got Ruby interpreter crashes (which appear to be related to my use of the curb gem). All of this was a lot to keep up with. One of the things I should have done from the start was make it easier to detect and view errors. While I ended up logging a lot of the necessary info, I accessed it manually. Having an automatic tool to find and present inconsistencies and errors would have made the process of maintaining the scraper a lot easier.
What did I learn from all of this?
If I were going to keep working on this, I would tackle two things next: making the automated error detector, and building a tool to process the raw data into something easier to analyze. The automation would help me expend energy on the more interesting parts, and making the raw data usable would provide more motivation to keep working on this.
I learned that I enjoy writing web scrapers and trying to make them reliable, even when making 50k requests a day.
The other thing I learned is that I would be much happier if Caltrain just published their raw train data, and then I wouldn’t have needed to write this blog post.