Significant Error Fixed in My Crash Map!

On Sunday evening, I checked TxDOT’s Crash Query Tool to see if all of January’s crashes to start 2024 had been uploaded to the system. A full month’s worth of crashes were present, so I grabbed the data and began the process of updating my crash map.

Everything worked seamless as usual – I filtered the CSV file to only pedestrians and cyclists, added January 2024 data to my map, and ran the update to Google Cloud and made sure the map was up and running.

Then, I started to check the data and perform a quick recap of what pedestrian and cyclist safety looked like in January. I immediately noticed some significant discrepancies between the data I pulled down from TxDOT and what my data looked like after cleaning and filtering it to put into the dashboard.

To perform my checks, I had to CSV files open – one with the raw data from TxDOT and one file that had been filtered in my program. I added some filters to my raw TxDOT data so that only pedestrians and cyclists were present. My filtered file had already been filtered. Next, I created pivot tables in each file to compare the amount of “Injury Types” for pedestrians and cyclists. When I saw the discrepancy, a pit formed in my stomach.

My raw TxDOT data had a total of 90 incidents involving pedestrians (70) and cyclists (20), while the filtered data I had originally included in my map dashboard only had 44. This meant that every injury type was being more under-counted than it already was! For example, the raw data in the image below shows that 5 pedestrians were killed in San Antonio during January. However, my filtered data I used to run the first update of the map only showed 1 pedestrian killed.

I went back and looked at my filtering function used to isolate pedestrian and cyclist involved crashes from the thousands of rows in the raw data that included car crashes. Sure enough, I spotted my error: I was removing rows that didn’t have a value for every single column! In the image below, you can see that line 58 is commented out, which means that it doesn’t have any effect on the function now. But that line was removing rows that should have stayed in the data. It also meant that every single time I’ve used the function over the past year, valid data was being removed!

I haven’t gone back to previous years to see how much data was being under-counted and what the difference was, but now I know for sure that important data was missing from previous updates. I’m disappointed that I’m only realizing this now and will most assuredly perform a more robust review and quality check of my data for running any updates to the dashboard.

The crash map and dashboard that’s currently available is now up-to-date and fully accurate. If anybody recognizes any more discrepancies, please let me know so I can correct them as soon as possible.

Jack Turek

recent posts

Hi, I’m Jack! Welcome to my site. I’m a public historian interested broadly in cities, labor struggles, and baseball history.