Guest Blog from Stephen Smay, Solutions Architect at Datawatch
Here at Datawatch, most of our customers use our product to enable fast and effective data preparation in the world of financial or medical data, but data is data, and the applications in other areas are nearly limitless.
Being a fan of data myself, I’m always on the lookout for interesting or unique uses for our product. Also being a fan of college football and true crime stories, I was intrigued when a friend sent me a suspect and arrest report for various types of assault during the time period of 2010 through 2016. He obtained this data via a Freedom of Information Act (FOIA) request from the Waco Police Department. The significance of this document revolves around an ongoing scandal at Baylor University regarding institutional suppression of multiple accusations of sexual and physical abuse by players on the school’s football team.
While people don’t often equate self-service data prep with investigative journalism or criminal investigations, this use case presented a fantastic example of how self-service data prep products can be used to replace hours of tedious, mind-numbing work and result in faster time to insight from your raw data. In the criminal justice world, decreasing or eliminating these time-consuming tasks allows investigators to identify suspects and close cases faster.
All the information is on the document: offenses with dates, whether it was suspected or an actual arrest, the individual’s name, and if known, his/her date of birth, age, and school/place of employment. But there were a number of problems with the source data, the first of which is that it is a PDF, and a messily formatted one at that. Some fields are empty, and others are too short, causing values to ‘bleed’ into subsequent columns. It’s what is known as semi-structured data, and it doesn’t copy-and-paste properly to Excel, and it doesn’t even export quite right from Adobe Acrobat Professional. Those problems might not be so critical if not for the sheer volume of it. It is 401-pages containing about 35 cases per page, for an estimated total of fourteen-thousand individual records. That’s a lot to clean up in Excel.
Fortunately, working at Datawatch, I have access to Monarch, which was purpose-built to extract messy data from PDF and other text-based files. With Monarch, I was able to identify the appropriate fields quickly and transform the entire document from a semi-structured mess into orderly rows and columns. #databliss
At that point, I would run into another problem: I had 13,900 records with names, but how could I determine who of those actually played on the Baylor football team?
Once again, Monarch provided the solution, this time making use of the ability to open data directly from the Internet. The website cfbstats.com provided the football rosters from 2016 back to 2008, and I was able to connect directly to the URL for each year. Unfortunately, the data from Waco PD went all the way
to 2001, and I wanted to check as far as possible, so I found another site: sports-reference.com for the rosters from 2007 back to 2000. After some formatting and calculations to normalize the tables (e.g. separating first and last names into their own columns, where one site had last-comma-first and the other had first-last format) from the different sites, I appended all the years together to create one master roster from 2000 to 2016.
Because any player could have been on the team multiple years, the next step was to pare that massive roster down to one entry per player, and include the first and last years he played. A simple grouping in Monarch facilitated that transformation down to 584 players.
At that, I joined the list of players to the table from the police report, and found 42 matches, but I wasn’t done there. Since I’d joined based solely on name, some matches could have been coincidence, especially for common first and last names, like someone named Joe Jackson. With that in mind, I created a filter to check if the suspect’s age was between 18 and 26, and wrote a formula to determine if the date of the police incident occurred within the dates of the player’s Baylor career.
Finally, I arrived at a list of 16 separate incidents, with 11 different men who were probably players on the Baylor University football team. The dates for these incidents range from March of 2010 to August of 2015, with seven of those players being on the team the season following the incident.
The resulting list would give a journalist or crime analyst a very interesting starting point for an investigation. These kinds of investigations often show up in white-knuckled thrillers in films and TV shows, (think of the recent box office hit, Spotlight), where this monotonous process of poring through data is sped up with a montage. In real-life, we don’t have the luxury of hitting the montage button to fast-forward through the work. However, you can accelerate the job with a product like Monarch. This took twenty minutes and showcases a fascinating way to use self-service data prep to solve complex and interesting problems, outside the world of accounting and finance.