Our group began with the knowledge that we wanted to do a project dealing with traffic and transportation that would be applicable to large groups of individuals. While searching for datasets on the internet that we could use or websites to scrape from, we stumbled across a dataset containing taxicab data from 2013 - tip amounts, fare rates, dropoff and pickup location, time of ride, distance of ride, etc. We felt this particular set had a ton of useful and easily accessible information we could use to draw interesting conclusions about fares and tip amounts as well as possibly traffic patterns. Thousands of taxi move about the city everyday transporting thousands of people, there are bound to be interesting things to discover.
Luciano Arango ’16
Computer Science Concentrator
Justin Oliver ’16
Applied Math Concentrator
Ali Monfre ’16
Applied Math Concentrator
Green Bay, WI
Our original data set was acquired by Chris Wong through a FOIL request to the NYC Taxi and Limousine Commission. Chris Wong then made the data available through links from his website to different Torrents at http://chriswhong.com/open-data/foil_nyc_taxi/. We read in a random sample of 240,000 taxi trips from 2013, and plotted their pickup locations below. You can very clearly see Manhattan and some of the other Burroughs outlined by the taxi trips, as well as the heavily-visited JFK and Laguardia Airports.
We began our analysis of tip data by plotting a simple histogram of tips as a percentage of the total cost of the ride (fare + tolls + taxes). It became immediately clear that the vast majority of trips had a reported tip percentage of 0%, which seemed rather unusual.
We quickly discovered, however, that the vast majority of the trips with 0% tip were trips where the passenger paid with cash rather than card. In fact, the average tip percentage for trips paid in cash was considerably less than 1%, where the average tip percentage for trips paid by card was almost 18%. This led us to the hypothesis that taxi drivers were likely under-reporting tips when they were paid in cash, as they could simply pocket the money and not be taxed, where when the trip was paid by card the taxi company was given the exact breakdown of fare versus tip. Thus for the sake of accurately analyzing tip data, we decided to only look at trips where the passenger paid by card. The tip data was now much more reasonable and showed predictable spikes around 10%, 15%, 20%, 25%, 30%, and 35%, which makes sense given the pre-selectable values that appear on the screen when paying by card.
Now that we had clearly identified the issue of under-reporting tips, we thought it would be useful to develop a tool for predicting this fraud in order. Using this un-skewed data we were able to build a model by which we could predict the tip a driver should receive on average based on various features of the trip, such as the time of day, the length of the trip, the pickup and drop-off locations, etc. As it turns out, the most important features in determining the tip a taxi driver will receive on average are the length of the trip, the time of day the trip is taken, and the speed of the trip. The pickup and drop-off locations were also rather significant.
In the end, the random forest classifier was able to predict almost 40% of tips to the nearest percent. However, in predicting whether or not a cab driver is committing fraud, we are really only concerned with whether the reported tip is unreasonably low. Thus, the problem was simplified to simply predict whether the tip would be greater or less than 10%. In this new scenario, the random forest classifier was approximately 88% accurate. This thus represents a very valuable tool for predicting and cracking down on the under-reporting of cash tips.
We first wanted to see if we could find any trends in the speed of taxis over the course of the day, week, and month in order to determine any correlations with trip time, cost of trip, and other features.
First looking at speeds over the course of the week, it is clear that trip times are slightly faster on the weekend than during the week. This makes sense given Manhattan is likely much busier during the work week when people are commuting into and out of the city. However, this trend was not reflected in any significant change in fare data or trip time.
Next we decided to look at speeds over the course of the day. There was a very clear increase in average speed until approximately 5:00 AM, with a minimum average speed round noon. Again, this makes sense given when the city would likely be busiest with tourists and people. Thus speed varies fairly significantly over the course of the day, but much less over the course of the week or month.
In fact, we can very clearly see this trend in speeds over the course of the day mirrored in the number of trips over the course of day. The time with the fewest number of trips was 5:00 AM, though the peak number of trips was slightly later than the speed data would suggest, closer to 7:00 PM at the end of the work day.
We created a Tips Heatmap so drivers can see they areas of the city that tip the most at various times of the day.
Click on the image below to use it!
We also wanted to build a model that would be useful for the average person, too. So, we decided to try to predict the base fare for any given ride a person would want to take in and around the NYC area. The two factors that largely contribute to fare quote are the trip time and the trip distance, so we needed to build a model that would use machine learning to essentially predict both and return a base fare amount. We figured that using pick-up and drop-off location we would be able to query the Google Maps API to get an estimated trip distance, and that using time of day and day of the week we would be able to predict trip time based upon traffic patterns.
We found, however, that the day of the week and month of the year had very little effect on the average fare rate; on the other hand, as you can see above, fare rate did vary based upon the time of day. Therefore, using all 440,000 trips from January 28th, 2013, we built a random forest classifier with 50 trees to take in the pick-up location, drop-off location, trip distance, and time of day and output a predicted fare for any given day of the year.
We can see that trip distance is the most important factor for predicting, followed by time of day and location (presumably where the estimated trip time comes from). We then plotted our results to visually see the correlation.
There is obviously a clear correlation between predicted fare and actual fare, although there is still an average variation of about ten dollars. We found, however, that we could predict a person’s fare within a range of 5 dollars on any day of the year with around 78% accuracy. We also found it to be interesting that there were so many actual and predicted fares at $52.00 exactly, so we plotted a histogram of fares.
The histogram confirms that there’s an unusual spike in the data at $52.00. After a little research, we discovered that $52.00 is the flat rate for traveling from Manhattan to JFK Airport, explaining the spike.