For my first project as part of the Metis Data Science Bootcamp, I was tasked with drafting a mock proposal to solve a problem by using the MTA turnstile dataset.
Introduction
The NYC Department of Transportation and Department of Environmental Protection wish to reduce carbon footprint and congestion of New York City roads. One way to do this would be to reduce the need for car ownership by adding to an already significant subway infrastructure. Currently, less than half of the city’s inhabitants own cars, but there is still room for growth as some neighborhoods/boroughs are better equipped than others.
The purpose of this exploratory data analysis is to pinpoint neighborhoods that have low ridership per capita. The NYC DOT can use this information to gauge which neighborhoods warrant more investment in subway lines, stations, and tracks. By having more New York City residents utilize the subway system in a convenient manner, more residents will benefit financially by not feeling obligated to purchase a car and the city will benefit from reduced congestion and a smaller carbon footprint.
Design
After extracting the MTA Turnstile Dataset, I took ridership levels by station and after cleaning up and filtering the data appropriately, I mapped these stations to zip codes and neighborhoods via Google API and GeoPandas.
Then, I used population data by neighborhood to map to control for differences in neighborhood size and determine a ridership per capita value by neighborhood. I wanted to see which areas are better equipped to handle most of its population going car-less, and which ones have room for improvement. I then looked for neighborhoods that went below a pre-determined threshold (explained in further detail in the Algorithms section) and sought patterns and trends. Then, I built plots and charts to analyze my findings.
Data
As mentioned in my design, I will be using the MTA turnstile dataset as my main source of analysis, along with neighborhood population data. Each row contains entry and exit counts, split into 4-hour intervals, broken down by turnstile, time/day, and subway station. Below is a snippet of the MTA turnstile dataset:
The following notes add more color as to the scope of data in use for this project:
Data Filtering
Data Cleaning
ENTRIES
column provided is cumulative, so I created a new column, NEW_ENTRIES
, that takes the difference between entry counts at incremental time intervals.C/A, UNIT, SCP
. Then, the NEW_ENTRIES
gets summed up for total entries for the mentioned time intervals.Data Mapping
Algorithms
Results
Taking Manhattan as an example, this chart represents the ridership per capita, broken down by neighborhood pre and post COVID. The purpose of this is to demonstrate that the pandemic did not significantly alter trends. Neighborhoods such as Lower Manhattan, Chelsea, and Clinton have the most subway usage while Central Harlem could use more investment.
Here, we can see the ridership per capita for each neighborhood vs its respective population. While Manhattan is home to neighborhoods with high ridership, as expected, the majority of neighborhoods that require attention are in Queens. In addition, this phenomenon occurs in a variety of Queens neighborhoods, regardless of size.
If we take a closer look to see where these neighborhoods are, we can see a clear pattern emerging in Queens. Each dot represents a neighborhood that misses the 4.94 threshold and the larger the dot, the smaller the ridership. This reveals a pattern where not only a large volume of neighborhoods that fail to meet the threshold are in Queens, but also neighborhoods that miss the threshold significantly.
In total, I identified 20 neighborhoods in NYC that warrant additional investment in subway infrastructure. Most of these neighborhoods are in Queens and I recommend starting in this borough. By targeting these neighborhoods, I believe that more residents will feel encouraged to use the subway and be less reliant on cars.
To enhance this project in the future, I would utilize cab data to identify neighborhoods with high cab traffic and low ridership and take a closer look at subway lines (for example, if one train line runs infrequently or if neighborhoods don’t have access to many lines).
To see my project in further detail, please visit my GitHub Repo.
Tools Used