Background: In this data analysis assignment, you will apply the concepts and techniques you’ve learned
throughout the course to a real-world dataset. Your task is to explore, clean, analyze, and visualize the data to
derive meaningful insights and provide actionable recommendations. This assignment will test your ability to
manipulate data, use statistical methods, create visualizations, and communicate your findings effectively.
You will be working with a dataset related to Taxi Trip Records in New York City, and your analysis will
help address specific questions or problems within that domain. Get ready to put your data analysis skills to
the test and showcase your ability to make data-driven decisions.
Tools, Languages, Libraries: In this course assignment, you are required to perform data analysis
using the Python programming language. The analysis should be conducted within a Jupyter Notebook
environment. You have the flexibility to choose your preferred tools and platforms for completing the
assignment. Whether you decide to work on your local machine or utilize cloud-based machine learning
services such as Google Colab, Azure Machine Learning, AWS SageMaker, Databricks, or any other, the
choice is yours. The objective is to empower you with the freedom to leverage the tools and resources that
best suit your workflow while showcasing your data analysis skills.
Dataset: Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pickup and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported
passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and
Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger
Enhancement Programs (TPEP/LPEP).
For this assignment you will be using the Yellow Taxi trip records from January, March, and June 2023:
Task 1: Data Loading (8 points)
- Load the NYC taxi data set for the months of January, March, and June in pandas’ data frame.
Use the data dictionary to understand what information is captured in each column.
- Compare the 3 months of data and identify and discuss 3 different trends in it.
For the rest of the tasks use the data from January 2023 only.
Task 2: Data Exploration and Pre-processing (12 points)
- Check for missing values in the dataset. Handle them appropriately and explain why you used a
- Identify two columns that have “noisy” (erroneous) values. Explain why you think they are noisy.
Identify how many such values exist in the dataset.
- Identify 2 columns that are highly correlated and explain their correlation.
Task 3: Featurization (6 points)
Using the existing features create new features:
- Create a feature which is a flag indicating if the trip is in rush-hour or not.
- Create a feature that encodes the “complexity” of the trip by comparing the actual distance of the
trip to the straight-line distance of the trip.
- Calculate the pickup and drop-off frequency in each taxi zone.
Task 4: Data Analysis (18 points)
Answer specific questions or perform analyses based on the dataset. For example:
- Rank the vendors by popularity.
- What are the peak travel hours?
- What is the average distance of the trips on weekdays and weekends?
- What is the average number of passengers in a trip on weekdays and weekends?
- What is the correlation between the fare about and the tip?
- What is the correlation between the fare amount and the number of passengers?
Task 5: Conclusion (6 points)
- Summarize the key findings from your data analysis.
- Reflect on any challenges you faced during the analysis.
- Suggest possible next steps or additional analyses that could be performed.