COMS BC1016 Final Project
Fall 2025
Barnard College
The final project for BC1016 provides an opportunity to bring together, apply, and communicate your knowledge of data science and statistics from this course. You will work in groups of 2 to choose one of the provided datasets to analyze and submit a writeup of your analysis and conclusions.
Project Milestones and Deadlines
- Group Declaration: Deadline Friday, Nov 7
Please read the "Group Guidelines" section for guidelines on how to form your final project groups and complete the Google Form to indicate your group (Individual or Group).
- Project Proposal: Due Wednesday, Nov 19 at 11:59pm
Each group will select a final project notebook and dataset to work on for the final project and complete the introduction section of the report.
- Progress Report: Due Monday, Dec 1 at 11:59pm
At this point, groups should be about ~60% done with the final project. For the progress report, groups should list out what analysis remains and how they plan on approaching it. Additionally, groups should share if they are running into any issues with their analysis that they may need assistance with or have questions about.
- Final Project Report: Due Friday, Dec 12 at 11:59pm
Groups will submit the completed reports along with a completed peer review.
Note: We will require all students to complete a peer review to share how work was distributed among team members. Any major discrepancies in the distribution of work will be factored into individual grades on this assignment.
Final Project Grading Breakdown:
- Project Proposal - 10%
- Progress Report - 25%
- Final Report - 65%
Group Guidelines
Groups of 2 students in the same lab section
Ideally, students will find a partner from the same lab section since we will have lab group work time and project consultations during the last lab.
Groups of 2 students from different lab sections
If you want to work in a group with a student not in your lab section, you will need to follow extra steps to receive approval:
- Both individuals agree on a lab section they will all attending during the last full week of the semester (12/3 or 12/4).
- In a single email to both 1016 professors (Prof Lee and Prof Megjhani) and your two lab TAs, state your names, what lab section each of you are in, and which lab section you will be attending for the final project consultation in the last lab. To ensure it's not missed, please title your email "[BC1016] Final Project Alice and Bob" where Alice and Bob are replaced with your names.
- One of the professors confirms via email that you have permission to form this group.
Groups of 3 students
If you would like to work in a group of 3, please note that the final report will have an extra requirement to account for the extra member. Groups of 3 will be required to write up analysis and results of 2 Hypothesis Tests (rather than only 1 Hypothesis Test required for groups of 2).
To receive approval to work in a group of 3, you will need to follow extra steps:
- All three members agree on a lab section they will all attending during the last full week of the semester (12/3 or 12/4).
-
In a single email to both 1016 professors (Prof Lee and Prof Megjhani) and your lab TAs, state your names, what lab section each of you are in, and which lab section you will be attending for the final project consultation in last lab. To ensure it's not missed, please title your email "[BC1016] Final Project Alice, Bob, and Carol" where Alice, Bob, and Carol are replaced with your names.
- One of the professors confirms via email that you have permission to form this group.
Datasets
Please note you may not copy or reference any public analysis of any of these datasets. Doing so will result in a zero.
-
Inside Airbnb (https://insideairbnb.com/about/) collects and publishes data on Airbnb listings in major cities across the world. For this dataset, we've downloaded and cleaned the data for 13 of the cities they have listed. You may choose to analyse one (or multiple) of the provided cities. If there is a city you are interested in that we did not clean, you may request permission from the instructors to use that data for that city.
NYC Restaurant Inspections - Link to data
The New York Department of Health and Mental Hygiene (DOHMH) provides a database of all violations, both confirmed and still being reviewed, from all restaurant and college cafeteria inspections done in the past three years (https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/about_data).
The Restaurant health data provided was downloaded Oct 2025 and has been separated into three datasets: Grade, Location, and Violation.
-
The city of Seattle makes available its database of pet licenses issued from Jan 2017 to Oct 2025 as part of the city's ongoing Open Data Initiative (https://data.seattle.gov/City-Administration/Seattle-Pet-Licenses/jguv-t9rb/about_data). We have also prepared two additional datasets. The first is the Statistics of Income (SOI) dataset for WA from the 2022 tax year (https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2022-zip-code-data-soi), which features the number of tax returns received by the IRS from each zip code broken out by several income brackets. The second is the Seattle Parks and Recreation Park Addresses (https://data.seattle.gov/Community-and-Culture/Seattle-Parks-And-Recreation-Park-Addresses/v5tj-kqhc/about_data).
-
Nidula Elgiriyewithana uploaded this dataset onto Kaggle in 2023. A brief summary of the dataset, originally at the conference, is provided below:
"This dataset contains a comprehensive list of the most famous songs of 2023 as listed on Spotify. The dataset offers a wealth of features beyond what is typically available in similar datasets. It provides insights into each song's attributes, popularity, and presence on various music platforms. The dataset includes information such as track name, artist(s) name, release date, Spotify playlists and charts, streaming statistics, Apple Music presence, Deezer presence, Shazam charts, and various audio features."
Report Sections
We have provided a template with a copy of the instructions for the written component of the final report (Link). The teaching staff can assist in brain storming analyses to consider, but your team should ultimately decide what question is most interesting to you. The report includes the following sections:
Introduction
- (250-300 words) Introduce the dataset to familiarize your reader with the data / variables involved, including:
- who collected the dataset and why, when and where it was collected
- what information is included in the dataset (e.g., what each row represents and what attributes are included)
- the variables most relevant to your analysis (hypothesis test, prediction analysis, plots for data exploration)
- (150-200 words) Explicitly state your hypothesis test and prediction questions:
- Hypothesis test (groups of three are required to have 2 hypothesis tests)
- What is the null hypothesis?
- What is the alternative hypothesis?
- How do you plan to test your hypothesis?
- Prediction Analysis
- What two attributes will you analyze the relationship between?
- What is your prediction testing question?
- What you expect to learn overall
- What do your hypothesis test and prediction analysis help you answer about the data?
Exploratory Data Analysis
Here you will include exactly 2 plots and 2 tables and associated descriptions of each output. For each plot and table, provide a brief (2-4 sentence) explanation of what the result tells us and why we should care about it:
- 1 quantitive plot (e.g., scatter plot, histogram, etc)
- 1 qualitative plot (e.g., bar chart comparing groups)
- table made using an aggregate function (pivot or group)
- table made as a result of a join
As general advice, readers should be able to take away as much information as possible from the plots themselves. Be deliberate about how you format these plots and focus on selecting one that best represents the most interesting result (rather than including multiple plots/tables to demonstrate the work you've done). The plots and tables should relate to the hypothesis test and prediction questions you hope to answer. You should judiciously show only the attributes that are important for your questions and thus hide rows or columns that take away from key information.
Hypothesis Test
- State the hypothesis test (restate the null and alternative hypothesis)
- Decide on a significance level
- Explain your choice of test
- Implement your test in code (note: you do not need to include the code in your written report)
- Interpret your results and state whether the data is consistent with the null or not
- For groups of three, include a summary of what you learned from both tests (not just what you learned from each one).
Prediction
- Restate your prediction question and the method you'll use to test
- Implement your predictive model in code (note: you do not need to include the code in the written report)
- Evaluate your model fit and/or performance using appropriate diagnostic tools and measurements (such as linear regression)
- You should show work to calculate the model numerically (for example, the slope and interception of the regression line). Do not merely use fit_line = True.
- Even if your scatter plot does not seem to indicate a linear relationship, please generate a residual plot demonstrating knowledge of applying linear regression to show further evidence of non-linearity.
- Interpret your results. Does the model fit the data? Can you use it to predict the outcome of new datapoints? What changes might need to be made to get a better model?
When exploring prediction questions for your final project, it’s generally recommended to apply linear regression to numerical data, and it is not recommended to apply linear regression to categorical data that is binary (only has two values).
Conclusion
This section should be roughly 200 words.
- Briefly restate the results of your hypothesis test(s) and prediction procedures
- State what you conclude given the results of all of your analyses.
- State at least one potential limitation of your data or analysis. Are there potential sources of bias in your data?
Report Tips
- Write the report text in a collaborative text editor (such as Google Docs) to be able to work on components of the report with your partner(s).
- Communicating your analytical approach is a core component of how your final project is evaluated. Spelling errors, typos, and grammatical errors will be factored into your grade.
- A large part of communicating is being selective with what components of your analysis to report upon. Your final report shouldn't be a walkthrough of your entire process. Rather, you should hone in on the charts and tables that are most important for communicating your results. While you should absolutely explore many paths in your analysis, be judicious about which ones you write about in your final report. If you were reading this report for the first time, what would you find useful for understanding the results? Are there any parts that are unnecessary or distract from the main point?