COMS BC1016 Final Project
Spring 2026
Barnard College
The final project for BC1016 provides an opportunity to bring together, apply, and communicate your knowledge of data science and statistics from this course. You will work in groups of 2 to choose one of the provided datasets to analyze and submit a writeup of your analysis and conclusions.
AI and Usage of Outside Resources
As a reminder, the syllabus states:
For your final project, AI generated text is not permitted as part of your written descriptions in your final report. Your report must include your own original writing and reflections. Violations can result in a failing grade for the assignment and/or the course.
Please note this policy includes using generative AI to produce analysis.
You are not allowed to submit AI generated code.
If you like to do anything more advanced than what we have covered in class, you must include a brief explanation of where you learned these concepts. For instance, if you have taken prior statistics classes and would like to perform a more complicated analysis or if you have experience with (and would like to use) Python libraries we have not covered in class.
Project Milestones and Deadlines
- Group Declaration: Deadline Wednesday, April 1
Please complete one of the two Google Forms to indicate your group preference: Group Declaration (if you know what group you want to be in) or Group Matching (if you do not have a group and would like us to form one for you)
- Project Proposal: Due Friday, April 17 at 11:59pm
Each group will select a final project notebook and dataset to work on for the final project and complete the graphs and tables required in the Exploratory Data Analysis section. Groups do not need to include sentence explanations of their graphs and tables in the project proposal. Based on your exploratory data analysis, you will state the hypothesis (null and alternative) you are planning to test.
- Progress Report: Due Monday, April 27 at 11:59pm
At this point, groups should be about ~60% done with the final project. For the progress report, groups should complete the exploratory data analysis and hypothesis testing sections. Groups should begin the prediction section. For whatever is not completed in the prediction section by the time of the progress report, they should list how they plan on approaching it and what analysis they plan to do. Additionally, groups should share if they are running into any issues with their analysis that they may need assistance with or have questions about.
- Final Project Report: Due Friday, May 8 at 11:59pm
Groups will submit the completed reports along with a completed peer review.
Note: We will require all students to complete a peer review to share how work was distributed among team members. Any major discrepancies in the distribution of work will be factored into individual grades on this assignment.
Final Project Grading Breakdown:
- Group Declaration - 1%
- Project Proposal - 9%
- Progress Report - 25%
- Final Report - 65%
Final Project Groups
This project is intended to be worked on in groups of two. Ideally, students will find a partner from the same lab section, though you may choose to work with someone from a different lab section if you prefer. Please note that the last two labs are a "Final Project Work Time" and "Final Project Consultations". All group members are required to attend the same lab section for the final two labs.
Groups of 3 students:
If you would like to work in a group of 3, please note that the final report will have an extra requirement to account for the extra member. Groups of 3 will be required to write up analysis and results of 2 Hypothesis Tests (rather than only 1 Hypothesis Test required for groups of 2).
Working Alone:
You may choose to work alone. However, please note that you will have the same project components and requirements as a group of 2.
Project Proposal
Once your groups have been formed, you will be tasked with exploring the data and coming up with the hypothesis question(s) you will be testing. You will create the 2 graphs and 2 tables for the Exploratory Data Analysis section of the final report and write your hypothesis question and alternative hypothesis. For the project proposal, you are not required to write the sentence explanations for the exploratory data analysis. Grading will be based on completion and readability.
Progress Report
Before the "Final Project Consultations", you will be asked to submit a progress report. The main purpose of the progress report is to identify any bottlenecks groups are running into. This will give groups and the teaching staff information to work through any issues before the semester ends. The progress report, similar to the project proposal, will be graded based on completion and readability.
For the progress report, you will be required to complete the Exploratory Data Analysis (graphs, tables, and sentence explanations) and the Hypothesis Testing sections. You are not required to have completed the prediction section, but you should list your plan and what remains to be done for the prediction section. If you are running into any issues with your final project, you should include this in your progress report.
Datasets
Please note you may not copy or reference any existing or public analysis of any of these datasets. Doing so will result in a zero.
-
The New York Department of Health and Mental Hygiene (DOHMH) provides a database of all violations, both confirmed and still being reviewed, from all restaurant and college cafeteria inspections done in the past three years .
The Restaurant health data provided was downloaded Oct 2025 and has been separated into three datasets: Grade, Location, and Violation.
-
The city of Seattle makes available its database of pet licenses issued from Jan 2017 to Oct 2025 as part of the city's ongoing Open Data Initiative. We have also prepared two additional datasets. The first is the Statistics of Income (SOI) dataset for WA from the 2022 tax year, which features the number of tax returns received by the IRS from each zip code broken out by several income brackets. The second is the Seattle Parks and Recreation Park Addresses.
-
Iconic musician Taylor Swift needs no introduction. This data set comes from the taylor R package by W. Jake Thompson and is a curated set of data scraped from Genius and the Spotify API. In this project you will have access to both song-level metadata and lyrical content from Taylor Swift's many albums and eras.
-
Brian Mubia scraped and curated a collection of recipes from the cooking website allrecipes as part of the tastyR package. This data includes ingredients, cooking times, nutritional profiles, and community feedback.
If your group would like to analyze a data set that is not one of the ones provided, you must get instructor approval. Once you have approval, you may use the template starter notebook.
Report Sections
Starter notebooks will include an outline of the required components of the final report. The teaching staff can assist in brain storming analyses to consider, but your team should ultimately decide what question is most interesting to you. The report includes the following sections:
Introduction
- Introduce the dataset to familiarize your reader with the data / variables involved, including:
- who collected the dataset and why, when and where it was collected
- what information is included in the dataset (e.g., what each row represents and what attributes are included)
- the variables most relevant to your analysis (hypothesis test, prediction analysis, plots for data exploration)
- Explicitly state your hypothesis test and prediction questions:
- Hypothesis test (groups of three are required to have 2 hypothesis tests)
- What is the null hypothesis?
- What is the alternative hypothesis?
- How do you plan to test your hypothesis?
- Prediction Analysis
- What two attributes will you analyze the relationship between?
- What is your prediction testing question?
- What you expect to learn overall
- What do your hypothesis test and prediction analysis help you answer about the data?
Exploratory Data Analysis
Here you will include exactly 2 plots and 2 tables and associated descriptions of each output. For each plot and table, provide a brief (2-4 sentence) explanation of what the result tells us and why we should care about it:
- 1 quantitive plot (e.g., scatter plot, histogram, etc)
- 1 qualitative plot (e.g., bar chart comparing groups)
- table made using an aggregate function (pivot or group)
- table made as a result of a join
As general advice, readers should be able to take away as much information as possible from the plots themselves. Be deliberate about how you format these plots and focus on selecting one that best represents the most interesting result (rather than including multiple plots/tables to demonstrate the work you've done). The plots and tables should relate to the hypothesis test and prediction questions you hope to answer. You should judiciously show only the attributes that are important for your questions and thus hide rows or columns that take away from key information.
Hypothesis Test
- State the hypothesis test (restate the null and alternative hypothesis)
- Decide on a significance level
- Explain your choice of test
- Implement your test in code
- Interpret your results and state whether the data is consistent with the null or not
- For groups of three, include a summary of what you learned from both tests (not just what you learned from each one).
Prediction
- Restate your prediction question and the method you'll use to test
- Implement your predictive model in code
- Evaluate your model fit and/or performance using appropriate diagnostic tools and measurements (such as linear regression)
- You should show work to calculate the model numerically (for example, the slope and interception of the regression line). Do not merely use fit_line = True.
- Even if your scatter plot does not seem to indicate a linear relationship, please generate a residual plot demonstrating knowledge of applying linear regression to show further evidence of non-linearity.
- Interpret your results. Does the model fit the data? Can you use it to predict the outcome of new datapoints? What changes might need to be made to get a better model?
When exploring prediction questions for your final project, it’s generally recommended to apply linear regression to numerical data, and it is not recommended to apply linear regression to categorical data that is binary (only has two values).
Conclusion
Your report should end with a section that summarizes your results and draws conclusions.
- Briefly restate the results of your hypothesis test(s) and prediction procedures
- State what you conclude given the results of all of your analyses.
- State at least one potential limitation of your data or analysis. Are there potential sources of bias in your data?
Report Tips
- Communicating your analytical approach is a core component of how your final project is evaluated. Spelling errors, typos, and grammatical errors will be factored into your grade.
- A large part of communicating is being selective with what components of your analysis to report upon. Your final report shouldn't be a walkthrough of your entire process. Rather, you should hone in on the charts and tables that are most important for communicating your results. While you should absolutely explore many paths in your analysis, be judicious about which ones you write about in your final report. If you were reading this report for the first time, what would you find useful for understanding the results? Are there any parts that are unnecessary or distract from the main point?