COMS BC1016 Final Project

Fall 2025
Barnard College


The final project for BC1016 provides an opportunity to bring together, apply, and communicate your knowledge of data science and statistics from this course. You will work in groups of 2 to choose one of the provided datasets to analyze and submit a writeup of your analysis and conclusions.

Project Milestones and Deadlines

  1. Group Declaration: Deadline Friday, Nov 7
    Please read the "Group Guidelines" section for guidelines on how to form your final project groups and complete the Google Form to indicate your group (Individual or Group).
  2. Project Proposal: Due Wednesday, Nov 19 at 11:59pm
    Each group will select a final project notebook and dataset to work on for the final project and complete the introduction section of the report.
  3. Progress Report: Due Monday, Dec 1 at 11:59pm
    At this point, groups should be about ~60% done with the final project. For the progress report, groups should list out what analysis remains and how they plan on approaching it. Additionally, groups should share if they are running into any issues with their analysis that they may need assistance with or have questions about.
  4. Final Project Report: Due Friday, Dec 12 at 11:59pm
    Groups will submit the completed reports along with a completed peer review.
Note: We will require all students to complete a peer review to share how work was distributed among team members. Any major discrepancies in the distribution of work will be factored into individual grades on this assignment.

Final Project Grading Breakdown:

Group Guidelines

Groups of 2 students in the same lab section
Ideally, students will find a partner from the same lab section since we will have lab group work time and project consultations during the last lab.

Groups of 2 students from different lab sections
If you want to work in a group with a student not in your lab section, you will need to follow extra steps to receive approval:

Groups of 3 students
If you would like to work in a group of 3, please note that the final report will have an extra requirement to account for the extra member. Groups of 3 will be required to write up analysis and results of 2 Hypothesis Tests (rather than only 1 Hypothesis Test required for groups of 2).
To receive approval to work in a group of 3, you will need to follow extra steps:

Datasets

Please note you may not copy or reference any public analysis of any of these datasets. Doing so will result in a zero.

Report Sections

We have provided a template with a copy of the instructions for the written component of the final report (Link). The teaching staff can assist in brain storming analyses to consider, but your team should ultimately decide what question is most interesting to you. The report includes the following sections:

Introduction

  1. (250-300 words) Introduce the dataset to familiarize your reader with the data / variables involved, including:
    1. who collected the dataset and why, when and where it was collected
    2. what information is included in the dataset (e.g., what each row represents and what attributes are included)
    3. the variables most relevant to your analysis (hypothesis test, prediction analysis, plots for data exploration)
  2. (150-200 words) Explicitly state your hypothesis test and prediction questions:
    1. Hypothesis test (groups of three are required to have 2 hypothesis tests)
      1. What is the null hypothesis?
      2. What is the alternative hypothesis?
      3. How do you plan to test your hypothesis?
    2. Prediction Analysis
      1. What two attributes will you analyze the relationship between?
      2. What is your prediction testing question?
  3. What you expect to learn overall
    1. What do your hypothesis test and prediction analysis help you answer about the data?

Exploratory Data Analysis

Here you will include exactly 2 plots and 2 tables and associated descriptions of each output. For each plot and table, provide a brief (2-4 sentence) explanation of what the result tells us and why we should care about it:

  1. 1 quantitive plot (e.g., scatter plot, histogram, etc)
  2. 1 qualitative plot (e.g., bar chart comparing groups)
  3. table made using an aggregate function (pivot or group)
  4. table made as a result of a join
As general advice, readers should be able to take away as much information as possible from the plots themselves. Be deliberate about how you format these plots and focus on selecting one that best represents the most interesting result (rather than including multiple plots/tables to demonstrate the work you've done). The plots and tables should relate to the hypothesis test and prediction questions you hope to answer. You should judiciously show only the attributes that are important for your questions and thus hide rows or columns that take away from key information.

Hypothesis Test

  1. State the hypothesis test (restate the null and alternative hypothesis)
  2. Decide on a significance level
  3. Explain your choice of test
  4. Implement your test in code (note: you do not need to include the code in your written report)
  5. Interpret your results and state whether the data is consistent with the null or not
  6. For groups of three, include a summary of what you learned from both tests (not just what you learned from each one).

Prediction

  1. Restate your prediction question and the method you'll use to test
  2. Implement your predictive model in code (note: you do not need to include the code in the written report)
  3. Evaluate your model fit and/or performance using appropriate diagnostic tools and measurements (such as linear regression)
    • You should show work to calculate the model numerically (for example, the slope and interception of the regression line). Do not merely use fit_line = True.
    • Even if your scatter plot does not seem to indicate a linear relationship, please generate a residual plot demonstrating knowledge of applying linear regression to show further evidence of non-linearity.
  4. Interpret your results. Does the model fit the data? Can you use it to predict the outcome of new datapoints? What changes might need to be made to get a better model?
When exploring prediction questions for your final project, it’s generally recommended to apply linear regression to numerical data, and it is not recommended to apply linear regression to categorical data that is binary (only has two values).

Conclusion

This section should be roughly 200 words.

  1. Briefly restate the results of your hypothesis test(s) and prediction procedures
  2. State what you conclude given the results of all of your analyses.
  3. State at least one potential limitation of your data or analysis. Are there potential sources of bias in your data?

Report Tips

  1. Write the report text in a collaborative text editor (such as Google Docs) to be able to work on components of the report with your partner(s).
  2. Communicating your analytical approach is a core component of how your final project is evaluated. Spelling errors, typos, and grammatical errors will be factored into your grade.
  3. A large part of communicating is being selective with what components of your analysis to report upon. Your final report shouldn't be a walkthrough of your entire process. Rather, you should hone in on the charts and tables that are most important for communicating your results. While you should absolutely explore many paths in your analysis, be judicious about which ones you write about in your final report. If you were reading this report for the first time, what would you find useful for understanding the results? Are there any parts that are unnecessary or distract from the main point?