Page View Time Series Visualizer -- Bar Plot Assistance Needed

I’m almost 2/3 of the way through the “Page View Time Series Visualizer” Project, but am not sure why two of my bar plot tests are failing/getting errors. I have been developing this on my local machine/laptop.

Here is my code thus far (omitting the box plot part because I haven’t done that yet):

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from pandas.plotting import register_matplotlib_converters
import numpy as np
register_matplotlib_converters()

# Import data (Make sure to parse dates. Consider setting index column to 'date'.)
df = pd.read_csv(
    filepath_or_buffer='fcc-forum-pageviews.csv',
    parse_dates=['date'],
    index_col='date'
)

# Clean data
# thanks to https://towardsdatascience.com/10-examples-that-will-make-you-use-pandas-query-function-more-often-a8fb3e9361cb
low_end = df.value.quantile(0.025)
high_end = df.value.quantile(0.975)
df = df.query(f"value > {low_end} and value < {high_end}")


def draw_line_plot():
    df_line = df.copy()
    fig = plt.figure(figsize=(15, 5))
    x_values = df_line.index.tolist()
    y_values = df_line['value'].tolist()
    plt.plot(x_values, y_values, 'r')  # 'r' for red line
    plt.title('Daily freeCodeCamp Forum Page Views 5/2016-12/2019')
    plt.xlabel('Date')
    plt.ylabel('Page Views')
    # plt.show()
    # # Save image and return fig (don't change this part)
    fig.savefig('line_plot.png')
    return fig


def draw_bar_plot():
    # Copy and modify data for monthly bar plot
    # This includes the cleaned data, which explains why some dates are missing
    df_bar = df.copy()
    df_bar = df_bar.reset_index()

    # adds the missing months, seen here:  https://stackoverflow.com/questions/43408621/add-a-row-at-top-in-pandas-dataframe
    new_rows = []
    new_rows.insert(0, {'date': pd.to_datetime(
        '2016-04-01 00:00:00'), 'value': 0})
    new_rows.insert(0, {'date': pd.to_datetime(
        '2016-03-01 00:00:00'), 'value': 0})
    new_rows.insert(0, {'date': pd.to_datetime(
        '2016-02-01 00:00:00'), 'value': 0})
    new_rows.insert(0, {'date': pd.to_datetime(
        '2016-01-01 00:00:00'), 'value': 0})

    df_bar = pd.concat([pd.DataFrame(new_rows), df_bar], ignore_index=True)

    # adds in year and month columns
    df_bar['year'] = df_bar['date'].dt.strftime('%Y')
    df_bar['month'] = df_bar['date'].dt.strftime('%m')

    df_bar = df_bar.groupby(['year', 'month'])['value'].mean()
    df_bar = df_bar.reset_index(drop=False)

    # https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
    df_bar['month_name'] = pd.to_datetime(
        df_bar['month'], format='%m').dt.month_name()

    print(df_bar)
    # https://stackoverflow.com/questions/51879686/pandas-only-recognizes-one-column-in-my-data-frame

    # this should be good for the most part
    bar_plot = sns.barplot(
        data=df_bar,
        x='year',
        y='value',
        hue='month_name',
        # https://www.codecademy.com/article/seaborn-design-ii
        palette=sns.color_palette("Paired", 12)
    )
    bar_plot.set(
        title='Monthly freeCodeCamp Forum Page Views 5/2016-12/2019',
        xlabel='Years',
        ylabel='Average Page Views',
    )
    plt.legend(
        title='Months'
    )
    fig = bar_plot.figure
    # plt.show()
    # # Draw bar plot

    # # Save image and return fig (don't change this part)
    fig.savefig('bar_plot.png')
    return fig

And here’s the output from the failing/erroring tests (again, omitting the box plot parts because I haven’t done that yet):

======================================================================
FAIL: test_bar_plot_legend_labels (test_module.BarPlotTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mgermaine93/Desktop/CODE/fcc-code-challenges/data-analysis-with-python/page-view-time-series-visualizer/test_module.py", line 53, in test_bar_plot_legend_labels
    self.assertEqual(
AssertionError: Lists differ: ['Jan[111 chars]mber', 'January', 'February', 'March', 'April'[199 chars]ber'] != ['Jan[111 chars]mber']

First list contains 24 additional elements.
First extra element 12:
'January'

  ['January',
   'February',
   'March',
   'April',
   'May',
   'June',
   'July',
   'August',
   'September',
   'October',
   'November',
-  'December',
-  'January',
-  'February',
-  'March',
-  'April',
-  'May',
-  'June',
-  'July',
-  'August',
-  'September',
-  'October',
-  'November',
-  'December',
-  'January',
-  'February',
-  'March',
-  'April',
-  'May',
-  'June',
-  'July',
-  'August',
-  'September',
-  'October',
-  'November',
   'December'] : Expected bar plot legend labels to be months of the year.

======================================================================
FAIL: test_bar_plot_number_of_bars (test_module.BarPlotTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mgermaine93/Desktop/CODE/fcc-code-challenges/data-analysis-with-python/page-view-time-series-visualizer/test_module.py", line 76, in test_bar_plot_number_of_bars
    self.assertEqual(actual, expected,
AssertionError: 193 != 49 : Expected a different number of bars in bar chart.

----------------------------------------------------------------------

The output of the final print(df_bar) line in the code is as follows:

    year month          value month_name
0   2016    01       0.000000    January
1   2016    02       0.000000   February
2   2016    03       0.000000      March
3   2016    04       0.000000      April
4   2016    05   19432.400000        May
5   2016    06   21875.105263       June
6   2016    07   24109.678571       July
7   2016    08   31049.193548     August
8   2016    09   41476.866667  September
9   2016    10   27398.322581    October
10  2016    11   40448.633333   November
11  2016    12   27832.419355   December
12  2017    01   32785.161290    January
13  2017    02   31113.071429   February
14  2017    03   29369.096774      March
15  2017    04   30878.733333      April
16  2017    05   34244.290323        May
17  2017    06   43577.500000       June
18  2017    07   65806.838710       July
19  2017    08   47712.451613     August
20  2017    09   47376.800000  September
21  2017    10   47438.709677    October
22  2017    11   57701.566667   November
23  2017    12   48420.580645   December
24  2018    01   58580.096774    January
25  2018    02   65679.000000   February
26  2018    03   62693.774194      March
27  2018    04   62350.833333      April
28  2018    05   56562.870968        May
29  2018    06   70117.000000       June
30  2018    07   63591.064516       July
31  2018    08   62831.612903     August
32  2018    09   65941.733333  September
33  2018    10  111378.142857    October
34  2018    11   78688.333333   November
35  2018    12   80047.483871   December
36  2019    01  102056.516129    January
37  2019    02  105968.357143   February
38  2019    03   91214.483871      March
39  2019    04   89368.433333      April
40  2019    05   91439.903226        May
41  2019    06   90435.642857       June
42  2019    07   97236.566667       July
43  2019    08  102717.310345     August
44  2019    09   97268.833333  September
45  2019    10  122802.272727    October
46  2019    11  143166.428571   November
47  2019    12  150733.500000   December

And when I do plt.show(), my bar plot looks like this:

bar_plot

It’s clear to me that somehow the legend labels and the number of bars aren’t matching up, but I’m not entirely sure how that’s happening. Perhaps a second set of eyes will help?

Thank you in advance for any assistance!

That can’t be right - the error clearly states your legend contains duplicate months…
Plus I copy-pasted your code into Replit and it clearly shows a wrong legend.

However I have trouble pinpointing down what causes the error, so I can only give some advice.
First off, don’t add missing months. First off you are adding wrong data and if someone would use your charts down the line, this could cause some trouble. On top of that imagine there would be more months missing, there clearly must be a better way.
And that better way is the “hue_order” argument for the plot, with which you can tell it to show months in a specific order.

Also you shouldn’t reset the index at the start. The DataFrame already has a datetimeindex which make extracting months and years as easy as df.index.month.

Try those tipps and see if it fixes some things. Because again, I cannot point out anything specific that is wrong, but somewhat in your code results in a legend containing the months for every year - meaning 4 sets of all months instead of just one.

There are two distinct problem areas.

First, your data reading, cleaning, and processing are not exactly correct. I had to fix problems from the beginning (reading the CSV forward) to get the date information correct. Print the dataframe after each step to make sure that you are achieving the correct results.

Second, sns.barplot() creates one bar plot. When you give it 48 months, it makes 48 bars with 48 legend entries. You need to use the bar version of sns.catplot() to make 4 separate one year bar plots on one categorical plot with a common legend of 12 bars.