Data Analysis Project- Sea Predictor, tests enforce over-complicate solution

Hello, here is the project of subject:

Among the main criteria in the challenge is to plot two “lines of best fit”(over a scatter plot). They are basically diagonal straight lines, which have a start point and an end point described in the project requirements. I wont go into details how those points are calculated, but It can be achieved as simple as providing the coordinates of the start and end point on the X and Y axes, however the tests are not satisfied with such approach. They demand an array of coordinates, to dot on every year on the X coordinate, which draws just the same- a diagonal straight line.
The tests should check if the said line is positioned accordingly, has the exact width, does cross the years coordinates at the exact point, but it shouldnt enforce the user to use an unnecessary complicated approach to achieve such result. Its not demanded on the requirements and project description that such approach must be utilized.
In comparison both approaches result in the same thing, the difference is only in how it was achieved.

I’m curious what the objection is. To me, this was the most straightforward data analysis project. The spec said “plot the points,” so I would expect a test for that. It even said to use scipy.stats.linregress() to calculate the regression for the lines of best fit. The tests check these and a few other things that have to be correct for these to be correct, so the tests seem appropriate.

There are other ways to draw a line of best fit, but when the spec says do it this way, you do it this way. It’s hard to test any line of best fit instead of a specific one. But if you’re not using a linear least squares regression, expect funny looks.

I think this is actually one of the more useful data analysis projects because it presents a use case that many people reach for in excel: take a list of data, plot it, then draw a curve for analysis and prediction. It does a good job of translating that use case to python, pandas, and friends.

1 Like

I have no critique for the project as a whole, i enjoyed working on it. My only concern is how the tests correspond to the given requirements, or more precise, the two tests i put in my initial post.
I use the markdown read-me file to extract what the project requires me to do. Here is one of the requirements of question(i dont have objections against the requirement, but the test which handles it):

  • Use the linregress function from scipy.stats to get the slope and y-intercept of the line of best fit. Plot the line of best fit over the top of the scatter plot. Make the line go through the year 2050 to predict the sea level rise in 2050.

I do as instructed, i used the lingress scipy function to get the slope and intercept used to draw the line of best fit. I plot the line over the scatter plot, providing the coordinates of the start and end of the line, which are calculated using the slope and intercept and the requirements to reach thru year 2050. It draws the exact line the project expects. The test however expect me to plot the line with an array of points, corresponding to every year, which ultimately produces the same line. I fail to find where the project requires me to use this unnecessary more complicated solution(besides the tests enforcing this route).

I understand what you’re saying now, but I’m curious as to how you plotted the line without the arrays of points in the first place (there are ways to do it). This project uses the traditional way of graphing lines and lines of best fit: calculate the curve function (linregress) , iterate over the inputs to get the outputs, then draw the points. matplotlib's plot() function expects iterables for its domain and range values, so if you just gave it the endpoints, you would still have to get the first and last year and calculate the first and last sea level to plot and pass them as iterables. I’m sure there are other ways that could just use the first values and the slope.

But, plotting all the points, just the endpoints, or first point and slope can be done in two lines of code (I think; I know the first one can). So it boils down to the project creator thinking traditionally and assuming that all the points would be available for testing in the plot. It does make some sense to test this way, so that if someone decided to fit some other function to the endpoints, the test would realize that plot is not a line (I know you could fit a function to all the correct data points too, but this is getting fantastic already…).

1 Like

I was totally clueless for many aspects of the project. Im not native English speaker and line of best fit is nothing i can associate from school. Data visualization is still a blurry subject to me and i have some troubles yet to understand the logic on how the various python libraries from that curriculum operate. I did some google on how slope and intercept are applied to calculate line of best fit values and plot them using the frameworks assumed in the project. I used the easier and most understandable solution i could find. Obviously im not familiar on whats the traditional way of doing this.

Here is a snippet of how i plot the first line, before i fit it to match the test needs:

# df stands for the DataFrame i use

# using data from the two columns in conjunction with linregress()
line_data=linregress(df.Year, df['CSIRO Adjusted Sea Level'])

# extract slope and intercept values
slope=line_data.slope
intercept=line_data.intercept

# min and max points for the x axis
xmin=df.Year.min()
xmax=2050
line_range=[xmin, xmax]

# finally plotting the line, only providing x/y coords for start and end points
ax.plot(line_range, [slope*x + intercept  for x in line_range], 'y')

The line is a straight diagonal line. All it takes is provide the start and end point in the plot function. To match the tests, instead of using two endpoints(xmin & xmax), i made an array using those same endpoints(range(xmin, xmax) to feed it to the plot method. To me this seems like an unnecessary complication of the process.
From your explanation, i assume this is not how the project creator anticipated the lingress function would be utilized?

Well, I there’s no right or wrong here, just tradition I think. I’ve seen it done this way all the way back to some very ancient fortran77 code (as in it had been transcribed from punch cards at some point) that would print tables of data like this project expects.

You could just do something like

    (slope, int, r, p, error) = linregress(x=dfr["year"], y=dfr["csiro"])
    ax.plot(years, slope * years + int, label="recent")

and this will do the endpoints and everything in between.

As you have realized, since this is a line having anything in addition to two points or a point and a slope is surplus information, but it’s just the way I’ve always seen it done. According to SO, matplotlib only added the axline() function to use two points or point-slope at version 3.3.0.

What would be interesting is if you could write new tests to check your way of plotting the line. You should be able to create a test to verify the endpoints and generate an equation of a line easily. But looking at matplotlib's Line2D documentation doesn’t indicate a clear path forward. I would assume that since there is no attribute representing the line’s equation, that it is generated on the fly as the line is drawn and interpolation is needed between pairs of coordinates. If that’s the case, then testing the internal points without having them explicitly plotted in the line would require digging down into comparing pixel coordinates in the image.

This really gets into the “how much should we test” area of development. A good article on this can be found here, even though it’s about F# the ideas are generally applicable.

1 Like

I’m a big fan of using the full range of available points to create the line of best fit. Sure, two points work fine when you are making linear fit, but a linear fit is only one of a wide array of possible curve fitting techniques. You can try higher order polynomial regression, or logistic regression, or one of the myriad of other curve fitting techniques, and your code is far more flexible and its easier for you to experiment with other curve fitting techniques if you use the full range of available points.

You overestimate my abilities :slight_smile:
My experience with python is limited to the few projects of Python for Everybody and the Data Analysis sections of FCC. Im only able to work thru the logics thanks to what FCC taught me with JS so far. Im able to read between the lines of the python tests thanks to the JS quality assurance certificate, but designing my own would need going thru Quality Assurance with Python course ^^.
Thanks for the code snippet you provide, it does appear very clear and simple. I could prolly rewrite my solution with a similar approach

EDIT: actually, i can see how the test could operate logically, altho i wouldnt be able to put it down in code.
So it uses ax.get_lines()[lineIndex].get_xdata() and ax.get_lines()[lineIndex].get_ydata() to retrieve the arrays containing X and Y coords and if it has the slope and intercept parameters, the test can check if the respective X coord corresponds with the respective Y coord(which would be Xcoord*slope+intercept. It would need to retrieve the length of xdata and ydata and check if all values align with the equation.

I wouldn’t really be in favor of that sort of a test. It encourages inflexible coding on the student’s part.

wouldnt it be the opposite, more flexible, as it would allow solutions that draw the line with two points. Currently it only allows solutions which draw the line with a set array of points

No, it would encourage the user to write less flexible code that only works for plotting the result of linear curve fitting.

Making the test more flexible to let the learner to write less flexible code is not a very good idea, in my opinion.

Ok, now i get what you mean, by not being flexible.
The flaw in my logic was, i assumed the line of best fit would be straight. While this is true for the project, its not valid in all cases and line of best fit can be of varying shape(took me some googling for additional examples).

1 Like