Newbie - Question about Correlation

Question I’m attempting to answer: Is there a strong relationship between prior arrests and the prediction of recidivism?

Table Data (n=~4,750) has (among many others) the following relevant columns:
Prior Arrests - Values range from 0 to 38
Recidivism Prediction - Values range from 1 (unlikely to go back to jail) to 10 (highly likely to go back to jail)
Recidivism Range - 1-3=Low, 4-6=Medium, and 7-10=High

Ok…my first instinct when looking at this question is to calculate the correlation between the Prior Arrests and Recidivism Prediction columns. I use the =CORREL() function in Excel and come up with a value of .42, at which point I deduce that there is NOT a STRONG correlation between the two.

However, the trainer uses a pivot table with Recidivism Range in the rows (Low, Med, High) and Average of Prior Arrests in the data field. This shows an obvious correlation between the two. Since I couldn’t run the correlation on text values, I then equated a 1 to Low, a 2 to Med, and a 3 to High in a separate column and then re-ran the correlation between this row and Prior Arrests and it showed a correlation of .40.

MY Question: After watching the video, I understand how the presenter came to his conclusion, but I am wondering WHY the =CORREL() function did not produce a higher value, given that there seems to be a strong relationship between the two. Am I thinking of the correlation function in the wrong way? How would I know when to use the correlation function and/or when to use the technique shown by the trainer?

2 Likes

I hope this will be relevant for someone.

I do not know why there is a Recidivism Range, as this just decreases the precision of your calculations. That is, all it does is maps three numbers down to one number of the same precision. Which, in a sense, does not count as mapping.

Your first use of the CORREL function is correct. However, you might find it more meaningful to, instead, use regression, in the form of Excel’s trend line. That is, plot your scatter plot, then use the trend line option in Excel to fit a polynomial to the data.

If you get Excel to show you the equation of this line, it will display it with an r value. The closer this is to 1, the more of a representation there is between your data, and the polynomial. However, if you require a high degree polynomial to get a high r value, then there is a weak correlation between your arrays.

Hope this helps.