Question about len() in problem

I am working on a Python problem for an online course. I need to extract floats from a .txt file and find the average of them; however, when I print the length from the floats it gives me 7. The length of the floats is more than 7. This is messing with my code and returning the wrong value. Would someone look at my code and tell me why it returns a length of 7? Where is my mistake?

I´m posting my code and all the floats that return when I print only the floats below.

fname = input("Enter file name: ")
fh = open("mbox-short.txt")
for line in fh:
    if not line.startswith("X-DSPAM-Confidence:"):
        continue
    num = line.find(".")
    numstr = line[num-1:]
    sum = 0
    print(float(numstr))
    el = (float(numstr))
    if sum < el:
      sum+=el
    length = len(numstr)
    res = sum/length
print("Average spam confidence: ", res)

print return of floats

0.8475
0.6178
0.6961
0.7565
0.7626
0.7556
0.7002
0.7615
0.7601
0.7605
0.6959
0.7606
0.7559
0.7605
0.6932
0.7558
0.6526
0.6948
0.6528
0.7002
0.7554
0.6956
0.6959
0.7556
0.9846
0.8509
0.9907

Have you tried to print the string that’s giving such length? Try printing it, but with added some specific characters at the start and ending, to delimit the actual start and ending. There might be some characters (like spaces or new line characters) that might not be so easy to notice.

I ran the .split() to get rid of white space print function

   print(numstr.split())

It returned this and the length was still 7. I also ran rsplit(). Same return. I don´t understand why the len() is returning something less than what is.

['0.8475']
['0.6178']
['0.6961']
['0.7565']
['0.7626']
['0.7556']
['0.7002']
['0.7615']
['0.7601']
['0.7605']
['0.6959']
['0.7606']
['0.7559']
['0.7605']
['0.6932']
['0.7558']
['0.6526']
['0.6948']
['0.6528']
['0.7002']
['0.7554']
['0.6956']
['0.6959']
['0.7556']
['0.9846']
['0.8509']
['0.9907']

Can you post a copy of the text file contents?

The content of the file exceeded FCC content limit. A link to the txt file is below:
https://www.py4e.com/code3/mbox-short.txt?PHPSESSID=239cd7100f27f8ce0c0321fb5fa24666

That still might contain new line character, it wouldn’t be split to own item:

>>> '0.8475\n'.split()
['0.8475']
>>> '0.8475'.split()
['0.8475']

Try printing it like this:

print('"{}"'.format(numstr))

There’s one more way - print(repr(numstr)) - , it would print the new line character normally:

>>> print(repr('0.8475\n'))
'0.8475\n'

It appears you are still capturing the new line character.

Honestly, I am not sure why you are thinking that calculating sum/length is going to get you the average of the floats. You need a count of how many floats you have. That count will be the divisor. Also, why are you only adding the float to sum if sum is less than the float?

I want to make sure I understand where I´m at and what I need to do from here.
WHERE I´M AT:
So far my code prints takes strings of floats (that are crowded with /n) from the .txt (Code below)

fname = input("Enter file name: ")
fh = open("mbox-short.txt")
for line in fh:
    if not line.startswith("X-DSPAM-Confidence:"):
        continue
    num = line.find(".")
    numstr = line[num-1:]

WHAT I NEED TO DO:
First I need to find a way to iterate through the string floats being collected from the .txt file, eliminate the whitespace and convert to a float and, when done with that, set up a counter to count the converted float .

Second, I need to iterate through the converted floats and then add each one to the previous one, saving the total sum in a variable and dividing that by the counter for the floats.

I think this is a good plan. Where are the flaws?

This does not technically guarantee it is a float. You just found a line with a dot in in and take the character before the dot through the end of the line. If you really want to validate that numstr is a float you can try to convert it to a float. If doing so does not cause an error you have a float. Otherwise you will get an error meaning you do not have a float.

You already have that logic with your for loop.

If you follow my logic I first mention above, the extra white space will not cause a problem.

This is your first logic flaw. You need to think very carefully about when/where in your code you need to initialize your counter.

You already are iterating through the file of possible floats, so there is no need to iterate through anything a second time. If you capture the running sum of the floats via the iteration of the file and have properly initialized a counter and increment the counter when a float is found, then after the file has been completed iterated, you can calculate the average by dividing the running sum of floats by your counter variable.

1 Like

I will add one small note about my last post. I said you will need to test if numstr is a float. After thinking about that for a bit, I realize that since the line containing the number always has the same format, you can simply convert it without a worry of it not converting to a float.

I am testing my code to see if I am accumlating the floats and the counter.

#Assigning file name user inputs to fname
fname = input("Enter file name: ")
#assigning action of opening file to fh
fh = open(fname) 
#looping through fh
for line in fh:
#for every line strip off the white space
    line = line.rstrip()
    #initialize counter and sum for accum. tot sum of floats and number of float instances
    counter = 0
    sum = 0
    #Include exception to insure loop focuses on specific instance of float
    if not line.startswith("X-DSPAM-Confidence:"):
        continue
    #identify decimal point
    num = line.find("0.")
    try:
    #use try to test if value is float through attempting to convert to float
        numstr = float(line[num:])
    #conversion is successful add current sum    
        sum = sum + numstr
    #note occurrence of float by adding to counter
        counter = counter + 1
    #if try fails return top of loop and run again
    except:
        continue  
    print(counter)  
    print(sum)      

I return this.

1
0.8475
1
0.6178
1
0.6961
1
0.7565
1
0.7626
1
0.7556
1
0.7002
1
0.7615
1
0.7601
1
0.7605
1
0.6959
1
0.7606
1
0.7559
1
0.7605
1
0.6932
1
0.7558
1
0.6526
1
0.6948
1
0.6528
1
0.7002
1
0.7554
1
0.6956
1
0.6959
1
0.7556
1
0.9846
1
0.8509
1
0.9907

I know this means my counter is in the wrong place…as well as my sum variable…
but I don´t get it. I put them in try because if try is successful (and it is a float) they´ll add one to counter.

This is incorrect. My logic is that if try completes successively during an instance of the loop than add one to counter and add the sum to the current converted value in the loop and ascribe that new value to the variable sum.

Where am I wrong here?

You are constantly resetting the value of counter and sum to 0 during each iteration of a line. I already hinted where to initialize such variables. You will only initialize these variables one time.

Okay! I got it. I can´t initiate my variables sum/counter inside the loop, but I they need to accumulate within the loop. Thanks for your help!

Glad you figured it out.

Here are a few ways to solve the problem:

fh = open("mbox-short.txt")
floats_sum = 0
floats_count = 0
for line in fh:
    if line.startswith("X-DSPAM-Confidence:"):
        decimal_point_index = line.find(".")
        num = float(line[decimal_point_index - 1:])
        floats_sum += num
        floats_count += 1
avg = floats_sum / floats_count
print("Average spam confidence: ", avg)
fh = open("mbox-short.txt")
floats = []
for line in fh:
    if line.startswith("X-DSPAM-Confidence:"):
        line_sections = line.split(":")
        floats.append(float(line_sections[1]))
avg = sum(floats)/len(floats)
print("Average spam confidence: ", avg)
import re
textfile = open("mbox-short.txt", "r")
text = textfile.read()
textfile.close()
floats = re.findall("X-DSPAM-Confidence: (\d\.\d+)", text)
avg = sum([float(num) for num in floats]) / len(floats)
print("Average spam confidence: ", avg)