Question about a .split() issue

I´m working on a Python problem asking me to extract the hour of the day for each of the email messages in a text file and count the number of times they appear.

This is my code so far.

name = input("Enter file:")
if len(name) < 1:
    name = "mbox-short.txt"
handle = open(name)
for line in handle:
    words = line.split()
    if line.startswith("From"):
        date = words[5:6]
      
        print(date)
    It returns this:
['09:14:16']
[]
['18:10:48']
[]
['16:10:39']
[]
['15:46:24']
[]
['15:03:18']
[]
['14:50:18']
[]
['11:37:30']
[]
['11:35:08']
[]
['11:12:37']
[]
['11:11:52']
[]
['11:11:03']
[]
['11:10:22']
[]
['10:38:42']
[]
['10:17:43']
[]
['10:04:14']
[]
['09:05:31']
[]
['07:02:32']
[]
['06:08:27']
[]
['04:49:08']
[]
['04:33:44']
[]
['04:07:34']
[]
['19:51:21']
[]
['17:18:23']
[]
['17:07:00']
[]
['16:34:40']
[]
['16:29:07']
[]
['16:23:48']
[]

I need to obtain the number before the first colon, butI´m lost about where to begin. Strings are immutable. I´ve tried .rstrip() as well as running .split(":") None of these are working. I know I have to utilize key/values in some way because the data needs to go into a dict()…but I´m unsure how.

Any hints would be greatly appreciated. I´m attaching a link to the text file referenced in the code.

Thanks!

Which date? The one that is after Date: below the line that starts with Author:?

Are you aware that date is a list with one string element? You need to be targeting the element with split and not the list.

How are the counts to be displayed?

I am working toward counting the dates. Right now, I´m just trying to get a successful return of all months appearing after the email and the element "From "

My code below returns a date

ls = list()
name = input("Enter file:")
if len(name) < 1:
    name = "mbox-short.txt"
handle = open(name)
for line in handle:
    if line.startswith("From"):
      date = line.find(":")
      if int((line[date-2:date])):
        ls.append(line[date-2:date])   
    print(ls)    

However, it is the same one:

['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']
['09']

It´s getting stuck.

I assume this has to do with the way I´m trying to identify the integer in the string:

      if int((line[date-2:date])):
        ls.append(line[date-2:date])   

I feels like the code is too convoluted, but I´m not sure how else to identify this and I want to make sure I can draw this element out before I move on to counting it and organizing it in a data structure (I imagine a dictionary with a key of the month and a value of the counter).

Am I correct about any of this?

I thought you said you were trying to extract the hour:

It is.

The reason you see ['09'] multiple times before the script errors out is because, you have only added one month to the list over the 38 lines your script handles.

My algorithm for creating a dictionary of hours of the day with corresponding counts is as follows:

1 Open file
2 Create dictionary to hold counts of hours
3 For each line:
3.1 If lines starts with “From”:
3.1.1 Split the line into an array using white space as delimiter. If the length of the array is equal to 7:
3.1.1.1 The correct line that starts with “From” has been found, so split the 6th element of the line into an array containing the hours, minutes and seconds using : as delimiter.
3.1.1.2 Get the hour of the time by referencing the 1st element of the time array.
3.1.1.3 If the counts dictionary does not contain the hour as a key, then add it and set its value to 1, otherwise increment the existing hour key value by 1.

I´m working on assembling the list of strings that I will eventually iterate through, keeping a counter of their appearance, which I will eventually plug into a dictionary as a key/value.

That said, with this code,

# input text file into program create backstop in case line is too small
name = input("Enter file:")
if len(name) < 1:
    name = "mbox-short.txt"
#create dictionary and handle for input content
dic = dict()
handle = open(name)
#run for loop on line
for line in handle:
#if line starts with FROM, split elements at colon, separate off relevent parts of element
#in "relevant" variable, run for-in loop on "time" in order to eliminate errant "From" 
# element from "time", append to "lst"
    if line.startswith("From"):
        linespl = line.split(":")
        relevant = linespl[0].split( )
        time = relevant[-1]
        time = time.split(" ")
        print(time)

I am returning this:

['09']
['From']
['18']
['From']
['16']
['From']
['15']
['From']
['15']
['From']
['14']
['From']
['11']
['From']
['11']
['From']
['11']
['From']
['11']
['From']
['11']
['From']
['11']
['From']
['10']
['From']
['10']
['From']
['10']
['From']
['09']
['From']
['07']
['From']
['06']
['From']
['04']
['From']
['04']
['From']
['04']
['From']
['19']
['From']
['17']
['From']
['17']
['From']
['16']
['From']
['16']
['From']
['16']
['From']

So I added this to eliminate that pesky “From”…

        lst = list()
        for num in time:
            if int(time[num]):
                 lst.append(num)
        print(lst)

Which I intended to squeeze out the “From” part of the string as an int() would be false and would not qualify it for append() to lst

I can tell by the error message

line 43, in <module>
    if int(time[num]):
TypeError: list indices must be integers or slices, not str

that my assumptions were not based in logic, but I don´t understand how it doesn´t work. I´m performing the int() formula on a string within a list…what am I missing?
Is, from a broad standpoint, my logic reasonable?
If not, where am I off?

Thanks fpr your patience and help.

Where did you add that code? Please show your new full code.

Did you not attempt to implement the algorithm I gave you?

I am posting what I´m working with so far. I am returning the correct key, but I am unsure about my ideas moving forward.
Can you tell me if what I´ve done so far is going to lead me in the right direction and my ideas about moving forward are valid?

My code and my comments explaining my intentions are below:

# input text file into program create backstop in case line is too small
name = input("Enter file:")
if len(name) < 1:
    name = "mbox-short.txt"
#create dictionary and handle for input content
dic = dict()
handle = open(name)
#run for loop on line
for line in handle:
#if line starts with FROM, split elements at colon, separate off relevent parts of element
#in "relevant" variable, run for-in loop on "time" in order to eliminate errant "From" 
# element from "time", append to "lst"
    if line.startswith("From "):
        linespl = line.split(":")
        relevant = linespl[0].split( )
        time = relevant[-1]
        #print(time)
        lst = list()
        print(lst)
        ##this is problematic, variable "time" is primed, 
        # access and count it in "dic" data type
        for word in dic:
            if dic[word] == time[word]:
                dic[word] = dic[word]+1
            else: 
                dic[word] = 1
# If the counts dictionary does not contain the hour as a key, then add it 
# and set its value to 1, otherwise increment the existing hour key value by 1.

I do not understand the purpose of this code. Each time you capture a new time you create a new empty list? Why?

This for loop will never iterate over anything because the dictionary is empty.

I´m stuck on 3.1.1 of your algorithm.
I applied your suggestions, but I return absolutely nothing when I run the algorithm and I´m completely lost. I don´t understand.
My code is below. What am I doing wrong?

# input text file into program create backstop in case line is too small
name = input("Enter file:")
if len(name) < 1:
    name = "mbox-short.txt"
#create dictionary and handle for input content
lst = list()
dic = dict()

fhand = open(name)
for line in fhand:
    if line.startswith("From "):
        linespl = line.split(" ")
        if len(linespl) == 7:
            date = linespl[5].split(":")
            print(date)
            time = date[0]
            print(time)

Try using line.split()

Your version would only split on a single space character.

On a separate note, it is important to properly name your variables, so it is clear what value they represent.
Instead of:

date = linespl[5].split(":")

I would write:

time_array = linespl[5].split(":")

and then instead of:

time = date[0]

I would write:

hour = time_array[0]
1 Like

This is what I´ve got now.

# input text file into program create backstop in case line is too small
name = input("Enter file:")
if len(name) < 1:
    name = "mbox-short.txt"
#create dictionary and handle for input content
lst = list()
counts = dict()
fhand = open(name)
for line in fhand:
    if line.startswith("From "):
        email_info = line.split()
        #print(email_info)
    if len(email_info) == 7:
        time_arr = email_info[5].split(":")
        print(time_arr)
        hours = time_arr[0]
        #print(hours)
        lst_hours.append(hours)

Now I return the hours taken out of time_arr
I append() the hours to lst_hours

NOW, here is what I think I ened to do

With lst_hours , I can run a for in loop to check for frequency of each hour in the dictionary, using a .get() method to apply to the hour if it doesn´t already exist OR add one to the existing variable

Do I have it right?

It is not hours, it is just hour (one hour).

Also, there is no need to use a list at this point. You have the dictionary (counts), so you just need to implement the last part of my algorithm:

3.1.1.3 If the counts dictionary does not contain the hour as a key, then add it and set its value to 1, otherwise increment the existing hour key value by 1.

I solved the issue and I have some questions. First, I´ll post my code below:

# input text file into program create backstop in case line is too small
name = input("Enter file:")
if len(name) < 1:
    name = "mbox-short.txt"
#open name,create dictionary and handle for input content

fhand = open(name)
lst = list()
counts = dict()

#search each line of handle for particular phrase
#split() this phrase when found
for line in fhand:
    if line.startswith("From "):
        email_info = line.split()

        #print(email_info)

#separate off the time component of phrase
        time_arr = email_info[5].split(":")
        
        #print(time_arr)

        hours = time_arr[0]
#iterate through dict() and add one if it appears
        counts[hours] = counts.get(hours, 0)+ 1
#loop through dict() and append k and v to list()
for key,val in counts.items():
    lst.append((key,val))
#Sort list 
sorted(lst)

#print(lst)

#Run loop in key values in list and print them out
for key,val in lst:
    print(key,val)
        

You had mentioned that, after my hours variable, I didn´t need to use a list() and could just start testing for the existence of the hour in the counts dictionary and if it is existed add one and if not add it and give it a value.

I could not figure out how to do it without list().

The problems I ran into was that Python did not allow me to add things to a dictionary element. This was possible with a list()

I imagine that this is a result of ignorance. Therefore, would you mind demonstrated a manner of completing my project without using list()?

Thanks so much!

Good job solving it.

Below is what I meant by not using an extra list:

name = input("Enter file:")
if len(name) < 1:
    name = "mbox-short.txt"
file = open(name)
counts = {}
for line in file:
    if line.startswith("From"):
        words = line.split()
        if len(words) == 7:
            time = words[5]
            time_list = time.split(':')
            hour = time_list[0]
            counts[hour] = counts.get(hour, 0) + 1
for hour, count in counts.items():
    print(hour, count)

The only reason to use a list would be to sort the results in a specific way.

You are using sorted but it is not doing anything for you. How were you wanting to sort the list before printing? If you want to sort by hour, then you could do:

name = input("Enter file:")
if len(name) < 1:
    name = "mbox-short.txt"
file = open(name)
counts = {}
for line in file:
    if line.startswith("From"):
        words = line.split()
        if len(words) == 7:
            time = words[5]
            time_list = time.split(':')
            hour = time_list[0]
            counts[hour] = counts.get(hour, 0)+ 1
counts_sorted = sorted(counts.items(), key=lambda item: int(item[0]))
for hour, count in counts_sorted:
    print(hour, count)