Combing seperate components into a program

I wrote components for a webscraper (it started simple, but the website is a nasty one to repeatedly get data from). All components are working fine seperately, but I am looking to combine them into 1 program, which should be a fundamental thing to do, but I am overlooking something or making it unnecessarily difficult. Can some one please help me?

The components are:

  • url_read() - reads a page number from a file ‘page.txt’ and create a URL to be used in driver.get (URL, headers, proxy)

  • get_proxies() - resulting in a set of proxies to be used in driver.get (URL, headers, proxy)

  • create_header() - resulting in a header with changing user agents to be used in driver.get (URL, headers, proxy)

  • scrape() - starting with driver.get (URL= <result from URL>, headers= <result from create_header> , proxy= <result from get_proxies> ) and scraping the data of the site and writing it to a df ; looking if there are more pages and write the pagenumber to a file ‘page.txt’ to have it stored in case of crash or kick by server and be used with the next sequence for scraping the next page (overwrite this page number with the next one in the next iteration and so on)

I tried defining functions for all of these, but I cannot get(read: don’t know how to , cannot find an undersandable explanation :flushed:) the results from the first 3 functions into driver.get(). I thought I need to return the result, but I cannot get the result out of the function passing it as an argument into a command of another function.
Once that is solved, the next problem presents itself:
How should I repeat this sequence - can I call the function scrape() again by example:

while page_number >1: 
    scrape()

But then: how to include the other functions here, so that input is generated correctly? Should I define a seperate function for driver.get() and call this before each scrape?

I am really stuck and hope someone can coach me (it might take a few more questions to me and explanations from me…).
Thanks a lot!

Do you have any draft code already?

Some additional questions:

  • What does it mean components (functions?) are working fine separately? Manually taking results from them and running driver.get works as expected, without issues?
  • What’s stopping from using result from those three functions in another function? Just from the description this sounds like a straightforward thing to do.
  • Are there any errors (exceptions raised)?

Thanks @sanity for your swift reply, I will try to explain myself better:

I have the complete code, all components/pieces/ functions ready, I wanted to be sure they are all doing what they should do.

I wrote each component mentioned as a little program on its own, with the intention to use it as a function. So, for example the url_read() checks if a file exists, looks into that file for a page number and creates a URL (for example ‘http://forum.freecodecamp.org/#p=2’).
Manually taking this result and putting it into driver.get() works, and I can scrape the page.

This is exactly my problem: it should be straightforward but I cannot think of a way the returned URL from the function can be used in driver.get(), which lives in another function scrape() (and so on for the results of the other functions)…

No exceptions at all

Overall process could look similar to the following - new function first prepares data by calling url_read, get_proxies, create_header to get needed data. This is assuming those function returns what is required. Then it calls scrape passing in those arguments (scrape function should be modified to accept needed parameters), which handles the rest further.

Other option might be doing the preparation step already in scrape function.

So like:

def url_read():
    *check file, build url*  
    return url

def create_header():
    *build header*
    return header

def get_proxies():
    *look for proxies*
    return proxy

def scrape(url, header, proxy):
    driver.get(URL = url, headers = header, proxies = proxy)
    *scrape and add to df*
    *check page number and write to file*

# define function to call preperation functions and call scrape function with arguments from preperation functions
def mother_function():
    url = url_read() 
    header = create_header()
    proxy = get_proxies()
    scrape(url,  header,  proxy)

# call mother_function
mother_function()

I tried this with a few simple functions and print statements in the scrape function, and it seems to work :slight_smile:

Now this process needs to be repeated for the next page, and since a pagenumber is written to a file in the scrape function (it overwrites the previous page number), is it smart to call the mother_function from the scrapefunction? For example:

def scrape(url, header, proxy):
    driver.get(URL = url, headers = header, proxies = proxy)
    *scrape and add to df*
    *check page number and write to file*
    if page_number > 1:
        mother_function()

It feels odd calling the mother_function from a function which was called by the mother_function, creating a loop.

Although it might look strange it’d probably work. Even if that’s not perfect it’s a good step and it’s better than something not working.

After some tinkering I’d try to put while loop in the mother_function. Ending loop either based on the value returned by scrape function, or possibly additional function, which just checks if there’s number in the file.

1 Like

Thanks a lot, it does work indeed, ending in an infinite loop and stoppng after reaching the maximum recursion depth :slight_smile:

Since the url_read function is reading the page number and using this to build the URL, I had this function return the page number too:

def url_read():
   *check file, build url*
   return url, page_number

and having the scrape() function check the page too and call mother_function(). The scrape function is the last step in the program, so it felt right to do the check here
To prevent infinite loop, I added a stop when it reaches an arithmetic plural of 20 pages (I will play with this to search for a balance between maximum number of repeats, time it takes and how much the server accepts within a time frame):

def scrape(url, header, proxy, page):
    driver.get(URL = url, headers = header, proxies = proxy)
    *scrape and add to df*
    *check page number and write to file*
   # call mother() again, invoking repeat of all fucntions, until number of pages is a multiple of 20; end if this is reached
    if page % 20 != 0:
        mother()

mother function is now:

def mother():
    url, page = url_read()
    header = create_header()
    proxy = get_proxies()
    scrape(url, header, proxy, page)

(the return from url_read() is a tuple, it was fun to see how many different ways there are to split and use the variables, I chose this one since it does not need to call read_url() multiple times)

Thank you for the lesson using functions, this is very valuable to me! I will play around with it more and no doubt come up with more in the future :wink:

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.