Master's program scholarship scrapper

so i want to create a scrapper that crawls through all the websites and scrapes data for master’s scholarship with custom parameters, and outputs a .xlsx or .csv file format, any recommendations ? or experience in doing this type of data extraction ?

i tried running this, scrappy runspider

full_masters_scholarship_spider.py

import scrapy
import pandas as pd
from urllib.parse import urlparse

class FullMastersScholarshipSpider(scrapy.Spider):
    name = 'full_masters_scholarship_spider'
    start_urls = ['https://www.google.com']

    def parse(self, response):
        # Search for relevant keywords on Google
        search_query = "full free master's program scholarship stipend accommodation english"
        yield scrapy.Request(f'https://www.google.com/search?q={search_query}', callback=self.parse_search_results)

    def parse_search_results(self, response):
        # Find all the search result links
        links = response.css('div.g > div > a::attr(href)').getall()

        # Filter the links to only include scholarship websites
        scholarship_links = [link for link in links if 'scholarship' in link.lower()]

        # Extract the website name and scholarship details for each link
        data = []
        for link in scholarship_links:
            website_name = urlparse(link).netloc.split('.')[0].capitalize()
            # Scrape the scholarship details from the website
            scholarship_details = self.scrape_scholarship_details(link)
            data.append({
                'Website': website_name,
                'Website Link': link,
                'Scholarship Details': scholarship_details
            })

            # Stop scraping if we have at least 100 results
            if len(data) >= 100:
                break

        # Create a DataFrame and save it to an Excel file
        df = pd.DataFrame(data)
        df.to_excel('full_masters_scholarships.xlsx', index=False)

    def scrape_scholarship_details(self, url):
        # This method should contain the logic to scrape the scholarship details from the website
        # It should return a string containing the relevant scholarship information
        return "Full tuition fee waiver, monthly stipend of $1,500, and free on-campus accommodation."

in return Im getting a blank excel file, along with this in the terminal

PS D:\> scrapy runspider full_masters_scholarship_spider.py
2024-04-12 17:35:54 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: scrapybot)
2024-04-12 17:35:54 [scrapy.utils.log] INFO: Versions: lxml 5.2.1.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.1.2, Twisted 24.3.0, Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.5, Platform Windows-10-10.0.19045-SP0
2024-04-12 17:35:56 [scrapy.addons] INFO: Enabled addons:
[]
2024-04-12 17:35:56 [py.warnings] WARNING: C:\Users\Mehdi\AppData\Local\Programs\Python\Python312\Lib\site-packages\scrapy\utils\request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-04-12 17:35:56 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-04-12 17:35:57 [scrapy.extensions.telnet] INFO: Telnet Password: 8f49bb6f1d0ce44d
2024-04-12 17:35:57 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-04-12 17:35:57 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2024-04-12 17:35:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-04-12 17:35:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-04-12 17:35:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-04-12 17:35:57 [scrapy.core.engine] INFO: Spider opened
2024-04-12 17:35:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-04-12 17:35:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-04-12 17:35:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.com> (referer: None)
2024-04-12 17:35:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.com/search?q=full%20free%20master%27s%20program%20scholarship%20stipend%20accommodation%20english> (referer: https://www.google.com)
2024-04-12 17:36:00 [scrapy.core.engine] INFO: Closing spider (finished)
2024-04-12 17:36:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 809,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 35170,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 2.788749,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 4, 12, 11, 36, 0, 435029, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 121664,
 'httpcompression/response_count': 2,
 'log_count/DEBUG': 3,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2024, 4, 12, 11, 35, 57, 646280, tzinfo=datetime.timezone.utc)}
2024-04-12 17:36:00 [scrapy.core.engine] INFO: Spider closed (finished)
PS D:\>

Did you log out what you get back from the request?

I haven’t tried scraping a Google search, but you might have to pre-render it.

https://docs.scrapy.org/en/latest/topics/dynamic-content.html#topics-javascript-rendering

so Copilot suggested that I should narrow down the scrapping to only listed websites on the spider

import requests
from bs4 import BeautifulSoup
import pandas as pd

# List of websites to scrape
websites = [
    ("Scholarships360", "https://www.scholarships360.org"),
    ("College Board", "https://www.collegeboard.org"),
    ("Fastweb", "https://www.fastweb.com"),
    ("Scholly", "https://myscholly.com"),
    ("ScholarshipOwl", "https://scholarshipowl.com"),
    ("Cappex", "https://www.cappex.com"),
    ("Going Merry", "https://www.goingmerry.com"),
    ("Jlv College Counseling", "https://jlvcollegecounseling.com"),
    ("Niche", "https://www.niche.com"),
    ("Chegg", "https://www.chegg.com"),
]

# Prepare a list to store the website links and names
results = []

# Loop through the websites
for name, url in websites:
    # Send a GET request to the website
    response = requests.get(url)

    # Parse the website's content using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all 'a' tags (which define hyperlinks) in the parsed content
    a_tags = soup.find_all('a')

    # Loop through the 'a' tags
    for tag in a_tags:
        # Get the link (href attribute) and name (string content) of the website
        link = tag.get('href')
        link_name = tag.string

        # Append the link and name to the list
        results.append([name, url, link, link_name])

# Convert the list to a pandas DataFrame
df = pd.DataFrame(results, columns=['Website Name', 'Website URL', 'Link', 'Link Name'])

# Write the DataFrame to an Excel file
df.to_excel('websites.xlsx', index=False)

and getting an Excel file this time with something on it at least !

in the terminal I’m getting this

PS D:\> scrapy runspider full_masters_scholarship_spider.py
2024-04-13 18:34:31 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: scrapybot)
2024-04-13 18:34:31 [scrapy.utils.log] INFO: Versions: lxml 5.2.1.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.1.2, Twisted 24.3.0, Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.5, Platform Windows-10-10.0.19045-SP0
2024-04-13 18:34:32 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.scholarships360.org:443
2024-04-13 18:34:32 [urllib3.connectionpool] DEBUG: https://www.scholarships360.org:443 "GET / HTTP/1.1" 301 None
2024-04-13 18:34:32 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): scholarships360.org:443
2024-04-13 18:34:33 [urllib3.connectionpool] DEBUG: https://scholarships360.org:443 "GET / HTTP/1.1" 200 None
2024-04-13 18:34:33 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.collegeboard.org:443
2024-04-13 18:34:34 [urllib3.connectionpool] DEBUG: https://www.collegeboard.org:443 "GET / HTTP/1.1" 200 8440
2024-04-13 18:34:34 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.fastweb.com:443
2024-04-13 18:34:36 [urllib3.connectionpool] DEBUG: https://www.fastweb.com:443 "GET / HTTP/1.1" 200 16531
2024-04-13 18:34:36 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): myscholly.com:443
2024-04-13 18:34:38 [urllib3.connectionpool] DEBUG: https://myscholly.com:443 "GET / HTTP/1.1" 200 None
2024-04-13 18:34:38 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): scholarshipowl.com:443
2024-04-13 18:34:40 [urllib3.connectionpool] DEBUG: https://scholarshipowl.com:443 "GET / HTTP/1.1" 200 None
2024-04-13 18:34:40 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.cappex.com:443
2024-04-13 18:34:41 [urllib3.connectionpool] DEBUG: https://www.cappex.com:443 "GET / HTTP/1.1" 301 None
2024-04-13 18:34:41 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.appily.com:443
2024-04-13 18:34:41 [urllib3.connectionpool] DEBUG: https://www.appily.com:443 "GET / HTTP/1.1" 200 None
2024-04-13 18:34:41 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.goingmerry.com:443
2024-04-13 18:34:43 [urllib3.connectionpool] DEBUG: https://www.goingmerry.com:443 "GET / HTTP/1.1" 200 38645
2024-04-13 18:34:43 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): jlvcollegecounseling.com:443
2024-04-13 18:34:43 [urllib3.connectionpool] DEBUG: https://jlvcollegecounseling.com:443 "GET / HTTP/1.1" 200 None
2024-04-13 18:34:43 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.niche.com:443
2024-04-13 18:34:44 [urllib3.connectionpool] DEBUG: https://www.niche.com:443 "GET / HTTP/1.1" 403 4277
2024-04-13 18:34:44 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.chegg.com:443
2024-04-13 18:34:45 [urllib3.connectionpool] DEBUG: https://www.chegg.com:443 "GET / HTTP/1.1" 403 4679
Usage
=====
  scrapy runspider [options] <spider_file>
runspider: error: No spider found in file: full_masters_scholarship_spider.py

PS D:\>

now how do I modify the spider further more, and get desirable results

Well that is completely different. Before, you were using Google and had a search parameter. Now you are just scraping a list of sites.

I don’t know what it is you are trying to scrape from the sites. Right now it seems you are just getting the links from the landing page.

yes, it’s a completely different approach and I’m getting just website URLs and sub-directories in this way. what I want to do is a multi-step crawling, using a search engine to find the master’s scholarship websites, then crawl at least 1000 websites, and find the scholarships that provide full funding, accommodation, and stipend. the reason for crawling the whole Google or at least thousands of websites is that the criteria I’m setting for scholarships are very rare, maybe if you search an entire scholarship database of a single website, there will be 2 or 3 matches of those criteria. I searched manually like this 6 or 7 years ago and it was very energy draining not to mention the eye strain and mental energy depletion due to diminishing results. so that’s why I’m trying this approach,

I hope you are getting what I’m trying to do, even the free code llama 70b in fact cant really give this multi-step-based crawling code, it eventually ran into error and stopped responding

and I don’t have gpt 4 subscription !!

Considering you are relying on AI generated code, that sounds like a very ambitious project.

Even if you get the links to the sites, you still have to know where the data you are looking for is located on each site. The actual information might be gated behind a sign-up process as well.

What you are trying to do doesn’t seem feasible.

Surely there are already sites that aggregate this information in some form, or counseling services provided for students.


Just as an aside, I would suggest you learn about the tools and language you are using, as well as look up information on web scraping.

You should never rely on AI to write code for you; and that goes doubly so if you don’t know what you are doing. It lies and doesn’t simply have the capacity for creative problem solving.

I think you’re trying way too hard on a problem that doesn’t have a proper solution. Even if you get the links from the home page, there’s no guarantee that it actually leads to a proper internship opportunity. I’m almost absolutely certain there’s a better way to do this.