We want to build an aggregator product on a very active niche.
I simply want to fetch the title,image and first paragraph of the articles matching my keywords on the website.
I research 3 methods which are RSS feeds, mixnode [dot] com and web scrapping.
But I am wondering what technology to use:
RSS feed fetches (title, image and description) but not all website offer image in there feed.
I found a technology call mixnode[dot]com but my challenge is that it host the web page as database with rolls and columns which cannot offer real update for my app but I need update every 3 hours.
Web scarpering is a good idea but what how do on handle the issue of HTML structure change ?
I have research some website and app that aggregate with images like:
bounce [dot] ng (look like a Nigerian app)
How does these platform (Bounce [dot] ng, mix [dot] com) do it ?
If you have a better insight how to build this effectively please share. thanks
I am not allow to post links so i have to use [dot]
I don’t know what your familiarity with various web technologies is, but I can make some suggestions for you that will at least give you some idea of where to head next.
Of the three approaches, you’ve already identified that RSS won’t suit your needs because of the images issue - but if you already know what to do with RSS feeds and that is otherwise your best option, you could consider using a bank of dummy images to feed into your aggregator to use for those that don’t offer images in their RSS. Since it’s very niche, surely images related to that niche will always be broadly relevant…just a thought
Whether you use DIY web scraping or a third party solution like mixnode depends on whether you have the budget for mixnode and are happy with their offering (it sounds like you’re not totally happy with their solution for your needs), or whether you know, or can learn, how to do web scraping.
Personally, I’d do the web scraping solution, because I LOVE WEB SCRAPING!
You are correct in wondering what happens when a page you are scraping changes it’s structure…that’s just par for the course, I’m afraid. You have to monitor your data and keep it up to date with changes. These would presumably be few and far between, though.
In terms of technologies…well, you can probably do web scraping in most modern languages. Node.js has some good solutions, including Puppeteer (which would be my first suggestion). Python has Beautiful Soup and Scrapy as good options.
It really just depends on your familiarity with web app development, servers, databases etc.
There is a lot of frustration in web-scraping, especially if you aggregate multiple sites, but if you like the detective work involved it’s a heap of fun!
Thanks JacksonBates for your help. i thought of using a bank of dummy images for feed without image but i just don’t like it.
i really wonder how bounce.ng came up with their solution.
thanks JacksonBates for your time.