Web Scraping With PHP

I arrived at this web page… web-scraping-with-php-crawl-web-pages

I’d love to pursue this project, get a grasp of web scraping using a PHP script; but that project requires Composer; and I won’t use 3rd party applications for dependency injections, or installations. Is there no way to approach this by avoiding 3rd party dependency injectors? Do you have any ideas?

I don’t use PHP, but having to use a dependency manager is not uncommon. A lot of languages have some form of dependency manager (npm, pip, cargo, whatever).

Why do you not want to use Composer? If you look at different libraries, it is the recommended way of installing them in the docs.

You asked me, and so you deserve an answer.

Have you ever looked through all of the code that gets called when you use PSR-4 autoloading? The fact that it’s slow doesn’t appear to be up for debate.
PSR-4 autoloading is so slow and so over engineered that it has turned some of the best world class PHP developers into apologists who spend their time writing elaborate
excuses for their frameworks being so slow.

Over engineering is just the tip of the iceberg, the biggest technical problem with PSR-4 autoloading is interversion dependency what this means is that you can download a package, assuming it to be from a developer whom you trust. Little do you know that 70 other developers have contributed to the code, all of whom are perfectly anonymous.

To give you a concrete example, I use a framework that has one vendor autoload folder with a one single package in it that was downloaded from Packagist. From this one downloaded package, we ended up with more than 200 copyright notices (from all sorts of different people)!!

The idea of downloading something that has been potentially tampered with by dozens of other perfectly anonymous developers is a security nightmare. Interversion dependency is not unique to Packagist, it is used in the NPM Website. That’s NPM the package manager that has been abandoned by the guy who invented it. Even though NPM focuses on NodeJS, and Packagist focuses on PHP, both systems are fundamentally the same. You know what, Google “NPM” under the news tabs and read some of the articles that come up.

The entire NPM system is riddle with malicious code; and what is happening to NPM today, is an indication of what could happen to Packagist tomorrow. Why wait? Goggle “Packagist”, today and see what is being written; and over the last three weeks, numerous downloadable plugins in the WordPress ecosystem have been found to have malicious code and/or vulnerabilities. I couldn’t pretend to imagine what hackery could be injected into my sites by using Composer.

I don’t want to be accused of not contributing to the conversation, and perhaps this might be contrary to your own agenda on this site, but the hackers of today are skilled to the eyeballs, they are determined, well resourced, and have all the time in the world to execute their diabolical plans, given the rewards that are at stake.

I believe that there is an alternative to autoloading and/or 3rd Party package managers.

Thanks, and be well.

As I said, I don’t use PHP, so I can’t give you any specific advice.

I feel like you are addressing separate issues.

  1. Slowness of autoloader. I don’t know anything about this, but if you know of a better solution, or can write one, use that instead.

  2. Dependencies. All valid points, if not a bit too alarmist for my taste. If you want to write your own parser and scraper, no one is stopping you.

If so, why are you not using it?