There are two possibilities:
- Read the site and analyze it with Regular Expressions
- Syntactically parse HTML with GIFT or Simplexml
The first option is the easiest but not the safest for you because if you do not take precautions in the construction of Regular Expressions, a comma (literally) that the developer of the target site modifies and your Application may potentially fail to work.
In addition, it is slower because you almost work in brute force, marrying various patterns and manipulating arrays structures, often multidimensional.
For that possibility file_get_contents() often enough:
$html = file_get_contents( 'http://www.site.com' );
And $html you report as target of successive preg_match(), preg_match_all(), preg_replace()... those you find best, as many times as you need.
The second possibility is more complicated if you choose GIFT, but it’s safer because you work with the hierarchy of HTML, almost the same in Javascript. You list us, iterate collections of children and etc.
It’s complicated because the GIFT is a massive and very detailed set of classes.
If the target site is simpler, you can choose Simplexml which is kind to GIFT, but much less powerful and therefore much simpler.
The entire site? a specific page or pages of the site? Just that specific site?
– Daniel Omine