Reading HTML is commonly known as "screen scraping". The Idea behind this is to automate the task of reading web pages and extracting the data in them so that some useful information can be extracted and analysed. The main problem is that HTML on the web doesn't always follow the standards so you need to have code that cleans up the HTML for you.
Robots.txt[]
Webmasters can create a file called robots.txt telling screen scraping bots what pages if any can be read. It is a good idea to respect this.