Sadly, .NET has no simple way inbuilt (without third party modules) to parse HTML into a tree structure capable of being easily navigated to extract information. Again, if you were using InternetExplorer.Application, this provides an interface onto the DOM, but this does not allow you full control of the process of downloading data, which is what we're after here. Its also extraordinarily slow: using Measure-Object, you can determine that parsing the innerHTML of an element using regular expressions can be from 10-20 times faster than returning the row and column objects via the API and iterating over their innerText.
So let's assume we're using regular expressions. How do we go about this?
Consider the following regular expression:
Given an array $tagnames, we match an element that is any one of these tagnames, followed by the shortest possible number of characters, and then a closing right angled bracket. This does not match (a) elements that are syntactically incorrect or (b) elements that do not have any attributes.
We can then extract the attributes via the following regular expression:
This extracts from the second match group of the first regular expression, the name and value pairs that constitute attributes. They can be enclosed in quotes, or be a run of characters not containing a space. You can abstract this to include the single quote as a valid enclosing character as well, I leave this as an exercise to you.
Lets suppose we want to extract all the input tags (including input type=hidden), filter them, and post some of then back to the application. Tags such as __VIEWSTATE and other ones beginning with underscores are a good example of this. They need to be transmitted with the session state cookie (which is not easily extracted from IE itself, but is easily gotten via System.Net.HTTPWebRequest) in order to retain our login session. Using these regular expressions and iterating the matches from the second will allow us to extract the name/value/type fields and extract them as required, prior to posting them back.
So how do we post them back to the form? Well, a post request needs to send data representing the post request to the server, and the data must be url encoded.
The above translates an array of hashrefs containing name,value pairs, into post data. Note that you must first:
in order to load the assembly containing the UrlEncode function.
Finally, you must translate the data of the post request to the appropriate character encoding, and post it off. Here is an example of that translation:
This is translating the string $pdata into a byte array representing the string in the UTF8 character encoding (that most commonly used by web servers).
Finally, you must let the server know how much data you will be sending, so it can allocate memory for your request, as well as telling it what format the data is in:
$req.ContentType = "application/x-www-form-urlencoded";
$req.ContentLength = $bytez.Length;
$reqstream = $req.GetRequestStream();
$reqstream.Write($bytez, 0, $bytez.Length);
This completes your postback to the server and you now read back the data in the usual way (described in part 1).
Now you can check the results of your download and react in your program accordingly, saving the file or textual data to an appropriate location.