Web Form/Database problem

Repli · 27-04-2005 6:31pm #1

Basically what I am trying to ask is - is there a way of viewing a page source if you know the exact page name - without going into the web page and right clicking 'view source'. Like doing it through cmd in dos or something. I have a few million web pages I need the source of so was thinking of writing some sort of a script or something to automate it.

Thanks!

hostyle · 27-04-2005 7:48pm

Perl and LWP

aidan_walsh · 27-04-2005 9:19pm

Just use your script to send a HTTP GET request for the page and read everything after the first single blank line. Everything above that is a HTTP header and irrelevant to your needs.

GET [i]url[/i] HTTP/1.0

Ph3n0m · 27-04-2005 9:53pm

php is your friend

<?
$filename = 'test.txt';
$handle = fopen("http://www.boards.ie/", "rb");
$contents = '';
while (!feof($handle)) {
  $contents .= fread($handle, 8192);
}
$writer = fopen($filename, 'a');
fwrite($writer, $contents); 
fclose($handle);

echo "done";
?>

just get the previous code into a loop, feeding it urls and creating dynamic text files and thus you have you own page source storer

Repli · 27-04-2005 11:11pm

Thanks to everyone for the suggestions =D
At the moment I did something quick with wget, its working alright but I am probably gonna use something like php because I will be adding to this.

$contents .= fread($handle, 8192);

Ph3n0m just a quick question - is that 8192 = 8mb? So the most you can read is 8mb of source at a time? Thanks

Ph3n0m · 28-04-2005 9:19am

http://ie.php.net/manual/en/function.fwrite.php

If the length argument is given (in this case 8192), writing will stop after length bytes have been written or the end of string is reached, whichever comes firs

Zaph0d · 28-04-2005 9:25am

How would you do this if you wanted to store enough info to render the page offline? eg images, frames etc.
would you need code to find the source of all referenced URLs in the html retrieved in the original HTTP-GET?
would this have to be recursive?

mewso · 28-04-2005 2:25pm

Knowing it's a complete waste of time but VB.Net is also your friend:-

Dim myReq As System.Net.HttpWebRequest = CType(System.Net.WebRequest.Create("http://boards.ie/"), System.Net.HttpWebRequest)
Dim myStream As System.IO.Stream
myStream = myReq.GetResponse.GetResponseStream
Dim objReader As New System.IO.StreamReader(myStream)
Dim PageContent As String = objReader.ReadToEnd

hostyle · 28-04-2005 4:10pm

Zaph0d wrote:

How would you do this if you wanted to store enough info to render the page offline? eg images, frames etc.
would you need code to find the source of all referenced URLs in the html retrieved in the original HTTP-GET?
would this have to be recursive?

google for web spider / slurper / offline downloader software

rsynnott · 03-05-2005 4:45pm

aidan_walsh wrote:
Just use your script to send a HTTP GET request for the page and read everything after the first single blank line. Everything above that is a HTTP header and irrelevant to your needs.
GET [i]url[/i] HTTP/1.0

Careful; if the site(s) use(s) virtual hosting, you should do HTTP/1.1. Otherwise it may not work.

Web Form/Database problem

Comments