URL organising software

Garreth · 05-05-2010 7:59pm #1

I'm looking for a program that does the following task:

It must scan a simple text document, identify all URL's in it and organise them as a list in a seperate text document. It should be able to identify links that end with specific tld so I end up with a list of .com domains and just the domain name for example "http://www.example.com" and not "http://www.example.com/directory/page.html"

I would like to know has anyone came across a program that does this or do I need to pay someone to build it for me?

Also what sort of cost roughly would I be looking at to have this made? and lastly could I build the program myself with a crash course in perl or whatever the appropriate language would be?

Sorry for the long post, I appreciate any feedback.

tricky D · 05-05-2010 8:07pm

This should be possible in Excel or OpenOffice's Calc. Don't have excel to hand but in Calc you get the URL list. Paste it in and the Text Import dialog comes up.* Use the '/' as a separator instead of the usual comma, tab etc. Now you've http's in the first column, empty 2nd column and the stripped domains in the 3rd. Discard the following columns. Copy and paste the 3rd column into a text file.

If you want the http:// stuff, paste '//' into all the 2nd column cells (after the first few you can recopy and paste them in a few or a load at a time), copy the first 3 columns, paste back into Notepad++ or Textpad and replace ' // ' with '//'. Job done.

* I think the excel way of doing this is to paste the URLs in sothey're all in column1. Then use the text to table function.

Garreth · 05-05-2010 8:59pm

Hi tricky d, thanks for reply.

The initial document that I want the software to scan contains a huge amount of text and among it is the 2000 or so links I want organised into a list.

After the software compiled the list i could apply what you have suggested to clean it up but I still need the first step, sorry if I wasn't clear enough about this in my original post.

CreepingDeath · 05-05-2010 10:12pm

Garreth wrote: »

I'm looking for a program that does the following task:

It must scan a simple text document, identify all URL's in it and organise them as a list in a seperate text document. It should be able to identify links that end with specific tld so I end up with a list of .com domains and just the domain name for example "http://www.example.com" and not "http://www.example.com/directory/page.html"

You should be able to do it easy enough with any language or application that supports regular expressions.
Just define the pattern you want to search for and the portion you want back.

marco_polo · 05-05-2010 10:41pm

Yep you could definately do it with regular expressions as the lads have said. In fact I would be surprised if it came to more than 10 lines or so

http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

http://www.cs.tut.fi/~jkorpela/perl/regexp.html

daymobrew · 06-05-2010 12:30am

Garreth wrote: »

Also what sort of cost roughly would I be looking at to have this made? and lastly could I build the program myself with a crash course in perl or whatever the appropriate language would be?

perl all the way!

If you could post a sample text file and possible search options (e.g. is tld specified on command line or somewhere?) someone might throw something together for a bit of fun.

tricky D · 06-05-2010 2:48pm

Garreth wrote: »

Hi tricky d, thanks for reply.

The initial document that I want the software to scan contains a huge amount of text and among it is the 2000 or so links I want organised into a list.

After the software compiled the list i could apply what you have suggested to clean it up but I still need the first step, sorry if I wasn't clear enough about this in my original post.

To get all links on their own, you could use Page Data Bookmarklets. The 'List All Links' one will give you all the links. Yopu might need to clean that a bit. Then follow my previous instructions.

However as damobrew says, perl is perfect for this, but that's only if you already know it or are willing to learn it.

Dr.Silly · 06-05-2010 5:04pm

how about this, and at the end, just sort your column ?

URL organising software

Comments