Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

URL organising software

Options
  • 05-05-2010 7:59pm
    #1
    Closed Accounts Posts: 10


    I'm looking for a program that does the following task:

    It must scan a simple text document, identify all URL's in it and organise them as a list in a seperate text document. It should be able to identify links that end with specific tld so I end up with a list of .com domains and just the domain name for example "http://www.example.com" and not "http://www.example.com/directory/page.html"

    I would like to know has anyone came across a program that does this or do I need to pay someone to build it for me?

    Also what sort of cost roughly would I be looking at to have this made? and lastly could I build the program myself with a crash course in perl or whatever the appropriate language would be?

    Sorry for the long post, I appreciate any feedback.


Comments

  • Closed Accounts Posts: 9,700 ✭✭✭tricky D


    This should be possible in Excel or OpenOffice's Calc. Don't have excel to hand but in Calc you get the URL list. Paste it in and the Text Import dialog comes up.* Use the '/' as a separator instead of the usual comma, tab etc. Now you've http's in the first column, empty 2nd column and the stripped domains in the 3rd. Discard the following columns. Copy and paste the 3rd column into a text file.

    If you want the http:// stuff, paste '//' into all the 2nd column cells (after the first few you can recopy and paste them in a few or a load at a time), copy the first 3 columns, paste back into Notepad++ or Textpad and replace ' // ' with '//'. Job done.

    * I think the excel way of doing this is to paste the URLs in sothey're all in column1. Then use the text to table function.


  • Closed Accounts Posts: 10 Garreth


    Hi tricky d, thanks for reply.

    The initial document that I want the software to scan contains a huge amount of text and among it is the 2000 or so links I want organised into a list.

    After the software compiled the list i could apply what you have suggested to clean it up but I still need the first step, sorry if I wasn't clear enough about this in my original post.


  • Closed Accounts Posts: 8,015 ✭✭✭CreepingDeath


    Garreth wrote: »
    I'm looking for a program that does the following task:

    It must scan a simple text document, identify all URL's in it and organise them as a list in a seperate text document. It should be able to identify links that end with specific tld so I end up with a list of .com domains and just the domain name for example "http://www.example.com" and not "http://www.example.com/directory/page.html"

    You should be able to do it easy enough with any language or application that supports regular expressions.
    Just define the pattern you want to search for and the portion you want back.


  • Moderators, Science, Health & Environment Moderators Posts: 10,079 Mod ✭✭✭✭marco_polo


    Yep you could definately do it with regular expressions as the lads have said. In fact I would be surprised if it came to more than 10 lines or so

    http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

    http://www.cs.tut.fi/~jkorpela/perl/regexp.html


  • Registered Users Posts: 6,509 ✭✭✭daymobrew


    Garreth wrote: »
    Also what sort of cost roughly would I be looking at to have this made? and lastly could I build the program myself with a crash course in perl or whatever the appropriate language would be?
    perl all the way!

    If you could post a sample text file and possible search options (e.g. is tld specified on command line or somewhere?) someone might throw something together for a bit of fun.


  • Advertisement
  • Closed Accounts Posts: 9,700 ✭✭✭tricky D


    Garreth wrote: »
    Hi tricky d, thanks for reply.

    The initial document that I want the software to scan contains a huge amount of text and among it is the 2000 or so links I want organised into a list.

    After the software compiled the list i could apply what you have suggested to clean it up but I still need the first step, sorry if I wasn't clear enough about this in my original post.

    To get all links on their own, you could use Page Data Bookmarklets. The 'List All Links' one will give you all the links. Yopu might need to clean that a bit. Then follow my previous instructions.

    However as damobrew says, perl is perfect for this, but that's only if you already know it or are willing to learn it.


  • Closed Accounts Posts: 1,759 ✭✭✭Dr.Silly


    how about this, and at the end, just sort your column ?


Advertisement