Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Conversion to Web problem!!!

Options
  • 28-09-2007 10:50am
    #1
    Closed Accounts Posts: 552 ✭✭✭


    I work in a publications department for a largish company. We are currently making our files available for the Web. Problem is we use InDesign as our Design and Typesetting package. Which is what we should be using, concentrating mainly on books.

    But the company we use to build the search engine for our books are useless in my opinion. Basically we are exporting all the files to RTF, then stripping out the images from the file and placing them into the RTF for them, they could be vector art drawn inside InDesign for ease of preparation to print without external programmes.

    My thing is though, we can just export the PDF to RTF and all the images are there, but PDFs are problematic at best as they lose a lot of details, like Character and Paragraph Styles when converted to PDF. So when you export to PDF they come up with Style Names like CM+147.

    The company we have hired to do the conversion won't use the RTFs from the PDF, understandably because they can't make the CSS work. But at the end of the day we are the ones either left converting all the files to RTF and placing the images in, and we are the ones that have to convert all the styles for the web.

    That isn't my job though, we hired this company to put our data on our website, with a built in Search function. And I tried to find some of our publications online today and it was horrible. I searched for the key words and nothing came up. I went through the archive structure in the left pane and found the document. But other documents matched the keywords, like I got hits on them, and I could see they were marked with a tick mark, but the one thing I was looking for wasn't marked.

    We used to have PDFs up there but the people who buy our books found it very slow, probably due to their internet connection or computer to open a pdf in their browser, so we wanted a totally "TEXT" file they could search, with Images .

    Ok I've rambled on enough. Here's my question:

    Does anyone know of a company that will take our InDesign files, convert them for the web, construct a good Search Engine for our website?

    It doesn't matter if these are separate companys or one company.

    How would you proceed with this problem?


Comments

  • Registered Users Posts: 35,524 ✭✭✭✭Gordon


    This is probably best suited for the webmaster forum.

    I worked in a company similar to yours but we dealed with jpgs. Jpgs have the ability to hold metadata (there are programs that can load and extract the metadata other than photoshop but I can't remember which). However your problem is then getting the client to search the metadata. In this case we hired a php programmer to write a database, a front end, and a client that uploads the jpgs and a seperate keyword cvs file to the database and links them. So it looks like your company isn't doing a good job.

    One thing you can look into while getting another company are solutions like coppermine and gallery which are free php/mysql programs. They have the ability to utilise keywords on jpgs and other filetypes. However, whether or not they will strip the metadata in the jpg; I am not sure.

    Can inDesign do a batch export of all your files and export them into a jpg? I know photoshop can do this easily.


  • Closed Accounts Posts: 552 ✭✭✭Hank_Scorpio


    Many thanks for your reply. If this needs to be moved to another forum then that is fine by me, can I just be sent the link or something when it is so I can find it again. Thanks.

    And jpgs aren't really a viable option for us. We need people to be able to search the text, and have it highlighted and etc. with the ability to copy and paste text from it.

    InDesign and export to a lot of formats, including jpg and Adobe Bridge (comes free with most Adobe prodcuts) can add all sorts of meta data to images.

    We have well over 50 titles, each ranging from 100 to 3,200 pages. That would be a lot of meta data to add wouldn't it?


  • Closed Accounts Posts: 552 ✭✭✭Hank_Scorpio


    Many thanks for your reply. If this needs to be moved to another forum then that is fine by me, can I just be sent the link or something when it is so I can find it again. Thanks.

    And jpgs aren't really a viable option for us. We need people to be able to search the text, and have it highlighted and etc. with the ability to copy and paste text from it.

    InDesign and export to a lot of formats, including jpg and Adobe Bridge (comes free with most Adobe prodcuts) can add all sorts of meta data to images.

    We have well over 50 titles, each ranging from 100 to 3,200 pages. That would be a lot of meta data to add wouldn't it?


  • Registered Users Posts: 3,594 ✭✭✭forbairt


    Haven't completely read through this .. but would you not just be after adobe golive ? :)


  • Closed Accounts Posts: 552 ✭✭✭Hank_Scorpio


    I know what you mean with GoLive. It's truly great at exporting to HTML. But it does have it's pitfalls. There is a lot of tidying up to do with the Go Live documents. One thing that is amazing is the Live Update feature, where we make changes to the indesign file the GoLive files get updated automatically, as they are direct links with the GoLive package. But on the downside, when you do these updates some information is cut off the end of the documents so you have to go and edit the html to show this data. It's a good way, but alas, here's the final problem, InD CS3 does not have GoLive support as Adobe bought Macromedia and now they offer support for Dreamweaver, which doesn't handle the automatic update, but you need to have CSS or something similar set up. Actually, that gives me an idea! I'll be back with my idea shortly and if it worked.

    Thanks for making me think!


  • Advertisement
  • Registered Users Posts: 3,594 ✭✭✭forbairt


    I haven't actually used go live ... so don't exactly know what I'm talking about in this instance ....

    I do however think there's nearly always going to have to be some level of tidy up to get things looking just right ...

    I'm actually surprised you're having so many problems with all this ... it sounds like a fairly straight forward process for a web development company ...


  • Closed Accounts Posts: 552 ✭✭✭Hank_Scorpio


    I can't believe that we are the one's having to do all the tidying up to be honest. I thought we'd just send them the files and then they would produce them for our web. But when I went to use it the other day, I typed in my search and I got hits, but the actual document I was looking for, which had my search options in its Title did not come up as a hit for the search. Tell me that's not completely wrong. The first thing that should come up on the hit list should be the actual title and not useless references from other books. Surely? I am working on making the files more ready for html, so to speak, in InDesign, I may as well do it myself and learn PHP or whatever I need to do it myself. I can't see the point in paying people any more for something that I can do on my own. I don't mean to sound like a no it all or a brain that can pick anything up quickly, that's not the case at all, it'll be a long road but so worth it in the end, I think.


  • Registered Users Posts: 3,594 ✭✭✭forbairt


    There should be some kinda weighting given to it all ... titles would be first matches ... followed possibley by description of the book (from the back of the book or whatever) ... these would be the defaults I'd assume as well as author and so on ... then you'd perform a full body text search .. which could get quite intensive


  • Closed Accounts Posts: 552 ✭✭✭Hank_Scorpio


    Exactly, the books would need to be indexed for metadata that is searchable on from the search engine. It's ok, I'm beginning to learn how to export basic HTML out of InDesign. Then I'm going to learn how to build CSS with DreamWeaver. Then I'm going to go on an Indexing course. Then I'm go to learn how to add metadata to our file.

    Where do I learn to build search engines? I thought there were people and companies out there that can do this sort of thing so I don't have to?


  • Registered Users Posts: 3,594 ✭✭✭forbairt


    Exactly, the books would need to be indexed for metadata that is searchable on from the search engine. It's ok, I'm beginning to learn how to export basic HTML out of InDesign. Then I'm going to learn how to build CSS with DreamWeaver. Then I'm going to go on an Indexing course. Then I'm go to learn how to add metadata to our file.

    Where do I learn to build search engines? I thought there were people and companies out there that can do this sort of thing so I don't have to?


    I build them :D ... joke .. actually not .. but booked up over the next while ...

    Anyways ... you'll want to decide what languages you want to use ...

    Main options are php / asp ... after that you can go down the jsp / ruby road if you want ...

    currently for something like that I'd probably do php / mysql

    What you actually need to do is have all the text in a database

    At its most basic .. you'll have
    ID
    Book Title
    Book Description
    Book Body

    this is truely at its most basic ...

    Take a look at this script here ...
    http://www.devpapers.com/article/306
    And away you go ...

    I guess if the guys are asking for the books as RTF ... files they are just doing a very simple import into a database .. and having chapters ... in there ... they are probably also ... having some kinda include for images there as well.

    The problem I'm assuming is they've lost all the original formatting of the books now ?

    If you need a hand with any of this .. just drop me a pm .. I'm online .. too much :D


  • Advertisement
  • Registered Users Posts: 3,594 ✭✭✭forbairt


    BTW as Gordon said ... you'd probably be better posting this in the webmaster forums ...


  • Closed Accounts Posts: 552 ✭✭✭Hank_Scorpio


    Cool, i've bookmarked the DevPapers site. Thanks for that, could come in handy. If I do need a hand with any of it I will drop you a PM. Thanks, you're the first person since April that has offered to help, so I guess my incessant ranting is starting to pay off


  • Closed Accounts Posts: 552 ✭✭✭Hank_Scorpio


    Well the reason I posted it here is because it's coming from InDesign to the Web. So I am also looking for InD to Web people too. I didn't want to double up on the postings, and initially I couldn't find a Web forum. So... sorry if it's in the wrong place.


  • Registered Users Posts: 35,524 ✭✭✭✭Gordon


    I misunderstood your first post, and don't know inDesign very well, but I've checked out a few ideas and may have a temporary solution..

    You say that you would prefer not to use pdfs, but you may be able to get a site up and running if you use a content management system and importing your pdfs.

    If you are willing to dabble in a bit of Content Management System learning (very easy, you just need to set up a test server - see the Webmaster forum for more info) then you can set up Joomla which is free.

    Once you have joomla working you can check this mod out. It costs 15 dollars but looks like a good solution for indexing your pdfs that you have uploaded to your joomla site.

    Another option is to download another CMS called Drupal and check this link out as a similar option. However, this looks pretty intense as you need to have a pdftotext program installed on your server and do various bits and bobs that the Webmaster forum can help you out on.

    However, this may all be reduntant if you are not willing to output to pdf and use them.

    Incidentally, how do you output to rtf on inDesign? I have the program and can't find how to do it! Also, I can't find how to output to basic html either :confused:.

    If you want this moved over to the Webmaster maybe the mod of this forum wouldn't mind if I did it myself. I really think that you would be better off there.


  • Registered Users Posts: 3,594 ✭✭✭forbairt


    not had much time today ....

    PDF's being large in file size ... comes down to the images contained within them I'd say ... text and the font embedded shouldn't really size the PDF up that much ... if downloading for end customers is a problem then you should potentially decide to chop them up ... books have chapters for a reason :P

    Also there are about 6 different settings for exporting to pdfs if memory serves me correctly ... print / press / high quality and so on ... I don't have indesign ... played with the previous demo's but haven't had a need for it just yet

    A free opensource indexing system
    http://www.htdig.org/

    Check out the FAQ ... section 4.9 ... http://www.htdig.org/FAQ.html#q4.9
    So it should index your pdf's for you ... its also a search engine as opposed to a content management system which joomla and drupal are ..

    I don't know how easy it is to set up exactly ... but if I was going to do it I'd probably do it this way ... the alternatives will have you mucking about too much importing into their systems

    BTW how are you releasing your books ... free for download ? or do users have to log in ? :)


  • Closed Accounts Posts: 552 ✭✭✭Hank_Scorpio


    Thanks for all those links and all the help it is really appreciated. I will look into each of them and see how it goes.

    The reason that the PDFs can't be used is because the search engine searches the text in all the PDFs, so it is taking ages to search through literally thousands upon thousands of pages of text. It is just too slow, even when indexed by google, our PDFs are too big for google.

    If you're exporting to RTF from InDesign:

    With the text insertion point inside the text box. Choose File>Export, then select RTF, this option is not there if the text box is not active.

    For basic HTML:
    Window>Tags

    Create a tag that has basic html elements (without quotes) "P", "em", "stong", "H1" etc. When you have this done you Map the Styles to the Tags. Now each bit of your text is tagged with that element.

    Note: If you have one word in bold, italic or bold italic inside a pargraph then you will need to have a character style applied. You can download a script called "Preserve Local Formatting" which will create character styles and apply them to the bits of text that is bastardised inside paragraphs that are styled.

    File>Cross Media Export>XML

    With the XML file, edit it, put in basic HTML tags at the top and bottom. Tidy up a bit and you're done.

    You can export to Dreamweaver too, but you will need to have some CSS in place. The tags will be there though.


  • Registered Users Posts: 3,594 ✭✭✭forbairt


    Thanks for all those links and all the help it is really appreciated. I will look into each of them and see how it goes.

    The reason that the PDFs can't be used is because the search engine searches the text in all the PDFs, so it is taking ages to search through literally thousands upon thousands of pages of text. It is just too slow, even when indexed by google, our PDFs are too big for google.


    with HTDIG ... you only index the file when its been updated ... so after your initial index of the file ... it won't need to be reindexed until its changed again to the best of my knowledge ...

    A soultion for google and other search engines indexing your files ... hmm... these files are all going to be publicly available ? .. (still in shock about that) :P ..

    break up the pdf's as I said into chapters .. .reduce the quality of the images embedded in the files ... and the size should reduce heavily


  • Closed Accounts Posts: 552 ✭✭✭Hank_Scorpio


    As I said, the breaking down of the books isn't the problem for size. The problem comes when you are searching and it has to search over 50 titles for references, scanning well over 50,000 pages of text.

    There was a massive uproar from people who bought this service as the PDF search and find was imensely slow. So we are changing it for the better. And as I've said, the search engine that was developed is crap.

    The books would only be available through our website, which you need a member login. They can't be printed and they are for cross-referencing legislation and law only.


  • Registered Users Posts: 3,594 ✭✭✭forbairt


    As I said, the breaking down of the books isn't the problem for size. The problem comes when you are searching and it has to search over 50 titles for references, scanning well over 50,000 pages of text.

    There was a massive uproar from people who bought this service as the PDF search and find was imensely slow. So we are changing it for the better. And as I've said, the search engine that was developed is crap.

    The books would only be available through our website, which you need a member login. They can't be printed and they are for cross-referencing legislation and law only.

    HTDig .. keeps all the keywords and so on in a database .. which makes for fast access so its not searching the PDF's as such everytime :)

    Problems I could see .. are if you're using shared hosting for 30 quid for the year ... you'll probably not get the fastest searchs ... so you may need to have a dedicated host somewhere ... :)


  • Closed Accounts Posts: 18,056 ✭✭✭✭BostonB


    Sorry for dragging this up, and its probably no longer an issue.

    I reckon to have a fast and useful search, theres no way of avoiding that you'd have to parse the PDF/RTF files and build up useful metadata for searching on. You could write a small application to do this with VBA or similar, which would automate this, but someone would have to maintain and test this metadata. This would then need to accessed/stored in a online database which can be queried. Its likely there are styles in the original PDF's which would be useful as metadata. I worked on systems like this in the past, building up XML metadata on CBT source material for use in online searches. You can then track which searches are done and further refine the metadata to better match the searches.


  • Advertisement
Advertisement