Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Zip File Parser

Options
  • 08-06-2006 2:15pm
    #1
    Closed Accounts Posts: 2,268 ✭✭✭


    Hello,

    I need to write a parser in VB 6 that will read text from a zip file. does anyone know of any documentation that discusses what the data stored in a zip file looks like?

    If so thanks; if not thanks anyway.

    MM


Comments

  • Closed Accounts Posts: 17,208 ✭✭✭✭aidan_walsh




  • Closed Accounts Posts: 2,268 ✭✭✭mountainyman


    thanks very much


  • Registered Users Posts: 15,443 ✭✭✭✭bonkey




  • Closed Accounts Posts: 2,268 ✭✭✭mountainyman


    Thanks Bonkey,
    You obviously don't know what a parser is or what I am trying to do but thanks for your help.

    Read<>Write

    Thanks Again


  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    Thanks Bonkey,
    You obviously don't know what a parser is or what I am trying to do but thanks for your help.

    Read<>Write

    Thanks Again

    I missed where you said parser. D'oh. My (dumb) bad.


  • Advertisement
  • Closed Accounts Posts: 2,268 ✭✭✭mountainyman


    bonkey wrote:
    I missed where you said parser. D'oh. My (dumb) bad.
    Anyhoo if anyone is reading I need to page the file across and read it without unzipping.

    Of course if I could unzip it that'd be too easy.

    I think that this may be impossible (not is impossible, may be impossible)

    After all just because you can parse a text file this does not imply that you can parse a zip file. The contents of the text file are nothing like the contents of a zip file.

    Now there may be an API call that I can refer to in order to read the zip and this is my only hope.

    Thanks for Listening.


  • Registered Users Posts: 21,264 ✭✭✭✭Hobbes


    Now there may be an API call that I can refer to in order to read the zip and this is my only hope.

    Thanks for Listening.

    XP supports opening zip files as file folders, so its possible its accessible via an API call that way.

    Although probably not much help the ability to parse a zip file is built into the Java API kit.

    had a quick look.. this may help you find what you need.

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/vbcompii.asp


  • Closed Accounts Posts: 2,268 ✭✭✭mountainyman


    Thanks for your help guys .net is not an option unfortunately. But I will certainkly examine the msdn stuff.

    thanks


  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    OK...having made my initial blunder, I'm getting totally confused here, and I'm not at all sure what it is you want/need to do.

    Using a Windows API call (should one exist) to manipulate a ZIP file is no different to using a 3rd-party library to manipulate a zip file.

    If thats what you want to do, then ok...but your description of what to do said that you wanted to write a parser to read the text in a zip. You later clarified this to say you want to read the text without unzipping.

    It doesn't matter if you use PKZIP, inbuilt-windows functionality, some other 3rd-party library or a hand-rolled solution, the only way you can read something that has been compressed is to decompress it....and unless you write the hand-rolled solution, you're not writing a parser, you're using one.

    You can decompress on the fly, in memory, and never write a file from the output, but you're still applying the decompression routine - you're still unzipping.

    If what you want to do is have the functionality to unzip, then what you want is a Visual Basic library with the Zip / Unzip functionality already built in. Thats the type of stuff my original suggestion would point you towards. One example I found by quickly refining the search was :

    http://www.vbaccelerator.com/home/VB/Code/Libraries/Compression/Zipping_Files/article.asp
    http://www.vbaccelerator.com/home/VB/Utilities/VBPZip/Info-ZIP_Unzip_DLL_(Renamed_vbuzip10_dll).asp

    If this isn't what you want, because you're not supposed to use someone else's Zip functionality - because you're supposed to write your own parser - then there is simply no way you can use a Windows API or other API call to do it, because thats exactly the same.

    jc


  • Closed Accounts Posts: 2,268 ✭✭✭mountainyman


    Hi All

    it is obviously me that isn't explaining things properly. I currently have a parser written in VB6 which takes a text file of a pretty weird format and processes it into a CSV.

    This weird output is an ancient proprietary system of my client's and is used to model interactions of molecules.

    My client now wants to change their product so that it will zip up the weirdly formatted text file as these files are huge. Over 2 gb in some cases.

    The reason they want to do this is that they are modelling in more complex way and using a visual display. So where 2 years ago they woudl manually generate one file now they run batches that generate say ten.

    The files are larger because they are now asking more complex questions.

    So basically heavier usage of the system. They are using the parser more than they thought they would.

    I would like to work with these larger numbers of weird files to import and reformat them.

    In order to do that I would like to
    A) Find the Zip file
    B) read a chunk of that file
    C) process that chunk
    D) repeat till EOF

    If this is impossible without unzipping I will be narked.

    If it is impossibel without unzipping can I:
    A) Unzip a part of the zip file
    B) Brocess that part
    C) destroy that part
    D)repeat till EOF

    I hope that is clearer.

    MM


  • Advertisement
  • Closed Accounts Posts: 17,208 ✭✭✭✭aidan_walsh


    Basically, zip files look like this:

    Header
    Entry
    -Entry details
    -Entry data <- This is where the actual compressed file is
    Entry
    ...
    EOF.

    What you want to do is open the relevant entry, get at the data inside, and move on. But decompressing a 2GB file is going to take time.

    The only way I can think of off hand that doesn't require decompressing the entire file is to process buffered sections of the compressed data in memory, do as much processing as you can on those, and then move on. Obviously, though, this is really only practical as a read-only solution - without a way of mapping the data you will have to decompress the entire file if you want to edit the data.


  • Registered Users Posts: 21,264 ✭✭✭✭Hobbes


    I think what might be better to do is to convert that CSV file into some kind of binary format and parse that.

    That should cause a huge drop in size. Also the earlier link I posted shows you how to zip the contents of a file within the file and not just into a zipfile.


  • Closed Accounts Posts: 17,208 ✭✭✭✭aidan_walsh


    Hobbes wrote:
    I think what might be better to do is to convert that CSV file into some kind of binary format and parse that.

    That should cause a huge drop in size. Also the earlier link I posted shows you how to zip the contents of a file within the file and not just into a zipfile.
    Well, yeah, assuming the company doesn't have anything else that also acts on the files as they are. But I guess they wouldn't be zipping them in that case...


  • Registered Users Posts: 27,163 ✭✭✭✭GreeBo


    well I guess it depends on the structure of a zip file.
    From my intepretation of zipping, part of zip file tells you nothing.
    You need to decompress the whole thing to get something sensible.

    Looking at this prblem from another perspective, can you modify whatever is generating these files to just create more, smaller files that are easier to work with?
    Are the files/results sequential?
    Can they be volumised?


  • Registered Users Posts: 907 ✭✭✭tibor


    You have a system that is constantly doing reads/writes to large text files,
    which is leading to heavy load on your system, possibly impacting other things running on it?

    And your solution is to compress these large files?
    And read/write from the compressed large files in an attempt to reduce load?

    Are you serious?



    edit: if that is *REALLY* what you want, zlib FTW


  • Registered Users Posts: 21,264 ✭✭✭✭Hobbes


    Well, yeah, assuming the company doesn't have anything else that also acts on the files as they are. But I guess they wouldn't be zipping them in that case...

    Just create a program to turn back into the insanely large files again. Tibor has a point though.


  • Registered Users Posts: 683 ✭✭✭Gosh




  • Closed Accounts Posts: 80 ✭✭Torak


    In order to do that I would like to
    A) Find the Zip file
    B) read a chunk of that file
    C) process that chunk
    D) repeat till EOF

    If this is impossible without unzipping I will be narked.


    It IS impossible without unzipping.. as mr bonkey said if you want to interpret the data then it has to be unzipped.

    It is however (regardless of the difficulties involved) possible to process the contents of an entry in a zip file whilst it is being decompressed rather than storing it all in memory throw away after processing a chunk.

    let us call this ability a stream. Note that the ZipInputStream in java is not what you want as it gives you an entry at a time. You need a similiar product which gives you entries in chunks that are configurable.
    potential result from google

    I have no idea if the program does what you ask. If it does not then chances are that you will need to follow the very first link and learn the decompression algorithm and implement in a fashion that allows a visitor to access each chunk...
    tibor wrote:
    You have a system that is constantly doing reads/writes to large text files,
    which is leading to heavy load on your system, possibly impacting other things running on it?

    And your solution is to compress these large files?
    And read/write from the compressed large files in an attempt to reduce load?

    Are you serious?

    You assume too much. It is quite possible and indeed probable all things considered, that the issue is storage of the large files rather than processing of them. The solution which is required then is to remove the storage/transport issue without overly impacting the processing time.


  • Registered Users Posts: 907 ✭✭✭tibor


    Torak wrote:
    You assume too much.

    Do I?
    So basically heavier usage of the system. They are using the parser more than they thought they would.


  • Closed Accounts Posts: 80 ✭✭Torak


    tibor wrote:
    Do I?
    So basically heavier usage of the system. They are using the parser more than they thought they would.

    Absolutely, -- and hence, perhaps, because the parser is used far more in far more intensive fashion, the output is larger and therefore causing storage problems..

    tbh I don't want to bicker with you and I apologise that my initial response was written somewhat rudely. It is no excuse, however I was quite tired...

    My thought process stands -- although at first glance it seems to be what you are saying, i believe that there is enough information in what is not said directly to determine that the real cause of the problem is not a lack of clock cycles.

    I could be completely wrong..


  • Advertisement
  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    Torak wrote:
    It is quite possible and indeed probable all things considered, that the issue is storage of the large files rather than processing of them. The solution which is required then is to remove the storage/transport issue without overly impacting the processing time.

    If thats the case, then my immediate reaction is that buying a bigger disk array is probably cheaper than paying a programmer to write an application to use existing disk more efficiently.

    As a second option, I'd use NTFS file-compression, rather than using a system like ZIP. This way, the stuff is stored in a compressed format, which can be massively successful with large text files, but can still be read from / written to as though it were a regular file.

    Similarly, if the problem is having enough memory in the machine to handle loading one complete file (thus leading to the "block" approach), my immediate thought would be first to buy more memory. My second thought would be to buy disk, store the files uncompressed or NTFS-compressed, and write your code to block-access the uncompressed files.

    While many people argue that throwing hardware at a problem is the wrong way to solve it, sometimes hardware *is* the problem.

    jc


  • Closed Accounts Posts: 80 ✭✭Torak


    bonkey wrote:
    If thats the case, then my immediate reaction is that buying a bigger disk array is probably cheaper than paying a programmer to write an application to use existing disk more efficiently.

    true, as long as transportation of the large files is not also an issue.. different physical locations are a fact of life and it may be an issue here..

    perhaps network bandwidth in the office is being chewed up transporting the files across the network and it is now taking 2 hours every morning just to get the application to start as everybody starts the app at the same time..

    who knows really.
    bonkey wrote:
    As a second option, I'd use NTFS file-compression, rather than using a system like ZIP. This way, the stuff is stored in a compressed format, which can be massively successful with large text files, but can still be read from / written to as though it were a regular file.

    true.. makes sense.. and more than worth investigating i'd imagine
    bonkey wrote:
    Similarly, if the problem is having enough memory in the machine to handle loading one complete file (thus leading to the "block" approach), my immediate thought would be first to buy more memory. My second thought would be to buy disk, store the files uncompressed or NTFS-compressed, and write your code to block-access the uncompressed files.

    unless an amount of this work is performed on peoples workstations in which case this might require a full upgrade of every system on everybodies desk.. The inexpensive memory upgrade can quickly become a nightmare for everybody involved.
    bonkey wrote:
    While many people argue that throwing hardware at a problem is the wrong way to solve it, sometimes hardware *is* the problem.

    absolutely...

    My point was that I assumed that the OP had done his research at least to some degree correctly and that his post on here was either concerned with evaluating a specific solution as an option or implementation after exhausting other options.

    The comment previously posted by tibor was based on the statement "heavier use of the system" could result in a hell of a lot of symptoms. simplifying it in the manner specified was an assumption and there are many questions required to be asked to get to the root cause..

    that's all..

    respectfully,
    T


Advertisement