Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Here's a doozy of a Java problem... (i18n)

Options
  • 07-06-2005 8:14pm
    #1
    Registered Users Posts: 597 ✭✭✭


    Anyone got a solution to this problem:

    In a nutshell: RandomAccessFile.readLine() naively assumes that a byte is a char. This of course is pants for multi-byte encoded files.

    Some detail:
    We have an app that uses RandomAccessFile to read huge data files where each line is a record. RandomAccessFile is used so that we can efficiently jump to a line in the file by byte offset, then read X amount of lines. This reading is done using RandomAccessFile.readLine().

    The problem is that I now need to support multibyte character encodings (Unicode, Big5, ShiftJS etc) but RandomAccessFile.readLine() naively assumes that 1 byte is a char.

    I'm thinking of extending RandomAccessFile to be constructed with a Charset and adding a new method readEncodedLine(). But I'm unsure of the impl, I can read a number bytes into an array and convert to a String using Charset.decode() but how many bytes do I read at a time, how to I know if I split a multibyte char between reads?


    The only other approach I can think of is to use a BufferedReader around an encoding aware InputStreamReader. But, this is gonna be pretty inefficient as I'll have to potentially call readLine() a lot in order to get to the data position (line number) I need to begin reading the required data.

    Anyone got any ideas?


Comments

  • Closed Accounts Posts: 857 ✭✭✭davros


    Just a question... at the moment, you say you know the byte offset for any particular line. That implies you know the exact length of each line.

    In the multibyte character encoding scenario, you don't know the length of a line in bytes but you still plan to jump to the start of a particular line using RandomFileAccess's byte offset?

    If I imagine your data file is UTF-8 encoded, say, even if you know exactly how many characters there are per line, you can't say how many bytes offset to a particular line without examining every character in between to count its number of bytes.

    Sorry, that's not very helpful. The problem sounds very interesting and I'm curious to hear solutions but I've never used random file access myself.


  • Closed Accounts Posts: 92 ✭✭tempest


    bambam wrote:
    In a nutshell: RandomAccessFile.readLine() naively assumes that a byte is a char. This of course is pants for multi-byte encoded files.

    I don't really think it's naive.... It's pretty well documented and that's just the way it works. :)

    How about something like the following. Basically wrap the random access file in an InputStream and create a buffered reader around that.

    Needs a bit of work to support the InputStream contract, but it should work.
    
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.RandomAccessFile;
    
    public class RandomAccessInputStream extends InputStream {
    
        private RandomAccessFile file = null;
        
        public RandomAccessInputStream(final RandomAccessFile file) {
            this.file = file;
        }
    
    	/**
    	 * @see java.io.InputStream#read()
    	 */
    	public int read() throws IOException {
    		return file.read();
    	}
        
    
    	/**
    	 * @see java.io.InputStream#skip(long)
    	 */
    	public long skip(final long arg0) throws IOException {
    		// Kind of cheating and hoping for the best here
            file.seek(arg0);
            return arg0;
    	}
    
    }
    
    import java.io.BufferedReader;
    import java.io.FileNotFoundException;
    import java.io.IOException;
    import java.io.InputStreamReader;
    import java.io.RandomAccessFile;
    import java.io.UnsupportedEncodingException;
    
    public class MainClass {
    
    	public static void main(final String[] args) {
            
            final long byteOffset = 5400L;
            final String encoding = "UTF-8";
            
    		try {
    			RandomAccessFile file = new RandomAccessFile("afile", "r");
    			RandomAccessInputStream is = new RandomAccessInputStream(file);
    			
    			is.skip(byteOffset);
    			
    			InputStreamReader reader = new InputStreamReader(is, encoding);
    			BufferedReader bufferedReader = new BufferedReader(reader);
    			
    			String line = bufferedReader.readLine();
    		} catch (FileNotFoundException e) {
                e.printStackTrace();
    		} catch (UnsupportedEncodingException e) {
                e.printStackTrace();
    		} catch (IOException e) {
                e.printStackTrace();
    		}
            
    	}
    }
    
    


  • Registered Users Posts: 597 ✭✭✭bambam


    That looks interesting tempest, think I'll have a play with it.
    thx, Bam


Advertisement