Here's a doozy of a Java problem... (i18n)

bambam · 07-06-2005 08:14PM #1

Anyone got a solution to this problem:

In a nutshell: RandomAccessFile.readLine() naively assumes that a byte is a char. This of course is pants for multi-byte encoded files.

Some detail:
We have an app that uses RandomAccessFile to read huge data files where each line is a record. RandomAccessFile is used so that we can efficiently jump to a line in the file by byte offset, then read X amount of lines. This reading is done using RandomAccessFile.readLine().

The problem is that I now need to support multibyte character encodings (Unicode, Big5, ShiftJS etc) but RandomAccessFile.readLine() naively assumes that 1 byte is a char.

I'm thinking of extending RandomAccessFile to be constructed with a Charset and adding a new method readEncodedLine(). But I'm unsure of the impl, I can read a number bytes into an array and convert to a String using Charset.decode() but how many bytes do I read at a time, how to I know if I split a multibyte char between reads?

The only other approach I can think of is to use a BufferedReader around an encoding aware InputStreamReader. But, this is gonna be pretty inefficient as I'll have to potentially call readLine() a lot in order to get to the data position (line number) I need to begin reading the required data.

Anyone got any ideas?

davros · 07-06-2005 11:50PM

Just a question... at the moment, you say you know the byte offset for any particular line. That implies you know the exact length of each line.

In the multibyte character encoding scenario, you don't know the length of a line in bytes but you still plan to jump to the start of a particular line using RandomFileAccess's byte offset?

If I imagine your data file is UTF-8 encoded, say, even if you know exactly how many characters there are per line, you can't say how many bytes offset to a particular line without examining every character in between to count its number of bytes.

Sorry, that's not very helpful. The problem sounds very interesting and I'm curious to hear solutions but I've never used random file access myself.

tempest · 08-06-2005 09:10AM

bambam wrote:

In a nutshell: RandomAccessFile.readLine() naively assumes that a byte is a char. This of course is pants for multi-byte encoded files.

I don't really think it's naive.... It's pretty well documented and that's just the way it works.

How about something like the following. Basically wrap the random access file in an InputStream and create a buffered reader around that.

Needs a bit of work to support the InputStream contract, but it should work.


import java.io.IOException;
import java.io.InputStream;
import java.io.RandomAccessFile;

public class RandomAccessInputStream extends InputStream {

    private RandomAccessFile file = null;
    
    public RandomAccessInputStream(final RandomAccessFile file) {
        this.file = file;
    }

	/**
	 * @see java.io.InputStream#read()
	 */
	public int read() throws IOException {
		return file.read();
	}
    

	/**
	 * @see java.io.InputStream#skip(long)
	 */
	public long skip(final long arg0) throws IOException {
		// Kind of cheating and hoping for the best here
        file.seek(arg0);
        return arg0;
	}

}

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.RandomAccessFile;
import java.io.UnsupportedEncodingException;

public class MainClass {

	public static void main(final String[] args) {
        
        final long byteOffset = 5400L;
        final String encoding = "UTF-8";
        
		try {
			RandomAccessFile file = new RandomAccessFile("afile", "r");
			RandomAccessInputStream is = new RandomAccessInputStream(file);
			
			is.skip(byteOffset);
			
			InputStreamReader reader = new InputStreamReader(is, encoding);
			BufferedReader bufferedReader = new BufferedReader(reader);
			
			String line = bufferedReader.readLine();
		} catch (FileNotFoundException e) {
            e.printStackTrace();
		} catch (UnsupportedEncodingException e) {
            e.printStackTrace();
		} catch (IOException e) {
            e.printStackTrace();
		}
        
	}
}

bambam · 09-06-2005 09:13AM

That looks interesting tempest, think I'll have a play with it.
thx, Bam

Here's a doozy of a Java problem... (i18n)

Comments