Japanese encoding in Perl

Alligator Wine · 21-04-2006 4:21pm #1

Hi, it's my first time dealing with a foreign character set in Perl and I'm having a few problems.

For the moment I just want to read in a file which is encoded in Unicode and output each line to another file (later I want to deal with regexs, but I wanted to start off simple). When I open the new output file it doesn't match the original. Could someone please tell me where I'm going wrong? Cheers + here's the code I've written so far.

#!/usr/bin/perl

$inFile=$ARGV[0];
$outFile=$ARGV[1];

open (IN, $inFile) || die "cannot open file: $inFile\n";
chomp(@lines = <IN>);
close (IN);

open (OUT, "$outFile") || die "cannot write to file: $outFile\n";
for($i = 0; $i <= $#lines; $i++) {
print OUT "$lines[$i]\n";
}

daymobrew · 21-04-2006 5:51pm

What if you do it without using chomp?

#!/usr/bin/perl

$inFile=$ARGV[0];
$outFile=$ARGV[1];

open (IN, $inFile) || die "cannot open file: $inFile\n";
open (OUT, "$outFile") || die "cannot write to file: $outFile\n";
while ( <IN> )
{
  print OUT;
}
close (IN);
close (OUT);

Don't forget to close OUT filehandle (I realise that it will be closed when the script ends).

Alligator Wine · 22-04-2006 11:31am

It doens't make any difference if I use chomp or not. Same with closing the output file.

The problem is with character encoding and what perl interprets as a character. I'm just wondering how people normally handle foreign character sets such as Japanese.

daymobrew · 22-04-2006 2:42pm

I don't have to deal with non-English data so I don't know the right way.
Maybe the perllocale page (perldoc perllocale) might help. It can tell perl to assume a different locale when working on data.
Also look at:
perldoc perluniintro (Perl Unicode introduction)
perldoc perlunicode (Unicode support in perl)

Google Groups is a good place to look too.

MrScruff · 24-04-2006 11:57am

Your code works for me using a UTF-8 Japanese XML file.
(
with the addition of a ">" i.e

open (OUT, ">$outFile") || die "cannot write to file: $outFile\n";

)

What encoding is your input file using?

Japanese encoding in Perl

Comments