1 Sep 2004 09:55
Re: language-specific harvesting of texts from the Web
Stuart A Yeates <stuart.yeates <at> computing-services.oxford.ac.uk>
2004-09-01 07:55:11 GMT
2004-09-01 07:55:11 GMT
Marco Baroni wrote: >>One situation where your approach may not work so well, is when a >>language's websites use multiple character encodings. Unfortunately, >>this is quite common in languages that have non-Roman writing systems, > > > At least for Japanese, our way to get around this problem in our > web-mining scripts was to look for the charset declaration in the html > code of each page, and then to convert (inside the script) the page from > that charset to utf8. > > I would be interested in hearing about other ways to deal with multiple > encodings. textcat (http://odur.let.rug.nl/~vannoord/TextCat/) is a language and encoding guesser which reliably guesses test language and encoding based solely on examples and statistics. Knows 69 natural languages. Open source. I've had good experiance using the built-in java encoding converters (readers and writers shipped for ~100 encodings as standard) to convert between languages. Freely avaliable. cheers stuart -- -- Stuart Yeates stuart.yeates <at> computing-services.oxford.ac.uk OSS Watch http://www.oss-watch.ac.uk/ Oxford Text Archive http://ota.ahds.ac.uk/ Humbul Humanities Hub http://www.humbul.ac.uk/(Continue reading)
RSS Feed