My Photo
Blog powered by Typepad

May 2005

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        

« More Evidence of Housing Trouble to Come | Main | Huffington Uber-Blog Launches »

May 09, 2005

Comments

Precision Blogger

For me this has been a common problem with Netscape. Do you "refresh the page view" when this happens? That usually fixes the proble for me.
- The Precision Blogger
http://precision-blogging.blogspot.com

Mark Gisleson

I'm pretty sure this stems from the Mac and Windows using slightly different character sets. For a real mess. trying cutting and pasting Word text into a blog sometime.

Dan Gillmor

I'm using Firefox and Safari. Happens with both...

Joe Abley

http://www.fourmilab.ch/webtools/demoroniser/

Skott Klebe

The web page uses the UTF-8 charset, which is reasonably universal - that is, it's a Unicode character set that includes scads of international characters. I run Firefox, which I assume defaults to UTF-8 (at least, I don't remember ever changing that setting), and the page looks fine. If I change my character set to Western (ISO 8859-1), I get exactly the result you're showing.
The good news? If you change your setting to UTF-8 (on Firefox, it's View | Character Encoding | Unicode), you will not only get better results on this page, but on almost any other page as well.

Matt

Dan, usually the problem creeps in when people compose the posts one place (like Word, as Mark mentioned above) before posting them into their blog entry field. It's usually curly quotes, which can be hard to spot if you're not looking for them. Word also doesn't make the problem easy to fix. (To keep straight quotes in Word, you have to go into the Tools > AutoCorrect Options submenu, and uncheck "Replace straight quotes with curly quotes" under both "AutoFormat" and "AutoFormat as you type." I've created a Word macro that finds and replaces my curly quotes automatically.)

Alex Harford

Dan, the strange thing is that it doesn't happen to me when I view the page in Thunderbird through RSS. I have it set up to view the article only, not load the web page.

I was confused about your post, so I went to have a look in Firefox, and it's broken for me as well.

Don't give up on unmediated, it's a good site!

Alex

PS. What's wrong with my site's url http://wasabi.dynu.com:8080/blog ?

Don Marti

That page does have some special characters that don't pass validation . One more reason to run the W3C validator on new designs and software, and every once in a while on new content.

A stand-alone tool like demoroniser would fix this, but non-standard characters are really something that the CMS or blog software should be checking for and fixing -- for example, replacing word processor quotes with the correct, standard characters. Making your page comply with web standards doesn't mean that you have to go back to "typewriter quotes".

How bout these here real quotation marks? he concluded.

Jim Wilde

Hi Dan,

Please do not give up on del.icio.us! There is a lot of developemt action happening in real time on del.icio.us. Yeah, one those cool social experiments where software is developed as you interact. I just love it. I use it as a somewhat wrapper for a couple of functions in my own tool - Ideascape. So I try to stay involved with almost every delicious development. Please be patient and great things will happen!

Try giving feedback to MS or Apple.

Eric

Works for me.

The page clearly advertise itself as UTF-8, but your browser interpret it in some iso-latin variant for some reason.

What is selected in safari/View/Text encoding menu? Mine is set to 'Default" and works like a charm. Try it. Or try UTF-8.

Jeremy Pepper

Come back, Dan, come back.

I think I figured out the problem - I usually type my posts up in Word or Outlook, and then cut and paste them into Blogger.

I do write in Edit HTML, instead of WYSIWYG, but I guess on the smart quotes, I get burnt. I'll be more diligent (observant?).

Thanks!

Larry Rosenstein

It's a problem with the data being sent by the server, so browser-side tweaking won't do it. It looks like the text was converted to UTF-8 twice.

I dumped the page to a file and where it is supposed to say "We'd", between the "e" and "d" were the following bytes: c3 a2 c2 80 c2 99. Decoding this as UTF-8 twice, produces 2019, which is RIGHT SINGLE QUOTATION MARK in Unicode parlance.

F Ho

I found the same problem (letter a with caret, followed by a special symbol that appears to be the Euro symbol) when I was looking at some articles in Ars Technica using a PC with Firefox. So this problem does not appear to be limited to a Mac. I tried the suggestion above to switch to UTF-8, but had no effect. I then looked at the 'unmediated' page, and in this case, I see the letter a with caret, but without the following symbol, so the problem was not as bad there.

I tried reading those pages with Internet Explorer (which was already set to UTF-8). The Ars Technica page looked the same; but for the 'unmediated' page, '$ $ ' shows up after the letter a with caret.

Isn't the demoroniser fix mentioned above intended for people generating web pages, and not for processing incoming web page?

Ben Gertzfield

This is a really common problem I see nowadays, which results from programmers assuming that any string they input into their program is encoded in ISO 8859-1 when it's actually sometimes UTF-8.

What's happened here is that each right single quote character (U+2019 in Unicode) has been first encoded in UTF-8. Then *each byte* of the multibyte UTF-8 sequence was accidentally misinterpreted as *separate* single-byte characters encoded in ISO 8859-1. Finally, the incorrect characters were re-encoded in UTF-8.

Here's exactly what happened for the ’ (U+2019, the right single quotation mark in "We’d like.."), which is \u2019 in Python syntax. I'm doing the following by opening Terminal on my Mac and running 'python'.

[ben@misc-dhcp32:~]% python
>>> s = u"We\u2019d like"
>>> utf8 = s.encode("utf-8")
>>> utf8
'We\xe2\x80\x99d like'

(note how the Unicode character U+2019 is correctly represented as the three bytes 0xe2 0x80 0x99 in UTF-8)

>>> wrong_s = unicode(utf8, 'iso-8859-1')

(Now we've incorrectly interpreted each byte of the UTF-8 string as a separate ISO 8859-1 character when converting to Unicode -- this happens a lot when lazy programmers assume that any string they come across must be ISO 8859-1, since it's the default on English Windows.

The single Unicode character U+2019 has become *three* Unicode characters U+00E2 U+0080 U+0099, which are rendering for me as â, € and ™. Interestingly, U+0080 is *not* the Euro symbol (that's U+20AC), but in Windows-1252 encoding, 0x80 is the Euro symbol. Some weird mapping is going on there in Safari to come up with that result).

>>> wrong_s.encode("utf-8")
'We\xc3\xa2\xc2\x80\xc2\x99d like'

Now we end up with 6 bytes of UTF-8, which is exactly what appears in the source of the HTML in that page. It's also three legitimate Unicode characters expressed in Unicode:

>>> wrong_s
u'We\xe2\x80\x99d like'

This is why you see three garbage characters on that page.

Ben

Anna

Not sure this adds anything useful but:

Garbage chars might also happen whenever the page's stated encoding (in Content-Type meta tag at top of page) doesn't match the actual encoding. (fyi, Dan's weblog says it's charset=utf-8 which seems to be standard. )

"Isn't the demoroniser fix mentioned above intended for people generating web pages, and not for processing incoming web page?"

Yes, since that that is where the mismatch is created. It's nice to be able to point site owners to it though.

(In case anyone else thinks, wrongly, that this will work: running the text through Notepad (pasting it into Notepad, then copying it back out of Notepad) didn't help.)

I was poking around this weekend looking at this issue, and it was hard slogging due to terminology - does "convert" mean "encode so you can get your special characters back, when it's converted back" or does it mean "transform, permanently, into normal text"?
Although of course it's not permanent if Microsoft can get its paws on it.

For those who haven't seen it - Doin' The Backspace Mambo With Microsoft Word 2000 - "...The bile rises into the back of the throat, you begin to quiver a bit as you reach for the mouse..."

Anspar Jonte

I see those characters frequently from Mac users who paste text into a listserv e-mail.

Christian McDonald

We had the same problem as our authors often use MS World to compose their Movable Type blog entries. I'm not sure this is the most elegant way to handle it, but we resolved it by adding to our html files.

Christian McDonald

OK, so that last post didn't include the code, of course: meta http-equiv="content-type" content="text/html; charset=utf-8"

Bruce

It's been my experience that this problem occurs when users paste text from a microsoft application into a textbox on a web page, or uploads a MS-edited file directly to a web server.

Windows uses a close variant of the standard charset "ISO 8859-1 (Latin 1)" called "Microsoft Windows Codepage : 1252 (Latin I)". The problem is that this charset uses character addresses for various display characters (such as curly quotes and emdashes) that, in ISO 8859-1, are reserved for future control characters (think new line markers) that are yet to be defined. When text is placed into a form control (as in a web page) or into a document which is then rendered in ISO 8859-1 (or UTF-8) the Microsoft extensions are rendered incorrectly.

This is yet another example of Microsoft extending a standard for their own purposes. The obstensible reason is that their charset allows their customers to create more compelling text with the use of characters that are not available with the standard charsets. Their solution does accomplish that, but users of other platforms (and occasionally third-party applications on Microsoft platforms) suffer when this "enhancement" creeps into documents meant to be shared across platforms.

Cynics argue that this one way that Microsoft attempts to lock in customers: by making sure that MS generated documents are only fully compatible between MS products. I'm not certain that this is a true motive, but the results certainly to tend to support the idea.

Bruce

By the way, it is certainly possible to create a filter/converter to address this problem. I have one written in perl, but it is designed to run in a web server context. As you are unlikely to make use of it yourself since your blog runs on typepad, I suggest that you contact typepad and request a feature enhancement.

Bruce

RE: demoronizer

I haven't heard of that particular perl script, but it looks like it'll do the trick. It's the same thing that I implemented myself, but probably their version is better. (Mine was never meant for distribution, strictly an in-house thing.) So yeah, give it a shot.

Anna

A Perl script isn't ideal since you can't just tell other people "go run this", since it requires that they install perl, which requires that they either trust you or know what they're doing.

Wish there was an online form for demoronizing, akin to the disemvoweller ...

Bruce, is your script online somewhere? (online as in "readable", not necessarily as in "runnable")

yatta

That is weird, Dan. I guess no one ever noticed it before and no one ever said anything. Thanks for spotting it. We'll go clean up the mess.

Patrick Hall

I suppose one horribly hackish but nonetheless effective solution would be to make a javascript bookmarklet (or even better, at least for Firefox, a Greasemonkey extension) that would convert the
mangled sequences into quotations.

Kat

RE: Dan, usually the problem creeps in when people compose the posts one place (like Word, as Mark mentioned above) before posting them into their blog entry field. It's usually curly quotes, which can be hard to spot if you're not looking for them. Word also doesn't make the problem easy to fix. (To keep straight quotes in Word, you have to go into the Tools > AutoCorrect Options submenu, and uncheck "Replace straight quotes with curly quotes" under both "AutoFormat" and "AutoFormat as you type." I've created a Word macro that finds and replaces my curly quotes automatically.)

Posted by: Matt | May 9, 2005 09:14 AM

Thank you for posting on this problem, Dan. And for the great response by Matt! Following this advice, I was able to get WORD and Blogger to like each other! Thank you both *so* much!

The comments to this entry are closed.