UPDATED
I snipped the above from the unmediated site to show a problem I occasionally run into when browsing on my Mac: Quotes and apostrophes (and some other punctuation marks) get garbled with odd characters.
I'm not sure what causes this, but it causes me to avoid returning to such sites.
UPDATE: From the comments -- is this the answer? (There are some terrific, detailed comments if you're into the nuts and bolts of why this happens.)
For me this has been a common problem with Netscape. Do you "refresh the page view" when this happens? That usually fixes the proble for me.
- The Precision Blogger
http://precision-blogging.blogspot.com
Posted by: Precision Blogger | May 09, 2005 at 08:29 AM
I'm pretty sure this stems from the Mac and Windows using slightly different character sets. For a real mess. trying cutting and pasting Word text into a blog sometime.
Posted by: Mark Gisleson | May 09, 2005 at 08:40 AM
I'm using Firefox and Safari. Happens with both...
Posted by: Dan Gillmor | May 09, 2005 at 08:46 AM
http://www.fourmilab.ch/webtools/demoroniser/
Posted by: Joe Abley | May 09, 2005 at 08:51 AM
The web page uses the UTF-8 charset, which is reasonably universal - that is, it's a Unicode character set that includes scads of international characters. I run Firefox, which I assume defaults to UTF-8 (at least, I don't remember ever changing that setting), and the page looks fine. If I change my character set to Western (ISO 8859-1), I get exactly the result you're showing.
The good news? If you change your setting to UTF-8 (on Firefox, it's View | Character Encoding | Unicode), you will not only get better results on this page, but on almost any other page as well.
Posted by: Skott Klebe | May 09, 2005 at 09:12 AM
Dan, usually the problem creeps in when people compose the posts one place (like Word, as Mark mentioned above) before posting them into their blog entry field. It's usually curly quotes, which can be hard to spot if you're not looking for them. Word also doesn't make the problem easy to fix. (To keep straight quotes in Word, you have to go into the Tools > AutoCorrect Options submenu, and uncheck "Replace straight quotes with curly quotes" under both "AutoFormat" and "AutoFormat as you type." I've created a Word macro that finds and replaces my curly quotes automatically.)
Posted by: Matt | May 09, 2005 at 09:14 AM
Dan, the strange thing is that it doesn't happen to me when I view the page in Thunderbird through RSS. I have it set up to view the article only, not load the web page.
I was confused about your post, so I went to have a look in Firefox, and it's broken for me as well.
Don't give up on unmediated, it's a good site!
Alex
PS. What's wrong with my site's url http://wasabi.dynu.com:8080/blog ?
Posted by: Alex Harford | May 09, 2005 at 09:39 AM
That page does have some special characters that don't pass validation . One more reason to run the W3C validator on new designs and software, and every once in a while on new content.
A stand-alone tool like demoroniser would fix this, but non-standard characters are really something that the CMS or blog software should be checking for and fixing -- for example, replacing word processor quotes with the correct, standard characters. Making your page comply with web standards doesn't mean that you have to go back to "typewriter quotes".
How bout these here real quotation marks? he concluded.
Posted by: Don Marti | May 09, 2005 at 09:42 AM
Hi Dan,
Please do not give up on del.icio.us! There is a lot of developemt action happening in real time on del.icio.us. Yeah, one those cool social experiments where software is developed as you interact. I just love it. I use it as a somewhat wrapper for a couple of functions in my own tool - Ideascape. So I try to stay involved with almost every delicious development. Please be patient and great things will happen!
Try giving feedback to MS or Apple.
Posted by: Jim Wilde | May 09, 2005 at 09:53 AM
Works for me.
The page clearly advertise itself as UTF-8, but your browser interpret it in some iso-latin variant for some reason.
What is selected in safari/View/Text encoding menu? Mine is set to 'Default" and works like a charm. Try it. Or try UTF-8.
Posted by: Eric | May 09, 2005 at 09:57 AM
Come back, Dan, come back.
I think I figured out the problem - I usually type my posts up in Word or Outlook, and then cut and paste them into Blogger.
I do write in Edit HTML, instead of WYSIWYG, but I guess on the smart quotes, I get burnt. I'll be more diligent (observant?).
Thanks!
Posted by: Jeremy Pepper | May 09, 2005 at 10:18 AM
It's a problem with the data being sent by the server, so browser-side tweaking won't do it. It looks like the text was converted to UTF-8 twice.
I dumped the page to a file and where it is supposed to say "We'd", between the "e" and "d" were the following bytes: c3 a2 c2 80 c2 99. Decoding this as UTF-8 twice, produces 2019, which is RIGHT SINGLE QUOTATION MARK in Unicode parlance.
Posted by: Larry Rosenstein | May 09, 2005 at 10:30 AM
I found the same problem (letter a with caret, followed by a special symbol that appears to be the Euro symbol) when I was looking at some articles in Ars Technica using a PC with Firefox. So this problem does not appear to be limited to a Mac. I tried the suggestion above to switch to UTF-8, but had no effect. I then looked at the 'unmediated' page, and in this case, I see the letter a with caret, but without the following symbol, so the problem was not as bad there.
I tried reading those pages with Internet Explorer (which was already set to UTF-8). The Ars Technica page looked the same; but for the 'unmediated' page, '$ $ ' shows up after the letter a with caret.
Isn't the demoroniser fix mentioned above intended for people generating web pages, and not for processing incoming web page?
Posted by: F Ho | May 09, 2005 at 10:43 AM
This is a really common problem I see nowadays, which results from programmers assuming that any string they input into their program is encoded in ISO 8859-1 when it's actually sometimes UTF-8.
What's happened here is that each right single quote character (U+2019 in Unicode) has been first encoded in UTF-8. Then *each byte* of the multibyte UTF-8 sequence was accidentally misinterpreted as *separate* single-byte characters encoded in ISO 8859-1. Finally, the incorrect characters were re-encoded in UTF-8.
Here's exactly what happened for the ’ (U+2019, the right single quotation mark in "We’d like.."), which is \u2019 in Python syntax. I'm doing the following by opening Terminal on my Mac and running 'python'.
[ben@misc-dhcp32:~]% python
>>> s = u"We\u2019d like"
>>> utf8 = s.encode("utf-8")
>>> utf8
'We\xe2\x80\x99d like'
(note how the Unicode character U+2019 is correctly represented as the three bytes 0xe2 0x80 0x99 in UTF-8)
>>> wrong_s = unicode(utf8, 'iso-8859-1')
(Now we've incorrectly interpreted each byte of the UTF-8 string as a separate ISO 8859-1 character when converting to Unicode -- this happens a lot when lazy programmers assume that any string they come across must be ISO 8859-1, since it's the default on English Windows.
The single Unicode character U+2019 has become *three* Unicode characters U+00E2 U+0080 U+0099, which are rendering for me as â, € and ™. Interestingly, U+0080 is *not* the Euro symbol (that's U+20AC), but in Windows-1252 encoding, 0x80 is the Euro symbol. Some weird mapping is going on there in Safari to come up with that result).
>>> wrong_s.encode("utf-8")
'We\xc3\xa2\xc2\x80\xc2\x99d like'
Now we end up with 6 bytes of UTF-8, which is exactly what appears in the source of the HTML in that page. It's also three legitimate Unicode characters expressed in Unicode:
>>> wrong_s
u'We\xe2\x80\x99d like'
This is why you see three garbage characters on that page.
Ben
Posted by: Ben Gertzfield | May 09, 2005 at 10:54 AM
Not sure this adds anything useful but:
Garbage chars might also happen whenever the page's stated encoding (in Content-Type meta tag at top of page) doesn't match the actual encoding. (fyi, Dan's weblog says it's charset=utf-8 which seems to be standard. )
"Isn't the demoroniser fix mentioned above intended for people generating web pages, and not for processing incoming web page?"
Yes, since that that is where the mismatch is created. It's nice to be able to point site owners to it though.
(In case anyone else thinks, wrongly, that this will work: running the text through Notepad (pasting it into Notepad, then copying it back out of Notepad) didn't help.)
I was poking around this weekend looking at this issue, and it was hard slogging due to terminology - does "convert" mean "encode so you can get your special characters back, when it's converted back" or does it mean "transform, permanently, into normal text"?
Although of course it's not permanent if Microsoft can get its paws on it.
For those who haven't seen it - Doin' The Backspace Mambo With Microsoft Word 2000 - "...The bile rises into the back of the throat, you begin to quiver a bit as you reach for the mouse..."
Posted by: Anna | May 09, 2005 at 12:59 PM
I see those characters frequently from Mac users who paste text into a listserv e-mail.
Posted by: Anspar Jonte | May 09, 2005 at 02:36 PM
We had the same problem as our authors often use MS World to compose their Movable Type blog entries. I'm not sure this is the most elegant way to handle it, but we resolved it by adding to our html files.
Posted by: Christian McDonald | May 10, 2005 at 06:39 AM
OK, so that last post didn't include the code, of course: meta http-equiv="content-type" content="text/html; charset=utf-8"
Posted by: Christian McDonald | May 10, 2005 at 06:40 AM
It's been my experience that this problem occurs when users paste text from a microsoft application into a textbox on a web page, or uploads a MS-edited file directly to a web server.
Windows uses a close variant of the standard charset "ISO 8859-1 (Latin 1)" called "Microsoft Windows Codepage : 1252 (Latin I)". The problem is that this charset uses character addresses for various display characters (such as curly quotes and emdashes) that, in ISO 8859-1, are reserved for future control characters (think new line markers) that are yet to be defined. When text is placed into a form control (as in a web page) or into a document which is then rendered in ISO 8859-1 (or UTF-8) the Microsoft extensions are rendered incorrectly.
This is yet another example of Microsoft extending a standard for their own purposes. The obstensible reason is that their charset allows their customers to create more compelling text with the use of characters that are not available with the standard charsets. Their solution does accomplish that, but users of other platforms (and occasionally third-party applications on Microsoft platforms) suffer when this "enhancement" creeps into documents meant to be shared across platforms.
Cynics argue that this one way that Microsoft attempts to lock in customers: by making sure that MS generated documents are only fully compatible between MS products. I'm not certain that this is a true motive, but the results certainly to tend to support the idea.
Posted by: Bruce | May 10, 2005 at 09:56 AM
By the way, it is certainly possible to create a filter/converter to address this problem. I have one written in perl, but it is designed to run in a web server context. As you are unlikely to make use of it yourself since your blog runs on typepad, I suggest that you contact typepad and request a feature enhancement.
Posted by: Bruce | May 10, 2005 at 10:03 AM
RE: demoronizer
I haven't heard of that particular perl script, but it looks like it'll do the trick. It's the same thing that I implemented myself, but probably their version is better. (Mine was never meant for distribution, strictly an in-house thing.) So yeah, give it a shot.
Posted by: Bruce | May 10, 2005 at 11:44 AM
A Perl script isn't ideal since you can't just tell other people "go run this", since it requires that they install perl, which requires that they either trust you or know what they're doing.
Wish there was an online form for demoronizing, akin to the disemvoweller ...
Bruce, is your script online somewhere? (online as in "readable", not necessarily as in "runnable")
Posted by: Anna | May 10, 2005 at 12:26 PM
That is weird, Dan. I guess no one ever noticed it before and no one ever said anything. Thanks for spotting it. We'll go clean up the mess.
Posted by: yatta | May 11, 2005 at 09:20 AM
I suppose one horribly hackish but nonetheless effective solution would be to make a javascript bookmarklet (or even better, at least for Firefox, a Greasemonkey extension) that would convert the
mangled sequences into quotations.
Posted by: Patrick Hall | May 12, 2005 at 05:33 PM
RE: Dan, usually the problem creeps in when people compose the posts one place (like Word, as Mark mentioned above) before posting them into their blog entry field. It's usually curly quotes, which can be hard to spot if you're not looking for them. Word also doesn't make the problem easy to fix. (To keep straight quotes in Word, you have to go into the Tools > AutoCorrect Options submenu, and uncheck "Replace straight quotes with curly quotes" under both "AutoFormat" and "AutoFormat as you type." I've created a Word macro that finds and replaces my curly quotes automatically.)
Posted by: Matt | May 9, 2005 09:14 AM
Thank you for posting on this problem, Dan. And for the great response by Matt! Following this advice, I was able to get WORD and Blogger to like each other! Thank you both *so* much!
Posted by: Kat | June 06, 2005 at 04:56 PM