Welcome to Geeklog, Anonymous Saturday, December 21 2024 @ 01:35 pm EST

Geeklog Forums

charsets confused between utf-8 and iso-8859-1


brett

Anonymous
Hi,

Our blog uses Geeklog (which overall has been a great product). We are http://www.worldfootynews.com, covering Australian Rules football around the world, and we mainly write in English but also try to capture the correct spelling of clubs and cities in various languages. As far as I can tell our MySQL is in utf-8 and most of our writers post in that. But we had a problem when we changed our webhost site because suddenly a lot of characters like in Sk���¥ne became junk (who knows how this will appear to you, but SHOULD have a little circle above the 'a' in Skane). I spent many many hours over months trying to solve the problem, researching the web etc, and finally concluded the server might be by default outputting our pages in iso-8859-1 (western european?). I tried hacking our geeklog code changing charsets wherever it was iso to utf-8.

Nothing worked until finally I realised I just needed to have the default_charset set to utf-8 and have logged in users select a utf-8 language in their account preferences. All the characters worked and the world was good. This story explained it all to my readers:

http://www.worldfootynews.com/article.php/20061204230043692

But just a few weeks later my offsider upgraded us from an geeklog (about 2 years old) to 1.4. Amongst much other pain and suffering, the characters have gone bad again. So I've checked the default charset is utf-8 and the language is en-gb and my preferences are english_utf-8 and nothing seems to work. In fact only when I set the language to iso-8859-1 do the characters display properly.

Any ideas much appreciated. Did any language things change between say 1.1 and 1.4?

The closest thing I've found is the following link, but I don't think it helps in this case:

http://www.geeklog.net/article.php?story=200410120657418

I've also noticed our RSS has stopped, probably due to similar issues. A validator tells us this:

http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fwww.worldfootynews.com


Any help much appreciated.

Brett

 Quote

brett

Anonymous

Just a slight correction - the secondary issue of the RSS had the wrong link. It should've been:

http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fwww.worldfootynews.com%2Fbackend%2Fwfn.rdf

So again, something wrong with character sets.

 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Well, with your modifications and the back and forth, it's hard to tell from here what the character set really is on your site (and/or the database) now.

I don't think anything has changed in the character set handling in Geeklog up to and including 1.4.0sr5-1. We did change things a bit in 1.4.1, though, so maybe that would be worth a try (not on the live site, of course).

Also check your theme. If it has a meta tag for the character set, remove it - Geeklog is sending a proper HTTP header now (as of 1.4.0). Maybe you have conflicting character sets there.

Hmm, can't really think of anything else right now. HTH

bye, Dirk
 Quote

brett

Anonymous
sad
Thanks Dirk. Still broken unfortunately. I downloaded a full copy of our site so I could sift through it for virtually every reference to iso-8859-1. Nothing was obviously wrong, but I changed any occurences to utf-8 just in case. None of that helped. We only have the professional theme. For some bizarre reason as of a few weeks ago our site needs the user to by default set to iso-8859-1 or if logged in, set their preferences to English (which sets to iso-8859-1) not English_UTF-8 (which sets to utf-8). Stories like this one demonstrate the problem:

http://www.worldfootynews.com/article.php/20060127004101396


*** We use cPanel and phpMyAdmin shows for localhost:

mySQL charset: UTF-8 Unicode (utf8)

MySQL connection collation: utf8_unicode_ci

Under Character sets and collations it has various highlighted settings, but under utf8 it has:

utf8_general_ci Unicode (multilingual), case insensitive

Although if I go right down into the database it shows all the gl tables as latin1_swedish_ci, e.g.

gl_stories (type) MyISAM (collation) latin1_swedish_ci


However all of the database settings above (from *** down), I'm fairly confident, have not changed since before our Geeklog software upgrade and subsequent problems with char sets.


Basically I've given up. I've spent days looking at this and can't make any sense of it. It took me ages to get the site working under utf-8 properly, and just when I did the GL upgrade seems to have broken it, but I can't find any evidence for that in any of the source. There seems to be no good reason for the problems. Maybe my host changed something? I surrender - I'll tell all my users to use iso-8859-1 and change the default to that since utf-8 won't work for us. Even though our database appears to be utf-8 it needs to be read as iso-8859-1 as off a few weeks ago.

No, I haven't tried the beta version of the next GL release. I don't really know how to do it offline as a non-live version. Not sure if a cPanel interface lets me do that sort of thing.

Thanks for trying to help. I guess sometimes one has to move on with a less than satisfactory outcome.
 Quote

brett

Anonymous

I'd be interested to hear if anyone else is successfully using Geeklog 1.4 and utf-8 charsets.
 Quote

Status: offline

samstone

Forum User
Full Member
Registered: 09/29/02
Posts: 820
I don't know whether assigning the right collation will help. Since you use CPanel:

For a new installation, I create a database in CPanel. Then go to PHPMyAdmin, select the related database, and click on the "Operation" tab to change the collation to utf8_general_ci. Then when you run installation the tables are all reated with utf-8 collation.

Changing the existing collation is is what I haven't done, but have read somewhere on the web:

Export (backup) the tables and data. In PHPMyAdmin, select the related database and delet all tables. Change the databases collation to utf8_general_ci under the "Operation" tab. Then import the tables and data. You will see all tables in the new colation.

Hope this helps.

Sam

 Quote

brett

Anonymous

Scary stuff, but I reckon I'll give it a go.

As far as I know, we never changed the charsets and yet it worked under older GL but not the new one. But for the lack of any other ideas about what to do, I'll give it a try. Will do some more reading first so I hopefully don't stuff up the database.

Thanks.
 Quote

brett

Anonymous

Well, with much fear in my heart, since I haven't really played with the database before, I went into cPanel, exported the database to my local drive, had a look at it with Notepad (under XP), and it said it is utf8, I set the collation to utf8_general_ci instead of the latin1_swedish_ci one shown there, I deleted all my tables (and checked that indeed my website was now gone), again set collation to utf8_general_ci just in case it had changed in the process, then imported the database again from the text file I'd exported, then held my breath.

The good news is the database successfully imported so has not been lost.

The bad news is that nothing has changed. Articles with non-English characters like Malmö appear with the empty square for the "o" when viewed as utf-8, and come out ok when viewed as western european (by selecting in the browser or by setting language in user preferences in geeklog to english not english_utf-8).

Strangely if I look at the tables row-by-row then the collation is still listed as latin1_swedish_ci, even though there is no way I forgot to set it to utf8_general_ci via the operations tab. I've gone back and checked and it still says collation as utf8 on the operations menu.

So it remains a bizarre and frustrating problem.

I'm grasping at straws. Could cPanel or phpMyAdmin have a bug in it?

I'm way out of my depth here, but having a read of this:

http://dev.mysql.com/doc/refman/5.0/en/charset-general.html

and

http://dev.mysql.com/doc/refman/5.0/en/charset-mysql.html

I get the impression that collation just refers to sort orders when there are sort commands, and in fact should have nothing to do with my problem. But interestingly it notes that latin1_swedish_ci is the default collation for charset latin1, so it does lend support to the idea that somehow SQL is giving out the stories as latin1 not utf-8. The inconsistency in all this remains that I'm pretty sure it was working fine up until we upgraded Geeklog, and I'm pretty sure the database was not touched in any way during that process, which suggests the problem really is back within the Geeklog code, yet everywhere I've got utf-8 not iso.

Either that or somehow I'm not forcing the charset to utf-8 when I load in the database, even though the tab suggests so.

Or maybe I'm crazy!
 Quote

Status: offline

mevans

Forum User
Full Member
Registered: 02/08/04
Posts: 393
Location:Texas
Brett,

I recently converted a site to UTF-8 from ISO-8859-1, here is how I did it. It seems to work just fine.

First, make sure you MySQL database version is at least 4.1.
  • Dump my database using mysqldump or phpMyAdmin
  • Use some tool to do a global search / replace of the SQL dump file. You want to replace all CHARSET=utf8mb4 with CHARSET=utf8mb4. I used the replace command which is included in the MySQL distribution, but there are several Win32 programs that will do this as well.
  • You have to make sure you have all valid UTF-8 characters in your database. I used the iconv program to validate / fix the data. iconv is available under Unix and also Win32. Using the following command:

    iconv –c –f utf-8 –t utf-8 < input.sql > output.sql

  • Change the collation for the actual database using phpMyAdmin to utf_general_ci
  • Drop all the old tables with phpMyAdmin (do not drop the database).
  • Import the newly created output.sql into the database
  • Double check all the tables and fields are utf8 in the database structure
  • Change Geeklog's config.php to use utf-8 charset
  • Change Geeklog's config.php to use english_utf-8 as the langauge
  • Remove all the non-utf-8 language files
  • Using phpMyAdmin - look at gl_users table to see how many users have selected a language preference
  • Run a few SQL queries to fix the language preferences (for example)

    UPDATE gl_users set language="english_utf-8" WHERE language="english";
    UPDATE gl_users set language="danish_utf-8" WHERE language="danish";
    Do this for each of the languages

  • If you are not running Geeklog v1.4.1, you will need to edit the mysql.class.php source file and make one small change:

    Find the following lines:
    Text Formatted Code

            // Connect to MySQL server
            $this->_db = mysql_connect($this->_host,$this->_user,$this->_pass) or di
    e('Cannnot connect to DB server');
     

    Add this line immediately after the above
    Text Formatted Code

                   @mysql_query ("SET NAMES 'utf8'", $this->_db);
     

    Save the file.

  • Double check a few stories / forum posts to make sure all is well


This is very similar to what you have already tried, but there are 2 main differences, changing the collation of each table to utf-8 and also validating the data with iconv.

Let me know if this helps!

Thanks!
Mark
 Quote

brett

Anonymous

Marky Mark!

I've been battling this problem for so long and I think you might have given me the solution!

I've started going through all the steps that you wrote out for me so beautifully, but while I waited for clunky old Notepad to do the find and replace all on the CHARSET=utf8mb4 part, I jumped to the mysql.class.php change you suggested, since I recall seeing somewhere else that set names was a useful command, but I didn't know where I could put it previously. So I added it in, and went back to look at the usual problems stories on my site, and under western european they were now having problems, so the excitement rose, and I set my browser to utf8 and YES, they look correct!

So with language set to english_utf8 the site now tells the browser utf8 and it is indeed utf8!

I know I've skipped out the steps inbetween, so I'll go through and finish that when I get a chance tomorrow, just to make sure everything is consistent.

Interesting that you say that before GL 1.4.1 I'd need this change to mysql.class.php. This makes me suspect that it was in changing to 1.4.0 from an old version that caused the problem. Admittedly someone did such I try 1.4.1 as a test but I wasn't sure how to do that.

I'll report back how the rest of this goes, but it seems the problem is well on the way to being solved. Very very much appreciated, as is the support of all the Geeklog developers who have enabled us to set up what we think is a great little not-for-profit website that we know is appreciated by a few thousand people around the world. And now I can continue to publish their names and cities with the correct characters!

Cheers,
Brett
 Quote

brett

Anonymous

A further update. It seems to all be working correctly now. The one step I didn't do was the iconv test of the database.

I had trouble downloading all the parts and it was late and I impatiently just went without it. I've since got it working (note for anyone using these instructions in future, cutting and pasting the command gave bad characters and iconv spat them, so since I'm on XP I opened a dos prompt, went to the iconv.exe directory, and copied my sql file there, and TYPED in the command):

iconv -c -f utf-8 -t utf-8 < world_blogtest3.sql > output.sql

The input file was 3645 kB and the output was 3651 kB. So I'll import that back in.

Anyway, thanks again!

Brett
 Quote

All times are EST. The time is now 01:35 pm.

  • Normal Topic
  • Sticky Topic
  • Locked Topic
  • New Post
  • Sticky Topic W/ New Post
  • Locked Topic W/ New Post
  •  View Anonymous Posts
  •  Able to post
  •  Filtered HTML Allowed
  •  Censored Content