Welcome to Geeklog, Anonymous Sunday, December 22 2024 @ 06:57 am EST
Geeklog Forums
Geeklog Handling of Two-Byte Characters in RSS feeds
alank
Anonymous
The Geeklog installation, in question, is the latest security release of 1.5.2. However, the issue has been present in previous iterations and versions.
What is called in Geeklog, Content Syndication, has a character limit; the default limit is 150 characters. In Geeklog, a character is deemed one byte, it seems. However, a character set could have characters that are two bytes long. The potential to break two-byte characters is inherent in this apparent scheme. A broken character tends to break feed readers.
Perhaps there's a way to change a feed reader so that it can manage the issue on the client side. However, I am wondering if there's something that could be done on the server side to prevent Geeklog from serving feeds with broken characters.
Any thoughts? Thanks.
What is called in Geeklog, Content Syndication, has a character limit; the default limit is 150 characters. In Geeklog, a character is deemed one byte, it seems. However, a character set could have characters that are two bytes long. The potential to break two-byte characters is inherent in this apparent scheme. A broken character tends to break feed readers.
Perhaps there's a way to change a feed reader so that it can manage the issue on the client side. However, I am wondering if there's something that could be done on the server side to prevent Geeklog from serving feeds with broken characters.
Any thoughts? Thanks.
13
14
Quote
alank
Anonymous
Right, so I didn't count on a response, tho, I was hoping for one. I came up with a quick and dirty fix as follows, editing the lib-syndication.php file in the backend systems folder:
function SYND_truncateSummary( $text, $length )
{
if( $length == 0 )
{
return '';
}
else
{
$text = stripslashes( $text );
$text = trim( $text );
$text = preg_replace( "/(\015)/", "", $text );
if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))
{
$text = substr( $text, 0, $length - 3 ) . '...';
}
// Check if we broke html tag and storytext is now something
// like "blah blah <a href= ...". Delete "<*" if so.
if( strrpos( $text, '<' ) > strrpos( $text, '>' ))
{
$text = substr( $text, 0, strrpos( $text, '<' ) - 1 )
. ' ...';
}
$text = substr($text, 0, strrpos($text, ' '));//added code
return $text;
}
}
The "added code" near the bottom simply finds the nearest space at the end of the text and lops off the text to the right of that.
The assumption is that the first instance of a damaged character is always going to be to the right of the last space in the string.
Text Formatted Code
function SYND_truncateSummary( $text, $length )
{
if( $length == 0 )
{
return '';
}
else
{
$text = stripslashes( $text );
$text = trim( $text );
$text = preg_replace( "/(\015)/", "", $text );
if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))
{
$text = substr( $text, 0, $length - 3 ) . '...';
}
// Check if we broke html tag and storytext is now something
// like "blah blah <a href= ...". Delete "<*" if so.
if( strrpos( $text, '<' ) > strrpos( $text, '>' ))
{
$text = substr( $text, 0, strrpos( $text, '<' ) - 1 )
. ' ...';
}
$text = substr($text, 0, strrpos($text, ' '));//added code
return $text;
}
}
The "added code" near the bottom simply finds the nearest space at the end of the text and lops off the text to the right of that.
The assumption is that the first instance of a damaged character is always going to be to the right of the last space in the string.
19
14
Quote
alank
Anonymous
Oh, I forgot to add back the dots:
Text Formatted Code
$text = substr($text, 0, strrpos($text, ' ')) . ' ...'; //added code
15
12
Quote
Status: offline
Dirk
Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by: alank
Right, so I didn't count on a response, tho, I was hoping for one.
Whoa. At least give me a chance to understand the problem first. It wasn't clear (to me at least) what you were talking about. There is no 150 character limit in the Content Syndication, for example ...
I guess the proper solution would be to use our MBYTE_substr string function there.
Btw, you may want to report bugs in our bugtracker. Things can easily get lost in the forum traffic.
bye, Dirk
14
13
Quote
alank
Anonymous
Thanks for getting back. I thought there was a default limit (a suggested size, more like). In any event, it's arbitrary and can split multibyte characters apart and make the feed invalid. I spent most of my time trying to work a solution on the client side, not wanting to mess around with the geeklog code. But I found it much more difficult. You have to come up with some kind of preprocessing that seemed more trouble than it was worth. I would be interested in seeing how the mbyte thingy might be used to resolve this in a more elegant way.
Sorry for not submitting a bug report. I wasn't sure it was a bug.
Sorry for not submitting a bug report. I wasn't sure it was a bug.
14
10
Quote
alank
Anonymous
Yes, so as suggested, this seems to be quite the ticket,
$text = trim( $text );
$text = strip_tags( $text );
$text = preg_replace( "/(\015)/", "", $text );
if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))
{
$text = MBYTE_substr( $text, 0, $length - 3 ) . '...';
}
I added the strip_tags bit to get better results in the face of image tag content which eats up a characters, and the rest.
Text Formatted Code
$text = stripslashes( $text );$text = trim( $text );
$text = strip_tags( $text );
$text = preg_replace( "/(\015)/", "", $text );
if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))
{
$text = MBYTE_substr( $text, 0, $length - 3 ) . '...';
}
I added the strip_tags bit to get better results in the face of image tag content which eats up a characters, and the rest.
13
14
Quote
Status: offline
Dirk
Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by: alank
I added the strip_tags bit to get better results in the face of image tag content which eats up a characters, and the rest.
Actually, the result should contain HTML tags, so that's on purpose.
Looks like a proper fix is a bit more complex, as it also requires a working MBYTE_strrpos function. See if you can make sense of this changeset or wait for 1.6.0b2 ...
bye, Dirk
14
15
Quote
alank
Anonymous
Quote by: Dirk
Actually, the result should contain HTML tags, so that's on purpose.
Looks like a proper fix is a bit more complex, as it also requires a working MBYTE_strrpos function. See if you can make sense of this changeset or wait for 1.6.0b2 ...
bye, Dirk
Quote by: alank
I added the strip_tags bit to get better results in the face of image tag content which eats up a characters, and the rest.
Actually, the result should contain HTML tags, so that's on purpose.
Looks like a proper fix is a bit more complex, as it also requires a working MBYTE_strrpos function. See if you can make sense of this changeset or wait for 1.6.0b2 ...
bye, Dirk
I think I'll return the strip_tags to the client side.
The revised code seems alright, but I wonder what it means if the $mb_enabled value returned MBYTE_checkEnabled() is false, even though the character encoding is actually utf-8 (or some other multibyte encoding). Would it be useful to implement mb_check_encoding for this purpose? Or maybe mb_detect_encoding?
15
32
Quote
alank
Anonymous
I realized that it was significant to get the tags stripped before the length was applied, so that visible text would be fitted into the feed description. That can be achieved only within geeklog. So, without altering the $text value I was able to get what I needed this way,
if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))
{
$text = MBYTE_substr( strip_tags( $text ), 0, $length - 3 ) . '...';
}
That way the characters inside html tags would not be counted.
Text Formatted Code
if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))
{
$text = MBYTE_substr( strip_tags( $text ), 0, $length - 3 ) . '...';
}
That way the characters inside html tags would not be counted.
16
11
Quote
alank
Anonymous
Quote by: alank
if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))
{
$text = MBYTE_substr( strip_tags( $text ), 0, $length - 3 ) . '...';
}
That way the characters inside html tags would not be counted.
I realized that it was significant to get the tags stripped before the length was applied, so that visible text would be fitted into the feed description. That can be achieved only within geeklog. So, without altering the $text value I was able to get what I needed this way,
Text Formatted Code
if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))
{
$text = MBYTE_substr( strip_tags( $text ), 0, $length - 3 ) . '...';
}
That way the characters inside html tags would not be counted.
I take that back. It did alter the $text variable. Crumbs.
12
14
Quote
All times are EST. The time is now 06:57 am.
- Normal Topic
- Sticky Topic
- Locked Topic
- New Post
- Sticky Topic W/ New Post
- Locked Topic W/ New Post
- View Anonymous Posts
- Able to post
- Filtered HTML Allowed
- Censored Content