Welcome to Geeklog, Anonymous Sunday, December 22 2024 @ 06:57 am EST

Geeklog Forums

Geeklog Handling of Two-Byte Characters in RSS feeds


alank

Anonymous
The Geeklog installation, in question, is the latest security release of 1.5.2. However, the issue has been present in previous iterations and versions.

What is called in Geeklog, Content Syndication, has a character limit; the default limit is 150 characters. In Geeklog, a character is deemed one byte, it seems. However, a character set could have characters that are two bytes long. The potential to break two-byte characters is inherent in this apparent scheme. A broken character tends to break feed readers.

Perhaps there's a way to change a feed reader so that it can manage the issue on the client side. However, I am wondering if there's something that could be done on the server side to prevent Geeklog from serving feeds with broken characters.

Any thoughts? Thanks.
 Quote

alank

Anonymous
Right, so I didn't count on a response, tho, I was hoping for one. I came up with a quick and dirty fix as follows, editing the lib-syndication.php file in the backend systems folder:

Text Formatted Code

function SYND_truncateSummary( $text, $length )
{
    if( $length == 0 )
    {
        return '';
    }
    else
    {
        $text = stripslashes( $text );
        $text = trim( $text );
        $text = preg_replace( "/(\015)/", "", $text );
        if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))
        {
            $text = substr( $text, 0, $length - 3 ) . '...';
        }

        // Check if we broke html tag and storytext is now something
        // like "blah blah <a href= ...". Delete "<*" if so.
        if( strrpos( $text, '<' ) > strrpos( $text, '>' ))
        {
            $text = substr( $text, 0, strrpos( $text, '<' ) - 1 )
                  . ' ...';
        }
        $text = substr($text, 0, strrpos($text, ' '));//added code
        return $text;
    }
}
 

The "added code" near the bottom simply finds the nearest space at the end of the text and lops off the text to the right of that.

The assumption is that the first instance of a damaged character is always going to be to the right of the last space in the string.
 Quote

alank

Anonymous
Oh, I forgot to add back the dots:

Text Formatted Code
$text = substr($text, 0, strrpos($text, ' ')) . ' ...'; //added code
 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by: alank

Right, so I didn't count on a response, tho, I was hoping for one.


Whoa. At least give me a chance to understand the problem first. It wasn't clear (to me at least) what you were talking about. There is no 150 character limit in the Content Syndication, for example ...

I guess the proper solution would be to use our MBYTE_substr string function there.

Btw, you may want to report bugs in our bugtracker. Things can easily get lost in the forum traffic.

bye, Dirk
 Quote

alank

Anonymous
Thanks for getting back. I thought there was a default limit (a suggested size, more like). In any event, it's arbitrary and can split multibyte characters apart and make the feed invalid. I spent most of my time trying to work a solution on the client side, not wanting to mess around with the geeklog code. But I found it much more difficult. You have to come up with some kind of preprocessing that seemed more trouble than it was worth. I would be interested in seeing how the mbyte thingy might be used to resolve this in a more elegant way.

Sorry for not submitting a bug report. I wasn't sure it was a bug.
 Quote

alank

Anonymous
Yes, so as suggested, this seems to be quite the ticket,

Text Formatted Code
        $text = stripslashes( $text );
        $text = trim( $text );
        $text = strip_tags( $text );
        $text = preg_replace( "/(\015)/", "", $text );
        if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))
        {
            $text = MBYTE_substr( $text, 0, $length - 3 ) . '...';
        }
 

I added the strip_tags bit to get better results in the face of image tag content which eats up a characters, and the rest.
 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by: alank

I added the strip_tags bit to get better results in the face of image tag content which eats up a characters, and the rest.


Actually, the result should contain HTML tags, so that's on purpose.

Looks like a proper fix is a bit more complex, as it also requires a working MBYTE_strrpos function. See if you can make sense of this changeset or wait for 1.6.0b2 ...

bye, Dirk
 Quote

alank

Anonymous
Quote by: Dirk

Quote by: alank

I added the strip_tags bit to get better results in the face of image tag content which eats up a characters, and the rest.


Actually, the result should contain HTML tags, so that's on purpose.

Looks like a proper fix is a bit more complex, as it also requires a working MBYTE_strrpos function. See if you can make sense of this changeset or wait for 1.6.0b2 ...

bye, Dirk


I think I'll return the strip_tags to the client side.

The revised code seems alright, but I wonder what it means if the $mb_enabled value returned MBYTE_checkEnabled() is false, even though the character encoding is actually utf-8 (or some other multibyte encoding). Would it be useful to implement mb_check_encoding for this purpose? Or maybe mb_detect_encoding?
 Quote

alank

Anonymous
I realized that it was significant to get the tags stripped before the length was applied, so that visible text would be fitted into the feed description. That can be achieved only within geeklog. So, without altering the $text value I was able to get what I needed this way,
Text Formatted Code

        if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))
        {
            $text = MBYTE_substr( strip_tags( $text ), 0, $length - 3 ) . '...';
        }
 

That way the characters inside html tags would not be counted.
 Quote

alank

Anonymous
Quote by: alank

I realized that it was significant to get the tags stripped before the length was applied, so that visible text would be fitted into the feed description. That can be achieved only within geeklog. So, without altering the $text value I was able to get what I needed this way,

Text Formatted Code

        if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))
        {
            $text = MBYTE_substr( strip_tags( $text ), 0, $length - 3 ) . '...';
        }
 

That way the characters inside html tags would not be counted.



I take that back. It did alter the $text variable. Crumbs.
 Quote

All times are EST. The time is now 06:57 am.

  • Normal Topic
  • Sticky Topic
  • Locked Topic
  • New Post
  • Sticky Topic W/ New Post
  • Locked Topic W/ New Post
  •  View Anonymous Posts
  •  Able to post
  •  Filtered HTML Allowed
  •  Censored Content