Welcome to Geeklog, Anonymous Sunday, December 22 2024 @ 01:23 am EST

Geeklog Forums

Google problem

Page navigation


Status: offline

Filipino

Forum User
Chatty
Registered: 08/27/02
Posts: 50
Our organization has since 1994 been covering Philippines news - earlier this year we switched to CMS (Geeklog) as it proved easier to post 200 or more items a day. We have encountered a serious problem - as the leading source of Philippine news organizations such as the BBC, CNN like to trawl our input and highlight it on their sites to expand upon their own coverage. Recently we have offered our site to Google as a souce for their new (still in Beta) news services and in turn reproduce something like this: Oracle turns grid computing Philippines Daily News, Philippines - 1 hour ago By Leo Magno. ORACLE Corp. is updating its flagship database product by becoming the first software company to adopt the concept ... Google and to a lessor degree the BBC have told us that our output as produced by Geeklog is not suitable for trawling and does not produce acceptable output. For us this is quite a blow as we have invested heavily (in time and effort) in converting 8 of our news sites to Geeklog and full integration with the various \'News Highlighting\' services is vital to the service we offer. Can anyone offer any advice or a resolve. Wayne
 Quote

Status: offline

ScurvyDawg

Forum User
Full Member
Registered: 11/06/02
Posts: 523
I would say you might want to offer to pay one of the developers to remedy this, seems your organization has done pretty well for free so far. How about investing in the community. My two cents.
 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by Filipino:Google and to a lessor degree the BBC have told us that our output as produced by Geeklog is not suitable for trawling and does not produce acceptable output.
It\'s hard to offer advice when you don\'t understand the problem ... What exactly were Google\'s and the BBC\'s complaints about? In another thread, I\'ve mentioned that Geeklog could do better with regards to using structural markup, e.g. using H1 tags for headlines, etc. But that is mainly a theme issue and could be resolved by creating a new theme. Is that what they were talking about? bye, Dirk
 Quote

Status: offline

squatty

Forum User
Full Member
Registered: 01/21/02
Posts: 269
...does not produce acceptable output.
Well, what is \"acceptable output\"? I assume they have detailed the specifications. Can you elaborate on what they are looking for?
In a world without walls and fences, who needs Windows and Gates?
 Quote

Status: offline

squatty

Forum User
Full Member
Registered: 01/21/02
Posts: 269
Opps...looks like Dirk beat me to the punch. I guess great minds do think alike ;-)
In a world without walls and fences, who needs Windows and Gates?
 Quote

Status: offline

Filipino

Forum User
Chatty
Registered: 08/27/02
Posts: 50
Quote by ScurvyDawg: I would say you might want to offer to pay one of the developers to remedy this, seems your organization has done pretty well for free so far. How about investing in the community. y two cents.
I only wish we were a rich organization sadly like most non-profit groups we struggle had it not been for the support by LSoft we would never have survived so long. Our organization is made up of volunteers and my wife and I work 18 hours a day on the service/s we offer. Also being a third world registered charity our budget is based on third world price structure despite the fact we are located in Europe. Provided development costs were not too great we would be happy to pay we have never been a begging bowl organization.
 Quote

Status: offline

Filipino

Forum User
Chatty
Registered: 08/27/02
Posts: 50
Quote by Dirk: It\'s hard to offer advice when you don\'t understand the problem ... What exactly were Google\'s and the BBC\'s complaints about? In another thread, I\'ve mentioned that Geeklog could do better with regards to using structural markup, e.g. using H1 tags for headlines, etc. But that is mainly a theme issue and could be resolved by creating a new theme. Is that what they were talking about? bye, Dirk
It seems it is impossible to extract the output from a page or separate the individual stories I am told because the sites layout (which I presume to be GL\'s) is not to industry standard - I have no idea what that means. With the BBC technical people I tried 1 story to one page but still it was not possible to automatically \'extract\' the title and first lines of the story. On the front page where there are 5 to 10 stories there was no way to distinguish one from another. Perhaps I took the wrong type of program to display \'news\' but was wanting something that could take input from a variey of sources by email and present it in a neat readable manner. Take this agency story on my page this was I understand unreadable by Google yet could be written on another site. So you think it could be the theme Dirk. Wayne
 Quote

Status: offline

Filipino

Forum User
Chatty
Registered: 08/27/02
Posts: 50
Try this as a comparison this was successfully trawled by Google: here ours was not here This is how Google reproduced it Eagles collar Tams in UAAP ABS CBN News, Philippines - 3 hours ago ... U flat-footed, leading to a humbling 87-63 win over the erstwhile solo leader Sunday night in the 66th University Athletics Association of the Philippines at ... Squatty I was not given specifications how the site should be laid out. Our site is trawled continuously for new additions, when they are made they are supposed to be extracted automatically. I run a smaller site where pages are produced with Dreamweaver and they do trawl alright. Not sure what the spider looks for in the page make up. Wayne
 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by Filipino:is not to industry standard - I have no idea what that means.
Well, then you should ask them. It does, however, sound like what I was talking about. Since most (if not all) Geeklog themes do not use H1, H2, etc. tags for the story titles, it\'s next to impossible for a software to determine what the story title is and where the next story starts. You may want to compare the HTML of sites that are listed by Google and the BBC to see what kind of markup they use, e.g. for the date, abstracts, and other data that Google may be interested in. I don\'t know which theme you\'re using, but I was playing around with the Classic theme the other day to make it use H1 tags for the title of the featured story and H2 tags for the other story titles. With some modest use of CSS, this looked almost exactly like the original theme. bye, Dirk
 Quote

Status: offline

Filipino

Forum User
Chatty
Registered: 08/27/02
Posts: 50
Quote by Dirk: I don\'t know which theme you\'re using, but I was playing around with the Classic theme the other day to make it use H1 tags for the title of the featured story and H2 tags for the other story titles. With some modest use of CSS, this looked almost exactly like the original theme.
Thanks Dirk - a customized theme heavily CSS based as far as I know it does use h1, h2 tags but I will have to get someone to check that out. Dirk I have just found out one site which is trawled use PHPNuke and their output is read not being familier with it not sure how their screen output differs from Geeklog. I have contacted Google to ask for a detailed analysis of what is wrong as they are eager to use our news references I am sure they will assist. Wayne
 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by Filipino: Try this as a comparison this was successfully trawled by Google: here ours was not here
Well, so much for the H1 theory ... That sports site actually uses
Text Formatted Code
<p CLASS="newshead">Eagles collar Tams in UAAP
Which is just plain stupid, IMO. However, since that headline is in a new paragraph, it's easier to find than the headline on your site, where it is in a table cell (and preceded by a non-breaking space). If you don't get any information from Google, try looking at other news sites and see if they use the same CSS class names ("text", "newshead") - maybe that's all they're looking for. bye, Dirk
 Quote

Mark Beihoffer

Anonymous
I\'d like to see this remedied as well, and have been working on a standards-compliant theme for Dragonfly Networks that will have the appropriate HTML structure tags as well.

So far, the site\'s main page is validating using the W3C\'s validator service, which is the first step. Now I\'m stuck with the problem of creating a cascading style sheet theme that support HTML structural markup.

I\'m hoping to spend a few hours on it this week, and once the main page (I can\'t guarantee the other pages will validate as some of the PHP functions have proven to be inscrutable to me) is generating valid HTML structural markup, I\'m sure Google et. al will have no problem spidering Geeklog sites.

When the theme is validating and finished I\'m going to release it back to the Geeklog community - hopefully people will then be able to create their own custom themes with a simplified look, that are crawler-friendly, accessible (i.e. fonts will properly resize in their browser), and useful.

I know my theme doesn\'t look as polished as most Geeklog themes, but it\'s far more focused on generating validating, structural markup than it is on looking flashy. ;-)

 Quote

Status: offline

Filipino

Forum User
Chatty
Registered: 08/27/02
Posts: 50
Yes, this is what I was told that the pages generated did not \'validate\' to industry standards which I presume Mark is what you relate in your post. I am working with an organization in the Philippines to investigate this problem. It seems that one publication uses PHPNuke for it\'s online publication and that does comply. I have a test version of PHPNuke on our site but I am not expert enough to see what the differences are in the output. Wayne
 Quote

Status: offline

ScurvyDawg

Forum User
Full Member
Registered: 11/06/02
Posts: 523
embarrassed
Wow ..... Good thread. Sorry about my abruptness earlier. I like what your saying Mark. Cool I have a friend who has a site comes up as #1 in Google, for cetain somewhat familiar search terms. For instance he is #1 under \" information warfare \" in Google and that is because of its basic HTML presentation. It is like a simple example of a perfect document for getting spidered by the Google robot. I can\'t wait to see what you come up with Mark. Sounds exciting. This is such a vibrant community.
 Quote

Status: offline

ScurvyDawg

Forum User
Full Member
Registered: 11/06/02
Posts: 523
freakingout
LOL Seems the Google Database is a constantly moving and changing creature. Never one to remain static. I went and looked to see if it was still number one as had not cheched in about 10 days or so. Sorry his site is not one the first page any more........ LOL You need to search under the term \"information warfare solutions\" without the \"\" Smile He is still #1 for that search term.
 Quote

Status: offline

Filipino

Forum User
Chatty
Registered: 08/27/02
Posts: 50
ScurvyDawg .... this is not a search engine problem it is a new news service that Google is operating (at the moment under Beta) Try this link to see what I am talking about. We think the problem is 2 way - the front page has the x number of latest posted items which are shown again on what I shall call the \'Read More\' page. The problem seems to be the front page and the multiple articles are confusing the trawl engine. To be fair one of the newspapers in the Philippines uses 2 line titles .... the first line is lost so the article takes the second line as it\'s title with the Google news trawler. We put the Geeklog site along with ours www.balita.org and a number of sites where the trawling is working through validator.w3.org and all come up with numerous errors, but none seemed serious enough to distinguish between why one site would register over another. The strange part of the equation is one site that is trawled use PHPNuke and output is successfully trawled from that. Dirk thought the problem was the theme - yesterday we tried numerous themes with GL and all produced this self same non-reading output. We tried using a test setup of PHPNuke and it was read first time. Our organization has no intention of moving to PHPNuke and we will stay firmly with GL as we consider it a far better interface. This will be ongoing and we will find out what the problem is. Wayne
 Quote

Mark

Anonymous
A site does not need to be in valid HTML/XHTML to be in Google\'s news section. Many of them are far from it. For example, this was the number one news item on the link you posted. I had to manually set the doctype and encoding before it could even check because there\'s no DOCTYPE or CHARSET. It shows the page has 280 errors. As Dirk previously posted and this page also shows, they use a class to help the bot locate the beginning of an article: < p class=\"newshead\" > In the article\'s body, it has this: < p class=\'text\' >. All you have to do is ask Google if these are the standard classes their bot looks for or check some more sites to see if they have other ones. After you find the proper classes add them to your template then your problem should be solved.
 Quote

Mark Beihoffer

Anonymous
You could probably get this to work if you came up with a CSS header class, like h1.newshead or h3.text - then use <h1 class=\"newshead\">My News Headline Here</h1> and see if it works with Google News. The benefit to this would be that it would work with both Google News service, and also would correctly be spidered by the major search engines. In fact, maybe I\'ll try this myself... Mr. Green
 Quote

Status: offline

Filipino

Forum User
Chatty
Registered: 08/27/02
Posts: 50
Quote by Mark: A site does not need to be in valid HTML/XHTML to be in Google\'s news section. Many of them are far from it. For example, this was the number one news item on the link you posted. I had to manually set the doctype and encoding before it could even check because there\'s no DOCTYPE or CHARSET. It shows the page has 280 errors. As Dirk previously posted and this page also shows, they use a class to help the bot locate the beginning of an article: < p class=\"newshead\" > In the article\'s body, it has this: < p class=\'text\' >. All you have to do is ask Google if these are the standard classes their bot looks for or check some more sites to see if they have other ones. After you find the proper classes add them to your template then your problem should be solved.
Hi you made this sound as if I were the one putting up the problems ... a programmer in the Philippines is working with Google on this and a dummy copy of our site. They say the problem is not the CSS they are saying it is the unconventional way the php organizes the page that is shown on the fly. It looks as if certain parts will have to be modified to conform with a valid output. Yes I agree with what you say with ABS/CBN but that is readable by Google\'s news bot. I am working with Google on this problem are you? So a statement such as \"A site does not need to be in valid HTML/XHTML to be in Google\'s news section.\" is not a lot of help when our is not. Even the Geeklog site has been tried just in case it was us .... even that could not be read. Wayne
 Quote

Mark Beihoffer

Anonymous
Hi Wayne - I totally understand your frustration in this - it is sometimes a mystery to me how Google\'s finer points work - I\'ve never tried to get one of my sites to be listed on their news feeds, but I can definitely see how it\'d be very frustrating to get the feedback you\'re getting from Google. Just so we\'re sure, I\'m not the same Mark that replied to you that time - funny how we\'re both posting in the same thread here. :-) I wonder if adding header classes to the CSS would help? I might give it a whirl this weekend if I get time - but I\'m moving this weekend so it might take me a week or so. I like your site - it\'s very nicely formatted and laid out - don\'t blame Geeklog for the Google problems, they\'re doing the best they can on a free project I think. And I think they do it really well... I\'m hoping that I can convince Dirk and the other guys that Geeklog 2.0 needs to be standards-compliant and accessible - I love the flashy layouts that come with the default Geeklog themes but honestly I\'d prefer something simple - my ideal in web design is sites like Kuro5hin.org, really easy on the eyes, simple layouts, not too many graphics, and best of all, they\'re crawler-friendly. One of my biggest wishes about Geeklog 2.0 (whew, this is turning into a rant! sorry.) would be for the URI conventions to *all* be crawler-friendly - not only would this help with say Google\'s news trawler, but it would also help all the other (more \"crippled\" ;-)) search engines to successfully crawl the site. Anyway sorry for the extended rant, I\'m in kind of a raving mood tonight and I wanted to make sure you\'re not giving up on Geeklog just yet, despite your struggles with the Google news feed.
 Quote

Page navigation

All times are EST. The time is now 01:23 am.

  • Normal Topic
  • Sticky Topic
  • Locked Topic
  • New Post
  • Sticky Topic W/ New Post
  • Locked Topic W/ New Post
  •  View Anonymous Posts
  •  Able to post
  •  Filtered HTML Allowed
  •  Censored Content