I've been spending some time recently looking into Search Engine Optimization for BlogCFC. I hadn't given this much thought until recently, when I launched a new blog and noticed that the majority of my posts weren't being indexed in Google, and the ones that were weren't being indexed as I thought they would be.
It turns out that Google is fairly picky on how it indexes sites, especially dynamic ones like those that use blogCFC as the engine for their blogs. I've been discussing with Ray Camden changes that can be made to blogCFC that will make it more search engine friendly, especially for Google. The changes that I'm proposing are all to the core CFC and related layout files and don't get into server side changes that you can make such as SES/friendly URLs. That's a topic all its own, and one beyond the simple changes I'm proposing. Ray's agreed to consider them for a future release.
The main issues as I see them are that Google and other search engines aren't indexing individual bog entries, those that you would associate with an entry's permalink. Instead, Google seems to be indexing entries wherever it finds them. This means that sometimes it will index a particular entry from your blog's homepage (undesirable since this changes and entries drop off), from calendar links, from category pages, etc. The point is Google isn't being provided with one single version of any of your blogCFC posts. Instead, it's left to its own devices to attempt to craw you site and index the content. This is problematic for the reasons I just mentioned. It's also problematic for a number of other reasons. One of these is the calendar pod. When Google encounters the calendar on a page, it attempts to follow all of the links. While this is a slight problem because a post can show up for day, month and year entries, it's an even bigger deal with the > and < links, which move the calendar backward and forward in time. It's a problem because Google can go on crawing these links backward and forward for what would be an infinite amount of time. Luckily, Google is coded such that it won't let itself get caught in this sort of endless indexing of dynamic content. The bad news here, though, is that when it realizes it's caught like this, it stops indexing your content. Google has stated on their site that while the engine will index dynamic content, it may limit how much it indexes on any given site. With this in mind, I'm of the opinion that it's better to code the calendar component so that you can't move forward or backward past the point of any entries. That is, you can't move back to a time before the first entry was made, and you can't move forward in time past the current month.
Here are the recommendations I'm making that should improve the search engine friendliness of blogCFC. If you have any additional suggestions/comments, please make them here and I'll be sure to compile them and pass them along to Ray:
I'm sure there's a lot more that can be done, but this should serve as a good starting point for making blogCFC more search engine friendly.
This is along the lines of what I noticed as well - that only some of my posts were making it into google, and the ones that were weren't coming from the individual entries but rather what seemed like random calendar pages.
Given the number of people using blogCFC these days, I think getting these issues cleared up is going to result in a lot more traffic for everyone, not to mention a lot more CF related content in google.
One thing though, you mention marking the blog main page as non-indexible. Isn't this going to put many blogs in the position of having an un-listed home/main page? My gut feeling is that provided the main page is clearly indicated as being such within a SE listing, then visitors generally understand and are tolerant of any discrepencies between the actual content and that listed within the search engine results. This issue seems to be a common problem and applies (of course) to many news sites and portals, as well as blogs.
Having said that, my gut feeling applies mainly to people who understand how Web apps and indexing work. The less experienced may well find such discrepencies baffling. But then, what to do about the main page?
I had the same thought about whether or not to make the main blog page indexible. In the end, I was worried that if google indexed the main page, it would only index that version of those entries, creating some gaps. I'm not sure for sure, though as I don't really know how Google fully treats this. It will be interesting to see how blogs implementing this show up in google after they are indexed next. That's part of the reason Ray is keeping some of this stuff in beta, so we get a chance to test and tweak before becoming a final release.
I'd love to hear any additional thoughts you have here.