uGuX SEO

Using WhiteHat SEO to achieve BlackHat SEO results (minus the penalties)
  • Home
  • About
  • Contact
  • SEO Tools

Splogs (or Snews) Using Web-Stemming

Published by uGuX on February 11, 2008 10:39 am under Blog, SEO (Search Engine Optimization), Scripts

Most are fairly familiar with the current trend in splogs that are stuffed with three AdSense units above the fold of the page.

These pages use blackhat SEO techniques of scraping RSS feeds from multiple blogs. They are then displayed on Google (if the SEO-er is innovative enough).

If you use partial feeds, you’re safe, right?

Here’s a Python Webstemmer that takes it all to a new level.

“Snews” — scraping the news sites

Here’s their claimed accuracy:

New York Times 488.8/552.2 (88%)
Newsday 373.7/454.7 (82%)
Washington Post 342.6/367.3 (93%)
Boston Globe 332.9/354.9 (93%)
ABC News 299.7/344.4 (87%)
BBC 283.3/337.4 (84%)
Los Angels Times 263.2/345.5 (76%)
Reuters 188.2/206.9 (91%)
CBS News 171.8/190.1 (90%)
Seattle Times 164.4/185.4 (89%)
NY Daily News 144.3/147.4 (98%)
International Herald Tribune 125.5/126.5 (99%)
Channel News Asia 119.5/126.2 (94%)
CNN 65.3/73.9 (89%)
Voice of America 58.3/62.6 (94%)
Independent 58.1/58.5 (99%)
Financial Times 55.7/56.6 (98%)
USA Today 44.5/46.7 (96%)
NY1 35.7/37.1 (95%)
1010 Wins 14.3/16.1 (88%)
Total 3829.1/4349.2 (88%)

It’s fairly accurate with an 88% average while scraping professional news sources. If you read a lot of news online, you’d be fairly familiar with how much separation of text there it — meaning, news items broken up with random ads. Now, how much easier would it be to scrape Wordpress blogs that EACH have the SAME EXACT template structures? Not too much.

Below is how text is broken up:

$ cat cnn.txt

!UNMATCHED: 200511210103/www.cnn.com/                                             (unmatched page)!UNMATCHED: 200511210103/www.cnn.com/privacy.html                                 (unmatched page)

!UNMATCHED: 200511210103/www.cnn.com/interactive_legal.html                       (unmatched page)

…

!MATCHED: 200603010455/www.cnn.com/2006/HEALTH/02/09/billy.interview/index.html   (matched page)

PATTERN: 200511210103/www.cnn.com/2005/POLITICS/11/20/bush.murtha/index.html      (layout pattern name)

SUB-0: CNN.com - Too busy to cook? Not so fast - Feb 9, 2006                      (supplementary section)

TITLE: Too busy to cook? Not so fast                                              (article title)

SUB-10: Leading chef shares his secrets for speedy, healthy cooking               (supplementary section)

SUB-17: Corporate Governance                                                      (supplementary section)

SUB-17: Lifestyle (House and Home)

SUB-17: New You Resolution

SUB-17: Billy Strynkowski

MAIN-20: (CNN) — A busy life can put the squeeze on healthy eating. But that     (main text)

         doesn’t have to be the case, according to Billy Strynkowski, executive

         chef of Cooking Light magazine. He says cooking healthy, tasty meals

         at home can be done in 20 minutes or less.

MAIN-20: CNN’s Jason White interviewed Chef Billy to learn his secrets for

         healthy cooking on the run.

…

SUB-25: Health care difficulties in the Big Easy                                  (supplementary section)

!MATCHED: 200603010455/www.cnn.com/2006/EDUCATION/02/28/teaching.evolution.ap/index.html  (another matched page)

PATTERN: 200511210103/www.cnn.com/2005/POLITICS/11/20/bush.murtha/index.html      (layout pattern name)

SUB-0: CNN.com - Evolution debate continues - Feb 28, 2006                        (supplementary section)

TITLE: Evolution debate continues                                                 (article title)

SUB-17: Schools                                                                   (supplementary section)

SUB-17: Education

MAIN-20: SALT LAKE CITY (AP) — House lawmakers scuttled a bill that would have   (main text)

         required public school students to be told that evolution is not

         empirically proven — the latest setback for critics of evolution.


…
    • Gigablast SEO and Gigaboost
    • The Mass-Accumulation of *.EDU Backlinks
    • Linkbait Tactics That Actually Work

1 Comment so far

  1. Asian on June 27th, 2008

    Hi,
    Your blog got some pretty useful SEO info. But when you make up terms like “SNews” for splogs you should have checked on google if there are anything elese in that name. SNews is a very popular open source CMS. It has nothing to do with splogs.
    Regards,
    Neo

Posting your comment.

Leave a reply

  • Posts

      • Unknown SEO Techniques
      • SEO.com Finally Learns SEO
      • Gigablast SEO Part II
      • CNN = SEO
      • The Mass-Accumulation of *.EDU Backlinks
      • Splogs (or Snews) Using Web-Stemming
      • The Monstrous List of PPC Negative Keywords
      • Keyword Tool: Number of Google Searches/Clicks
      • Buy 100’s of .edu & .gov Links
      • Low-Quality Google Content Advertising
      • Google AdWords: Free Keyword Tool External
      • Buy All the .edu Links You Want
      • Linkbait Tactics That Actually Work
      • Google Trends Analysis
      • The Insanely Long “Places to Ping List” for Blogs
  • Search

  • Archives

    • June 2008 (1)
    • April 2008 (1)
    • March 2008 (2)
    • February 2008 (2)
    • January 2008 (4)
    • December 2007 (12)
    • November 2007 (10)
  • Categories

    • Affiliate (1)
    • Blog (5)
    • Links (2)
    • Misc. (2)
    • Scripts (3)
    • SEM (Search Engine Marketing) (9)
    • SEO (Search Engine Optimization) (27)
  • Pages

    • About
    • Contact
    • SEO Tools
  • SEO News

    • SEO Conferences: Good Deal or No Deal?
    • SEO Plagiarism
    • Key Elements of an Online Community Strategy
    • Google Knows Where I Live
    • SEO Secrets of SMX Advanced: Give It Up

*refers to a wild-card

Copyright © 2008 uGuX