Creating RSS feeds from web pages with Ruby

If you are anything of a geek, you like to stay up2date about your craft. Nothing better then reading about peoples experiences to improve your own skills and insights. So the easiest way to keep on following someones blog is to add it to your favourite RSS aggregator. Unfortunately, even in 2015, you still find blogs that don’t serve an RSS feed of their content.

But don’t be sad. You’re not the only one. So the idea came to me that it shouldn’t be difficult to parse a web page and create an RSS feed of it, that I can serve myself.

In search for a solution

But before reinventing the wheel, I started of in search of a solution. Since I’m annoyed by this problem, surely someone else has as well and created a solution for this. As it turns out, someone did. I came across the gem Feedalizer written by Christoffer Sawicki.

So I thought: “Great! I’ll just use that one!”. Unfortunately, the gem was abandoned and wasn’t updated anymore since 2008. Since I couldn’t find the original source code anymore, I just started by downloading the ruby gem and extracted the code from there.

Reviving the project

Analyzing the code brought the following points to my attention:

* the gem structure was outdated. Which is logical, since things have changed since 2008
* it used Hpricot for parsing the HTML pages. Hpricot development stopped in 2013

So I basically started refactoring the gem ( plus side, it had tests ) and replaced Hpricot with Oga. And there it was, an old gem revived from the dead.

It was a long shot, but I contacted Christoffer if he would be ok to grant me owner access to the gem. This way, I could push a new version out. And what do you know, he did (you got to love open-source)!

A new updated version saw the daylight for anyone interested to use it 🙂

Feedalizer usage

Now, installing the easiest part. Just type gem install Feedalizer and you are good to go. Now lets take for example my own blog. On the homepage, you will find an overview of the last 7 articles published. So we will use this to create an rss feed. The HTML markup for one article in the list is as follow:

<article itemscope="" itemtype="http://schema.org/BlogPosting">
  <header>
    <span itemprop="author" class="hidden">
      <a href="https://plus.google.com/118158441381960305269?rel=author">Michaël Rigart</a>
    </span>
    <span itemprop="datePublished" class="hidden">
      2015-08-06
    </span>
    <span itemprop="url" class="hidden">
      http://www.michaelrigart.be/en/blog/useful-ruby-gems-to-improve-your-code-quality-and-skills.html
    </span>
    <ul class="post-meta">
      <li>August 6th, 2015</li>
      <li class="comment-1">
        <span class="icon"></span> 
        <a href="http://www.michaelrigart.be/en/blog/useful-ruby-gems-to-improve-your-code-quality-and-skills.html#disqus_thread">Comments</a>
      </li>
      <li itemscope="" itemtype="http://schema.org/Person">
        Written by 
        <span itemprop="name">
          <a itemprop="url" href="http://www.michaelrigart.be/en/about-me.html">Michaël Rigart</a>
        </span>
      </li>
    </ul>
    <h3 itemprop="name">
      <a title="Read more about Useful Ruby gems to improve your code quality and skills" href="http://www.michaelrigart.be/en/blog/useful-ruby-gems-to-improve-your-code-quality-and-skills.html">Useful Ruby gems to improve your code quality and skills</a>
    </h3>
  </header>
  <div class="entry-content">
    <p itemprop="description">How do you prevent your application to deteriorate over time? Or even improve your coding skills? That's a question a lot of developers ask.</p>
    <a title="Read more about Useful Ruby gems to improve your code quality and skills" class="btn btn-small btn-subtle" href="http://www.michaelrigart.be/en/blog/useful-ruby-gems-to-improve-your-code-quality-and-skills.html">Read more about Useful Ruby gems to improve your code quality and skills</a>
  </div>
</article>

From this block, we can extract all the necessary data we need to build our own RSS feed. Take a look at the implementation:

#!/usr/bin/env ruby

require 'feedalizer'
require 'time'

feedalize('http://www.michaelrigart.be/en/blog.html') do
  feed.title = 'Michael Rigart'
  feed.description = 'Tech and findings'

  scrape_items('//article') do |rss_item, html_element|
    rss_item.link = html_element.xpath('header//span[@itemprop="url"]"').text
    rss_item.title = html_element.xpath('header//h3//a').text
    rss_item.date = Time.parse(html_element.xpath('header//span[@itemprop="datePublished"]').text)

    rss_item.description = html_element.xpath('div[@class="entry-content"]//p').text
  end

  output!
end

We start our script off by setting the propper ruby shebang, followed by requiring the Feedalizer and time libraries. The time library is used to parse the publish date of posts.
Then comes the real work.

We call the Feedalizer method, which is basically a wrapper for the Feedalizer object. The wrapper basically takes the url of the HTML page you want to turn into a RSS feed and a code-block.

Inside the code block we start by setting the feed title and description. In this example, we set it manually. Next we call the scrape_items method. The scrape_items method takes an XPath expression and an optional limit. The XPath expression defines which wrapper elements hold the article details ( in our example, the articles are wrapped inside the HTML article tag. By default, Feedalizer limits the first 15 elements it encounters.

The scrape_elements method returns a new rss item and the HTML element it extracted. We then start iterating though every rss item / html element and start extracting the data we need. In this example, we set the url, title, publish date and description. As you can see, we search for the data by using another XPath expression.

After all the items are iterated, we simply display the output to STDOUT, by calling the output! method. You can also use the regular output method, which return the output as a string. This way, you could save the output to file. By replacing output! with File.open(‘/tmp/rss.xml’, ‘w’) { |file| file.write(output) }, the XML code will be stored in a file called rss.xml in the /tmp folder.

Of course you want to include the RSS in your preferred RSS aggregator. You could do this by setting up a webserver like Apache or Nginx and serve the generated XML feeds.

Since I’m picking up this project, let me know if you encounter any issues or have any new idea’s or suggestion. Just let me know through a Github issue and I’ll have a look at it.