Scraping with RSS.it.com [HACK]

With RSS.it.com, you can use XPath to scrape websites without a dedicated RSS feed. (Note that some websites do not consent to scraping, and it is against the TOS, so choose the websites carefully.)

01.

Find a URL

First step is to find a URL of the website of which you want to get a continuous feed of. I chose our Why Us? page as a safe option.

02.

Add feed

In our website, click on Subscription Management from the homepage (once you log in) and switch to Add a feed or category. From there, add the feed URL and the category you want it to be in.

Here comes the crazy part. Instead of just normally adding the feed, click on the Type of feed source exactly at the bottom of category. From there, select HTML + XPath (Web scraping). More input sections will pop up. First, add Xpath for the feed title, which is mostly the <title>, or the <h1> tag. So, you can use either //title or //h1. Or, you can state a static feed title as well, no big deal.
(Note: // in XPath Selects nodes in the document from the current node that match the selection no matter where they are.)

The most important part is the XPath for finding news items. This is where you need to be careful. You need to find the XPath for the news items. For example, in our Why Us? page, the news items are in the <div class="section"> tag. So, the XPath for that would be //div[@class="section"]. To make matters much more complex, as there can be multiple classes on a same div, you can use contains function to find the div. In this case, you can use //div[contains(@class, "section")]. Other things to be considered are the similarly named classes and the whitespaces in class names. But we’ll leave that to you.

Next field is the XPath for the news title. This is the XPath for the title of the news item. In our case, it is descendant::h2. (descendant:: selects all descendants (children, grandchildren, etc.) of the current node.)

The items content field is the XPath for the content of the news item. In our case, it is .//p. (The .// abbreviation can be used to select the current node and all its descendants.)

Let's go over other options briefly. The XPath for the news link is the XPath for the link of the news item. In our case, it is descendant::a/@href. The XPath for the news date is the XPath for the date of the news item and the XPath for the news image is the XPath for the image of the news item. The titles are self-explanatory.

03.

Additional options

Now that the scraping part is done, it's time for additional information. You can use username and password for the pages that require authentication to work. Or, you can manipulate with the cookies. Also, you are allowed to change the method of request (GET, POST). There are other options you can explore.

04.

And Voila!

All things are set up now. Press on Add button and you are good to go. The website will keep you up to date with the feed once every few minutes. Till then, sit back and relax.