The purpose of this tutorial is to become acquainted with the basic functions of RTILA.
This tutorial assumes that you have already installed RTILA, please get the latest version from the repository:
There are some concepts you need to know before you scrape a page. Firstly identify whether you will extract from a list page or a detail page.
Listing pages are usually built with repeating structures like categories, paginations or repeating blocks of information such as lists, grids or rows.
Sometimes those pages will contain valuable information and they only serve to grab the links to the inner pages which they link to.
Here are a couple of list page samples.
For these types of pages you need to set RTILA Project type onto List Pages. This is the default setting.
RTILA allows you to select which elements you want to scrape, and generates a css selector automatically.
This works out of the box, but you can fine tune the css selector manually if needed.
To do this, you need to open the Chrome Inspector and try to identify the main structure of the repeating blocks. This will allow you to select the desired elements easily.
Let’s open an example with the books page, open this url:
As you see, there are blocks of code containing each book.
If you are not comfortable searching the DOM structure you can use the select element button.
This is how the DOM will look.
RTILA has a real time element inspector too.
On the image, the full css selector generated by RTILA is:
LI.col-xs-6.col-sm-4.col-md-3.col-lg-3 > ARTICLE.product_pod > DIV.product_price > P.price_color
How should we read this? For now, you can discard the classes to make it simpler:
LI > ARTICLE > DIV > P
This gives us the route or path to the element we want to extract.
RTILA accepts manual input for css selectors.
Press the green + button and enter: LI > ARTICLE > DIV > P
However, there is a problem, some unneeded data is being picked as well, but we only need the price. This is the reason we want to use classes or identifiers when available to create more accurate selectors.
Look at this partial DOM image. Press CTRL+F and write in the selector, we will get the same results as shown on RTILA before, which will give us 40 elements.
How can we get a narrow selector? Let’s refine it by adding a class.
For example we could write:
LI > ARTICLE > DIV > P.price_color
LI > ARTICLE > DIV > P:first-child
There is no such thing as a good or bad selector, the final decision is yours.
OK, now that we know how to use a class, you could try to create a simple selector like: P.price_color
That will target 20 items as we need, let’s see what happens.
We are getting an error message. Why?
On list pages we should select the information from the same base structure, in our case getting the same LI element as the base one.
The base element could depend on the DOM structure; it will vary from project to project. In this case we could use LI or ARTICLE.
Time to practice
Open your RTILA, we will start creating your first project.
Press the +New Project button and choose Quick.
Navigate to: http://books.toscrape.com/
Choose the Inspection radio button, and press Select (Empty Header Format).
Press the Green + button and click on the book image, you can click on any of them.
RTILA will build the selector for you. Visually you will see the images inside a yellow border, and a purple selector box with a 20 number inside.
Press the Green + button again and choose one book’s title.
And again, do the same for the prices.
I’m sure you got the point. See, it’s not too difficult.
Click the Save button when you’re done, RTILA will start scraping and show you the results page window when finished.
You can go back to this results page at any time.
Here you will be able to preview or export the results to a file.
Feel free to explore this page and then close it when you are done.
Good. So how did it go?
I hope you grasped the basics and now you undestand what’s going on when selecting elements from the browser.
CONTINUE READING: Tutorial 2 – Creating bridges with RTILA