Welcome to RTILA help page
Please post questions, follow discussions and share your knowledge in the RTILA Visual Web Scraping Facebook group.
You can download the latest version from the official GitHub repository at:
RTILA is available for Mac, Linux, and Windows.
For the best experience, try to download the latest version for your operating system for the best experience.
This software is actively developed, some options could vary slightly.
With RTILA you can scrape text, URLs, email addresses, images or download any type of binary files from web pages, such as pdfs, docs, etc.
RTILA is very easy to use, but learning some scraping concepts will help you to succeed in your scraping projects.
You may purchase RTILA online at codecanyon.
Once you complete the payment, you will receive the license key file via email.
Paste your serial number to activate your license.
Please note that there are 3 different versions of RTILA: for Windows, Mac OS, and Linux.
Register the software using the required key for your Operating System.
The main window
This is the main screen. The main RTILA options are on the top of the screen and are Project selector, New, Settings, Inspector, Extractor, and Results.
You will most likely use almost all of them from the left to the right on each project.
The Project selector will be populated with all of your projects and you will be able to select your previous projects from here.
The New button will allow you to create a new project from scratch, or choose an already pre-configured template. I recommend you to start creating some projects from the «Books to Scrape» template and start learning from those preset projects.
You can import your own projects or ones created by other RTILA users. Remember to create a New – Manual empty project before importing a new project. Importing a project overwrites the current one.
Starting a project from a template is an easy task, just click any one of them and you will see some additional info on the right panel (Name, URL and Type). Those projects can be customized later with your own parameters, but these are a good start.
At the top, there is a Search bar to find the project templates easily. Don’t worry, there are just a few of them for now.
Just choose one and click Import Template. Depending on the template you choose, you will have to go to Settings or Inspector to finish the configuration with your queries.
For now, just import a template and click Extractor. That will give you an idea about the information you are getting from the scrape.
You will see the project details are automatically populated in the options page.
Do you want to give it a try? Just press Extractor and RTILA will start scraping a sample site. Chrome will be launched and it will work for a couple of minutes. Click the RTILA Results button on the top right and wait until is finished.
This will open a window with some stats – in my case – 1000 pages scraped in around 2 minutes. Click the Preview button to check the results, or Export the data to a file (CSV is recommended).
Now we will explore the Project Options pages. No need to start from scratch, I’ll get a template project and explain how each option works. Close the results page and go to the main RTILA screen.
Options – General page
On this screen you will set the Project name and URL to inspect on all of your projects (mandatory fields). Set a meaningful project name because it will be the unique text reference to find it later from the Project selector control.
The URL to inspect should be the starting point of the page you want to scrape. Usually this will be a listing page or a detail page. Listing pages are usually built with repeating structures like categories, paginations or repeating blocks of information such as lists, grids or rows.
Detail pages are usually the last pages to scrape (it could be a product page for example).
This detail is important because RTILA has to know if you are searching for a repeating structure, or if you are just scraping separate elements from a single page. You can learn more about Listing or Detail pages in this post.
The bridge is a great RTILA weapon to allow relations between pages. It’s the glue that will allow you to reference pages from other pages. You can learn more about bridges in this post. A project should be set as Bridge when you want to reference/call it from another project.
Bridges allow us to scrape sites with complex navigation. Do not worry about this concept now, because RTILA will allow you to scrape some data and reuse it on a second project, without the use of bridges. You can wait until you feel familiar with the basic scraping techniques, then learn the bridge system later.
NOTE: The bridge feature was removed since version 3.4.0. You can automate the browser navigation from the automation panel.
The next option is Silent mode, this will scrape the page without a graphical user interface. This means Chrome will be hidden. It’s probably just fine to set as default (No) for development time. The silent mode will be useful when a project is scheduled to run every 5 minutes for example.
The strategy option will allow you to choose between Extract data from the start or Extract only updated values. The last option will compare the results and show you only the differences.
Purge results clean the results automatically. This limits the total number of stored sessions and discards the older scrapings; this is useful when you only want to keep the newer ones. Sometimes you will use RTILA to monitor a site with frequent or regular scrapes, this will allow you have some auto cleaning capabilities.
Adding empty results will include or discard when a particular page does not provide any results. Sometimes you can get no values, or the page could return a 404 error, no problem. By default, RTILA will discard that row, but you can force it to create an empty one.
Too many details? Focus on Project name, URL, and Project type. Those are the basic ones.
Options – Advanced page
Don’t be scared. This screen offers you many options, but as before, I’ll help you to set the basic ones.
The Device, Width, and Height options will help you set or emulate a specific device. You can set them as you like.
Instances, Concurrent tabs, and Wait time are REALLY important ones. They will set your actual scraping speed.
An instance is a full chrome execution. An instance can open a certain number of tabs. That means, 3 instances and 5 tabs, will open 15 pages at the same time.
My recommendation is, don’t abuse this. Set those values as slow as you can and don’t send multiple requests to a site. In some countries, people host their sites on shared hosting and multiple requests per second could leave you blocked.
It does not matter if you scrape for 2 minutes, or 20 minutes. To start testing, you can set to 1 instance, 1 tab and 3 seconds wait time. This is slower but safe.
Timeout and retries should be auto explanatory, and those are the wait times and retries, respectively.
Next, you will see some switches. On almost all the checkboxes, you have an explanation, but I’ll explain some interesting ones.
Disable image loadings will allow you to discard loading photos on the sites, this could help you improve the load times.
Checksum calculation will prevent you getting duplicates. This is a very nice option when you are dealing with «Load more…» buttons or with tricky paginations that could cause duplicates and you don’t want to investigate further.
Do you have a list of pages to scrape? Just paste them here and you are done. Do you remember the URL that we set on the Options, URL to inspect? That one will be discarded when you set a list of URLs to crawl.
There are some nice filters available, basically for string manipulation. The most important one is the possibility to generate URLs to avoid automating paginations or creating URLs that follow a numeric pattern.
Here is an example, lets imagine on some site all of the pages we want to visit follow the pattern: page1.html, page2.html and so on.
These parameters will generate a hundred URLs, from 1 to 100. Just insert a unique string identifier and RTILA will recreate all the pages for the feed.
Do you need to upload the results to a remote site? Set the FTP credentials and the results will be automatically sent.
As we will see later, RTILA has scheduling options. You could set some scheduled scraping and deliver the results remotely.
Activate or deactivate the FTP feature from the «Auto upload last results» checkbox. Unchecked means, the FTP will remain disabled.
The Mailer page will allow you to configure the settings for email delivery.
You will probably use this when you set RTILA on an external server to monitor a site or scrape a site at regular intervals, using this option with the Scheduler.
Yes, we talked a bit about the Scheduler before.
You probably need to scrape the same page or site from time to time, to grab data, or just to monitor some data.
You can combine the scheduler capabilities with the FTP options or with the Mailer.
Do not forget to set a convenient value on the Purge results parameter (see Options – General page) to avoid storing hundreds or thousands of scraped sessions, you probably don’t need to store unlimited ones when the scheduler is active.
A proxy is a 3rd party server that enables you to route your request through their servers and use their IP address. When using a proxy, the website you are making the request to no longer sees your IP address but the IP address of the proxy.
The syntax of the proxies is USER:PASSWORD@IP:PORT
You can set one proxy per line, RTILA will perform the proxy rotation.
Please, consider using proxies or some VPN software while scraping but again, be gentle with the site you are scraping on.
This window will allow you to set GET/POST requests (the two most common HTTP methods). GET is used to request data from a specified resource. POST is used to send data to a server to create/update a resource.
There are some extensions available on RTILA, probably the most useful one is the Adblock extension. This will prevent loading annoying ads or popups on some sites.
Another extension is Hola Free VPN, but if you are getting serious with scraping you should use a dedicated VPN or proxies. You can try this free extension.
You can’t add new extensions to RTILA, this may be allowed in the future.
From the Settings menu, you can achieve some useful actions related to projects.
Clone will duplicate your project. This will allow you to start working with an exact copy of the project.
Import and Export loads or saves a project from/to a file. Be sure you import or export to an equivalent version of RTILA, incompatibility issues could occur between very different versions.
The export file is a JSON file.
IMPORTANT: be sure to create/select a NEW empty project before importing a project file, because Import overwrites the actual/current project.
Clear navigation removes all cookies, sessions, browsing history, etc.
Remove option, eliminates the current project, all the results, and related data files. This action is permanent.
It will not delete the images or downloaded files.
Those can be found at C:\Users\YourUser\Downloads\RTILA in their own project folder/scrape.
I recommend that you Export the project and the CSV results before deleting it.
The Inspector module
Hopefully now you get an idea about the configuration possibilities and you want to start a real project.
After opening the Inspector, you may either start selecting data or creating automation actions.
I will use an example to teach you the basics.
Go to New and select Manual to create an empty project.
Give it a name of your choice so that it can be identified uniquely.
Paste this URL: http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
And change Project type to Detail, because this is a single product page with different properties such as name, price, description, but they are not related. We just want to grab single values for now.
Save the information and press the Inspector button.
A Chrome instance will open the above URL and a couple of panels will be available.
On every scrape, you will need to use CSS selectors to grab the information from the page.
The selectors are strings of text that target HTML elements or structures. Here is a highly recommended source to learn some of them:
Do you need to learn them all now? Of course it’s not needed. Some websites can be tricky though and the knowledge will help you.
RTILA has a point and click interface and it will try to guess the better selector for you.
TIP: If you are not comfortable with CSS selectors, I recommend you this Chrome extension to you: Selector Gadget.
Target an element
Press the +Add button.
Move your mouse to the book title and click the left mouse button.
You will see a new property is created, and some new information is visible on the Selector panel.
It was not hard to get this one.
Let’s check what we have on the panel.
Type: Detail pages. We set this before on the Project General Options page.
Name: Property 1. This was auto-generated, we should change it to a UNIQUE name. In this case, it could be: BookName
CSS selector: This was auto-generated too. RTILA considers that H1 is a good selector in this case. We will talk about selectors in more depth later.
Locked: After clicking on the title, RTILA blocked the selector. Now we can move the mouse safely and no other selectors will overwrite the one we got before. You can unlock and select another part of the site if you miss clicked the book’s title.
The purple box: Shows the active property, and the number inside indicates how many elements the selector is grabbing. In this case, the site has only one H1 element, that is, the title string. Later we will add other selectors, the selected one will be purple, the other will appear dark gray. You can navigate all the selectors by clicking on these boxes.
The Add button: Just creates a new selector. It’s a button with a dropdown menu that will allow you to clone a property or delete it.
Advanced: This will allow you to access the Filters and Settings menu. We will discuss those later.
The Save button: Well, you can guess this one. Please save all of your jobs by clicking Save at the end. After saving you can close the Chrome instance.
The Preview grid: as you can see on the image, it shows our first property content.
Go to your project and try to grab the price.
That is, Press the Add button, go to the site and point and click as the image below.
A second selector is created, I just named it: Price.
You can check the purple box, there is also a number 1, this means we are only getting «one Price» value.
The book could have a discounted price or other recommended readings could appear on this page, showing their own prices. We are sure that we are getting just a single value.
In case we get multiple values we will need to refine the selector.
Press Save and close the Chrome instance.
Exporting the scraped data
Let’s do a quick real scrape. On RTILA main page press Extractor, and after a few seconds, press Results.
From here you can validate your extraction. This was a simple one, have a quick look at the Preview button, or better, go to Export and choose CSV.
Open the resulting file with your favorite CSV editor (or a text editor will do the job).
That was all.
But as you can imagine, some sites could have a tricky HTML markup and you will need to help RTILA with the adequate selectors. So…
How to use the CSS selectors
If you are new to scraping, probably you want to try with Facebook, LinkedIn, Google Maps, or Amazon.
However, they are not a good place to learn to scrape, they require some tricky interactions. That’s why we are starting with this simple book page.
Open your Chrome for a moment (not RTILA), and go to our test site: http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
Press F12 and go to the Elements tab.
You will find the source of the site we scraped before. You can inspect the code and try to find a couple of lines that I colored in yellow and green respectively.
On yellow, you can see an HTML element. It’s between the < > characters and as you already know, it’s the H1 tag.
Usually, HTML elements are not too specific, which means, if you try to grab some data that is a simple paragraph <p>, you will get many elements because the p tag is really common.
Now we are not talking about styling with CSS (forget about the right Styles panel), we are using CSS to target specific parts of the page.
How could we be sure to get the correct h1 tag? Could we be more specific?
Yes probably, look at the line below the h1, it has a parent div, that is contained on another div, that is contained on an article HTML tag, that is contained on another div.
Yes, all the HTML code is nested, and you can use the HTML tags, but usually, you will support them with other specific elements that are classes and ids.
For example, let’s say I want to grab the price, that is inside a P element. We may think that the P element could do the job, and we try the P selector on RTILA.
Well, the price is there, but as you see… what a disaster!
Many other parts of the page are using paragraphs. We need to narrow our selection.
I go back to the developer tools to inspect and try the RTILA suggested selector.
By pressing CTRL+F we can test a selector.
I write p.price_color and I see on the search results that I get a single element. RTILA did that for us before, just by pointing and clicking. Do you remember the small purple square with a number inside?
Now we know that the p element was not unique, but the p element with the class price_color is.
We will write classes with a period at the start, that is: .price_color
And we can join the HTML element with the class, that is: p.price_color
It’s as easy as that.
How do we talk to the CSS selector in the correct language?
Think about this. We want to say, hey! I see an identifier named «content_inner» that is unique, and he has a child named «article» who has a child with a div and so on.
We just have to chain the elements, there are some rules, but the basic ones are:
Use > to introduce child descendant.
Use a # to target an identifier.
Use a . to target a class.
Now, try to follow me:
#content_inner > article > div.row > div.product_main > p:first-of-type
Let’s validate the selector:
Seems to be working, right?
Now, the CSS selector searches for a unique identifier, which has a descendent (a div with a product_main class) with some paragraphs inside. But we have checked our site and the Price is always contained on the first one.
You are probably thinking… man, you are crazy. Who will use those weird and long CSS selectors? I’m sure I can grab just an identifier or a class very near my data, and have no need to worry about learning this.
Well, that’s why I don’t recommend starting to practice on Facebook.
Remember, apparently all the websites are the same, but internally the markup could be very different. Sometimes scraping 100,000 pages will be easy, but there are times that scraping 200 pages is a real pain.
And we are not just talking about selectors, many other things could occur, like delayed loading, loading more buttons, paginations, applets… captchas! Welcome to the scrapping world.
Do you want to be good at scraping? Learn as much as you can about CSS selectors, it will take you just a couple of days.
The advanced menu
From time to time, you will have to filter or apply some conditions to work with the data.
Now you will discover some advanced features that will allow you to avoid having to post-process the extracted data.
Filters & conditions
The filters window is split into 2 panels: filters, and conditions.
In our example, we were grabbing the stock. However we probably have enough with the number and we want to do some cleaning on that string.
Open the filters menu and create a filter with this regular expression: \d+ or just use the one that comes preconfigured in Data filter labeled as «Integer».
Check the real-time preview. The string shows only the numerical value.
If you scroll down at the filters pane, you will see a place to play with the scraped value.
In this example, I go to attach some text to it.
Write on the Actions area:
FIELD_VALUE=»We have » + FIELD_VALUE + » books»
And now our resulting value will be that we have 22 books instead of In stock (22 available).
Mastering both of them could take some months of hard study.
Well, now you know how it works.
The advanced button was… advanced.
If you are not comfortable with more study, just Google or search at StackOverflow. You will be able to reuse some snippets easily.
The conditions panel will allow you to filter the scraped information based on a criterion.
For example, I can set a condition like this:
CONDITION: Contains – bestseller
Now RTILA will check when the book’s description meets that criteria, and discard the pages that do not contain the «bestseller» word on their description.
You can add as many conditions as you need.
The current operator for this panel is AND, that means if there is more than one condition, all of them need to be TRUE to get the content.
You can select a property and go to Advanced – Settings
A new popup will allow you to configure the following:
Default value: by default this value is empty, but you can set some string as default value when the scraper didn’t get any data.
Required: RTILA will stop if this value is not found. You can use it as a FLAG to bypass captchas or other issues that prevent you from continuing getting data.
Download: this will try to download a file resource, like an image, a pdf file, etc. The path of the download will be located at your user downloads folder:
C:\Users\YourUSER\Downloads\RTILA\the project name\download identifier\
In my case a sample path could be:
You can find this path on the results page of each project. The download folder will remain even if you delete a project.
Monitor: this will help you to check for a value change. You will probably activate this function together with the Mailer settings, to get an email notification when there are changes on the monitored value.
Content: choose between Inner Text, Inner HTML, or Attribute (custom attribute). Sometimes you will need to get the HTML code or just an attribute. You can define that here.
You will probably need to parse the scraped information when you select Inner HTML or Attribute. You can do that by adding a Filter, as we discussed before.
Handling with tricky selectors
Let’s try an exercise.
On our project, we need to extract the image of a product gallery.
We want to download the images from the gallery, and before we do that we inspect the source (Pressing F12 on Chrome).
We need to explore the second image. Right-click and choose Inspect.
On this particular slider, the image is not an HTML element, it’s a CSS background property!
A selector can grab the inner text (between tags), the HTML code itself, or the value of an attribute.
On this site, the CSS classes are not meaningful, but there is one at the top: product-briefing that we will use as starting point.
This is my CSS selector:
That is the div[style^=»back»] part.
There is another «child» selector that I should mention: div:nth-child(2), this one references the second child of a div. We are targeting the second image, the first image is div:nth-child(1) and the third: div:nth-child(3).
But, how will we set all this information on RTILA?
Just paste the selector onto the panel. This time we coded it manually because the site markup is not easy to target.
Now, go to Advanced – Settings, and change Content to Attribute, and Custom Attribute to Style.
I have also checked the Download box because I wish to download the image, not only to grab the URL.
On the img2 preview now, we can find all the information, with the URL to the image, and other properties.
We should parse that.
Go to Advanced – Filters and create a regular expression with the predefined filter: URL.
A regular expression will appear; we will use the provided one as a default.
Press Save and check the preview.
Everything seems to be perfect now.
I hope this small example gives you an idea about what you can achieve with RTILA.
It’s true – you need a bit of CSS, but I have some goodie for you.
SelectorGadget is an open-source tool that helps you with CSS selectors generation and discovery on complicated sites. Just install the Chrome Extension. It’s free and very easy to use.
Don’t be scared about this small tutorial, you are on the Advanced RTILA menus.
Sometimes scraping can be a challenge, always go step by step.
The automation module
Currently, this module is being developed and improved. Probably you will find some minor differences in the visuals but this will be easy to follow.
For learning purposes, I’ll scrape the blog on this page.
We will scrape some data from the articles, and learn how to iterate or use the pagination.
I set a simple project by filling the name and URL to inspect.
Press Save and click on the Inspector button.
We learned before how to grab the desired information, we used the Inspector panel for that.
But now, we have to learn how to navigate the site and tell to RTILA when we are ready to scrape the data.
This is what we need:
LOOP Infinite Go inside each article For each article, scrape it and go back When a Next page button exists Click on next page
Start clicking on Add New – Loop.
An infinite loop will be selected by default. No other actions are needed on this step.
Next, we should click on every article link to navigate inside the content.
Add a new Command action and grab the CSS sector, pointing and clicking on the website.
In this case, the selector is:
main#genesis-content > article > h2 > a
And this will grab 9 links from the page.
Set the Element count to Multiple, because we want to iterate this on every link.
From this step, RTILA will move inside the articles.
We should scrape on this step, and back to the «main page» again.
In this case, add a Command with the action «Go back» and set the Extract results to «Before the command starts».
This is very important because we want to scrape just NOW, before leaving the page.
See how I nested the actions. You have to do the same, grabbing the commands and dropping them inside the parent nodes.
You did the navigation!
You probably don’t want to try, because now we are not scraping any data. But if you Save all and click on Extractor, RTILA will start navigating alone on the 9 articles from the first page.
Next page action
Let’s work with the Next page button now.
The Next page button appears when there are remaining articles on the blog, this helps us to paginate.
We should check if the button is there, and in that case, we should click it.
Create a Condition from Add New. Choose a good selector for the pagination link, and click on the «Stop if the condition is not met» checkbox.
After scraping the results, if there is no «Next button» we have reached the last page.
Why did I choose that selector? This is a preview of the source, showing the class I want to target: .pagination-next and the link.
You can learn more about selectors in the previous section.
What do we do when there is a Next page button/link? We click on it.
Here is the full structure:
Save all the actions that we set on this panel, activate the Allow Navigation checkbox, and click an article.
Why? Because we set how to navigate the site, but we still didn’t select any information!
Now you should know how to target the data.
For this small tutorial, I grabbed 4 values: PostName, Date, Author and ReadingTime.
TIP: remember to disable the Allow Navigation when you start to grab the CSS selectors.
Save from the Inspector panel, and close the window.
You will be back to the RTILA main page.
You probably have to set the base URL again (RTILA probably changed it as we moved inside an article before).
Once you have finished selecting the necessary data to be scraped, and your project is fully configured you can press the Extractor button.
When Silent mode is set to No, a regular Chrome instance (or more than one, depending on your settings) will appear and start to navigate scraping the data automatically.
If some property was set as Required and it’s not found, RTILA will wait for your manual interaction. This could help to solve some interruption manually, like a captcha or an expired session.
The results page
Once your project is fully extracted RTILA will provide your results on this page.
Each scraping will generate a new results row, showing the amount of data, time of completion, and duration of the process.
You can open this window while the project is currently scraping.
This will allow you to check a Preview of the data, or you can perform a partial Export.
Once the project is finished, you can export any of the scrapes to a file.
Just notice, the preview window shows 10 records by default.
This does not mean that you only got 10 results, you can increase this value manually.
Click the Export button to save the configuration as CSV, JSON or HTML file formats.
You can keep the export data or delete it by clicking the trash can.
If you only need to keep a certain number of scrapes you can set your desired number on Purge results, from the Options – General panel.
The Clear All button will delete all the results.
The Bulk Export button will allow you to save all the results to a folder at once.