Welcome to RTILA help page
Please post questions, follow discussions and share your knowledge in the RTILA Visual Web Scraping Facebook group.
You can download the latest version from the official GitHub repository at:
https://github.com/IKAJIAN/rtila-releases/releases
RTILA is available for Mac, Linux, and Windows.
For the best experience, try to download the latest version for your operating system.
With RTILA you can scrape text, URLs, email addresses, images or download any type of binary files from web pages, such as pdfs, docs, etc.
RTILA is very easy to use, but learning some scraping concepts will help you to succeed in your scraping projects.
On the latest version of RTILA, the interface changed and now brings 3 separate panels (properties, automations and preview), probably there will be some layout changes in the following weeks to bring a comfortable user interface.
This software is actively developed, some options could vary slightly.
Don’t be worried about the screens could be a bit different on newer versions, you will find everything. Read this help to get the basics and you will be fine.
Please, note that I’m just an RTILA user and I wrote this document in my free time.
Registering RTILA
You may purchase RTILA online at codecanyon.
Once you complete the payment, you will receive the license key file via email.
After starting RTILA, the registration screen will pop up as shown above.
Paste your serial number to activate your license.
Please note that there are 3 different versions of RTILA: for Windows, Mac OS, and Linux.
Register the software using the required key for your Operating System.
RTILA 5 minutes tour
RTILA Launcher
When you start RTILA the laucher will appear.
Click on the Start RTILA Studio button to open the main interface.
The main window
This is the main screen.
To set a new (quick) project, you only have to write a URL and press Start.
From the NEW button, you will be able to create new projects, import them from a file, or import some samples from the cloud (public templates).
The New button will allow you to create a new project from scratch, or choose an already pre-configured template. I recommend you to start creating some projects from the «Books to Scrape» template and start learning from those preset projects.
Starting a project from a template is an easy task, just click any one of them and you will see some additional info on the right panel (Name, URL and Type). Those projects can be customized later with your own parameters, but these are a good start.
You can import your own projects or ones created by other RTILA users. Remember to create a New – Manual empty project before importing a new project. Importing a project overwrites the current one.
From the left menu, you can manage your projects, compile bots, create flows, export/view your scrape results and manage your browser profiles.
If you have some projects configured, the initial screen will look like this, with a top bar with some extra features.
From the Options menu, you can set all the initial project settings.
You will see project some details are already set on the options page.
Let’s check some templates first.
Go to NEW and import the book template as the image below: Basic list example
You can test the imported project from the main screen:
Do you want to give it a try? Just press RUN and RTILA will start scraping a sample site. Chrome will be launched automatically.
Once is finished, click the RTILA Results from the left menu.
And press the PREVIEW button.
This will open a window with some extracted data. You can click on Export to save the data to a file (CSV is recommended).
Now we will explore the Project Options page. Close the results window and go to the main RTILA screen.
Options – General page
On this screen you will set the Project name and URL to inspect on all of your projects (mandatory fields). Set a meaningful project name because it will be the unique text reference to find it later from the Project selector control.
The URL to inspect should be the starting point of the page you want to scrape.
The Silent mode option will scrape the site without a graphical user interface. This means Chrome will be hidden. It’s probably just fine to set as default (No) for development time. The silent mode will be useful when a project is scheduled to run every 5 minutes for example.
Purge results clean the results automatically. This limits the total number of stored sessions and discards the older scrapings; this is useful when you only want to keep the newer ones. Sometimes you will use RTILA to monitor a site with frequent or regular scrapes, this will allow you to have some auto cleaning capabilities.
Instances, Concurrent tabs, and Wait times are REALLY important. They will set your actual scraping speed.
An instance is a full chrome execution. An instance can open a certain number of tabs. That means, 3 instances and 5 tabs, will open 15 pages at the same time.
My recommendation is, don’t abuse this. Set those values as slow as you can and don’t send multiple requests to a site. In some countries, people host their sites on shared hosting and multiple requests per second could leave you blocked.
It does not matter if you scrape for 2 minutes, or 20 minutes. To start testing, you can set to 1 instance, 1 tab and 3 seconds wait time. This is slower but safe.
Timeout and retries should be auto explanatory, and those are the wait times and retries, respectively.
Don’t set your timeout too low and the browser will not stay enough time waiting for the data to be loaded.
Too many details? Focus on Project name and URL to inspect. Those are the basic ones.
Disable image loadings will allow you to discard loading photos on the sites, this could help you improve the load times.
On some sites, you can disable JavaScript or CSS loading. Sometimes this could help with the scraping itself, or… break your project. Yes, disabling the CSS or JavaScript could prevent some data to load on the site.
Reports option
RTILA has some handy reports option.
You can reach it from the main window.
Proxies
A proxy is a 3rd party server that enables you to route your request through their servers and use their IP address. When using a proxy, the website you are making the request to no longer sees your IP address but the IP address of the proxy.
The syntax of the proxies is USER:PASSWORD@IP:PORT
You can set one proxy per line, RTILA will perform the proxy rotation.
Please, consider using proxies or some VPN software while scraping but again, be gentle with the site you are scraping on.
We recommend proxies from BrightData – Luminati. Probably you are enough with their Data Center proxies plan of $0.6/GB to start.
External proxy rotation allows you to use services like scraperapi or rocketscrape.
Extensions
You can install extensions on RTILA, probably the most useful one is the Adblock extension. This will prevent loading annoying ads or popups on some sites.
Just add one URL extension per line. Some useful extensions could be Adblock or Hola Free VPN.
The first one will prevent loading annoying ads or popups on some sites, the second one is just a free VPN service.
If you are getting serious with scraping you should use a dedicated VPN or proxy.
JS Actions
This is an advanced screen. Sometimes it’s useful to run some JavaScript after the page is loaded but before the scraping. You can add your JavaScript snippets here.
If you are a PRO scraper, do not worry if your legs start shaking. If you are a JavaScript beginner, it is a fantastic language to learn. Your scrapes will become awesome with a bit of knowledge, scrapers are always learning new things.
Logs screen
You can check the logs from RTILA or from the logs directory on your disk.
This will help you to debug the scraping process.
Captcha solver
Do you need to solve captchas? Use an external service like 2captcha.com.
Include their API key in RTILA and you are done.
The Inspector module
Hopefully, now you get an idea about the configuration possibilities and you want to start a real project.
After opening the Inspector, you may either start selecting data or creating automation actions.
I will use an example to teach you the basics.
Go to New and select Manual to create an empty project.
Give it a name of your choice so that it can be identified uniquely.
Paste this URL: http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
Save the information and press the Inspect button.
A Chrome instance will open the above URL and a couple of panels will be available.
On every scrape, you will need to use CSS selectors to grab the information from the page.
The selectors are strings of text that target HTML elements or structures. Here is a highly recommended source to learn some of them:
https://www.w3schools.com/cssref/css_selectors.asp
Do you need to learn them all now? Of course it’s not needed. Some websites can be tricky though and the knowledge will help you.
RTILA has a point and click interface and it will try to guess the better selector for you.
TIP: If you are not comfortable with CSS selectors, I recommend you this Chrome extension to you: Selector Gadget.
Let’s start.
Target an element
Click on Add new property button.
Move your mouse to the book title and click the left mouse button.
You will see a new property is created, and some new information is visible on the Selector panel.
It was not hard to get this one.
Let’s check what we have on the panel left Panel.
The main panel contains a dataset.
A dataset means a group of elements to extract from a page. In this case, we are working to extract the details of a book, which means… we will grab a title, a price, an author, and so on.
Name: Property 1. This was auto-generated, we should change it to a UNIQUE name. In this case, it could be: BookName
CSS selector: This was auto-generated too. RTILA considers that H1 is a good selector in this case. We will talk about selectors in more depth later.
The purple box: Shows the active property, and the number inside indicates how many elements the selector is grabbing. In this case, the site has only one H1 element, that is, the title string. Later we will add other selectors, the selected one will be purple, the other will appear dark gray. You can navigate all the selectors by clicking on these boxes.
The Add new property button: Just creates a new selector.
The Save button: Well, you can guess this one. Please save all of your jobs by clicking Save at the end. After saving you can close the Chrome instance.
The Preview grid: as you can see on the image, it shows our first property content.
Go to your project and try to grab the price.
That is, Press the Add new property button, go to the site and point and click as the image below.
A second selector is created, I just named it: Price.
You can check the purple box, there is also a number 1, this means we are only getting «one Price» value.
The book could have a discounted price or other recommended readings could appear on this page, showing their own prices. We are sure that we are getting just a single value.
In case we get multiple values we will need to refine the selector.
Before saving the project, let’s change the Dataset from List to Detail.
Go to the Dataset settings on the top left of the Inspector panel.
And choose Details on the dropdown.
Press Save and close, and let’s save the full project now.
Scraping and exporting data
Let’s do a quick real scrape. On RTILA main window, press RUN.
Wait for a few seconds and click on Results from the left menu.
From here you can validate your extraction. This was a simple one, have a quick look at the Preview button, or better, go to Export and choose CSV.
Open the resulting file with your favorite CSV editor (or a text editor will do the job).
That was all.
But as you can imagine, some sites could have a tricky HTML markup and you will need to help RTILA with the adequate selectors. So…
How to use the CSS selectors
If you are new to scraping, probably you want to try with Facebook, LinkedIn, Google Maps, or Amazon.
However, they are not a good place to learn to scrape, they require some tricky interactions. That’s why we are starting with this simple book page.
Open your Chrome for a moment (not RTILA), and go to our test site: http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
Press F12 and go to the Elements tab.
You will find the source of the site we scraped before. You can inspect the code and try to find a couple of lines that I colored in yellow and green respectively.
In yellow, you can see an HTML element. It’s between the < > characters and as you already know, it’s the H1 tag.
Usually, HTML elements are not too specific, which means, if you try to grab some data that is a simple paragraph <p>, you will get many elements because the p tag is really common.
Now we are not talking about styling with CSS (forget about the right Styles panel), we are using CSS to target specific parts of the page.
How could we be sure to get the correct h1 tag? Could we be more specific?
Yes probably, look at the line below the h1, it has a parent div, that is contained on another div, that is contained on an article HTML tag, that is contained on another div.
Yes, all the HTML code is nested, and you can use the HTML tags, but usually, you will support them with other specific elements that are classes and ids.
For example, let’s say I want to grab the price, that is inside a P element. We may think that the P element could do the job, and we try the P selector on RTILA.
Well, the price is there, but as you see… what a disaster!
Many other parts of the page are using paragraphs. We need to narrow our selection.
I go back to the developer tools to inspect and try the RTILA suggested selector.
By pressing CTRL+F we can test a selector.
I write p.price_color and I see on the search results that I get a single element. RTILA did that for us before, just by pointing and clicking. Do you remember the small purple square with a number inside?
Now we know that the p element was not unique, but the p element with the class price_color is.
We will write classes with a period at the start, that is: .price_color
And we can join the HTML element with the class, that is: p.price_color
It’s as easy as that.
How do we talk to the CSS selector in the correct language?
Think about this. We want to say, hey! I see an identifier named «content_inner» that is unique, and he has a child named «article» who has a child with a div and so on.
We just have to chain the elements, there are some rules, but the basic ones are:
Use > to introduce child descendant.
Use a # to target an identifier.
Use a . to target a class.
Now, try to follow me:
#content_inner > article > div.row > div.product_main > p:first-of-type
Let’s validate the selector:
Seems to be working, right?
Now, the CSS selector searches for a unique identifier, which has a descendent (a div with a product_main class) with some paragraphs inside. But we have checked our site and the Price is always contained on the first one.
You are probably thinking… man, you are crazy. Who will use those weird and long CSS selectors? I’m sure I can grab just an identifier or a class very near my data, and have no need to worry about learning this.
Well, that’s why I don’t recommend starting to practice on Facebook.
Remember, apparently all the websites are the same, but internally the markup could be very different. Sometimes scraping 100,000 pages will be easy, but there are times that scraping 200 pages is a real pain.
And we are not just talking about selectors, many other things could occur, like delayed loading, loading more buttons, paginations, applets… captchas! Welcome to the scrapping world.
Do you want to be good at scraping? Learn as much as you can about CSS selectors, it will take you just a couple of days.
Start here:
https://www.w3schools.com/css/css_combinators.asp
https://www.w3schools.com/cssref/css_selectors.asp
The Filters & Conditions menus
From time to time, you will have to filter or apply some conditions to work with the data.
Now you will discover some advanced features that will allow you to avoid having to post-process the extracted data.
Filters & conditions
Let’s grab the stock information.
Click on Add new property and select the stock value in the browser as in the image below.
In our example, we were grabbing the stock. However we probably have enough with the number and we want to do some cleaning on that string.
Open the filters menu and create a filter with this regular expression: \d+ or just use the one that comes preconfigured in Data filter labeled as «Integer».
Press Save.
Check the real-time preview. The string shows only the numerical value, in this case, the value 22.
If you scroll down at the filters pane, you will see a place to play with the scraped value, which is the Actions control.
In this example, I go to attach some text to it.
Write on the Actions area:
FIELD_VALUE=»We have » + FIELD_VALUE + » books»
And now our resulting value will be that We have 22 books instead of In stock (22 available).
The Actions component will accept any JavaScript code. This is completely optional but you can get the idea, HTML cleanup, trimming strings, math operations, replacements… anything you want.
You are probably thinking that in the last 2 minutes I mainly wrote about Regular Expressions and JavaScript.
Mastering both of them could take some months of hard study.
Well, now you know how it works.
The advanced button was… advanced.
If you are not comfortable with more study, just Google or search at StackOverflow. You will be able to reuse some snippets easily.
The conditions panel will allow you to filter the scraped information based on a criterion.
For example, I can set a condition like this:
CONDITION: Contains – bestseller
Now RTILA will check when the book’s description meets that criteria, and discard the pages that do not contain the «bestseller» word on their description.
You can add as many conditions as you need.
The current operator for this panel is AND, which means if there is more than one condition, all of them need to be TRUE to get the content.
Advanced Settings (legacy version)
You can select a property and go to Advanced – Settings
A new popup will allow you to configure the following:
Default value: by default this value is empty, but you can set some string as default value when the scraper didn’t get any data.
Required: RTILA will stop if this value is not found. You can use it as a FLAG to bypass captchas or other issues that prevent you from continuing getting data.
Download: this will try to download a file resource, like an image, a pdf file, etc. The path of the download will be located at your user downloads folder:
C:\Users\YourUSER\Downloads\RTILA\the project name\download identifier\
In my case a sample path could be:
C:\Users\Usuario\Downloads\RTILA\ecommerce/693/rubikcube.jpg
You can find this path on the results page of each project. The download folder will remain even if you delete a project.
Monitor: this will help you to check for a value change. You will probably activate this function together with the Mailer settings, to get an email notification when there are changes on the monitored value.
Content: choose between Inner Text, Inner HTML, or Attribute (custom attribute). Sometimes you will need to get the HTML code or just an attribute. You can define that here.
You will probably need to parse the scraped information when you select Inner HTML or Attribute. You can do that by adding a Filter, as we discussed before.
Handling with tricky selectors
Let’s try an exercise.
On our project, we need to extract the image of a product gallery.
We want to download the images from the gallery, and before we do that we inspect the source (Pressing F12 on Chrome).
We need to explore the second image. Right-click and choose Inspect.
On this particular slider, the image is not an HTML element, it’s a CSS background property!
A selector can grab the inner text (between tags), the HTML code itself, or the value of an attribute.
On this site, the CSS classes are not meaningful, but there is one at the top: product-briefing that we will use as starting point.
This is my CSS selector:
.product-briefing>div>div>div>div:nth-child(2)>div>div>div[style^="back"]
Don’t worry about the exact selector pattern, I’m just checking the path to the element that has a property named style, and starts with the string «back».
That is the div[style^=»back»] part.
There is another «child» selector that I should mention: div:nth-child(2), this one references the second child of a div. We are targeting the second image, the first image is div:nth-child(1) and the third: div:nth-child(3).
But, how will we set all this information on RTILA?
Just paste the selector onto the panel. This time we coded it manually because the site markup is not easy to target.
Now, go to Advanced – Settings, and change Content to Attribute, and Custom Attribute to Style.
I have also checked the Download box because I wish to download the image, not only to grab the URL.
On the img2 preview now, we can find all the information, with the URL to the image, and other properties.
We should parse that.
Go to Advanced – Filters and create a regular expression with the predefined filter: URL.
A regular expression will appear; we will use the provided one as a default.
Press Save and check the preview.
Everything seems to be perfect now.
I hope this small example gives you an idea about what you can achieve with RTILA.
It’s true – you need a bit of CSS, but I have some goodie for you.
SelectorGadget is an open-source tool that helps you with CSS selectors generation and discovery on complicated sites. Just install the Chrome Extension. It’s free and very easy to use.
Don’t be scared about this small tutorial, you are on the Advanced RTILA menus.
Sometimes scraping can be a challenge, always go step by step.
The automation module
Currently, this module is being developed and improved. Probably you will find some minor differences in the visuals but this will be easy to follow.
For learning purposes, I’ll scrape the blog on this page.
We will scrape some data from the articles, and learn how to iterate or use the pagination.
I set a simple project by filling the name and URL to inspect.
Press Save and click on the Inspector button.
We learned before how to grab the desired information, we used the Inspector panel for that.
But now, we have to learn how to navigate the site and tell to RTILA when we are ready to scrape the data.
This is what we need:
LOOP Infinite Go inside each article For each article, scrape it and go back When a Next page button exists Click on next page
Creating actions
Start clicking on Add New – Loop.
An infinite loop will be selected by default. No other actions are needed on this step.
Next, we should click on every article link to navigate inside the content.
Add a new Command action and grab the CSS sector, pointing and clicking on the website.
In this case, the selector is:
main#genesis-content > article > h2 > a
And this will grab 9 links from the page.
Set the Element count to Multiple, because we want to iterate this on every link.
From this step, RTILA will move inside the articles.
We should scrape on this step, and back to the «main page» again.
In this case, add a Command with the action «Go back» and set the Extract results to «Before the command starts».
This is very important because we want to scrape just NOW, before leaving the page.
See how I nested the actions. You have to do the same, grabbing the commands and dropping them inside the parent nodes.
You did the navigation!
You probably don’t want to try, because now we are not scraping any data. But if you Save all and click on Extractor, RTILA will start navigating alone on the 9 articles from the first page.
Next page action
Let’s work with the Next page button now.
The Next page button appears when there are remaining articles on the blog, this helps us to paginate.
We should check if the button is there, and in that case, we should click it.
Create a Condition from Add New. Choose a good selector for the pagination link, and click on the «Stop if the condition is not met» checkbox.
After scraping the results, if there is no «Next button» we have reached the last page.
Why did I choose that selector? This is a preview of the source, showing the class I want to target: .pagination-next and the link.
You can learn more about selectors in the previous section.
What do we do when there is a Next page button/link? We click on it.
Here is the full structure:
Save all the actions that we set on this panel, activate the Allow Navigation checkbox, and click an article.
Why? Because we set how to navigate the site, but we still didn’t select any information!
Now you should know how to target the data.
For this small tutorial, I grabbed 4 values: PostName, Date, Author and ReadingTime.
TIP: remember to disable the Allow Navigation when you start to grab the CSS selectors.
Save from the Inspector panel, and close the window.
You will be back to the RTILA main page.
You probably have to set the base URL again (RTILA probably changed it as we moved inside an article before).
Extracting data
Once you have finished selecting the necessary data to be scraped, and your project is fully configured you can press the Extractor button.
When Silent mode is set to No, a regular Chrome instance (or more than one, depending on your settings) will appear and start to navigate scraping the data automatically.
If some property was set as Required and it’s not found, RTILA will wait for your manual interaction. This could help to solve some interruption manually, like a captcha or an expired session.
The results page
Once your project is fully extracted RTILA will provide your results on this page.
Each scraping will generate a new results row, showing the amount of data, time of completion, and duration of the process.
You can open this window while the project is currently scraping.
This will allow you to check a Preview of the data, or you can perform a partial Export.
Once the project is finished, you can export any of the scrapes to a file.
Just notice, the preview window shows 10 records by default.
This does not mean that you only got 10 results, you can increase this value manually.
Click the Export button to save the configuration as CSV, JSON or HTML file formats.
You can keep the export data or delete it by clicking the trash can.
If you only need to keep a certain number of scrapes you can set your desired number on Purge results, from the Options – General panel.
The Clear All button will delete all the results.
The Bulk Export button will allow you to save all the results to a folder at once.