RTILA Advanced Scraping I

Tiempo de lectura: 7 minutos

Today, I’m going to analyze a site and demonstrate how to scrape it with RTILA.

As this is for learning purposes, this project will not be available for download. Take a seat, and enjoy the lecture.

I’ll work with the Spanish version of the site, but there should not be a difference with other local versions.

I don’t mention the site name or the URL, but you can see the screenshots and figure it out.

Let’s do it.

This is how the site looks after a query. We want scraping jobs.

As you can see, the job listings appear in the left column.

When you click on a company block, a second column on the right will appear with additional information.

And there is more!

Some companies show a link that you can follow, that gives us the company details in a new tab.

In the image above, you can see the link with the text: Consejo Superior de Investigaciones Científicas.

After clicking on it, more information will appear, like the links to their website or social networks, employees, etc.

This is how the company information looks:

The main listings have pagination buttons too, they look like this:

And, that’s the presentation.

Sounds like a lot of work. Where do we start?

We will start with the automation part.

Let’s create the navigation to the page and fill the search form.

These actions should be self-explanatory.

The input selector is detected perfectly with RTILA’s point and click, it should be: input#text-input-what.

Here we can see the full interaction, the first block that we already saw on the previous image and the block with the LOOP.

On the loop, we are clicking the listings.

It’s as easy as this:

This powerful event will click on every listing element.

Maybe you are thinking, “ok… this action does all that, why we need the loop?”

Easy, because the site has pagination.

We use one loop to iterate all the pages, but we don’t need a loop to click all the elements, just set Element count to Multiple.

Remember, to add a pause after every click. This is mandatory almost all the time.

What we do next? Following the commands, you’ll see another block.

That block checks for the existence of a company link. Just in case it exists, click on it and open the new content in a new tab. We will cover this in more detail later.

The next step is to scrape the data. Look at how this is indented inside the listings event after clicking on it. This is a necessary action.

And the last event is to click on the next page, that covers the pagination automation.

How we check for the existence of a link?

We create a new selector, and we try to grab it.

After this, a condition is set.

Exists? Click on it.

Great, the job is almost done.

You can see some JavaScript code in this automation.

This is because we have to scrape on the new tabs (other pages different from the main one) and we are creating this task dynamically with code.

Scary?!’
‘Yes.’
‘Don’t be scared!’
‘I’m not really scared’
‘You should be!’
‘Well I am a little bit scared’
‘Well don’t be!’

The init block contains:

window.localStorage.setItem('companyLink', '');

This will access the local Storage object and initialize the companyLink value to empty.

document.querySelectorAll('a[data-tn-element="companyLink"]').forEach(function(e){
  if(["facebook.com", "twitter.com", "linkedin.com", "...more"].indexOf(new URL(e.href).hostname.replace("www.", ""))!==-1){
    e.remove();
  }
});

if(RTILA_e= document.querySelector('a[data-tn-element="companyLink"]')){
  window.localStorage.setItem('companyLink', RTILA_e.href);
}

With this code, we remove the social URLs from Facebook, Twitter and LinkedIn.

After this, we grab the URL and keep the value on companyLink.

We can test this code in Chrome console, this could be a page sample, a company with a business website and a couple of social networks.

Paste the code in the console.

And that did the job. The code removed the social links, and the setItem will get the business URL.

You can verify this, move to Application – Local storage.

The value is there.

How we should send this value to RTILA?

Go to the inspector panel and create a new selector, for this example I named it Company URL.

Click Advanced – Filters, create a new filter and add this code to the Actions field:

FIELD_VALUE=window.localStorage.getItem('companyLink');

getItem will grab the information and we will store it on the Company URL selector.

We are done.

The last code is to handle the pagination.

A good selector could be:

.pagination > .pagination-list > li:last-child > a

Remember to test at the very last page to verify that the selector does not capture anything and the stop condition is set.

It’s recommended to add some wait time after every click.

This post is not part of the RTILA documentation, I’m just sharing a bit of knowledge.

It’s not recommended to create and fill values from code, RTILA will probably solve this in the very near future using bridges and merging tables.

Sometimes the target website can be a challenge, you can reach us for RTILA custom projects. We offer premium support at very affordable prices.

As this is a very small company, I appreciate your support with a Google review here:

https://search.google.com/local/writereview?placeid=ChIJfU54LSH_Xw0RjGgTSmeXSoQ

Thanks for reading, you are really brave because this lecture was probably a bit tough.

Enjoy the tips.

SOCIAL

CONTACTO

MENÚ