If you read my friend and colleague’s Dmitry Sychev blog post, you understand the concepts of web scraping. If you’re working with a site that doesn’t have an API, web scraping is an effective tool for crawling a page to gather data.
Taking inspiration from that, it reminded me of another tool that some might not be aware of.
Selenium is actually a driver, or a piece of software that you can use to launch and control an “automated browser” using an API of sorts. It can locate items on a DOM and manipulate them using virtual clicks, scrolls, and drags. You know that long boring process you have to go through to log into your bank account? Selenium can completely automate that.
Creating a Selenium Project using Python
I’m going to assume you know how to create a basic python project. Beyond that you’ll need the Selenium library:
pip install selenium
You’ll also need to download the Selenium Driver:
Place the driver file in the same directory that you plan on executing your python script. In this example, I’m going to create a script called “selenium_test.py”
Excellent. Now let’s write some code.
from selenium import webdriver
driver_path = "./chromedriver"
driver = webdriver.Chrome(driver_path)
Here we are importing the web driver, setting our driver path, then calling the driver
Once these components are loaded into memory, we can begin to execute commands.
The “driver.get(“https://bengarlock.com")” instructs Selenium to open a browser and navigate to the url we specify. From there, it will locate the title of the page, then print it to our console. From there it will quit.
They’re browser you’re seeing in this gif is completely automated. I’ll I’m doing is running my script. This is a basic example, but you can see how powerful this tool can be. In scrapping applications, testing, performance analytics, anything that requires automation, Selenium is a solution.
Is it Evil?
Not on its own, but it can be. Like all tools, it depends on how it’s used. A lot of companies use Selenium to conduct automated testing of their software for legitimate data gathering. The potential problem with Selenium is it can be used to pull a lot of data out of websites that may not what their data scraped. Ever notice that “robots.txt” file in the public directory of your Ruby Project? That’s you telling the world which URL’s you don’t want crawled. Here’s Amazon’s as an example.
The moral of the story is be careful how you use Selenium and other tools like it. Especially if you’re at a company that has to follow strict compliance standards.