Web Scraping using different Libraries and tips to get avoided while scraping.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
History:
The history of the web scraping dates back nearly to the time when the World Wide Web was born.
- After the birth of World Wide Web in 1989, the first web robot,[1] World Wide Web Wanderer, was created in June 1993, which was intended only to measure the size of the web.
- In December 1993, the first crawler-based web search engine, JumpStation, was launched. As there were not so many websites available on the web, search engines at that time used to rely on their human website administrators to collect and edit the links into a particular format. In comparison, JumpStation brought a new leap, being the first WWW search engine that relied on a web robot.
- In 2000, the first Web API and API crawler came. API stands for Application Programming Interface. It is an interface that makes it much easier to develop a program by providing the building blocks. In 2000, Salesforce and eBay launched their own API, with which programmers were enabled to access and download some of the data available to the public. Since then, many websites offer web APIs for people to access their public database.
The three major tools for Web scraping with python are:
i)BeautifulSoup
ii) Selenium
iii)Scarpy
All the methods have its pros and cons. Let’s look on them and summarize which tools will suit you more to extract information from a website.
BeautifulSoup
pros:
- It is easy to import and to master.
- A lot of websites are made up of html and php, which are more of a static in nature, beautifulsoup works better in these situations.
Cons:
- It is inefficient when chunks of data are fetched.
Selenium
pros:
- Selenuium is used for both scraping and testing.
- Those webpages using javascript,JSON and are more dynamic in nature, selenium plays a more pivotal role than beautifulsoup.
cons:
- It is also inefficient when a large number of data are fetched.
- It is difficult to import and master.
Scarpy
pros:
- Scrapy is good when you are dealing with a large amount of data.
- Scrapy is easy to import
cons:
- It is hard to master.
- It is efficient.
Tips to avoid getting blocked during scraping
1. ROBOTS.TXT
First of all, you have to understand what is robots.txt file and what is its functionality. So, basically it tells search engine crawlers which pages or files the crawler can or can’t request from your site. This is used mainly to avoid overloading any website with requests. This file provides standard rules about scraping. Many websites allow GOOGLE to let them scrape their websites. One can find robots.txt file on websites — http://example.com/robots.txt.
Sometimes certain websites have User-agent: * or Disallow:/ in their robots.txt file which means they don’t want you to scrape their websites.
Basically anti-scraping mechanism works on a fundamental rule which is: Is it a bot or a human? For analyzing this rule it has to follow certain criteria in order to make a decision.
Points referred by an anti-scraping mechanism:
- If you are scraping pages faster than a human possibly can, you will fall into a category called “bots”.
- Following the same pattern while scraping. Like for example, you are going through every page of that target domain for just collecting images or links.
- If you are scraping using the same IP for a certain period of time.
- User Agent missing. Maybe you are using a headerless browser like Tor Browser
If you keep these points in mind while scraping a website, I am pretty sure you will be able to scrape any website on the web.
2. IP Rotation
This is the easiest way for anti-scraping mechanisms to caught you red-handed. If you keep using the same IP for every request you will be blocked. So, for every successful scraping request, you must use a new IP for every request. You must have a pool of at least 10 IPs before making an HTTP request. To avoid getting blocked you can use proxy rotating services like Scrapingdog or any other Proxy services. I am putting a small python code snippet which can be used to create a pool of new IP address before making a request.
from bs4 import BeautifulSoup
import requests
l={} u=list() url=”https://www.proxynova.com/proxy-server-list/country-"+country_code+"/"
respo = requests.get(url).text
soup = BeautifulSoup(respo,’html.parser’)
allproxy = soup.find_all(“tr”)
for proxy in allproxy:
foo = proxy.find_all(“td”)
try: l[“ip”]=foo[0].text.replace(“\n”,””).replace(“document.write(“,””).replace(“)”,””).replace(“\’”,””).replace(“;”,””)
except:
l[“ip”]=None
try:
l[“port”]=foo[1].text.replace(“\n”,””).replace(“ “,””)
except:
l[“port”]=None
try:
l[“country”]=foo[5].text.replace(“\n”,””).replace(“ “,””)
except:
l[“country”]=None
if(l[“port”] is not None):
u.append(l)
l={} print(u)
This will provide you a JSON response with three properties which are IP, port, and country. This proxy API will provide IPs according to a country code. you can find country code here.
But for websites which have advanced bot detection mechanism, you have to use either mobile or residential proxies. you can again use Scrapingdog for such services. The number of IPs in the world is fixed. By using these services you will get access to millions of IPs which can be used to scrape millions of pages. This is the best thing you can do to scrape successfully for a longer period of time.
3. User-Agent
The User-Agent request header is a character string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent. Some websites block certain requests if they contain User-Agent that don’t belong to a major browser. If user-agents are not set many websites won’t allow viewing their content. You can get your user-agent by typing What is my user agent on google.
You can also check your user-string here:
http://www.whatsmyuseragent.com/
Somewhat same technique is used by an anti-scraping mechanism that they use while banning IPs. If you are using the same user-agent for every request you will be banned in no time. What is the solution? Well, the solution is pretty simple you have to either create a list of User-Agents or maybe use libraries like fake-useragents. I have used both techniques but for efficiency purposes, I will urge you to use the library.
A user-agent string listing to get you started can be found here:
http://www.useragentstring.com/pages/useragentstring.php
https://developers.whatismybrowser.com/useragents/explore/
4. Make Scraping slower, keep Random Intervals in between
As you know the speed of crawling websites by humans and bots is very different. Bots can scrape websites at a very fast pace. Making fast unnecessary or random requests to a website is not good for anyone. Due to this overloading of requests a website may go down.
To avoid this mistake, make your bot sleep programmatically in between scraping processes. This will make your bot look more human to the anti-scraping mechanism. This will also not harm the website. Scrape the smallest number of pages at a time by making concurrent requests. Put a timeout of around 10 to 20 seconds and then continue scraping. As I said earlier respect the robots.txt file.
Use auto throttling mechanisms which will automatically throttle the crawling speed based on the load on both the spider and the website that you are crawling. Adjust the spider to an optimum crawling speed after a few trials run. Do this periodically because the environment does change over time.
5. Change in Scraping Pattern & Detect website change
Generally, humans don’t perform repetitive tasks as they browse through a site with random actions. But web scraping bots will crawl in the same pattern because they are programmed to do so. As I said earlier some websites have great anti-scraping mechanisms. They will catch your bot and will ban it permanently.
Now, how can you protect your bot from being caught? This can be achieved by Incorporating some random clicks on the page, mouse movements, and random actions that will make a spider look like a human.
Now, another problem is many websites change their layouts for many reasons and due to this your scraper will fail to bring data you are expecting. For this, you should have a perfect monitoring system that detects changes in their layouts and then alert you with the scenario. Then this information can be used in your scraper to work accordingly.
One of my friends is working in a large online travel agency and they crawl the web to get prices of their competitors. While doing so they have a monitoring system that mails them every 15 minutes about the status of their layouts. This keeps everything on track and their scraper never breaks.