Indeed Job Postings are scraped from the internet.
So you're looking for a job and want to search smarter rather than harder to find something different and interesting? Why not build a web scraper that will collect and parse work posting data for you? Set it and forget it, then return to your data treasures in a neat and tidy format! So, how do you do it? Let's have a look at this together!
[Before we begin, it's worth noting that many websites limit or outright exclude data scraping from their pages. Depending on where and how users attempt to scrape information, they can face legal consequences. Many websites have a dedicated page at www.[site].com/robots.txt that lists data scraping limitations. When looking at sites that store user data, be highly cautious — sites like Facebook, LinkedIn, and even Craigslist do not appreciate data being scraped off their pages. Mates, scrape carefully.]
I wanted to look at data science-related work posted in a number of cities on indeed.com, a job aggregator that updates several times per day, for this project. I scraped information from Indeed's pages using Python's "requests" and "BeautifulSoup" libraries, then assembled my data into a dataframe using the "pandas" library for further cleaning and review.
Taking a look at the URL and Page layout
Let's start with a sample page from indeed.
There are a few items to note about the URL structure:
Note that “q=” starts the string for the “what” field on the website, with “+” separating search terms (e.g., looking for “data+scientist” jobs).
When specifying a salary, it will parse the salary statistic by commas, so the beginning of the salary will be followed by percent 24 and then the amount before the first comma, it will then be split by percent 2C and the rest of the number will be continued (i.e. percent 2420 percent 2C000 = $20,000).
Note that the string for city of interest starts with “&l=,” with “+” separating search terms if city is more than one word (i.e. “New+York”).
“&start=” is a placeholder for the search result where you want to start (i.e., start by looking at the 10th result)
As we create a scraper to look at and collect data from a collection of pages, the URL structure will come in handy. Keep this in mind for future reference.
There will be 15 work postings on each list of job performance. Five of these are “sponsored” jobs, which are displayed by Indeed in a different order than the rest of the data. The remaining ten findings are tailored to the current page.
HTML tags are used to code all of the information on this page. HTML (HyperText Markup Language) is the coding that tells your web browser how to view the contents of a given page when you visit it.Our web scraping Services provides high-quality structured data to improve business outcomes and enable intelligent decision making,
Our Web scraping service allows you to scrape data from any websites and transfer web pages into an easy-to-use format such as Excel, CSV, JSON and many others
This includes the basic structure and order of the document. HTML tags often have attributes, which are useful for keeping track of what details can be found where on a page's layout.
By right-clicking on a page and selecting "Inspect" from the menu that appears, Chrome users can analyse the HTML structure of the page. On the right-hand side of your page, a menu will appear, with a long list of nested HTML tags containing the data currently displayed in your browser window. There's a tiny box with an arrow button in the upper-left corner of this menu. The box will turn blue when you open it (notice in the screenshot below). This will allow you to move your cursor over the items on the page to see both the tag associated with that item and the location of that item in the HTML for that page.
I've hovered over one of the work listings in the screenshot above to illustrate how the entire job's contents is contained inside a div> tag with attributes like "class ='row outcome'," "id='pj 7a21e2c11afb0428'," and so on. Fortunately, we won't need to know every attribute of every tag to extract our data, but knowing how to read a page's HTML structure is useful.
Now we'll use Python to extract the HTML from the page and start working on our scraper.
Putting Together the Scraper Components
Now that we've looked at the page's basic structure and learned a bit about its HTML structure, we can start thinking about writing code to extract the details we want. We'll start by importing our libraries. Notice that I'm also importing "time," which can be a useful way of staggering page requests so that a site's servers aren't overwhelmed while scraping data.