LinkedIn web scraping



I recently discovered a new R package for connecting to the LinkedIn API. Unfortunately the LinkedIn API seems pretty limited to begin with; for example, you can only get basic data on companies, and this is detached from data on individuals. I'd like to get data on all employees of a given company, which you can do manually on the site but is not possible through the API. would be perfect if it recognised the LinkedIn pagination (see end of page).

Does anyone know any web scraping tools or techniques applicable to the current format of the LinkedIn site, or ways of bending the API to carry out more flexible analysis? Preferably in R or web based, but certainly open to other approaches.


Posted 2015-05-13T21:01:03.070

Reputation: 460


Web scraping LinkedIn is against their terms of service. See LinkedIn “DOs” and “DON’Ts”- DON'T:"Use manual or automated software, devices, scripts robots, other means or processes to access, “scrape,” “crawl” or “spider” the Services or any related data or information;"

– Brian Spiering – 2015-05-23T21:03:07.677



Beautiful Soup is specifically designed for web crawling and scraping, but is written for python and not R


Posted 2015-05-13T21:01:03.070

Reputation: 1 185


I didn't think beautiful soup allowed you to iterate over pages, turns out you can. Thanks

– christopherlovell – 2015-05-14T06:24:01.073


Scrapy is a great Python library which can help you scrape different sites faster and make your code structure better. Not all sites can be parsed with classic tools, because they can use dynamic JS content building. For this task it is better to use Selenium (This is a test framework for web sites, but it also a great web scraping tool). There's also a Python wrapper available for this library. In Google you can find a few tricks which can help you use Selenium inside Scrapy and make your code clear, organized, and you can use some great tools for Scrapy library.

I think that Selenium would be a better scraper for Linkedin than classic tools. There is a lot of javascript and dynamic content. Also, if you want to make authentication in your account and scrape all available content, you will get a lot of problems with classic authentication using simple libraries like requests or urllib.


Posted 2015-05-13T21:01:03.070

Reputation: 219


I like rvest in combination with the SelectorGadget chrome plug-in for selecting relevant sections.

I've used rvest and built small scripts to paginate through forums by:

  1. Look for the "Page n Of m" object
  2. Extract m
  3. Based on the page structure, build a list of links from 1 to m (e.g.
  4. Iterate the scraper through the full list of links


Posted 2015-05-13T21:01:03.070

Reputation: 41


I would also go with beautifulsoup, if you know python. In case you rather code javascript/JQuery (and you are familiar with node.js), you may want to checkout CoffeeScript (Check out the Tutorial) I already used it successfully on several occasions for scraping web pages.


Posted 2015-05-13T21:01:03.070

Reputation: 1


lxml is a nice web scrapping library in Python. Beautiful Soup is a wrapper over lxml. So, lxml is faster than both scrapy and beautiful soup and has a much easier learning curve.

This is an example of a scraper which I built with it for a personal project, which can iterate over web pages.


Posted 2015-05-13T21:01:03.070

Reputation: 7 606