Parsing LinkedIn html pages with PowerShell

A couple of weeks ago we posted a job opening on LinkedIn (were looking for a person to be in charge of our Jelastic‘s professional services), and it turned out that while LinkedIn jobs attract a lot of applications, the site itself does not make it easy to process them afterwards. You get CVs in email, and they also post a list of applicants with email addresses, phone numbers, titles, etc. – but there is no way to export the list to, say, Excel. In our case, we really wanted to have the data exported, so we could jointly work on a shared spreadsheet and everyone involved could grade each applicant and add notes to the table.

Being a PowerShell guy, I wrote the script below that does the scraping for me. :) Basically, I just saved the page with the list of applicants to my local disk and found that in their html, each applicant information is contained in vcard element, which has class name with LinkedIn URL and the actual name, and then elements with email and phone number:

So all my script has to do is: create an IE object and then use it to find the corresponding fields, then create custom objects from them, add them to the collection, and export it to CSV. Here’s the code – hope it helps you solve similar tasks when other sites do not provide good export capabilities:

$ie = new-object -com "InternetExplorer.Application"

# The easiest way to accomodate for slowness of IE
Start-Sleep -Seconds 1

$ie.Navigate("D:\Temp\LinkedIn.htm")

# The easiest way to accomodate for slowness of IE
Start-Sleep -Seconds 1

$doc = $ie.Document

# Get a collection of vcard elements
$cards = $doc.body.getElementsByClassName("vcard")

# This will be our collection of parsed objects
$processesCards = @()

# Iterate through the collection
for ($i=0; $i -lt $cards.length; $i++) {

 $itm = $cards.item($i)

 # Get the 'name' element that has the applicant name and URL
 $name = $itm.getElementsByClassName("name").item(0).
                       getElementsByTagName("a").item(0)

 # If you want you can output the name to the screen 
 # so you know where you are
 $name.outerText

 # Get the phone number and email address
 $phone = $itm.getElementsByClassName("phone").item(0)
 $email = `
   $itm.getElementsByClassName("trk-applicant-email").item(0)

 # Below is PowerShell v3 notation. 
 # In v2, replace '[pscustomobject]' with 
 # 'new-object psobject -Property' 
 $obj = [pscustomobject] @{"name"=$name.outerText; 
                           "url"=$name.href; 
                           "email"=$email.outerText; 
                           "phone"= $phone.outerText }

 $processesCards += $obj

}

# Export to CSV - which you can open in Excel
$processesCards | Export-Csv D:\Temp\linkedin.csv 
About these ads

0 Responses to “Parsing LinkedIn html pages with PowerShell”



  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




My Recent Tweets

Legal

The posts on this blog are provided “as is” with no warranties and confer no rights. The opinions expressed on this site are mine and mine alone, and do not necessarily represent those of my employer - WSO2 or anyone else for that matter. All trademarks acknowledged.

© 2007-2014 Dmitry Sotnikov

April 2012
M T W T F S S
« Mar   Aug »
 1
2345678
9101112131415
16171819202122
23242526272829
30  

Follow

Get every new post delivered to your Inbox.

Join 2,329 other followers

%d bloggers like this: