Hey PowerShell, how popular is this Baby Name ?
When I did see this blogpost : Baby Names, Nameless Keys, and Mumbling , I also decided to "scrape" the Social Security Administration's Popular Baby Names site
I already did some webpage scraping like this in PowerShell : generate a random list of surnames . (Yes,Yes breaking rule seven ;-) )
but here we do a bot more processing, and make easy to work with the data we did scrape of the webpage.
While going on from the technique from that post, as this time the layout of the table is a bit different, also I get some more columns and convert it to an array of Objects so we can query easy on the data in PowerShell and also be able to export the data to other applications.
The Table on the Webpage looks like this :
After the text scraping and processing we can use it in PowerShell like this :
PoSH> $list | where {$_.name -eq 'Marc'}
Name : Marc
Count : 6335
Percent : 0.0507 %
Sex : Male
Rank : 289
PoSH> $list | where {$_.rank -eq 1}
Name : Jacob
Count : 179896
Percent : 1.4409 %
Sex : Male
Rank : 1
Name : Emily
Count : 149420
Percent : 1.2525 %
Sex : Female
Rank : 1
PoSH> $list | where {$_.Percent -gt 1} | Format-Table -AutoSize
Name Count Percent Sex Rank
---- ----- ------- --- ----
Jacob 179896 1.4409 % Male 1
Michael 165257 1.3236 % Male 2
Joshua 151094 1.2102 % Male 3
Emily 149420 1.2525 % Female 1
Matthew 148038 1.1857 % Male 4
Andrew 131862 1.0562 % Male 5
Christopher 129095 1.0340 % Male 6
Joseph 126394 1.0124 % Male 7
Daniel 125929 1.0086 % Male 8
Madison 123729 1.0372 % Female 2
PoSH> $list | Export-Csv -not nameslist.csv ; ii nameslist.csv
PoSH>
And export to Csv to do some Excel analyzing :
Ok, how does this work :
First we get the Source of the Page as HTML :
$decade = 2000
$wc = new-Object System.Net.WebClient
$nl = $wc.DownloadString("http://www.ssa.gov/OACT/babynames/decades/names$($decade)s.html")
Extract the data
As the data in the Table on the website is a but Awkward formatted ( Male and Female in one row ) we need to do some processing, to keep the Regex simple I decided to just grab all the name,percent and score cells, in one long array :
# Process list
$r = [regex]'="15%">(.*?)</td>'
$m = $r.Matches($nl)
As you can see here, this is not realy usefull yet, as it is not realy structured :
PoSH> $r = [regex]'="15%">(.*?)</td>'
PoSH> $m = $r.Matches($nl)
PoSH> $m.count
6000
PoSH> $m[0]
Groups : {="15%">Jacob</td>, Jacob}
Success : True
Captures : {="15%">Jacob</td>}
Index : 7697
Length : 17
Value : ="15%">Jacob</td>
PoSH> $m[0..10] |% {$_.groups[1].value}
Jacob
179,896
1.4409
Emily
149,420
1.2525
Michael
165,257
1.3236
Madison
123,729
PoSH>
So we need to do some more processing,
For this we loop trough the array make new objects, and adding the properties using add-member, as we know the order of the objects (Male name, Count, Percent,Female name,Count,Percent) I first get the Male name and use the MoveNext() Method of the $foreach variable, that is provided in a Foreach loop, here to skip iterations of the loop .
Also note that I convert the Count to an Integer (to be able to do a good sort on it later ) , and format the Percent at 4 digits behind comma.
# Convert to arrray of Objects
$script:list = @()
Foreach ($i in 0..($m.count -1) ) {
# Male
$Record = new-Object -typename System.Object
$Record | add-Member -memberType noteProperty -name Name -Value $m[$i].groups[1].value
[void] $foreach.MoveNext()
$Record | add-Member -memberType noteProperty -name Count -Value ([int]($m[$foreach.current].groups[1].value))
[void] $foreach.MoveNext()
$Record | add-Member -memberType noteProperty -name Percent -Value ("{0:p4}" -f (([double]$m[$foreach.current].groups[1].value) / 100))
$Record | add-Member -memberType noteProperty -name Sex -Value Male
$script:list += $record
# Female
[void] $foreach.MoveNext()
$Record = new-Object -typename System.Object
$Record | add-Member -memberType noteProperty -name Name -Value $m[$foreach.current].groups[1].value
[void] $foreach.MoveNext()
$Record | add-Member -memberType noteProperty -name Count -Value ([int]($m[$foreach.current].groups[1].value))
[void] $foreach.MoveNext()
$Record | add-Member -memberType noteProperty -name Percent -Value ("{0:p4}" -f (([double]$m[$foreach.current].groups[1].value) / 100))
$Record | add-Member -memberType noteProperty -name Sex -Value Female
$script:list += $record
}
Allready does look better ;-) :
PoSH> $list | select -First 5
Name Count Percent Sex
---- ----- ------- ---
Jacob 179896 1.4409 % Male
Emily 149420 1.2525 % Female
Michael 165257 1.3236 % Male
Madison 123729 1.0372 % Female
Joshua 151094 1.2102 % Male
PoSH>
Add Ranking :
As we lost the ranking during this proces we are going to re-generate that by sorting the list and adding the Rank Property :
# Sort
$list = $list | sort -desc count
# Add rank
$list | where {$_.Sex -eq 'Male'} |% {$rank = 0} {$rank++;$_ | add-Member -memberType noteProperty -name Rank -Value $rank}
$list | where {$_.Sex -eq 'Female'} |% {$rank = 0} {$rank++;$_ | add-Member -memberType noteProperty -name Rank -Value $rank}
And then we are done and can use this list to do some Babyname statistics ;-)
and you can see how high your name was in this list.
I hope this post showed how you can convert unstructured data into structured data in PowerShell and how easy it is after that to work with it in PowerShell and to Export this data.
I added the complete code also as an attachement to this post.
Enjoy,
Greetings /\/\o\/\/