In this post, we'll be making our first small little web scraping app.
Before we get started, let's just talk a little bit about web scraping and what it is. The most simlified definition for web scraping is "extracting data from websites", which is somewhat implied by the name. It has always been very much of a grey area. Going into a legal discussion is beyond the scope of this article, though I will recommend this blog post going into deeper detail about that.
So, to introduce today's project, we'll be building a simple GitHub follower counter, to count how many followers a user has on GitHub through the terminal.
Initialising
First, let's make a directory for this repository.
mkdir github-follower-counter
cd github-follower-counter
Open it in your code editor. If you're using Visual Studio Code you can simply do code .
yarn init -y
# For NPM
# npm init -y
Install puppeteer
yarn add puppeteer
# For NPM
# npm i puppeteer
Getting started with the code
First off, let's import puppeteer into our project.
const puppeteer = require('puppeteer')
Now, let's get the terminal arguments from the user. To do this, we can use process.argv
let username = process.argv[2]
if (username == null) return console.log('Error! Please specify a user!')
Next, let's create our getFollowers function.
const getFollowers = async(user=`https://github.com/${username}`) => {
}
Inside it, let's launch the browser, open a new tab, and navigate to the URL.
let browser = await puppeteer.launch()
let page = await browser.newPage()
await page.goto(user)
Inside it, let's evaluate the page.
let githubFollowers = await page.evaluate(() => {
})
Now, let's get the follower count. If we navigate over to GitHub, and right click < view page source (or ctrl+u). We can see the code of the website.
Inside of here, we can see that the span
element, with the class of text-bold text-gray-dark
has the current follower count.
Back to our code, let's do
const followerCount = document.querySelector('span.text-bold').innerHTML
Now, let's output the results. There is an error however. If a user does not exist, then it will show us as "optional" on the follower count. To prevent this, we can do...
if (followerCount == 'optional') return('Error! Incorrect username, make sure to double check your spelling.')
else return(`That user has a total of ${followerCount} followers!`)
Next, back to our function, let's output this.
let githubFollowers = await page.evaluate(() => {
const followerCount = document.querySelector('span.text-bold').innerHTML
if (followerCount == 'optional') return('Error! Incorrect username, make sure to double check your spelling.')
else return(`That user has a total of ${followerCount} followers!`)
})
console.log(githubFollowers)
})
Make sure to close the browser window as well.
await browser.close()
At the bottom, don't forget to call this function.
getFollowers()
And you should be good to go! Make sure to run node index.js
followed by a username to test it out!
_Note: a far better way to do this is to use the GitHub API. This was primarily a way on how to select and get certain elements, if you're looking to make an actual project with this, then the GitHub API is the way to go!
Thanks for reading, Happy Thanksgiving.