Stack Overflow API
A Reflection on My First Data Engineer Project
Want some music while reading? 🎵📻
Background
Lucky me, soon I’ll start working for a big corporation as a Data Engineer. This is my very first official job after following the Software Engineering pathway. I’ll be a fulltime worker, not an intern, therefore, I want to prepare for the position carefully. I’ve been searching around a lot. Unlike Software Development, there are not so many sources guiding you how to start a Data Engineer project. This was a bit of a hindrance for me when I want to sit down and work on a project.
After a lot of Googling around, I finally came up with a simple idea to start the project. Data Engineering, in simplest term, is a process of finding, cleaning and preparing data for later use. Therefore, I follow exactly these steps to create this Stack Overflow API project.
Stack Overflow API is basically an API that scrapes the Stack Overflow website for newest unanswered questions, transforming the data to suit our purpose, extracting them into multiple formats and returning them in different endpoints.
Project Structure
The project structure is really simple: the main app and the notebooks. I use Jupyter Notebook (JN) to scrape, transform and load the data. JN makes the process more fluid and fast. After that, Fast API will take care of loading the data, creating different endpoints using the extracted data.
Data Pipeline
Scraping / Extracting
There are many ways to scrape the web such as using Beautiful Soup, Selenium, and Requests-HTML. I use Requests-HTML because of its ease of use.
Transforming the Data
The information is scraped by using HTML classes and then organized into a list. It was tricky at some points using classes but it finally went fine.
The data was transformed and saved into .csv and .py files. The .py contains the data in list form and the .csv file could be use for analysis, for example.
Loading the Data
It’s not quite accurate to name this step as Loading since Loading is more about loading the data into the data warehouse. However, we don’t have any data warehouse at the moment, we only use the data for the API here. I’m just trying to make these steps quite like an ETL process, please excuse me for this 😅.
As mentioned above, the API is run by Fast API. Fast API uses the data list from .py file and distribute the data through different endpoints.
Conclusion
I was a really nice experience doing this project. It gave me a bit more concrete knowledge to later build more complex projects. Although the project is so simple, this is actually quite a big accomplishment in my opinion 😄. I’ll create more Data Engineer projects for sure! Stay tuned!