Excel, Python: Basic data scraping for NBL Australia basketball game stats
I recently went to Onlinejobsph again to look for quick gigs that can be done on the side using my skills. This time, I found a post looking for someone who can scrape basketball data.
Reached out and got a reply about the site they want to scrape from, and what data they need. I replied back probing for more details to decide how much should I charge and the scope of the requirement but after few days, received no response and the listing is already removed.
The post stated they're not interested in it being done by AI because they've already tried it. I mentioned to them that I actually use AI but I have working knowledge of Python to know and understand if the code is garbage or not.
Since I don't want my initial effort go to waste, I actually started building the script as a way to practice.
Site is NBL which is a basketball league from Australia. Luckily, they don't have heavy anti-bot measure (or they probably have, but I have not triggered it on testing)
The first item I do when scraping is how is the data generated and presented. What are their API endpoints and the parameters required to generate the result set.
Poking around, I found that it looks for an ID that gets passed to their API. If you passed a valid ID, it will then return a JSON structure containing the data about the match.
Will not go into details but those who know how the site works, knows.
Since this is for learning/knowledge, I did not look for how other people might have scraped from the site. I only used Google/AI for making my code more efficient and less redundant.
The script first asks for the URL from the user. From that URL, a function will then look for the ID that will be passed to API.
Once we have the ID, a function will fetch the JSON response from API. It saves the data on a file inside a folder named json. My assumption is we're not interested in fetching the data as the game happens but rather the final results.
Rationale for this is I want to avoid rate limits (by re-scraping) since we're only processing finished games.
A for loop is then run on the player_match_statistics_data to get the details to build each row. Since the player and team is nested, only specific details were fetched using get. I'm not interested with the external_player_image etc.
After everything is ready, the base list is then appended to the pmsd_rows list. The rows and headers were passed to a pandas dataframe and this is the same idea that was used for team_match_statistics and play_by_play data, just a bit different implementation.

Comments
Post a Comment