Excel, Python: Basic data scraping for NBL Australia basketball game stats

 



I recently went to Onlinejobsph again to look for quick gigs that can be done on the side using my skills. This time, I found a post looking for someone who can scrape basketball data.

Reached out and got a reply about the site they want to scrape from, and what data they need. I replied back probing for more details to decide how much should I charge and the scope of the requirement but after few days, received no response and the listing is already removed.

The post stated they're not interested in it being done by AI because they've already tried it. I mentioned to them that I actually use AI but I have working knowledge of Python to know and understand if the code is garbage or not.

Since I don't want my initial effort go to waste, I actually started building the script as a way to practice.

Site is NBL which is a basketball league from Australia. Luckily, they don't have heavy anti-bot measure (or they probably have, but I have not triggered it on testing)

The first item I do when scraping is how is the data generated and presented. What are their API endpoints and the parameters required to generate the result set.

Poking around, I found that it looks for an ID that gets passed to their API. If you passed a valid ID, it will then return a JSON structure containing the data about the match.

Will not go into details but those who know how the site works, knows.

Since this is for learning/knowledge, I did not look for how other people might have scraped from the site. I only used Google/AI for making my code more efficient and less redundant.

The script first asks for the URL from the user. From that URL, a function will then look for the ID that will be passed to API.

Once we have the ID, a function will fetch the JSON response from API. It saves the data on a file inside a folder named json. My assumption is we're not interested in fetching the data as the game happens but rather the final results.

Rationale for this is I want to avoid rate limits (by re-scraping) since we're only processing finished games.



After we got the JSON file with the game data, I then created a function that will take the ID as parameter and read the JSON and get the keys/values. Since I got 'ghosted' on the gig, there's no actual information on what they want aside from the play by play data on the page.

For this learning project, we'll take the data from the player_match_statistics, team_match_statistics, and play_by_play.


For player_match_statistics, there's a nested data there for player and team so instead of typing all possible headers, I just used the first item in the index and fetched all possible headers from it, until the index position 40.

I then extended the list of headers to include the column names I want.


A for loop is then run on the player_match_statistics_data to get the details to build each row. Since the player and team is nested, only specific details were fetched using get. I'm not interested with the external_player_image etc.

After everything is ready, the base list is then appended to the pmsd_rows list. The rows and headers were passed to a pandas dataframe and this is the same idea that was used for team_match_statistics and play_by_play data, just a bit different implementation.


Once the pandas dataframes are ready, the export to excel was setup and we're done.

If I was more passionate about this thing or will be paid decent money (this project was priced at 2,500php on Onlinejobs. roughly 40USD which I think is too small if the scope gets bigger), my next plan is as follows:

- Add an option to export to CSV or export to a local sqlite file.
- Add a batch export mode where the user will just need to feed the script a txt file containing links and it will process those links.


Quick demo for the script is available below.
First scrape is for the link that has saved JSON already and another is for a fresh scrape.



Thanks for checking out.




Comments