DataSolve 1: How to combine columns from multiple Excel files using Python

 



In this new series of blog entries, I will be taking questions I see online and try to provide an answer for them. This is learning by doing and familiarizing myself with the tools and terms of the field.


I started learning Python from online tutorials before considering data analytics. Reason is I saw this programming language as something relatively easy to pick up, with good job potential if taken seriously, and a tool for automating tasks. I consider myself on beginner-intermediate level since I haven't fully grasped the concept of Classes.


Python is already on the final part of the data analytics journey under Refocus.


For this initial release, I stumbled upon a question on a facebook group 'Data Engineering Pilipinas' and figured it's a good way to practice my Python for data-related stuff and create a blog entry here. You may need a pretty good understanding of Python to understand the items below.


From the screenshot, here's what is required:


"The objectives are as follows:

(1) Combine excel files with identical columns

(2) Handling variations in the case sensitivity of sheet names, particularly with multiple files containing sheet names such as 'DATABASE', and 'Database'."


There's no sample files so I made a list of things that we need to accomplish here:

1. Only combine Excel files with identical columns. We are assuming that there's no tables on the sheets, only ranges.


2. Handle case sensitivity for sheet names, and maybe trim whitespaces too. We will not be taking into consideration the possible misspellings on sheet names per file.


On an actual work scenario, we may want to do fuzzy search to catch possible misspellings on sheet names. We may also want to combine the columns for as long as at least two columns are match. Again, this will vary depending on the data we are working on and what the task requires.


Assumption/s:

We only care about the 'database' or 'data' sheet existing on a file and that every sheet with that name should have two columns. If there are more than two columns on a sheet named 'database' or 'data', we don't want it.


The code: https://gist.github.com/merkked/cb861a8e6c68244472d5dce044803cf1

Explanations will be provided per line on the code to explain what it does.


First, we import the libraries we need and filter out the files we need using glob. pandas will be imported as pd (alias) because we don't want to always type the full pandas. We set our variable as excel_files and the glob will be a list of files that matches our wildcard.


The *.xls* means we want any files with .xls in them. Asterisk was set because there are different excel formats and we want to get them for checking. It could be .xls, .xlsx, .xlsm, etc.


sheet_search is a list of sheet names that we want to look for on the excel files.

df_list is an empty list where we will store dataframes later on.

columns is a variable where we set 2 (integer) because our assumption is we only want 'database' sheets with 2 columns. We could have hardcoded this but we want to veer from that as much as possible.


On line 10, we are checking if a folder named 'output' already exists and if there's none, we will create it. Our goal is to create a new excel file with combined information and our glob is set to look for files based on our wildcard. We don't want the output file to be rescanned by the script again. Our glob is not set to scan directories recursively.


Since excel_files is a list, we will go through each item on that list for line 17. If we have excel files named raw_1.xlsx, raw2.xlsm, and raw3_.xls, we will be opening them one at a time.


We have created a loop so the process for raw_1.xlsx will be repeated for the rest of the matching files.


We will then create sheet_names_original where we will store all the names of the sheet on the specific excel file we are working on. This variable is a list.


On line 19, we used a list comprehension to create a new list named sheet_names. This will be identical with sheet_names_original except all the sheet names of raw_1.xlsx will be converted to lowercase and any whitespaces are stripped/removed.


Once we have that, we will make another for loop to check each item on the sheet_search list if it is present on the sheet_search list. We are checking if the word 'database' and 'data' is present in the sheets of raw_1.xlsx.


If there's a match, that means we are on the right track.


We want to know where is that match located in the list so we created a variable named index. Let's say we found out that 'database' is present in one of the sheets of raw_1.xlsx. We want to know the position of it because it's not unlikely that raw_1.xlsx has multiple sheets.


We will then create a dataframe by reading the excel file and retrieving the original sheet name using the index we created earlier. We can't use the cleaned items from sheet_names because it will be in lowercase and pandas will not be able to locate it if the original sheet name is not an exact match. Simply put, 'database' is different from 'DATAbase'. The header is also set to None because we only want to combine the columns on a new file.


On line 27, we want to check if the 'database' or 'data' sheet we found matches our criteria of 2 columns only. If it matches, we set a new dataframe named df with dropna() to remove empty rows/cells. We then print a message and append the df to our list earlier.


Finally, we are ready to combine to a new file. We concat all of the items in our df_list and save it to a file named output.xlsx with sheet name of database. We don't want the index and header so it was set to False.


Done, you should have a new file based on the criteria we have set. This is not a beginner-friendly one since I did not go through every option and commands available. That's what the documentation, Google search, and AI are for.


The beauty of learning a language is customization based on what you need as final output. This is intended to be as generic as possible because the code that you will need will depend on what is required of you. I am aware of things that I skipped or could have done the other way.


On the next edition of DataSolve, we will look into an alternative scenario where the script will check all of the files on a directory, check for files with similar sheets name and combine them if they have the same number of non-empty columns. This means not restricting the match to two columns only and not restricting the search to only 'database' and 'data'.


See you!

Comments