Automating Data Cleanup with Python Scripting


This Python script was created to resolve an issue that I encountered during my practicum at CBRE. I was assigned to geocode a fairly large dataset (430,000+ rows) of clients based on a postal code field. To make the dataset more manageable, I split it into 29 smaller Excel sheets based on the store location and client gender. During the first run of geocoding in ArcGIS Pro, only about 85% of the dataset received matches, which seemed rather low. I looked at the rows that had no matches and noticed that many of them had spaces in the postal code (ex. V5C 3R8). This was problematic as the geocoding locator I was using had a postal code format with no spaces.

Removing the spaces in the postal codes manually would take extremely long due to the size of the dataset - I decided to write the following Python script to parse a directory for csv files, loop through each row in a csv file, and remove any spaces from the postal code field. The updated csv file would then be written to the same directory as a new file with a "_NEW" suffix. Using this script allowed me to edit all 29 files, loop through all 430,000+ rows and remove any spaces in under one minute. Running the updated dataset through the geocoder again yielded significant improvements, with over 98% of the dataset matched. The script allows for use with other similar datasets as the directory path and postal code field index can be edited in the code.

								

	import os
	import csv

	# select the directory to parse
	directory = 'C:\\Users\\User\\Documents\\New_Folder'

	# set the row index of the postal code field
	fieldindex = 9

	# loop through directory looking for csv files
	for filename in os.listdir(directory):
		if filename.endswith(".csv"):

			# get file path of csv
			input = os.path.join(directory, filename)

			# set output file path
			output = directory + '\\' + filename[:-4] + '_NEW.csv'

			# set up csv reader and writer
			with open(input, 'r') as infile, open(output, 'w', newline='') as outfile:
				csvfile = csv.reader(infile, delimiter=',')
				writer = csv.writer(outfile, delimiter=',')

				# write the header to the output file
				header = next(infile)
				outfile.write(header)

				# loop through csv rows
				for row in csvfile:

					# add row values to a list
					rowvalues = []
					for value in row:
						rowvalues.append(value)

					# remove space if postal code contains space
					postalcode = row[fieldindex]
					if len(postalcode) == 7:
						postalcode = postalcode[0:3] + postalcode[4:7]

					# update the row with edited postal code
					rowvalues.pop(fieldindex)
					rowvalues.insert(fieldindex, postalcode)

					# write the row to output file
					writer.writerow(rowvalues)

								
							

Download the Python script here.