The Audyssey: a Mythic Journey in Data Cleaning

ash ryan
5 min readJul 7, 2023

--

AI generated painting featuring a paladin facing off a basilisk with Matrix code falling everywhere.

In the vast realm of public datasets, I embarked on a quest to find the perfect project that would challenge my data cleaning skills. Armed with the means to clean, I sifted through numerous sources until I stumbled upon a hidden gem — an unconventional dataset compiled from Audible audiobooks. Curiously, no one had attempted to tame this particular dataset before, and its potential intrigued me. With a quick glance, I knew that I had found my next conquest.

Garbled Glyphs and the Cache of 100 Tongues

As I delved into the dataset, I found myself facing an intricate puzzle — tens of thousands of rows of data entangled in a mojibake curse. Automatic importing methods proved futile in unraveling this mess, compelling me to take matters into my own hands. Determined, I manually imported the data into Excel, but the encoding hurdle still awaited me. Though it seemed the most obvious solution, I triumphed by overriding the encoding to UTF-8, unveiling the true contents of the dataset.

Mojibake: Corrupted text caused by decoding with an unintended character encoding.

What I discovered was a linguistic kaleidoscope spanning dozens of different languages across a staggering 80,000 rows of data. To narrow my focus and ensure success in my endeavors, I employed a Common Table Expression (CTE) to partition the rows based on language. I purged everything except the English audiobooks, bidding farewell to the excess data and the ‘Language’ column. With my feet on solid ground, I braced myself for the next challenge that lay ahead.

With RowNum as (
Select *,
Row_number() over (
Partition by Language
Order by Language
) row_num
From audible_uncleaned..audible_uncleaned$
)
DELETE
From RowNum
Where language <> 'english'


Alter Table audible_uncleaned..audible_uncleaned$
Drop Column Language

The Case of the Superfluous Accolades

As I unraveled the intricate layers of the dataset, I discovered a peculiar affliction plaguing the authors and the narrators. The ‘Author’ column suffered from an incessant repetition of “Written by:” preceding each author’s name, while the ‘Narrator’ column shared a similar fate with “Narrated by:” preceding each narrator’s name. It was as if the accreditations of these talented individuals had become entangled in a web of redundancy.

A snapshot of an Excel document showing entries from the Author and Narrator columns described before.

Armed with my SQL magic, I summoned the power of SUBSTRING() and wielded the CHARINDEX() artifact to perform a precise incision. By selecting everything up to, and including, the colon (“:”) and skillfully slicing it away, I eradicated the redundant parasite that had ensnared the authors and narrators. Finally, they could breathe freely, their accreditations no longer mired in superfluous repetition.

Update audible_uncleaned..audible_uncleaned$
Set Author = Substring(Author, Charindex(':', Author) +1, Len(Author))

Update audible_uncleaned..audible_uncleaned$
Set Narrator = Substring(Narrator, Charindex(':', Narrator) +1, Len(Narrator))

Conquering Chaos: Separating the Merged Madness

My journey continued, and I stumbled upon a column of monstrosity — the Reviews column. Star ratings and the number of ratings received were ruthlessly fused together without a shred of mercy.

A snapshot of an Excel document showing the Review column as described before.

Determined to break the binding spell that afflicted this beastly column, I brandished the powerful artifact known as CHARINDEX() once again, infused with the arcane might of SUBSTRING(). However, before I could perform the extraction, I had to create space for each half to exist independently.

Swiftly, I altered the table within which this epic quest unfolded, birthing two new columns: “# of Ratings” and “Stars”. With unwavering determination, I plunged my SQL sword, slicing the beast right down the middle. The halves fell into their newfound abodes, now able to coexist in perfect harmony, liberated from the horrors of a hodgepodge existence.

Alter Table audible_uncleaned..audible_uncleaned$
Add #ofRatings nvarchar(255);

Alter Table audible_uncleaned..audible_uncleaned$
Add Stars nvarchar(255);

Update audible_uncleaned..audible_uncleaned$
Set #ofRatings = Substring(Rating, Charindex('stars', Rating) +5, Len(Rating))

Update audible_uncleaned..audible_uncleaned$
Set Stars = Left(Rating, Charindex('stars', Rating) +4)

Fractals of Worth: the True Value Within

As I entered the final column, a sense of anticipation filled the air. A corrupted automaton guarded the ‘Price’ column, spewing out bewildering numbers that defied logic, ranging from exorbitant figures to a surplus of zeroes after each decimal. Suspecting a deceitful illusion, I realized these numbers were a result of a data typing error rather than the actual prices.

A snapshot of an Excel document showing the Price column as described before.

Armed with determination, I initially attempted to alter the column’s datatype to decimal, only to be met with a fierce counterattack — an error message mocking my efforts. Undeterred, I turned to a powerful relic in my arsenal, the legendary ISNUMERIC(), which revealed a startling truth: the presence of strings disguised as “Free” whenever an audiobook had no cost. This revelation became the key to disabling the corrupted automaton and rebalancing the column.

Error message saying: Error converting data type nvarchar to numeric.

With swift precision, I employed the transformative REPLACE() spell to convert all “free” strings into zeroes. Then, defying the automaton’s fury, I altered the column to decimal format once more. A clicking sound resonated through the air, signifying progress, but the battle wasn’t over. The prices remained inflated and burdened by excessive zeroes.

ALTER TABLE audible_uncleaned..audible_uncleaned$
ALTER COLUMN Price DECIMAL(18, 2);

SELECT *
FROM audible_uncleaned..audible_uncleaned$
WHERE ISNUMERIC(Price) = 0;

In my darkest hour, an epiphany shone brightly. I divided the ‘Price’ column into a hundred parts, retaining only a fraction and rounding it down to two decimal places. Miraculously, the corrupted automaton shattered into countless fragments, mirroring the liberation of the ‘Price’ column. With its demise, the true glimmering prices of each audiobook emerged, heralding the completion of my valiant quest. The database stood cleansed of evil inconsistencies and redundancies, ready to illuminate decisions and insights in the realm of audiobooks.

Update audible_uncleaned..audible_uncleaned$ 
Set Price = Replace(Price, 'Free', 0.00);

Update audible_uncleaned..audible_uncleaned$
Set Price = Convert(float, Round(Price/100, 2));

Audiobook Dataset from: https://www.kaggle.com/datasets/snehangsude/audible-dataset

--

--