Pandas .melt(): A Guide For The Perplexed

pd.melt() can be confusing because of its mediocre documentation and unintuitively-named parameters. It’s nevertheless important for getting dataframes into ‘tidy’ format for plotting purposes. 

First, we’ll discuss the id_vars and value_vars parameters. Pass into id_vars a list of columns that you DO NOT WANT CHANGED. These columns will remain as they are. Pass into value_vars a list of columns you want to be combined into a single column. pd.melt() takes the columns passed into value_vars and automatically generates a new column containing the names of those old columns. By default this column is called ‘variable‘. An additional new column ‘value‘ is generated containing the values from the old columns. 

Imagine you have a dataframe with columns 'asset‘, ‘US Dollar‘, ‘Euro‘, ‘Japanese Yen‘. The values correspond to the price of a cryptoasset like Bitcoin or Ethereum in a quote currency, like USD.

If we run df.melt(id_vars='asset', value_vars=['US Dollar', 'Euro', 'Japanese Yen']), we’ll get back a dataframe with an ‘asset’ column, a ‘variable‘ column containing ‘US Dollar‘, ‘Euro‘, ‘Japanese Yen‘, and a column ‘value‘ containing the value of each cryptoasset in each of the denominations. 

pd.melt() gives you some control over how the columns are renamed. By default pd.melt() will assume that any column not passed into id_vars is to be transformed. So instead of the above code, we could run pd.melt(id_vars='airline', var_name='asset', value_name='asset_price'), which will produce the same dataframe but with more appropriately named columns. 

How Do I Find Where A Maximum Value Lives in a DataFrame?

Say we’ve created the following dataframe:

df = pd.DataFrame(np.random.randint(0,100, size=(10,4)), columns=list('ABCD'))

Now we want to accomplish two things. We want to loop through all the rows and find which column contains the largest (or smallest) integer for each row, and we want to loop through the columns to find the row containing the largest (or smallest) integer.

Looping through rows to find the column where the largest integer lives:

for idx in df.index:
    print(df.loc[idx].idxmax())

The same thing for the smallest integer:

for idx in df.index:
    print(df.loc[idx].idxmin())

Looping over all the columns to find the row where the largest integer occurs:

for col in df.columns:
    print(df.loc[:, col].idxmax())

The same thing for the smallest integer:

for col in df.columns:
    print(df.loc[:, col].idxmin())