Python pipes

python
calmcodes
Let’s learn some from calmcode, thanks Vincent Warmerdam!
Author

Jakob Johannesson

Published

August 1, 2024

Pipes is a lovely feature that makes your code more readable and less verbose

Pipes make code easy to read and maintain. They also make it easier to understand the flow of data through your code. See this mermaid flow, it is just like you would think of transforming a dataframe.

flowchart LR

DataFrame --> Action1
Action1 --> Action2
Action2 --> Action3

Replace the arrow with a “then”. Do this, then that, then this and finish off with action3. As you can see, we are not writing spaghetti code anymore!

Pipes in python

Code
import pandas as pd

df=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")


def set_dtypes(dataf):
    return (dataf
            .assign(date=lambda d: pd.to_datetime(d['date']))
            .sort_values(['currency_code', 'date']))

def remove_outliers(dataf):
    min_row_country=32
    countries = (dataf
                .groupby('currency_code')
                .agg(n=('name','count'))
                .loc[lambda d: d['n'] >= min_row_country]
                .index)
    return (dataf
            .loc[lambda d: d['currency_code'].isin(countries)])

df.pipe(set_dtypes).pipe(remove_outliers).head()
date iso_a3 currency_code name local_price dollar_ex dollar_price USD EUR GBP JPY CNY
197 2005-06-01 ARE AED UAE 9.0 3.67250 2.450647 -0.05014 -0.31606 -0.28769 0.04608 0.93167
270 2006-05-01 ARE AED UAE 9.0 3.67260 2.450580 -0.11850 -0.34945 -0.32824 0.09889 0.87318
311 2007-01-01 ARE AED UAE 10.0 3.67260 2.722867 -0.05783 -0.28519 -0.28976 0.18236 0.92440
352 2007-06-01 ARE AED UAE 10.0 3.67285 2.722681 -0.09244 -0.34659 -0.32055 0.18937 0.88229
393 2008-06-01 ARE AED UAE 10.0 3.67285 2.722681 -0.15181 -0.49039 -0.40429 0.03904 0.48854

Amazing! Now we can see that the code is much more readable and easier to understand.

What if we did it in the old way and look at them both?

Code
import pandas as pd

df=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")


df2 = (df
  .assign(date=lambda d: pd.to_datetime(d['date']))
  .sort_values(['currency_code', 'date'])
  .groupby('currency_code')
  .agg(n=('date', 'count')))

df.loc[lambda d: d['currency_code'].isin(df2[df2['n'] >= 32].index)].head()
date iso_a3 currency_code name local_price dollar_ex dollar_price USD EUR GBP JPY CNY
0 2000-04-01 ARG ARS Argentina 2.50 1.000000 2.500000 0.11607 0.05007 -0.16722 -0.09864 1.09091
1 2000-04-01 AUS AUD Australia 2.59 1.680000 1.541667 -0.31176 -0.35246 -0.48645 -0.44416 0.28939
2 2000-04-01 BRA BRL Brazil 2.95 1.790000 1.648045 -0.26427 -0.30778 -0.45102 -0.40581 0.37836
3 2000-04-01 GBR GBP Britain 1.90 0.632911 3.002000 0.34018 0.26092 0.00000 0.08235 1.51076
4 2000-04-01 CAN CAD Canada 2.85 1.470000 1.938776 -0.13448 -0.18566 -0.35417 -0.30099 0.62152

Not so tidy, harder to read and to grasp what is going on here, compared to the pipes. Also pytest is something you point towards functions, so it aint going to work in the bad code example right above here.

Transforming data

One of the most common tasks is to aggregate, filter, select, mutate and do conditional changes to the data.

Aggregating columns

Tip

df.groupby(‘name’).sum(‘dollar_price’)

Code
df.groupby('name').sum('dollar_price')
df.groupby("name")["dollar_price"].sum()
df.groupby("name").agg(total=("dollar_price", "sum")).reset_index().head()
name total
0 Argentina 134.292819
1 Australia 160.908677
2 Azerbaijan 35.182460
3 Bahrain 51.703457
4 Brazil 170.168869

Mutating a column

Tip

df[‘mean_price_country’] = df.groupby(‘name’)[‘local_price’].transform(‘mean’)

Code
import pandas as pd

df=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")

df=df.assign(
    is_good_price = lambda dataframe: dataframe['dollar_price'].map(lambda dollar_price: True if dollar_price <= 4 else False) 
)

df['eur_price'] = df.groupby('name')['EUR'].transform('sum')*df.groupby('name')['local_price'].transform('sum')
df[['local_price', 'date', "name"]].set_index('date').head()
local_price name
date
2000-04-01 2.50 Argentina
2000-04-01 2.59 Australia
2000-04-01 2.95 Brazil
2000-04-01 1.90 Britain
2000-04-01 2.85 Canada

Dropping a column

Code
df

df.drop("date",axis=1).head(2)
iso_a3 currency_code name local_price dollar_ex dollar_price USD EUR GBP JPY CNY is_good_price eur_price
0 ARG ARS Argentina 2.50 1.00 2.500000 0.11607 0.05007 -0.16722 -0.09864 1.09091 True -161638.542969
1 AUS AUD Australia 2.59 1.68 1.541667 -0.31176 -0.35246 -0.48645 -0.44416 0.28939 True -1049.043475
Tip

df.drop(“date”,axis=1)

Replace blank space in columns

Tip

data.columns = [column.replace(” “,”_“) for column in data.columns]

Filter some data using query

Tip

Best solution:

df.query(“name == ‘United States’”)

Note

Alternative solution:

df[df[‘name’]==‘United States’]

Code
df=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")

df.query("name == 'United States'").head(4)
date iso_a3 currency_code name local_price dollar_ex dollar_price USD EUR GBP JPY CNY
27 2000-04-01 USA USD United States 2.24 1.0 2.24 0.0 -0.05914 -0.25383 -0.19238 0.87345
55 2001-04-01 USA USD United States 2.24 1.0 2.24 0.0 -0.00955 -0.21285 -0.05524 0.87345
87 2002-04-01 USA USD United States 2.35 1.0 2.35 0.0 -0.01107 -0.18558 0.16603 0.85314
119 2003-04-01 USA USD United States 2.46 1.0 2.46 0.0 -0.17477 -0.21761 0.12672 1.05745

Lambda function

This example is not functioning, just for a visual

Code
df.assign(
    name_uppercase = lambda dataframe: dataframe['name'].map(lambda name: name.upper()),
).assign(
    name_uppercase_double = lambda dataframe: dataframe['name_uppercase'].map(lambda name: name.upper()+"-"+name.upper())
)

Case when

Code
import pandas as pd

list = ['Ana', 'Kevin', 'Wang', 'Tao', '']
print(pd.DataFrame([{'index': idx, 'name': char} for idx,char in enumerate(list)]))
   index   name
0      0    Ana
1      1  Kevin
2      2   Wang
3      3    Tao
4      4       

What if you want to replace some values in a dataframe? You can do it using a function and then an apply fuction. I think this is pretty nice to do as you can do it row-wise. It is easy to read the code and understand the conditions while it is easy to update the conditons also.

Tip

This is also good as you can use pytest on the functions and therefore make sure that they work as expected.

Code
df=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")

# Define the custom function
def update_price(row):
    if row['name'] == 'Argentina' and (row['local_price'] < 3 or row['local_price'] >= 10):
        return 20
    return row['local_price']

# Apply the function to the DataFrame row-wise
df['local_price'] = df.apply(update_price, axis=1)
df.head(3)
date iso_a3 currency_code name local_price dollar_ex dollar_price USD EUR GBP JPY CNY
0 2000-04-01 ARG ARS Argentina 20.00 1.00 2.500000 0.11607 0.05007 -0.16722 -0.09864 1.09091
1 2000-04-01 AUS AUD Australia 2.59 1.68 1.541667 -0.31176 -0.35246 -0.48645 -0.44416 0.28939
2 2000-04-01 BRA BRL Brazil 2.95 1.79 1.648045 -0.26427 -0.30778 -0.45102 -0.40581 0.37836

Alternative solution, not as nice

Code
import pandas as pd
import numpy as np

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df

conditions = [
    (df["age"].lt(10)),
    (df["age"].ge(10) & df["age"].lt(20)),
    (df["age"].ge(20) & df["age"].lt(30)),
    (df["age"].ge(30) & df["age"].lt(50)),
    (df["age"].ge(50)),
]
choices = ["baby", "kid", "young", "mature", "grandpa"]

df["elderly"] = np.select(conditions, choices)

df.head(3)
name age preTestScore postTestScore elderly
0 Jason 42 4 25 mature
1 Molly 52 24 94 grandpa
2 Tina 36 31 57 mature