Python pipes

python

calmcodes

Let’s learn some from calmcode, thanks Vincent Warmerdam!

Author

Jakob Johannesson

Published

August 1, 2024

Pipes is a lovely feature that makes your code more readable and less verbose

Pipes make code easy to read and maintain. They also make it easier to understand the flow of data through your code. See this mermaid flow, it is just like you would think of transforming a dataframe.

flowchart LR

DataFrame --> Action1
Action1 --> Action2
Action2 --> Action3

Replace the arrow with a “then”. Do this, then that, then this and finish off with action3. As you can see, we are not writing spaghetti code anymore!

Pipes in python

Code

import pandas as pd

df=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")


def set_dtypes(dataf):
    return (dataf
            .assign(date=lambda d: pd.to_datetime(d['date']))
            .sort_values(['currency_code', 'date']))

def remove_outliers(dataf):
    min_row_country=32
    countries = (dataf
                .groupby('currency_code')
                .agg(n=('name','count'))
                .loc[lambda d: d['n'] >= min_row_country]
                .index)
    return (dataf
            .loc[lambda d: d['currency_code'].isin(countries)])

df.pipe(set_dtypes).pipe(remove_outliers).head()

	date	iso_a3	currency_code	name	local_price	dollar_ex	dollar_price	USD	EUR	GBP	JPY	CNY
197	2005-06-01	ARE	AED	UAE	9.0	3.67250	2.450647	-0.05014	-0.31606	-0.28769	0.04608	0.93167
270	2006-05-01	ARE	AED	UAE	9.0	3.67260	2.450580	-0.11850	-0.34945	-0.32824	0.09889	0.87318
311	2007-01-01	ARE	AED	UAE	10.0	3.67260	2.722867	-0.05783	-0.28519	-0.28976	0.18236	0.92440
352	2007-06-01	ARE	AED	UAE	10.0	3.67285	2.722681	-0.09244	-0.34659	-0.32055	0.18937	0.88229
393	2008-06-01	ARE	AED	UAE	10.0	3.67285	2.722681	-0.15181	-0.49039	-0.40429	0.03904	0.48854

Amazing! Now we can see that the code is much more readable and easier to understand.

What if we did it in the old way and look at them both?

Code

import pandas as pd

df=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")


df2 = (df
  .assign(date=lambda d: pd.to_datetime(d['date']))
  .sort_values(['currency_code', 'date'])
  .groupby('currency_code')
  .agg(n=('date', 'count')))

df.loc[lambda d: d['currency_code'].isin(df2[df2['n'] >= 32].index)].head()

	date	iso_a3	currency_code	name	local_price	dollar_ex	dollar_price	USD	EUR	GBP	JPY	CNY
0	2000-04-01	ARG	ARS	Argentina	2.50	1.000000	2.500000	0.11607	0.05007	-0.16722	-0.09864	1.09091
1	2000-04-01	AUS	AUD	Australia	2.59	1.680000	1.541667	-0.31176	-0.35246	-0.48645	-0.44416	0.28939
2	2000-04-01	BRA	BRL	Brazil	2.95	1.790000	1.648045	-0.26427	-0.30778	-0.45102	-0.40581	0.37836
3	2000-04-01	GBR	GBP	Britain	1.90	0.632911	3.002000	0.34018	0.26092	0.00000	0.08235	1.51076
4	2000-04-01	CAN	CAD	Canada	2.85	1.470000	1.938776	-0.13448	-0.18566	-0.35417	-0.30099	0.62152

Not so tidy, harder to read and to grasp what is going on here, compared to the pipes. Also pytest is something you point towards functions, so it aint going to work in the bad code example right above here.

Transforming data

One of the most common tasks is to aggregate, filter, select, mutate and do conditional changes to the data.

Aggregating columns

Tip

df.groupby(‘name’).sum(‘dollar_price’)

Code

df.groupby('name').sum('dollar_price')
df.groupby("name")["dollar_price"].sum()
df.groupby("name").agg(total=("dollar_price", "sum")).reset_index().head()

	name	total
0	Argentina	134.292819
1	Australia	160.908677
2	Azerbaijan	35.182460
3	Bahrain	51.703457
4	Brazil	170.168869

Mutating a column

Tip

df[‘mean_price_country’] = df.groupby(‘name’)[‘local_price’].transform(‘mean’)

Code

import pandas as pd

df=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")

df=df.assign(
    is_good_price = lambda dataframe: dataframe['dollar_price'].map(lambda dollar_price: True if dollar_price <= 4 else False) 
)

df['eur_price'] = df.groupby('name')['EUR'].transform('sum')*df.groupby('name')['local_price'].transform('sum')
df[['local_price', 'date', "name"]].set_index('date').head()

	local_price	name
date
2000-04-01	2.50	Argentina
2000-04-01	2.59	Australia
2000-04-01	2.95	Brazil
2000-04-01	1.90	Britain
2000-04-01	2.85	Canada

Dropping a column

Code

df

df.drop("date",axis=1).head(2)

	iso_a3	currency_code	name	local_price	dollar_ex	dollar_price	USD	EUR	GBP	JPY	CNY	is_good_price	eur_price
0	ARG	ARS	Argentina	2.50	1.00	2.500000	0.11607	0.05007	-0.16722	-0.09864	1.09091	True	-161638.542969
1	AUS	AUD	Australia	2.59	1.68	1.541667	-0.31176	-0.35246	-0.48645	-0.44416	0.28939	True	-1049.043475

Tip

df.drop(“date”,axis=1)

Replace blank space in columns

Tip

data.columns = [column.replace(” “,”_“) for column in data.columns]

Filter some data using query

Tip

Best solution:

df.query(“name == ‘United States’”)

Note

Alternative solution:

df[df[‘name’]==‘United States’]

Code

df=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")

df.query("name == 'United States'").head(4)

	date	iso_a3	currency_code	name	local_price	dollar_ex	dollar_price	EUR	GBP	JPY	CNY
27	2000-04-01	USA	USD	United States	2.24	1.0	2.24	-0.05914	-0.25383	-0.19238	0.87345
55	2001-04-01	USA	USD	United States	2.24	1.0	2.24	-0.00955	-0.21285	-0.05524	0.87345
87	2002-04-01	USA	USD	United States	2.35	1.0	2.35	-0.01107	-0.18558	0.16603	0.85314
119	2003-04-01	USA	USD	United States	2.46	1.0	2.46	-0.17477	-0.21761	0.12672	1.05745

Lambda function

This example is not functioning, just for a visual

Code

df.assign(
    name_uppercase = lambda dataframe: dataframe['name'].map(lambda name: name.upper()),
).assign(
    name_uppercase_double = lambda dataframe: dataframe['name_uppercase'].map(lambda name: name.upper()+"-"+name.upper())
)

Case when

Code

import pandas as pd

list = ['Ana', 'Kevin', 'Wang', 'Tao', '']
print(pd.DataFrame([{'index': idx, 'name': char} for idx,char in enumerate(list)]))

   index   name
0      0    Ana
1      1  Kevin
2      2   Wang
3      3    Tao
4      4

What if you want to replace some values in a dataframe? You can do it using a function and then an apply fuction. I think this is pretty nice to do as you can do it row-wise. It is easy to read the code and understand the conditions while it is easy to update the conditons also.

Tip

This is also good as you can use pytest on the functions and therefore make sure that they work as expected.

Code

df=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")

# Define the custom function
def update_price(row):
    if row['name'] == 'Argentina' and (row['local_price'] < 3 or row['local_price'] >= 10):
        return 20
    return row['local_price']

# Apply the function to the DataFrame row-wise
df['local_price'] = df.apply(update_price, axis=1)
df.head(3)

	date	iso_a3	currency_code	name	local_price	dollar_ex	dollar_price	USD	EUR	GBP	JPY	CNY
0	2000-04-01	ARG	ARS	Argentina	20.00	1.00	2.500000	0.11607	0.05007	-0.16722	-0.09864	1.09091
1	2000-04-01	AUS	AUD	Australia	2.59	1.68	1.541667	-0.31176	-0.35246	-0.48645	-0.44416	0.28939
2	2000-04-01	BRA	BRL	Brazil	2.95	1.79	1.648045	-0.26427	-0.30778	-0.45102	-0.40581	0.37836

Alternative solution, not as nice

Code

import pandas as pd
import numpy as np

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df

conditions = [
    (df["age"].lt(10)),
    (df["age"].ge(10) & df["age"].lt(20)),
    (df["age"].ge(20) & df["age"].lt(30)),
    (df["age"].ge(30) & df["age"].lt(50)),
    (df["age"].ge(50)),
]
choices = ["baby", "kid", "young", "mature", "grandpa"]

df["elderly"] = np.select(conditions, choices)

df.head(3)

	name	age	preTestScore	postTestScore	elderly
0	Jason	42	4	25	mature
1	Molly	52	24	94	grandpa
2	Tina	36	31	57	mature