Easy data reconciliation in Python

Easy data reconciliation in Python

In this post, I would like to share all the common methods in python for data reconciliation. Consider scenarios like below: 1. Finding symmetric differences, 2. Combining data with overlaps. All the code below are for illustration purposes only, but the fundamentals can be scaled to more complex datasets.

import pandas as pd

import numpy as np

# turn jupyter notebook warnings off

import warnings

warnings.filterwarnings("ignore")

Working example one: Two dataframes with reconciliation column key containing non-duplicated elements

df1 = pd.DataFrame({'tradeid':range(5),'profit':range(1000,2000,200)})

df2 = pd.DataFrame({'tradeid':range(2,7,1), 'stock':   ['APL','MST','JNJ','TSL','BAB']})

display(df1)

display(df2)
No alt text provided for this image

Solution one: Since ‘tradeid’ is always unique, drop_duplicates() is the best way to spot the “exclusive-or” differences between two dataframes

pd.concat([df1,df2],axis=0).drop_duplicates(subset='tradeid',keep=False)
No alt text provided for this image

Solution two: Since ‘tradeid’ is always unique, we can also use a set function to find our differences

diff = set(df1.tradeid).symmetric_difference(set(df2.tradeid)) 

Then use .isin() method to select the rows from a list. Also python set is not searchable by hash values; thus converting to list

pd.concat([df1.loc[df1['tradeid'].isin(list(diff)),:],df2.loc[df2['tradeid'].isin(list(diff)),:]])
No alt text provided for this image

Solution three: Similar to solution two, this time we are finding symmetric differences using np.array

diff = np.setxor1d(np.array(df1.tradeid), np.array(df2.tradeid)) 

pd.concat([df1.loc[df1['tradeid'].isin(list(diff)),:],df2.loc[df2['tradeid'].isin(list(diff)),:]])
No alt text provided for this image


Working example two: Two dataframes with reconciliation column key containing duplicated elements

df1 = pd.DataFrame({'tradeid':[0,2,2,3,4],'profit':range(1000,2000,200)})

df2 = pd.DataFrame({'tradeid':[2,3,4,5,5], 'stock':['APL','MST','JNJ','TSL','BAB']})

display(df1)

display(df2)
No alt text provided for this image

Solution: Use the pandas merge function with the indicator flag

df3 = df1.merge(df2, on='tradeid',how='outer',indicator=True)

display(df3)

df3 = df3.loc[df3["_merge"] != 'both', :]

# cleanup to make df3 our final result

del df3["_merge"]

df3
No alt text provided for this image


Working example three: Finding symmetric differences for 2 lists containing duplicates

list1 = [0,1,1,2,3]

list2 = [0,1,3,4,4] 

Solution is embedded using functional programming

list3 = list(filter(lambda x:x not in list2, list1))

list4 = list(filter(lambda x:x not in list1, list2))

list3 + list4 

Out: [2, 4, 4]

p.s. list comprehension does not work here somehow [list1.remove(x) for x in list1 if x in list2]


Working example four: Combining data with overlaps

d1 = pd.DataFrame({'tradeid':[0,1,2],'profit':[1000,np.nan,2000],'stock':['APL','JNJ',np.nan]})

d2 = pd.DataFrame({'tradeid':[3,1,2],'profit':[1000,1500,np.nan],'stock':['APL',np.nan,np.nan]})

display(d1)

display(d2) 

Subsequent dataset will combine and overwrite the original

oned2.combine_first(d1)
No alt text provided for this image


To view or add a comment, sign in

More articles by Changyang Xu

  • 什么样的V才能抄底?

    “V”是指某种价格曲线,或者策略遇到了最大回撤。无论如何,在最大回撤处必须承认是发生了某种风险事件。如果拍脑袋交易,很容易两边打脸或者一条路走到黑。我的建议是,必须深刻知道V型背后的原理。这样就算做错短期被质疑,也有信念能拿到证明自己是对的…

  • 投资一定要有view

    过去的文章大多涉及alpha,这篇主要讲beta的交易一定要有方向和view。因为本质上很多东西是完全不兼容的。大盘急跌时,有让出筹码止损的人,也有冒着更大风险加杠杆抄底的人。基本面和量价模型在某些层面也不兼容。基本面最好的分组几乎和市值最…

  • 量化基金经理六点看未来AI发展

    以下观点有些借鉴腾讯千问智谱的2026圆桌会议 - https://www.xiaoyuzhoufm.

  • 国家队对市场的调控

    基金的每日份额是我从上交所网站里用爬虫整理的 (https://www.sse.

  • 股市是不是过热了?

    先不说寒武纪估值约等于英伟达的2倍 先不说智谱IPO流动性差,最高点能换要一个百度 先不说商业航天讲着上千倍市盈率的理想故事 先不说把存储算力的短期缺库存,周期概念换到成长涨价概念,来拔估值的…

  • 风险的认知

    人往往对任意一个时点的的信息格外看重,比如2025年对中证1000的超额是多少。要知道,当下的股价只是成千上万种可能性其中的一个,看过去或者未来都没有意义。…

  • 接下去的进步要靠无知

    AI时代最不缺的就是信息获取。本来人从1岁活到99岁的认知曲线被大幅提前。“知道太多”的副作用是让人变得更加愤世嫉俗;或者因能更精确推演失败,而止步不前。过去的50年里,无数成功创业者靠从头学起,再加上独创的见解,力压当时的行业巨头。但是当…

  • A股宽基的估值到底高不高?

    从时间序列上,A股从20年前的稀缺资源超高估值,到10年前的体系去杠杆高增长,再到当下突破了估值最高点,其实没有可比性。当下的底部代表有医药,消费,金融;顶部代表有电子,有色,军工,通信。最高点往往意味着时代正在发生某种范式转移,但承受或者…

  • 把自己长期留在牌桌上

    之前的文章资管 vs 自营聊到人和人之差只在于留在牌桌上时间的长短。但这件事说起来容易,做起来难。 “科学家也是人,而且一点儿都不缺乏人性。当数据和欲望相冲突的时候,后者往往会胜出。”——布莱恩基廷(宇宙学家)…

  • 严格基于统计做重大决策

    上一篇我们讨论过资管 vs 自营。主观上赌对市场Smart…

Others also viewed

Explore content categories