pandas provides various methods for combining and comparing Series orDataFrame.
concat(): Merge multiple Series or DataFrame objects along a shared index or column
DataFrame.join(): Merge multiple DataFrame objects along the columns
DataFrame.combine_first(): Update missing values with non-missing values in the same location
merge(): Combine two Series or DataFrame objects with SQL-style joining
merge_ordered(): Combine two Series or DataFrame objects along an ordered axis
merge_asof(): Combine two Series or DataFrame objects by near instead of exact matching keys
Series.compare() and DataFrame.compare(): Show differences in values between two Series or DataFrame objects
concat()#
The concat() function concatenates an arbitrary amount ofSeries or DataFrame objects along an axis whileperforming optional set logic (union or intersection) of the indexes onthe other axes. Like numpy.concatenate
, concat()takes a list or dict of homogeneously-typed objects and concatenates them.
In [1]: df1 = pd.DataFrame( ...: { ...: "A": ["A0", "A1", "A2", "A3"], ...: "B": ["B0", "B1", "B2", "B3"], ...: "C": ["C0", "C1", "C2", "C3"], ...: "D": ["D0", "D1", "D2", "D3"], ...: }, ...: index=[0, 1, 2, 3], ...: ) ...: In [2]: df2 = pd.DataFrame( ...: { ...: "A": ["A4", "A5", "A6", "A7"], ...: "B": ["B4", "B5", "B6", "B7"], ...: "C": ["C4", "C5", "C6", "C7"], ...: "D": ["D4", "D5", "D6", "D7"], ...: }, ...: index=[4, 5, 6, 7], ...: ) ...: In [3]: df3 = pd.DataFrame( ...: { ...: "A": ["A8", "A9", "A10", "A11"], ...: "B": ["B8", "B9", "B10", "B11"], ...: "C": ["C8", "C9", "C10", "C11"], ...: "D": ["D8", "D9", "D10", "D11"], ...: }, ...: index=[8, 9, 10, 11], ...: ) ...: In [4]: frames = [df1, df2, df3]In [5]: result = pd.concat(frames)In [6]: resultOut[6]: A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D34 A4 B4 C4 D45 A5 B5 C5 D56 A6 B6 C6 D67 A7 B7 C7 D78 A8 B8 C8 D89 A9 B9 C9 D910 A10 B10 C10 D1011 A11 B11 C11 D11
Note
concat() makes a full copy of the data, and iterativelyreusing concat() can create unnecessary copies. Collect allDataFrame or Series objects in a list before usingconcat().
frames = [process_your_file(f) for f in files]result = pd.concat(frames)
Note
When concatenating DataFrame with named axes, pandas will attempt to preservethese index/column names whenever possible. In the case where all inputs share acommon name, this name will be assigned to the result. When the input names donot all agree, the result will be unnamed. The same is true for MultiIndex,but the logic is applied separately on a level-by-level basis.
Joining logic of the resulting axis#
The join
keyword specifies how to handle axis values that don’t exist in the firstDataFrame.
join='outer'
takes the union of all axis values
In [7]: df4 = pd.DataFrame( ...: { ...: "B": ["B2", "B3", "B6", "B7"], ...: "D": ["D2", "D3", "D6", "D7"], ...: "F": ["F2", "F3", "F6", "F7"], ...: }, ...: index=[2, 3, 6, 7], ...: ) ...: In [8]: result = pd.concat([df1, df4], axis=1)In [9]: resultOut[9]: A B C D B D F0 A0 B0 C0 D0 NaN NaN NaN1 A1 B1 C1 D1 NaN NaN NaN2 A2 B2 C2 D2 B2 D2 F23 A3 B3 C3 D3 B3 D3 F36 NaN NaN NaN NaN B6 D6 F67 NaN NaN NaN NaN B7 D7 F7
join='inner'
takes the intersection of the axis values
In [10]: result = pd.concat([df1, df4], axis=1, join="inner")In [11]: resultOut[11]: A B C D B D F2 A2 B2 C2 D2 B2 D2 F23 A3 B3 C3 D3 B3 D3 F3
To perform an effective “left” join using the exact index from the originalDataFrame, result can be reindexed.
In [12]: result = pd.concat([df1, df4], axis=1).reindex(df1.index)In [13]: resultOut[13]: A B C D B D F0 A0 B0 C0 D0 NaN NaN NaN1 A1 B1 C1 D1 NaN NaN NaN2 A2 B2 C2 D2 B2 D2 F23 A3 B3 C3 D3 B3 D3 F3
Ignoring indexes on the concatenation axis#
For DataFrame objects which don’t have a meaningful index, the ignore_index
ignores overlapping indexes.
In [14]: result = pd.concat([df1, df4], ignore_index=True, sort=False)In [15]: resultOut[15]: A B C D F0 A0 B0 C0 D0 NaN1 A1 B1 C1 D1 NaN2 A2 B2 C2 D2 NaN3 A3 B3 C3 D3 NaN4 NaN B2 NaN D2 F25 NaN B3 NaN D3 F36 NaN B6 NaN D6 F67 NaN B7 NaN D7 F7
Concatenating Series and DataFrame together#
You can concatenate a mix of Series and DataFrame objects. TheSeries will be transformed to DataFrame with the column name asthe name of the Series.
In [16]: s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X")In [17]: result = pd.concat([df1, s1], axis=1)In [18]: resultOut[18]: A B C D X0 A0 B0 C0 D0 X01 A1 B1 C1 D1 X12 A2 B2 C2 D2 X23 A3 B3 C3 D3 X3
Unnamed Series will be numbered consecutively.
In [19]: s2 = pd.Series(["_0", "_1", "_2", "_3"])In [20]: result = pd.concat([df1, s2, s2, s2], axis=1)In [21]: resultOut[21]: A B C D 0 1 20 A0 B0 C0 D0 _0 _0 _01 A1 B1 C1 D1 _1 _1 _12 A2 B2 C2 D2 _2 _2 _23 A3 B3 C3 D3 _3 _3 _3
ignore_index=True
will drop all name references.
In [22]: result = pd.concat([df1, s1], axis=1, ignore_index=True)In [23]: resultOut[23]: 0 1 2 3 40 A0 B0 C0 D0 X01 A1 B1 C1 D1 X12 A2 B2 C2 D2 X23 A3 B3 C3 D3 X3
Resulting keys
#
The keys
argument adds another axis level to the resulting index or column (creatinga MultiIndex) associate specific keys with each original DataFrame.
In [24]: result = pd.concat(frames, keys=["x", "y", "z"])In [25]: resultOut[25]: A B C Dx 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3y 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7z 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11In [26]: result.loc["y"]Out[26]: A B C D4 A4 B4 C4 D45 A5 B5 C5 D56 A6 B6 C6 D67 A7 B7 C7 D7
The keys
argument cane override the column nameswhen creating a new DataFrame based on existing Series.
In [27]: s3 = pd.Series([0, 1, 2, 3], name="foo")In [28]: s4 = pd.Series([0, 1, 2, 3])In [29]: s5 = pd.Series([0, 1, 4, 5])In [30]: pd.concat([s3, s4, s5], axis=1)Out[30]: foo 0 10 0 0 01 1 1 12 2 2 43 3 3 5In [31]: pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"])Out[31]: red blue yellow0 0 0 01 1 1 12 2 2 43 3 3 5
You can also pass a dict to concat() in which case the dict keys will be usedfor the keys
argument unless other keys
argument is specified:
In [32]: pieces = {"x": df1, "y": df2, "z": df3}In [33]: result = pd.concat(pieces)In [34]: resultOut[34]: A B C Dx 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3y 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7z 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11
In [35]: result = pd.concat(pieces, keys=["z", "y"])In [36]: resultOut[36]: A B C Dz 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11y 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7
The MultiIndex created has levels that are constructed from the passed keys andthe index of the DataFrame pieces:
In [37]: result.index.levelsOut[37]: FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]])
levels
argument allows specifying resulting levels associated with the keys
In [38]: result = pd.concat( ....: pieces, keys=["x", "y", "z"], levels=[["z", "y", "x", "w"]], names=["group_key"] ....: ) ....: In [39]: resultOut[39]: A B C Dgroup_key x 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3y 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7z 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11
In [40]: result.index.levelsOut[40]: FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])
Appending rows to a DataFrame#
If you have a Series that you want to append as a single row to a DataFrame, you can convert the row into aDataFrame and use concat()
In [41]: s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])In [42]: result = pd.concat([df1, s2.to_frame().T], ignore_index=True)In [43]: resultOut[43]: A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D34 X0 X1 X2 X3
merge()#
merge() performs join operations similar to relational databases like SQL.Users who are familiar with SQL but new to pandas can reference acomparison with SQL.
Merge types#
merge() implements common SQL style joining operations.
one-to-one: joining two DataFrame objects ontheir indexes which must contain unique values.
many-to-one: joining a unique index to one ormore columns in a different DataFrame.
many-to-many : joining columns on columns.
Note
When joining columns on columns, potentially a many-to-many join, anyindexes on the passed DataFrame objects will be discarded.
For a many-to-many join, if a key combination appearsmore than once in both tables, the DataFrame will have the Cartesianproduct of the associated data.
In [44]: left = pd.DataFrame( ....: { ....: "key": ["K0", "K1", "K2", "K3"], ....: "A": ["A0", "A1", "A2", "A3"], ....: "B": ["B0", "B1", "B2", "B3"], ....: } ....: ) ....: In [45]: right = pd.DataFrame( ....: { ....: "key": ["K0", "K1", "K2", "K3"], ....: "C": ["C0", "C1", "C2", "C3"], ....: "D": ["D0", "D1", "D2", "D3"], ....: } ....: ) ....: In [46]: result = pd.merge(left, right, on="key")In [47]: resultOut[47]: key A B C D0 K0 A0 B0 C0 D01 K1 A1 B1 C1 D12 K2 A2 B2 C2 D23 K3 A3 B3 C3 D3
The how
argument to merge() specifies which keys areincluded in the resulting table. If a key combination does not appear ineither the left or right tables, the values in the joined table will beNA
. Here is a summary of the how
options and their SQL equivalent names:
Merge method | SQL Join Name | Description |
---|---|---|
|
| Use keys from left frame only |
|
| Use keys from right frame only |
|
| Use union of keys from both frames |
|
| Use intersection of keys from both frames |
|
| Create the cartesian product of rows of both frames |
In [48]: left = pd.DataFrame( ....: { ....: "key1": ["K0", "K0", "K1", "K2"], ....: "key2": ["K0", "K1", "K0", "K1"], ....: "A": ["A0", "A1", "A2", "A3"], ....: "B": ["B0", "B1", "B2", "B3"], ....: } ....: ) ....: In [49]: right = pd.DataFrame( ....: { ....: "key1": ["K0", "K1", "K1", "K2"], ....: "key2": ["K0", "K0", "K0", "K0"], ....: "C": ["C0", "C1", "C2", "C3"], ....: "D": ["D0", "D1", "D2", "D3"], ....: } ....: ) ....: In [50]: result = pd.merge(left, right, how="left", on=["key1", "key2"])In [51]: resultOut[51]: key1 key2 A B C D0 K0 K0 A0 B0 C0 D01 K0 K1 A1 B1 NaN NaN2 K1 K0 A2 B2 C1 D13 K1 K0 A2 B2 C2 D24 K2 K1 A3 B3 NaN NaN
In [52]: result = pd.merge(left, right, how="right", on=["key1", "key2"])In [53]: resultOut[53]: key1 key2 A B C D0 K0 K0 A0 B0 C0 D01 K1 K0 A2 B2 C1 D12 K1 K0 A2 B2 C2 D23 K2 K0 NaN NaN C3 D3
In [54]: result = pd.merge(left, right, how="outer", on=["key1", "key2"])In [55]: resultOut[55]: key1 key2 A B C D0 K0 K0 A0 B0 C0 D01 K0 K1 A1 B1 NaN NaN2 K1 K0 A2 B2 C1 D13 K1 K0 A2 B2 C2 D24 K2 K0 NaN NaN C3 D35 K2 K1 A3 B3 NaN NaN
In [56]: result = pd.merge(left, right, how="inner", on=["key1", "key2"])In [57]: resultOut[57]: key1 key2 A B C D0 K0 K0 A0 B0 C0 D01 K1 K0 A2 B2 C1 D12 K1 K0 A2 B2 C2 D2
In [58]: result = pd.merge(left, right, how="cross")In [59]: resultOut[59]: key1_x key2_x A B key1_y key2_y C D0 K0 K0 A0 B0 K0 K0 C0 D01 K0 K0 A0 B0 K1 K0 C1 D12 K0 K0 A0 B0 K1 K0 C2 D23 K0 K0 A0 B0 K2 K0 C3 D34 K0 K1 A1 B1 K0 K0 C0 D0.. ... ... .. .. ... ... .. ..11 K1 K0 A2 B2 K2 K0 C3 D312 K2 K1 A3 B3 K0 K0 C0 D013 K2 K1 A3 B3 K1 K0 C1 D114 K2 K1 A3 B3 K1 K0 C2 D215 K2 K1 A3 B3 K2 K0 C3 D3[16 rows x 8 columns]
You can Series and a DataFrame with a MultiIndex if the names ofthe MultiIndex correspond to the columns from the DataFrame. Transformthe Series to a DataFrame using Series.reset_index() before merging
In [60]: df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]})In [61]: dfOut[61]: Let Num0 A 11 B 22 C 3In [62]: ser = pd.Series( ....: ["a", "b", "c", "d", "e", "f"], ....: index=pd.MultiIndex.from_arrays( ....: [["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"] ....: ), ....: ) ....: In [63]: serOut[63]: Let NumA 1 aB 2 bC 3 cA 4 dB 5 eC 6 fdtype: objectIn [64]: pd.merge(df, ser.reset_index(), on=["Let", "Num"])Out[64]: Let Num 00 A 1 a1 B 2 b2 C 3 c
Performing an outer join with duplicate join keys in DataFrame
In [65]: left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})In [66]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})In [67]: result = pd.merge(left, right, on="B", how="outer")In [68]: resultOut[68]: A_x B A_y0 1 2 41 1 2 52 1 2 63 2 2 44 2 2 55 2 2 6
Warning
Merging on duplicate keys significantly increase the dimensions of the resultand can cause a memory overflow.
Merge key uniqueness#
The validate
argument checks whether the uniqueness of merge keys.Key uniqueness is checked before merge operations and can protect against memory overflowsand unexpected key duplication.
In [69]: left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})In [70]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})In [71]: result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")---------------------------------------------------------------------------MergeError Traceback (most recent call last)Cell In[71], line 1----> 1 result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")File ~/work/pandas/pandas/pandas/core/reshape/merge.py:170, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate) 155 return _cross_merge( 156 left_df, 157 right_df, (...) 167 copy=copy, 168 ) 169 else:--> 170 op = _MergeOperation( 171 left_df, 172 right_df, 173 how=how, 174 on=on, 175 left_on=left_on, 176 right_on=right_on, 177 left_index=left_index, 178 right_index=right_index, 179 sort=sort, 180 suffixes=suffixes, 181 indicator=indicator, 182 validate=validate, 183 ) 184 return op.get_result(copy=copy)File ~/work/pandas/pandas/pandas/core/reshape/merge.py:813, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, indicator, validate) 809 # If argument passed to validate, 810 # check if columns specified as unique 811 # are in fact unique. 812 if validate is not None:--> 813 self._validate_validate_kwd(validate)File ~/work/pandas/pandas/pandas/core/reshape/merge.py:1657, in _MergeOperation._validate_validate_kwd(self, validate) 1653 raise MergeError( 1654 "Merge keys are not unique in left dataset; not a one-to-one merge" 1655 ) 1656 if not right_unique:-> 1657 raise MergeError( 1658 "Merge keys are not unique in right dataset; not a one-to-one merge" 1659 ) 1661 elif validate in ["one_to_many", "1:m"]: 1662 if not left_unique:MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
If the user is aware of the duplicates in the right DataFrame but wants toensure there are no duplicates in the left DataFrame, one can use thevalidate='one_to_many'
argument instead, which will not raise an exception.
In [72]: pd.merge(left, right, on="B", how="outer", validate="one_to_many")Out[72]: A_x B A_y0 1 1 NaN1 2 2 4.02 2 2 5.03 2 2 6.0
Merge result indicator#
merge() accepts the argument indicator
. If True
, aCategorical-type column called _merge
will be added to the output objectthat takes on values:
Observation Origin
_merge
valueMerge key only in
'left'
frame
left_only
Merge key only in
'right'
frame
right_only
Merge key in both frames
both
In [73]: df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})In [74]: df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})In [75]: pd.merge(df1, df2, on="col1", how="outer", indicator=True)Out[75]: col1 col_left col_right _merge0 0 a NaN left_only1 1 b 2.0 both2 2 NaN 2.0 right_only3 2 NaN 2.0 right_only
A string argument to indicator
will use the value as the name for the indicator column.
In [76]: pd.merge(df1, df2, on="col1", how="outer", indicator="indicator_column")Out[76]: col1 col_left col_right indicator_column0 0 a NaN left_only1 1 b 2.0 both2 2 NaN 2.0 right_only3 2 NaN 2.0 right_only
Overlapping value columns#
The merge suffixes
argument takes a tuple of list of strings to append tooverlapping column names in the input DataFrame to disambiguate the resultcolumns:
In [77]: left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]})In [78]: right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]})In [79]: result = pd.merge(left, right, on="k")In [80]: resultOut[80]: k v_x v_y0 K0 1 41 K0 1 5
In [81]: result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))In [82]: resultOut[82]: k v_l v_r0 K0 1 41 K0 1 5
DataFrame.join()#
DataFrame.join() combines the columns of multiple,potentially differently-indexed DataFrame into a single resultDataFrame.
In [83]: left = pd.DataFrame( ....: {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"] ....: ) ....: In [84]: right = pd.DataFrame( ....: {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"] ....: ) ....: In [85]: result = left.join(right)In [86]: resultOut[86]: A B C DK0 A0 B0 C0 D0K1 A1 B1 NaN NaNK2 A2 B2 C2 D2
In [87]: result = left.join(right, how="outer")In [88]: resultOut[88]: A B C DK0 A0 B0 C0 D0K1 A1 B1 NaN NaNK2 A2 B2 C2 D2K3 NaN NaN C3 D3
In [89]: result = left.join(right, how="inner")In [90]: resultOut[90]: A B C DK0 A0 B0 C0 D0K2 A2 B2 C2 D2
DataFrame.join() takes an optional on
argument which may be a columnor multiple column names that the passed DataFrame is to bealigned.
In [91]: left = pd.DataFrame( ....: { ....: "A": ["A0", "A1", "A2", "A3"], ....: "B": ["B0", "B1", "B2", "B3"], ....: "key": ["K0", "K1", "K0", "K1"], ....: } ....: ) ....: In [92]: right = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])In [93]: result = left.join(right, on="key")In [94]: resultOut[94]: A B key C D0 A0 B0 K0 C0 D01 A1 B1 K1 C1 D12 A2 B2 K0 C0 D03 A3 B3 K1 C1 D1
In [95]: result = pd.merge( ....: left, right, left_on="key", right_index=True, how="left", sort=False ....: ) ....: In [96]: resultOut[96]: A B key C D0 A0 B0 K0 C0 D01 A1 B1 K1 C1 D12 A2 B2 K0 C0 D03 A3 B3 K1 C1 D1
To join on multiple keys, the passed DataFrame must have a MultiIndex:
In [97]: left = pd.DataFrame( ....: { ....: "A": ["A0", "A1", "A2", "A3"], ....: "B": ["B0", "B1", "B2", "B3"], ....: "key1": ["K0", "K0", "K1", "K2"], ....: "key2": ["K0", "K1", "K0", "K1"], ....: } ....: ) ....: In [98]: index = pd.MultiIndex.from_tuples( ....: [("K0", "K0"), ("K1", "K0"), ("K2", "K0"), ("K2", "K1")] ....: ) ....: In [99]: right = pd.DataFrame( ....: {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=index ....: ) ....: In [100]: result = left.join(right, on=["key1", "key2"])In [101]: resultOut[101]: A B key1 key2 C D0 A0 B0 K0 K0 C0 D01 A1 B1 K0 K1 NaN NaN2 A2 B2 K1 K0 C1 D13 A3 B3 K2 K1 C3 D3
The default for DataFrame.join is to perform a left joinwhich uses only the keys found in thecalling DataFrame. Other join types can be specified with how
.
In [102]: result = left.join(right, on=["key1", "key2"], how="inner")In [103]: resultOut[103]: A B key1 key2 C D0 A0 B0 K0 K0 C0 D02 A2 B2 K1 K0 C1 D13 A3 B3 K2 K1 C3 D3
Joining a single Index to a MultiIndex#
You can join a DataFrame with a Index to a DataFrame with a MultiIndex on a level.The name
of the Index with match the level name of the MultiIndex.
In [104]: left = pd.DataFrame( .....: {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, .....: index=pd.Index(["K0", "K1", "K2"], name="key"), .....: ) .....: In [105]: index = pd.MultiIndex.from_tuples( .....: [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], .....: names=["key", "Y"], .....: ) .....: In [106]: right = pd.DataFrame( .....: {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, .....: index=index, .....: ) .....: In [107]: result = left.join(right, how="inner")In [108]: resultOut[108]: A B C Dkey Y K0 Y0 A0 B0 C0 D0K1 Y1 A1 B1 C1 D1K2 Y2 A2 B2 C2 D2 Y3 A2 B2 C3 D3
Joining with two MultiIndex#
The MultiIndex of the input argument must be completely usedin the join and is a subset of the indices in the left argument.
In [109]: leftindex = pd.MultiIndex.from_product( .....: [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"] .....: ) .....: In [110]: left = pd.DataFrame({"v1": range(12)}, index=leftindex)In [111]: leftOut[111]: v1abc xy num a x 1 0 2 1 y 1 2 2 3b x 1 4 2 5 y 1 6 2 7c x 1 8 2 9 y 1 10 2 11In [112]: rightindex = pd.MultiIndex.from_product( .....: [list("abc"), list("xy")], names=["abc", "xy"] .....: ) .....: In [113]: right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)In [114]: rightOut[114]: v2abc xy a x 100 y 200b x 300 y 400c x 500 y 600In [115]: left.join(right, on=["abc", "xy"], how="inner")Out[115]: v1 v2abc xy num a x 1 0 100 2 1 100 y 1 2 200 2 3 200b x 1 4 300 2 5 300 y 1 6 400 2 7 400c x 1 8 500 2 9 500 y 1 10 600 2 11 600
In [116]: leftindex = pd.MultiIndex.from_tuples( .....: [("K0", "X0"), ("K0", "X1"), ("K1", "X2")], names=["key", "X"] .....: ) .....: In [117]: left = pd.DataFrame( .....: {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=leftindex .....: ) .....: In [118]: rightindex = pd.MultiIndex.from_tuples( .....: [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], names=["key", "Y"] .....: ) .....: In [119]: right = pd.DataFrame( .....: {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=rightindex .....: ) .....: In [120]: result = pd.merge( .....: left.reset_index(), right.reset_index(), on=["key"], how="inner" .....: ).set_index(["key", "X", "Y"]) .....: In [121]: resultOut[121]: A B C Dkey X Y K0 X0 Y0 A0 B0 C0 D0 X1 Y0 A1 B1 C0 D0K1 X2 Y1 A2 B2 C1 D1
Merging on a combination of columns and index levels#
Strings passed as the on
, left_on
, and right_on
parametersmay refer to either column names or index level names. This enables mergingDataFrame instances on a combination of index levels and columns withoutresetting indexes.
In [122]: left_index = pd.Index(["K0", "K0", "K1", "K2"], name="key1")In [123]: left = pd.DataFrame( .....: { .....: "A": ["A0", "A1", "A2", "A3"], .....: "B": ["B0", "B1", "B2", "B3"], .....: "key2": ["K0", "K1", "K0", "K1"], .....: }, .....: index=left_index, .....: ) .....: In [124]: right_index = pd.Index(["K0", "K1", "K2", "K2"], name="key1")In [125]: right = pd.DataFrame( .....: { .....: "C": ["C0", "C1", "C2", "C3"], .....: "D": ["D0", "D1", "D2", "D3"], .....: "key2": ["K0", "K0", "K0", "K1"], .....: }, .....: index=right_index, .....: ) .....: In [126]: result = left.merge(right, on=["key1", "key2"])In [127]: resultOut[127]: A B key2 C Dkey1 K0 A0 B0 K0 C0 D0K1 A2 B2 K0 C1 D1K2 A3 B3 K1 C3 D3
Note
When DataFrame are joined on a string that matches an index level in botharguments, the index level is preserved as an index level in the resultingDataFrame.
Note
When DataFrame are joined using only some of the levels of a MultiIndex,the extra levels will be dropped from the resulting join. Topreserve those levels, use DataFrame.reset_index() on those levelnames to move those levels to columns prior to the join.
Joining multiple DataFrame#
A list or tuple of :class:`DataFrame`
can also be passed to join()to join them together on their indexes.
In [128]: right2 = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])In [129]: result = left.join([right, right2])
DataFrame.combine_first()#
DataFrame.combine_first() update missing values from one DataFramewith the non-missing values in another DataFrame in the correspondinglocation.
In [130]: df1 = pd.DataFrame( .....: [[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]] .....: ) .....: In [131]: df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])In [132]: result = df1.combine_first(df2)In [133]: resultOut[133]: 0 1 20 NaN 3.0 5.01 -4.6 NaN -8.22 -5.0 7.0 4.0
merge_ordered()#
merge_ordered() combines order data such as numeric or time series datawith optional filling of missing data with fill_method
.
In [134]: left = pd.DataFrame( .....: {"k": ["K0", "K1", "K1", "K2"], "lv": [1, 2, 3, 4], "s": ["a", "b", "c", "d"]} .....: ) .....: In [135]: right = pd.DataFrame({"k": ["K1", "K2", "K4"], "rv": [1, 2, 3]})In [136]: pd.merge_ordered(left, right, fill_method="ffill", left_by="s")Out[136]: k lv s rv0 K0 1.0 a NaN1 K1 1.0 a 1.02 K2 1.0 a 2.03 K4 1.0 a 3.04 K1 2.0 b 1.05 K2 2.0 b 2.06 K4 2.0 b 3.07 K1 3.0 c 1.08 K2 3.0 c 2.09 K4 3.0 c 3.010 K1 NaN d 1.011 K2 4.0 d 2.012 K4 4.0 d 3.0
merge_asof()#
merge_asof() is similar to an ordered left-join except that mactches are on thenearest key rather than equal keys. For each row in the left
DataFrame,the last row in the right
DataFrame are selected where the on
key is lessthan the left’s key. Both DataFrame must be sorted by the key.
Optionally an merge_asof() can perform a group-wise merge by matching theby
key in addition to the nearest match on the on
key.
In [137]: trades = pd.DataFrame( .....: { .....: "time": pd.to_datetime( .....: [ .....: "20160525 13:30:00.023", .....: "20160525 13:30:00.038", .....: "20160525 13:30:00.048", .....: "20160525 13:30:00.048", .....: "20160525 13:30:00.048", .....: ] .....: ), .....: "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"], .....: "price": [51.95, 51.95, 720.77, 720.92, 98.00], .....: "quantity": [75, 155, 100, 100, 100], .....: }, .....: columns=["time", "ticker", "price", "quantity"], .....: ) .....: In [138]: quotes = pd.DataFrame( .....: { .....: "time": pd.to_datetime( .....: [ .....: "20160525 13:30:00.023", .....: "20160525 13:30:00.023", .....: "20160525 13:30:00.030", .....: "20160525 13:30:00.041", .....: "20160525 13:30:00.048", .....: "20160525 13:30:00.049", .....: "20160525 13:30:00.072", .....: "20160525 13:30:00.075", .....: ] .....: ), .....: "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"], .....: "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01], .....: "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03], .....: }, .....: columns=["time", "ticker", "bid", "ask"], .....: ) .....: In [139]: tradesOut[139]: time ticker price quantity0 2016-05-25 13:30:00.023 MSFT 51.95 751 2016-05-25 13:30:00.038 MSFT 51.95 1552 2016-05-25 13:30:00.048 GOOG 720.77 1003 2016-05-25 13:30:00.048 GOOG 720.92 1004 2016-05-25 13:30:00.048 AAPL 98.00 100In [140]: quotesOut[140]: time ticker bid ask0 2016-05-25 13:30:00.023 GOOG 720.50 720.931 2016-05-25 13:30:00.023 MSFT 51.95 51.962 2016-05-25 13:30:00.030 MSFT 51.97 51.983 2016-05-25 13:30:00.041 MSFT 51.99 52.004 2016-05-25 13:30:00.048 GOOG 720.50 720.935 2016-05-25 13:30:00.049 AAPL 97.99 98.016 2016-05-25 13:30:00.072 GOOG 720.50 720.887 2016-05-25 13:30:00.075 MSFT 52.01 52.03In [141]: pd.merge_asof(trades, quotes, on="time", by="ticker")Out[141]: time ticker price quantity bid ask0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.961 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.982 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.933 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.934 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
merge_asof() within 2ms
between the quote time and the trade time.
In [142]: pd.merge_asof(trades, quotes, on="time", by="ticker", tolerance=pd.Timedelta("2ms"))Out[142]: time ticker price quantity bid ask0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.961 2016-05-25 13:30:00.038 MSFT 51.95 155 NaN NaN2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.933 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.934 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
merge_asof() within 10ms
between the quote time and the trade time andexclude exact matches on time. Note that though we exclude the exact matches(of the quotes), prior quotes do propagate to that point in time.
In [143]: pd.merge_asof( .....: trades, .....: quotes, .....: on="time", .....: by="ticker", .....: tolerance=pd.Timedelta("10ms"), .....: allow_exact_matches=False, .....: ) .....: Out[143]: time ticker price quantity bid ask0 2016-05-25 13:30:00.023 MSFT 51.95 75 NaN NaN1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.982 2016-05-25 13:30:00.048 GOOG 720.77 100 NaN NaN3 2016-05-25 13:30:00.048 GOOG 720.92 100 NaN NaN4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
compare()#
The Series.compare() and DataFrame.compare() methods allow you tocompare two DataFrame or Series, respectively, and summarize their differences.
In [144]: df = pd.DataFrame( .....: { .....: "col1": ["a", "a", "b", "b", "a"], .....: "col2": [1.0, 2.0, 3.0, np.nan, 5.0], .....: "col3": [1.0, 2.0, 3.0, 4.0, 5.0], .....: }, .....: columns=["col1", "col2", "col3"], .....: ) .....: In [145]: dfOut[145]: col1 col2 col30 a 1.0 1.01 a 2.0 2.02 b 3.0 3.03 b NaN 4.04 a 5.0 5.0In [146]: df2 = df.copy()In [147]: df2.loc[0, "col1"] = "c"In [148]: df2.loc[2, "col3"] = 4.0In [149]: df2Out[149]: col1 col2 col30 c 1.0 1.01 a 2.0 2.02 b 3.0 4.03 b NaN 4.04 a 5.0 5.0In [150]: df.compare(df2)Out[150]: col1 col3 self other self other0 a c NaN NaN2 NaN NaN 3.0 4.0
By default, if two corresponding values are equal, they will be shown as NaN
.Furthermore, if all values in an entire row / column, the row / column will beomitted from the result. The remaining differences will be aligned on columns.
Stack the differences on rows.
In [151]: df.compare(df2, align_axis=0)Out[151]: col1 col30 self a NaN other c NaN2 self NaN 3.0 other NaN 4.0
Keep all original rows and columns with keep_shape=True
In [152]: df.compare(df2, keep_shape=True)Out[152]: col1 col2 col3 self other self other self other0 a c NaN NaN NaN NaN1 NaN NaN NaN NaN NaN NaN2 NaN NaN NaN NaN 3.0 4.03 NaN NaN NaN NaN NaN NaN4 NaN NaN NaN NaN NaN NaN
Keep all the original values even if they are equal.
In [153]: df.compare(df2, keep_shape=True, keep_equal=True)Out[153]: col1 col2 col3 self other self other self other0 a c 1.0 1.0 1.0 1.01 a a 2.0 2.0 2.0 2.02 b b 3.0 3.0 3.0 4.03 b b NaN NaN 4.0 4.04 a a 5.0 5.0 5.0 5.0