AMZ DIGICOM

Digital Communication

AMZ DIGICOM

Digital Communication

Merge pandas (): Merge the dataframes

PARTAGEZ

Pandas function DataFrame.merge() served to merge two data -dataframas using common keys (keys). This makes it possible to effectively combine data from different sources in order to carry out more complete analyzes.

Web accommodation

Flexible, efficient and safe web accommodation

  • SSL certificate and DDOS protection
  • Data backup and restoration
  • Assistance 24/7 and personal advisor

Pandas function syntax merge()

The Python Pandas DataFrame method merge() can take into account a whole series of different parameters that influence the way in which the dataframas are combined. The general syntax of the function merge() is as follows:

DataFrame.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

python

Note

Pandas function merge() is similar to the SQL Join operation in relational databases. Therefore, if you already know database languages ​​such as SQL, you may have more facilities to understand how the Pandas DataFrame method works merge(). Note, however, that the behavior may vary: if the two key columns contain zero values, these will also be merged.

Relevant parameters

Using the different parameters than Pandas merge() Accept, you specify not only the dataframes pandas to combine, but also the type of joint and other details.

Setting Description Default value
left First data to connect
right Second data to connect
how Type of joint operation to be performed (inner,, outer,, left Or right)) inner
on Column or index level to be used as a key; must be present in the two dataframes
left_on Column or index level of the left data, used as a key
right_on Column or index level of the right dataframa, used as a key
left_index If Truethe left -wing data index will be used as a key False
right_index If Truethe right -of -right data index will be used as a key False
sort If Truethe resulting dataframa keys will be sorted lexicographically. False
suffixes Suffixes used to make columns of the same name unique ("_x", "_y")
copy If Falsethe copy is avoided True
indicator Add a column indicating the origin of the lines after merger (both,, left_only,, right_only)) False

Pandas application merge()

Several examples can help understand how the Pandas function works merge().

INNER JOIN (internal joint)

Sql INNER JOIN connects two dataframes pandas and only returns the lines whose keys correspond to the two dataaframes. All other lines are excluded from the result. To do this, two dataframas are created with example data:

import pandas as pd
# DataFrames d'exemple
df1 = pd.DataFrame({
    'Clé': ['A', 'B', 'C'],
    'Valeur1': [1, 2, 3]
})
df2 = pd.DataFrame({
    'Clé': ['B', 'C', 'D'],
    'Valeur2': [4, 5, 6]
})
print(df1)
print(df2)

python

The two resulting dataframes present themselves as follows:

Clé    Valeur1
0     A                1
1     B                2
2     C                3
    Clé    Valeur2
0     B                4
1     C                5
2     D                6

We can now perform a INNER JOIN Using the function merge() ::

# Jointure interne (INNER JOIN)
result = pd.merge(df1, df2, how='inner', on='Clé')
print(result)

python

The release shows that in this example, only the lines with keys B and C are included in the final dataframa, because they are present in The two dataframes original.

Clé    Valeur1    Valeur2
0     B                2                4
1     C                3                5

OUTER JOIN

A OUTER JOIN Merge two dataframas while retaining all the lines of the two sets. If a key does not correspond in one of the dataframas, the missing values ​​are replaced by NaN.

# Jointure externe (OUTER JOIN)
résultat = pd.merge(df1, df2, how='outer', on='Clé')
print(résultat)

python

As expected, the dataframa resulting from the merger includes All the lines of the two dataframes. For key A, which only exists in df1and the key D, which only exists in df2the missing values ​​are inserted as NaN.

Clé    Valeur1    Valeur2
0     A            1.0            NaN
1     B            2.0            4.0
2     C            3.0            5.0
3     D            NaN            6.0

Note

All other known variants of JOIN work almost in the same way.

Use of left_on And right_on

Sometimes the two dataframas have different key column names. In this case, you can use the settings left_on And right_on To indicate which columns should be used. To do this, two new data are first created:

import pandas as pd
# Création des DataFrames d'exemple
df3 = pd.DataFrame({
    'Clé': ['A', 'B', 'C'],
    'Valeur1': [1, 2, 3]
})
df4 = pd.DataFrame({
    'Clé2': ['B', 'C', 'D'],
    'Valeur2': [4, 5, 6]
})
print(df3)
print(df4)

python

The two dataframas present themselves as follows:

Clé    Valeur1
0     A                1
1     B                2
2     C                3
    Clé2    Valeur2
0        B                4
1        C                5
2        D                6

To perform the operation JOIN With different keys, the parameters left_on And right_on are now specified:

# Jointure avec des noms de colonnes de clés différents
result = pd.merge(df3, df4, how='inner', left_on='Clé', right_on='Clé2')
print(result)

python

Using explicitly left_on='Clé' And right_on='Clé2'the corresponding key columns are used for connection.

Clé    Valeur1 Clé2    Valeur2
0    B                2                     B                4
1    C                3                     C                5

Use of indexes as keys

You can also use the Dataframa indices as connection keys By defining the parameters left_index And right_index has True. Two new data with indexes are first created:

df5 = pd.DataFrame({
    'Valeur1': [1, 2, 3]
}, index=['A', 'B', 'C'])
df6 = pd.DataFrame({
    'Valeur2': [4, 5, 6]
}, index=['B', 'C', 'D'])
print(df5)
print(df6)

python

The dataframas created in the above code are as follows:

Valeur1
A          1
B          2
C          3
    Valeur2
B          4
C          5
D          6

An operation JOIN can now be carried out on the basis of the indexes:

# Jointure avec les index
result = pd.merge(df5, df6, how='inner', left_index=True, right_index=True)
print(result)

python

Not surprisingly, the result is a JOIN based on data indexes:

Valeur1  Valeur2
B          2          4
C          3          5

The function merge() is an essential tool to effectively combine data -based dataframas depending on different rules of joint. It is inspired by SQL joints and allows optimal flexibility to handle python datasets.

Télécharger notre livre blanc

Comment construire une stratégie de marketing digital ?

Le guide indispensable pour promouvoir votre marque en ligne

En savoir plus

Souhaitez vous Booster votre Business?

écrivez-nous et restez en contact