The Python Pandas function DataFrame.dropna() is used to delete from a dataframe all the lines or columns which contain missing values (NA). She therefore plays a crucial role, especially in the preparation and cleaning of data.
Web accommodation
Flexible, efficient and safe web accommodation
- SSL certificate and DDOS protection
- Data backup and restoration
- Assistance 24/7 and personal advisor
Pandas syntax dropna()
The function dropna() takes up to five settings. The basic syntax is very simple:
DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False, ignore_index=False)
python
Relevant parameters
Pandas function behavior DataFrame.dropna() can be influenced by past parameters. The most important parameters are summarized in the following table:
| Parameters | Description | Default value |
|---|---|---|
axis
|
Determine if the lines (0 or index) or the columns (1 or columns) are deleted
|
0 |
how
|
Indicate if all values (all) or only some (any) must be no
|
any
|
thresh
|
Indicates the minimum number of non-annual values that a line or a column must have to be deleted | None
|
subset
|
Determines which lines or columns should be considered; if Noneall columns are taken into account
|
None
|
inplace
|
Determines whether the operation is carried out in the original dataframa | False
|
ignore_index
|
If Truethe remaining axes will be labeled from 0 to N-1
|
False
|
Pandas application DataFrame.dropna()
Pandas dropna() is necessary to clean the data before an analysis, by deleting lines or columns with missing values. It helps avoid biases in statistical analyzes. This function also facilitates the creation of graphics and reports, because the missing values can in some cases lead to erroneous representations.
Deleting lines with missing values
In the following code example, we consider a dataframe which contains nan values:
import pandas as pd
import numpy as np
# Création d'un DataFrame avec des données d'exemple
données = {
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
}
df = pd.DataFrame(données)
print(df)
python
Dataframa presents itself as follows:
A B C
0 1.0 5.0 9
1 2.0 NaN 10
2 NaN NaN 11
3 4.0 8.0 12
In the next step, we apply the Pandas function dropna() ::
## Suppression de toutes les lignes contenant au moins une valeur NaN
df_cleaned = df.dropna()
print(df_cleaned)
python
The execution of the code gives the following result:
A B C
0 1.0 5.0 9
3 4.0 8.0 12
Only the index lines 0 and 3 of the dataframe are still present, because all the other lines contained nan values.
Deletion of columns with missing values
Deleting columns with missing values works in the same way. To do this, just define the parameter axis From 1 to 1:
## Suppression de toutes les colonnes contenant au moins une valeur NaN
df_cleaned_columns = df.dropna(axis=1)
print(df_cleaned_columns)
python
In the result, we see that only the column « C » remains, because it is the only one not to contain no value:
Application of thresh
If you only want to delete the lines that have less than two non-nan values, you can use the parameter thresh ::
## Suppression de toutes les lignes contenant moins de deux valeurs non-NaN
df_thresh = df.dropna(thresh=2)
print(df_thresh)
python
After the execution of the code, the first line is now present, because it contains two non-nan values:
A B C
0 1.0 5.0 9
1 2.0 NaN 10
3 4.0 8.0 12
Use of subset
The parameter subset is used to specify the specific columns in which the missing values must be sought. Only lines that have missing values in the specified columns will be deleted.
## Suppression de toutes les lignes contenant un NaN dans la colonne « A » :
df_subset = df.dropna(subset=['A'])
print(df_subset)
python
We note that only the index line 2 was deleted, because it contained a nan value in the « A » column. The other lines are kept, even if they contain nan in other columns.
A B C
0 1.0 5.0 9
1 2.0 NaN 10
3 4.0 8.0 12

