Pandas DataFrame is a Python data structure that allows you to create and handle paintings. We explain the architecture of this data structure to you as well as its main methods and properties.
How do pandas data work?
Pandas data is at the heart of the Python Pandas library and offer both flexible and effective data analysis. A DataFrame is a two -dimensional tabular data structurecomposed of indexed lines and labeled columns. This organization makes it possible to structure the data clearly and easy to handle, similar to spreadsheets like Excel or LIBEROFICE. Each column of a dataframa can contain different types of Python data, which allows you to store heterogeneous data Like digital values, character strings or Booleans within one and the same table.
Advice
Pandas dataframas are based on tables Numpywhich allows effective handling of rapid data and calculations. However, they are distinguished from NUMPY data structures by certain aspects, including their ability to manage heterogeneous data and their flexibility in terms of dimensions. Thus, NUMPY structures are particularly suitable for the management of large quantities of digital values, while the dataframas of pandas are better appropriate for the general handling of various data.
How is a pandas data structured?
A dataframa consists of three main elements: the data themselves, the line indexes and column names. The line index (or simply the index) serves as a single identifier for each line. By default, the lines are indexed with digital values, but these can be replaced by character strings. Note that pandas dataframas are Indexed to zerothat is to say that the indexes start at 0.


Note
Although Pandas Dataframes are part of the most popular and useful Python data structures, they are not officially part of the basic language and must therefore be imported. It is done with the line import pandas Or from pandas import DataFrame At the start of your file. You can also use import pandas as pd If you want to reference the module with a shorter name (in this case « PD »).
The use of pandas data
Pandas dataframas offer a multitude of techniques and methods that make it possible to process, analyze and effectively visualize data. In what follows, you will learn some of the most important concepts and methods for manipulation of data with Pandas Dataframes.
Create a dataframe pandas
If you have already saved the desired data in a Python list or a Python dictionary, you can very easily create a dataframa from this data. To do this, simply transmit the existing data structure in the manufacturer's argument pandas.DataFrame([Données]). However, the way Pandas interprets your data depends on the structure you are going on to the manufacturer. For example, you can create a dataframe pandas from a Python list:
import pandas
liste = ["Ahmed", "Beatrice", "Candice", "Donovan", "Elisabeth", "Frank"]
df = pandas.DataFrame(liste)
print(df)
# Sortie :
# 0
# 0 Ahmed
# 1 Béatrice
# 2 Candice
# 3 Donovan
# 4 Elisabeth
# 5 Frank
python
As you can see in the example above, the single lists allow you to create only data with a Unique column not labeled. This is why it is recommended to create dataframes from dictionaries containing lists. In this case, the keys are interpreted as column names and lists such as the corresponding data. Discover our example:
import pandas
données = {
'Nom' : ['Arthur', 'Bruno', 'Christophe'],
'Âge' : [34, 30, 55],
'Salaire' : [75000.0, 60000.5, 90000.3],
}
df = pandas.DataFrame(données)
print(df)
# Sortie :
# Nom Âge Salaire
# 0 Arthur 34 75000.0
# 1 Bruno 30 60000.5
# 2 Christophe 55 90000.3
python
Web accommodation
Flexible, efficient and safe web accommodation
- SSL certificate and DDOS protection
- Data backup and restoration
- Assistance 24/7 and personal advisor
With this method, the dataframa immediately has the desired format and titles. If you do not want to trust the integrated Python data structures, you can also load your data from an external sourcelike a CSV file or an SQL database. To do this, simply call the appropriate pandas function:
import pandas
import sqlalchemy
# DataFrame de CSV :
csv = pandas.read_csv("fichiers.csv/donnees.csv")
# DataFrame de SQL :
engine = create_engine('postgresql://nom_d_utilisateur:mot_de_passe@localhost:5432/ma_base_de_données')
sql = pandas.read_sql_query('SELECT * FROM tabelle', engine)
python
Dataframas csv And sql of the example above now contain all data in the file respectively data.csv and the SQL table table. When creating a dataframa from an external source, you can also specify additional details, for example if the digital indexes must be included or not in the dataaframa. You will find more details on the additional arguments of the two functions on the Official documentation page on Pandas Dataframes.
Advice
To create a dataframe pandas from an SQL table, you must use Pandas in combination with a SQL Python module such as Sqlalchemy. Establish a connection to the database using the SQL module that you have chosen and transmit it to read_sql_query().
Pandas DataFrames: Display the data
With Pandas Dataframes, you can not only display the whole table, but also individual lines and columns. In addition, you can choose the lines and columns you want to see. The following example shows how you can display individual or multiple lines or columns:
# Afficher la 0e ligne
print(df.loc[0])
# Imprimer les lignes 3 à 6
print(df.loc[3:6])
# Imprimer les lignes 3 et 6
print(df.loc[[3, 6]])
# Imprimer la colonne "Profession"
print(df["Profession"])
# Imprimer les colonnes "Profession" et "Âge"
print(df[["Profession", "Âge"]])
# Sélectionner plusieurs lignes et colonnes
print(df.loc[[3, 6], ['Profession', 'Âge']])
python
As we can see in the example, during the referencing of a column, we only use the name of the column between simple apostrophes, as in Python dictionaries. On the other hand, to reference a line, we always use the attribute loc. With locit is also possible to apply logical conditions to filter the data. This is shown by the following block of code, in which only the lines whose « age » value is greater than 30 are displayed:
print(df.loc[df['Âge'] > 30])
python
But we can also use the attribute iloc To select lines and columns Based on their position in the dataframa. Thus, we can for example display the cell on the third line and the fourth column:
print(df.iloc[3, 4])
# Sortie :
# Paris
print(df.iloc[[3, 4, 6], 4])
# Sortie :
# 3 Paris
# 4 Marseille
# 6 Lyon
python
DataFrames pandas: iterer on lines
When processing data in Python, it is very often necessary to iterate on the lines of a pandas dataframe, for example to apply the same operation to all data. Pandas offers two different methods to iterate on the lines of a dataframa: itertuples() And iterrows(). Both methods have their advantages and disadvantages in terms of performance and conviviality.
The method iterrows() Returns for each line of the DataFrame a tuple containing the index and the series correspondent. One series is another Pandas or Numpy data structure which is very similar to a Python list on many points, but which offers better performance. Access to individual elements in the series is by the column namewhich considerably facilitates data manipulation.
Although Pandas Series are much more effective than Python lists, this data structure has a certain additional cost in terms of performance. This is why the method itertuples() is mainly recommended for very large data. Contrary to iterrows(),, itertuples() returns it whole line, including indexin the form of tules, which are much more efficient than the series. In the tules, individual elements are acceded by means of a point, as for the attributes of an object.
Another important difference between the series and the Tuples is that the Tuples are not mutable (modifiable). If we want to use itertuples() To iterate on a dataframe and modify values, you must reference the dataframe with the attribute at and the tuple index. This attribute works very similar to loc. The following example serves to illustrate the differences between iterrows() And itertuples() ::
import pandas
df = pandas.DataFrame({
'Name' : ['Alice', 'Bob', 'Charlie'],
'Âge' : [25, 30, 35],
'Salaire' : [70000.0, 80000.5, 90000.3]
})
for index, row in df.iterrows():
row['Salaire'] += 1000
print(f"Index : {index}, Âge : {row['Âge']}, Salaire : {row['Salaire']}")
for tup in df.itertuples():
df.at[tup.Index, 'Salaire'] += 1000 # Modifier la valeur directement dans le DataFrame en utilisant at[]
print(f"Index : {tup.Index}, Âge : {tup.Âge}, Salaire : {df.loc[tup.Index, 'Salaire']}")
# Les deux boucles ont la même sortie
python

