Technologies | August 24, 2023

Python Pandas and DataFrame: The Power Couple for Modern Data Analysis 

How to easily read, recognize and analyze data in Python? Explore the free Python library: Pandas.

Pandas

Pandas is a free Python library that greatly expands your data analysis and processing capabilities. This library is one of the most important tools in the Python environment, and is widely used as a form of support in various industries. In this article, you will learn how to load, process and analyze data easily in Python. This competence is valued in many areas, from finance to engineering.

Python – Pandas Library in Data Science. Why is it worth finding out more?

The Pandas library was created to make it possible to work with various types of data that are not always complete or require proper processing. Pandas provides flexible and easy-to-use data structures and tools (albeit not always efficient… but later in the article I will also tell you how to deal with it). In addition to such libraries as NumPy, Matplotlib, Seaborn or ScikitLearn, it has earned popularity and recognition among academics, analysts, engineers, and data enthusiasts. 

Introduction to Pandas in Python

It is an ideal tool for managing and analyzing data (with the help of extra libraries) in small and medium-sized collections. In the case of large collections, which are popular in areas related to Big Data, processing is also possible. But as the amount of data increases, the likelihood of memory and performance problems increases too. 

When will the Pandas library work best?

The Pandas library is well suited to working with different types and data sources: 

  • Array data with columns of various types (e.g. Excel, SQL). 
  • Time-series data. 
  • Data with labels of rows and columns (labeled data). 

Examples of applications of the Pandas library. 

Here are some examples of applications of the Pandas library: 

  • Loading various data formats (CSV, Excel, SQL, flat files, etc.). 
  • Filtering, sorting and other operations on data. 
  • Clearing data (deleting NaN values, that is: Not a Number data), averaging, replacing values, etc.). 
  • Quickly and efficiently calculating statistics and performing operations on data. 
  • Visualization of data with charts. 

Before we start using the Pandas library, let’s make sure that it is installed in the Python environment. 

Also read: SQL vs NoSQL databases 

How to install the Pandas library

Installing Pandas is very simple and can be done with the pip tool, which is the default Python package manager. I assume that Python is already installed on your computer, and if not – here’s how to do it: BeginnersGuide/Download – Python Wiki. Then just open the terminal and type the following command: 

pip install pandas 

You may be required to grant permissions to execute this command (e.g., using sudo on Unix systems or running a terminal with administrator rights on Windows), and if you are using a specific Python virtual environment (e.g., venv or conda), you will need to activate this environment before installing the package. You can read more about virtual environments in the documentation.  

If you are using Anaconda, you can install Pandas using the command: 

conda install pandas 

How to import Python Pandas into your project

Once the Pandas library is installed, you can start using it. The first step is to import the library into our script or project. Importing a Pandas library is no different than any other library in Python. 

We can do so using the code below: 

import pandas as pd 

In Python, Pandas is typically imported under the alias ‘pd’, which is a short and commonly accepted abbreviation. Now, when we want to use the Pandas library functions, instead of typing the full word “pandas“, using the “pd” abbreviation is sufficient. It is similar in the case of the NumPy library (‘np’), which I wrote about in a previous article. 

For example, if we wanted to create a DataFrame (one of the key data structures in Pandas), the code would look like this: 

import pandas as pd 

data = { 

'column_1': [3, 2, 0, 1],  

'column_2': [0, 3, 7, 2] 

} 

example_df = pd.DataFrame(data) 

print(example_df) 

As you can see in the example, I used the DataFrame class, the basic structure provided by Pandas. In the next paragraph, we will discuss the other two basic data structures – Series and DataFrame. 

Pandas DataFrames and series

The main purpose of the Pandas library is to facilitate working with data, which is why Pandas introduces two data structures: Series and DataFrame. Understanding these structures is key to using this library effectively. 

Pandas series 

A series is a one-dimensional data structure, or rather an array (ndarray), similar to a list or column in a table. Each element (e.g. integers, lists, objects, tuples) in the series is assigned to an identifier called an index. The series stores data of one type. 

Here is an example of creating a series that contains a list of items: 

import pandas as pd 

vals_sr = pd.Series(["Val_1", "Val_2", "Val_3", "Val_4", "Val_5"]) 

print(vals_sr) 

When it comes to the index, by default, these are integers, starting from zero. The index can be changed, e.g. by labeling. In this case, we should extend our code that is responsible for creating the series. For this we use the index parameter. The code looks like this: 

import pandas as pd 

vals_sr = pd.Series(["Val_1", "Val_2", "Val_3", "Val_4", "Val_5"], index=["A", "B", "C", "D", "E"]) 

print(vals_sr) 

It is worth remembering that the number of “labels” should correspond to the number of elements in a series. Otherwise, the Python interpreter will return an error (ValueError). If you do not want to display the entire series, but only check which indexes have been assigned or display only values without them, you can use the following code snippet: 

import pandas as pd 

vals_sr = pd.Series(["Val_1", "Val_2", "Val_3", "Val_4", "Val_5"], index=["A", "B", "C", "D", "E"]) 

print(vals_sr) 

print(vals_sr.index) # -- returns objects of Index type 

print(vals_sr.values) # -- returns an ndarray object 

Pandas DataFrame

A DataFrame is a two-dimensional data structure similar to a table in a database or Excel spreadsheet. A DataFrame consists of rows and columns – each column in a DataFrame is a series. As you probably guess, even though a given column contains only one data type, a DataFrame can contain many columns, each of which includes a different type of data. An example would be creating a DataFrame from data on transactions made by customers, identified by ID. 

import pandas as pd 

ct_data = { 

'client_id': ['C34P', 'C35S', 'C35P', 'C97S', 'C58S'], 

'card_transactions': [11, 244, 31, 458, 63] 

} 

client_transaction_df = pd.DataFrame(ct_data) 

print(client_transaction_df) 

Loading data from different sources

One of the most important benefits of Pandas is the ease with which you can load data from a variety of sources and file formats. The most popular ones include: 

  • CSV 
  • Excel (.xlsx) 
  • SQL 
  • Flat files (e.g., text files) 

Data from the CSV file can be loaded into the DataFrame with the pd.read_csv() function. 

import pandas as pd 

df = pd.read_csv('path_to_your_file.csv') 

print(df) 

Similarly, we can load an Excel file using the pd.read_excel() function. 

import pandas as pd 

df = pd.read_excel('path_to_your_file.xlsx') 

print(df) 

To load the result of the SQL query, we must first create a connection with the database. Using the example of the SQLite database, we can do it this way: 

import pandas as pd 

import sqlite3 

# Establishing a database connection 

conn = sqlite3.connect("database_name.db") 

# Execution of the query 

df = pd.read_sql_query("SELECT * FROM my_table", conn) 

print(df) 

Assuming you’re using a different database, you’ll need to install the right Python driver and replace the sqlite3.connect with the right connection. If you want to learn more about sqlite, I encourage you to access the sqlite3documentation.  

In the next paragraph, we will discuss the basic operations on data using series and DataFrame. 

Pandas – operations on data

Once you know how to load data using the Pandas library, we will now focus on selecting, filtering and sorting data and using the apply and map functions. 

Selecting, filtering and sorting data

Selecting specific data from the DataFrame is one of the basic and most commonly used operations. Pandas allows you to select data in many ways: 

  • Column selection: df[‘columnname’] 
  • Selecting rows using index numbers: df.iloc[index] 
  • Selecting rows using index labels: df.loc[label] 

See a sample snippet of the code below: 

import pandas as pd 

ct_data = { 

'client_id': ['C34P', 'C35S', 'C35P', 'C97S', 'C58S'], 

'count': [11, 244, 31, 458, 63] 

} 

df = pd.DataFrame(ct_data) 

print(df) 

# Selecting the 'client_id' column 

print(df['client_id']) 

 

# Selecting the first line 

print(df.iloc[0]) 

# Selecting the row with the zero index label 

print(df.loc[0]) 

Filtering means the process of selecting a subset of data based on given criteria. For example, we may wish to select only those transactions that are available in a quantity greater than 60: 

filtered_df = df[df['count'] > 60] 

print(filtered_df) 

Data sorting is a simple process that we can perform using the sort_values() method: 

sorted_df = df.sort_values('count') 

print(sorted_df) 

Column operations: adding, deleting, renaming

To add a new column to the DataFrame, we can simply assign the data to the new column, as in the example below: 

df['amount'] = [1200, 4500, 3000, 28000, 700] # -- we add a column with the sum of the amounts for which transactions were made 

print(df) 

To delete a column, we will use the drop() method: 

df = df.drop('amount', axis=1) 

print(df) 

You can rename a column using the rename() method: 

df = df.rename(columns={'client_id': 'client_code', 'count': 'quantity'}) 

print(df) 

Using apply and map functions

The apply and map functions allow you to apply the selected function to each element saved in the series or DataFrame. For example, using functions from the NumPy library for the ‘quantity’ column. 

import numpy as np 

df['log_quantity'] = df['quantity'].apply(np.log) 

print(df) 

Applying the map method is similar, but it only works on the series. It is often used to replace values based on a dictionary. For example, for a list of customers, we may wish to add an additional digit in the identifier. 

code_map = { 

'C34P': '0C34P', 

'C35S': '1C35S', 

'C35P': '1C35P', 

'C97S': '0C97S', 

'C58S': '0C58S' 

} 

df['client_code'] = df['client_code'].map(code_map) 

print(df) 

Pandas: data cleaning and repairing missing data

Working with data that comes from real sources practically always involves the need to clean or correct it. The data often contains gaps, duplicates, or data types that are not suitable for the analysis. In this part, I will discuss problems and show you how to deal with them using Pandas library tools. 

Handling missing data (NaN) 

Missing data is marked as NaN (Not a Number). Pandas offers several methods to handle it, such as: 

  • Filling in missing data with a specific value 
  • Deleting rows of missing data 

The  fillna() method allows you to fill in the missing data with a specific value or using a specific method (e.g. ‘forward fill’ – ffill, ‘backward fill’ – bfill): 

import numpy as np 

data = { 

    'A': [1, 2, np.nan], 

    'B': [5, np.nan, np.nan], 

C, 1, 2, 3 

} 

df = pd.DataFrame(data) 

df_filled_zeros = df.fillna (value=0)  # --We fill in the missing data with the value 0 

print(df_filled_zeros) 

It is also possible to delete rows with missing data. In the case of a large set and a small number of “broken” rows, this should not have a large impact on the quality of the data. But with a small set, deleting several rows can significantly affect the next analysis. However, If you decide to delete the selected rows, you can use the dropna() method: 

df_dropped = df.dropna()  # Removing rows with missing data 

print(df_dropped) 

Sometimes, in our set, there are numerous unnecessary duplicates from the analysis perspective. Removing duplicates is especially useful when they make up the majority of our set. Getting rid of them will allow you to unburden the library and perform more efficient operations, e.g. on columns in the DataFrame. Pandas provides a drop_duplicates() method that allows you to easily remove duplicates: 

data = { 

'client_id': ['C34P', 'C35S', 'C35P', 'C97S', 'C58S'], 

'count': [11, 244, 31, 458, 63] 

} 

df = pd.DataFrame(data) 

df = df.drop_duplicates()  # We remove duplicates 

print(df) 

You can learn more about working with missing data from the extensive documentation on Pydata.org: Working with missing data — pandas 2.0.3 documentation (pydata.org)

If you work on data provided by other people or companies, you may encounter numeric data (integers, digits, floating point numbers), which are presented in the form of strings of characters. So, for example, the integer 200 in the DataFrame is written as a string with the value ‘200’. The interpreter will treat this as text, not a number. If you want to perform statistical or mathematical operations on such data, it is necessary to change the data type of the column (in this case from ‘str‘ to ‘int’). You can do this using the astype() method: 

import pandas as pd 

data = { 

'client_id': ['C34P', 'C35S', 'C35P', 'C97S', 'C58S'], 

'count': [11, 244, 31, 458, 63] 

} 

df = pd.DataFrame(data) 

df['count'] = df['count'] .astype (int)  # We are changing the data type of the column 'count' to int 

print(df) 

In conclusion, cleaning data is usually a necessary step in the process of data processing and analysis. Pandas comes with many tools to make it easier. If you want to learn more about working with text files, here is a link to the documentation: Working with text data — pandas 2.0.3 documentation (pydata.org). In the next part, I will move on to the topic of aggregation and grouping data. 

Basic statistical operations and data grouping

Once we have clean and properly formatted data, we can proceed to the analysis. In this paragraph, I will show you some basic statistics, data grouping, and operations on indexes and multi-indexes. 

Basic statistics calculations 

Pandas offers functions to perform calculation of basic statistics, such as:  

  • mean – the average value,  
  • median – the median value in the data set,  
  • mode – the most common value in the data set, 
  • standard deviation. 

Based on the example of data samples generated with the Gaussian distribution, we will create a DataFrame and determine the above statistics. 

import pandas as pd 

import numpy as np 

data = { 

'A': e.g.random.normal(0, 1, 100), 

'B': np.random.normal(1, 2, 100), 

'C': np.random.normal(-1, 0.5, 100) 

} 

df = pd.DataFrame(data) 

print(df.mean())  # average 

print(df.median())  # median 

print(df.mode())  # Moda 

print(df.std())  # standard deviation 

Of course, based on the example of this data, I wanted to show you how tools in the Pandas library work. I leave testing its full capabilities on real data to you. To familiarize yourself with the specification of the discussed functions and many others that can be used on DataFrame, I encourage you to go through the documentation: pandas.DataFrame.mean — pandas 2.0.3 documentation (pydata.org) 

Grouping data with the groupby function 

The groupby function allows you to group data based on specific columns. If you use SQL language on a daily basis, the issue of grouping should not be new to you. We will use the groupby() function for grouping. After grouping the data, we can calculate statistics for each group: 

import pandas as pd 

data = { 

'product_id': ['product_34P', 'product_34P', 'store_35S', 'product_35P', 'store_97S', 'product_35P', 'product_34P', 'store_58S'], 

'count': [12, 24, 36, 60, 18, 48, 20, 72], 

'price': [1.2, 0.5, 0.75, 1.25, 2.0, 1.3, 0.55, 0.8] 

'store': ['EU1', 'UK1', 'EU2', 'EU2', 'UK2', 'EU1', 'UK2', 'EU1'] 

} 

df = pd.DataFrame(data) 

grouped = df.groupby('product_id') 

print(grouped.mean())  #average price and quantity for each product 

Indexing in Pandas: operations on indexes and multi-indexes 

Indexes are an important part of the discussed data structures in the Pandas library. They allow quick access to data. Basic operations include, for example: 

  • resetting the index,  
  • setting a new index, 
  • sorting by index. 

Multi-indexes allow you to index across many levels, which is especially useful in the case of hierarchical data. Multi-indexes allow you to analyze data at different hierarchical levels. For example, if we have a dataset of product sales in different countries and regions. It allows you to notice global sales trends, as well as more detailed trends at the level of selected countries or regions – see the example code below. I used a previously created DataFrame with product sales data. 

# Set ‘product_id’ as index 

df = df.set_index('product_id') 

print(df) 

# Resetting the index 

df = df.reset_index() 

print(df) 

# Setting up a multi-index 

df = df.set_index(['product_id', 'count']) 

print(df) 

Data visualization using Pandas, NumPy and Matplotlib libraries 

Data visualization is a key element of any data analysis. It helps to better understand the data structure and makes it easier to present results. Pandas offers built-in data visualization tools that are based on the Matplotlib library. An example code for creating several charts is below: 

 import pandas as pd import pandas as pd 

import numpy as np 

import matplotlib.pyplot as plt 

# Sample data 

df = pd.DataFrame({ 

'A': np.random.randn(1000), 

'B': np.random.randn(1000), 

'C': np.random.randn(1000) 

}) 

# Line chart  

df['A'].plot(kind='line') 

plt.show() 

#Histogram 

df['B'].plot(kind='hist', bins=20) 

plt.show() 

# dot diagram 

df.plot(kind='scatter', x='A', y='B') 

plt.show() 

# Bar chart  

df['C'].value_counts().sort_index().plot(kind='bar') 

plt.show() 

Analyzing data using Pandas

After discussing the basic features and operations that Pandas offers, we will move on to a practical example of analyzing a real dataset. For this purpose, we will use the public data set on the survival of passengers on the Titanic. The process can be split into the following steps: 

  • Loading and pre-screening data 
  • Data cleansing 
  • Data analysis 
  • Data visualization 

You can download the data CSV FILE titanic.csv here:  

We will start by loading the data and examining its structure: 

import pandas as pd 

# Loading data 

df = pd.read_csv('titanic.csv') 

# Displaying the first 5 rows 

print(df.head()) 

# Data basics 

print(df.info()) 

Then we will perform a basic data cleansing – for example, removing the missing values: 

# Removing rows with missing data 

df = df.dropna() 

In this case, after deleting the rows with missing data, we can assume that the data is ready for analysis. Let’s now see how the ticket class influenced the chances of survival: 

# Grouping data by ticket class and calculating average survival 

print(df.groupby('Pclass')['Survived'].mean())

We can also visualize our results to better understand the dependencies in the data: 

import matplotlib.pyplot as plt 

# Survival bar graph by ticket class 

df.groupby('Pclass')['Survived'].mean().plot(kind='bar') 

plt.ylabel('Survival Rate') 

plt.show() 

To sum up, this simple example shows how easy it is to use Pandas to load, clean, analyze, and visualize your data. I encourage you to familiarize yourself with the sample dataset (CSV file attached) and analyze selected dependencies on your own. 

When working on a DataFrame, a “view” and “copy” of the original DataFrame may draw your attention. This is important because when we try to assign new values to a dataset: 

  • When using the view – the changes will affect the original DataFrame, 
  • When we use a copy – we will not change the original DataFrame. 

It is worth using the.loc []  or .iloc[] method to select data and assign values to avoid unnecessary problems. 

Furthermore, Pandas can “deduce” data types automatically when they are loaded, e.g. from a file or database. However, this is not always accurate or in accordance with the user’s assumptions. Therefore, it is worth using the .dtypes method and checking the data types in the created DataFrame. In the event of a mismatch, use the .astype () method that you are already familiar with and convert the selected data types. 

When it comes to missing data (i.e. NaN values), performing operations on missing data may lead to incorrect or unexpected results. For example, the sum of the number and value of NaN gives the value of NaN. So remember to handle missing data, e.g. using the .dropna() and .fillna() methods, to delete or fill in the missing data accordingly. 

Here are some of the most popular features in the Pandas library: 

Loading and saving data:  

  • read_csv(), to_csv(), read_excel(), to_excel(), read_sql(), to_sql() 

Data selection:  

  • .loc[], .iloc[] 

Data manipulation:  

  • drop(), rename(), set_index(), reset_index(), pivot(), melt() 

Data cleansing:

  • dropna(), fillna(), replace(), duplicated(), drop_duplicates() 

Data analysis: 

  • describe(), value_counts(), groupby(), corr() 

Stats: 

  • mean(), median(), min(), max(), std(), quantile() 

Operations on strings: 

  • str.lower(), str.upper(), str.contains(), str.replace(), str.split(), str.join() 

Data visualization:

  • fence(), hist(), boxplot() 

Of course, the abovementioned functions have significant parameterization capabilities, so I encourage you to go through the documentation to better understand and adapt their use to your individual needs (see the documentation: General functions).  


nearshore 2022.12.01 cover

SOFTWARE DEVELOPMENT SERVICES

Unlock the potential of your project with our End-to-End nearshoring services! Get started now!

Fundamentals of Pandas – summary 

Pandas is a popular library which is indispensable when working with data and analyzing it in the Python environment. The above introduction definitely does not cover the entire topic, but shows the basic capabilities and indicates possible directions for exploring the library further. The Pandas library is particularly efficient when combined with NumPy, Matplotlib, Seaborn and other libraries, depending on the given task. 

At Inetum, he mainly works on issues related to Python and data processing. Metrics, modeling and numerical calculations as well as cooperation between research centers and industry are also familiar to him. He is very interested in bioinformatics, in which he would like to develop his competences.

Exclusive Content Awaits!

Dive deep into our special resources and insights. Subscribe to our newsletter now and stay ahead of the curve.

Information on the processing of personal data

Exclusive Content Awaits!

Dive deep into our special resources and insights. Subscribe to our newsletter now and stay ahead of the curve.

Information on the processing of personal data

Subscribe to our newsletter to unlock this file

Dive deep into our special resources and insights. Subscribe now and stay ahead of the curve – Exclusive Content Awaits

Information on the processing of personal data

Almost There!

We’ve sent a verification email to your address. Please click on the confirmation link inside to enjoy our latest updates.

If there is no message in your inbox within 5 minutes then also check your *spam* folder.

Already Part of the Crew!

Looks like you’re already subscribed to our newsletter. Stay tuned for the latest updates!

Oops, Something Went Wrong!

We encountered an unexpected error while processing your request. Please try again later or contact our support team for assistance.

    Get notified about new articles

    Be a part of something more than just newsletter

    I hereby agree that Inetum Polska Sp. z o.o. shall process my personal data (hereinafter ‘personal data’), such as: my full name, e-mail address, telephone number and Skype ID/name for commercial purposes.

    I hereby agree that Inetum Polska Sp. z o.o. shall process my personal data (hereinafter ‘personal data’), such as: my full name, e-mail address and telephone number for marketing purposes.

    Read more

    Just one click away!

    We've sent you an email containing a confirmation link. Please open your inbox and finalize your subscription there to receive your e-book copy.

    Note: If you don't see that email in your inbox shortly, check your spam folder.