Introduction to Pandas- A complete guide to Python’s most popular data analysis library.

Posted by

In recent years, python programming language has gained lot of popularity and because of its qualities, has become most developers first choice of coding. Python has extremely large number of packages available for almost every task. One of the many libraries of python is Pandas, which is widely used for data analysis in python. In this blog, we will go through the fundamentals of pandas.

Pandas is an open source python library which provides high performance, easy to use, flexible and expressive data structures designed to make working with structured (tabular, multi-dimensional, potentially heterogeneous) and time series data easily and intuitively. It aims to be the fundamental high-level building block for doing practical, real world data analysis in python. Pandas is built on Numpy, which is also an open source python library.

Pandas is well suited for different kinds of data such as:

  • tabular data such as SQL table or Excel spreadsheet.
  • Ordered and Unordered time series data.
  • homogeneous or heterogeneous arbitrary matrix data.
  • any other form of statistical data.

For a data scientist, working with data consists of performing tasks like cleaning the data, modelling the data, interpreting the results and organizing it in suitable format for further visualizations. Pandas is the perfect tool for all these tasks. Before diving into the practicals, lets first install the package into our system.

Installation

Before using pandas in python code, we need to install it first as it does not come preinstalled with python unless you are using anaconda in which case it does. You can install pandas using pip by executing following code in command prompt. All the code in this blog will be suitable for windows. For Mac or Linux, you need to google it. I will provide links wherever possible.

 pip install pandas 

This piece of code should run without any error. If any error comes up that means the installation was not successful.

Pandas is so popular for data analysis because of its powerful data structure. There are two primary data structures in pandas:

  • Series (1-dimensional).
  • DataFrame (2-dimensional).

These two data structure handle the vast majority of typical use cases in industry like finance, statistics, social science, and many areas of engineering.

Series

Series is a one dimensional numpy array with axis labels and capable of handling data of any type (integer, string, float and even python objects). The axis labels are called as index. In simpler terms, series is nothing but a column in a excel spreadsheet.

A pandas series can be created using Series() constructor which has many parameters but most important ones are described below.

  • _data_ : It takes various forms of data like ndarray, lists, dictionary, constants, etc,.
  • _index_ : The values which will be used as an index. If not provided, by default np.arrange(n) will be assigned as index where n is the total number of rows in the data.
  • _dtype_ : It is for data type. If None, data will be inferred.
  • _copy_ : It is used for copying of data, by default its false.

Creating series

#creating lists
data = np.array([15000, 250000, 500000, 1000000])
#creating lists of index values
index_val = ['2011', '2012', '2013', '2014']
#creating series
ser = pd.Series(data, index=index_val)
ser 

DataFrame

Dataframe is a two dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Data is arranged in a tabular format in dataframe. Dataframes consists of three components: rows,columns and the data. It can also be considered as combination of multiple pandas series as every column in dataframe is a series.

A pandas dataframe can be created using DataFrame() constructor which has many parameters but most important ones are described below.

  • _data_ : It takes various forms of data like ndarray, series, map, lists, dictionaries, constants and another dataframe, etc,.
  • _index_ : The values which will be used as a row index. If not provided, by default np.arrange(n) will be assigned as index where n is the total number of rows in the data.
  • _columns_ : For column labels, If not provided takes np.arrange(n) by default.
  • _dtype_ : It is for data type of each column. If None, data will be inferred.
  • _copy_ : It is used for copying of data, by default its false.

Creating dataframe

#creating lists of lists
data = [['1','Thor','Mjolnir'],['2', 'Cap America', 'Shield'],['3', 'Iron Man', 'Armour'],['4', 'Black Widow', 'Combat'], ['5', 'Hawk Eye', 'Arrow']]
#creating lists of column names
columns = ['Avenger number', 'Name', 'Weapon']
#creating dataframe 
df = pd.DataFrame(data, columns)
df 

Pandas has a lot to offer and data structures like series and dataframe are too powerful to explain their implementation in a single blog. This is enough for this blog. I will cover the practical implementation of series and dataframe and maybe practical data analysis using pandas in future until then keep learning.

Thank you for reading. Enjoy python.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s