Python Pandas Interview Questions

Here's a list of frequently asked Python Pandas interview questions and answers:

1. What is Pandas?

Pandas is an open-source Python library providing high-performance data manipulation tools. Its name comes from "Panel Data," a term in econometrics referring to multidimensional data. Created by Wes McKinney in 2008, Pandas simplifies data analysis by handling five crucial steps: loading, manipulating, preparing, modeling, and analyzing data regardless of its source.

2. What are the different Data Structures in Pandas?

Pandas primarily uses two data structures:

  • Series: A one-dimensional labeled array holding any data type. Think of it like a single column in a spreadsheet.
  • DataFrame: A two-dimensional labeled data structure, similar to a spreadsheet or SQL table. It can have columns of different data types (numbers, text, booleans, etc.).
  • Index: An immutable array used for labeling axes (rows and columns) in both Series and DataFrames.
  • Panel (Deprecated): A three-dimensional data structure; largely replaced by other methods in newer Pandas versions.

3. Define a Pandas Series.

A Series is a one-dimensional labeled array. The row labels are called the index. You can easily create a Series from lists, tuples, or dictionaries using the pd.Series() method. A Series can only have one column.

4. How to calculate the standard deviation of a Pandas Series?

Use the std() function:

Syntax

Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)

5. Define a Pandas DataFrame.

A DataFrame is Pandas' most common data structure. It's a two-dimensional array with labeled axes (rows and columns). Key features include:

  • Columns can have different data types (integers, booleans, strings, etc.).
  • It's essentially a dictionary of Series, with both row and column indices.

6. What are the significant features of the Pandas library?

  • Memory Efficiency: Optimized for handling large datasets efficiently.
  • Data Alignment: Automatic and explicit alignment of data based on labels or indices.
  • Reshaping: Easy reorganization and transformation of data structures.
  • Merge and Join: Combining datasets from different sources, similar to SQL joins.
  • Time Series Support: Powerful tools for working with time-stamped data.

7. Explain Reindexing in Pandas.

Reindexing aligns a DataFrame to a new index. If values aren't present in the old index, it fills them with NaN (Not a Number). It returns a new object unless the new index is identical to the old one and copy=False.

8. What Pandas tool creates a scatter plot matrix?

scatter_matrix. A scatter plot matrix visualizes relationships between multiple variables in a dataset using a grid of scatter plots.

9. How to create a DataFrame in Pandas?

You can create a DataFrame using:

  • Lists:
Example

import pandas as pd
a = ['Python', 'Pandas']
info = pd.DataFrame(a)
print(info)
Output

      0
0  Python
1  Pandas
  • Dictionaries of NumPy arrays:
Example

import pandas as pd
info = {'ID' :[101, 102, 103],'Department' :['B.Sc','B.Tech','M.Tech']}
info = pd.DataFrame(info)
print(info)
Output

   ID Department
0  101        B.Sc
1  102      B.Tech
2  103      M.Tech

10. Explain Categorical Data in Pandas.

A categorical data type represents categorical variables (like gender, country). It's useful for:

  • Saving memory with string variables having few unique values.
  • Specifying the logical order of categories (e.g., "low," "medium," "high").
  • Signaling to other libraries that a column is categorical.

11. How to create a Series from a dictionary?

Pass the dictionary to the pd.Series() constructor. If you don't specify an index, the dictionary keys become the index in sorted order.

Example

import pandas as pd
info = {'x': 0., 'y': 1., 'z': 2.}
a = pd.Series(info)
print(a)
Output

x    0.0
y    1.0
z    2.0
dtype: float64

12. How to create a copy of a Pandas Series?

Use Series.copy(deep=True) for a deep copy (copying data and indices). deep=False creates a shallow copy.

13. How to create an empty DataFrame?

Example

import pandas as pd
info = pd.DataFrame()
print(info)
Output

Empty DataFrame
Columns: []
Index: []

14. How to add a column to a Pandas DataFrame?

You can add a new column using a Pandas Series or by creating it from existing columns.

Example

import pandas as pd
info = {'one': pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
        'two': pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
info = pd.DataFrame(info)
info['three'] = pd.Series([20, 40, 60], index=['a', 'b', 'c'])
info['four'] = info['one'] + info['three']
print(info)

15. Adding Indices, Rows, or Columns to a Pandas DataFrame

Adding an Index:

Pandas allows adding indices. If you don't specify an index when creating a DataFrame, it defaults to a numerical index (0, 1, 2...).

Adding Rows:

Use .loc (label-based indexing), .iloc (integer-based indexing), or .ix (a combination, but generally discouraged in favor of .loc and .iloc) to insert rows.

Adding Columns:

Similar to adding rows, use .loc or .iloc to add columns to your DataFrame.

16. Deleting Indices, Rows, or Columns from a Pandas DataFrame

Deleting an Index:

  • Reset the index using df.reset_index().
  • Remove the index name with del df.index.name.
  • Remove duplicate index values by resetting and dropping duplicates.
  • Remove an index along with its row using df.drop().

Deleting a Column:

Use the drop() method with axis=1 (or axis='columns') to delete a column. Set inplace=True to modify the DataFrame directly. You can also remove duplicate column values using df.drop_duplicates().

Removing a Row:

Use df.drop_duplicates() to remove duplicate rows. Use the drop() method specifying the index of the rows to remove.

17. Renaming Indices or Columns of a Pandas DataFrame

Use the rename() method. Provide a dictionary mapping old names to new names for indices or columns. Use inplace=True to modify the DataFrame directly.

18. Iterating over a Pandas DataFrame

Several methods are available:

  • iterrows(): Iterates row by row.
  • iteritems(): Iterates column by column.
  • itertuples(): Iterates over rows as namedtuples.

Choose the method best suited to your task.

19. Getting Items in Series A Not Present in Series B

Use the isin() method with boolean indexing:

Example

import pandas as pd
p1 = pd.Series([2, 4, 6, 8, 10])
p2 = pd.Series([8, 10, 12, 14, 16])
p1[~p1.isin(p2)]
Output

0    2
1    4
2    6
dtype: int64

20. Getting Items Not Common to Both Series A and Series B

Use NumPy's union1d() and intersect1d():

Example

import pandas as pd
import numpy as np
p1 = pd.Series([2, 4, 6, 8, 10])
p2 = pd.Series([8, 10, 12, 14, 16])
p_u = pd.Series(np.union1d(p1, p2))
p_i = pd.Series(np.intersect1d(p1, p2))
p_u[~p_u.isin(p_i)]
Output

0     2
1     4
2     6
5    12
6    14
7    16
dtype: int64

21. Getting Minimum, 25th Percentile, Median, 75th Percentile, and Maximum of a Numeric Series

Use NumPy's percentile() function:

Example

import pandas as pd
import numpy as np
p = pd.Series(np.random.normal(14, 6, 22))
np.percentile(p, q=[0, 25, 50, 75, 100])

22. Getting Frequency Counts of Unique Items of a Series

Use the value_counts() method:

Example

import pandas as pd
import numpy as np
p = pd.Series(np.take(list('pqrstu'), np.random.randint(6, size=17)))
p.value_counts()

23. Converting a NumPy Array to a DataFrame of a Given Shape

Reshape the NumPy array using reshape() and then create a DataFrame:

Example

import pandas as pd
import numpy as np
p = pd.Series(np.random.randint(1, 7, 35))
info = pd.DataFrame(p.values.reshape(7, 5))
print(info)

24. Converting a Series to a DataFrame

Use the to_frame() method:

Example

import pandas as pd
s = pd.Series(["a", "b", "c"], name="vals")
s.to_frame()

25. What is a Pandas NumPy Array?

NumPy (Numerical Python) is a library for numerical computations in Python, providing efficient operations on arrays. Pandas uses NumPy arrays internally for its data structures.

26. Converting a DataFrame to a NumPy Array

Use the to_numpy() method.

27. Converting a DataFrame to an Excel File

Use the to_excel() method. For multiple sheets, use ExcelWriter.

28. Sorting a DataFrame

You can sort DataFrames in two ways:

By Label (Index): Use the sort_index() method. By default, it sorts row labels (index) in ascending order. You can specify the axis and ascending parameters to control sorting.

By Value: Use the sort_values() method. Specify the column name(s) to sort by using the by parameter.

29. What is a Time Series in Pandas?

Time series data represents data points indexed in time order. Pandas provides excellent tools for working with time series data, including forecasting future values using various time series models.

30. What is a Time Offset?

A time offset is the difference between a local time and Coordinated Universal Time (UTC). It's crucial for handling time zone differences.

31. Define Time Periods.

Time periods represent durations like days, months, quarters, or years. Pandas' Period class helps manage these.

32. How to Convert a String to a Date

Use the strptime() method from the datetime module:

Example

from datetime import datetime
dmy_str1 = 'Wednesday, July 14, 2018'
dmy_str2 = '14/7/17'
dmy_str3 = '14-07-2017'
dmy_dt1 = datetime.strptime(dmy_str1, '%A, %B %d, %Y')
dmy_dt2 = datetime.strptime(dmy_str2, '%d/%m/%y') #Corrected format
dmy_dt3 = datetime.strptime(dmy_str3, '%d-%m-%Y') #Corrected format
print(dmy_dt1)
print(dmy_dt2)
print(dmy_dt3)
Output

2018-07-14 00:00:00
2017-07-14 00:00:00
2017-07-14 00:00:00

Note: The original output was inconsistent with the input strings. The code above has been corrected to reflect the correct date parsing and output.

33. What is Data Aggregation?

Data aggregation applies functions (like sum, min, max) across rows or columns to summarize data.

34. What is a Pandas Index?

A Pandas Index is a crucial tool for selecting data from a DataFrame. It provides efficient access to rows and columns and is often referred to as subset selection.

35. Define Multiple Indexing.

Multiple indexing (hierarchical indexing) allows you to have multiple levels of indices for rows or columns, useful for higher-dimensional data.

36. Define Reindexing.

Reindexing changes the index of a DataFrame. Use the reindex() method. Missing values in the new index will be filled with NaN.

Syntax

DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=np.nan, limit=None, tolerance=None)

37. How to Set the Index.

You can set the index when creating a DataFrame or later using the set_index() method.

38. How to Reset the Index.

Use the reset_index() method. This will create a new numerical index.

39. Describe Data Operations in Pandas.

Pandas offers various data operations:

  • Row and Column Selection: Selecting specific rows and columns (resulting in a Series if only one row or column is selected).
  • Data Filtering: Selecting rows based on boolean conditions.
  • Handling Null Values: Pandas represents missing data as NaN (Not a Number).

40. Define GroupBy in Pandas.

The groupby() function groups data based on specified criteria, allowing for aggregated calculations or other operations on those groups.

Syntax

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)