Python Pandas Interview Questions
Here's a list of frequently asked Python Pandas interview questions and answers:
1. What is Pandas?
Pandas is an open-source Python library providing high-performance data manipulation tools. Its name comes from "Panel Data," a term in econometrics referring to multidimensional data. Created by Wes McKinney in 2008, Pandas simplifies data analysis by handling five crucial steps: loading, manipulating, preparing, modeling, and analyzing data regardless of its source.
2. What are the different Data Structures in Pandas?
Pandas primarily uses two data structures:
- Series: A one-dimensional labeled array holding any data type. Think of it like a single column in a spreadsheet.
- DataFrame: A two-dimensional labeled data structure, similar to a spreadsheet or SQL table. It can have columns of different data types (numbers, text, booleans, etc.).
- Index: An immutable array used for labeling axes (rows and columns) in both Series and DataFrames.
- Panel (Deprecated): A three-dimensional data structure; largely replaced by other methods in newer Pandas versions.
3. Define a Pandas Series.
A Series is a one-dimensional labeled array. The row labels are called the index. You can easily create a Series from lists, tuples, or dictionaries using the pd.Series()
method. A Series can only have one column.
4. How to calculate the standard deviation of a Pandas Series?
Use the std()
function:
Syntax
Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
5. Define a Pandas DataFrame.
A DataFrame is Pandas' most common data structure. It's a two-dimensional array with labeled axes (rows and columns). Key features include:
- Columns can have different data types (integers, booleans, strings, etc.).
- It's essentially a dictionary of Series, with both row and column indices.
6. What are the significant features of the Pandas library?
- Memory Efficiency: Optimized for handling large datasets efficiently.
- Data Alignment: Automatic and explicit alignment of data based on labels or indices.
- Reshaping: Easy reorganization and transformation of data structures.
- Merge and Join: Combining datasets from different sources, similar to SQL joins.
- Time Series Support: Powerful tools for working with time-stamped data.
7. Explain Reindexing in Pandas.
Reindexing aligns a DataFrame to a new index. If values aren't present in the old index, it fills them with NaN
(Not a Number). It returns a new object unless the new index is identical to the old one and copy=False
.
8. What Pandas tool creates a scatter plot matrix?
scatter_matrix
. A scatter plot matrix visualizes relationships between multiple variables in a dataset using a grid of scatter plots.
9. How to create a DataFrame in Pandas?
You can create a DataFrame using:
- Lists:
Example
import pandas as pd
a = ['Python', 'Pandas']
info = pd.DataFrame(a)
print(info)
Output
0
0 Python
1 Pandas
- Dictionaries of NumPy arrays:
Example
import pandas as pd
info = {'ID' :[101, 102, 103],'Department' :['B.Sc','B.Tech','M.Tech']}
info = pd.DataFrame(info)
print(info)
Output
ID Department
0 101 B.Sc
1 102 B.Tech
2 103 M.Tech
10. Explain Categorical Data in Pandas.
A categorical data type represents categorical variables (like gender, country). It's useful for:
- Saving memory with string variables having few unique values.
- Specifying the logical order of categories (e.g., "low," "medium," "high").
- Signaling to other libraries that a column is categorical.
11. How to create a Series from a dictionary?
Pass the dictionary to the pd.Series()
constructor. If you don't specify an index, the dictionary keys become the index in sorted order.
Example
import pandas as pd
info = {'x': 0., 'y': 1., 'z': 2.}
a = pd.Series(info)
print(a)
Output
x 0.0
y 1.0
z 2.0
dtype: float64
12. How to create a copy of a Pandas Series?
Use Series.copy(deep=True)
for a deep copy (copying data and indices). deep=False
creates a shallow copy.
13. How to create an empty DataFrame?
Example
import pandas as pd
info = pd.DataFrame()
print(info)
Output
Empty DataFrame
Columns: []
Index: []
14. How to add a column to a Pandas DataFrame?
You can add a new column using a Pandas Series or by creating it from existing columns.
Example
import pandas as pd
info = {'one': pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
'two': pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
info = pd.DataFrame(info)
info['three'] = pd.Series([20, 40, 60], index=['a', 'b', 'c'])
info['four'] = info['one'] + info['three']
print(info)
15. Adding Indices, Rows, or Columns to a Pandas DataFrame
Adding an Index:
Pandas allows adding indices. If you don't specify an index when creating a DataFrame, it defaults to a numerical index (0, 1, 2...).
Adding Rows:
Use .loc
(label-based indexing), .iloc
(integer-based indexing), or .ix
(a combination, but generally discouraged in favor of .loc
and .iloc
) to insert rows.
Adding Columns:
Similar to adding rows, use .loc
or .iloc
to add columns to your DataFrame.
16. Deleting Indices, Rows, or Columns from a Pandas DataFrame
Deleting an Index:
- Reset the index using
df.reset_index()
. - Remove the index name with
del df.index.name
. - Remove duplicate index values by resetting and dropping duplicates.
- Remove an index along with its row using
df.drop()
.
Deleting a Column:
Use the drop()
method with axis=1
(or axis='columns'
) to delete a column. Set inplace=True
to modify the DataFrame directly. You can also remove duplicate column values using df.drop_duplicates()
.
Removing a Row:
Use df.drop_duplicates()
to remove duplicate rows. Use the drop()
method specifying the index of the rows to remove.
17. Renaming Indices or Columns of a Pandas DataFrame
Use the rename()
method. Provide a dictionary mapping old names to new names for indices or columns. Use inplace=True
to modify the DataFrame directly.
18. Iterating over a Pandas DataFrame
Several methods are available:
iterrows()
: Iterates row by row.iteritems()
: Iterates column by column.itertuples()
: Iterates over rows as namedtuples.
Choose the method best suited to your task.
19. Getting Items in Series A Not Present in Series B
Use the isin()
method with boolean indexing:
Example
import pandas as pd
p1 = pd.Series([2, 4, 6, 8, 10])
p2 = pd.Series([8, 10, 12, 14, 16])
p1[~p1.isin(p2)]
Output
0 2
1 4
2 6
dtype: int64
20. Getting Items Not Common to Both Series A and Series B
Use NumPy's union1d()
and intersect1d()
:
Example
import pandas as pd
import numpy as np
p1 = pd.Series([2, 4, 6, 8, 10])
p2 = pd.Series([8, 10, 12, 14, 16])
p_u = pd.Series(np.union1d(p1, p2))
p_i = pd.Series(np.intersect1d(p1, p2))
p_u[~p_u.isin(p_i)]
Output
0 2
1 4
2 6
5 12
6 14
7 16
dtype: int64
21. Getting Minimum, 25th Percentile, Median, 75th Percentile, and Maximum of a Numeric Series
Use NumPy's percentile()
function:
Example
import pandas as pd
import numpy as np
p = pd.Series(np.random.normal(14, 6, 22))
np.percentile(p, q=[0, 25, 50, 75, 100])
22. Getting Frequency Counts of Unique Items of a Series
Use the value_counts()
method:
Example
import pandas as pd
import numpy as np
p = pd.Series(np.take(list('pqrstu'), np.random.randint(6, size=17)))
p.value_counts()
23. Converting a NumPy Array to a DataFrame of a Given Shape
Reshape the NumPy array using reshape()
and then create a DataFrame:
Example
import pandas as pd
import numpy as np
p = pd.Series(np.random.randint(1, 7, 35))
info = pd.DataFrame(p.values.reshape(7, 5))
print(info)
24. Converting a Series to a DataFrame
Use the to_frame()
method:
Example
import pandas as pd
s = pd.Series(["a", "b", "c"], name="vals")
s.to_frame()
25. What is a Pandas NumPy Array?
NumPy (Numerical Python) is a library for numerical computations in Python, providing efficient operations on arrays. Pandas uses NumPy arrays internally for its data structures.
26. Converting a DataFrame to a NumPy Array
Use the to_numpy()
method.
27. Converting a DataFrame to an Excel File
Use the to_excel()
method. For multiple sheets, use ExcelWriter
.
28. Sorting a DataFrame
You can sort DataFrames in two ways:
By Label (Index): Use the sort_index()
method. By default, it sorts row labels (index) in ascending order. You can specify the axis
and ascending
parameters to control sorting.
By Value: Use the sort_values()
method. Specify the column name(s) to sort by using the by
parameter.
29. What is a Time Series in Pandas?
Time series data represents data points indexed in time order. Pandas provides excellent tools for working with time series data, including forecasting future values using various time series models.
30. What is a Time Offset?
A time offset is the difference between a local time and Coordinated Universal Time (UTC). It's crucial for handling time zone differences.
31. Define Time Periods.
Time periods represent durations like days, months, quarters, or years. Pandas' Period
class helps manage these.
32. How to Convert a String to a Date
Use the strptime()
method from the datetime
module:
Example
from datetime import datetime
dmy_str1 = 'Wednesday, July 14, 2018'
dmy_str2 = '14/7/17'
dmy_str3 = '14-07-2017'
dmy_dt1 = datetime.strptime(dmy_str1, '%A, %B %d, %Y')
dmy_dt2 = datetime.strptime(dmy_str2, '%d/%m/%y') #Corrected format
dmy_dt3 = datetime.strptime(dmy_str3, '%d-%m-%Y') #Corrected format
print(dmy_dt1)
print(dmy_dt2)
print(dmy_dt3)
Output
2018-07-14 00:00:00
2017-07-14 00:00:00
2017-07-14 00:00:00
Note: The original output was inconsistent with the input strings. The code above has been corrected to reflect the correct date parsing and output.
33. What is Data Aggregation?
Data aggregation applies functions (like sum
, min
, max
) across rows or columns to summarize data.
34. What is a Pandas Index?
A Pandas Index is a crucial tool for selecting data from a DataFrame. It provides efficient access to rows and columns and is often referred to as subset selection.
35. Define Multiple Indexing.
Multiple indexing (hierarchical indexing) allows you to have multiple levels of indices for rows or columns, useful for higher-dimensional data.
36. Define Reindexing.
Reindexing changes the index of a DataFrame. Use the reindex()
method. Missing values in the new index will be filled with NaN
.
Syntax
DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=np.nan, limit=None, tolerance=None)
37. How to Set the Index.
You can set the index when creating a DataFrame or later using the set_index()
method.
38. How to Reset the Index.
Use the reset_index()
method. This will create a new numerical index.
39. Describe Data Operations in Pandas.
Pandas offers various data operations:
- Row and Column Selection: Selecting specific rows and columns (resulting in a Series if only one row or column is selected).
- Data Filtering: Selecting rows based on boolean conditions.
- Handling Null Values: Pandas represents missing data as
NaN
(Not a Number).
40. Define GroupBy in Pandas.
The groupby()
function groups data based on specified criteria, allowing for aggregated calculations or other operations on those groups.
Syntax
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)