We are all Data Engineers.
We all work with data.
Amount of data is growing fast, in the business setup or daily life. There is a need to extract them from different places, marge it, filter and send it to someone.
And do it AS FAST AS POSSIBLE.
Probably you also have a lot of data to be analyzed.
Most likely, you don't like to repeat this operations over and over again. Doing everything manually is a tedious task.
Python and pandas might be the tools that you need.
Pandas gives you possibility to:
- read it from heterogeneous data sources: (CSV, Excel, Database, Parquet etc)
- analyze the data,
- operate on a data,
- manipulate the data,
- supplement it with another data,
- filter and sort.
After you are done with your operations, pandas gives you a possibility to store it in your favorite format: Excel, CSV, Parquet. Whatever you like.
It is not yet pure gold. What pandas is not good at?
From my perspective, pandas is not perfect in operating on JSON files. Maybe doing data analysis on JSON is OK, but manipulating and creating new JSON file using pandas? There are the fastest ways.
I found it much easier to work with json Python package then read JSON structure into DataFrame, and manipulating the DataFrame.
But maybe it is just my preference.
People smarter than me are saying that pandas in not great to deal with large data sets. It needs a lot of memory and that there are other tools that are better suited for the job.
This can mean that using pandas in production environment while dealing with large data sets is not a good idea. Probably you would not like to build your ETL processing using this library. It is more suited for data analysis, then to build highly efficient transformations that operate on large data sets.
Magic of pandas
Why even bother with pandas? Why don't do the operation in Excel only? Or only in SQL in the Database? Or only on CSV file?
You can imagine that you don't have only one Excel file to analyze.
You have dozens of them.
You probably would not like to do your task manually, but automate it. Pandas is a tool that helps to automate your data engineering task.
Imagine different scenario.
You need to read data from one source, transform it and write to another place. It may not be production code. It might be just code that make your (developer) life easier.
It may be the tool to do an ad-hoc analysis or produce a quick report for your manager or analyst.
You can also extend this scenario and not only read data from one source but from multiple and then marge it (join) do transformations, filtering and store it.
Sometimes pandas is just the right tool for the job. That supports your efficiency by facilitating work with different source. Or when you need to do some data analysis and technical analysis.
You can easily imagine that you can automate your daily data analytics and integration tasks using Python and pandas.
Write once, don't repeat yourself (DRY) and have time for other activities.
Top 5 high level pandas features
- Reads heterogeneous data source.
- Common interface to operate on a data from different sources.
- It is the way to automate your data engineering / analysis tasks.
- Number of options. Most likely, it will meet your most sophisticated demand.
- DataFrame and Series concepts - to work with tables and columns are very powerful interface on a data.
Takeaways:
- For whom pandas is a good idea?
If you are a data analyst, data engineer, integration specialist, business analyst or quality assurance, this is a tool that you may find interesting to:
- do data exploration
- automate your daily data related tasks (python + pandas in powerful package)
- to combine data from multiple sources
- Why pandas is good idea for data analysis?
DRY. Building the solution once and reuse it. Learn one framework and operate on multiple source using common (DataFrame) interface.
- What are operations that you might need?
Transforming a DataFrame is a big subject. By transforming I mean adding columns, filtering the data, making calculations based on a data or grouping by the data.