Why do people prefer Pandas to SQL?



I've been using SQL since 1996, so I may be biased. I've used MySQL and SQLite 3 extensively, but have also used Microsoft SQL Server and Oracle.

The vast majority of the operations I've seen done with Pandas can be done more easily with SQL. This includes filtering a dataset, selecting specific columns for display, applying a function to a values, and so on.

SQL has the advantage of having an optimizer and data persistence. SQL also has error messages that are clear and understandable. Pandas has a somewhat cryptic API, in which sometimes it's appropriate to use a single [ stuff ], other times you need [[ stuff ]], and sometimes you need a .loc. Part of the complexity of Pandas arises from the fact that there is so much overloading going on.

So I'm trying to understand why Pandas is so popular.


Posted 2018-07-12T09:25:51.067

Reputation: 601

Comments are not for extended discussion; this conversation has been moved to chat.

– Sean Owen – 2018-07-22T03:43:33.750

Pandas docs on this are here.

– Alexei Martianov – 2019-09-03T09:33:05.613



The real first question is why are people more productive with DataFrame abstractions than pure SQL abstractions.

TLDR; SQL is not geared around the (human) development and debugging process, DataFrames are.

The main reason is that DataFrame abstractions allow you to construct SQL statements whilst avoiding verbose and illegible nesting. The pattern of writing nested routines, commenting them out to check them, and then uncommenting them is replaced by single lines of transformation. You can naturally run things line by line in a repl (even in Spark) and view the results.

Consider the example, of adding a new transformed (string mangled column) to a table, then grouping by it and doing some aggregations. The SQL gets pretty ugly. Pandas can solve this but is missing some things when it comes to truly big data or in particular partitions (perhaps improved recently).

DataFrames should be viewed as a high-level API to SQL routines, even if with pandas they are not at all rendered to some SQL planner.


You can probably have many technical discussions around this, but I'm considering the user perspective below.

One simple reason why you may see a lot more questions around Pandas data manipulation as opposed to SQL is that to use SQL, by definition, means using a database, and a lot of use-cases these days quite simply require bits of data for 'one-and-done' tasks (from .csv, web api, etc.). In these cases loading, storing, manipulating and extracting from a database is not viable.

However, considering cases where the use-case may justify using either Pandas or SQL, you're certainly not wrong. If you want to do many, repetitive data manipulation tasks and persist the outputs, I'd always recommend trying to go via SQL first. From what I've seen the reason why many users, even in these cases, don't go via SQL is two-fold.

Firstly, the major advantage pandas has over SQL is that it's part of the wider Python universe, which means in one fell swoop I can load, clean, manipulate, and visualize my data (I can even execute SQL through Pandas...). The other is, quite simply, that all too many users don't know the extent of SQL's capabilities. Every beginner learns the 'extraction syntax' of SQL (SELECT, FROM, WHERE, etc.) as a means to get your data from a DB to the next place. Some may pick up some of the more advance grouping and iteration syntax. But after that there tends to be a pretty significant gulf in knowledge, until you get to the experts (DBA, Data Engineers, etc.).

tl;dr: It's often down to the use-case, convenience, or a gap in knowledge around the extent of SQL's capabilities.


Posted 2018-07-12T09:25:51.067

Reputation: 456

4I think SQL largely being set based plays a large part, when a lot of people from other technical areas are used to handling data line by line. Also consider that data is mostly just data to pandas but different SQL engines support different built in functions which can get wildly annoying fast if you have to chop and change during your workday – Dave – 2018-07-12T14:26:29.907

3I wouldn't say it's not viable. If you can get the data into a pandas data frame, you can probably shove it into a PostgreSQL DB. But for one and done, it's probably more effort and time than you would save. – jpmc26 – 2018-07-13T07:37:46.763

3I agree that some ETL approaches appear to be programmer-centric decisions. That is, they prefer to manipulate the data then present this "perfect" payload to the database. However, as you indicate, if it can be done via several SQL queries, then the extra programmatic layer is unnecessary. Exactly what I faced recently. As the OP and your answer indicates, it could be that "old-school" or DBA-centric people look at it and say, why not do it in SQL (even just several simple queries!). That said, I've found pandas to be very powerful for extremely diverse data sets. – SaltySub2 – 2018-07-13T08:45:36.760

1@SaltySub Just a point on shifting things out of the programmatic layer into SQL: It's a fair point and can be perfectly valid, but going as far as burying application logic in SQL procedures can bring its own special flavour of headache. – Electric Head – 2018-07-13T18:23:53.420

1@ElectricHead I agree that there needs to be a right balance. If a series of SQL queries can perform the tasks adequately, it can definitely be easier and more efficient. Conversely, as you indicate, if one has to place a huge amount of logic into SQL procedures, etc. then pandas should be strongly considered. Particularly as above if you are using different database flavours - SQL syntax differences can get very hairy then. – SaltySub2 – 2018-07-15T03:26:59.033

Even with a good IDE with auto-completion, SQL tends to suffer from pushing the user to use massive amounts of nesting. For example, column assignment/creation followed by extra steps (maybe aggregation etc) are super easy to write AND read in a dataframe-like language (they are just single lines). In most SQL context you need to use nesting and it becomes unreadable. In fact, you often see people using various scripting languages to create SQL factories just to put the queries together.

Ideal world would be something Spark/pandas with a good SQL plan printer (i.e. is readable by humans). – mathtick – 2018-09-05T17:25:47.920


As much as there is overlap in the application of these two things, this is comparing apples to oranges.

pandas is a data analysis toolkit implemented in Python, a general purpose programming language. SQL is a domain-specific language for querying relational data (usually in an relational database management system which SQLite, MySQL, Oracle, SQL Server, PostgreSQL etc. are examples).

SQL implies

  • working with data in an RDBMS* which may or may not be appropriate for the workload, even if it's just a small SQLite database,
  • database domain knowledge (as an end user, developer and/or administrator; the suggestion that "SQL is faster" I often see is a massive over-simplification), and
  • overcoming the not-insignificant learning curve in using SQL effectively, particularly in specialist applications such as data analysis (as opposed to creating simple reports of simple data).

* It's worth underlining the fact that SQL is so domain-specific it's becoming much less relevant to working with increasingly common alternatives to relational databases such as NoSQL databases. This represents a fundamental shift in how data is stored and structured, and there is really no universally common way of accessing it like the development of SQL standardisation aimed to achieve.

Python on the other hand (pandas is fairly "pythonic" so it holds true here) is flexible and accessible to people from various backgrounds. It can be used as a "scripting language", as a functional language and a fully featured OOP language. Visualisation capabilities and data source interoperability are built into pandas, but you're free to incorporate whatever Python can do into your workflow (which is most things); the scientific Python ecosystem has ballooned and includes great tools such as Jupyter Notebook and essential scipy libraries such as matplotlib and numpy (which pandas builds on). Significant elements of pandas' data analysis is R-inspired and you won't generally find statisticians umming and ahhing about whether they use R (or possibly increasingly pandas!) over putting everything in a database and writing their analyses in SQL.

I'm not saying pandas is better than SQL or vice versa, but SQL is a very domain-specific tool whereas pandas is part of a giant, flexible and accessible ecosystem. I work with geospatial data systems, of which relational databases are a huge part, and SQL is a powerful and essential tool. However, pandas is an equally if not more essential part of my day-to-day toolkit and SQL is often relegated to fetching data -- perhaps with some pre-processing -- so I can do things with it in pandas.

Electric Head

Posted 2018-07-12T09:25:51.067

Reputation: 311

2This is the only true answer, it should be the chosen one. SQL and Pandas are two different things, I do not understand what comparison people are trying to make. – gented – 2018-07-13T06:51:20.957

I suspect it's an end-user perspective of writing something code-like to fetch and massage some data from somewhere and spit out some numbers. I'm not entirely surprised; I've had first hand experience of how data analysts presented with an old but otherwise unremarkable Oracle database haven't even the first idea of what it is and how to connect to it let alone get data out.

I believe it betrays a fundamental lack of understanding of the technology - I've actually added a bit to hopefully emphasise how quickly misunderstanding the scope of SQL falls down. – Electric Head – 2018-07-13T18:00:46.153

I'd challenge your bit about being irrelevant to NoSQL situations. Consider for example the strides PostgreSQL has made with its JSON storage. – jpmc26 – 2018-07-14T20:31:08.627

I tried to choose my words carefully; PostgreSQL is still an RDBMS despite doing a lot of things well (as SQL Server is despite supporting graphs). But, I've relaxed the wording a touch because it's still a good point: there is some crossover and, importantly, SQL APIs do exist for some NoSQL systems. It is crossover though, SQL is not a universal language and not all data is structured relationally. – Electric Head – 2018-07-15T08:25:32.137

1I think you can do everything in SQL which is possible in pandas. SQL is not flexible but is so much optimised. – Media – 2018-07-18T00:55:31.117

I feel this has been adequately explained already in various answers and comments, so I'm reluctant to repeat this... but they are different tools for different jobs. A trivial example to highlight why this comparison doesn't really mean anything: simple time series analysis is possible in both pandas and SQL, but the latter is not going to help you plot the data on a figure. I'm not even going to get into the integrated capabilities of numpy or scipy. – Electric Head – 2018-07-18T23:59:53.370


First, pandas is not that much popular. I use both pandas and SQL. First I try to understand the task- if it can be done in SQL, I prefer SQL because it is more efficient than pandas. Try working on a large data (10,000,000 x 50). Try to do some groupby operation in both SQL and pandas. You will understand.

I use pandas where it comes handy- like splitting a column values into an array and doing some stuff on it (like choosing only some values out of that array). Now this kind of task is relatively hard to code in SQL, but pandas will ease your task.

Ankit Seth

Posted 2018-07-12T09:25:51.067

Reputation: 1 569

1Is this inefficiency specific to pandas? I've done quite a bit of in-memory data manipulation in C# and found it pretty easy and efficient, provided it fit memory and was one-shot (i.e. no need to incrementally update indexes as the data changes). – CodesInChaos – 2018-07-13T11:59:55.807

pandas is meant to be convenient over fast, but that's not to say it can't be fast if you use it right. In the end, executing a SQL query on data in a database isn't magic - it requires resources like anything, it's just that (if you do it right!) you're hopefully making use of resources on carefully configured, beefy database servers. Getting your pipeline right in pandas or similar (e.g. streaming data rather than loading it all into memory) is going to determine how successful some efforts are. – Electric Head – 2018-07-13T18:05:40.270

@CodesInChaos There is this answer of pandas vs SQl - http://qr.ae/TUIpzE. There it is described the advantages and disadvantages of using pandas.

– Ankit Seth – 2018-07-14T02:38:55.097

@CodesInChaos from my personal experience pandas (as well as python in general) is orders of magnitude slower and less memory efficient than C# or Java all other things like algorithm, optimizations, hardware etc. being equal. One may argue that when it comes to DS or ML python is just glue for C-implemented number grinders but even as glue it slows things down quite a bit unless it's JIT-compiled (PyPy, Cython, numba, Weld) which is a separate source of pain when it comes to complex solutions and third-party package support.

– ayorgo – 2020-07-01T15:32:25.320

I should add that Pandas with large datasets using Dask is now very performant. – vy32 – 2020-08-23T16:13:06.603


I'm one of those people who would use (in my case) R's dplyr (the language, not necessarily the tool) in every case if I could even though I know my SQL.

The major benefit I see in Pandas/dplyr/data.table pipelines is that the operations are atomic and can be read top to bottom.

In SQL you need to parse the whole script, jumping around (what's being sumamrized, what's being joined and how - left? inner? right?, are there any filters applied?) to fully grasp what is happening.

In Pandas et al each step of the pipeline is self contained, it does something with the input data and returns output data, this sequential process makes it easier to reason about what's happening since there is a clearly defined state for each operation rather than just on a query level.

And yes you can do WITH statements and such but it requires much more code and is not as clear what object is being used compared to piping.

Robin Gertenbach

Posted 2018-07-12T09:25:51.067

Reputation: 181


I'm fairly new to Pandas/Python but have 20+ years as a SQLServer DBA, architect, administrator, etc.. I love Pandas and I'm pushing myself to always try to make things work in Pandas before returning to my comfy, cozy SQL world.

Why RDBMS's are Better: The advantage of RDBMS's are their years of experience optimizing query speed and data read operations. What's impressive is that they can do this while simultaneously balancing the need to optimize write speed and manage highly concurrent access. Sometimes these additional overheads tilt the advantage to Pandas when it comes to simple, single-user use cases. But even then, a seasoned DBA can tune a database to be highly optimized for read speed over write speed. DBA's can take advantage of things like optimizing data storage, strategic disk page sizing, page filling/padding, data controller and disk partitioning strategies, optimized I/O plans, in-memory data pinning, pre-defined execution plans, indexing, data compression, and many more. I get the impression from many Pandas developers that they don't understand the depth that's available there. What I think usually happens is that if Pandas developer never has data that's big enough to need these optimizations, they don't appreciate how much time they can save you out of the box. The RDBMS world has 30 years of experience optimizing this so if raw speed on large datasets are needed, RDBMS's can be beat.

Why Is Python/Pandas Better: That said, speed isn't everything and in many use cases isn't the driving factor. It depends on how you're using the data, whether it's shared, and whether you care about the speed of the processing. RDBMS's are generally more rigid in their data structures and put a burden on the developer to be more deterministic with data shapes. Pandas lets you be more loose here. Also, and this is my favorite reason, you're in a true programming language. Programming languages give you infinitely more flexibility to apply advanced logic to the data. Of course there's also the rich ecosystem of modules and 3rd party frameworks that SQL can't come close to. Being able to go from raw data all the way to web presentation or data visualization in one code base is VERY convenient. It's also much more portable. You can run Python almost anywhere including public notebooks that can extend the reach of your results to get to people more quickly. Databases don't excel at this.

My Advice? If you find yourself graduating to bigger and bigger datasets you owe it to take the plunge and learn how RDBMS's can help. I've seen million row, multi-table join, summed aggregate queries tuned from 5 minutes down to 2 seconds. Having this understanding in your tool belt just makes you a more well rounded data scientist. You may be able to do everything in Pandas today but some day your may have an assignment where RDBMS is the best choice.


Posted 2018-07-12T09:25:51.067

Reputation: 101


Things Pandas can do, that SQL can't do

  1. df.describe()
  2. Plotting, e.g. df['population'].plot(kind='hist')
  3. Use a dataframe directly for training machine learning algorithms

Things Pandas can do, I wasn't aware that SQL can do as well

  1. Export to csv: df.to_csv('foobar.csv'). This is important when you want to show something to a business owner who wants to work with Excel. And there is df.to_excel as well. But in SQL, you can do SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' FROM test_table; (thank you, vy32!)

Martin Thoma

Posted 2018-07-12T09:25:51.067

Reputation: 15 590

1Nice. Although most of these seem like functions that could be implemented in SQL. (SQL does have directly CSV export.) – vy32 – 2018-07-18T19:51:00.993

Could you please send me a query which exports to CSV? (I only know tools which do this for some SQL-based databases, but I've never seen a query ... so I doubt that this is part of the SQL specification) – Martin Thoma – 2018-07-18T20:19:36.667

2SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' FROM test_table; See https://dev.mysql.com/doc/refman/8.0/en/select-into.html – vy32 – 2018-07-19T14:17:12.387

Thank you so much, vy! I think I'll adjust my answer when I'm home :-) – Martin Thoma – 2018-07-19T16:52:41.503

Sure thing. Remember, the file ends up on the SQL server, not the client. – vy32 – 2018-07-20T18:21:50.973

Oh, and, obviously, you can export to CSV with SQLite by simply changing the delimiter and doing a select. – vy32 – 2018-07-20T18:22:45.440


The only thing not covered in these answers that I'd like to mention is that it also depends on how you're using SQL. Take arcpy for example. For some reason none of the arcpy.da functions have an execute many feature. This is really strange because pretty much every other python sql library does. The Where statement in the arcpy.da functions is also limited to around 120 characters. This essentially means that if you have any relatively high number of things you're trying to do with your database your only real choice is to call your chosen arcpy.da function multiple times, changing the where statement each time you do. There are a few tricks you can use to make this process go faster - you can iterate over chunks of your dataset for example - but literally every single of these tricks is much much slower than just using one arcpy.da.searchcursor to load your entire table into a pandas data frame, and then manipulating it using pandas, numpy, and, if your data is really this massive, dask. I need to emphasize here that pandas isn't just a little faster in this case. It's disgustingly faster. It's so much faster that I was literally laughing at myself for not doing it sooner. Using pandas dropped one scripts execution time down from well over an hour - I forget if this was the jump from 3.5 hours or from 1.5 hours - to literally 12 minutes.

One thing to note is that while I could have done this with sql it would have taken me a lot longer to learn. I would have either had to learn operations specifically for sql in Access - that's where the data for this script ended up - - sql in Access wasn't as robust as I needed it to be when I was actually looking into doing this -, or I would have had to write all my data to a sqlite3 database, manipulate it there, and then put it in Access. While this might have given me similar performance results, it would have made my script harder to modify in the future.

So yeah, sometimes Pandas and is just strictly better than using the sql options you have at your disposal. Everything I would have needed to do in sql was done with a function in pandas. You can also use sql syntax with pandas if you want to. There's little reason not to use pandas and sql in tandem.

One more thing I want to mention about Pandas and numpy is that both of these libraries are by nature set based approaches. You can loop through dataframes and series build with these libraries, but it's really hard to modify data in these structures like that so you'll end up writing more efficient code - set based - with both of these libraries purely because it's so much easier to do. Being "guided" if not rail-roaded into using set based approaches is not something I've experienced with SQL.

One more massive thing I forgot to mention with Pandas. Money. Pandas is a tool that a lot of Data Science jobs want you to know how to use. Pretty much every Data Science job I've looked at has paid more than database management type jobs. The only exception to this that I've noticed is in Data Engineering, but I've seen far less of those job postings. Pandas looks like it makes you more money at a glance.


Posted 2018-07-12T09:25:51.067


5Perhaps sad that when it comes to modern jobs it's about having the right buzzwords in your resume as opposed to the approaches you take to solving a problem (assuming you can learn said buzzword relatively fast). It's like the buzzword is more important than the problem solving. When the problem solving for X should involve learning and using technology A, B, C, not the reverse. I wonder if most development teams now smash out things because of buzzword-ism and trendiness, then think about problem solving as a secondary, or "old-school" thing because you didn't know/use said buzzword. – SaltySub2 – 2018-07-13T08:49:55.253

@SaltySub yeah you're right. We do a lot with SQL. It's just that for this particular instance, I was already writing code in python and I needed to learn Pandas anyway. IMO pandas works better with python than SQL does. I just didn't put that in my answer because they probably perform the same. NoSQL looks like it's just a buzzword. I couldn't find a single example of where a non-relational database is actually better than a relational database. – None – 2018-07-13T13:39:25.400

I think I know what you're getting at with describing pandas/numpy as set based; it's much the same as matlab, scilab, etc. in that they are massively optimised for bulk array operations and stink when using procedural constructs like looping. However, SQL is by and large all about set operations and a user is going to struggle until they understand that. Basically I don't really understand the distinction? SQL railroads you far more on that front. – Electric Head – 2018-07-13T18:19:14.183

1@ElectricHead in my experience if you're writing your own function involving sql in python, it's easier to just misuse your cursor and write bad queries than it is using pandas/numpy. Gotta remember that not all sql modules/libraries are made the same. In my case, with arcpy.da.SearchCursors and the like, there's really not a good way to do something to a bunch of records efficiently because of weird limitations. If I use pandas/numpy there becomes one good way to do things, and that's what I want when using python. – None – 2018-07-13T18:23:39.850

@ElectricHead it's extremely difficult to write a loop to modify rows in a data frame programmatically in Pandas. It's really easy to do with SQL stuff. – None – 2018-07-13T18:26:08.717

1Ahhh, OK. You mean a homespun SQL pipeline via a python dbapi implementation vs using numpy/pandas? In which case, yeah gotcha, no argument from me there; care required! It read to me as vs plain SQL which you obviously need to understand set operations with, but will find that out quite quickly when running silly queries from a database client. – Electric Head – 2018-07-13T18:33:14.693

1@Steve Yes, won't stop people trying to dynamically modify stuff in loops in pandas or similar though :) I think understanding SQL helps working in pandas effectively (it's not like they hide the similarity in some concepts though). – Electric Head – 2018-07-13T18:35:44.627

1@Steve Indeed pandas is powerful too... I guess one of my frustrations is developers and management both, including myself, not spending adequate time evaluating solutions and chasing trends (where money is involved to promote self/company). But even in lean prototyping/ mvp one would have to lay appropriate groundwork for scaling. SQL, noSQL and Pandas... all have their purposes for the appropriate tasks and projects at different stages. For the past year plus, noSQL for a lean prototype/ mvp certainly helped me in more ways than one. SQL would have been overkill for that. – SaltySub2 – 2018-07-15T03:32:13.240


I'll attempt to answer this question based on my own experience. In contrast to the other answers, I prefer Sql for deep learning and big-data-related things. There are numerous reasons for that. As it can be seen here,

Pandas provides an intuitive, powerful, and fast data analysis experience on tabular data. However, because Pandas uses only one thread of execution and requires all data to be in memory at once, it doesn’t scale well to datasets much beyond the gigabyte scale.

Sql engines usually keep the keys or special columns in data-structures like $B ^+$ tree in order to facilitate CRUD operations. This data structure keeps the status of all data in the database. This is not pandas can do because it can't access all the data simultaneously. On the other hand, it can't do some operations even with its chunk parameter used in read_csv. As an example, you can't have direct batch operations for large datasets that your memory can't accommodate them. Any other tasks which depend on your entire dataset need extra coding. All of these can be handled in Sql without extra coding, just with a simple query. Simple Sql operations are just used without any fear about the memory.

Another difference is that CRUD operations in Sql can be applied distributed with different authorisation policies which are not possible in pandas.

It is not meant to say which is better, it all depends on your task. For large scale computation I prefer Sql and for small ones, I prefer pandas.

There are other things that are not in pandas which are really important for fast experience for data extraction that I'll refer to later. For now, just take a look at here.


Posted 2018-07-12T09:25:51.067

Reputation: 12 077


I thought I would add that I do a lot of time-series based data analysis, and the pandas resample and reindex methods are invaluable for doing this. Yes, you can do similar things in SQL (I tend to create a DateDimension table for helping with date-related queries), but I just find the pandas methods much easier to use.

Also, as others have said, the rest of my modelling is in Python, and I often have web calls or CSV files.

Ken Syme

Posted 2018-07-12T09:25:51.067

Reputation: 206


Panda is more popular since python in the form of jupyter notebooks is the most populair toolbox as used by data scientist in neural network area. Python is becomming "the" langauge. Its even possible to use SQL backend but your not bound to SQL only with panda.


Posted 2018-07-12T09:25:51.067

Reputation: 101


This thread is rather old, but a few points

  • Spark and most data platforms add SQL because it is the defacto/ubiquitos langage of data analysis
  • I disagree you can do everything in SQL -- some tasks like applying lambda transforms aren't feasible w/o hacks

I think what has made Pandas popular is being built on Python, accessible by Notebooks and an easy API to learn quickly.

Doug F

Posted 2018-07-12T09:25:51.067

Reputation: 66

Now there's a good master's project: adding user-defined functions in Python to SQL, on the fly. – vy32 – 2020-08-23T16:11:37.857

@vy32 - alot of the big RDBMSs like Oracle have a way to add a UDF, but its usually very painful. Postgres has a way as well -- its not beautiful but works. A more 1st class way to integrate lambda's in a DB would be a huge win! https://www.postgresql.org/docs/8.2/plpython-funcs.html

– Doug F – 2020-08-24T05:20:31.117