Women Who Code - Empowering women to excel in tech careers

Women Who Code Talks Tech 15 | Spotify – iTunes – Google – YouTube – Text
In this episode, Women Who Code Boston Director Anna Astori explains options to speed up your python data processing operations. Valuable for those who are just starting in data science and Python development and an excellent refresher for those who are already experienced with that.

What is fast code, and how do we define it? More importantly, how can we measure it? There are three aspects against which we can measure the code. It's time, CPU consumption, and memory consumption. For each one of those, there are a lot of tools available in Python. For instance, for time, you can use time or timeit modules. Measuring time is a little tricky. Your operating system can interfere while you're doing it. You must be aware of things happening and how to take care of them to avoid skewed results. There's a great read called Falsehoods Programmers Believe About Time. It's been republished on multiple platforms on the web. It's still accepted as one of the general benchmarks.

For CPU measurements, there is a very popular cProfile module, and for memory, memory-profiler. What are the things that you can do when you start a project? You're working with lots of data. Suddenly, you realize your code is hitting a wall, and it's taking forever to run. What can you do? There are several things that you can do, starting from very straightforward tweaks in your code that you can apply all the way to more generalized and sophisticated approaches. The first suggestion that comes to mind is to look out for places where you can replace for-loops with list comprehensions. What do I mean by this, and what value do they bring? Let's take an example of a dummy function square integers with for-loop. All it does is create an output list and then iterates over each integer within a limit, appends it squared to the output list, and finally returns the list.

When I run this function, you can see a sample code, how to call the timeit function within your Python code, and then print its output to a standard output. In my machine, the output was about 9.3 seconds. Can we do better? Yes, we can, with list comprehension. Here is a new function, which achieves the same thing using list comprehension. You can see my code calling timeit again and printing out the evaluation. This time, it's slightly over six seconds, which, if you think in relative terms, it's a great improvement. If I were working with real-life, real-large data input, the benefit would be even more perceivable. That shows the benefit of list comprehensions in Python. It's a great tool.

It might be a good idea to try and avoid unnecessary for-loops and even list comprehensions altogether. One of the patterns I come across in real-life applications sometimes, but a lot more often in this kind of lead code coding interview problems where you know that you're going to be iterating over, say, a list of items and do some computation for each one of them. You'll want to store this computation in a predefined list or array. A nice trick that you can use in Python is to use the multiplication shortcut. I have a rows list, which is a list of actual items and elements, and I can simply multiply the default zero value by the length of this input rows list. Then my computation per row list would look something like the output below. So that avoids the loops and doing list comprehension altogether. There are some caveats to this approach as well. I'm not going to go into many details here because of the scope of the lightning talk.

What else can we do? Another area that I would suggest exploring and looking into is the Python built-ins, especially if you are maybe starting in coding or data science. Sometimes it feels a little bit tempting to write this logic by yourself because you can do it. I would recommend against it. The Python built-in functions that I'm talking about include very simple things like sum, max, map, reduce, filter, etcetera. The great thing about them is that they're often implemented in C under the hood. They're optimized for various scenarios. They'll do the things a lot more efficiently than the code I might have written myself. Another example that I would like to mention here is something that also shows up a lot in real-life applications if we're working with textual data like I had to do. Also, in this kind of lead code problem where you maybe need to reduce an input string or something like find all the non-repetitive characters.

You might also be tempted just to start building the output string character by character, but it might not be a very efficient way to do it. It might be much more efficient to append the characters to an intermediate list and then call the string modules join method on it. When you're building a string character by character under the hood, Python will store every intermediate version of the string and thus take up a lot of memory. Imagine if you're working with really large data input, they'll take up a lot of memory and will slow down your program.

The last one I wanted to talk about is the operator itemgetter is very helpful, very handy in the following kind of examples. I have a list of tuples of first and last names of users, and I want to sort them. I would like to sort them by last name, rather than the first name, which would be the default. In that case, I can use the operator itemgetter highlighted here in magenta as the key, and if you can see the sample output that it would provide in that case, it would do it very quickly, very nicely.

Another thing to think about in your code is if you're working with many objects and their attributes, and if you're referencing those attributes in one piece of code quite a few times, it might be a good idea to assign them to local variables. Imagine a rectangle object that has its height and width attributes. I will assign them to rectangle underscore height and rectangle underscore width variables and then reuse them later in my code to compute the surface and then maybe down below the perimeter or maybe something else.

It's probably not going to shave off a lot of execution time, but to put it super concisely, the thing is that if you are referencing the object attributes directly, your Python has to retrieve the self-object first and then its attributes. If you're using local variables, it skips one step. It can come in handy. Along the lines of thinking carefully about your objects and data structures, my general suggestion is always to explore and learn, and know the data structures and the objects that you're using. If you're working with Python dictionaries and want to check a key in a dictionary, Python 3 allows you two versions of the syntax. You can use the top one if the key is ticked versus the bottom one. In Python 3, the top one is going to be a little faster. It all depends on the general logic of your code, but if you're not using the actual keys or anything else later in your code, just go for the faster, simpler version.

Another dilemma that I see pop up sometimes and that trips up even experienced developers sometimes is this choice of the list and set data structures. When you know that you're going to retrieve an element from the container, the data structure, which one to go forward because we know that retrieving elements from a set is, one, super efficient, let's make great use of that. On the other hand, watch out. If you get a list as an input, which is very frequent in Python, you might be tempted to turn this list into a set first and then retrieve on and from there.

However, if you're only retrieving one element, what will happen? You'll have to iterate over each list element first for those elements to be added to the set. You'll create this extra overhead that you don't need, and thus, you won't get any benefit from retrieving an element from a set. Know your data structures and your objects well, and choose an approach wisely. What happens if you've replaced your loops with list comprehensions, and local variables don't help anymore? It might sound like you're working with a very big application with real-life data at this point. If you're not familiar yet with some of those fascinating libraries that are there for Python like NumPy and Pandas, they definitely should be on your radar.

NumPy uses its own implementation of arrays that are more compact and faster than Python list. That's actually open information; you can find it on the web. It's really interesting. Similarly, Pandas is super popular. It uses some of the mechanisms like vectorization that also allows you to avoid loops all together and make things run a lot faster. However, Pandas is super interesting because some of its default data structures are not even designed to be super-efficient. It's such a mature library that there are so many ways to scale it to a large data set and fine-tune its efficiency that it's even provided on its official documentation.

Lastly, one more thing that I wanted to mention here is, especially if you have a long-running, really large application, what might be beneficial is just-in-time or JIT compilers. JIT compilers collect data about information rather about the data types that your code is using. This information creates very specific machine code that helps your program run a lot faster. On the other hand, you won't see benefit from compilers if you're applying it to one-off short scripts, but if it's a big program, they could be your friend. There are two really popular ones, one with the PyPI, which is the Python implementation of Python. The other one is Numba, which is a JIT compiler. That also claims to be at least a few times faster than the CPython standard implementation.

WWCode Talks Tech #16: Optimizing Python Code