I’ve done a fruitful book camp recently based on the Python Tricks book by Dan Bader. There were 81 tricks that were new to me or which I found highly remarkable. It would not be very practical to list all of them down here. And it would probably not even comply with the publisher’s copyright. Luckily, I had my data-driven glasses with me during the five-day book camp.
Dan was so good to mention from time to time which Python Tricks have an impact on memory, speed and performance when data is processed on a large scale. This is how the Python Big Data Tricks compilation was born.
There are several ways to create a copy in python
a = ['foo', foo] b = a.copy() c = a[:] d = list(a) e = copy.copy(a) f = copy.deepcopy(a)
Creating deep copies is slower and requires more space. In this benchmark it is 270 times slower than the slice approach:
namedtuples are great for creating immutable classes in python and they are more space-efficient than regular classes.
from collections import namedtuple >>> Goodie = namedtuple('Goodie', [ ... 'url', ... 'followers', ... ]) >>> goodie = Goodie('datagoodie.com', 5765776523764) >>> goodie.followers 5765776523764
A beautiful benchmark on space efficiency:
- generators work like list notations but are streams of data
- they allow for maintainable pipelines of data processing
- use generators for memory efficiency because generators produce values on the go, e.g.
>>> # use a generator to go from ... sum(x * 2 for x in range(3)) 6
- create data pipelines with iterator chains (dbader.org)
There is a huge variety of arrays in Python
- go for a generic array structure like a list when you begin your project and then change to a more efficient data structure as the data load get becomes critical
use NumPy/Pandas for a great choice of fast array implementations for scientific calculations and data analysis
array.arrayfor more space efficiency (strictly typed)
tuplesrequire less space than lists
bytes are immutable, bytearrays are mutable. The conversion from bytearrays to bytes is super slow
- you can turn regular primitives in binary blobs with
struct.Struct. Doing that you can keep more data in memory or send it in a package over a network.
- you can use lists as stacks using
pop()to add and remove the latest element at the end of the list
collections.dequegreat for push/pop at the end AND at the beginning (both
O(N)), but performs poorly at random access
- in distributed environments queues can be used to either define elements as synchronously or asynchronously mutable
- for priority queues use
queue.PriorityQueue. Or use
heapqin distributed environments
- deconstruct your functions and data-structures with Python’s Disassembler
Please let me know if you have other great tricks and code examples to make Big Data development with Python more efficient.