Lesson 7, Bit 4: Databases and Pickles

Databases

A database is a file that is organized for storing data. Many databases are organized like a dictionary in the sense that they map from keys to values. The biggest difference between a database and a dictionary is that the database is on disk (or other permanent storage), so it persists after the program ends.

The module dbm provides an interface for creating and updating database files. As an example, I'll create a database that contains captions for image files.

Opening a database is similar to opening other files:

Code Notes
import dbm
db = dbm.open('captions', 'c')

The mode 'c' means that the database should be created if it doesn't already exist. The result is a database object that can be used (for most operations) like a dictionary.

When you create a new item, dbm updates the database file.

db['cleese.png'] = 'Photo of John Cleese.'

When you access one of the items, dbm reads the file:

Code Result
db['cleese.png'] b'Photo of John Cleese.'

The result is a bytes object, which is why it begins with b. A bytes object is similar to a string in many ways. When you get farther into Python, the difference becomes important, but for now we can ignore it.

If you make another assignment to an existing key, dbm replaces the old value:

Code Result
db['cleese.png'] = 'A silly walk'
db['cleese.png'] b'A silly walk'

Some dictionary methods, like keys and items, don't work with database objects. But iteration with a for loop works:

for key in db:
    print(key, db[key])

As with other files, you should close the database when you are done:

db.close()

Pickling

A limitation of dbm is that the keys and values have to be strings or bytes. If you try to use any other type, you get an error.

The pickle module can help. It’s part of the Python standard library, so it’s always available. It’s fast; the bulk of it is written in C, like the Python interpreter itself. It can store arbitrarily complex Python data structures.

What can the pickle module store?

  • All the native datatypes that Python supports: booleans, integers, floating point numbers, strings, bytes objects, byte arrays, and None.
  • Lists, tuples, dictionaries, and sets containing any combination of native datatypes.
  • Lists, tuples, dictionaries, and sets containing any combination of lists, tuples, dictionaries, and sets containing any combination of native datatypes (and so on, to the maximum nesting level that Python supports).
  • Functions (with caveats). First, we will use the dumps method to show you what pickling effectively does to an object.

First, we will use the dumps method to show you what pickling effectively does to an object.

pickle.dumps takes an object as a parameter and returns a string representation (dumps is short for "dump string"):

Code Result
import pickle

t = [1, 2, 3]

pickle.dumps(t)
b'\x80\x03]q\x00(K\x01K\x02K\x03e.'

So what just happened?

The pickle module takes a Python data structure and serializes the data structure using a data format called "the pickle protocol."

The pickle protocol is Python-specific; there is no guarantee of cross-language compatibility. You probably couldn't take the shoplistfile file you just created and do anything useful with it in Perl, PHP, Java, or any other language.

Not every Python data structure can be serialized by the pickle module. The pickle protocol has changed several times as new data types have been added to the Python language, but there are still limitations.

As a result of these changes, there is no guarantee of compatibility between different versions of Python itself. Newer versions of Python support the older serialization formats, but older versions of Python do not support newer formats (since they don't support the newer data types).

Unless you specify otherwise, the functions in the pickle module will use the latest version of the pickle protocol. This ensures that you have maximum flexibility in the types of data you can serialize, but it also means that the resulting file will not be readable by older versions of Python that do not support the latest version of the pickle protocol.

The latest version of the pickle protocol is a binary format. Be sure to open your pickle files in binary mode, or the data will get corrupted during writing.

Pickles and Saving to a File

So now that we used the dumps method to see what pickling is actually doing to our objects, let's see how we can use the dump method (note: no "s" in dump) to load it into a file.

The dump method accepts a minimum of two arguments: the object that we are pickling and the file in which we are storing the data, like this:

pickle.dump(object, file)

Here is an example where we take a shopping list and save it for later in a file.

Line Code Notes
1 import pickle

Import in the pickle module.

2 shoplistfile = 'shoplist.data'

The name of the file where we will store the object

3 shoplist = ['apple', 'mango', 'carrot']

The list of things to buy. Notice that this is a Python list object.

4 fout = open(shoplistfile, 'wb')

Open the file in write mode.  Notice the b after the 'w' mode – b is for "binary"

5 pickle.dump(shoplist, fout)

Dump the object to a file

6 fout.close()

Close the file.

Hooray! We successfully pickled the list.

Reading Pickle Data from a File

Now that we have a pickled file, we want to be able to get that data back.  We can use the load method to accomplish this.  First we need to open our file in binary mode.  Then we can use the load method to "unpickle" the data and make it usable again.

The load method accepts the file handler as its argument. Let's continue with our shopping list example.

Line Code Notes
7 fin = open(shoplistfile, 'rb')

Read back from the storage

8 storedlist = pickle.load(fin)

Load the object from the file

9 print(storedlist)

Display the list.

Here is our output:

['apple', 'mango', 'carrot']

It's our original list!  We can pick up where we left off and do whatever we want with it.

However, it needs to be noted that this is not identical to the original list – it is a different object.  Here's an example where we pickle a list, then unpickle it.

Code Result
import pickle

t = [1, 2, 3]

s = pickle.dumps(t)

s
b'\x80\x03]q\x00(K\x01K\x02K\x03e.'
t2 = pickle.loads(s)

t2
[1, 2, 3]

Although the new object has the same value as the old, it is not (in general) the same object:

Code Result
t1 == t2 True
t1 is t2 False

In other words, pickling and then unpickling has the same effect as copying the object.

You can use pickle to store non-strings in a database. In fact, this combination is so common that it has been encapsulated in a module called shelve, although we don't have time to discuss it in this course.

Multiple Pickles in one File

We can store multiple pickles in a single file as well.  The caveat here is that you have to remember which order they went in, because it follows a "first-in / first-out" rule:

Code Result
import pickle

list1 = ['apple', 'mango', 'carrot']
list2 = ['bread', 'bagel']
list3 = ['cake', 'cookies', 'pie']

fout = open('list.dat', 'wb')

pickle.dump(list1, fout)
pickle.dump(list2, fout)
pickle.dump(list3, fout)

fout.close()

That saved each list as a pickle into one file. Now let's extract them:

Code Result
fin = open('list.dat', 'rb')

pickled_list1 = pickle.load(fin)
pickled_list2 = pickle.load(fin)
pickled_list3 = pickle.load(fin)

print(pickled_list1)
print(pickled_list2)
print(pickled_list3)
['apple', 'mango', 'carrot']
['bread', 'bagel']
['cake', 'cookies', 'pie']

First in - first out. You have to keep track of it.