Lesson 7, Bit 2: Reading and Searching Data from Files

Text Files and Lines

A text file can be thought of as a sequence of lines, much like a Python string can be thought of as a sequence of characters. To break the file into lines, there is a special character that represents the "end of the line" called the newline character.

We've seen as far back as Lesson 1 that the newline character is \n. Remember that even though this looks like two characters, it is actually a single character. When we look at the variable by entering "stuff" in the interpreter, it shows us the \n in the string, but when we use print to show the string, we see the string broken into two lines by the newline character.

Code Output
stuff = '1\n2'
print(stuff)
1
2

You can also see that the length of the string '1\n2' is three characters because the newline character is a single character.

Code Output
stuff = '1\n2'
print(len(stuff))
3

So when we look at the lines in a file, we need to imagine that there is a special invisible character called the newline at the end of each line that marks the end of the line.

So the newline character separates the characters in the file into lines.

Reading Files

While the file handle does not contain the data for the file, it is quite easy to construct a for loop to read through and count each of the lines in a file:

Code Output
fin = open('words.txt')

count = 0

for line in fin:
    count += 1

print("Line Count", count)
Line Count 113809

We can use the file handle as the sequence in our for loop. Our for loop simply counts the number of lines in the file and prints them out. The rough translation of the for loop into English is, "for each line in the file represented by the file handle, add one to the count variable."

The reason that the open function does not read the entire file is that the file might be quite large with many gigabytes of data. The open statement takes the same amount of time regardless of the size of the file. The for loop actually causes the data to be read from the file.

When the file is read using a for loop in this manner, Python takes care of splitting the data in the file into separate lines using the newline character. Python reads each line through the newline and includes the newline as the last character in the line variable for each iteration of the for loop.

Because the for loop reads the data one line at a time, it can efficiently read and count the lines in very large files without running out of main memory to store the data. The above program can count the lines in any size file using very little memory since each line is read, counted, and then discarded.

Launch Exercise

The readline Method

The file object provides several methods for reading, including readline, which reads characters from the file until it gets to a newline and returns the result as a string:

Code Result
fin = open('words.txt')
fin.readline()
'aa\n'

The first word in this particular list is "aa", which is a kind of lava. The sequence \n represents two whitespace characters, a carriage return and a newline, that separate this word from the next.

When we print the result fin.readline(), we cannot see the \n because it renders as a new line break.  But if we immediately display another character, we can see the line break:

Code Output
fin = open('words.txt')
print(fin.readline())
print("hi")
aa
hi

Launch Exercise

The file object keeps track of where it is in the file, so if you call readline again, you get the next word:

Code Result
fin.readline() 'aah\n'

The next word is "aah", which is a perfectly legitimate word, so stop looking at me like that.

Launch Exercise

If it's the whitespace that's bothering you, we can get rid of it with the string method strip:

Code Result
line = fin.readline()

word = line.strip()

word
'aahed'

If you recall back to Lesson 5, strip will remove any whitespace (spaces, tabs, or newlines) from the beginning and end of a string. 

Launch Exercise

You can also use a file object as part of a for loop. This program reads words.txt and prints each word, one per line:

Code Output
fin = open('words.txt')

for line in fin:
    word = line.strip()
    print(word)
aa
aah
aahed
aahing
aahs
aal
aalii
aaliis
aals
aardvark
(...)

Launch Exercise

The readlines Method

If you want to store all lines in a list of lines, you can use the readlines method:

Code Result
fin = open('words.txt')

words = fin.readlines()
['aa\n', 'aah\n', 'aahed\n', 'aahing\n', 'aahs\n', 'aal\n', 'aalii\n', 'aaliis\n', 'aals\n', 'aardvark\n', (...)]

Launch Exercise

When we use the len function to see the length of the list called words, we get the full number of lines in the file words.txt:

Code Result
len(words) 113809

Just like the number we got when we counted each line in the file!

So now, if we want to access the 15th line in this file, we can display the 14th index (remember that indices start with 0 so the 15th item is actually the 14th index):

Code Result
words[14] 'aasvogel\n'

Because words is now a list, you can use all of the list methods (find, replace, slicing, etc) on it.

Launch Exercise

The read Method

If you know the file is relatively small compared to the size of your main memory, you can read the whole file into one string using the read method on the file handle.

Code Output
fin = open('words.txt')

words = fin.read()

print(len(words))
1016714
print(words[:20]) aa
aah
aahed
aahing

In this example, the entire contents (all 1,016,714 characters) of the file words.txt are read directly into the variable words. We use string slicing to print out the first 20 characters of the string data stored in words.

When the file is read in this manner, all the characters including all of the lines and newline characters are one big string in the variable words. Remember that this form of the open function should only be used if the file data will fit comfortably in the main memory of your computer.

If the file is too large to fit in main memory, you should write your program to read the file in chunks using a for or while loop.

Launch Exercise

Searching Through a File

When you are searching through data in a file, it is a very common pattern to read through a file, ignoring most of the lines and only processing lines which meet a particular condition. We can combine the pattern for reading a file wit string methods to build simple search mechanisms.

For these examples, we are going to use the file called mbox.txt.  You can download it from https://online.cscc.edu/apps/python/book/mbox.txt.  This file is a record of e-mail activity from various individuals in an open source project development team.

For example, if we wanted to read a file and only print out lines which started with the prefix "From:", we could use the string method startswith to select only those lines with the desired prefix:

Code Output
fin = open('mbox.txt')

for line in fin:
    if line.startswith("From:"):
         print(line)
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
...

The output looks great since the only lines we are seeing are those which start with "From:", but why are we seeing the extra blank lines? Oh yeah – we forgot about that invisible newline character. Each of the lines ends with a newline, so the print statement prints the string in the variable line which includes a newline and then print adds another newline, resulting in the double spacing effect we see.

We could use line slicing to print all but the last character, but a simpler approach is to use the strip or rstrip method which strips whitespace from the right side of a string as follows:

Code Output
fin = open('mbox.txt')

for line in fin:
    line = line.strip()
    if line.startswith("From:"):
        print(line)
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
...

As your file processing programs get more complicated, you may want to structure your search loops using continue. The basic idea of the search loop is that you are looking for "interesting" lines and effectively skipping "uninteresting" lines. And then when we find an interesting line, we do something with that line.

We can structure the loop to follow the pattern of skipping uninteresting lines as follows:

Code Output
fin = open('mbox.txt')

for line in fin:
    line = line.strip()

    # Skip uninteresting line
    if not line.startswith("From:"):
        continue

    # Process 'interesting' line
    else:
        print(line)
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
...

The output of the program is the same. In English, the uninteresting lines are those which do not start with "From:", which we skip using continue. For the "interesting" lines (i.e., those that start with "From:") we perform the processing on those lines.

We can use the find string method to simulate a text editor search that finds lines where the search string is anywhere in the line. Since find looks for an occurrence of a string within another string and either returns the position of the string or  - 1 if the string was not found, we can write the following loop to show lines which contain the string "@uct.ac.za" (i.e., they come from the University of Cape Town in South Africa):

Code Output
fin = open('mbox.txt')

for line in fin:
    line = line.strip()

    # Skip uninteresting line
    if line.find("@uct.ac.za") ==  - 1:
        continue

    # Process 'interesting' line
    else:
        print(line)
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
X - Authentication - Warning: nakamura.uits.iupui.edu: apache set sender to stephen.marquard@uct.ac.za using –f
From: stephen.marquard@uct.ac.za
Author: stephen.marquard@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan  4 07:02:32 2008
...

Letting the User choose the File Name

We really do not want to have to edit our Python code every time we want to process a different file. It would be more usable to ask the user to enter the file name string each time the program runs so they can use our program on different files without changing the Python code.

This is quite simple to do by reading the file name from the user using input as follows:

file_name = input('Enter the file name: ')
fin = open(file_name)

count = 0

for line in fin:
    line = line.strip()

    if not line.startswith('Subject:') :
        continue
    else:
        count += 1

print("There were", count, "subject lines in", file_name)

We read the file name from the user and place it in a variable named file_name and open that file. Now we can run the program repeatedly on different files.

Enter the file name: mbox.txt
There were 1797 subject lines in mbox.txt
Enter the file name: mbox - short.txt
There were 27 subject lines in mbox.txt

Before peeking at the next section, take a look at the above program and ask yourself, "What could go possibly wrong here?" or "What might our friendly user do that would cause our nice little program to ungracefully exit with a traceback, making us look not - so - cool in the eyes of our users?"

What if our user types something that is not a file name?

Enter the file name: missing.txt

Traceback (most recent call last):
  File "examples.py", line 28, in <module>
    fin = open(file_name)
FileNotFoundError: [Errno 2] No such file or directory: 'missing.txt'

Do not laugh, users will eventually do every possible thing they can do to break your programs - either on purpose or with malicious intent. As a matter of fact, an important part of any software development team is a person or group called Quality Assurance (or QA for short) whose very job it is to do the craziest things possible in an attempt to break the software that the programmer has created.

The QA team is responsible for finding the flaws in programs before we have delivered the program to the end users who may be purchasing the software or paying our salary to write the software. So the QA team is the programmer's best friend.

So now that we see the flaw in the program, we can elegantly fix it using the try/except structure – but we will learn more about it later in this lesson. For now, let us move in to learn how to write our files.