Lesson 7, Bit 2: Reading and Searching Data from Files
Text Files and Lines
A text file can be thought of as a sequence of lines, much like a Python string can be thought of as a sequence of characters. To break the file into lines, there is a special character that represents the "end of the line" called the newline character.
We've seen as far back as Lesson 1 that the newline character is
\n. Remember that even though this looks like two characters, it is
actually a single character. When we look at the variable by entering "stuff" in
the interpreter, it shows us the \n in the string, but when we use
print to show the string, we see the string broken into two lines by the newline
character.
| Code | Output |
|---|---|
stuff = '1\n2' |
1 |
You can also see that the length of the string '1\n2' is three
characters because the newline character is a single character.
| Code | Output |
|---|---|
stuff = '1\n2' |
3 |
So when we look at the lines in a file, we need to imagine that there is a special invisible character called the newline at the end of each line that marks the end of the line.
So the newline character separates the characters in the file into lines.
Reading Files
While the file handle does not contain the data for the file, it is quite
easy to construct a for loop to read through and count each of the
lines in a file:
| Code | Output |
|---|---|
fin = open('words.txt') |
Line Count 113809 |
We can use the file handle as the sequence in our for loop. Our
for loop simply counts the number of lines in the file and prints
them out. The rough translation of the for loop into English is,
"for each line in the file represented by the file handle, add one to the count
variable."
The reason that the open function does not read the entire file
is that the file might be quite large with many gigabytes of data. The
open statement takes the same amount of time regardless of the size
of the file. The for loop actually causes the data to be read from
the file.
When the file is read using a for loop in this manner, Python
takes care of splitting the data in the file into separate lines using the
newline character. Python reads each line through the newline and includes the
newline as the last character in the line variable for each iteration of the
for loop.
Because the for loop reads the data one line at a time, it can
efficiently read and count the lines in very large files without running out of
main memory to store the data. The above program can count the lines in any size
file using very little memory since each line is read, counted, and then
discarded.
The readline Method
The file object provides several methods for reading, including
readline, which reads characters from the file until it gets to a
newline and returns the result as a string:
| Code | Result |
|---|---|
fin = open('words.txt') |
'aa\n' |
The first word in this particular list is "aa", which is a kind of lava. The
sequence \n represents two whitespace characters, a carriage return
and a newline, that separate this word from the next.
When we print the result fin.readline(), we cannot see the
\n because it renders as a new line break. But if we
immediately display another character, we can see the line break:
| Code | Output |
|---|---|
fin = open('words.txt') |
aa |
The file object keeps track of where it is in the file, so if you call
readline again, you get the next word:
| Code | Result |
|---|---|
fin.readline() |
'aah\n' |
The next word is "aah", which is a perfectly legitimate word, so stop looking at me like that.
If it's the whitespace that's bothering you, we can get rid of it with the
string method strip:
| Code | Result |
|---|---|
line = fin.readline() |
'aahed' |
If you recall back to Lesson 5, strip will remove any whitespace (spaces, tabs, or newlines) from the beginning and end of a string.
You can also use a file object as part of a for loop. This
program reads words.txt and prints each word, one per line:
| Code | Output |
|---|---|
fin = open('words.txt') |
aa |
The readlines Method
If you want to store all lines in a list of lines, you can use the
readlines method:
| Code | Result |
|---|---|
fin = open('words.txt') |
['aa\n', 'aah\n', 'aahed\n', 'aahing\n', 'aahs\n', 'aal\n',
'aalii\n', 'aaliis\n', 'aals\n', 'aardvark\n',
(...)] |
When we use the len function to see the length of the list
called words, we get the full number of lines in the file
words.txt:
| Code | Result |
|---|---|
len(words) |
113809 |
Just like the number we got when we counted each line in the file!
So now, if we want to access the 15th line in this file, we can display the 14th index (remember that indices start with 0 so the 15th item is actually the 14th index):
| Code | Result |
|---|---|
words[14] |
'aasvogel\n' |
Because words is now a list, you can use all of the list methods
(find, replace, slicing, etc) on it.
The read Method
If you know the file is relatively small compared to the size of your main
memory, you can read the whole file into one string using the read
method on the file handle.
| Code | Output |
|---|---|
fin = open('words.txt') |
1016714 |
print(words[:20]) |
aa |
In this example, the entire contents (all 1,016,714 characters) of the file
words.txt are read directly into the variable words.
We use string slicing to print out the first 20 characters of the string data
stored in words.
When the file is read in this manner, all the characters including all of the
lines and newline characters are one big string in the variable
words. Remember that this form of the open function
should only be used if the file data will fit comfortably in the main memory of
your computer.
If the file is too large to fit in main memory, you should write your program
to read the file in chunks using a for or while
loop.
Searching Through a File
When you are searching through data in a file, it is a very common pattern to read through a file, ignoring most of the lines and only processing lines which meet a particular condition. We can combine the pattern for reading a file wit string methods to build simple search mechanisms.
For these examples, we are going to use the file called
mbox.txt. You can download it from https://online.cscc.edu/apps/python/book/mbox.txt.
This file is a record of e-mail activity from various individuals in an open
source project development team.
For example, if we wanted to read a file and only print out lines which
started with the prefix "From:", we could use the string method
startswith to select only those lines with the desired prefix:
| Code | Output |
|---|---|
fin = open('mbox.txt') |
From: stephen.marquard@uct.ac.za |
The output looks great since the only lines we are seeing are those which start with "From:", but why are we seeing the extra blank lines? Oh yeah – we forgot about that invisible newline character. Each of the lines ends with a newline, so the print statement prints the string in the variable line which includes a newline and then print adds another newline, resulting in the double spacing effect we see.
We could use line slicing to print all but the last character, but a simpler
approach is to use the strip or rstrip method which
strips whitespace from the right side of a string as follows:
| Code | Output |
|---|---|
fin = open('mbox.txt') |
From: stephen.marquard@uct.ac.za |
As your file processing programs get more complicated, you may want to
structure your search loops using continue. The basic idea of the
search loop is that you are looking for "interesting" lines and effectively
skipping "uninteresting" lines. And then when we find an interesting line, we do
something with that line.
We can structure the loop to follow the pattern of skipping uninteresting lines as follows:
| Code | Output |
|---|---|
fin = open('mbox.txt') |
From: stephen.marquard@uct.ac.za |
The output of the program is the same. In English, the uninteresting lines are those which do not start with "From:", which we skip using continue. For the "interesting" lines (i.e., those that start with "From:") we perform the processing on those lines.
We can use the find string method to simulate a text editor
search that finds lines where the search string is anywhere in the line. Since
find looks for an occurrence of a string within another string and either
returns the position of the string or - 1 if the string was
not found, we can write the following loop to show lines which contain the
string "@uct.ac.za" (i.e., they come from the University of Cape Town in South
Africa):
| Code | Output |
|---|---|
fin = open('mbox.txt') |
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16
2008 |
Letting the User choose the File Name
We really do not want to have to edit our Python code every time we want to process a different file. It would be more usable to ask the user to enter the file name string each time the program runs so they can use our program on different files without changing the Python code.
This is quite simple to do by reading the file name from the user using
input as follows:
file_name = input('Enter the file name:
')
fin = open(file_name)
count = 0
for line
in fin:
line = line.strip()
if not line.startswith('Subject:') :
continue
else:
count += 1
print("There were", count, "subject lines in", file_name)
We read the file name from the user and place it in a variable named
file_name and open that file. Now we can run the program repeatedly
on different files.
Enter the file name: mbox.txt
There were 1797 subject lines in mbox.txtEnter the file name:
mbox - short.txt
There were 27 subject lines in
mbox.txtBefore peeking at the next section, take a look at the above program and ask yourself, "What could go possibly wrong here?" or "What might our friendly user do that would cause our nice little program to ungracefully exit with a traceback, making us look not - so - cool in the eyes of our users?"
What if our user types something that is not a file name?
Enter the
file name: missing.txt
Traceback (most recent call
last):
File "examples.py", line 28, in <module>
fin = open(file_name)
FileNotFoundError:
[Errno 2] No such file or directory: 'missing.txt'
Do not laugh, users will eventually do every possible thing they can do to break your programs - either on purpose or with malicious intent. As a matter of fact, an important part of any software development team is a person or group called Quality Assurance (or QA for short) whose very job it is to do the craziest things possible in an attempt to break the software that the programmer has created.
The QA team is responsible for finding the flaws in programs before we have delivered the program to the end users who may be purchasing the software or paying our salary to write the software. So the QA team is the programmer's best friend.
So now that we see the flaw in the program, we can elegantly fix it using the try/except structure – but we will learn more about it later in this lesson. For now, let us move in to learn how to write our files.
