Text Files and Lines

A text file can be thought of as a sequence of lines, much like a Python string can be thought of as a sequence of characters. To break the file into lines, there is a special character that represents the "end of the line" called the newline character.

We've seen as far back as Lesson 1 that the newline character is \n. Remember that even though this looks like two characters, it is actually a single character. When we look at the variable by entering "stuff" in the interpreter, it shows us the \n in the string, but when we use print to show the string, we see the string broken into two lines by the newline character.

Code	Output
`stuff = '1\n2' print(stuff)`	`1 2`

You can also see that the length of the string '1\n2' is three characters because the newline character is a single character.

Code	Output
`stuff = '1\n2' print(len(stuff))`	`3`

So when we look at the lines in a file, we need to imagine that there is a special invisible character called the newline at the end of each line that marks the end of the line.

So the newline character separates the characters in the file into lines.

Reading Files

While the file handle does not contain the data for the file, it is quite easy to construct a for loop to read through and count each of the lines in a file:

Code	Output
`fin = open('words.txt') count = 0 for line in fin: count += 1 print("Line Count", count)`	`Line Count 113809`

We can use the file handle as the sequence in our for loop. Our for loop simply counts the number of lines in the file and prints them out. The rough translation of the for loop into English is, "for each line in the file represented by the file handle, add one to the count variable."

The reason that the open function does not read the entire file is that the file might be quite large with many gigabytes of data. The open statement takes the same amount of time regardless of the size of the file. The for loop actually causes the data to be read from the file.

When the file is read using a for loop in this manner, Python takes care of splitting the data in the file into separate lines using the newline character. Python reads each line through the newline and includes the newline as the last character in the line variable for each iteration of the for loop.

Because the for loop reads the data one line at a time, it can efficiently read and count the lines in very large files without running out of main memory to store the data. The above program can count the lines in any size file using very little memory since each line is read, counted, and then discarded.

The readline Method

The file object provides several methods for reading, including readline, which reads characters from the file until it gets to a newline and returns the result as a string:

Code	Result
`fin = open('words.txt') fin.readline()`	`'aa\n'`

The first word in this particular list is "aa", which is a kind of lava. The sequence \n represents two whitespace characters, a carriage return and a newline, that separate this word from the next.

When we print the result fin.readline(), we cannot see the \n because it renders as a new line break. But if we immediately display another character, we can see the line break:

Code	Output
`fin = open('words.txt') print(fin.readline()) print("hi")`	`aa hi`

The file object keeps track of where it is in the file, so if you call readline again, you get the next word:

Code	Result
`fin.readline()`	`'aah\n'`

The next word is "aah", which is a perfectly legitimate word, so stop looking at me like that.

If it's the whitespace that's bothering you, we can get rid of it with the string method strip:

Code	Result
`line = fin.readline() word = line.strip() word`	`'aahed'`

If you recall back to Lesson 5, strip will remove any whitespace (spaces, tabs, or newlines) from the beginning and end of a string.

You can also use a file object as part of a for loop. This program reads words.txt and prints each word, one per line:

Code	Output
`fin = open('words.txt') for line in fin: word = line.strip() print(word)`	`aa aah aahed aahing aahs aal aalii aaliis aals aardvark (...)`

The readlines Method

If you want to store all lines in a list of lines, you can use the readlines method:

Code	Result
`fin = open('words.txt') words = fin.readlines()`	`['aa\n', 'aah\n', 'aahed\n', 'aahing\n', 'aahs\n', 'aal\n', 'aalii\n', 'aaliis\n', 'aals\n', 'aardvark\n', (...)]`

When we use the len function to see the length of the list called words, we get the full number of lines in the file words.txt:

Code	Result
`len(words)`	`113809`

Just like the number we got when we counted each line in the file!

So now, if we want to access the 15^th line in this file, we can display the 14^th index (remember that indices start with 0 so the 15^th item is actually the 14^th index):

Code	Result
`words[14]`	`'aasvogel\n'`

Because words is now a list, you can use all of the list methods (find, replace, slicing, etc) on it.

The read Method

If you know the file is relatively small compared to the size of your main memory, you can read the whole file into one string using the read method on the file handle.

Code	Output
`fin = open('words.txt') words = fin.read() print(len(words))`	`1016714`
`print(words[:20])`	`aa aah aahed aahing`

In this example, the entire contents (all 1,016,714 characters) of the file words.txt are read directly into the variable words. We use string slicing to print out the first 20 characters of the string data stored in words.

When the file is read in this manner, all the characters including all of the lines and newline characters are one big string in the variable words. Remember that this form of the open function should only be used if the file data will fit comfortably in the main memory of your computer.

If the file is too large to fit in main memory, you should write your program to read the file in chunks using a for or while loop.

Searching Through a File

When you are searching through data in a file, it is a very common pattern to read through a file, ignoring most of the lines and only processing lines which meet a particular condition. We can combine the pattern for reading a file wit string methods to build simple search mechanisms.

For these examples, we are going to use the file called mbox.txt. You can download it from https://online.cscc.edu/apps/python/book/mbox.txt. This file is a record of e-mail activity from various individuals in an open source project development team.

For example, if we wanted to read a file and only print out lines which started with the prefix "From:", we could use the string method startswith to select only those lines with the desired prefix:

Code	Output
`fin = open('mbox.txt') for line in fin: if line.startswith("From:"): print(line)`	`From: stephen.marquard@uct.ac.za From: louis@media.berkeley.edu From: zqian@umich.edu From: rjlowe@iupui.edu From: zqian@umich.edu ...`

The output looks great since the only lines we are seeing are those which start with "From:", but why are we seeing the extra blank lines? Oh yeah – we forgot about that invisible newline character. Each of the lines ends with a newline, so the print statement prints the string in the variable line which includes a newline and then print adds another newline, resulting in the double spacing effect we see.

We could use line slicing to print all but the last character, but a simpler approach is to use the strip or rstrip method which strips whitespace from the right side of a string as follows:

Code	Output
`fin = open('mbox.txt') for line in fin: line = line.strip() if line.startswith("From:"): print(line)`	`From: stephen.marquard@uct.ac.za From: louis@media.berkeley.edu From: zqian@umich.edu From: rjlowe@iupui.edu From: zqian@umich.edu ...`

As your file processing programs get more complicated, you may want to structure your search loops using continue. The basic idea of the search loop is that you are looking for "interesting" lines and effectively skipping "uninteresting" lines. And then when we find an interesting line, we do something with that line.

We can structure the loop to follow the pattern of skipping uninteresting lines as follows:

Code	Output
`fin = open('mbox.txt') for line in fin: line = line.strip() # Skip uninteresting line if not line.startswith("From:"): continue # Process 'interesting' line else: print(line)`	`From: stephen.marquard@uct.ac.za From: louis@media.berkeley.edu From: zqian@umich.edu From: rjlowe@iupui.edu From: zqian@umich.edu ...`

The output of the program is the same. In English, the uninteresting lines are those which do not start with "From:", which we skip using continue. For the "interesting" lines (i.e., those that start with "From:") we perform the processing on those lines.

We can use the find string method to simulate a text editor search that finds lines where the search string is anywhere in the line. Since find looks for an occurrence of a string within another string and either returns the position of the string or - 1 if the string was not found, we can write the following loop to show lines which contain the string "@uct.ac.za" (i.e., they come from the University of Cape Town in South Africa):

Code	Output
`fin = open('mbox.txt') for line in fin: line = line.strip() # Skip uninteresting line if line.find("@uct.ac.za") == - 1: continue # Process 'interesting' line else: print(line)`	`From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 X - Authentication - Warning: nakamura.uits.iupui.edu: apache set sender to stephen.marquard@uct.ac.za using –f From: stephen.marquard@uct.ac.za Author: stephen.marquard@uct.ac.za From david.horwitz@uct.ac.za Fri Jan 4 07:02:32 2008 ...`

Letting the User choose the File Name

We really do not want to have to edit our Python code every time we want to process a different file. It would be more usable to ask the user to enter the file name string each time the program runs so they can use our program on different files without changing the Python code.

This is quite simple to do by reading the file name from the user using input as follows:

file_name = input('Enter the file name: 
')
      fin = open(file_name)

      count = 0

      for line 
in fin:
          line = line.strip()

      
    if not line.startswith('Subject:') :
      
        continue
      
    else:
      
        count += 1

      
print("There were", count, "subject lines in", file_name)

We read the file name from the user and place it in a variable named file_name and open that file. Now we can run the program repeatedly on different files.

Enter the file name: mbox.txt
  
    There were 1797 subject lines in mbox.txt

Enter the file name: 
mbox - short.txt
      There were 27 subject lines in 
mbox.txt

Before peeking at the next section, take a look at the above program and ask yourself, "What could go possibly wrong here?" or "What might our friendly user do that would cause our nice little program to ungracefully exit with a traceback, making us look not - so - cool in the eyes of our users?"

What if our user types something that is not a file name?

Enter the 
file name: missing.txt

      Traceback (most recent call 
last):
        File "examples.py", line 28, in <module>
   
       fin = open(file_name)
      FileNotFoundError: 
[Errno 2] No such file or directory: 'missing.txt'

Do not laugh, users will eventually do every possible thing they can do to break your programs - either on purpose or with malicious intent. As a matter of fact, an important part of any software development team is a person or group called Quality Assurance (or QA for short) whose very job it is to do the craziest things possible in an attempt to break the software that the programmer has created.

The QA team is responsible for finding the flaws in programs before we have delivered the program to the end users who may be purchasing the software or paying our salary to write the software. So the QA team is the programmer's best friend.

So now that we see the flaw in the program, we can elegantly fix it using the try/except structure – but we will learn more about it later in this lesson. For now, let us move in to learn how to write our files.