Installing Canopy

If you haven't installed it yet, download and install Enthought Canopy. It's a big download, and you won't need it until the end of the lab exercises. At the end of installation, leave the checkbox checked that asks if you want Canopy to be your default Python environment:

Files

Setting the current working directory

Throughout this lab, you'll need the contents of this zip file. Unzip it into the same folder that you're going to be coding in.

Unfortunately, since we're working with files and graphics, you'll no longer be able to use the visualizer (but it'll be back next week). Instead, I recommend that you paste the lines directly into the Canopy interpreter. Since we're working with files, you'll need to make sure the contents of the zip file are in Canopy's current working directory. The easiest way to do this is:

Open any file in the desired directory using Canopy.
Right click on the interpreter window and select "Change to Editor Directory" (marked with a black arrow in the image below).

You can also change your current working directory by clicking on the displayed current working directory (marked with a red arrow in the image below):

For those of you not using Canopy, you can ask your about its current working directory using the command import os; print(os.getcwd()). You can also set the current working directory by using import os; os.chdir(path), where path is the directory you want, e.g. os.chdir("C:\\windows\\"). Note that Windows paths must have DOUBLE backslashes for technical reasons having to do with strings that we'll discuss next week.

Basic file operations

Given the input file plaintext.txt, with contents as shown below:

this 
is a plain text file
it contains this text
spread out
over five lines

What is the output of the following program? If the program crashes on the first line, then your current working directory isn't set up correctly. Flag me down and I'll help out.

f = open("plaintext.txt", "rb")
print(f)
print("Tell #0: " + str(f.tell()))

line_one = f.readline()
print(line_one)
print("Tell #1: " + str(f.tell()))

line_two = f.readline()
print(line_two)
print("Tell #2: " + str(f.tell()))

f.seek(-9, 1)
print("Tell #3: " + str(f.tell()))
print(f.readline())

next_three_bytes = f.read(3)
print(next_three_bytes)
print("Tell #4: " + str(f.tell()))

rest_of_the_file = f.read(5)
print(rest_of_the_file)
print("Tell #5: " + str(f.tell()))

f.seek(8)
print("Tell #6: " + str(f.tell()))

for line in f:
    print line
    print("Tell: " + str(f.tell()))

f.close()

Give the contents of output.txt after the following program completes.

input_file = open("plaintext.txt", "r")
output_file = open("output.txt", "w")
all_lines = input_file.readlines()
print(all_lines)
all_lines.reverse()
output_file.writelines(all_lines)
input_file.close()
output_file.close()

Writing complicated data types

One very important thing when programming is to avoid reinventing the wheel. In this exercise, we'll explore various ways of writing lists to a file.

Consider the following code that writes two lists to a text file.

some_list = [6, 3, 10, 5, 16, 8, 4, 2, 1]
another_list = [101, 102, 103, 104]
output_file = open("some_lists.txt", "w")

output_file.write(str(len(some_list)))

for x in some_list:
    output_file.write(str(x))

output_file.write(str(len(another_list)))

for x in another_list:
    output_file.write(str(x))

output_file.close()

Why would it be hard to recover the original lists from the output file? Run this program and look at the generated output for some clues.

Since the write command doesn't add newline characters, we can do so manually. As an example, see the following code:

some_list = [6, 3, 10, 5, 16, 8, 4, 2, 1]
another_list = [101, 102, 103, 104]
output_file = open("some_lists.txt", "w")

output_file.write(str(len(some_list)) + "\n")

for x in some_list:
    output_file.write(str(x) + "\n")

output_file.write(str(len(another_list)) + "\n")

for x in another_list:
    output_file.write(str(x) + "\n")

output_file.close()

Is it possible to recover the two original lists using this file?

Write a program that is able to recover the original two lists by reading in the some_lists.txt generated by program #2 (immediately above).
The code you wrote for part 3 is often called a parser. Writing parsers is boring, and debugging them is the worst. Luckily, Python has a built in module called pickle that does all the work for you. As an example, see the code below:
```
import pickle

some_list = [6, 3, 10, 5, 16, 8, 4, 2, 1]
another_list = [101, 102, 103, 104]
output_file = open("pickled_lists.txt", "w")

pickle.dump(some_list, output_file)
pickle.dump(another_list, output_file)

output_file.close()
```
Run the code above and open pickled_lists.txt.

To read a pickle file that was generated using pickle.dump, you can use pickle.load, as shown in the example below:

import pickle

input_file = open("pickled_lists.txt", "r")

some_list = pickle.load(input_file)
another_list = pickle.load(input_file)

print(some_list)
print(another_list)

input_file.close()

Compare this program to the one that you wrote for part 3. Much better!

As you noticed in part 4 above, pickled files are human readable. However, data can be stored more efficiently if we're willing to sacrifice human readability. To do so, we can specify a protocol for the pickle.dump method by adding a 3rd paramter.

import pickle

numbers = []
for i in range(10000):
    numbers.append(i)

output_file = open("pickled_numbers.txt", "wb")
pickle.dump(numbers, output_file, protocol=0)
output_file.close()

output_file = open("pickled_numbers_binary.txt", "wb")
pickle.dump(numbers, output_file, protocol=2)
output_file.close()

Run the code above and compare the size of the two files generated. Open pickled_numbers_binary.txt and observe that the encoding used is opaque to your human eyes.

You might ask "Why would I ever want to pickle something?" The answer is that often times your programs will generate complex objects (for example, lists of lists), and generating these complex objects may take a very long time. By being able to save them to a file, your program can pick up where it left off last time.

CSV files

The file RBS_library.csv (in the .zip file you downloaded) was generated using Excel. Open it up using Excel, or if you don't have Excel, you can open a Google docs version at this link.

import csv
f = open('RBS_library.csv', 'rb')
csv_reader = csv.reader(f, delimiter = ',')

print(csv_reader.line_num)
dummy = csv_reader.next()
print(csv_reader.line_num)
headers = csv_reader.next()
print(csv_reader.line_num)

plasmid_id = []
GFP_off = []

for row in csv_reader:  
    plasmid_id.append(row[0])
    GFP_off.append(float(row[3]))

print(plasmid_id)
print(GFP_off)

Run the code above (ensure that the current working directory is the same as the csv file) and make sure you understand the results of the print statements.

Exceptions and URLs

Python provides a built in module called urllib2 which allows you to treat URLs somewhat like readable files. You can use read(), readline(), and readlines() just as with files.

Try running the following code:

import urllib2

urls = ['http://trololololololololololo.com', 'http://en.wikipedia.org/wiki/Eduard_Khil', 'http://en.wikipedia.org/w/index.php?action=raw&title=Eduard_Khil']

for url in urls:
    try: 
        u = urllib2.urlopen(url)    
        print "First 100 bytes of " + url + ":"
        print(u.read(100))
    except:
        print "Could not read " + url

The reason that the normal looking wikipedia link (http://en.wikipedia.org/wiki/Eduard_Khil) fails is that Wikipedia refuses connections unless they are from a specific list of web browsers (this is perhaps to keep bandwidth costs from being run up by lazily implemented automated web crawlers). We can fix this by either using the special URL (http://en.wikipedia.org/w/index.php?action=raw&title=Eduard_Khil) in position 2. Another approach is to lie about the identity of our browser. As an example, try running the following code (appended to the bottom of the other urllib code above).
```
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

try:
    u = opener.open(urls[1])
    print "First 100 bytes of " + urls[1] + ":"
    print(u.read(100))
except:
    print "Could not read " + urls[1]
```
In this version of the code, Python masquerades as an old version of Firefox, and Wikipedia gleefully gives us the data we request.

Numpy

Why does the following code crash? Remove the problematic line and rerun the code. What would you need to do to achieve the intended functionality of the problematic line?

import numpy as np

list_of_numbers = []
for i in range(100, 200):
    list_of_numbers.append(i)
    
array_of_numbers = np.arange(0, 99)

print(list_of_numbers)
print(array_of_numbers)
print(list_of_numbers ** 2)
print(array_of_numbers ** 2)

Why does the program below fail? How can it be fixed?

import numpy
import math

x = numpy.arange(1, 10)
y = math.log(x)

Below are two programs. The first works fine, but the second does not. Why not?

from numpy import *

t = arange(0, 2*pi, 0.01)
y = cos(t)

from numpy import *
from math import *

t = arange(0, 2*pi, 0.01)
y = cos(t)

Drawing Exercises

Word of warning: Python will behave a bit differently depending on whether or not you're running commands from the interpreter window or from a file. If you're in doubt, always put .show() at the end of your code, and close all figures before running your code.

Basic Exercises

Throughout this section, you may find it helpful to consult the official PyPlot tutorial.

Write a program that draws x in red, x * ln(x) in black, and x² in green over the range x = [1.7, 10]. Natural logarithm is just log(x) in numpy.
Augment your answer to #1 so that the axis includes only values in the x range [1.7, 10], and y range [1, 50].
Augment your answer to #2 so that there are now two plots in one figure. The top subplot should be exactly your answer to #2. In the bottom subplot, make a plot of ln(x) as a blue dotted line over the same x-range[1.7, 10], and you should use a suitable y range. How do you determine suitable values for the y range (other than by looking at the answer)?

Plotting Census Data

The file texas_counties.csv gives the population of various counties in Texas. Column number 10 (starting from 0) gives the percentage change in population between 2011 and 2012. Your job is to histogram all of the counties in Texas. Your histogram should have 20 bins, and should look something like the plot below:
Augment your answer to #1 so that only counties with a population of greater than 10,000 are included in the histogram. Your histogram should still have 20 bins, and should look something like the plot below:

A little extra

Draw and show

This is an optional exercise to demonstrate more precisely how draw and show work. It's perhaps more confusing than it's worth, but in the event that you're curious.

In the Canopy interpreter, what happens after you run EACH of the following lines:
```
figure(1)
plot([1, 2, 3, 4], [1, 4, 9, 16])
xlabel('x axis')
clf()
```
Close the figure, then create a .py file called draw_test.py. Inside this file, put only the single line:
```
figure(1)
```
Try to run the program. What error do you get?
Now modify draw_test.py so that the code reads:
```
import matplotlib.pyplot as plt
plt.figure(1)
```
Try to run the program. What happens? How is this different from what you observed when directly typing in the same code using the interpreter?
Now modify draw_test.py once more so that it reads:
```
import matplotlib.pyplot as plt
plt.figure(1)
plt.show()
```
What happens when you run the code now?
Next up, without closing the figure, create and run a new file called draw_test_2.py containing the following code:
```
import matplotlib.pyplot as plt
plt.figure(1)
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.show()
plt.xlabel('x axis')
```
As you see, nothing happens -- this is because Python does not update already existing figures unless you use a special command (called draw, see below).
Now close the figure, and rerun draw_test_2.py. You should see the plot show up, because this time there was no figure(1) when the program began. Note that the show() command does not need to appear after the final change to the graph (i.e. the x-axis label does show up, even though it was created after the show command).
If you want to alter a figure that already exists (i.e. is already showing) at the time that a program begins, you should use the plt.draw() command to tell matplotlib to update the drawing. Close the figure, and create a new figure by simply typing figure(1) in the Canopy interpreter.
Now, close the figure and create a .py file containing the following code and run it:
```
import matplotlib.pyplot as plt
plt.figure(1)
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.draw()
plt.xlabel('x axis')
```
What do you observe? The x-axis label does NOT appear! This is because the draw command came before the xlabel command. Unlike the initial show command that creates the figure, order is very important.
This minor inconsistency in the behavior of show() and draw() may seem a little confusing. Luckily, you don't really need to understand it, but I wanted to point it out in case you run into this behavior on your own later.