CSV files in Python


CSV PythonViews 1858

To understand how to work with CSV files in Python, we need to know why we actually need CSV files?

Need for file input and output

Let’s say we have a program that accepts an input and produces an output. For now, let’s say a list is passed as an input and in turn, we are getting another list as an output. Think about how do you pass the input. If both input and output lists are of smaller length, for the input you can manually type the values using the keyboard, for the output you can print it using the print function.

What if, the list is a larger length say some 10,000. Will you manually type the values? Even if you can print the output, using the print statement, is it possible for you to check all the values in the console.

One of the most common ways to share such large data between the programs is through text files. And the most popular format for data is CSV.

We don’t have to write our own CSV parser from scratch. There are several CSV libraries out there for us to use. The Python has its own csv library, which we can use in most of the cases. If data requires a lot of numerical analysis then we can use the Pandas library.

In this article, let’s learn how the CSV files work and the Python built-in CSV libraries. We will also learn how to parse CSV from text files, read from and write into CSV files and how CSV parsing works using the Pandas library.

How a CSV file differs from a text file?

Comma Separated Value is the abbreviated form of CSV. A CSV file is a plain text file that contains tabular data in a specific structure. Since it is a plain text file, it can contain only the text data which is nothing but ASCII or Unicode characters.

As already said, the CSV file contains the tabular data and each data value is separated by a comma. An example of CSV data format is:

Username,Identifier,Firstname,Lastname
booker12,9012,Rachel,Booker
grey07,2070,Laura,Grey

You can see that each value is separated by a comma. The first line of each column describes what information the column is going is contain. The remaining lines contain the actual data.

In general, the character which separates each value in a CSV file is called a delimiter. The comma is not the only delimiter used for CSV files. The other popularly used delimiters are tab(\t), pipe(|), colon(:) and semi-colon(;). You should know the delimiter used, in order to parse a CSV file.

How do CSV files get created?

We can export data from spreadsheets or databases in a CSV format and import these CSV files into programs. A program that handles a large volume of data can save its output into a CSV file. For example, let’s say we have a data-mining program that handles a large volume of data, we can export the result of the program into a CSV format, and then import it into a spreadsheet for data analysis,  graph generation, or creation and publication of a report.

CSV files are easy to work with any language that supports text file input, and string manipulation like Python.

Python’s built-in CSV library

With Python built-in CSV library, we can work with a variety of CSV formats and  CSV files generated by Excel. The built-in CSV library provides functionality to read from and write into CSV files. The CSV library has objects and code which we can import in our programs to read, write, and process data from and into CSV files.

Reading CSV files

We can read the CSV files using the reader object. With the open() function, the CSV file can be opened as a text file. The return value of the open() function is a file object, which is passed as a parameter to the reader.

Here is the sample staff_list.csv file:

Username,	Identifier,	First name,	Last name
booker12,	9012,		Rachel,		Booker
grey07,		2070,		Laura,		Grey
johnson81,	4081,		Craig,		Johnson
jenkins46,	9346,		Mary,		Jenkin

The below code reads the CSV file in python:

import csv

filename='staff_list.csv'
line_count=0

with open(filename,'r') as csvfile:
    csvreader=csv.reader(csvfile)
    for row in csvreader:
        if line_count==0:
            print(f"The column names are {', '.join(row)}")
            line_count+=1
        else:
            print(f"Username of {row[2]} {row[3]} is {row[0]} and id is {row[1]}")
            line_count+=1
    print(f"The total number of processed lines is {line_count}")

The output will be:

The column names are Username, Identifier, First name, Last name
Username of Rachel Booker is booker12 and id is 9012
Username of Laura Grey is grey07 and id is 2070
Username of Craig Johnson is johnson81 and id is 4081
Username of Mary Jenkins is jenkins46 and id is 9346
The total number of processed lines is 5

For the open() method we are passing an additional parameter called ‘r’ along with the filename. ‘r’ character means that we are opening the file in the ‘read’ mode.

The value returned by the reader object is iterable. With for loop, we are processing the data returned by the reader object. Since the first row contains header or column names, we are handling it in a different way.

We can also deal with column names like below method:

import csv

filename='staff_list.csv'

fields=[]
with open(filename,'r') as csvfile:
    csvreader=csv.reader(csvfile)
    fields=next(csvreader)
    print(f"The column names are {', '.join(fields)}")
    for row in csvreader:
        print(f"Username of {row[2]} {row[3]} is {row[0]} and id is {row[1]}")

Since the return value of the reader object is iterable, the next method returns the current row and points the iterator to the next row. We can process the remaining rows, which is the data row, using for loop. The output is still the same.

The column names are Username, Identifier, First name, Last name
Username of Rachel Booker is booker12 and id is 9012
Username of Laura Grey is grey07 and id is 2070
Username of Craig Johnson is johnson81 and id is 4081
Username of Mary Jenkins is jenkins46 and id is 9346

Reading CSV file into a dictionary

Instead of dealing with each row individually, we can read CSV data into an ordered dictionary and process the data.

The input csv file ‘staff_list.csv’ has the below records:

Username,Identifier,First name,Last name
booker12,9012,Rachel,Booker
grey07,2070,Laura,Grey
johnson81,4081,Craig,Johnson
jenkins46,9346,Mary,Jenkins

The code to read CSV file into a dictionary is

import csv

filename='staff_list.csv'
line_count=0

with open(filename,'r') as csvfile:
    csvreader=csv.DictReader(csvfile)
    for row in csvreader:
        if line_count==0:
            print(f"The column names are {', '.join(row)}")
            line_count+=1
        print(f'Username of {row["First name"]} {row["Last name"]} is {row["Username"]} and id is {row["Identifier"]}')
        line_count+=1
    print(f"The total number of processed lines is {line_count}")

The output is the same as before

The column names are Username, Identifier, First name, Last name
Username of Rachel Booker is booker12 and id is 9012
Username of Laura Grey is grey07 and id is 2070
Username of Craig Johnson is johnson81 and id is 4081
Username of Mary Jenkins is jenkins46 and id is 9346
The total number of processed lines is 5

We are reading the headers in the same way as before. But if you notice, we are reading data rows using the dictionary keys. Where do these keys come from? These keys are actually the column names. If you don’t have a header row in your CSV file, you should actually set a list of your own keys using the optional parameter fieldnames.

Optional csv reader parameters

With Python’s in-built csv reader, we can handle different formats of csv files by using some additional parameters like the below ones:

  • delimiter: The character that separates each data value. The default value is the comma(“,”) but there are other delimiters as well.
  • quotechar: The character that surrounds the field, that contains the delimiter. The double-quote      (‘ ” ‘) is the default value.
  • escapechar: When quotes aren’t used, escapechar is used to escape the delimiter character. Providing no escape character is the default value.

Let’s see an example where these optional parameters can be used.

Name,address,phone
James Butt, 6649 N Blue Gum St, New Orleans, Orleans, LA, 70116, 504-621-8927
Josephine Darakjy, 4 B Blue Ridge Blvd, Brighton, Livingston, MI, 48116, 810-292-9388
Art Venere, 8 W Cerritos Ave #54, Bridgeport, Gloucester, NJ, 8014, 856-636-8749

In the above example, the column names are Name, address, and phone. The comma delimits each data value. But the data value in the address field contains its own comma. How to deal with this?

There are three possible solutions on how to solve this:

Using different delimiter

We can use other delimiters like the pipe(|), colon(:), semi-colon(;). So that comma can be used in the address field.

Name|address|phone
James Butt| 6649 N Blue Gum St, New Orleans, Orleans, LA| 70116 504-621-8927
Josephine Darakjy| 4 B Blue Ridge Blvd, Brighton, Livingston, MI, 48116| 810-292-9388
Art Venere| 8 W Cerritos Ave #54, Bridgeport, Gloucester, NJ, 8014| 856-636-8749

And specify the delimiter in the code.

import csv

filename='staff_list.csv'
line_count=0

with open(filename,'r') as csvfile:
    csvreader=csv.DictReader(csvfile,delimiter='|') #delimiter is | here
    for row in csvreader:
        if line_count==0:
            print(f"The column names are {', '.join(row)}")
            line_count+=1
        print(f'Address of {row["Name"]}  is {row["address"]} and phone is {row["phone"]}')
        line_count+=1
    print(f"The total number of processed lines is {line_count}")

Using quotechar

We can wrap the address field in quotes so that the delimiter character will be ignored inside the quoted strings. Quotechar will work fine if we don’t have quotes inside the data.

Name, address, phone
James Butt,"6649 N Blue Gum St, New Orleans, Orleans, LA, 70116", 504-621-8927
Josephine Darakjy,"4 B Blue Ridge Blvd, Brighton, Livingston, MI, 48116", 810-292-9388
Art Venere,"8 W Cerritos Ave #54, Bridgeport, Gloucester, NJ, 8014", 856-636-8749

Using escapechar

By specifying escapechar, it nullifies the interpretation of the delimiter. If escapechar is used, in the code we have to use the optional parameter escapechar.

Name, address, phone
James Butt,6649 N Blue Gum St\, New Orleans\, Orleans\, LA\, 70116, 504-621-8927
Josephine Darakjy,4 B Blue Ridge Blvd\, Brighton\, Livingston\, MI\, 48116, 810-292-9388
Art Venere,8 W Cerritos Ave #54\, Bridgeport\, Gloucester\, NJ\, 8014, 856-636-8749

Code is:

import csv

filename='staff_list.csv'
line_count=0

with open(filename,'r') as csvfile:
    csvreader=csv.reader(csvfile,escapechar="\\") 
    for row in csvreader:
        if line_count==0:
            print(f"The column names are {', '.join(row)}")
            line_count+=1
        else:
            print(f'Address of {row[0]}  is {row[1]} and phone is {row[2]}')
            line_count+=1
    print(f"The total number of processed lines is {line_count}")

Writing CSV files

With writer object, we can write into CSV files in python. To write each row we can use writerow function.

import csv

filename='staff_list.csv'

with open(filename,'w') as csvfile:
    csvwriter=csv.writer(csvfile,delimiter=',',quotechar='"',quoting=csv.QUOTE_MINIMAL)
    csvwriter.writerow(['Username','Identifier','First name','Last name'])
    csvwriter.writerow(['booker12','9012','Rachel','Booker'])
    csvwriter.writerow(['grey07','2070','Laura','Grey'])
    csvwriter.writerow(['johnson81','4081','Craig','Johnson'])

Here we have opened the file name in write mode. The quotechar is used to wrap the fields which contain the delimiter, with the character mentioned. However, the usage of quotechar is determined by the quoting parameter. The quoting parameter can take the below values:

  • QUOTE_MINIMAL: This is the default value. Quotes only the fields which contain the delimiter character.
  • QUOTE_ALL: Quotes all the fields.
  • QUOTE_NONNUMERIC: Quotes all the fields which contain the text data. Non-numeric field will be converted to float data type.
  • QUOTE_NONE: Instead of quoting escapes the delimiter. We must use escapechar if quoting is set to QUOTE_NONE.

The output file created by the above code will contain the below data as a result.

Username,Identifier,First name,Last name

booker12,9012,Rachel,Booker

grey07,2070,Laura,Grey

johnson81,4081,Craig,Johnson

Writing CSV file from a dictionary

We can write CSV file from a dictionary as we can read from a dictionary.

import csv

filename='staff_list.csv'
fields=['Username','Identifier','First Name','Last name']

with open(filename,'w') as csvfile:
    csvwriter=csv.DictWriter(csvfile,fieldnames=fields)
    csvwriter.writeheader()
    csvwriter.writerow({'Username':'booker12','Identifier':'9012','First Name':'Rachel','Last name':'Booker'})
    csvwriter.writerow({'Username':'grey07','Identifier':'2070','First Name':'Laura','Last name':'Grey'})
    csvwriter.writerow({'Username':'johnson81','Identifier':'4081','First Name':'Craig','Last name':'Johnson'})

Fieldnames is must required when using DictWriter. DictWriter doesn’t know what are the keys to be used if we don’t provide fieldnames. Writeheader function writes the column names.

The output file generated by the above code contains the below data:

Username,Identifier,First Name,Last name

booker12,9012,Rachel,Booker

grey07,2070,Laura,Grey

johnson81,4081,Craig,Johnson

Parsing CSV files with pandas

We can parse CSV files not only with Python’s in-built csv library but also with Pandas. If your data requires a lot of analysis then Pandas in recommended.

pandas provide easy to use data structures and high-performance data analysis tools. pandas is an open-source Python library and can be downloaded for all Python installations. pandas even comes with Anaconda distribution and can be used in Jupyter notebook in order to share data, code, analyze results, visualizations, and narrative text.

The below command is to install pandas and its dependencies in Anaconda:

conda install pandas

If you don’t have Ananconda, then the pip command to install pandas is:

pip install pandas

Reading CSV file in Pandas

I have added two more fields to our existing staff_list.csv file. Below is the data:

Username,Identifier,First name,Last name,Hire Date,Salary
booker12,9012,Rachel,Booker,03/15/14,50000.00
grey07,2070,Laura,Grey,06/01/15,60000.00
johnson81,4081,Craig,Johnson,05/12/14,45000.00
jenkins46,9346,Mary,Jenkins,11/01/13,55000.00
smith79,5079,Jamie,Smith,05/23/13,80000.00
jones84,6078,Terry,Jones,08/12/14,72000.00

The read _csv method of pandas takes the CSV file as an input and reads the data into DataFrame. The below code is to read CSV data into pandas DataFrame using pandas:

import pandas
df=pandas.read_csv('staff_list.csv')
print(df)

The output will be:

    Username  Identifier First name Last name Hire Date   Salary
0   booker12        9012     Rachel    Booker  03/15/14  50000.0
1     grey07        2070      Laura      Grey  06/01/15  60000.0
2  johnson81        4081      Craig   Johnson  05/12/14  45000.0
3  jenkins46        9346       Mary   Jenkins  11/01/13  55000.0
4    smith79        5079      Jamie     Smith  05/23/13  80000.0
5    jones84        6078      Terry     Jones  08/12/14  72000.0

There are new things to note here:

  • pandas recognized the first line of data as columns names even though we didn’t handle the header in the code as we do in python in-built csv library.
  • We didn’t specify anything about index in our code, still, the DataFrame data is indexed. The index starts at 0. If needed, we can specify what the index column has to be.
  • The Salary column and the Identifier column are considered a float column. Check the type of value in a column to validate that.
    print(type(df['Salary'][0]),type(df['Identifier'][0]),type(df['Username'][0]))

Reading CSV files with other optional parameters

To change the index column use the index_col optional parameter.

import pandas
df=pandas.read_csv('staff_list.csv',index_col='Username')
print(df)

Now the username will be considered as an index in the DataFrame.

           Identifier First name Last name Hire Date   Salary
Username
booker12         9012     Rachel    Booker  03/15/14  50000.0
grey07           2070      Laura      Grey  06/01/15  60000.0
johnson81        4081      Craig   Johnson  05/12/14  45000.0
jenkins46        9346       Mary   Jenkins  11/01/13  55000.0
smith79          5079      Jamie     Smith  05/23/13  80000.0
jones84          6078      Terry     Jones  08/12/14  72000.0

If you check the type of the Hire Date column, it’s still a string. If you want pandas to recognize it as a data field then use the optional parameters parse_dates. parse_data takes a list of columns that will be treated as data columns.

import pandas
df=pandas.read_csv('staff_list.csv',index_col='Username',parse_dates=['Hire Date'])
print(df)

Check the output for the difference in date format.

           Identifier First name Last name  Hire Date   Salary
Username
booker12         9012     Rachel    Booker 2014-03-15  50000.0
grey07           2070      Laura      Grey 2015-06-01  60000.0
johnson81        4081      Craig   Johnson 2014-05-12  45000.0
jenkins46        9346       Mary   Jenkins 2013-11-01  55000.0
smith79          5079      Jamie     Smith 2013-05-23  80000.0
jones84          6078      Terry     Jones 2014-08-12  72000.0

Check the type of value in a date column to validate that.

print(type(df['Hire Date'][0]))

If your data doesn’t have column names then you can use an optional parameter called names to set a list of column names. If you want to override the column names then use an additional parameter called header and assign 0 to it, along with names parameter.

import pandas
df=pandas.read_csv('staff_list.csv',index_col='Username',
    parse_dates=['Hired on'],
    header=0,
    names=['Username','Id','First name','Last name','Hired on','Salary'])
print(df)

The column names Identifier and Hire Date will be changed to Id and Hired on. If you are changing the column names, change it in all the places where it’s used. Notice that we have changed the parse_dates value based on the column name to be changed.

             Id First name Last name   Hired on   Salary
Username
booker12   9012     Rachel    Booker 2014-03-15  50000.0
grey07     2070      Laura      Grey 2015-06-01  60000.0
johnson81  4081      Craig   Johnson 2014-05-12  45000.0
jenkins46  9346       Mary   Jenkins 2013-11-01  55000.0
smith79    5079      Jamie     Smith 2013-05-23  80000.0
jones84    6078      Terry     Jones 2014-08-12  72000.0

Writing CSV files in Pandas:

With pandas writing the csv data also an easier task. Let’s write the Dataframe data into CSV file by changing the column names.

import pandas
df=pandas.read_csv('staff_list.csv',index_col='Username',
    parse_dates=['Hired on'],
    header=0,
    names=['Username','Id','First name','Last name','Hired on','Salary'])
df.to_csv('modified_staff_list.csv')

The to_csv method takes the name of the csv file where we want our data to be written, as a parameter and writes the Dataframe data into it.

The output CSV file contains the below data:

Username,Id,First name,Last name,Hired on,Salary
booker12,9012,Rachel,Booker,2014-03-15,50000.0
grey07,2070,Laura,Grey,2015-06-01,60000.0
johnson81,4081,Craig,Johnson,2014-05-12,45000.0
jenkins46,9346,Mary,Jenkins,2013-11-01,55000.0
smith79,5079,Jamie,Smith,2013-05-23,80000.0
jones84,6078,Terry,Jones,2014-08-12,72000.0

Conclusion:

Most of the CSV parsing, reading, and writing can be handled by Python’s builtin csv library. Go for pandas, if your large set of data, and it involves a lot of analysis since pandas provide quick and easy CSV handling capabilities.

There are other libraries as well like ANTLRPLY, and PlyPlus, for parsing CSV files. Use regular expressions if simple String manipulation doesn’t work.

Translate »