Pandas CSV error: Error tokenizing data. C error: EOF inside string starting at line

Tokenizing Error

Recently, I burned about 3 hours trying to load a large CSV file into Python Pandas using the read_csv function, only to consistently run into the following error:

ParserError                               Traceback (most recent call last)
<ipython-input-6-b51ad8562823> in <module>()
...
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)()

Error tokenizing data. C error: EOF inside string starting at line XXXX

“Error tokenising data. C error: EOF inside string starting at line”.

There was an erroneous character about 5000 lines into the CSV file that prevented the Pandas CSV parser from reading the entire file. Excel had no problems opening the file, and no amount of saving/re-saving/changing encodings was working. Manually removing the offending line worked, but ultimately, another character 6000 lines further into the file caused the same issue.

The solution was to use the parameter engine=’python’ in the read_csv function call. The Pandas CSV parser can use two different “engines” to parse CSV file – Python or C (default).

pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, 
                header='infer', names=None, 
                index_col=None, usecols=None, squeeze=False, 
                ..., engine=None, ...)

engine : {‘c’, ‘python’}, optional

Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.

The Python engine is described as “slower, but more feature complete” in the Pandas documentation. However, in this case, the python engine sorts the problem, without a massive time impact (overall file size was approx 200,000 rows).

UnicodeDecodeError

The other big problem with reading in CSV files that I come across is the error:

“UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x96 in position XX: invalid start byte”

Character encoding issues can be usually fixed by opening the CSV file in Sublime Text, and then “Save with encoding” choosing UTF-8. Then adding the encoding=’utf8′ to the pandas.read_csv command allows Pandas to open the CSV file without trouble. This error appears regularly for me with files saved in Excel.

encoding : str, default None

Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings
Subscribe
Notify of

22 Comments
Inline Feedbacks
View all comments

Thank you for this article! Both errors occurred on my data set and your solutions fixed it for me. You saved me the 3 hours.

Thank you, I’m a complete n00b to python and the above article was perfect.

#works like a charm
, engine=’python’)

Thanks, this was driving me mad! Even turned my ad-blocker off for you 🙂

Hi Shane,

I tried encoding the csv in sublime text and changed engine=”python”
I still get an error 🙁
ParserError: Expected 1 fields in line 34, saw 2

Is there something I need to change about my code?

This is my code:
data = pd.read_csv(‘Technical_Assessment_Package_Data_Analyst/survey_data.csv’, encoding=’utf8′, engine=”python”)

Thanks in advance.

Thanks for the reply Shane.
But I figured a way out. Not sure how did it help, but it did.
I first saved as the file in the UTF-8 csv format in excel itself. And then encoded the new file again inti Sublime.

Thanks again.

Thanks – saved a lot of time – thanks

Good info.

engine=python Life saver.. Thanks a lot .. saved me lot of time.

Hi, Thanks a lot this worked for me but but gave me another error “ParserError: unexpected end of data”. Can you please tell me why this is happening. My line is as follows `df = pd.read_csv(“test_view.csv”, engine=’python’)`

Did you figure this issue out, I’ve had the same experience with what is a very clean dataset.

Thanks, you saved my time!

Helped a ton. You are the kind of hero the internet needs

Thank you so much,It helped me a lot!!!

Last edited 4 years ago by Niya

data=pd.read_csv(“file name”,engine=’python’, error_bad_lines=False) 
try this line also, if you have error still.

Last edited 3 years ago by govind soni

Thank you so much man. I was getting really frustrated. You are my hero.

Thank you so much

Step 1: install sublim text software. the website link is: https://www.sublimetext.com/
Step 2: After install sublim text software then open your dataset in sublim text
Step 3: Then click file menu -> save in Encoding -> UTF-8 (click this option). and close dataset.
Step 4: upload your dataset in Jupyter notebook
Step 5: pd.read_csv(‘csv file name’, sep=’;’,engine=’python’,error_bad_lines=False)

Thank you so much Shane! This saved me a ton of banging my head against the wall. Hope you enjoy the coffee 😉

I have been searching the solution from past an hour and engine=’python’ worked like a magic to me.

thaks

Thanks it is working