Tokenizing Error
Recently, I burned about 3 hours trying to load a large CSV file into Python Pandas using the read_csv function, only to consistently run into the following error:
ParserError Traceback (most recent call last) <ipython-input-6-b51ad8562823> in <module>() ... pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)() pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)() Error tokenizing data. C error: EOF inside string starting at line XXXX
“Error tokenising data. C error: EOF inside string starting at line”.
There was an erroneous character about 5000 lines into the CSV file that prevented the Pandas CSV parser from reading the entire file. Excel had no problems opening the file, and no amount of saving/re-saving/changing encodings was working. Manually removing the offending line worked, but ultimately, another character 6000 lines further into the file caused the same issue.
The solution was to use the parameter engine=’python’ in the read_csv function call. The Pandas CSV parser can use two different “engines” to parse CSV file – Python or C (default).
pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, ..., engine=None, ...)
engine : {‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.
The Python engine is described as “slower, but more feature complete” in the Pandas documentation. However, in this case, the python engine sorts the problem, without a massive time impact (overall file size was approx 200,000 rows).
UnicodeDecodeError
The other big problem with reading in CSV files that I come across is the error:
“UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x96 in position XX: invalid start byte”
Character encoding issues can be usually fixed by opening the CSV file in Sublime Text, and then “Save with encoding” choosing UTF-8. Then adding the encoding=’utf8′ to the pandas.read_csv command allows Pandas to open the CSV file without trouble. This error appears regularly for me with files saved in Excel.
encoding : str, default None
Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings
Thank you for this article! Both errors occurred on my data set and your solutions fixed it for me. You saved me the 3 hours.
Thank you, I’m a complete n00b to python and the above article was perfect.
#works like a charm
, engine=’python’)
Thanks, this was driving me mad! Even turned my ad-blocker off for you 🙂
Hi Shane,
I tried encoding the csv in sublime text and changed engine=”python”
I still get an error 🙁
ParserError: Expected 1 fields in line 34, saw 2
Is there something I need to change about my code?
This is my code:
data = pd.read_csv(‘Technical_Assessment_Package_Data_Analyst/survey_data.csv’, encoding=’utf8′, engine=”python”)
Thanks in advance.
Hi Reshma,
Sounds like the code is okay – but the error is mentioning that your CSV data is ill formatted – specifically on line 34 – you should have a look – have you an extra comma or an unescaped string in line 34?
Shane
Thanks for the reply Shane.
But I figured a way out. Not sure how did it help, but it did.
I first saved as the file in the UTF-8 csv format in excel itself. And then encoded the new file again inti Sublime.
Thanks again.
Thanks – saved a lot of time – thanks
Good info.
engine=python Life saver.. Thanks a lot .. saved me lot of time.
Hi, Thanks a lot this worked for me but but gave me another error “ParserError: unexpected end of data”. Can you please tell me why this is happening. My line is as follows `df = pd.read_csv(“test_view.csv”, engine=’python’)`
Did you figure this issue out, I’ve had the same experience with what is a very clean dataset.
Thanks, you saved my time!
Helped a ton. You are the kind of hero the internet needs
Thank you so much,It helped me a lot!!!
data=pd.read_csv(“file name”,engine=’python’, error_bad_lines=False)
try this line also, if you have error still.
Thank you so much man. I was getting really frustrated. You are my hero.
Thank you so much
Step 1: install sublim text software. the website link is: https://www.sublimetext.com/
Step 2: After install sublim text software then open your dataset in sublim text
Step 3: Then click file menu -> save in Encoding -> UTF-8 (click this option). and close dataset.
Step 4: upload your dataset in Jupyter notebook
Step 5: pd.read_csv(‘csv file name’, sep=’;’,engine=’python’,error_bad_lines=False)
Thank you so much Shane! This saved me a ton of banging my head against the wall. Hope you enjoy the coffee 😉
I have been searching the solution from past an hour and engine=’python’ worked like a magic to me.
thaks
Thanks it is working