I’ve recently wanted to geocode a large number of addresses (think circa 60k) in Ireland as part of a visualisation of the Irish property market. Geocoding can be simply achieved in R using the geocode() function from the ggmap library. The geocode function uses Googles Geocoding API to turn addresses from text to latitude and longitude pairs very simply.
There is a usage limit on the geocoding service for free users of 2,500 addresses per IP address per day. This hard limit cannot be overcome without employing new a IP address, or paying for a business account. To ease the pain of starting an R process every 2,500 addresses / day, I’ve built the a script that geocodes addresses up the the API query limit every day with a few handy features:
- Once it hits the geocoding limit, it patiently waits for Google’s servers to let it proceed.
- The script pings Google once per hour during the down time to start geocoding again as soon as possible.
- A temporary file containing the current data state is maintained during the process. Should the script be interrupted, it will start again from the place it left off once any problems with the data /connection has been rectified.
The R script assumes that you are starting with a database that is contained in a single *.csv file, “input.csv”, where the addresses are contained in the “address” column. Feel free to use/modify to suit your own devices!
Comments are included where possible:
# Geocoding script for large list of addresses. # Shane Lynn 10/10/2013 #load up the ggmap library library(ggmap) # get the input data infile <- "input" data <- read.csv(paste0('./', infile, '.csv')) # get the address list, and append "Ireland" to the end to increase accuracy # (change or remove this if your address already include a country etc.) addresses = data$Address addresses = paste0(addresses, ", Ireland") #define a function that will process googles server responses for us. getGeoDetails <- function(address){ #use the gecode function to query google servers geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE) #now extract the bits that we need from the returned list answer <- data.frame(lat=NA, long=NA, accuracy=NA, formatted_address=NA, address_type=NA, status=NA) answer$status <- geo_reply$status #if we are over the query limit - want to pause for an hour while(geo_reply$status == "OVER_QUERY_LIMIT"){ print("OVER QUERY LIMIT - Pausing for 1 hour at:") time <- Sys.time() print(as.character(time)) Sys.sleep(60*60) geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE) answer$status <- geo_reply$status } #return Na's if we didn't get a match: if (geo_reply$status != "OK"){ return(answer) } #else, extract what we need from the Google server reply into a dataframe: answer$lat <- geo_reply$results[[1]]$geometry$location$lat answer$long <- geo_reply$results[[1]]$geometry$location$lng if (length(geo_reply$results[[1]]$types) > 0){ answer$accuracy <- geo_reply$results[[1]]$types[[1]] } answer$address_type <- paste(geo_reply$results[[1]]$types, collapse=',') answer$formatted_address <- geo_reply$results[[1]]$formatted_address return(answer) } #initialise a dataframe to hold the results geocoded <- data.frame() # find out where to start in the address list (if the script was interrupted before): startindex <- 1 #if a temp file exists - load it up and count the rows! tempfilename <- paste0(infile, '_temp_geocoded.rds') if (file.exists(tempfilename)){ print("Found temp file - resuming from index:") geocoded <- readRDS(tempfilename) startindex <- nrow(geocoded) print(startindex) } # Start the geocoding process - address by address. geocode() function takes care of query speed limit. for (ii in seq(startindex, length(addresses))){ print(paste("Working on index", ii, "of", length(addresses))) #query the google geocoder - this will pause here if we are over the limit. result = getGeoDetails(addresses[ii]) print(result$status) result$index <- ii #append the answer to the results file. geocoded <- rbind(geocoded, result) #save temporary results as we are going along saveRDS(geocoded, tempfilename) } #now we add the latitude and longitude to the main data data$lat <- geocoded$lat data$long <- geocoded$long data$accuracy <- geocoded$accuracy #finally write it all to the output files saveRDS(data, paste0("../data/", infile ,"_geocoded.rds")) write.table(data, file=paste0("../data/", infile ,"_geocoded.csv"), sep=",", row.names=FALSE)
Let me know if you find a use for the script, or if you have any suggestions for improvements.
Please be aware that it is against the Google Geocoding API terms of service to geocode addresses without displaying them on a Google map. Please see the terms of service for more details on usage restrictions.
Hi Ann – whats the error that you are getting? Have you got your csv file in the same place as your R script? And finally have you set the “working directory” of R to the same directory.
As a another try – change read.csv(paste0(‘./’, infield, ‘.csv’)) to just read.csv(paste0(infield, ‘.csv’))
Hope this works!
Hi Shane, thanks for this highly useful code. How do rows of your input file look like? I mean how is the address typed in each row? I need to add state in US after the primary address; for example primary address is “abcd high school”, followed by Oklahoma, US. Could you please help me out here.
The address can be just a string column in the csv file. So something like:
id, address, other_column
0, “Test Address, Test Town, State”, “other value”
1, “Test Address2, Test Town2, State”, “other value”
Doest that make sense? You can use paste() to add a state at the end if you have it in another column etc.
Hello Shane, thank you for the great code, it works great!
You’re very welcome Riodrig. Glad that you found it useful!
The address can be just a string column in the csv file. So something like:
id, address, other_column
0, “Test Address, Test Town, State”, “other value”
1, “Test Address2, Test Town2, State”, “other value”
Doest that make sense? You can use paste() to add a state at the end if you have it in another column etc.
Address lati long formatted address
1004B Jessica’s Court,Bel Air,MD, 39.53376 -76.37507 1099 Jessicas Ct, Bel Air, MD 21014, USA
i need one more column called formatted address how to change the code
Here is the error – Error in gzfile(file, mode) : cannot open the connection
In addition: Warning message:
In gzfile(file, mode) :
cannot open compressed file ‘../data/C:/Users/*******/Documents/April 2017/Addresses.csv_geocoded.rds’, probable reason ‘Invalid argument’
As a beginner, I find the error to be confusing and unclear. I followed your script without any changes and created my own csv file for testing purposes. All went well until I neared the end, unfortunately. Maybe the script should not be ran exactly as shown or maybe I am missing an important part of the instructions. Advice and suggestions are welcome.
Hi Shanelynn,
Thank you for the script, it has been very helpful for the purposes of my project. I noticed above in the comments that someone spotted the error on line 77 and it is now correct here. However, I accessed the script from Rbloggers here and the error is still there. I wanted to let you know in case you had any control over fixing the error on Rbloggers as well. Thank you!
Hi Emily – unfortunately R-bloggers is out of my control, which is a tough one – hopefully some people manage to find the original like you!
[…] cinquième option me direz vous, … je vais tester le package R (ggmap) qui a une fonction de géocodage liée à l’API de Google. Je cherche juste à m’assurer […]
Hello,
I am getting an error when I run line 63 – 73 which says
“Error in if (location == “”) return(failedGeocodeReturn(output)) :
missing value where TRUE/FALSE needed
Thoughts? Ideas? Fixes?
Thanks!
Hi 🙂 I noticed many people mentioning the repeating of addresses when it continues from index – would simply adding a +1 to the condition not work? Like this:
if (file.exists(tempfilename)){
print(“Found temp file – resuming from index”)
geocoded <- readRDS(tempfilename)
startindex <- nrow(geocoded)+1
print(startindex)
I tried it and seems to work for me but maybe Im missing something :/
Hi, Shane, thanks for your amazing code. Super helpful.
My dummy question about tempfilename. I should use any, I suppose. But in which format should I save?
Sincerely
Oleksiy
Hey Oleksliy, I would recommend using Comma-Seperated-Value (CSV) file for this!
Hi Shane,
Thanks for the code! I have a couple of questions, but I’m brand new to programming, so forgive my ignorance.
1) I have received the following error: “Error in data$Address (from DB_Geocode_trial1) : object of type ‘closure’ is not subsettable”. What is this? And how do I fix it?
2) I am working on a water-related project in which I need to find coordinates for “Habitations” (similar to a village) in India. In other words, I don’t have street names. Is this code meant to accomplish a task like this? If not, do you have any ideas for how I can modify it?
Sorry that I have so many questions. I appreciate any and all assistance!
Thanks in advance,
Brooke
Hi Brooke. It sounds like your data didn’t read in correctly for the first error – make sure your CSV file is correctly formatted, and the data shows up correctly if you run “head(data)”. For the water project, the geocoder from google should work fine with areas rather than street addresses, as long as those habitations are marked on Google. To help it along, you can add “India” to the end of the geocoding string.
Out of couriosity, is it possible to request the route between two points?
Hi Max, this is actually a topic for another post coming up – for this you use the Google Directions API.
hi Shane, is there a code for getting city and country from using lattitude and longitude data? please let me know
Hi Rajanna, this is called “reverse-geocoding” – you’ll need to look at the specific API for that – Google has one if you look around the documentation!
This worked for me and I don’t even have addresses. I passed it “Region, Country”, sometimes “,Region, Country” and it’s working well.
Thank you!
[…] about the fifth option? … I will test the R package (ggmap) which has a geocoding function related to the Google API. I’m just trying to make sure the […]
This was extremely helpful, thank you for sharing this!
[…] a recent project, I ported the “batch geocoding in R” script over to Python. The script allows geocoding of large numbers of string addresses to […]
Hi i am getting this Error: is.character(location) is not TRUE . any ideas??
I know this is months beyond the date you posted, but when you load the .csv file, make sure you add “stringsAsFactors = FALSE” in your read.csv line… something like this –> data <- read.csv('file_loaction.csv', stringsAsFactors = FALSE)
This is great! Would it be possible to adapt the code for the gmapsdistance function? Is this something you have already done? That is, I need to calculate multiple travel distances (in time)…rather than geocode addresses. Thanks for your thoughts!
Hi Shane,
Thanks for the useful code . When i tried with 252 addresses its worked fine after 4 days i used new input.csv file with 221 rows it’s started giving the below error
Error in if (location == “”) return(failedGeocodeReturn(output)) :
missing value where TRUE/FALSE needed
and its considering 252 rows old file values .please provide the solution waiting for your reply
Thanks Regards
Prakash Hullathi
Hi Prakash – it looks as if there is a missing address or location in your input file. That or a comma in the wrong place – you need to examine the file to find an errors in the input data.
Hi,
What is approximate runtime for 10K locations?
It processed around 900 records in half hour for me. This was with Sys.sleep(1) instead of Sys.sleep(60*60)
HI Shane,
Thanks for the extremely useful code.
I am using this for schools. However it is taking an extremely long time to do, around 3 schools every hour (sometimes no schools in an hour), before it hits the query limit.
Below is an example format I am using. As you can see they some may have longer addresses. Could this be the issue? Or is it normal for it to take this long?
PRESENTATION PRIMARY SCHOOL, TERENURE, DUBLIN 6W, Ireland
BEACAN MIXED N S, BEKAN, CLAREMORRIS, CO MAYO, Ireland
Many Thanks,
PMc
Change where it says Sys.sleep(60*60) to Sys.sleep(1)
This will then retry every 1 second.
The limit is so many records per second.
[…] Batch Geocoding with R and Google maps […]
Shane you legend. Worked perfectly. I changed the time to retry every 1 second and is currently processing 4800 records. No more time consuming geo coding! 🙂
No bother Brad – glad you found it useful!
I’ve noticed it doesn’t work with foreign characters such as á æ ü.
It falls over on that record.
I got round it by using a replace all in Excel so á with a – æ with ae etc.
(These are Scandinavian characters. I’m not sure if it is an issue for other foreign countries)
Thanks for this code Shane! It worked perfectly for me all summer, but since a week or two, I keep getting an OVER QUERY LIMIT message on the first query, and never get past that. I already tried changing the order of the requests and checking the input data for any weirdness, and changing Sys.sleep to 5. No luck.. Could this have something to do with the changes at Google since 11 June? Is there a way to work an API key into the code? Experiences, anyone?
Thanks,
Jessica
Hi Jessica, thanks for getting in touch. Unfortunately the “ggmap” library for R is quite restrictive and doesn’t let you put in an API key (or at least didn’t). Overall I’ve found the Python scripts easier for this since you can add a key. I might look at rejigging this script to use just a http request rather than the GGMAP library and that way we can use the API key effectively.
Looking into this further Jessica, I think there has been a change to the google API – there may no longer be the option to not use an API key – see https://manifesto.co.uk/google-maps-api-pricing-changes/
Thanks for the swift replies. Yes, this is the conclusion I came to as well..
I found a possible solution on Github – which uses a newer version of ggmap not yet on CRAN (https://github.com/dkahle/ggmap/issues/83 scroll all the way down to the last two posts from four days ago). Still considering whether it’s worth setting up an API key and billing, or whether to ditch Google altogether and use DSK or OSM instead for geocoding.
Ah – great find Jessica! I’ll update the post shortly with this update.
Hi, I am getting this error. I’ve narrowed it down to the following snip.
for (ii in seq(startindex, length(addresses))){
print(paste(“Working on index”, ii, “of”, length(addresses)))
#query the google geocoder – this will pause here if we are over the limit.
result = getGeoDetails(addresses[ii])
print(result$status)
result$index <- ii
#append the answer to the results file.
geocoded <- rbind(geocoded, result)
#save temporary results as we are going along
saveRDS(geocoded, tempfilename)
}
$ operator is invalid for atomic vectors
Do I need to change the data type? That might be a good approach, but i'm still new, so i'm still investigating.
I'm also seeing that there is a change to the Google API. Is this approach still worth exploring? Or is your code broken at the moment? Nice job though m8. It was good while it lasted. 🙂
Hi Richard, it seems that the Google API for geocoding has changed substantially – an API key is now required for all geocoding activities, and the ggmap library used by this script doesn’t support API keys. I’ll see if I can do an update to manually request the JSON from google and use an API key in the coming weeks! I think this causes the error with “result” in your snippet. For now though….I’d recommend using Python for this – I have a similar script that will do the trick!
Actually Richard – for a workaround, see the link from Jessica in the last comment – you can patch ggmap with the devtools library and install a different version that supports Google API keys.
Heads up;
> geocode(data$addresses, output = data$geocode, source = “dsk”)
Error in geocode(data$addresses, output = data$geocode, source = “dsk”) :
datasciencetoolkit.org terminated its map service, sorry!
Looks like DSK is discontinued.
I did get google to work ALMOST, HOWEVER, they want my billing information and I’m not comfortable with that since i’m not 100% sure if I did it right. The last thing I want is to charge a work project to my personal credit card on a hunch. 😉 But the repeating functions do work now. I just get Access Denied errors now.
geocode failed with status REQUEST_DENIED, location =
Betcha it’s my credit card they want. That’s a no from me for now.
[…] nominatim : github.com/hrbrmstr/nominatim ; ggmap::geocode ; geocodeHERE::geocodeHERE_simple ; geonames paquete; también google r street address geolocation Excelente artículo con código de ejemplo (usando el paquete de ggmap): shanelynn.ie/masivo-geocodificación-con-r-y-google-maps […]
Just want to say, if you are having to wait to geocode 2500 addresses a day, a better solution would be to set up your own geocoder using POST Gis and Census Tiger shapefiles. I set that up for a project that needed to geocode 20 million addresses.
At least for US addresses, I assume for European ones there is a comparable open-source solution.