I’ve recently wanted to geocode a large number of addresses (think circa 60k) in Ireland as part of a visualisation of the Irish property market. Geocoding can be simply achieved in R using the geocode() function from the ggmap library. The geocode function uses Googles Geocoding API to turn addresses from text to latitude and longitude pairs very simply.
There is a usage limit on the geocoding service for free users of 2,500 addresses per IP address per day. This hard limit cannot be overcome without employing new a IP address, or paying for a business account. To ease the pain of starting an R process every 2,500 addresses / day, I’ve built the a script that geocodes addresses up the the API query limit every day with a few handy features:
- Once it hits the geocoding limit, it patiently waits for Google’s servers to let it proceed.
- The script pings Google once per hour during the down time to start geocoding again as soon as possible.
- A temporary file containing the current data state is maintained during the process. Should the script be interrupted, it will start again from the place it left off once any problems with the data /connection has been rectified.
The R script assumes that you are starting with a database that is contained in a single *.csv file, “input.csv”, where the addresses are contained in the “address” column. Feel free to use/modify to suit your own devices!
Comments are included where possible:
# Geocoding script for large list of addresses. # Shane Lynn 10/10/2013 #load up the ggmap library library(ggmap) # get the input data infile <- "input" data <- read.csv(paste0('./', infile, '.csv')) # get the address list, and append "Ireland" to the end to increase accuracy # (change or remove this if your address already include a country etc.) addresses = data$Address addresses = paste0(addresses, ", Ireland") #define a function that will process googles server responses for us. getGeoDetails <- function(address){ #use the gecode function to query google servers geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE) #now extract the bits that we need from the returned list answer <- data.frame(lat=NA, long=NA, accuracy=NA, formatted_address=NA, address_type=NA, status=NA) answer$status <- geo_reply$status #if we are over the query limit - want to pause for an hour while(geo_reply$status == "OVER_QUERY_LIMIT"){ print("OVER QUERY LIMIT - Pausing for 1 hour at:") time <- Sys.time() print(as.character(time)) Sys.sleep(60*60) geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE) answer$status <- geo_reply$status } #return Na's if we didn't get a match: if (geo_reply$status != "OK"){ return(answer) } #else, extract what we need from the Google server reply into a dataframe: answer$lat <- geo_reply$results[[1]]$geometry$location$lat answer$long <- geo_reply$results[[1]]$geometry$location$lng if (length(geo_reply$results[[1]]$types) > 0){ answer$accuracy <- geo_reply$results[[1]]$types[[1]] } answer$address_type <- paste(geo_reply$results[[1]]$types, collapse=',') answer$formatted_address <- geo_reply$results[[1]]$formatted_address return(answer) } #initialise a dataframe to hold the results geocoded <- data.frame() # find out where to start in the address list (if the script was interrupted before): startindex <- 1 #if a temp file exists - load it up and count the rows! tempfilename <- paste0(infile, '_temp_geocoded.rds') if (file.exists(tempfilename)){ print("Found temp file - resuming from index:") geocoded <- readRDS(tempfilename) startindex <- nrow(geocoded) print(startindex) } # Start the geocoding process - address by address. geocode() function takes care of query speed limit. for (ii in seq(startindex, length(addresses))){ print(paste("Working on index", ii, "of", length(addresses))) #query the google geocoder - this will pause here if we are over the limit. result = getGeoDetails(addresses[ii]) print(result$status) result$index <- ii #append the answer to the results file. geocoded <- rbind(geocoded, result) #save temporary results as we are going along saveRDS(geocoded, tempfilename) } #now we add the latitude and longitude to the main data data$lat <- geocoded$lat data$long <- geocoded$long data$accuracy <- geocoded$accuracy #finally write it all to the output files saveRDS(data, paste0("../data/", infile ,"_geocoded.rds")) write.table(data, file=paste0("../data/", infile ,"_geocoded.csv"), sep=",", row.names=FALSE)
Let me know if you find a use for the script, or if you have any suggestions for improvements.
Please be aware that it is against the Google Geocoding API terms of service to geocode addresses without displaying them on a Google map. Please see the terms of service for more details on usage restrictions.
Thanks for sharing your code. It was very helpful.
Hi, Do You happen to know how should I change that script to take into account that some of my records do not have addresses with house number… just street name. right now it geocodes just addresses with numbers, when there is a record with just street name I ve got NA
Thanks for the nice write up. I usually use Bing for those purposes as it has a limit of 25,000 requests. See http://stackoverflow.com/questions/17361909/determing-the-distance-between-two-zip-codes-alternatives-to-mapdist. Even though this example is with routes, the package “taRifx.geo” also has a function for geocoding by using Bing.
This is awesome! Thanks for posting. I can use this for my project. 🙂
I’ve been using this code and would love to discuss it a little more in-depth. I recently used it to code ~6100 records. Unfortunately, I couldn’t get the geocoding process to start up again once I hit the 2500 limit. I also found one bug – when the code stores the index to restart the geocoding process, it repeats the index. (Example: code runs through 1180 and pauses. I re-run the code, and it picks up at 1180, repeating the data pull for that index.) Can we discuss sometime soon? Thanks!
Hey Lisa, I’m glad you found the code useful! Strange how the geocoding process wouldn’t start again – was it timing out or reporting a limit from the Google servers? Good catch on the bug – to be honest I was being lazy not completely fixing that one up – leads to a few repeats in the data!
Hi again – it said “over query limit” and timed out for another hour (after I had already waited an hour). Any suggestions on a fix, or is this another issue do you think? Thanks!
Hi Shane
Nice code – am hitting slight hitch. At the R prompt if I type
> str1 = “9 Abbey Street Lower Dublin 1”
> geocode(str1)
I get the ‘correct’ lon lat -6.249584 53.35223
However, if I use this script pointing to the same address in input.csv it gives me lat lon for somewhere in the midlands. Any suggestions?
Thanks
Heya Rowan, Thanks for the comment. To be honest, I dont have an answer for your question. The script simply uses the “geocode” function from ggmap to get the coordinates, so I dont know why the answer would be different. Is this possibly some sort of versioning issue? Can you recreate the issue every time? And is it just that address? The only thing i can think of is that the script appends “Ireland” to the end of the address to increase accuracy – perhaps its throwing things off track on this occasion – can you try again after editing line 13?
Dear Shane,
Many thanks for your R-code – it works perfectly – I used it for geocoding “Asing Prices” retrieved from teh web for the Real Estate Market in Austria (PhD). I use a lookup-table to avoid double geocoding of the same addess.
many thansk again Gerhard
Good to hear Ger. Glad it was helpful. I was meaning to implement a lookup table for this same reason. Doing something similar now in Javascript which i may put online shortly!
Thanks for posting, very useful!
On line 77 – you have data$long <- geocoded$lat when it should say geocoded$long –
Easy fix!! Thanks for the great code!
I have the same issue as rowan… Did you figure it out?
Good spot on the error on line 77 – have updated!!
Hi,
I’m a fresh student of R, so I’m discovering how everythig works. I have a silly doubt.
Here:
# get the input data:
infile <- "input"
data <- read.csv(paste0('./', infile, '.csv'))
I should input my data normally as:
infile<- read.csv("C:/Users/University/Project/Local.csv", sep=";")
and then run the rest normaly? I'm doing that and I receive this warning "Error in file(file, "rt") : invalid 'description' argument"
Thank you for your help!
Hi Shane, firstly thanks for sharing your code.
I’m having problems when the query limit gets over. The function gets in a infinite loop in the “while”, because even after 24 hours, when it uses the geocode() to the same interaction of the ‘for’ (address), the status don’t changes. I really don’t know why, but I thing it’s because the geocode() is “using stored information”. Do you have any suggestion?
Nice code, thanks.
I had same issue as Rowan, restarting and ending up with a larger vector that couldn’t be appended to original data. My workaround was to delete the temp file before restarting the process. Geocoding address is stored with workspace, so it doesn’t request it again from Google server (assuming you save workspace).
A slightly better solution is to comment out the temp file save and read.
And even better solution would be to delete the last row of the temp file and start from there?
Just to add a small piece of related info on what Mark and Rowan wondered:
Personally I didn’t find a function to force a “fresh” geocode -without workspace deletion. So Q:”What if I won’t delete my workspace (or create new) and I want to force a fresh geocode?”
A:ggmap stores locations on .GeocodedInformation on the workspace and
rm(.GeocodedInformation) can be used to remove them. After that obviously new/fresh geocode can be initated.
Thanks Sam and Mark – good points. I’ll put the code up on github, and if you have made changes that fix that bug – might be good to push changes up! Might be possible to force geocode() to not use the cached information.
Hey Shane, thank you for the great code.
I get the error message “Error in geo_reply$status : $ operator is invalid for atomic vectors” when I get to running the code with my own data. Has this come up before?
Thank you!
Hey Hi Shane,
Thank you so much for your code.
I’m doing a project and wanted to get the Lat/Longs for sone streets to make my project more meaningful.
I tried the code but i keep getting an error is.character(location) – my address(ii) value is all string. I can’t fathom what is the cause – though I suspect the data. Appreciate if I could share my screen shot with you for some suggestions.
Hey, just tried something which seems to run but the output does lack the Lat/Long. Line 72 modified to :
result = getGeoDetails(as.character(substitute(addresses[ii]))). Sorry for the multiple messages.
No problem Shanthi – thanks for the changes. I will update this post with the most recent R updates soon, and potentially also port this code into python.
I have a not so smart python code that sorta works..I f you wanna have a peep -lemme know :
Shane,
Thanks for the code. I have been using it quite a bit recently. I do have one question though. I have quite a few addresses to geocode and I’m wondering if you know how to set up the code so that I can go over the 2500 limit per day. I have signed up with Google to do this and created a project but I don’t know how to implement it in the code I have (i.e. your code). I don’t mind paying a few dollars to complete my list of addresses but I just can’t see how to do it. Any ideas?
Many thanks,
Barry
Hi Barry, thanks for getting in touch. Glad the code is useful. I actually don’t think that the R library “ggmap” that is used here allows you to specify the API key for quota usage on Google. (which is a shame). You might want to look at using Google Fusion Tables to upload your data and geocode from there – for simplicity. I’m also uploading shortly a Python equivalent for this code which will be a handy alternative! Hope this helps.
Hi Shane,
thanks for sharing the code with us. Workls like a bless, even though im missing the possibility to add a API Key within the ggmap package. But this has nothing to do with your code. In Case someone wants to transform german adresses, one small hint:
You need to encode into UTF-8 before passing it into the function, just add the function enc2utf8 in line 66 and it workls like a charm.
line 66: result = getGeoDetails(enc2utf8(addresses[ii]))
Hi, Thanks for this code and it helps a lot with my project, I actually have an error when running the script,
“Working on index 93 of 953”
contacting http://maps.googleapis.com/maps/api/geocode/json?address=T8W5J8%20%20%20%20,%20CANADA&sensor=false….Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=T8W5J8%20%20%20%20,%20CANADA&sensor=false
Error in geo_reply$status : $ operator is invalid for atomic vectors
In addition: Warning messages:
1: In readLines(connect, warn = FALSE) :
InternetOpenUrl failed: ‘A connection with the server could not be established’
2: In geocode(address, output = “all”, messaging = TRUE, override_limit = TRUE) :
geocoding failed for “T8W5J8 , CANADA”.
if accompanied by 500 Internal Server Error with using dsk, try google.
Do you know how can I solve this problem? Thanks for your help!!!!!
Hi Shane,
Did you have a chance to finalize the python equivalent code for this?
Hi Sabya, Actually, after many months, I have and it’s uploaded now.
Thank you for the code! What changes I should make to your code, if I don’t want to bypass the API limit, and allow R to collect the geocodes for several days? Thanks!!!
Hi Sophie, great that you found it useful. Weirdly, the ggmap library does not allow you to actually insert an API key and use the system without limits. I’m in the middle of a post of how to do this with python. If you’d like to use a key, I would advise building a URL request with the geocode url, and your key at the end. The response will be very similar. Have a look at: http://stackoverflow.com/questions/34402979/increase-the-api-limit-in-ggmaps-geocode-function-in-r
Hey Sabya, yes, that’s incoming, but not done yet. Unfortunately I have been a bit short of time to work on the blog of late.
Thank you for the code, Shane. I am a novice in R, and hence, my question below may appear less intelligent; I apologize in advance.
I am trying to convert 8,615 U.S. addresses to their latitude and longitude. Interestingly, at index 2694, Google Maps API encounters a bad request and reports the error mentioned below.
Question: I expected “address” dataframe to have been populated with the correct values till index 2693, but the dataframes does not seem to have been created. Can you please tell me where I have gone wrong? Thanks, Shane.
[1] “Working on index 2694 of 8615”
contacting http://maps.googleapis.com/maps/api/geocode/json?address=#2050N%201620%2026th%20Street%20Santa%20Monica%20CA%2090404,%20USA&sensor=false…query max exceeded, see ?geocode. current total = 2559
.Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=#2050N%201620%2026th%20Street%20Santa%20Monica%20CA%2090404,%20USA&sensor=false
Error in geo_reply$status : $ operator is invalid for atomic vectors
In addition: Warning messages:
1: In readLines(connect, warn = FALSE) :
cannot open: HTTP status was ‘400 Bad Request’
2: In geocode(address, output = “all”, messaging = TRUE, override_limit = TRUE) :
geocoding failed for “#2050N 1620 26th Street Santa Monica CA 90404, USA”.
if accompanied by 500 Internal Server Error with using dsk, try google.
Shane: please ignore my last comment. I could export “geocoded” dataframe and obtain the latitude and longitude of the 2693 addresses.
Glad you got sorted Vivek. If you’re doing this sort of work I’d also recommending looking at the Python version that I’ve recently posted for polling the Google API. Best of luck!
Hey Shane!
Im using your awesome code but for some reason I’m having this Error back from R:
I made sure that ggmap is well installed, but nothing.
[1] “Working on index 1 of 2496”
Error in getGeoDetails(address[ii]) : could not find function “geocode”
any idea what could have been wrong?
Hi Paolo, Check to make sure that the “ggmap” library is installed and loaded!
I have the opposite. I have the latitude and longitude. I would like to find the district (D1, D2, D3…). Do you know how can I do it?
Hi Junior, You’re looking for a “reverse geocode” I think… have a look at https://developers.google.com/maps/documentation/javascript/examples/geocoding-reverse
I can’t get the code to work at all…..from
infile <- "input"
data <- read.csv(paste0('./', infile, '.csv'))
that i get a warning saying error in / in paste0.
As a layman of codes, does this work with street address in a csv? what change should i make to the following part to get it work with my csv file?
#define a function that will process googles server responses for us.
getGeoDetails <- function(address){
#use the gecode function to query google servers
geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
#now extract the bits that we need from the returned list
answer <- data.frame(lat=NA, long=NA, accuracy=NA, formatted_address=NA, address_type=NA, status=NA)
answer$status <- geo_reply$status
i only have 1800 entries so i guess the later parts aren't really relevant.
Hi Naiyu, yes the program should work with addresses in a CSV file. You’ll need to have a look at your input file format here, I’m not sure what the error is. If you need a quick job, have a look at fusion tables from Google as an alternative for geocoding.
Hi, I having the same error. How did you fix this??