This work is done by Adrian Liaw
For this project, I'm going to wrangle the map data of Taipei, my home town.
You can doanload the dataset via MapZen Metro Extracts (Taipei, Taiwan)
# These are some libraries we're going to use soon
import re
import xml.etree.ElementTree as ET
from collections import defaultdict
from pymongo import MongoClient
try:
# Speed up a little bit, ujson is written in pure C
import ujson as json
except ImportError:
import json
# Regular Expression constant
PROBLEMCHARS = re.compile(r"[=\+/&<>;'\"\?%#$@\,\. \t\r\n]")
db = MongoClient("localhost", 27017)["map"]
After exploring the dataset, I think there are three main problems:
Values in the raw OSM XML are all string, but some of them might be better to store as arrays, or sub-document.
For instance, the tag cuisine
should be stored as array, since a restaurant could serve more than one style of cuisines. In OSM, these values are usually separated by ","
or ";"
like {"cuisine": "Italian;French"}
.
This also applies to many other tags (like operators
, some bus routes might be operated by multiple agencies), we can write a generalised function:
def separate_into_list(tags, field, delim="[,;,、]"):
"""Separate the value of some tag into a list (array) instead of storing pure string.
Arguments:
tags -- dict, A dict of tags, {k: v, k: v ...}
field -- str, The tag to separate, e.g. "cuisine", "operator"
Keyword Arguments:
delim -- str or re object, Seperator for the value, defaults to "[,;,、]"
("," and "、" are common separators in our language)
Returns:
dict -- Part of the resulting document. {k: v} if there's nothing to separate; {k: [v1, v2...]} elsewhere.
Modifies:
tags -- Deletes the field.
"""
delim = re.compile(delim)
if field not in tags: return {}
value = tags.pop(field)
if not delim.search(value): return {field: value}
return {field: [frag.strip() for frag in delim.split(value)]}
And we can call it from wherever we want, let's build functions for operator
, cuisine
, ref
, source
, phone
:
def process_operator(tags):
"""
{"operator": "國光客運、大都會客運"}
Should become:
{"operator": ["國光客運", "大都會客運"]}
"""
# To fix the ambiguity
if "Co., Ltd" in tags.get("operator", ""):
return {"operator": tags.pop("operator")}
return separate_into_list(tags, "operator")
def process_cuisine(tags):
"""
{"cuisine": "Italian; French"}
Should become:
{"cuisine": ["italian", "french"]}
"""
if tags.get("cuisine"):
# It should be case-insensitive
tags["cuisine"] = tags["cuisine"].lower()
# I don't know why, but some values looks like this: "PIZZA_,PASTA"
return separate_into_list(tags, "cuisine", r"[;,、]|(?:_?,_?)")
return {}
def process_ref(tags):
"""
Some subway stations have multiple refs, these stations are transfer station.
Other examples like roads, some roads also have multiple refs.
"""
return separate_into_list(tags, "ref")
def process_source(tags):
return separate_into_list(tags, "source")
def process_phone_number(tags):
return separate_into_list(tags, "phone")
*By the way, these functions with prefix of "process_" are going to be called for each element, these functions return partial documents those are fragments of the resulting document. Finally, for each element, we'll create an empty document (dict) initially, then use `update` method to include these fragments.*
In the other hand, some tags should be combined, they should resulting to be a sub-document of the main document.
For example: the address. Addresses in OSM are separated into several tags, addr:street
, addr:housenumber
, addr:city
, etc.
Noramlly this type of tags have a key with a colon in it, but not all tags with a colon belong to this type.
Another common case is multilingual tags such as name
, you can find tons of tags like name:en
, name:ja
. They should be combined into an object and stored as names
, then you can access these values via names.en
, names.ja
and so on.
Let's again write a general function for this:
def as_subdocument(tags, prefix, into):
"""Combine tags with some prefix into an object, and store as a sub-document (or nested document).
Arguments:
tags -- dict, A dict of tags, {k: v, k: v ...}
prefix -- str, All the tags with a key starting with given prefix are going to merge.
into -- str, A key for the resulting document to store the sub-document.
Returns:
dict -- Part of the resulting document.
Empty dict if no tags matching the prefix; {into: {k (without prefix): v, k: v ...}} otherwise.
Modifies:
tags -- Deletes the field.
"""
document = defaultdict(lambda: {})
for k in list(tags):
if k.startswith(prefix):
document[into][k[len(prefix):]] = tags.pop(k)
return document
Again, use it to construct other functions.
def process_address(tags):
"""
{"addr:full": "11656臺北市文山區新光路二段32號",
"addr:country": "TW",
"addr:housenumber": "32"}
Should become:
{"address": {
"full": "11656臺北市文山區新光路二段32號",
"country": "TW",
"housenumber": "32"
}}
"""
# address field should be addr:full
if tags.get("address"):
tags["addr:full"] = tags.pop("address")
return as_subdocument(tags, "addr:", "address")
def process_names(tags):
"""
{"name:zh": "新店區",
"name:en": "Xindian District",
"name:ja": "新店区"}
Should become:
{"names": {
"zh": "新店區",
"en": "Xindian District",
"ja": "新店区"
}}
"""
return as_subdocument(tags, "name:", "names")
def process_alt_names(tags):
return as_subdocument(tags, "alt_name:", "alt_names")
def process_old_names(tags):
return as_subdocument(tags, "old_name:", "old_names")
def process_official_names(tags):
return as_subdocument(tags, "official_name:", "official_names")
def process_refs(tags):
# This is also about multilingual
return as_subdocument(tags, "ref:", "refs")
def process_GNS(tags):
return as_subdocument(tags, "GNS:", "GNS")
def process_building_props(tags):
# building:levels building:height etc.
return as_subdocument(tags, "building:", "building_props")
Well, this is a very localised problem. This is important for Taiwanese because convenience stores are a part of our lives, we can do a lot of things there. They are everywhere in Taiwan, and I really mean, EVERYWHERE.
There are four main convenience store companies in Taiwan: 7-Eleven, Family Mart, Hi-Life and OK Mart. When we talk about convenience stores, we always mean these four, not others.
So I had this idea of labeling these convenience stores correctly in our data, this may be helpful if we're going to do some analysis about convenience stores. Now the problem here is that these stores have varying names, since the data were edited by lots of different users. For instance, 7-Eleven, some people wrote 7-ELEVEn, 7-11, Seven-Eleven.
Also, there're many nodes were labelled as {"shop": "convenience"}
, but many of them are not what we "expect".
Our task here is to label those four companies' stores as a stand-alone group, and also label with the unified company name or brand.
def process_conv_stores(tags):
"""Identify the convenience store company based on the name, and clean it.
{"shop": "convenience",
"name": "7 eleven"}
Should become:
{"shop": "convenience_store",
"brand": "7-Eleven"}
"""
if tags.get("shop") != "convenience" or tags.get("name") == None:
return {}
name = tags["name"].lower()
# 7 Eleven, seven-eleven, 7-11, 統一超商(company's legal name in our language, but we never say this)
if (("7" in name or "seven" in name) and ("11" in name or "eleven" in name)) or "統一" in name:
output = {"shop": "convenience_store", "brand": "7-Eleven"}
# Family Mart, FamilyMart, Family-Mart, 全家便利商店, 全家(for short, we always say this)
elif ("family" in name and "mart" in name) or "全家" in name:
output = {"shop": "convenience_store", "brand": "FamilyMart"}
# Hi-Life, HiLife, hi life, 萊爾富(again, we say this)
elif ("hi" in name and "life" in name) or "萊爾富" in name:
output = {"shop": "convenience_store", "brand": "Hi-Life"}
# OK, ok mart, OK‧MART
elif "ok" in name:
output = {"shop": "convenience_store", "brand": "OK·MART"}
else:
return {}
del tags["shop"]
if "brand" in tags: del tags["brand"]
# We're not going to drop "name", keep it to the end
return output
As a heavy public transportation user, I take buses and MRT (Taipei Metro Rapid Transit) everyday. It's a good idea to include public transit information in the further analysis.
These route data are stored as relations in OSM XML, I'm going to separate each route relation into three parts, stops, depots and path, where stops are bus stops or MRT stations (nodes), depots are those bus depots and MRT depots (closed ways / area ways), path is an array of open ways.
In this process_route
function, I'm going to have an element as the argument, because we need <member>
s under <relation>
s:
def process_route(element):
"""Break a route relation into an object of three parts
Argument:
element: <relation></relation>
Returns:
dict -- Part of the resulting document, containing "route_content", which contains "stops", "depots" and "path".
"""
document = {"route_content": defaultdict(lambda: [])}
for member in element.getiterator("member"):
# After a bit of exploring, I found out they don't have much difference
if member.get("role").lower() in ["stop", "backward_stop", "forward_stop", "platform"]:
document["route_content"]["stops"].append(member.get("ref"))
elif member.get("role") == "depot":
document["route_content"]["depots"].append(member.get("ref"))
else:
document["route_content"]["path"].append(member.get("ref"))
return document
Another type of relation I tried to deal with is boundary,
This will be useful if we want to analyse based on administrative areas:
def process_boundary(element):
"""Break a boundary relation into an object of three parts
Argument:
element: <relation></relation>
Returns:
dict -- Part of the resulting document, containing "boundary_data",
which contains "admin_centre"(or "label"), "boundary", "subarea"
"""
document = {"boundary_data": defaultdict(lambda: [])}
for member in element.getiterator("member"):
if member.get("role") in ["admin_centre", "label"]:
document["boundary_data"][member.get("role")] = member.get("ref")
elif member.get("role") in ["outer", "inner"]:
# Cities like New Taipei City have a ring-like boundary, should include "outer" or "inner"
document["boundary_data"]["boundary"].append(
{"type": member.get("role"), "ref": member.get("ref")}
)
elif member.get("role") == "subarea":
document["boundary_data"]["subareas"].append(member.get("ref"))
return document
Those are what I've solved in the auditing phase, let's wrap it up and import them into database.
This is how the final shape_element
function looks like, some additional functions will be defined right after this.
def shape_element(element):
"""Shape the element into dictionary like this:
{
"id": "2085444960",
"element": "node",
"loc": [121.524852, 25.0265463],
"name": "混_hun",
"created": {
"uid": "23731",
"version": "2",
"user": "Imrehg",
"changeset": "18946405",
"timestamp": {"$date": "2013-11-17T03:54:33Z"}
},
"address": {
"street": "和平東路一段104巷",
"housenumber": "6"
},
"amenity": "cafe",
"website": "http://huncoworkingspace.blogspot.tw/",
"wifi": "free",
"internet_access": "wlan",
"cuisine": "coffee_shop",
}
"""
if elem.tag in ["node", "way", "relation"]:
document = {}
# process_element_meta deals with attributes of the element
document.update(process_element_meta(elem))
# process_tags runs through the tags auditing functions like process_address, process_conv_stores
document.update(process_tags(elem))
if elem.tag == "way":
# process_nds grabs nodes (<nd>) in a way element into "node_refs"
document.update(process_nds(elem))
if elem.tag == "relation":
# Two special relations, route and boundary
if document.get("route") in ["bus", "subway", "railway"]:
document.update(process_route(elem))
elif document.get("boundary") == "administrative":
document.update(process_boundary(elem))
else:
# Otherwise, do a generalised transformation
document.update(process_relation(elem))
return document
def process_element_meta(element):
"""Extracts xml attributes from the element, and turn it into an appropriate form
Argument:
element: An XML element, could be node, way, or relation
Returns:
dict -- Part of the resulting document, including the element's metadata
"""
document = {"element": element.tag, "created": {}}
for key, val in element.attrib.items():
if key not in ["lon", "lat", "timestamp", "id"]:
document["created"][key] = val
elif key == "timestamp":
# Found out that we can use MongoDB's Extended JSON format to store something as Date if we use mongoimport
document["created"]["timestamp"] = {"$date": element.get("timestamp")}
elif key == "id":
document["id"] = element.get("id")
if element.tag == "node":
# Can get benefits of 2d indexes and geospatial queries
document["loc"] = [float(element.get("lon")), float(element.get("lat"))]
return document
def process_tags(element):
"""Get all tags and feed them to process functions we just wrote
Argument:
element: An XML element, could be node, way, or relation
Returns:
dict -- Part of the resulting document, including the element's tags
"""
# Get all the tags into a dictionary
tags = {}
for tag in element.getiterator("tag"):
if PROBLEMCHARS.search(tag.get("k")):
# Found out all the keys with problematic characters
# are just placed by dots where it should be a colon
if "." in tag.get("k"):
tag.set("k", tag.get("k").replace(".", ":"))
else:
continue
tags[tag.get("k")] = tag.get("v")
document = {}
for processor in [process_operator,
process_cuisine,
process_ref,
process_source,
process_phone_number,
process_address,
process_names,
process_alt_names,
process_old_names,
process_official_names,
process_refs,
process_GNS,
process_building_props,
process_conv_stores]:
document.update(processor(tags))
# Remaining tags should be added as normal fields, like in Lesson 6
document.update(tags)
return document
def process_nds(element):
"""
<way ...>
<nd ref="12345678">
<nd ref="90123456">
</way>
Should become:
{...
"node_refs": ["12345678", "90123456"]}
"""
document = {"node_refs": []}
for nd in element.getiterator("nd"):
document["node_refs"].append(nd.get("ref"))
return document
def process_relation(element):
"""
<relation ...>
<member type="node" role="foo" ref="12345678">
<member type="way" role="" ref="90123456">
</relation>
Should become:
{...
"members": [
{"type": "node",
"role": "foo",
"ref": "12345678"},
{"type": "way",
"role": "",
"ref": "90123456"}
]}
"""
document = {"members": []}
for member in element.getiterator("member"):
document["members"].append(member.attrib)
return document
Final step, dump them into a file then import them into MongoDB:
with open("taipei_taiwan.osm.json", "w") as output:
for _, elem in ET.iterparse("taipei_taiwan.osm"):
document = shape_element(elem)
if document:
json.dump(document, output)
!mongoimport -d map -c taipei taipei_taiwan.osm.json
File sizes:
taipei_taiwan.osm ........ 128 MB
taipei_taiwan.osm.json ... 141 MB
Number of documents: 664962
(Actually you can see it in the output cell of In[14]
)
Number of nodes, ways, relations:
list(db.taipei.aggregate([
{"$group": {"_id": "$element", "count": {"$sum": 1}}},
{"$sort": {"count": -1}},
]))
Number of MRT stations:
db.taipei.find({"station": "subway"}).count()
Number of bus stops:
db.taipei.find({"highway": "bus_stop"}).count()
Convenience stores count for each company:
list(db.taipei.aggregate([
{"$match": {"shop": "convenience_store"}},
{"$group": {"_id": "$brand", "count": {"$sum": 1}}},
{"$sort": {"count": -1}},
]))
As I said, convenience stores plays an important role in our lives, and it's important for a convenience store company to decide which place is good to open a new store. If some place has a considerable amount of people pass by or stay, and there's no such convenience store, and it's definitely a great place.
Now, our dataset have some amenities data, then we can infer about whether or not there will be an opportunity to have many customers coming. Places like schools, MRT stations or bus stops with many routes passing through may have large amount of people.
The benefit of using our data is, we have all sorts of data like restaurants, schools, hospitals, bus stops, not only just convenience stores, and we can infer something from multiple perspectives.
However, a challenge we may encounter is how to measure the importance of some place, because they should be weighted. Another problem is the distance, distances should be calculated based on streets, roads, how people would walk, not direct distance of two points.
Density of convenience stores, stores/km2
# The area we selected about 1048 square kilometres
db.taipei.find({"shop": "convenience_store"}).count() / 1048
Top 3 nearest convenience stores from an MRT station:
# Must create a 2d index before we use geospatial queries
db.taipei.create_index([("loc", "2d")])
list(db.taipei.aggregate([
{"$geoNear": {
"near": db.taipei.find_one({"station": "subway", "name": "七張"})["loc"],
"query": {"shop": "convenience_store"},
"distanceField": "distance",
"num": 3
}},
{"$project": {
"_id": 0,
"id": 1,
"brand": 1,
"distance": 1,
"loc": 1,
"address": 1
}}
]))
# For this station (七張), there's a FamilyMart right beside the entrance, and another one across the road.
Top 5 bus stops passed by most routes:
list(db.taipei.aggregate([
{"$match": {"route": "bus"}},
{"$unwind": "$route_content.stops"},
{"$group": {"_id": "$route_content.stops", "count": {"$sum": 1}}},
{"$sort": {"count": -1}},
{"$limit": 5},
{"$lookup": {"from": "taipei",
"localField": "_id",
"foreignField": "id",
"as": "stop"}},
{"$unwind": "$stop"},
{"$project": {"_id": "$stop.id",
"name": "$stop.name",
"count": 1}}
]))
Well, after this exploring and wrangling, I'm sure there're whole bunch of data missing in Taipei area, but I can see those activities on OpenStreetMap, the local community of OpenStreetMap in Taiwan is pretty active. I actually pretty like this project, and I've spent a lot of time on this, it's fun and interesting, and I'm always trying to dig deeper and deeper.
Resources referred/used: