Wrangle OpenStreetMap Data¶

This work is done by Adrian Liaw

For this project, I'm going to wrangle the map data of Taipei, my home town.

You can doanload the dataset via MapZen Metro Extracts (Taipei, Taiwan)

Problems Encountered in the Map
Overview of the Data
Additional Ideas About the Dataset
- Where to open a new convenience store
- Additional data exploration using MongoDB queries

# These are some libraries we're going to use soon

import re
import xml.etree.ElementTree as ET
from collections import defaultdict
from pymongo import MongoClient

try:
    # Speed up a little bit, ujson is written in pure C
    import ujson as json
except ImportError:
    import json

# Regular Expression constant
PROBLEMCHARS = re.compile(r"[=\+/&<>;'\"\?%#$@\,\. \t\r\n]")

db = MongoClient("localhost", 27017)["map"]

Problems Encountered in the Map¶

After exploring the dataset, I think there are three main problems:

Some tags are better stored as other data structures, rather than pure strings like in OSM data.
Standardise the names of convenience stores.
Public transportation routes (bus, subway) are unclear.

Storing data with appropriate data structures¶

Values in the raw OSM XML are all string, but some of them might be better to store as arrays, or sub-document.
For instance, the tag cuisine should be stored as array, since a restaurant could serve more than one style of cuisines. In OSM, these values are usually separated by "," or ";" like {"cuisine": "Italian;French"}.
This also applies to many other tags (like operators, some bus routes might be operated by multiple agencies), we can write a generalised function:

def separate_into_list(tags, field, delim="[,;，、]"):
    """Separate the value of some tag into a list (array) instead of storing pure string.

    Arguments:
    tags -- dict, A dict of tags, {k: v, k: v ...}
    field -- str, The tag to separate, e.g. "cuisine", "operator"

    Keyword Arguments:
    delim -- str or re object, Seperator for the value, defaults to "[,;，、]"
             ("，" and "、" are common separators in our language)

    Returns:
    dict -- Part of the resulting document. {k: v} if there's nothing to separate; {k: [v1, v2...]} elsewhere.

    Modifies:
    tags -- Deletes the field.
    """
    delim = re.compile(delim)

    if field not in tags: return {}

    value = tags.pop(field)

    if not delim.search(value): return {field: value}

    return {field: [frag.strip() for frag in delim.split(value)]}

And we can call it from wherever we want, let's build functions for operator, cuisine, ref, source, phone:

def process_operator(tags):
    """
    {"operator": "國光客運、大都會客運"}
    Should become:
    {"operator": ["國光客運", "大都會客運"]}
    """
    # To fix the ambiguity
    if "Co., Ltd" in tags.get("operator", ""):
        return {"operator": tags.pop("operator")}
    return separate_into_list(tags, "operator")

def process_cuisine(tags):
    """
    {"cuisine": "Italian; French"}
    Should become:
    {"cuisine": ["italian", "french"]}
    """
    if tags.get("cuisine"):
        # It should be case-insensitive
        tags["cuisine"] = tags["cuisine"].lower()
        # I don't know why, but some values looks like this: "PIZZA_,PASTA"
        return separate_into_list(tags, "cuisine", r"[;，、]|(?:_?,_?)")
    return {}

def process_ref(tags):
    """
    Some subway stations have multiple refs, these stations are transfer station.
    Other examples like roads, some roads also have multiple refs.
    """
    return separate_into_list(tags, "ref")

def process_source(tags):
    return separate_into_list(tags, "source")

def process_phone_number(tags):
    return separate_into_list(tags, "phone")

*By the way, these functions with prefix of "process_" are going to be called for each element, these functions return partial documents those are fragments of the resulting document. Finally, for each element, we'll create an empty document (dict) initially, then use `update` method to include these fragments.*

In the other hand, some tags should be combined, they should resulting to be a sub-document of the main document.
For example: the address. Addresses in OSM are separated into several tags, addr:street, addr:housenumber, addr:city, etc.
Noramlly this type of tags have a key with a colon in it, but not all tags with a colon belong to this type.
Another common case is multilingual tags such as name, you can find tons of tags like name:en, name:ja. They should be combined into an object and stored as names, then you can access these values via names.en, names.ja and so on.

Let's again write a general function for this:

def as_subdocument(tags, prefix, into):
    """Combine tags with some prefix into an object, and store as a sub-document (or nested document).

    Arguments:
    tags -- dict, A dict of tags, {k: v, k: v ...}
    prefix -- str, All the tags with a key starting with given prefix are going to merge.
    into -- str, A key for the resulting document to store the sub-document.

    Returns:
    dict -- Part of the resulting document.
            Empty dict if no tags matching the prefix; {into: {k (without prefix): v, k: v ...}} otherwise.

    Modifies:
    tags -- Deletes the field.
    """
    document = defaultdict(lambda: {})
    for k in list(tags):
        if k.startswith(prefix):
            document[into][k[len(prefix):]] = tags.pop(k)
    return document

Again, use it to construct other functions.

def process_address(tags):
    """
    {"addr:full": "11656臺北市文山區新光路二段32號",
     "addr:country": "TW",
     "addr:housenumber": "32"}

    Should become:

    {"address": {
        "full": "11656臺北市文山區新光路二段32號",
        "country": "TW",
        "housenumber": "32"
    }}
    """
    # address field should be addr:full
    if tags.get("address"):
        tags["addr:full"] = tags.pop("address")
    return as_subdocument(tags, "addr:", "address")

def process_names(tags):
    """
    {"name:zh": "新店區",
     "name:en": "Xindian District",
     "name:ja": "新店区"}

    Should become:

    {"names": {
        "zh": "新店區",
        "en": "Xindian District",
        "ja": "新店区"
    }}
    """
    return as_subdocument(tags, "name:", "names")

def process_alt_names(tags):
    return as_subdocument(tags, "alt_name:", "alt_names")

def process_old_names(tags):
    return as_subdocument(tags, "old_name:", "old_names")

def process_official_names(tags):
    return as_subdocument(tags, "official_name:", "official_names")

def process_refs(tags):
    # This is also about multilingual
    return as_subdocument(tags, "ref:", "refs")

def process_GNS(tags):
    return as_subdocument(tags, "GNS:", "GNS")

def process_building_props(tags):
    # building:levels building:height etc.
    return as_subdocument(tags, "building:", "building_props")

Standardise the names of convenience stores¶

Well, this is a very localised problem. This is important for Taiwanese because convenience stores are a part of our lives, we can do a lot of things there. They are everywhere in Taiwan, and I really mean, EVERYWHERE.

There are four main convenience store companies in Taiwan: 7-Eleven, Family Mart, Hi-Life and OK Mart. When we talk about convenience stores, we always mean these four, not others.

So I had this idea of labeling these convenience stores correctly in our data, this may be helpful if we're going to do some analysis about convenience stores. Now the problem here is that these stores have varying names, since the data were edited by lots of different users. For instance, 7-Eleven, some people wrote 7-ELEVEn, 7-11, Seven-Eleven.

Also, there're many nodes were labelled as {"shop": "convenience"}, but many of them are not what we "expect". Our task here is to label those four companies' stores as a stand-alone group, and also label with the unified company name or brand.

def process_conv_stores(tags):
    """Identify the convenience store company based on the name, and clean it.

    {"shop": "convenience",
     "name": "7 eleven"}

    Should become:

    {"shop": "convenience_store",
     "brand": "7-Eleven"}
    """
    if tags.get("shop") != "convenience" or tags.get("name") == None:
        return {}

    name = tags["name"].lower()

    # 7 Eleven, seven-eleven, 7-11, 統一超商(company's legal name in our language, but we never say this)
    if (("7" in name or "seven" in name) and ("11" in name or "eleven" in name)) or "統一" in name:
        output = {"shop": "convenience_store", "brand": "7-Eleven"}

    # Family Mart, FamilyMart, Family-Mart, 全家便利商店, 全家(for short, we always say this)
    elif ("family" in name and "mart" in name) or "全家" in name:
        output = {"shop": "convenience_store", "brand": "FamilyMart"}

    # Hi-Life, HiLife, hi life, 萊爾富(again, we say this)
    elif ("hi" in name and "life" in name) or "萊爾富" in name:
        output = {"shop": "convenience_store", "brand": "Hi-Life"}

    # OK, ok mart, OK‧MART
    elif "ok" in name:
        output = {"shop": "convenience_store", "brand": "OK·MART"}

    else:
        return {}

    del tags["shop"]
    if "brand" in tags: del tags["brand"]
    # We're not going to drop "name", keep it to the end
    return output

Public transportation routes are unclear¶

As a heavy public transportation user, I take buses and MRT (Taipei Metro Rapid Transit) everyday. It's a good idea to include public transit information in the further analysis.

These route data are stored as relations in OSM XML, I'm going to separate each route relation into three parts, stops, depots and path, where stops are bus stops or MRT stations (nodes), depots are those bus depots and MRT depots (closed ways / area ways), path is an array of open ways.

In this process_route function, I'm going to have an element as the argument, because we need <member>s under <relation>s:

def process_route(element):
    """Break a route relation into an object of three parts

    Argument:
    element: <relation></relation>

    Returns:
    dict -- Part of the resulting document, containing "route_content", which contains "stops", "depots" and "path".
    """
    document = {"route_content": defaultdict(lambda: [])}

    for member in element.getiterator("member"):

        # After a bit of exploring, I found out they don't have much difference
        if member.get("role").lower() in ["stop", "backward_stop", "forward_stop", "platform"]:
            document["route_content"]["stops"].append(member.get("ref"))

        elif member.get("role") == "depot":
            document["route_content"]["depots"].append(member.get("ref"))

        else:
            document["route_content"]["path"].append(member.get("ref"))
    return document

Another type of relation I tried to deal with is boundary,
This will be useful if we want to analyse based on administrative areas:

def process_boundary(element):
    """Break a boundary relation into an object of three parts

    Argument:
    element: <relation></relation>

    Returns:
    dict -- Part of the resulting document, containing "boundary_data",
            which contains "admin_centre"(or "label"), "boundary", "subarea"
    """
    document = {"boundary_data": defaultdict(lambda: [])}

    for member in element.getiterator("member"):

        if member.get("role") in ["admin_centre", "label"]:
            document["boundary_data"][member.get("role")] = member.get("ref")

        elif member.get("role") in ["outer", "inner"]:
            # Cities like New Taipei City have a ring-like boundary, should include "outer" or "inner"
            document["boundary_data"]["boundary"].append(
                {"type": member.get("role"), "ref": member.get("ref")}
            )

        elif member.get("role") == "subarea":
            document["boundary_data"]["subareas"].append(member.get("ref"))

    return document

Those are what I've solved in the auditing phase, let's wrap it up and import them into database.

This is how the final shape_element function looks like, some additional functions will be defined right after this.

def shape_element(element):
    """Shape the element into dictionary like this:
    {
        "id": "2085444960",
        "element": "node",
        "loc": [121.524852, 25.0265463],
        "name": "混_hun",
        "created": {
            "uid": "23731",
            "version": "2",
            "user": "Imrehg",
            "changeset": "18946405",
            "timestamp": {"$date": "2013-11-17T03:54:33Z"}
        },
        "address": {
            "street": "和平東路一段104巷",
            "housenumber": "6"
        },
        "amenity": "cafe",
        "website": "http://huncoworkingspace.blogspot.tw/",
        "wifi": "free",
        "internet_access": "wlan",
        "cuisine": "coffee_shop",
    }
    """
    if elem.tag in ["node", "way", "relation"]:
        document = {}
        # process_element_meta deals with attributes of the element
        document.update(process_element_meta(elem))
        # process_tags runs through the tags auditing functions like process_address, process_conv_stores
        document.update(process_tags(elem))

        if elem.tag == "way":
            # process_nds grabs nodes (<nd>) in a way element into "node_refs"
            document.update(process_nds(elem))

        if elem.tag == "relation":
            # Two special relations, route and boundary
            if document.get("route") in ["bus", "subway", "railway"]:
                document.update(process_route(elem))

            elif document.get("boundary") == "administrative":
                document.update(process_boundary(elem))

            else:
                # Otherwise, do a generalised transformation
                document.update(process_relation(elem))

        return document

def process_element_meta(element):
    """Extracts xml attributes from the element, and turn it into an appropriate form

    Argument:
    element: An XML element, could be node, way, or relation

    Returns:
    dict -- Part of the resulting document, including the element's metadata
    """
    document = {"element": element.tag, "created": {}}

    for key, val in element.attrib.items():

        if key not in ["lon", "lat", "timestamp", "id"]:
            document["created"][key] = val

        elif key == "timestamp":
            # Found out that we can use MongoDB's Extended JSON format to store something as Date if we use mongoimport
            document["created"]["timestamp"] = {"$date": element.get("timestamp")}

        elif key == "id":
            document["id"] = element.get("id")

    if element.tag == "node":
        # Can get benefits of 2d indexes and geospatial queries
        document["loc"] = [float(element.get("lon")), float(element.get("lat"))]

    return document

def process_tags(element):
    """Get all tags and feed them to process functions we just wrote

    Argument:
    element: An XML element, could be node, way, or relation

    Returns:
    dict -- Part of the resulting document, including the element's tags
    """
    # Get all the tags into a dictionary
    tags = {}
    for tag in element.getiterator("tag"):
        if PROBLEMCHARS.search(tag.get("k")):
            # Found out all the keys with problematic characters
            # are just placed by dots where it should be a colon
            if "." in tag.get("k"):
                tag.set("k", tag.get("k").replace(".", ":"))
            else:
                continue
        tags[tag.get("k")] = tag.get("v")

    document = {}
    for processor in [process_operator,
                      process_cuisine,
                      process_ref,
                      process_source,
                      process_phone_number,
                      process_address,
                      process_names,
                      process_alt_names,
                      process_old_names,
                      process_official_names,
                      process_refs,
                      process_GNS,
                      process_building_props,
                      process_conv_stores]:
        document.update(processor(tags))

    # Remaining tags should be added as normal fields, like in Lesson 6
    document.update(tags)
    return document

def process_nds(element):
    """
    <way ...>
        <nd ref="12345678">
        <nd ref="90123456">
    </way>

    Should become:

    {...
     "node_refs": ["12345678", "90123456"]}
    """
    document = {"node_refs": []}
    for nd in element.getiterator("nd"):
        document["node_refs"].append(nd.get("ref"))
    return document

def process_relation(element):
    """
    <relation ...>
        <member type="node" role="foo" ref="12345678">
        <member type="way" role="" ref="90123456">
    </relation>

    Should become:

    {...
     "members": [
        {"type": "node",
         "role": "foo",
         "ref": "12345678"},
        {"type": "way",
         "role": "",
         "ref": "90123456"}
     ]}
    """
    document = {"members": []}
    for member in element.getiterator("member"):
        document["members"].append(member.attrib)
    return document

Final step, dump them into a file then import them into MongoDB:

with open("taipei_taiwan.osm.json", "w") as output:
    for _, elem in ET.iterparse("taipei_taiwan.osm"):
        document = shape_element(elem)
        if document:
            json.dump(document, output)

!mongoimport -d map -c taipei taipei_taiwan.osm.json

2016-01-17T11:25:23.696+0800	connected to: localhost
2016-01-17T11:25:26.687+0800	[###.....................] map.taipei	22.4 MB/140.8 MB (15.9%)
2016-01-17T11:25:29.689+0800	[#######.................] map.taipei	43.5 MB/140.8 MB (30.9%)
2016-01-17T11:25:32.692+0800	[###########.............] map.taipei	64.7 MB/140.8 MB (45.9%)
2016-01-17T11:25:35.687+0800	[##############..........] map.taipei	86.8 MB/140.8 MB (61.6%)
2016-01-17T11:25:38.687+0800	[#################.......] map.taipei	105.4 MB/140.8 MB (74.8%)
2016-01-17T11:25:41.687+0800	[#####################...] map.taipei	128.2 MB/140.8 MB (91.0%)
2016-01-17T11:25:42.905+0800	[########################] map.taipei	140.8 MB/140.8 MB (100.0%)
2016-01-17T11:25:42.905+0800	imported 664962 documents

Overview of the Data¶

File sizes:
taipei_taiwan.osm ........ 128 MB
taipei_taiwan.osm.json ... 141 MB

Number of documents: 664962
(Actually you can see it in the output cell of In[14])

Number of nodes, ways, relations:

list(db.taipei.aggregate([
    {"$group": {"_id": "$element", "count": {"$sum": 1}}},
    {"$sort": {"count": -1}},
]))

[{'_id': 'node', 'count': 583636},
 {'_id': 'way', 'count': 76940},
 {'_id': 'relation', 'count': 4386}]

Number of MRT stations:

db.taipei.find({"station": "subway"}).count()

109

Number of bus stops:

db.taipei.find({"highway": "bus_stop"}).count()

5666

Convenience stores count for each company:

list(db.taipei.aggregate([
    {"$match": {"shop": "convenience_store"}},
    {"$group": {"_id": "$brand", "count": {"$sum": 1}}},
    {"$sort": {"count": -1}},
]))

[{'_id': '7-Eleven', 'count': 751},
 {'_id': 'FamilyMart', 'count': 541},
 {'_id': 'Hi-Life', 'count': 175},
 {'_id': 'OK·MART', 'count': 97}]

Additional Ideas About the Dataset¶

Where to open a new convenience store¶

As I said, convenience stores plays an important role in our lives, and it's important for a convenience store company to decide which place is good to open a new store. If some place has a considerable amount of people pass by or stay, and there's no such convenience store, and it's definitely a great place.

Now, our dataset have some amenities data, then we can infer about whether or not there will be an opportunity to have many customers coming. Places like schools, MRT stations or bus stops with many routes passing through may have large amount of people.

The benefit of using our data is, we have all sorts of data like restaurants, schools, hospitals, bus stops, not only just convenience stores, and we can infer something from multiple perspectives.
However, a challenge we may encounter is how to measure the importance of some place, because they should be weighted. Another problem is the distance, distances should be calculated based on streets, roads, how people would walk, not direct distance of two points.

Additional data exploration using MongoDB queries¶

Density of convenience stores, stores/km2

# The area we selected about 1048 square kilometres
db.taipei.find({"shop": "convenience_store"}).count() / 1048

1.4923664122137406

Top 3 nearest convenience stores from an MRT station:

# Must create a 2d index before we use geospatial queries
db.taipei.create_index([("loc", "2d")])

list(db.taipei.aggregate([
    {"$geoNear": {
        "near": db.taipei.find_one({"station": "subway", "name": "七張"})["loc"],
        "query": {"shop": "convenience_store"},
        "distanceField": "distance",
        "num": 3
    }},
    {"$project": {
        "_id": 0,
        "id": 1,
        "brand": 1,
        "distance": 1,
        "loc": 1,
        "address": 1
    }}
]))

# For this station (七張), there's a FamilyMart right beside the entrance, and another one across the road.

[{'address': {'housenumber': '119', 'street': '北新路二段'},
  'brand': 'FamilyMart',
  'distance': 0.0002057793235521875,
  'id': '1770221125',
  'loc': [121.5427351, 24.9749324]},
 {'brand': 'FamilyMart',
  'distance': 0.0002065491708936941,
  'id': '1983147356',
  'loc': [121.5431203, 24.9750737]},
 {'address': {'housenumber': '8', 'street': '北新路二段97巷'},
  'brand': 'OK·MART',
  'distance': 0.0008875231884306044,
  'id': '3881721569',
  'loc': [121.5422982, 24.9743891]}]

Top 5 bus stops passed by most routes:

list(db.taipei.aggregate([
    {"$match": {"route": "bus"}},
    {"$unwind": "$route_content.stops"},
    {"$group": {"_id": "$route_content.stops", "count": {"$sum": 1}}},
    {"$sort": {"count": -1}},
    {"$limit": 5},
    {"$lookup": {"from": "taipei",
                 "localField": "_id",
                 "foreignField": "id",
                 "as": "stop"}},
    {"$unwind": "$stop"},
    {"$project": {"_id": "$stop.id",
                  "name": "$stop.name",
                  "count": 1}}
]))

[{'_id': '2315915814', 'count': 37, 'name': '師大分部'},
 {'_id': '2307668493', 'count': 35, 'name': '師大分部'},
 {'_id': '1708079078', 'count': 35, 'name': '捷運公館站'},
 {'_id': '2063258381', 'count': 34, 'name': '捷運公館站'},
 {'_id': '1956926633', 'count': 29, 'name': '檳榔路'}]

Conclusion¶

Well, after this exploring and wrangling, I'm sure there're whole bunch of data missing in Taipei area, but I can see those activities on OpenStreetMap, the local community of OpenStreetMap in Taiwan is pretty active. I actually pretty like this project, and I've spent a lot of time on this, it's fun and interesting, and I'm always trying to dig deeper and deeper.

Resources referred/used: