[1] | 1 | .. _joins_advanced: |
---|
| 2 | |
---|
| 3 | Section 19: More Spatial Joins |
---|
| 4 | ============================== |
---|
| 5 | |
---|
| 6 | In the last section we saw the :command:`ST_Centroid(geometry)` and :command:`ST_Union([geometry])` functions, and some simple examples. In this section we will do some more elaborate things with them. |
---|
| 7 | |
---|
| 8 | .. _creatingtractstable: |
---|
| 9 | |
---|
| 10 | Creating a Census Tracts Table |
---|
| 11 | ------------------------------ |
---|
| 12 | |
---|
| 13 | In the workshop ``\data\`` directory, is a file that includes attribute data, but no geometry, ``nyc_census_sociodata.sql``. The table includes interesting socioeconomic data about New York: commute times, incomes, and education attainment. There is just one problem. The data are summarized by "census tract" and we have no census tract spatial data! |
---|
| 14 | |
---|
| 15 | In this section we will |
---|
| 16 | |
---|
| 17 | * Load the ``nyc_census_sociodata.sql`` table |
---|
| 18 | * Create a spatial table for census tracts |
---|
| 19 | * Join the attribute data to the spatial data |
---|
| 20 | * Carry out some analysis using our new data |
---|
| 21 | |
---|
| 22 | Loading nyc_census_sociodata.sql |
---|
| 23 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
---|
| 24 | |
---|
| 25 | #. Open the SQL query window in PgAdmin |
---|
| 26 | #. Select **File->Open** from the menu and browse to the ``nyc_census_sociodata.sql`` file |
---|
| 27 | #. Press the "Run Query" button |
---|
| 28 | #. If you press the "Refresh" button in PgAdmin, the list of tables should now include at ``nyc_census_sociodata`` table |
---|
| 29 | |
---|
| 30 | Creating a Census Tracts Table |
---|
| 31 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
---|
| 32 | |
---|
| 33 | As we saw in the previous section, we can build up higher level geometries from the census block by summarizing on substrings of the ``blkid`` key. In order to get census tracts, we need to summarize grouping on the first 11 characters of the ``blkid``. |
---|
| 34 | |
---|
| 35 | :: |
---|
| 36 | |
---|
| 37 | 360610001009000 = 36 061 00100 9000 |
---|
| 38 | |
---|
| 39 | 36 = State of New York |
---|
| 40 | 061 = New York County (Manhattan) |
---|
| 41 | 000100 = Census Tract |
---|
| 42 | 9 = Census Block Group |
---|
| 43 | 000 = Census Block |
---|
| 44 | |
---|
| 45 | Create the new table using the :command:`ST_Union` aggregate: |
---|
| 46 | |
---|
| 47 | .. code-block:: sql |
---|
| 48 | |
---|
| 49 | -- Make the tracts table |
---|
| 50 | CREATE TABLE nyc_census_tract_geoms AS |
---|
| 51 | SELECT |
---|
| 52 | ST_Union(the_geom) AS the_geom, |
---|
| 53 | SubStr(blkid,1,11) AS tractid |
---|
| 54 | FROM nyc_census_blocks |
---|
| 55 | GROUP BY tractid; |
---|
| 56 | |
---|
| 57 | -- Index the tractid |
---|
| 58 | CREATE INDEX nyc_census_tract_geoms_tractid_idx ON nyc_census_tract_geoms (tractid); |
---|
| 59 | |
---|
| 60 | -- Update the geometry_columns table |
---|
| 61 | SELECT Populate_Geometry_Columns(); |
---|
| 62 | |
---|
| 63 | Join the Attributes to the Spatial Data |
---|
| 64 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
---|
| 65 | |
---|
| 66 | Join the table of tract geometries to the table of tract attributes with a standard attribute join |
---|
| 67 | |
---|
| 68 | .. code-block:: sql |
---|
| 69 | |
---|
| 70 | -- Make the tracts table |
---|
| 71 | CREATE TABLE nyc_census_tracts AS |
---|
| 72 | SELECT |
---|
| 73 | g.the_geom, |
---|
| 74 | a.* |
---|
| 75 | FROM nyc_census_tract_geoms g |
---|
| 76 | JOIN nyc_census_sociodata a |
---|
| 77 | ON g.tractid = a.tractid; |
---|
| 78 | |
---|
| 79 | -- Index the geometries |
---|
| 80 | CREATE INDEX nyc_census_tract_gidx ON nyc_census_tracts USING GIST (the_geom); |
---|
| 81 | |
---|
| 82 | -- Update the geometry_columns table |
---|
| 83 | SELECT Populate_Geometry_Columns(); |
---|
| 84 | |
---|
| 85 | .. _interestingquestion: |
---|
| 86 | |
---|
| 87 | Answer an Interesting Question |
---|
| 88 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
---|
| 89 | |
---|
| 90 | Answer an interesting question! "List top 10 New York neighborhoods ordered by the proportion of people who have graduate degrees." |
---|
| 91 | |
---|
| 92 | .. code-block:: sql |
---|
| 93 | |
---|
| 94 | SELECT |
---|
| 95 | Round(100.0 * Sum(t.edu_graduate_dipl) / Sum(t.edu_total), 1) AS graduate_pct, |
---|
| 96 | n.name, n.boroname |
---|
| 97 | FROM nyc_neighborhoods n |
---|
| 98 | JOIN nyc_census_tracts t |
---|
| 99 | ON ST_Intersects(n.the_geom, t.the_geom) |
---|
| 100 | WHERE t.edu_total > 0 |
---|
| 101 | GROUP BY n.name, n.boroname |
---|
| 102 | ORDER BY graduate_pct DESC |
---|
| 103 | LIMIT 10; |
---|
| 104 | |
---|
| 105 | We sum up the statistics we are interested, then divide them together at the end. In order to avoid divide-by-zero errors, we don't bother bringing in tracts that have a population count of zero. |
---|
| 106 | |
---|
| 107 | :: |
---|
| 108 | |
---|
| 109 | graduate_pct | name | boroname |
---|
| 110 | --------------+-------------------+----------- |
---|
| 111 | 40.4 | Carnegie Hill | Manhattan |
---|
| 112 | 40.2 | Flatbush | Brooklyn |
---|
| 113 | 34.8 | Battery Park | Manhattan |
---|
| 114 | 33.9 | North Sutton Area | Manhattan |
---|
| 115 | 33.4 | Upper West Side | Manhattan |
---|
| 116 | 33.3 | Upper East Side | Manhattan |
---|
| 117 | 32.0 | Tribeca | Manhattan |
---|
| 118 | 31.8 | Greenwich Village | Manhattan |
---|
| 119 | 29.8 | West Village | Manhattan |
---|
| 120 | 29.7 | Central Park | Manhattan |
---|
| 121 | |
---|
| 122 | |
---|
| 123 | .. _polypolyjoins: |
---|
| 124 | |
---|
| 125 | Polygon/Polygon Joins |
---|
| 126 | --------------------- |
---|
| 127 | |
---|
| 128 | In our interesting query (in :ref:`interestingquestion`) we used the :command:`ST_Intersects(geometry_a, geometry_b)` function to determine which census tract polygons to include in each neighborhood summary. Which leads to the question: what if a tract falls on the border between two neighborhoods? It will intersect both, and so will be included in the summary statistics for **both**. |
---|
| 129 | |
---|
| 130 | .. image:: ./screenshots/centroid_neighborhood.png |
---|
| 131 | |
---|
| 132 | To avoid this kind of double counting there are two methods: |
---|
| 133 | |
---|
| 134 | * The simple method is to ensure that each tract only falls in **one** summary area (using :command:`ST_Centroid(geometry)`) |
---|
| 135 | * The complex method is to divide crossing tracts at the borders (using :command:`ST_Intersection(geometry,geometry)`) |
---|
| 136 | |
---|
| 137 | Here is an example of using the simple method to avoid double counting in our graduate education query: |
---|
| 138 | |
---|
| 139 | .. code-block:: sql |
---|
| 140 | |
---|
| 141 | SELECT |
---|
| 142 | Round(100.0 * Sum(t.edu_graduate_dipl) / Sum(t.edu_total), 1) AS graduate_pct, |
---|
| 143 | n.name, n.boroname |
---|
| 144 | FROM nyc_neighborhoods n |
---|
| 145 | JOIN nyc_census_tracts t |
---|
| 146 | ON ST_Contains(n.the_geom, ST_Centroid(t.the_geom)) |
---|
| 147 | WHERE t.edu_total > 0 |
---|
| 148 | GROUP BY n.name, n.boroname |
---|
| 149 | ORDER BY graduate_pct DESC |
---|
| 150 | LIMIT 10; |
---|
| 151 | |
---|
| 152 | Note that the query takes longer to run now, because the :command:`ST_Centroid` function has to be run on every census tract. |
---|
| 153 | |
---|
| 154 | :: |
---|
| 155 | |
---|
| 156 | graduate_pct | name | boroname |
---|
| 157 | --------------+-------------------+----------- |
---|
| 158 | 49.2 | Carnegie Hill | Manhattan |
---|
| 159 | 39.5 | Battery Park | Manhattan |
---|
| 160 | 34.3 | Upper East Side | Manhattan |
---|
| 161 | 33.6 | Upper West Side | Manhattan |
---|
| 162 | 32.5 | Greenwich Village | Manhattan |
---|
| 163 | 32.2 | Tribeca | Manhattan |
---|
| 164 | 31.3 | North Sutton Area | Manhattan |
---|
| 165 | 30.8 | West Village | Manhattan |
---|
| 166 | 30.1 | Downtown | Brooklyn |
---|
| 167 | 28.4 | Cobble Hill | Brooklyn |
---|
| 168 | |
---|
| 169 | Avoiding double counting changes the results! |
---|
| 170 | |
---|
| 171 | |
---|
| 172 | .. _largeradiusjoins: |
---|
| 173 | |
---|
| 174 | Large Radius Distance Joins |
---|
| 175 | --------------------------- |
---|
| 176 | |
---|
| 177 | A query that is fun to ask is "How do the commute times of people near (within 500 meters) subway stations differ from those of people far away from subway stations?" |
---|
| 178 | |
---|
| 179 | However, the question runs into some problems of double counting: many people will be within 500 meters of multiple subway stations. Compare the population of New York: |
---|
| 180 | |
---|
| 181 | .. code-block:: sql |
---|
| 182 | |
---|
| 183 | SELECT Sum(popn_total) |
---|
| 184 | FROM nyc_census_blocks; |
---|
| 185 | |
---|
| 186 | :: |
---|
| 187 | |
---|
| 188 | 8008278 |
---|
| 189 | |
---|
| 190 | With the population of the people in New York within 500 meters of a subway station: |
---|
| 191 | |
---|
| 192 | .. code-block:: sql |
---|
| 193 | |
---|
| 194 | SELECT Sum(popn_total) |
---|
| 195 | FROM nyc_census_blocks census |
---|
| 196 | JOIN nyc_subway_stations subway |
---|
| 197 | ON ST_DWithin(census.the_geom, subway.the_geom, 500); |
---|
| 198 | |
---|
| 199 | :: |
---|
| 200 | |
---|
| 201 | 10556898 |
---|
| 202 | |
---|
| 203 | There's more people close to the subway than there are people! Clearly, our simple SQL is making a big double-counting error. You can see the problem looking at the picture of the buffered subways. |
---|
| 204 | |
---|
| 205 | .. image:: ./screenshots/subways_buffered.png |
---|
| 206 | |
---|
| 207 | The solution is to ensure that we have only distinct census blocks before passing them into the summarization portion of the query. We can do that by breaking our query up into a subquery that finds the distinct blocks, wrapped in a summarization query that returns our answer: |
---|
| 208 | |
---|
| 209 | .. code-block:: sql |
---|
| 210 | |
---|
| 211 | SELECT Sum(popn_total) |
---|
| 212 | FROM ( |
---|
| 213 | SELECT DISTINCT ON (blkid) popn_total |
---|
| 214 | FROM nyc_census_blocks census |
---|
| 215 | JOIN nyc_subway_stations subway |
---|
| 216 | ON ST_DWithin(census.the_geom, subway.the_geom, 500) |
---|
| 217 | ) AS distinct_blocks; |
---|
| 218 | |
---|
| 219 | :: |
---|
| 220 | |
---|
| 221 | 4953599 |
---|
| 222 | |
---|
| 223 | That's better! So a bit over half the population of New York is within 500m (about a 5-7 minute walk) of the subway. |
---|
| 224 | |
---|
| 225 | |
---|
| 226 | |
---|