joins_advanced.rst @ 6

Revision 1, 7.8 KB checked in by djay, 14 years ago (diff)
Initial import of the svn tree

Rev	Line
[1]	1	.. _joins_advanced:
	2
	3	Section 19: More Spatial Joins
	4	==============================
	5
	6	In the last section we saw the :command:`ST_Centroid(geometry)` and :command:`ST_Union([geometry])` functions, and some simple examples. In this section we will do some more elaborate things with them.
	7
	8	.. _creatingtractstable:
	9
	10	Creating a Census Tracts Table
	11	------------------------------
	12
	13	In the workshop ``\data\`` directory, is a file that includes attribute data, but no geometry, ``nyc_census_sociodata.sql``. The table includes interesting socioeconomic data about New York: commute times, incomes, and education attainment. There is just one problem. The data are summarized by "census tract" and we have no census tract spatial data!
	14
	15	In this section we will
	16
	17	* Load the ``nyc_census_sociodata.sql`` table
	18	* Create a spatial table for census tracts
	19	* Join the attribute data to the spatial data
	20	* Carry out some analysis using our new data
	21
	22	Loading nyc_census_sociodata.sql
	23	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	24
	25	#. Open the SQL query window in PgAdmin
	26	#. Select File->Open from the menu and browse to the ``nyc_census_sociodata.sql`` file
	27	#. Press the "Run Query" button
	28	#. If you press the "Refresh" button in PgAdmin, the list of tables should now include at ``nyc_census_sociodata`` table
	29
	30	Creating a Census Tracts Table
	31	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	32
	33	As we saw in the previous section, we can build up higher level geometries from the census block by summarizing on substrings of the ``blkid`` key. In order to get census tracts, we need to summarize grouping on the first 11 characters of the ``blkid``.
	34
	35	::
	36
	37	360610001009000 = 36 061 00100 9000
	38
	39	36 = State of New York
	40	061 = New York County (Manhattan)
	41	000100 = Census Tract
	42	9 = Census Block Group
	43	000 = Census Block
	44
	45	Create the new table using the :command:`ST_Union` aggregate:
	46
	47	.. code-block:: sql
	48
	49	-- Make the tracts table
	50	CREATE TABLE nyc_census_tract_geoms AS
	51	SELECT
	52	ST_Union(the_geom) AS the_geom,
	53	SubStr(blkid,1,11) AS tractid
	54	FROM nyc_census_blocks
	55	GROUP BY tractid;
	56
	57	-- Index the tractid
	58	CREATE INDEX nyc_census_tract_geoms_tractid_idx ON nyc_census_tract_geoms (tractid);
	59
	60	-- Update the geometry_columns table
	61	SELECT Populate_Geometry_Columns();
	62
	63	Join the Attributes to the Spatial Data
	64	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	65
	66	Join the table of tract geometries to the table of tract attributes with a standard attribute join
	67
	68	.. code-block:: sql
	69
	70	-- Make the tracts table
	71	CREATE TABLE nyc_census_tracts AS
	72	SELECT
	73	g.the_geom,
	74	a.*
	75	FROM nyc_census_tract_geoms g
	76	JOIN nyc_census_sociodata a
	77	ON g.tractid = a.tractid;
	78
	79	-- Index the geometries
	80	CREATE INDEX nyc_census_tract_gidx ON nyc_census_tracts USING GIST (the_geom);
	81
	82	-- Update the geometry_columns table
	83	SELECT Populate_Geometry_Columns();
	84
	85	.. _interestingquestion:
	86
	87	Answer an Interesting Question
	88	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	89
	90	Answer an interesting question! "List top 10 New York neighborhoods ordered by the proportion of people who have graduate degrees."
	91
	92	.. code-block:: sql
	93
	94	SELECT
	95	Round(100.0 * Sum(t.edu_graduate_dipl) / Sum(t.edu_total), 1) AS graduate_pct,
	96	n.name, n.boroname
	97	FROM nyc_neighborhoods n
	98	JOIN nyc_census_tracts t
	99	ON ST_Intersects(n.the_geom, t.the_geom)
	100	WHERE t.edu_total > 0
	101	GROUP BY n.name, n.boroname
	102	ORDER BY graduate_pct DESC
	103	LIMIT 10;
	104
	105	We sum up the statistics we are interested, then divide them together at the end. In order to avoid divide-by-zero errors, we don't bother bringing in tracts that have a population count of zero.
	106
	107	::
	108
	109	graduate_pct \| name \| boroname
	110	--------------+-------------------+-----------
	111	40.4 \| Carnegie Hill \| Manhattan
	112	40.2 \| Flatbush \| Brooklyn
	113	34.8 \| Battery Park \| Manhattan
	114	33.9 \| North Sutton Area \| Manhattan
	115	33.4 \| Upper West Side \| Manhattan
	116	33.3 \| Upper East Side \| Manhattan
	117	32.0 \| Tribeca \| Manhattan
	118	31.8 \| Greenwich Village \| Manhattan
	119	29.8 \| West Village \| Manhattan
	120	29.7 \| Central Park \| Manhattan
	121
	122
	123	.. _polypolyjoins:
	124
	125	Polygon/Polygon Joins
	126	---------------------
	127
	128	In our interesting query (in :ref:`interestingquestion`) we used the :command:`ST_Intersects(geometry_a, geometry_b)` function to determine which census tract polygons to include in each neighborhood summary. Which leads to the question: what if a tract falls on the border between two neighborhoods? It will intersect both, and so will be included in the summary statistics for both.
	129
	130	.. image:: ./screenshots/centroid_neighborhood.png
	131
	132	To avoid this kind of double counting there are two methods:
	133
	134	* The simple method is to ensure that each tract only falls in one summary area (using :command:`ST_Centroid(geometry)`)
	135	* The complex method is to divide crossing tracts at the borders (using :command:`ST_Intersection(geometry,geometry)`)
	136
	137	Here is an example of using the simple method to avoid double counting in our graduate education query:
	138
	139	.. code-block:: sql
	140
	141	SELECT
	142	Round(100.0 * Sum(t.edu_graduate_dipl) / Sum(t.edu_total), 1) AS graduate_pct,
	143	n.name, n.boroname
	144	FROM nyc_neighborhoods n
	145	JOIN nyc_census_tracts t
	146	ON ST_Contains(n.the_geom, ST_Centroid(t.the_geom))
	147	WHERE t.edu_total > 0
	148	GROUP BY n.name, n.boroname
	149	ORDER BY graduate_pct DESC
	150	LIMIT 10;
	151
	152	Note that the query takes longer to run now, because the :command:`ST_Centroid` function has to be run on every census tract.
	153
	154	::
	155
	156	graduate_pct \| name \| boroname
	157	--------------+-------------------+-----------
	158	49.2 \| Carnegie Hill \| Manhattan
	159	39.5 \| Battery Park \| Manhattan
	160	34.3 \| Upper East Side \| Manhattan
	161	33.6 \| Upper West Side \| Manhattan
	162	32.5 \| Greenwich Village \| Manhattan
	163	32.2 \| Tribeca \| Manhattan
	164	31.3 \| North Sutton Area \| Manhattan
	165	30.8 \| West Village \| Manhattan
	166	30.1 \| Downtown \| Brooklyn
	167	28.4 \| Cobble Hill \| Brooklyn
	168
	169	Avoiding double counting changes the results!
	170
	171
	172	.. _largeradiusjoins:
	173
	174	Large Radius Distance Joins
	175	---------------------------
	176
	177	A query that is fun to ask is "How do the commute times of people near (within 500 meters) subway stations differ from those of people far away from subway stations?"
	178
	179	However, the question runs into some problems of double counting: many people will be within 500 meters of multiple subway stations. Compare the population of New York:
	180
	181	.. code-block:: sql
	182
	183	SELECT Sum(popn_total)
	184	FROM nyc_census_blocks;
	185
	186	::
	187
	188	8008278
	189
	190	With the population of the people in New York within 500 meters of a subway station:
	191
	192	.. code-block:: sql
	193
	194	SELECT Sum(popn_total)
	195	FROM nyc_census_blocks census
	196	JOIN nyc_subway_stations subway
	197	ON ST_DWithin(census.the_geom, subway.the_geom, 500);
	198
	199	::
	200
	201	10556898
	202
	203	There's more people close to the subway than there are people! Clearly, our simple SQL is making a big double-counting error. You can see the problem looking at the picture of the buffered subways.
	204
	205	.. image:: ./screenshots/subways_buffered.png
	206
	207	The solution is to ensure that we have only distinct census blocks before passing them into the summarization portion of the query. We can do that by breaking our query up into a subquery that finds the distinct blocks, wrapped in a summarization query that returns our answer:
	208
	209	.. code-block:: sql
	210
	211	SELECT Sum(popn_total)
	212	FROM (
	213	SELECT DISTINCT ON (blkid) popn_total
	214	FROM nyc_census_blocks census
	215	JOIN nyc_subway_stations subway
	216	ON ST_DWithin(census.the_geom, subway.the_geom, 500)
	217	) AS distinct_blocks;
	218
	219	::
	220
	221	4953599
	222
	223	That's better! So a bit over half the population of New York is within 500m (about a 5-7 minute walk) of the subway.
	224
	225
	226

Note: See TracBrowser for help on using the repository browser.

PostGIS.fr

Bienvenue sur PostGIS.fr

source: trunk/workshop-foss4g/joins_advanced.rst @ 6

Download in other formats: