joins.rst @ 15

Revision 1, 11.6 KB checked in by djay, 14 years ago (diff)
Initial import of the svn tree

Rev	Line
[1]	1	.. _joins:
	2
	3	Section 12: Spatial Joins
	4	=========================
	5
	6	Spatial joins are the bread-and-butter of spatial databases. They allow you to combine information from different tables by using spatial relationships as the join key. Much of what we think of as "standard GIS analysis" can be expressed as spatial joins.
	7
	8	In the previous section, we explored spatial relationships using a two-step process: first we extracted a subway station point for 'Broad St'; then, we used that point to ask further questions such as "what neighborhood is the 'Broad St' station in?"
	9
	10	Using a spatial join, we can answer the question in one step, retrieving information about the subway station and the neighborhood that contains it:
	11
	12	.. code-block:: sql
	13
	14	SELECT
	15	subways.name AS subway_name,
	16	neighborhoods.name AS neighborhood_name,
	17	neighborhoods.boroname AS borough
	18	FROM nyc_neighborhoods AS neighborhoods
	19	JOIN nyc_subway_stations AS subways
	20	ON ST_Contains(neighborhoods.the_geom, subways.the_geom)
	21	WHERE subways.name = 'Broad St';
	22
	23	::
	24
	25	subway_name \| neighborhood_name \| borough
	26	-------------+--------------------+-----------
	27	Broad St \| Financial District \| Manhattan
	28
	29	We could have joined every subway station to its containing neighborhood, but in this case we wanted information about just one. Any function that provides a true/false relationship between two tables can be used to drive a spatial join, but the most commonly used ones are: :command:`ST_Intersects`, :command:`ST_Contains`, and :command:`ST_DWithin`.
	30
	31	Join and Summarize
	32	------------------
	33
	34	The combination of a ``JOIN`` with a ``GROUP BY`` provides the kind of analysis that is usually done in a GIS system.
	35
	36	For example: "What is the population and racial make-up of the neighborhoods of Manhattan?" Here we have a question that combines information from about population from the census with the boundaries of neighborhoods, with a restriction to just one borough of Manhattan.
	37
	38	.. code-block:: sql
	39
	40	SELECT
	41	neighborhoods.name AS neighborhood_name,
	42	Sum(census.popn_total) AS population,
	43	Round(100.0 * Sum(census.popn_white) / Sum(census.popn_total),1) AS white_pct,
	44	Round(100.0 * Sum(census.popn_black) / Sum(census.popn_total),1) AS black_pct
	45	FROM nyc_neighborhoods AS neighborhoods
	46	JOIN nyc_census_blocks AS census
	47	ON ST_Intersects(neighborhoods.the_geom, census.the_geom)
	48	WHERE neighborhoods.boroname = 'Manhattan'
	49	GROUP BY neighborhoods.name
	50	ORDER BY white_pct DESC;
	51
	52	::
	53
	54	neighborhood_name \| population \| white_pct \| black_pct
	55	---------------------+------------+-----------+-----------
	56	Carnegie Hill \| 19909 \| 91.6 \| 1.5
	57	North Sutton Area \| 21413 \| 90.3 \| 1.2
	58	West Village \| 27141 \| 88.1 \| 2.7
	59	Upper East Side \| 201301 \| 87.8 \| 2.5
	60	Greenwich Village \| 57047 \| 84.1 \| 3.3
	61	Soho \| 15371 \| 84.1 \| 3.3
	62	Murray Hill \| 27669 \| 79.2 \| 2.3
	63	Gramercy \| 97264 \| 77.8 \| 5.6
	64	Central Park \| 49284 \| 77.8 \| 10.4
	65	Tribeca \| 13601 \| 77.2 \| 5.5
	66	Midtown \| 70412 \| 75.9 \| 5.1
	67	Chelsea \| 51773 \| 74.7 \| 7.4
	68	Battery Park \| 9928 \| 74.1 \| 4.9
	69	Upper West Side \| 212499 \| 73.3 \| 10.4
	70	Financial District \| 17279 \| 71.3 \| 5.3
	71	Clinton \| 26347 \| 64.6 \| 10.3
	72	East Village \| 77448 \| 61.4 \| 9.7
	73	Garment District \| 6900 \| 51.1 \| 8.6
	74	Morningside Heights \| 41499 \| 50.2 \| 24.8
	75	Little Italy \| 14178 \| 39.4 \| 1.2
	76	Yorkville \| 57800 \| 31.2 \| 33.3
	77	Inwood \| 50922 \| 29.3 \| 14.9
	78	Lower East Side \| 104690 \| 28.3 \| 9.0
	79	Washington Heights \| 187198 \| 26.9 \| 16.3
	80	East Harlem \| 62279 \| 20.2 \| 46.2
	81	Hamilton Heights \| 71133 \| 14.6 \| 41.1
	82	Chinatown \| 18195 \| 10.3 \| 4.2
	83	Harlem \| 125501 \| 5.7 \| 80.5
	84
	85
	86	What's going on here? Notionally (the actual evaluation order is optimized under the covers by the database) this is what happens:
	87
	88	#. The ``JOIN`` clause creates a virtual table that includes columns from both the neighborhoods and census tables.
	89	#. The ``WHERE`` clause filters our virtual table to just rows in Manhattan.
	90	#. The remaining rows are grouped by the neighborhood name and fed through the aggregation function to :command:`Sum()` the population values.
	91	#. After a little arithmetic and formatting (e.g., ``GROUP BY``, ``ORDER BY``) on the final numbers, our query spits out the percentages.
	92
	93	.. note::
	94
	95	The ``JOIN`` clause combines two ``FROM`` items. By default, we are using an ``INNER JOIN``, but there are four other types of joins. For further information see the `join_type <http://www.postgresql.org/docs/8.1/interactive/sql-select.html>`_ definition in the PostgreSQL documentation.
	96
	97	We can also use distance tests as a join key, to create summarized "all items within a radius" queries. Let's explore the racial geography of New York using distance queries.
	98
	99	First, let's get the baseline racial make-up of the city.
	100
	101	.. code-block:: sql
	102
	103	SELECT
	104	100.0 * Sum(popn_white) / Sum(popn_total) AS white_pct,
	105	100.0 * Sum(popn_black) / Sum(popn_total) AS black_pct,
	106	Sum(popn_total) AS popn_total
	107	FROM nyc_census_blocks;
	108
	109	::
	110
	111	white_pct \| black_pct \| popn_total
	112	---------------------+---------------------+------------
	113	44.6586020115685295 \| 26.5945063345703034 \| 8008278
	114
	115
	116	So, of the 8M people in New York, about 44% are "white" and 26% are "black".
	117
	118	Duke Ellington once sang that "You / must take the A-train / To / go to Sugar Hill way up in Harlem." As we saw earlier, Harlem has far and away the highest African-American population in Manhattan (80.5%). Is the same true of Duke's A-train?
	119
	120	First, note that the contents of the ``nyc_subway_stations`` table ``routes`` field is what we are interested in to find the A-train. The values in there are a little complex.
	121
	122	.. code-block:: sql
	123
	124	SELECT DISTINCT routes FROM nyc_subway_stations;
	125
	126	::
	127
	128	A,C,G
	129	4,5
	130	D,F,N,Q
	131	5
	132	E,F
	133	E,J,Z
	134	R,W
	135
	136	.. note::
	137
	138	The ``DISTINCT`` keyword eliminates duplicate rows from the result. Without the ``DISTINCT`` keyword, the query above identifies 491 results instead of 73.
	139
	140	So to find the A-train, we will want any row in ``routes`` that has an 'A' in it. We can do this a number of ways, but today we will use the fact that :command:`strpos(routes,'A')` will return a non-zero number if 'A' is in the routes field.
	141
	142	.. code-block:: sql
	143
	144	SELECT DISTINCT routes
	145	FROM nyc_subway_stations AS subways
	146	WHERE strpos(subways.routes,'A') > 0;
	147
	148	::
	149
	150	A,B,C
	151	A,C
	152	A
	153	A,C,G
	154	A,C,E,L
	155	A,S
	156	A,C,F
	157	A,B,C,D
	158	A,C,E
	159
	160	Let's summarize the racial make-up of within 200 meters of the A-train line.
	161
	162	.. code-block:: sql
	163
	164	SELECT
	165	100.0 * Sum(popn_white) / Sum(popn_total) AS white_pct,
	166	100.0 * Sum(popn_black) / Sum(popn_total) AS black_pct,
	167	Sum(popn_total) AS popn_total
	168	FROM nyc_census_blocks AS census
	169	JOIN nyc_subway_stations AS subways
	170	ON ST_DWithin(census.the_geom, subways.the_geom, 200)
	171	WHERE strpos(subways.routes,'A') > 0;
	172
	173	::
	174
	175	white_pct \| black_pct \| popn_total
	176	---------------------+---------------------+------------
	177	42.0805466940877366 \| 23.0936148851067964 \| 185259
	178
	179	So the racial make-up along the A-train isn't radically different from the make-up of New York City as a whole.
	180
	181	Advanced Join
	182	-------------
	183
	184	In the last section we saw that the A-train didn't serve a population that differed much from the racial make-up of the rest of the city. Are there any trains that have a non-average racial make-up?
	185
	186	To answer that question, we'll add another join to our query, so that we can simultaneously calculate the make-up of many subway lines at once. To do that, we'll need to create a new table that enumerates all the lines we want to summarize.
	187
	188	.. code-block:: sql
	189
	190	CREATE TABLE subway_lines ( route char(1) );
	191	INSERT INTO subway_lines (route) VALUES
	192	('A'),('B'),('C'),('D'),('E'),('F'),('G'),
	193	('J'),('L'),('M'),('N'),('Q'),('R'),('S'),
	194	('Z'),('1'),('2'),('3'),('4'),('5'),('6'),
	195	('7');
	196
	197	Now we can join the table of subway lines onto our original query.
	198
	199	.. code-block:: sql
	200
	201	SELECT
	202	lines.route,
	203	Round(100.0 * Sum(popn_white) / Sum(popn_total), 1) AS white_pct,
	204	Round(100.0 * Sum(popn_black) / Sum(popn_total), 1) AS black_pct,
	205	Sum(popn_total) AS popn_total
	206	FROM nyc_census_blocks AS census
	207	JOIN nyc_subway_stations AS subways
	208	ON ST_DWithin(census.the_geom, subways.the_geom, 200)
	209	JOIN subway_lines AS lines
	210	ON strpos(subways.routes, lines.route) > 0
	211	GROUP BY lines.route
	212	ORDER BY black_pct DESC;
	213
	214	::
	215
	216	route \| white_pct \| black_pct \| popn_total
	217	-------+-----------+-----------+------------
	218	S \| 30.1 \| 59.5 \| 32730
	219	3 \| 34.3 \| 51.8 \| 201888
	220	2 \| 33.6 \| 45.5 \| 535414
	221	5 \| 32.1 \| 45.1 \| 407324
	222	C \| 41.3 \| 35.9 \| 430194
	223	4 \| 34.7 \| 30.9 \| 328292
	224	B \| 36.1 \| 30.6 \| 261186
	225	Q \| 52.9 \| 26.3 \| 259820
	226	J \| 29.5 \| 23.6 \| 126764
	227	A \| 42.1 \| 23.1 \| 370518
	228	Z \| 29.5 \| 21.5 \| 81493
	229	D \| 39.8 \| 20.9 \| 233855
	230	G \| 44.8 \| 20.0 \| 138602
	231	L \| 53.9 \| 17.1 \| 104140
	232	6 \| 52.7 \| 16.3 \| 257769
	233	1 \| 54.8 \| 12.6 \| 659028
	234	F \| 60.0 \| 8.6 \| 438212
	235	M \| 50.0 \| 7.8 \| 166721
	236	E \| 69.4 \| 5.3 \| 86118
	237	R \| 57.7 \| 4.8 \| 389124
	238	7 \| 42.4 \| 3.8 \| 107543
	239
	240
	241	As before, the joins create a virtual table of all the possible combinations available within the constraints of the ``JOIN ON`` restrictions, and those rows are then fed into a ``GROUP`` summary. The spatial magic is in the ``ST_DWithin`` function, that ensures only census blocks close to the appropriate subway stations are included in the calculation.
	242
	243	Function List
	244	-------------
	245
	246	`ST_Contains(geometry A, geometry B) <http://postgis.org/docs/ST_Contains.html>`_: Returns true if and only if no points of B lie in the exterior of A, and at least one point of the interior of B lies in the interior of A.
	247
	248	`ST_DWithin(geometry A, geometry B, radius) <http://postgis.org/docs/ST_DWithin.html>`_: Returns true if the geometries are within the specified distance of one another.
	249
	250	`ST_Intersects(geometry A, geometry B) <http://postgis.org/docs/ST_Intersects.html>`_: Returns TRUE if the Geometries/Geography "spatially intersect" - (share any portion of space) and FALSE if they don't (they are Disjoint).
	251
	252	`round(v numeric, s integer) <http://www.postgresql.org/docs/7.4/interactive/functions-math.html>`_: PostgreSQL math function that rounds to s decimal places
	253
	254	`strpos(string, substring) <http://www.postgresql.org/docs/current/static/functions-string.html>`_: PostgreSQL string function that returns an integer location of a specified substring.
	255
	256	`sum(expression) <http://www.postgresql.org/docs/8.2/static/functions-aggregate.html#FUNCTIONS-AGGREGATE-TABLE>`_: PostgreSQL aggregate function that returns the sum of records in a set of records.
	257
	258	.. rubric:: Footnotes
	259
	260	.. [#PostGIS_Doco] http://postgis.org/documentation/manual-1.5/
	261

Note: See TracBrowser for help on using the repository browser.

PostGIS.fr

Bienvenue sur PostGIS.fr

source: trunk/workshop-foss4g/joins.rst @ 15

Download in other formats: