Urban sound has a huge influence over how we perceive places. Yet, city planning is concerned mainly with noise, simply because annoying sounds come to the attention of city officials in the form of complaints, whereas general urban sounds do not come to the attention as they cannot be easily captured at city scale. To capture both unpleasant and pleasant sounds, we applied a new methodology that relies on tagging information of georeferenced pictures to the cities of London and Barcelona. To begin with, we compiled the first urban sound dictionary and compared it with the one produced by collating insights from the literature: ours was experimentally more valid (if correlated with official noise pollution levels) and offered a wider geographical coverage. From picture tags, we then studied the relationship between soundscapes and emotions. We learned that streets with music sounds were associated with strong emotions of joy or sadness, whereas those with human sounds were associated with joy or surprise. Finally, we studied the relationship between soundscapes and people's perceptions and, in so doing, we were able to map which areas are chaotic, monotonous, calm and exciting. Those insights promise to inform the creation of restorative experiences in our increasingly urbanized world.
Studies have found that long-term exposure to urban noise (in particular, to traffic noise) results into sleeplessness and stress , increased incidence of learning impairments among children , and increased risk of cardiovascular morbidity such as hypertension  and heart attacks [4,5].
Because those health hazards are likely to reduce life expectancy, a variety of technologies for noise monitoring and mitigation have been developed over the years. However, those solutions are costly and do not scale at the level of an entire city. City officials typically measure noise by placing sensors at a few selected points. They do so mainly because they have to comply with the environmental noise directive , which requires the management of noise levels only from specific sources, such as road traffic, railways, major airports and industry. To fix the lack of scalability of a typical solution based on sensors, in distinct fields, researchers have worked on ways of making noise pollution estimation cheap. They have worked, for example, on epidemiological models to estimate noise levels from a few samples , on capturing samples from smartphones or other pervasive devices [8–12], and on mining geolocated data readily available from social media (e.g. Foursquare, Twitter) .
All this work has focused, however, on the negative side of urban sounds. Pleasant sounds have been left out from the urban planning literature, yet they have been shown to positively impact city dwellers’ health [14,15]. Only a few researchers have been interested in the whole ‘urban soundscape’. In the World Soundscape Project,1 for example, composer Raymond Murray Schafer et al. defined soundscape for the first time as an environment of sound (or sonic environment) with emphasis on the way it is perceived and understood by the individual, or by a society . That early work eventually led to a new International Standard, ISO 12913, where soundscape is defined as [the] acoustic environment as perceived or experienced and/or understood by a person or people, in context . Since that work, there remains a number of unsolved challenges though.
First, there is no shared vocabulary of urban sounds. Back in the early days of the World Soundscape Project, scholars collected sound-related terms and provided a classification of sounds , but that classification was meant to be neither comprehensive nor systematic. Signal processing techniques for automatically classifying sounds have recently used labelled examples [18,19], but, again, those training labels are not organized in any formal taxonomy.
Second, studying the relationship between urban sounds and people's perceptions is hard. So far, the assumption has been that a good proxy for perceptions is noise level. But perceptions depend on a variety of factors; for example, on what one is doing (e.g. whether one is at a concert). Therefore, policies focusing only on the reduction of noise levels might well fall short.
Finally, urban sounds cannot be captured at scale and, consequently, they are not considered when planning cities . That is because the collection of data for managing urban acoustic environments has mainly been relegated to small-scale surveys [21–24].
To partly address those challenges, we used georeferenced social media data to map the soundscape of an entire city, and related that mapping to people's emotional responses. We did so by extending previous work that captured urban smellscapes from social media  with four main contributions:
— We collected sound-related terms from different online and offline sources and arranged those terms in a taxonomy. The taxonomy was determined by matching the sound-related terms with the tags on 1.8 million georeferenced Flickr pictures in Barcelona and London, and by then analysing how those terms co-occurred across the pictures to obtain a term classification (co-occurring terms are expected to be semantically related). In so doing, we compiled the first urban sound dictionary and made it publicly available: in it, terms are best classified into six top-level categories (i.e. transport, mechanical, human, music, nature, indoor), and those categories closely resemble the manual classification previously derived by aural researchers over decades.
— Upon our picture tags, we produced detailed sound maps of Barcelona and London at the level of street segment. By looking at different segment types, we validate that, as one expects, pedestrian streets host people, music and indoor sounds, whereas primary roads are about transport and mechanical sounds.
— For the first time, to the best of our knowledge, we studied the relationship between urban sounds and emotions. By matching our picture tags with the terms of a widely used word-emotion lexicon, we determined people's emotional responses across the city, and how those responses related to urban sound: fear and anger were found on streets with mechanical sounds, whereas joy was found on streets with human and music sounds.
— Finally, we studied the relationship between a street's sounds and the perceptions people are likely to have of that street. Perceptions came from soundwalks conducted in two cities in the UK and Italy: locals were asked to identify sound sources and report them along with their subjective perceptions. Then, from social media data, we determined a location's expected perception based on the sound tags at the location.
The main idea behind our method was to search for sound-related words (mainly words reflecting potential sources of sound) on georeferenced social media content. To that end, we needed to get hold of two elements: the sound-related words and the content against which to match those words.
2.1 Sound words
We obtained sound-related words from the most comprehensive research project in the field—the World Soundscape Project—and from the most popular crowdsourced online repository of sounds—Freesound.
2.1.1 Schafer's words
The World Soundscape Project is an international research project that initiated the modern study of acoustic ecology. In his book The soundscape, the founder of the project, R. Murray Schafer, coined the term soundscape and emphasized the importance of identifying pleasant sounds and using them to create healthier environments. He described how to classify sounds, appreciating their beauty or ugliness, and offered exercises (e.g. ‘soundwalks’) to help people become more sensitive to sounds. An entire chapter was dedicated to the classification of urban sounds, which was based on literary, anthropological and historical documents. Sounds were classified into six categories: natural, human, societal (e.g. domestic sounds), mechanical, quiet and indicators (e.g. horns and whistles). Our work also used that classification: to associate words with each category, three annotators independently hand-coded the book's sections dedicated to the category, and the intersection of the three annotation sets (which is more conservative than the union) was considered, resulting in a list of 236 English terms.
2.1.2 Crowdsourced words
Freesound is the largest public online collaborative repository of audio samples: 130 K sounds annotated with 1.5 M tags are publicly available through an API. Out of the unique tags (which were 65 K), we considered only those that occurred more than 100 times (the remaining ones were too sparse to be useful), resulting in 2.2 K tags, which still amounted to 76% of the total volume as the tag frequency distribution was skewed. However, those tags covered many topics (including user names, navigational markers, sound quality descriptions and synthesized sound effects) and reflected ambiguous words at times (e.g. ‘fan’ might be a person or a mechanical device) and, as such, needed to be further filtered to retain only words related to sounds or physical sound sources. One annotator manually performed that filtering, which resulted into a final set of 229 English terms.
In addition to that set of words, there is an online repository specifically tailored to urban sounds called Favouritesounds.2 This site hosts crowdsourced maps of sounds for several cities in the world: individuals upload recordings of their favourite sounds, place them on the city map and annotate them with free-text descriptions. By manually parsing the 6 K unique words contained in those descriptions, we extracted 243 English terms.
2.2 Georeferenced content
Having two sets of sound-related words at hand, we needed social media data against which those words had to be matched. 17 M Flickr photos taken between 2005 and 2015 along with their tags were made publicly available in London and Barcelona. In those two cities, we identified each street segment from OpenStreetMap3 (OSM is a global group of volunteers who maintain free crowdsourced online maps). We then collated tags in each segment together by considering the augmented area of the segment's polyline, an area with an extra space of 22.5 m on each side to account for positioning errors typically present in georeferenced pictures [25,26].
We found that, among the three crowdsourced repositories, Freesound words matched most of the picture tags and offered the widest geographical coverage (figure 1), in that they matched 2.12 M tags and covered 141 K street segments in London. In addition, Freesound's words offered a far better coverage than Schafer's did, with a broad distribution of tags over street segments (figure 2).
Because the words of the other online repository considerably overlapped with Freesound's (67% Favoritesounds tags are also in Freesound), we worked only with Freesound (to ensure effective coverage) and with Schafer's classification (to allow for comparability with the literature).
To discover similarities, contrasts and patterns, sound words needed to be classified. Schafer already did so. Our Schafer's words are classified into seven main categories—nature, human, society, transport, mechanical, indicators and quiet—and each category might have a subcategory (e.g. society includes the subcategories indoor and entertainment). By contrast, Freesound's words are not classified. However, by looking at which Freesound words co-occur in the same locations, we could discover similarities (e.g. nature words could co-occur in parks, whereas transport words in trafficked streets). The use of community detection to extract word categories had been successfully tested in previous work that extracted categories of smell words . Compared with other clustering techniques (e.g. LDA , K-means ), a community detection technique has the advantage of being fully non-parametric and quite resilient to data sparsity. Therefore, we also applied it here. We first built a co-occurrence network where nodes were Freesound's words, and undirected edges were weighted with the number of times the two words co-occurred in the same Flickr pictures as tags. The semantic relatedness among words naturally emerged from the network's community structure: semantically related nodes ended up being both highly clustered together and weakly connected to the rest of the network. To determine the clustering structure, we could have used any of the literally thousands of different community detection algorithms that have been developed in the last decade . None of them always returns the ‘best’ clustering. However, because Infomap had shown very good performance across several benchmarks , we opted for it to obtain the initial partition of our network . Infomap's partitioning resulted in many clusters containing semantically related words, but it also resulted in some clusters that were simply too big to possibly be semantically homogeneous. To further split those clusters, we applied the community detection algorithm by Blondel et al. , which has been found to be the second best performing algorithm . This algorithm stops when no ‘node switch’ between communities increases the overall modularity , which measures the overall quality of the resulting partitions.4 The result of those two steps is the grouping of sound words in hierarchical categories. Because a few partitions of words could have been too fine-grained, we manually double-checked whether this was the case and, if so, we merged all those subcommunities that were under the same hierarchical partition and that contained strongly related sound words.
Figure 3 sketches the resulting classification in the form of a sound wheel. This wheel has six main categories (inner circle), each of which has a hierarchical structure with variable depth from 0 to 3. For brevity, the wheel reports only the first level fully (inner circle), whereas it reports samples for the two other levels. Despite spontaneously emerging from word co-occurrences and being fully data-driven, the classification in the wheel strikingly resembles Schafer's. The three categories human, nature and transport are all in both categorizations. The category quiet is missing, because it does not match any tag in Freesound, as one would expect. The remaining categories are all present but arranged at a different level: music and indoor are at the first level in the wheel, whereas they are at the second level in Schafer's categorization; the mechanical category in the wheel collates two of Schafer's categories into one: mechanical and indicator.
Freesound not only offered a classification similar to Schafer's and to recent working groups’ classifications [33,18] (speaking to its external validity), but also offered a richer vocabulary of words. By looking at the fraction tagc/tag of sound words in category c that matched at least one georeference picture tag (tagc) over the total number of tags in the city (tag), we saw that Freesound resulted in a full representation of all sound categories (figure 4), whereas Schafer's resulted in a patchy representation of many categories. Therefore, given its effectiveness, Freesound was chosen as the sound vocabulary for the creation of the urban sound wheel. Only the wheel's top-level categories were used. The full taxonomy is, however, available online5 for those wishing to explore specialized aspects of urban sounds (e.g. transport, nature).
With our sound categorization, we were able to determine, for each street segment j, its sound profile soundj in the form of a six-element vector. Given sound category c, the element soundj,c is 3.1where tagj,c is the number of tags at segment j that matched sound category c, and tagj is the total number of tags at segment j. To make sure the sound categories c we had chosen resulted in reasonable outcomes, we verified whether different street types were associated with sound profiles one would expect (§3.1), and whether those profiles matched official noise pollution data (§3.2).
3.1 Street types
One way of testing whether the six-category classification makes sense in the city context is to see which pairs of categories do not tend to co-occur spatially (e.g. nature and transport should be on separate streets). Therefore, for each street segment, we computed the pairwise Spearman rank correlation ρ between the fraction of sound tags in category c1 and that of sound tags in category c2, across all segments (figure 5). That is, we computed ρj(soundj,c1,soundj,c2) across all j's. We found that the correlations were either zero or negative. This meant that the categories were either orthogonal (i.e. the categories of human, indoor, music, mechanical show correlations close to zero) or geographically sorted in expected ways (with ρ=−0.50, nature and transport are seen, on average, on distinct segments).
To visualize the geographical sorting of sounds, we marked each street segment with the sound category that had the highest z-score in that segment (figure 6). The z-scores reflect the extent to which the fraction of sound tags in category c at street segment j deviated from the average fraction of sound tags in c at all the other segments: 3.2where μ(soundc) and σ(soundc) are the mean and standard deviation of the fractions of tags in sound category c across all segments. We then reported the most prominent sound at each street segment in figure 6: traffic was associated with street junctions and main roads, nature with parks or greenery spots and human and music with central parts or with pedestrian streets.
One, indeed, expects that different street types (table 1 reports the most frequent types in OSM) would be associated with different sounds. To verify that, we computed the average z-score of a sound category c for the segments with street type t: 3.3where St is the set of segments of (street) type t. Figure 7 reports the average values of those z-scores. Each clock-like representation refers to a street type, and the sound categories unfold along the clock: positive (negative) z-score values are marked in green (red) and suggest a presence of a sound category higher (lower) than the average one. By looking at the positive values, we saw that primary, secondary and tertiary streets (which contain cars) were associated with transport sounds; construction sites with mechanical sounds; footways and tracks (often embedded in parks) were associated with nature sounds; residential and pedestrian streets were associated with human, music and indoor sounds. Then, by looking at the negative values, we learned that primary, secondary, tertiary and construction streets were not associated with nature; and the other street types were not associated with sounds related to transport.
3.2 Noise pollution
The most studied aspect of urban sounds is the issue of noise pollution. Despite the importance of that issue, there are no reliable and high-coverage noise measurement data for world-class cities. There is a great number of participatory sensing applications that manage databases of noise levels in several cities, and some of them are publicly accessible [8–12], but all of them offer a limited geographical coverage of a city.
Barcelona is an exception, however. In 2009, the city council started a project, called Strategic Noise Map, whose main goal was to monitor noise levels and ultimately find new ways of limiting sound pollution. The project has a public API6 that returns noise values at the level of street segment for the whole city. For each segment, we collected the four dB values provided: three yearly averages for the three times of the day (day: from 07.00 to 21.00; evening: from 21.00 to 23.00; and night: from 23.00 to 07.00), and one aggregate value, the equivalent-weighted level (EWL), that averages those three values adding a 5 dB penalty to the evening period, and a 10 dB to the night period. With a practice akin to the one used for air quality indicators [34,35], those noise level values are estimated by a prediction model that is bootstrapped with field measurements . In the case of Barcelona, the model is bootstrapped with 2.300 short-span noise measurements lasting at most 15 min, usually taken during daytime, and with 100 long-span ones lasting from 24 h to a few days.
To see whether noise pollution was associated with specific sound categories, we considered the street segments with at least N tags and computed, across all the segments, the Spearman rank correlations ρj(EWLj, soundj,c) between segment j's EWL values (in dB) and j's fraction of picture tags that matched category c7 (figure 8). The idea was to determine not only which categories were associated with noise pollution but also how many tags were needed to have a significant association. We found that noise pollution was positively correlated (p<0.01) with traffic (0.1<ρ<0.3) and negatively correlated with nature (−0.1<ρ<−0.2), and those results did hold for low values of N, suggesting that only a few hundred tags were needed to build a representative sound profile of a street.
4. Emotional and perceptual layers
Sounds can be classified in ways that reflect aspects other than semantics—they may be classified according to, for example, their emotional qualities or the way they are perceived. Therefore, we now show how social media helps extracting the emotional layer (§4.1) and the perceptual layer (§4.2) of urban sound.
4.1 Emotional layer
Looking at a location through the lens of social media makes it possible to characterize places from different points of view. Sound has a highly celebrated link with emotions, especially music sound [38,39], and it has a considerable effect on our feelings and our behaviour.
One way of extracting emotions from georeferenced content is to use a word-emotion lexicon known as EmoLex . This lexicon classifies words into eight primary emotions: it contains binary associations of 6468 terms with the their typical emotional responses. The eight primary emotions (anger, fear, anticipation, trust, surprise, sadness, joy and disgust) come from Plutchik's psychoevolutionary theory , which is commonly used to characterize general emotional responses. We opted for EmoLex instead of other commonly used sentiment dictionaries (such as LIWC ) as it made it possible to study finer-grained emotions.
We matched our Flickr tags with the words in EmoLex and, for each street segment, we computed its emotion profile. The profile consisted of all Plutchik's primary emotions, in that each of its elements was associated with an emotion: 4.1where tagj,e is the number of tags at segment j that matched primary emotion e. We then computed the corresponding z-score: 4.2By computing the Spearman rank correlation ρj(zsoundj,c, zemotionj,e), we determined which sound was associated with which emotion. From figure 9, we see that joyful words were associated with streets typically characterized by music and human sounds, whereas they were absent in streets with traffic. Traffic was, instead, associated with words of fear, anticipation and anger. Interestingly, words of sadness (together with those of joy) were associated with streets with music, words of trust with indoors and words of surprise with streets typically characterized by human sounds.
4.2 Perceptual layer
From our social media data, we knew the extent to which a potential source of sound was present on a street. If we knew how people usually perceived that source as well, we could have estimated how the street was likely to be perceived.
One way of determining how people usually perceive sounds in the city context is to run soundwalks. These were introduced in the late 1960s  and are still common among acoustic researchers nowadays [44,45]. Therefore, to determine people's perceptions, one of the authors conducted soundwalks across eight areas in Brighton and Hove (UK) and 11 areas in Sorrento (Italy) in April and October. They involved 37 participants (UK: 16 males, five females, μage=38.6, δage=11.5; Italy: 10 males, six females, μage=34.7, δage=7.1) with a variety of backgrounds (e.g. acousticians, architects, planning professionals, local authorities and environmental officers). The experimenter led the participants along a predefined route and stopped at selected locations. At each of the locations, participants were asked to listen to the acoustic environment for two minutes and to complete a structured questionnaire (table 2) inquiring about sound sources’ notability , soundscape attributes , overall soundscape quality [46,47] and soundscape appropriateness . The questionnaire classified urban sounds into five categories (traffic, individuals, crowds, nature, other) as it is typically done in soundwalks [49,50], and the perceptions of such sounds into eight categories (pleasant, chaotic, vibrant, uneventful, calm, annoying, eventful and monotonous, after Axelsson et al.'s  work).
Those soundwalks resulted in 342 tuples, each of which represents a participant's report about sounds and perceptions at a given location. Each tuple had 13 [1,10] values: five values reflecting the extent to which the five sound categories were reported to be present, and the other eight reflecting the extent to which the eight perceptions were reported. More technically, soundk,c is the score for sound category c at tuple k, and perceptionk,f is the score for perception category f at tuple k. The frequency distributions of soundk,c (figure 10) suggest that the participants experienced both streets with only a few sounds, and streets with many. In addition, they rarely experienced crowds and came across traffic and, only at times, nature. Instead, the frequency distributions of perceptionk,f (figure 11) suggest that the participants experienced streets with very diverse perceptual profiles, resulting in the use of the full [1,10] score range for all perceptions.
To see which sounds participants tended to experience together, we computed the rank cross-correlation ρk(soundk,c1,soundk,c2) (figure 12a). Amid crowds, the participants reported high score in the category ‘individuals’. These two sound categories—individuals and crowds—had similar sound profiles so much so that the category ‘crowds’ could be experimentally replaced by the category ‘individuals’ in the specific instance of those soundwalks. Furthermore, as one would expect, the presence of traffic was associated with the absence of individuals, crowds and nature.
To then see which perceptions participants tended to experience together, we computed the rank cross-correlation ρ(perceptionk,f1,perceptionk,f2) (figure 12b). Perceptions meant to have opposite meanings indeed resulted in negative correlations (pleasant versus annoying, eventful versus uneventful, vibrant versus monotonous and calm versus chaos). Interestingly, with their near-zero correlation, pleasantness and eventfulness were orthogonal—when a place was eventful, nothing could have been said about its pleasantness.
To see which sounds participants experienced together with which perception, we computed the rank correlation ρk(soundk,c,perceptionk,f) (figure 14a). On average, vibrant areas tended to be associated with crowds, pleasant areas with individuals, calm areas with nature and annoying and chaotic areas with traffic. In a similar way, Axelsson et al. studied the principal components of their perceptual data  and found very similar results: they found that two components best explain most of the variability in the data (figure 13).
Finally, to map how streets are likely to be perceived, we needed to estimate a street's expected perception given the street's sound profile. The sound profiles came from our social media data, whereas the expected perception could have been computed from our soundwalks’ data. We had already computed the correlations between sounds and perceptions (figure 14a). However, those correlations are not expected values (accounting for, e.g. whether a perception is frequent or rare) but they simply are strength measures. Therefore, we computed the probability of perception f given sound category c as 4.3To compute the composing probabilities, we needed to discretize our [1,10] values taken during the soundwalks, and did so by segmenting them into quartiles. We then computed 4.4and 4.5where Q4(c) is the number of times the sound category c occurred in the fourth quartile of its score; Q4(c*) is the number of times any sound occurred in its fourth quartile; and Q4(c∧f) is the number of times sound c as well as perception f occurred in their fourth quartiles.
The conclusions drawn from the resulting conditional probabilities (figure 14b) did not differ from those drawn from the previously shown sound–perception correlations (figure 14a). As opposed to the correlation values, none of the conditional probabilities were very high (all below 0.33). This is because the conditional probabilities were estimated through the gathering of perceptual data in the wild8 and, as such, the mapping between perception and sound did not result in fully fledged probability values. Those values are best interpreted not as raw values but as ranked values. For example, nature sounds were associated with calm only with a probability 0.34, yet calm is the strongest perception related to nature as it ranks first.
The advantage of conditional probabilities over correlations is that they offer principled numbers that are properly normalized and could be readily used in future studies. They could be used, for example, to draw an aesthetics map, a map that reflects the emotional qualities of sounds. In the maps of figure 15, we associated each segment with the colour corresponding to the perception with the highest value of , where pj(c)=soundj,c, which is the fraction of tags at segment j that matched sound category c. pj( f) is effectively the probability that perception f is associated with street segment j, and the strongest f is associated with j. By mapping the probabilities of sound perceptions in London (figure 15a) and Barcelona (figure 15b), we observed that trafficked roads were chaotic, whereas walkable parts of the city were exciting. More interestingly, in the soundscape literature, monotonous areas have not necessarily been considered pleasant (they fall into the annoying quadrant of figure 13), yet the beaches of Barcelona were monotonous (and rightly so), but might have also been pleasant.
A project called SmellyMaps mapped urban smellscapes from social media , and this work—called ChattyMaps—has three main similarities with it. First, the taxonomy of sound and that of smell were both created using community detection algorithms, and both closely resembled categorizations widely used by researchers and practitioners in the corresponding fields. Second, the ways that social media data were mapped onto streets (e.g. buffering of segments, use of longitude/latitude coordinates on the pictures) are the same. Third, in both works, the validation was done with official data (i.e. with air quality data and noise pollution data). However, the two works differ as well, and they do so in three main ways. First, as opposed to SmellyMaps, ChattyMaps studied a variety of urban layers: not only the urban sound layer, but also the emotional, perceptual and sound diversity layers. Second, smell words were derived from smellwalks (as no other source was available), whereas sound words were derived from the online platform of Freesound. Third, because SmellyMaps showed that picture tags were more effective than tweets in capturing geographical-salient information, ChattyMaps entirely relied on Flickr tags.
Our approach comes with a few limitations, mainly because of data biases. The urban soundscape is multifaceted: the sounds we hear and the way we perceive them change considerably with small variations of, for example, space (e.g. simply turning a corner) and time (e.g. day versus night). By contrast, social media data have limited resolution and coverage, and that results in false positives. At times, sound tags do not reflect real sounds because of either misannotations or the figurative use of tags (figure 16a). Fortunately, those cases occur rarely. By manually inspecting 100 photos with sound tags, no false-positive was found: 87 pictures were correctly tagged and 13 referred to sounds that were plausible yet hard to ascertain.
Even when tags refer to sounds likely present in an area, they might do so partially. For example, the tags on the picture of figure 16b consisted of traffic terms (rightly) but not of nature terms, and that was a partial view of that street's soundscape. This risk shrinks as the number of sound tags for the segment increases. Indeed, let us stick with the same example: figure 16c was taken a few metres away from figure 16b, and its tags consisted of nature terms.
To partly mitigate noise at boundary regions, we did two things. First, as described in §2, we added a buffer of 22.5 m around each segment's bounding box. This has been commonly done in previous work dealing with georeferenced digital content [26,25]. It is hard to measure automatically how many tags are needed to get high confidence sound profiles, but we estimated it to be around 20–25 tags (figure 8), if official air quality data are used for validation.
Second, we associated sound distributions (and not individual sounds) with street segments. The six-dimensional sound vector was normalized in [0, 1] to have a probabilistic interpretation. In figure 16b,c, nature sounds were predominant, yet traffic-related sounds varied from 20% to 2% depending on the different parts of that street.
More generally, to have a more comprehensive view of this phenomenon, we determined each segment's sound diversity by computing the Shannon index 5.1where soundj,c is the fraction of tags at segment j that matched sound category c. After removing zero diversity values (often associated with segments having only one tag, which made 28% of segments in Barcelona, and 35% segments in London), we saw that the frequency distribution of diversity (figure 17a) had two peaks in 1 (for both cities) and in 1.5 for London and in 2.0 for Barcelona. Then, by mapping those values (figure 18), we saw that the values close to the first peak were associated with parks and suburbs, and those close to the second peak (and higher) were associated with the central parts of the two cities. Furthermore, the diversity did not depend on the number of tags per segment and became stable for segments with at least 10 tags (figure 17b).
We showed that social media data make it possible to effectively and cheaply track urban sounds at scale. Such a tracking was effective, because the resulting sounds were geographically sorted across street types in expected ways, and they matched noise pollution levels. The tracking was also cheap because it did not require the creation of any additional service or infrastructure. Finally, it worked at the scale of an entire city, and that is important, not least because, before our work, there had been nothing in sonography corresponding to the instantaneous impression which photography can create … The microphone samples details and gives the close-up but nothing corresponding to aerial photography .
However, whereas landscapes can be static, soundscapes are dynamic . Their perceptions are affected by demography (e.g. personal sensitivity to noise, age), context (e.g. city layout) and time (e.g. day versus night, weekdays versus weekends). Future studies could partly address those issues by collecting additional data and by comparing models of urban sounds generated from social media with those generated from geographic information system techniques.
Nonetheless, no matter what data one has, fully capturing soundscapes might well be impossible. Our work has focused on identifying potential sonic events. To use a food metaphor, if those events are the raw ingredients, then the aural architecture (which comes with the acoustic properties of trees, buildings, streets) is the cooking style, and the soundscape is the dish .
To unite hitherto isolated studies in a new synergy, in the future, we will conduct a comprehensive multi-sensory research of cities, one in which visual [55,56], olfactory  and sound perceptions are explored together.
The ultimate goal of this work is to empower city managers and researchers to find solutions for an ecologically balanced soundscape where the relationship between the human community and its sonic environment is in harmony, as Schafer famously (and prophetically) remarked in the late 1970s .
We declare we have no competing interests.
F.A. received funding through the People Programme (Marie Curie Actions) of the European Union's 7th Framework Programme FP7/2007-2013 under REA grant agreement no. 290110, SONORUS ‘Urban Sound Planner’.
We thank the Barcelona City Council for making the noise pollution data available.
One contribution to a special feature ‘City analytics: mathematical modelling and computational analytics for urban behaviour’.
↵3 A segment is often a street's portion between two road intersections but, more generally, it includes any outdoor place (e.g. highways, squares, steps, footpaths, cycle-ways).
↵4 If one were to apply Blondel's algorithm right from the start, the resulting clusters would be less coherent than those produced by our approach.
↵6 Interactive Map of Noise Pollution in Barcelona (http://w20.bcn.cat:1100/WebMapaAcustic/mapa˙soroll.aspx?lang=en).
↵8 It has been shown that, in soundwalks, perception ratings are affected by not only sounds, but also visual cues (e.g. greenery has been found to modulate soundscape ‘tranquillity’ ratings [52,53]).
- Received December 14, 2015.
- Accepted February 19, 2016.
© 2016 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.