1st ACM European Summer School on Data Science, Athens, July 2017
Over the past decade there has been a growing public fascination with the complex “connectedness” of modern society. This connectedness is found in many contexts: in the rapid growth of the Internet and the Web, in the ease with which global communication now takes place, and in the ability of news and information as well as epidemics and financial crises to spread around the world with surprising speed and intensity. These are phenomena that involve networks and the aggregate behavior of groups of people; they are based on the links that connect us and the ways in which each of our decisions can have subtle consequences for the outcomes of everyone else. This crash course is an introduction to the analysis of complex networks, made possible by the availability of big data, with a special focus on the social network and its structure and function. Drawing on ideas from computing and information science, complex systems, mathematic and statistical modelling, economics and sociology, this lecture sketchily describes the emerging field of study that is growing at the interface of all these areas, addressing fundamental questions about how the social, economic, and technological worlds are connected.
• Big graph data and social, information, biological and technological networks
• The architecture of complexity and how real networks differ from random networks: node degree and long tails, social distance and small worlds, clustering and triadic closure. Comparing real networks and random graphs. The main models of network science: small world and preferential attachment.
• The power of complex networks: Strong and weak ties, community structure and long-range bridges. Robustness of networks to failures and attacks. Cascades and spreading. Network models for diffusion and epidemics. The strength of weak ties for the diffusion of information. The strength of strong ties for the diffusion of innovation.
• Practical network analytics with Cytoscape and Gephi. Simulation of network processes with NetLogo.
Data science created unprecedented opportunities but also new risks. Data science techniques might expose sensitive traits of individuals and invade their privacy, this information could be used to discriminate people based on their presumed characteristics, or profiles. Sophisticated data driven machine learning algorithms yield classification and prediction models of behavioral traits of individuals, such as credit score, insurance risk, health status, personal preferences and orientations, on the basis of personal data disseminated in the digital environment by users, with or sometimes without their awareness. Such automated decision-making systems are often “black boxes”, mapping user’s features into a class label or a ranking value without exposing the reasons .
This is worrying not only for the lack of transparency, which undermines the trust of stakeholders, but also for possible social biases and prejudices hidden in the training data and learned by the algorithms, which may bring to discriminatory decisions or unfair actions. Gartner says that, within 2018, half of business ethics violations will occur through improper use of Big Data analytics .
Often, the achievements of data science are the result of re-interpreting available data for analysis goals that differ from the original reasons motivating data collection. Examples include mobile phone call records, originally collected by telecom operators for billing and operations, used for accurate and timely demography and human mobility analysis at country orregional scale. This re-purposing of data clearly shows the importance of legal compliance and data ethics technologies and safeguards to protect privacy and anonymity, secure data, engage users, avoid discrimination and misuse, account for transparency and fair use — to the purpose of seizing the opportunities of data science while controlling the associated risks. This is the focus of my lecture.