Big data, a general term for the massive amount of digital data being collected from all sorts of sources, is too large, raw, or unstructured for analysis through conventional relational database techniques. Almost 90% of the world's data today was generated during the past two years, with 2.5 quintillion bytes of data added each day.7 Moreover, approximately 90% of it is unstructured. Still, the overwhelming amount of big data from the Web and the cloud offers new opportunities for discovery, value creation, and rich business intelligence for decision support in any organization. Big data also means new challenges involving complexity, security, and risks to privacy, as well as a need for new technology and human skills. Big data is redefining the landscape of data management, from extract, transform, and load, or ETL, processes to new technologies (such as Hadoop) for cleansing and organizing unstructured data in big-data applications.
Although the business sector is leading big-data-application development, the public sector has begun to derive insight to help support decision making in real time from fast-growing in-motion data from multiple sources, including the Web, biological and industrial sensors, video, email, and social communications.3 Many white papers, journal articles, and business reports have proposed ways governments can use big data to help them serve their citizens and overcome national challenges (such as rising health care costs, job creation, natural disasters, and terrorism).9 There is also some skepticism as to whether it can actually improve government operations, as governments must develop new capabilities and adopt new technologies (such as Hadoop and NoSQL) to transform it into information through data organization and analytics.4
Here, we ask whether governments are able to implement some of today's big-data applications associated with the business sector. We first compare the two sectors in terms of goals, missions, decision-making processes, decision actors, organizational structure, and strategies (see the table here), then turn to several current applications in technologically advanced countries, including Australia, Japan, Singapore, South Korea, the U.K., and the U.S. Also examined are some business-sector big-data applications and initiatives that can be implemented by governments. Finally, we suggest ways for governments of follower countries to pursue their own future big-data strategies and implementations.
Business and Government Compared
Although the primary missions of businesses and governments are not in conflict, they do reflect different goals and values. In business, the main goal is to earn profits by providing goods and services, developing/sustaining a competitive edge, and satisfying customers and other stakeholders by providing value. In government, the main goal is to maintain domestic tranquility, achieve sustainable development, secure citizens' basic rights, and promote the general welfare and economic growth.
Most businesses aim to make short-term decisions with a limited number of actors in a competitive market environment. Decision making in government usually takes much longer and is conducted through consultation and mutual consent of a large number of diverse actors, including officials, interest groups, and ordinary citizens. Many well-defined steps are therefore required to reduce risk and increase the efficiency and effectiveness of government decision making.18 It follows that big-data applications likewise differ between public and private sectors.
Dataset Attributes Compared
The big-data environment reflects the evolution of IT-enabled decision-support systems: data processing in the 1960s, information applications in the 1970s1980s, decision-support models in the 1990s, data warehousing and mining in the 2000s, and big data today. The big-data era is at an early stage, as most related technology and analytics applications were first introduced only around 2010.4
The attributes and challenges of big data have been described in terms of "three Vs": volume, velocity, and variety (see Figure 1). Volume is big data's primary attribute, as terabytes or even petabytes of it are generated by organizations in the course of doing business while also complying with government regulations. Velocity is the speed data is generated, delivered, and processed; that is, big data is so large and difficult to manage and to extract value from that conventional information technologies are not effective for its management.13 Variety is that data comes in all forms: structured (traditional databases like SQL); semi-structured (with tags and markers but without formal structure like a database); and unstructured (unorganized data with no business intelligence behind it).
The concept of big data has evolved to imply not only a vast amount of the data but also the process through which organizations derive value from it. Big data, synonymous today with business intelligence, business analytics, and data mining, has shifted business intelligence from reporting and decision support to prediction and next-move decision making.13 New data-management systems aim to meet the challenges of big data; for example, Hadoop, an open-source platform, is the most widely applied technology for managing storage and access, overhead associated with large datasets, and high-speed parallel processing.22 However, Hadoop is a challenge for many businesses, especially small- and mid-size ones, as applications require expertise and experience not widely available and may thus need outsourced help. Finding the right talent to analyze big data is perhaps the greatest challenge for business organizations, as required skills are neither simple nor solely technology-oriented. Searching for and finding competent data scientists (in data mining, visualization, analysis, manipulation, and discovery) is difficult and expensive for most organizations.
Other commercial big-data technologies include the Casandra database, a Dynamo-based tool that can store two million columns in a single row, allowing inclusion of a large amount of data without prior knowledge of how it is formatted.13 Another challenge for businesses is deciding which technology is best for them: open source technology (such as Hadoop) or commercial implementations (such as Casandra, Cloudera, Hortonworks, and MapR).
Governments deal not only with general issues of big-data integration from multiple sources and in different formats and cost but also with some special challenges. The biggest is collecting data; governments have difficulty, as the data not only comes from multiple channels (such as social networks, the Web, and crowdsourcing) but from different sources (such as countries, institutions, agencies, and departments). Sharing data and information between countries is a special challenge, as shown by the terrorist bombing attack on the Boston Marathon in April 2013. National governments must be prepared and willing to share data and build systems for crime prevention and fighting. As reported in the public media, the Boston Marathon tragedy might have been prevented if the Russian secret services had shared critical information about the terror suspects with U.S. intelligence agencies. In addition, sharing information across national boundaries involves language translation and interpretation of text semantics (meaning of content) and sentiment (emotional content) so the true meaning is not lost. Dealing with language requires sophisticated and costly tools.
Data sharing within a country among different government departments and agencies is another challenge. The most important difference of government data vs. business data is scale and scope, both growing steadily for years. Governments, both local and national, in the process of implementing laws and regulations and performing public services and financial transactions accumulate an enormous amount of data with attributes, values, and challenges that differ from their counterparts in the business sector.
Government big-data issues can be categorized as silo, security, and variety. Each government agency or department typically has its own warehouse, or silo, of confidential or public information, with agencies often reluctant to share what they might consider proprietary data. The "tower of Babel" in which each system keeps its data isolated from other systems complicates trying to integrate complementary data among government agencies and departments. Communication failure is sometimes the issue for data integration;19 for example, in the U.K., a coalition of police departments and hospitals intended to share data on violent crimes has been reported as a failure due to a lack of communication among participating organizations.19 Another challenge for sharing and organizing government data involves finding a cohesive format that would allow for analytics in the legacy systems of different agencies. Even though most government data is structured, rather than semi-structured or unstructured, collecting it from multiple channels and sources is a further challenge. Then there is the lack of standardized solutions, software, and cross-agency solutions for extracting useful information from discrete datasets in multiple government agencies and insufficient funding due to government austerity measures to develop and implement these solutions.
Governments must also address related legality, security, and compliance requirements when using data. There is a fine line between collecting and using big data for predictive analysis and ensuring citizens' rights of privacy. In the U.S., the USA PATRIOT Act allows legal monitoring and sometimes spying on citizens; the Electronic Communication Privacy Act allows email access without warrant; the proposed Cyber Intelligence Sharing and Protection Act (CISPA) (not enacted as of February 2014) raises concern, as it might position the U.S. government toward the ultimate big-data end gameaccess to all data for all entities in the U.S.14 Even though the intent is to prevent attacks from both domestic and foreign sources against networks and systems, CISPA raises concerns of misconstrued profiling and/or inappropriate use of information.
Data security is the primary attribute of government big data, as collecting, storing, and using it requires special care. However, most big-data technologies today, including Casandra and Hadoop, lack sufficient security tools, making security another challenge for governments.
Compliance in highly regulated industries (such as financial services and health care) is yet another obstacle for gathering data for big-data government projects; for example, U.S. health-care regulations must be addressed when extracting knowledge from health-related big data. The two U.S. laws posing perhaps the greatest obstacle to big-data analytics in health care are the Health Insurance Portability and Accountability Act (HIPAA) and the Health Information Technology for Economic and Clinical Health Act (HITECH). HIPAA protects the privacy of individually identifiable health information, provides national standards for securing electronic data and patient records, and sets rules for protecting patient identity and information in analyzing patient safety events. HITECH expanded HIPAA in 2009 to protect the health records and electronic use of health information by various institutions. Together, these laws limit the amount and types of health records used for big-data analytics in health care. Because big data by definition involves large-scale data, these laws complicate collecting data and performing analysis on such a scale. As of February 2014, health-care information in the U.S. intended for big-data analytics is collected only from volunteers willing to share their own.
Businesses use big data to address customer needs and behavior, develop unique core competencies, and create innovative products and services. Governments use it, along with predictive analytics to enhance transparency, increase citizen engagement in public affairs, prevent fraud and crime, improve national security, and support the well-being of people through better education and health care.
Choosing and implementing technology to extract value and finding skilled personnel are constant challenges for businesses and governments alike. However, the challenges for governments are more acute, as they must look to break down departmental silos for data integration, implement regulations for security and compliance, and establish sufficient control towers (such as the Federal Data Center in the U.S.).
Comparing the big-data applications of leading e-government countries can reveal where current and future applications are focused and serve as a guide for follower countries looking to initiate their own big-data applications:
U.S. To manage real-time analysis of high-volume streaming data, the U.S. government and IBM collaborated in 2002 to develop a massively scalable, clustered infrastructure.1 The result, IBM InfoSphere Stream and IBM Big Data, both widely used by government agencies and business organizations, are platforms for discovery and visualization of information from thousands of real-time sources, encompassing application development and systems management built on Hadoop, stream computing, and data warehousing.
In 2009, the U.S. government launched Data.gov as a step toward government transparency and accountability. It is a warehouse containing 420,894 datasets (as of August 2012) covering transportation, economy, health care, education, and human services and the data source for multiple applications: 1,279 by governments, 236 by citizens, and 103 mobile-oriented.21 In 2010, the President's Council of Advisors on Science and Technology (the primary mechanism the federal government uses to coordinate its unclassified networking and information technology research investments) spelled out a big-data strategy in its report Designing a Digital Future: Federally Funded Research and Development in Networking and Information Technology.15 In 2012, the Obama Administration announced the Big Data Research and Development Initiative,12 a $200 million investment involving multiple federal departments and agencies, including the White House Office of Science and Technology Policy, National Science Foundation (NSF), National Institutes of Health (NIH), Department of Defense (DoD), Defense Advanced Research Projects Agency, Department of Energy, Health and Human Services, and U.S. Geological Survey. The main objectives were to advance state-of the-art core big-data technologies; accelerate discovery in science and engineering; strengthen national security; transform teaching and learning and expand the work force needed to develop and use big-data technologies.11
As of February 2014, NIH has accumulated hundreds terabytes of data for human genetic variations on Amazon Web Services, enabling researchers to access and analyze huge amounts of data without having to develop their own supercomputing capability. In 2012, NSF joined NIH to launch the Core Techniques and Technologies for Advancing Big Data Science & Engineering program, aiming to advance core scientific and technological means of managing, analyzing, visualizing and extracting useful information from large, diverse, distributed, heterogeneous datasets. Several federal agencies have launched their own big-data programs. The Internal Revenue Service has been integrating big data-analytic capabilities into its Return Review Program (RRP), which by analyzing massive amounts of data allows it to detect, prevent, and resolve tax-evasion and fraud cases.10 DoD is also spending millions of dollars on big-data-related projects; one goal is developing autonomous robotic systems (learning machines) by harnessing big data.
Governments expect big data to enhance their ability to serve their citizens and address major national challenges involving the economy, health care, job creation, natural disasters, and terrorism.
Local governments have also initiated big-data projects; for example, in 2011, Syracuse, NY, in collaboration with IBM, launched a Smarter City project to use big data to help predict and prevent vacant residential properties.7 Michigan's Department of Information Technology constructed a data warehouse to provide a single source of information about the citizens of Michigan to multiple government agencies and organizations to help provide better services.
European Union. In 2010, The European Commission initiated its "Digital Agenda for Europe" to address how to deliver sustainable economic and social benefits to EU citizens from a single digital market through fast and ultra-fast interoperable Internet applications.5 In 2012, in its "Digital Agenda for Europe and Challenges for 2012," the European Commission made big-data strategy part of the effort, emphasizing the economic potential of public data locked in filing cabinets and data centers of public agencies; ensuring data protection and increasing individuals' trust; developing the Internet of things, or communication between devices without direct human intervention; and assuring Internet security and secure treatment of data and online exchanges.5
U.K. The U.K. government was one of the earliest implementer EU countries of big-data programs, establishing the U.K. Horizon Scanning Centre (HSC) in 2004 to improve the government's ability to deal with cross-departmental and multi-disciplinary challenges.17 In 2011, the HSC's Foresight International Dimensions of Climate Change effort addressed climate change and its effect on the availability of food and water, regional tensions, and international stability and security by performing an in-depth analysis on multiple data channels. Another U.K. government initiative was the creation of the public website http://data.gov.uk in 2009, opening to the public more than 1,000 existing datasets from seven government departments initially, later increased to 8,633 datasets.
The Netherlands, Switzerland, the U.K., and 17 other countries launched a collaborative project with IBM called DOME to develop a supercomputing system able to handle a dataset in excess of one exabyte per day derived from the Square Kilometer Array (SKA) radio telescope.3 The project aims to investigate emerging technologies for exascale computing, data transport and storage, and streaming analytics required to read, store, and analyze all the raw data collected daily. This big-data project, headquartered at Manchester's Jodrell Bank Observatory in England, aims to address a range of scientific questions about the observable universe.
Asia. The United Nations' 2012 E-Government Survey gave high marks to several Asian countries, notably South Korea, Singapore, and Japan.20 Australia also ranked. These leaders have launched diverse initiatives on big data and deployed numerous projects:
South Korea. The Big Data Initiative, launched in 2011 by the President's Council on National ICT Strategies (the highest-level coordinating body for government ICT policy),16 aims to converge knowledge and administrative analytics through big data. Its Big Data Task Force was created to play the lead role in building the necessary infrastructure. The Big Data Initiative aims to establish pan-government big-data-network-and-analysis systems; promote data convergence between the government and the private sectors; build a public data-diagnosis system; produce and train talented professionals; guarantee privacy and security of personal information and improve relevant laws; develop big-data infrastructure technologies; and develop big-data management and analytical technologies.
Many South Korean ministries and agencies have proposed related action plans; for example, the Ministry of Health and Welfare initiated the Social Welfare Integrated Management Network to analyze 385 different types of public data from 35 agencies, comprehensively managing welfare benefits and services provided by the central government, as well as by local governments, to deserving recipients. The Ministry of Food, Agriculture, Forestry, and Fisheries and the Ministry of Public Administration and Security, or MOPAS, plan to launch the Preventing Foot and Mouth Disease Syndrome system, harnessing big data related to animal disease overseas, customs/immigration records, breeding-farm surveys, livestock migration, and workers in the livestock industry. Another system MOPAS is planning is the Preventing Disasters System to forecast disasters based on past damage records and automatic and real-time forecasts of weather and/or seismic conditions. Moreover, the Korean Bioinformation Center plans to develop and operate the National DNA Management System to integrate massive DNA and medical patient information to provide customized diagnosis and medical treatment to individuals.
Singapore. In 2004, to address national security, infectious diseases, and other national concerns, the Singapore government launched the Risk Assessment and Horizon Scanning (RAHS) program within the National Security Coordination Centre.6 Collecting and analyzing large-scale datasets, it proactively manages national threats, including terrorist attacks, infectious diseases, and financial crises. The RAHS Experimentation Center (REC), which opened in 2007, focuses on new technological tools to support policy making for RAHS and enhance and maintain RAHS through systematic upgrades of the big-data infrastructure. A notable REC application is exploration of possible scenarios involving importation of avian influenza into Singapore and assessment of the threat of outbreaks occurring throughout southeast Asia.
Aiming to create value through big-data research, analysis, and applications, the Singapore government also launched the portal site http://data.gov.sg/ to provide access to publicly available government data gathered from more than 5,000 datasets from 50 ministries and agencies.
Japan. The Japanese government has initiated several programs to use accumulated large-scale data. From 2005 to 2011, the Ministry of Education, Sports, Culture, Science, and Technology (MEXT), in association with universities and research institutes, operated the New IT Infrastructure for the Information-explosion Era project (the so-called Info-plosion). Since 2011, the government's top priority has been to address the consequences of the Fukushima earthquake, tsunami, and nuclear-power-plant disaster and the reconstruction and rehabilitation of affected areas, as well as relief of related social and economic consequences. MEXT has been collaborating with the country's National Science Foundation to enhance research and leverage big-data technologies for preventing, mitigating, and managing natural disasters.
The Council of Information and Communications and the ICT Strategy Committee, both branches of the Ministry of Internal Affairs and Communications, designated "big data applications" as a crucial mission for 2020 Japan. A big-data expert group was formed to search for technical solutions and manage institutional issues in deploying big data.
Australia. The Australian Government Information Management Office (AGIMO) provides public access to government data through the Government 2.0 program, which runs the http://data.gov.au/ website to support repository and search tools for government big data. The government expects to save time and resources by using automated tools that let users search, analyze, and reuse enormous amounts of data.
Implementations and Initiatives Compared
Reviewing big-data projects and initiatives in leading countries (see Figure 2) identifies three notable big-data trends: First, most projects operated or implemented today can only marginally be classified as big-data applications, as outlined in the figure's upper-left quadrant. The majority of government data projects in these countries appears to share structured databases of stored data; they do not use real-time, in-motion, and unstructured or semi-structured data. Second, large and complex datasets are becoming the norm for public-sector organizations. Governments expect big data to enhance their ability to serve their citizens and address major national challenges involving the economy, health care, job creation, natural disasters, and terrorism. However, the majority of big-data applications are in the citizen (participation in public affairs) and business sectors, rather than in the government sector. And third, most big-data initiatives in the government sector, especially in the U.S. (such as the National Science Foundation's and National Institutes of Health's Big Data program); are just getting under way or being planned for future implementation. This means big-data application projects in the government sector are still at an early stage of development, with only a handful of projects in operation (such as the U.S.'s RRP, Singapore's RAHS, and the U.K.'s HSC).
Elected officials, administrators, and citizens all seem to recognize that being able to manage and create value from large streams of data from different sources and in many forms (structured/stored, semi-structured/tagged, and unstructured/in-motion) represents a new form of competitive differentiation. Most governments operating or planning big-data projects need to take a step-by-step approach for setting the right goals and realistic expectations. Success depends on their ability to integrate and analyze information (through new technologies like Hadoop), develop supporting systems (such as big-data control towers), and support decision making through analytics.4
Here, we have explored the challenges governments face and the opportunities they find in big data. Such insights can also help follower countries in trying to deploy their own big-data systems. Moreover, follower countries may be able to leapfrog the leaders' applications through careful analysis of their successes and failures, as well as exploit future opportunities in mobile services.
Follower countries should therefore be cognizant of several insights regarding big-data applications in the public sector:
National priorities. All big-data projects in leading countries' governments share similar goals (such as easy and equal access to public services, better citizen participation in public affairs, and transparency). The main concerns with big-data applications converge on security, speed, interoperability, analytics capabilities, and lack of competent professionals. However, each government has its own priorities, opportunities, and threats based on its unique environment (such as terrorism and health care in the U.S., natural disasters in Japan, and national defense in South Korea).1
Analytics agency. For data that cuts across departmental boundaries, a top-down approach is needed to manage and integrate big data. Governments should look to establish big-data control towers to integrate accumulated datasets, structured or unstructured, from departmental silos. Moreover, governments need to establish an advanced analytics agency responsible for developing strategies for how big data can be managed through new technology platforms and analytics and how to secure skilled professional staff.
Real-time analysis. They need to manage real-time analysis of in-motion big data while protecting individual citizens' privacy and security. They should also explore new technological playgrounds (such as cloud computing, advanced analytics, security technologies, and legislation).
Global collaboration. Much government data is global in nature and can be used to prevent and solve global issues; for example, the Group on Earth Observations (GEO) is a collaborative international intergovernmental effort to integrate and share Earth-observation data. Its Global Earth Observation System of Systems (GEOSS), a global public infrastructure that generates comprehensive, near-realtime environmental data, intends to provide information and analyses for a wide range of global users and decision makers. Governments also need to share data related to security threats, fraud, and illegal activities. Such big data needs not only translation technologies but an international collaborative effort to share and integrate data,
ICT big brothers. Finally, governments should collaborate with "ICT big brothers" like EMC, IBM, and SAS; for example, Amazon Web Services hosts many public datasets, including Japanese and U.S. census data, and many genomic and medical databases.
1. Accenture. Build It and They Will Come? Chicago, 2012; http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Digital-Citizen-FullSurvey.pdf
2. Braham Group Inc. Maximizing the Value Provided By a Big Data Platform. Salt Lake City, UT, June 2012; http://public.dhe.ibm.com/common/ssi/ecm/en/iml14324usen/IML14324USEN.PDF
5. European Commission. A Digital Agenda for Europe. Brussels, Aug. 26, 2010; http://ec.europa.eu/digital-agenda/
7. IBM. IBM's Smarter Cities Challenge: Syracuse. Dec. 2011; http://smartercitieschallenge.org/city_syracuse_ny.html
9. McKinsey Global Institute. Big Data: The Next Frontier for Innovation, Competition, and Productivity. New York, May 2011; http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
10. National Information Society Agency. Evolving World on Big Data: Global Practices. May 2012; http://www.koreainformationsociety.com/2013/11/koreas-national-information-society.html
11. Office of Science and Technology Policy, Executive Office of the President. Fact Sheet: Big Data Across the Federal Government. Washington, D.C., Mar. 29, 2012; http://www.whitehouse.gov/administration/eop/ostp
12. Office of Science and Technology Policy, Executive Office of the President. Obama Administration Unveils 'Big Data' Initiative: Announces $200 Million in New R&D Investments. Washington, D.C., Mar. 29, 2012; http://www.whitehouse.gov/administration/eop/ostp
14. Plant, R. CISPA: Information without representation? Big Data Republic, Apr. 24, 2013; http://www.bigdatarepublic.com/author.asp?section_id=2635&doc_id=262480
15. President's Council of Advisors on Science and Technology. Designing a Digital Future: Federally Funded Research and Development in Networking and Information Technology. Washington, D.C., Dec. 2010; http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcast-nitrd-report-2010.pdf
17. Sherry, S. 33B pounds drive U.K. government big data agenda. Big Data Republic, Nov. 16, 2012; http://www.bigdatarepublic.com/author.asp?section_id=2642&doc_id=254471
19. Stonebraker, M. What does 'big data' mean? [email protected], Sept. 21, 2012; http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-data-mean/fulltext
20. United Nations. E-government Survey 2012: E-government for the People, 2012; http://www.un.org/en/development/desa/publications/connecting-governments-to-citizens.html
21. U.S. Government. Data.gov; http://www.data.gov
©2014 ACM 0001-0782/14/03
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2014 ACM, Inc.