Much has been said of the wide diversity of today’s data. In part, that refers to the different data volumes and structures (structured, semi-structured, or unstructured data) that organizations leverage for analytics projects. But it also refers to the variety of data sources, or the many locations where data can originate (or be stored) before being used for a task at hand.
What is a data source?
A data source is the digital or physical location where data originates from or is stored, which influences how it is stored per its location (e.g. data table or data object) and its connectivity properties.
In many cases, a data source will refer to the first location where the data originated, though the movement and ingestion of data can change its source. In this case, the new data source would not only have a new location, but new connection characteristics.
Additionally, a data source may be stored in the sole computer where it will be used (as is the case for desktop-based flat files or applications), or in an offsite location where it will be used by many different computers. In the latter case, an eCommerce giant like Amazon is a great example—as consumers search the Amazon website for various products, its connection to the database on the backend allows it to dictate what is actually available or what is out of stock, updating with each purchase made.
What are the data source types?
Despite the many different data sources that exist (and continue to be created), data sources can be broadly categorized into machine data sources and file data sources.
Machine data sources are unique to their machine, which means they cannot be easily shared. Users can connect to the data using the information found within the machine data source, and query the data using the Data Source Name (DSN). The DSN is a pointer to actual data in its respective databases or applications, regardless of whether it exists locally, on a remote server, is in a single, physical location, or is virtualized.
File data sources, on the other hand, are not unique to their machine and therefore are transferable across devices. Unlike machine data sources, file data sources do not have a DSN since they are not registered to individual applications or systems.
Under those two umbrella categorizations, some of the most common data sources include:
Databases are the most common data source today. A database is not only defined by the data it contains, but also the brand used to create or process the data. Databases can be hosted on-premise or in the cloud. Some of the most common cloud databases include Snowflake, Amazon Redshift, and Google BigQuery, while traditional on-premise databases have included Oracle and SAS.
Flat files are named as such because each line holds one data input; they can be easily readable as columns and rows, and maintain a uniform format that follows rules on data input. One of most common flat files is a comma-separated values (CSV) file, named as such because the value of each table is separated by a comma.
A web service is hosted by a server, which receives requests via web service calls triggered by user interactions. The requests are made through Remote Procedure Calls (RPC), which allows for communication between processes on different workstations. There are two types of web services: SOAP Web Services and REST Web Services; the former is an XML-based protocol and the latter accepts different data formats, such as Plain Text, HTML, JSON, XML etc.
Self-service applications are abundant in today’s business world, and their data underpins many integral analytics projects. Self-service applications include both desktop-based, such as Tableau, and cloud-based (SaaS) applications, such as Salesforce or Marketo. A data source from a self-service application can refer to any brand.
Why are data sources important?
A data source is important because it defines exactly how a physical connection can be established. This critical information includes the location of the database, for example, and the timeout duration and may also include credentials information and a sign-on.
In this way, a data source is like a key that unlocks the accessibility of that data for use in analytics, web services, transfer to other databases, etc. Well-defined, accessible data sources are critical to ensuring that data can be moved to exactly where it needs to be.
How are connections established between data sources?
There are a variety of ways to establish connections between data sources that allow for the transfer of data. These include File Transfer Protocol (FTP), HyperText Transfer Protocol (HTTP), or any of the many Application Programming Interfaces (APIs) provided by specific applications.
HTTP and FTP are often compared to one another because they are both file transfer protocols used to transfer data between a client and a server. However, there are several key differences between the two. Principally, an HTTP connection is used to access websites, while an FTP is used to transfer files between one host to another. Additionally, HTTP only establishes data connection; FTP establishes both data as well as control connection. In general, HTTP excels at transferring smaller files like webpages, whereas FTP can quickly transfer large files.
APIs allow applications to communicate with one another. A commonly-used metaphor is that of a waiter—much like a waiter delivering your order to the kitchen and carrying your meal out back to you, an API sends information back to a server where the necessary actions are performed, and then sent back to your desktop or mobile device. APIs typically adhere to HTTP and REST standards, which are easily accessible and understood broadly.
What are point-to-point integrations?
Point-to-point integrations refer to a singular connection built between one application and another, using either an API or FTP. These straightforward integrations make sense for organizations when the integrations are relatively few and shallow in scope, but can quickly snowball out of control if overused.
Even though connecting various data sources is important, an abundance of point-to-point integrations can lead to a complicated web of point-to-point integrations, which may be built using various coding languages. Ultimately, that leaves an organization with high technical debt and prevents efficient scale.
What is a data hub?
There are several solutions to the problem of point-to-point integrations, one of the most popular of which is the data hub.
A data hub operates off of a spoke-and-hub model; that is, all sorts of data sources within the organization—data lakes, data warehouses, SaaS applications, etc.—are connected to a central “hub.” In this way, a data hub embraces the diversity of data sources that an organization will inevitably accumulate, but grounds them with a singular platform.
Think of a data hub more like a gateway by which data moves, either virtually (through search) or temporarily physically as it passes from one application to the next. In that way, a data hub acts as both a map and a transport system for the wide-ranging data sources throughout the organization. It allows users to easily search for, access, and process data, no matter the type of data or if that data is in the cloud or on premises.
The most obvious benefit of a data hub is its ability to break down data silos by linking data through a singular data hub. Gone are the days of searching for data through various sources; instead, a data hub gives users the data they need at their fingertips.
What is the role of data engineering in managing data sources?
Data engineers, responsible for the development, integrity, and maintenance of an organization’s data infrastructure, are some of the main players involved in maintaining data sources and ensuring efficient connectivity between them (which, as we discussed, may involve adopting a modern strategy such as a data hub.)
However, data engineers must also be aware of who in the organization needs access to which data source and for what purpose. For example, data scientists may need training data for their machine learning models; a business analyst may need to collect data from both Marketo and Salesforce to better predict sales forecasts. In either case, data engineers need to ensure that the data is readily accessible in order to prevent long delays.
Traditionally, data engineers would work overtime to source, prepare, and join data from various sources in order to best serve the business. The problem is, the wait got too long. The business began to stall. And data engineers became overwhelmed with the amount of work that they were required to take on.
Now, data engineering work is best thought of as organization-wide work. Of course, data engineers still handle the highly-technical work, such as migrating data from one database to another, but with the help of modern data engineering platforms, business analysts are now able to set up their own pipelines connecting to data sources, cleanse and enrich that data for use, and even set up schedules for repeat use.
The switch in mentality across many organizations has been a game-changer, and opened up new worlds in terms of the accessibility of data sources. Now, with the help of a data engineering platform, data engineers are provisioning more than they are preparing, and data analysts are getting their data faster than ever.
Get started with a data engineering platform
The Trifacta platform is routinely recognized as the leader in data engineering by analysts and end-users alike. Trifacta automatically presents visual representations of your data based upon its content in the most compelling visual profile. This allows an immediate understanding of the data at a glance—no waiting around for IT to fulfill requirements or searching and filtering spreadsheets.
Since Trifacta is powered by machine learning, the platform is smart enough to recognize what the user is trying to do. If they want to standardize all versions or misspellings of California, for example, the tool will automatically suggest things like “CA” or “Calif.”
And in order to accelerate analytics, Trifacta allows users to build scheduled data pipelines that feed standardized data into the analytics application of their choice as soon as new data is ready. The result is efficient and comprehensive data standardization and data preparation for all types of data and analytics projects.