Tag Archives | data

Depth First Search algorithm in Python (Multiple Examples)

Depth First Search is a popular graph traversal algorithm. In this tutorial, We will understand how it works, along with examples; and how we can implement it in Python. Graphs and Trees are one of the most important data structures we use for various applications in Computer Science. They represent data in the form of nodes, which are connected to other nodes through ‘edges’. Like other data structures, traversing all the elements or searching for an element in a graph or a tree is one of the fundamental operations that is required to define such data structures. Depth First Search is one such graph traversal algorithm. Depth First Search begins by looking at the root node (an arbitrary node) of a graph. If we are performing a traversal of the entire graph, it visits the first child of a root node, then, in turn, looks at the first child of this node and continues along this branch until it reaches a leaf node. Next, it backtracks and explores the other children of the parent node in a similar manner. This continues until we visit all the nodes of the tree, and there is no parent node left to explore.

Continue Reading →

Introduction

Graphs and Trees are one of the most important data structures we use for various applications in Computer Science.
They represent data in the form of nodes, which are connected to other nodes through ‘edges’.

Like other data structures, traversing all the elements or searching for an element in a graph or a tree is one of the fundamental operations that is required to define such data structures. Depth First Search is one such graph traversal algorithm.

The Depth First Search Algorithm

Depth First Search begins by looking at the root node (an arbitrary node) of a graph. If we are performing a traversal of the entire graph, it visits the first child of a root node, then, in turn, looks at the first child of this node and continues along this branch until it reaches a leaf node.

Next, it backtracks and explores the other children of the parent node in a similar manner. This continues until we visit all the nodes of the tree, and there is no parent node left to explore.

source: Wikipedia

However, if we are performing a search of a particular element, then at each step, a comparison operation will occur with the node we are currently at.
If the element is not present in a particular node, then the same process exploring each branch and backtracking takes place.

This continues until either all the nodes of the graph have been visited, or we have found the element we were looking for.

Representing a graph

Before we try to implement the DFS algorithm in Python, it is necessary to first understand how to represent a graph in Python.

There are various versions of a graph. A graph may have directed edges (defining the source and destination) between two nodes, or undirected edges. The edges between nodes may or may not have weights. Depending on the application, we may use any of the various versions of a graph.

For the purpose of traversal through the entire graph, we will use graphs with directed edges (since we need to model parent-child relation between nodes), and the edges will have no weights since all we care about is the complete traversal of the graph.

Now there are various ways to represent a graph in Python; two of the most common ways are the following:

  1. Adjacency Matrix
  2. Adjacency List

Adjacency Matrix

Adjacency Matrix is a square matrix of shape N x N (where N is the number of nodes in the graph).
Each row represents a node, and each of the columns represents a potential child of that node.
Each (row, column) pair represents a potential edge.

Whether or not the edge exists depends on the value of the corresponding position in the matrix.
A non-zero value at the position (i,j) indicates the existence of an edge between nodes i and j, while the value zero means there exists no edge between i and j.

The values in the adjacency matrix may either be a binary number or a real number.
We can use binary values in a non-weighted graph (1 means edge exists, and a 0 means it doesn’t).
For real values, we can use them for a weighted graph and represent the weight associated with the edge between the row and column representing the position.

E.g., a value 10 between at position (2,3) indicates there exists an edge bearing weight 10 between nodes 2 and 3.

In Python, we can represent the adjacency matrices using a 2-dimensional NumPy array.

Adjacency List

Adjacency List is a collection of several lists. Each list represents a node in the graph, and stores all the neighbors/children of this node.

In Python, an adjacency list can be represented using a dictionary where the keys are the nodes of the graph, and their values are a list storing the neighbors of these nodes.

We will use this representation for our implementation of the DFS algorithm.

Let’s take an example graph and represent it using a dictionary in Python.

The given graph has the following four edges:

  1. A -> B
  2. A -> C
  3. B -> C
  4. C -> D

Let’s now create a dictionary in Python to represent this graph.
graph = {"A": ["B", "C"],
"B": ["C"],
"C": ["D"]}

Now that we know how to represent a graph in Python, we can move on to the implementation of the DFS algorithm.

Implementing Depth First Search(a non-recursive approach)

We will consider the graph example shown in the animation in the first section.

Let’s define this graph as an adjacency list using the Python dictionary.

graph = {"A":["D","C","B"],
"B":["E"],
"C":["G","F"],
"D":["H"],
"E":["I"],
"F":["J"]}

One of the expected orders of traversal for this graph using DFS would be:

Let’s implement a method that accepts a graph and traverses through it using DFS. We can achieve this using both recursion technique as well as non-recursive, iterative approach.
In this section, we’ll look at the iterative method.

We will use a stack and a list to keep track of the visited nodes.
We’ll begin at the root node, append it to the path and mark it as visited. Then we will add all of its neighbors to the stack.
At each step, we will pop out an element from the stack and check if it has been visited.
If it has not been visited, we’ll add it to the path and add all of its neighbors to the stack.

def dfs_non_recursive(graph, source):

if source is None or source not in graph:

return "Invalid input"

path = []

stack = [source]

while(len(stack) != 0):

s = stack.pop()

if s not in path:

path.append(s)

if s not in graph:

#leaf node
continue

for neighbor in graph[s]:

stack.append(neighbor)

return " ".join(path)

Our user-defined method takes the dictionary representing the graph and a source node as input.
Note that the source node has to be one of the nodes in the dictionary, else the method will return an “Invalid input” error.

Let’s call this method on our defined graph, and verify that the order of traversal matches with that demonstrated in the figure above.

DFS_path = dfs_non_recursive(graph, "A")

print(DFS_path)

Output :

Thus the order of traversal of the graph is in the ‘Depth First’ manner.

DFS using a recursive method

We can implement the Depth First Search algorithm using a popular problem-solving approach called recursion.

Recursion is a technique in which the same problem is divided into smaller instances, and the same method is recursively called within its body.

We will define a base case inside our method, which is – ‘If the leaf node has been visited, we need to backtrack’.

Let’s implement the method:

def recursive_dfs(graph, source,path = []):

if source not in path:

path.append(source)

if source not in graph:
# leaf node, backtrack
return path

for neighbour in graph[source]:

path = recursive_dfs(graph, neighbour, path)

return path

Now we can create our graph (same as in the previous section), and call the recursive method.

graph = {"A":["B","C", "D"],
"B":["E"],
"C":["F","G"],
"D":["H"],
"E":["I"],
"F":["J"]}

path = recursive_dfs(graph, "A")

print(" ".join(path))

Output:

The order of traversal is again in the Depth-First manner.

Depth First Search on a Binary Tree

What is a Binary Tree?

A binary tree is a special kind of graph in which each node can have only two children or no child.
Another important property of a binary tree is that the value of the left child of the node will be less than or equal to the current node’s value.
Similarly, the value in the right child is greater than the current node’s value.

Thus every value in the left branch of the root node is smaller than the value at the root, and those in the right branch will have a value greater than that at the root.

Let’s understand how we can represent a binary tree using Python classes.

Representing Binary Trees using Python classes

We can create a class to represent each node in a tree, along with its left and right children.
Using the root node object, we can parse the whole tree.

We will also define a method to insert new values into a binary tree.

class Node:

def __init__(self, value):

self.value = value

self.left = None

self.right = None

def insert(self, value):

if value:

if value < self.value:

if self.left is None:

self.left = Node(value)

else:

self.left.insert(value)

elif value > self.value:

if self.right is None:

self.right = Node(value)

else:

self.right.insert(value)

else:

self.value = value

Let’s now create a root node object and insert values in it to construct a binary tree like the one shown in the figure in the previous section.

root = Node(7)

root.insert(2)

root.insert(25)

root.insert(9)

root.insert(80)

root.insert(0)

root.insert(5)

root.insert(15)

root.insert(8)

This will construct the binary tree shown in the figure above.
It will also ensure that the properties of binary trees i.e, ‘2 children per node’ and ‘left < root < right’ are satisfied no matter in what order we insert the values.

Implementing DFS for a binary tree

Let’s now define a recursive function that takes as input the root node and displays all the values in the tree in the ‘Depth First Search’ order.

def dfs_binary_tree(root):

if root is None:

return

else:

print(root.value,end=" ")

dfs_binary_tree(root.left)

dfs_binary_tree(root.right)

We can now call this method and pass the root node object we just created.

dfs_binary_tree(root)

Output:

This order is also called as the ‘preorder traversal’ of a binary tree.

Depth First Search using networkx

So far, we have been writing our logic for representing graphs and traversing them.
But, like all other important applications, Python offers a library to handle graphs as well. It is called ‘networkx’.

‘networkx’ is a Python package to represent graphs using nodes and edges, and it offers a variety of methods to perform different operations on graphs, including the DFS traversal.

Let’s first look at how to construct a graph using networkx.

Constructing a graph in networkx

To construct a graph in networkx, we first create a graph object and then add all the nodes in the graph using the ‘add_node()’ method, followed by defining all the edges between the nodes, using the ‘add_edge()’ method.

Let’s construct the following graph using ‘networkx’.

import networkx as nx

G = nx.Graph() #create a graph

G.add_node(1) # add single node

G.add_node(2)

G.add_node(3)

G.add_node(4)

G.add_node(5)

G.add_nodes_from([6,7,8,9]) #add multiple nodes

Now that we have added all the nodes let’s define the edges between these nodes as shown in the figure.

# adding edges

G.add_edge(5,8)

G.add_edge(5,4)

G.add_edge(5,7)

G.add_edge(8,2)

G.add_edge(4,3)

G.add_edge(4,1)

G.add_edge(7,6)

G.add_edge(6,9)

Visualizing the graph in DFS

Now, we constructed the graph by defining the nodes and edges let’s see how it looks the networkx’s ‘draw()’ method and verify if it is constructed the way we wanted it to be. We will use matplotlib to show the graph.

import matplotlib.pyplot as plt

nx.draw(G, with_labels=True, font_weight='bold')

plt.show()

Output:

The orientation may be a little different than our design, but it resembles the same graph, with the nodes and the same edges between them.

Let’s now perform DFS traversal on this graph.

Graph traversal in networkx – DFS

The ‘networkx’ offers a range of methods for traversal of the graph in different ways. We will use the ‘dfs_preorder_nodes()’ method to parse the graph in the Depth First Search order.

The expected order from the figure should be:
5, 8, 2, 4, 3, 1, 7, 6, 9

Let’s call the method and see in what order it prints the nodes.

dfs_output = list(nx.dfs_preorder_nodes(G, source=5))

print(dfs_output)

Output:

Thus the order of traversal by networkx is along our expected lines.

Now that we have understood the depth-first search or DFS traversal well, let’s look at some of its applications.

Topological sorting using Depth First Search

Topological sorting is one of the important applications of graphs used to model many real-life problems where the beginning of a task is dependent on the completion of some other task.

For instance, we may represent a number of jobs or tasks using nodes of a graph.
Some of the tasks may be dependent on the completion of some other task. This dependency is modeled through directed edges between nodes.
A graph with directed edges is called a directed graph.

If we want to perform a scheduling operation from such a set of tasks, we have to ensure that the dependency relation is not violated i.e, any task that comes later in a chain of tasks is always performed only after all the tasks before it has finished.
We can achieve this kind of order through the topological sorting of the graph.

Note that for topological sorting to be possible, there has to be no directed cycle present in the graph, that is, the graph has to be a directed acyclic graph or DAG.

Let’s take an example of a DAG and perform topological sorting on it, using the Depth First Search approach.

Let’s say each node in the above graph represents a task in a factory to produce a product. The directed arrows between the nodes model are the dependencies of each task on the completion of the previous tasks.

Hence whatever ordering of tasks we chose to perform, to begin the task C, tasks A and E must have been completed.

Similarly, for performing the task I, the tasks A, E, C, and F must have been completed. Since there is no inward arrow on node H, the task H can be performed at any point without the dependency on completion of any other task.

We can construct such a directed graph using Python networkx’s ‘digraph’ module.

dag = nx.digraph.DiGraph()

dag.add_nodes_from(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'])

dag.add_edges_from([('A', 'B'), ('A', 'E'), ('B', 'D'), ('E', 'C'),
('D', 'G'),('C', 'G'),('C', 'I'), ('F', 'I')])

Note that we have used the methods ‘add_nodes_from()’ and ‘add_edges_from()’ to add all the nodes and edges of the directed graph at once.

We can now write a function to perform topological sorting using DFS.

We will begin at a node with no inward arrow, and keep exploring one of its branches until we hit a leaf node, and then we backtrack and explore other branches.

Once we explore all the branches of a node, we will mark the node as ‘visited’ and push it to a stack.

Once every node is visited, we can perform repeated pop operations on the stack to give us a topologically sorted ordering of the tasks.

Now let’s translate this idea into a Python function:

def dfs(dag, start, visited, stack):

if start in visited:

# node and all its branches have been visited
return stack, visited

if dag.out_degree(start) == 0:

# if leaf node, push and backtrack
stack.append(start)

visited.append(start)

return stack, visited

#traverse all the branches
for node in dag.neighbors(start):

if node in visited:

continue

stack, visited = dfs(dag, node, visited, stack)

#now, push the node if not already visited
if start not in visited:

print("pushing %s"%start)

stack.append(start)

visited.append(start)

return stack, visited

def topological_sort_using_dfs(dag):

visited = []

stack=[]

start_nodes = [i for i in dag.nodes if dag.in_degree(i)==0]

# print(start_nodes)

for s in start_nodes:

stack, visited = dfs(dag, s, visited, stack)

print("Topological sorted:")

while(len(stack)!=0):

print(stack.pop(), end=" ")

We have defined two functions – one for recursive traversal of a node, and the main topological sort function that first finds all nodes with no dependency and then traverses each of them using the Depth First Search approach.
Finally, it pops out values from the stack, which produces a topological sorting of the nodes.

Let’s now call the function ‘topological_sort_using_dfs()’

topological_sort_using_dfs(dag)

Output :

If we look closely at the output order, we’ll find that whenever each of the jobs starts, it has all its dependencies completed before it.

We can also compare this with the output of a topological sort method included in the ‘networkx’ module called ‘topological_sort()’.

topological_sorting = nx.topological_sort(dag)

for n in topological_sorting:

print(n, end=' ')

Output:

It looks like the ordering produced by the networkx’s sort method is the same as the one produced by our method.

Finding connected components using DFS

A graph has another important property called the connected components. A connected component in an undirected graph refers to a set of nodes in which each vertex is connected to every other vertex through a path.

Let’s look at the following example:

In the graph shown above, there are three connected components; each of them has been marked in pink.

Let’s construct this graph in Python, and then chart out a way to find connected components in it.

graph = nx.Graph()

graph.add_nodes_from(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'])

graph.add_edges_from([('A', 'B'), ('B', 'E'), ('A', 'E')]) #component 1

graph.add_edges_from([('C', 'D'), ('D', 'H'), ('H', 'F'), ('F', 'C')]) #component 2

graph.add_edge('G','I') #component 3

Let’s also visualize it while we are at it.

import matplotlib.pyplot as plt

nx.draw(graph, with_labels=True, font_weight='bold')

plt.show()

Output:

To find connected components using DFS, we will maintain a common global array called ‘visited’, and every time we encounter a new variable that has not been visited, we will start finding which connected component it is a part of.

We will mark every node in that component as ‘visited’ so we will not be able to revisit it to find another connected component.

We will repeat this procedure for every node, and the number of times we called the DFS method to find connected components from a node, will be equal to the number of connected components in the graph.

Let’s write this logic in Python and run it on the graph we just constructed:

def find_connected_components(graph):

visited = []

connected_components = []

for node in graph.nodes:

if node not in visited:

cc = [] #connected component

visited, cc = dfs_traversal(graph, node, visited, cc)

connected_components.append(cc)

return connected_components

def dfs_traversal(graph, start, visited, path):

if start in visited:

return visited, path

visited.append(start)

path.append(start)

for node in graph.neighbors(start):

visited, path = dfs_traversal(graph, node, visited, path)

return visited, path

Let’s use our method on the graph we constructed in the previous step.

connected_components = find_connected_components(graph)

print("Total number of connected components =", len(connected_components))

for cc in connected_components:

print(cc)

Output:

Conclusion

In this blog, we understood the DFS algorithm and used it in different ways.

We began by understanding how a graph can be represented using common data structures and implemented each of them in Python.

We then implemented the Depth First Search traversal algorithm using both the recursive and non-recursive approach.

Next, we looked at a special form of a graph called the binary tree and implemented the DFS algorithm on the same.
Here we represented the entire tree using node objects constructed from the Python class we defined to represent a node.

Then we looked at Python’s offering for representing graphs and performing operations on them – the ‘networkx’ module.
We used it to construct a graph, visualize it, and run our DFS method on it. We compared the output with the module’s own DFS traversal method.

Finally, we looked at two important applications of the Depth First Search traversal namely, topological sort and finding connected components in a graph.

0

Python correlation matrix tutorial

In this blog, we will go through an important descriptive statistic of multi-variable data called the correlation matrix. We will learn how to create, plot, and manipulate correlation matrices in Python. We will be looking at the following topics:
1 What is the correlation matrix?,
1.1 What is the correlation coefficient?
2 Finding the correlation matrix of the given data
3 Plotting the correlation matrix
4 Interpreting the correlation matrix
5 Adding title and labels to the plot
6 Sorting the correlation matrix
7 Selecting negative correlation pairs
8 Selecting strong correlation pairs (magnitude greater than 0.5)
9 Converting a covariance matrix into the correlation matrix
10 Exporting the correlation matrix to an image
11 Conclusion

Continue Reading →

What is the correlation matrix?

A correlation matrix is a tabular data representing the ‘correlations’ between pairs of variables in a given data.

We will construct this correlation matrix by the end of this blog.

Each row and column represents a variable, and each value in this matrix is the correlation coefficient between the variables represented by the corresponding row and column.

The Correlation matrix is an important data analysis metric that is computed to summarize data to understand the relationship between various variables and make decisions accordingly.

It is also an important pre-processing step in Machine Learning pipelines to compute and analyze the correlation matrix where dimensionality reduction is desired on a high-dimension data.

We mentioned how each cell in the correlation matrix is a ‘correlation coefficient‘ between the two variables corresponding to the row and column of the cell.

Let us understand what a correlation coefficient is before we move ahead.

What is the correlation coefficient?

A correlation coefficient is a number that denotes the strength of the relationship between two variables.

There are several types of correlation coefficients, but the most common of them all is the Pearson’s coefficient denoted by the Greek letter ρ (rho).

It is defined as the covariance between two variables divided by the product of the standard deviations of the two variables.

Where the covariance between X and Y COV(X, Y) is further defined as the ‘expected value of the product of the deviations of X and Y from their respective means’.
The formula for covariance would make it clearer.

So the formula for Pearson’s correlation would then become:

The value of ρ lies between -1 and +1.
Values nearing +1 indicate the presence of a strong positive relation between X and Y, whereas those nearing -1 indicate a strong negative relation between X and Y.
Values near to zero mean there is an absence of any relationship between X and Y.

Finding the correlation matrix of the given data

Let us generate random data for two variables and then construct the correlation matrix for them.

import numpy as np

np.random.seed(10)

# generating 10 random values for each of the two variables
X = np.random.randn(10)

Y = np.random.randn(10)

# computing the corrlation matrix
C = np.corrcoef(X,Y)

print(C)

Output:

Since we compute the correlation matrix of 2 variables, its dimensions are 2 x 2.
The value 0.02 indicates there doesn’t exist a relationship between the two variables. This was expected since their values were generated randomly.

In this example, we used NumPy’s `corrcoef` method to generate the correlation matrix.
However, this method has a limitation in that it can compute the correlation matrix between 2 variables only.

Hence, going ahead, we will use pandas DataFrames to store the data and to compute the correlation matrix on them.

Plotting the correlation matrix

For this explanation, we will use a data set that has more than just two features.

We will use the Breast Cancer data, a popular binary classification data used in introductory ML lessons.
We will load this data set from the scikit-learn’s dataset module.
It is returned in the form of NumPy arrays, but we will convert them into Pandas DataFrame.

from sklearn.datasets import load_breast_cancer

import pandas as pd

breast_cancer = load_breast_cancer()

data = breast_cancer.data

features = breast_cancer.feature_names

df = pd.DataFrame(data, columns = features)

print(df.shape)

print(features)

There are 30 features in the data, all of which are listed in the output above.

Our goal is now to determine the relationship between each pair of these columns. We will do so by plotting the correlation matrix.

To keep things simple, we’ll only use the first six columns and plot their correlation matrix.
To plot the matrix, we will use a popular visualization library called seaborn, which is built on top of matplotlib.

import seaborn as sns

import matplotlib.pyplot as plt

# taking all rows but only 6 columns
df_small = df.iloc[:,:6]

correlation_mat = df_small.corr()

sns.heatmap(correlation_mat, annot = True)

plt.show()

Output:

The plot shows a 6 x 6 matrix and color-fills each cell based on the correlation coefficient of the pair representing it.

Pandas DataFrame’s corr() method is used to compute the matrix. By default, it computes the Pearson’s correlation coefficient.
We could also use other methods such as Spearman’s coefficient or Kendall Tau correlation coefficient by passing an appropriate value to the parameter 'method'.

We’ve used seaborn’s heatmap() method to plot the matrix. The parameter ‘annot=True‘ displays the values of the correlation coefficient in each cell.

Let us now understand how to interpret the plotted correlation coefficient matrix.

Interpreting the correlation matrix

Let’s first reproduce the matrix generated in the earlier section and then discuss it.

You must keep the following points in mind with regards to the correlation matrices such as the one shown above:

  1. Each cell in the grid represents the value of the correlation coefficient between two variables.
  2. The value at position (a, b) represents the correlation coefficient between features at row a and column b. This will be equal to the value at position (b, a)
  3. It is a square matrix – each row represents a variable, and all the columns represent the same variables as rows, hence the number of rows = number of columns.
  4. It is a symmetric matrix – this makes sense because the correlation between a,b will be the same as that between b, a.
  5. All diagonal elements are 1. Since diagonal elements represent the correlation of each variable with itself, it will always be equal to 1.
  6. The axes ticks denote the feature each of them represents.
  7. A large positive value (near to 1.0) indicates a strong positive correlation, i.e., if the value of one of the variables increases, the value of the other variable increases as well.
  8. A large negative value (near to -1.0) indicates a strong negative correlation, i.e., the value of one variable decreases with the other’s increasing and vice-versa.
  9. A value near to 0 (both positive or negative) indicates the absence of any correlation between the two variables, and hence those variables are independent of each other.
  10. Each cell in the above matrix is also represented by shades of a color. Here darker shades of the color indicate smaller values while brighter shades correspond to larger values (near to 1).
    This scale is given with the help of a color-bar on the right side of the plot.

Adding title and labels to the plot

We can tweak the generated correlation matrix, just like any other Matplotlib plot. Let us see how we can add a title to the matrix and labels to the axes.

correlation_mat = df_small.corr()

sns.heatmap(correlation_mat, annot = True)

plt.title("Correlation matrix of Breast Cancer data")

plt.xlabel("cell nucleus features")

plt.ylabel("cell nucleus features")

plt.show()

Output:

If we want, we could also change the position of the title to bottom by specifying the y position.

correlation_mat = df_small.corr()

sns.heatmap(correlation_mat, annot = True)

plt.title("Correlation matrix of Breast Cancer data", y=-0.75)

plt.xlabel("cell nucleus features")

plt.ylabel("cell nucleus features")

plt.show()

Output:

Sorting the correlation matrix

If the given data has a large number of features, the correlation matrix can become very big and hence difficult to interpret.

Sometimes we might want to sort the values in the matrix and see the strength of correlation between various feature pairs in an increasing or decreasing order.
Let us see how we can achieve this.

First, we will convert the given matrix into a one-dimensional Series of values.

correlation_mat = df_small.corr()

corr_pairs = correlation_mat.unstack()

print(corr_pairs)

Output:

The unstack method on the Pandas DataFrame returns a Series with MultiIndex.That is, each value in the Series is represented by more than one indices, which in this case are the row and column indices that happen to be the feature names.

Let us now sort these values using the sort_values() method of the Pandas Series.

sorted_pairs = corr_pairs.sort_values(kind="quicksort")

print(sorted_pairs)

Output:

We can see each value is repeated twice in the sorted output. This is because our correlation matrix was a symmetric matrix, and each pair of features occurred twice in it.

Nonetheless, we now have the sorted correlation coefficient values of all pairs of features and can make decisions accordingly.

Selecting negative correlation pairs

We may want to select feature pairs having a particular range of values of the correlation coefficient.
Let’s see how we can choose pairs with a negative correlation from the sorted pairs we generated in the previous section.

negative_pairs = sorted_pairs[sorted_pairs < 0]

print(negative_pairs)

Output:

Selecting strong correlation pairs (magnitude greater than 0.5)

Let us use the same approach to choose strongly related features. That is, we will try to filter out those feature pairs whose correlation coefficient values are greater than 0.5 or less than -0.5.

strong_pairs = sorted_pairs[abs(sorted_pairs) > 0.5]

print(strong_pairs)

Output:

Converting a covariance matrix into the correlation matrix

We have seen the relationship between the covariance and correlation between a pair of variables in the introductory sections of this blog.

Let us understand how we can compute the covariance matrix of a given data in Python and then convert it into a correlation matrix. We’ll compare it with the correlation matrix we had generated using a direct method call.

First of all, Pandas doesn’t provide a method to compute covariance between all pairs of variables, so we’ll use NumPy’s cov() method.
cov = np.cov(df_small.T)

print(cov)

Output:

We’re passing the transpose of the matrix because the method expects a matrix in which each of the features is represented by a row rather than a column.

So we have gotten our numerator right.
Now we need to compute a 6×6 matrix in which the value at i, j is the product of standard deviations of features at positions i and j.

We’ll then divide the covariance matrix by this standard deviations matrix to compute the correlation matrix.

Let us first construct the standard deviations matrix.

#compute standard deviations of each of the 6 features
stds = np.std(df_small, axis = 0) #shape = (6,)

stds_matrix = np.array([[stds[i]*stds[j] for j in range(6)] for i in range(6)])

print("standard deviations matrix of shape:",stds_matrix.shape)

Output:

Now that we have the covariance matrix of shape (6,6) for the 6 features, and the pairwise product of features matrix of shape (6,6), we can divide the two and see if we get the desired resultant correlation matrix.

new_corr = cov/std_matrix

We have stored the new correlation matrix (derived from a covariance matrix) in the variable new_corr.

Let us check if we got it right by plotting the correlation matrix and juxtaposing it with the earlier one generated directly using the Pandas method corr().

plt.figure(figsize=(18,4))

plt.subplot(1,2,1)

sns.heatmap(correlation_mat, annot = True)

plt.title("Earlier correlation matrix (from Pandas)")

plt.xlabel("cell nucleus features")

plt.ylabel("cell nucleus features")

plt.subplot(1,2,2)

sns.heatmap(correlation_mat, annot = True)

plt.title("Newer correlation matrix (from Covariance mat)")

plt.xlabel("cell nucleus features")

plt.ylabel("cell nucleus features")

plt.show()

Output:

We can compare the two matrices and notice that they are identical.

Exporting the correlation matrix to an image

Plotting the correlation matrix in a Python script is not enough. We might want to save it for later use.
We can save the generated plot as an image file on disk using the plt.savefig() method.

correlation_mat = df_small.corr()

sns.heatmap(correlation_mat, annot = True)

plt.title("Correlation matrix of Breast Cancer data")

plt.xlabel("cell nucleus features")

plt.ylabel("cell nucleus features")

plt.savefig("breast_cancer_correlation.png")

After you run this code, you can see an image file with the name ‘breast_cancer_correlation.png’ in the same working directory.

Conclusion

In this tutorial, we learned what a correlation matrix is and how to generate them in Python. We began by focusing on the concept of a correlation matrix and the correlation coefficients.

Then we generated the correlation matrix as a NumPy array and then as a Pandas DataFrame. Next, we learned how to plot the correlation matrix and manipulate the plot labels, title, etc. We also discussed various properties used for interpreting the output correlation matrix.

We also saw how we could perform certain operations on the correlation matrix, such as sorting the matrix, finding negatively correlated pairs, finding strongly correlated pairs, etc.

Then we discussed how we could use a covariance matrix of the data and generate the correlation matrix from it by dividing it with the product of standard deviations of individual features.
Finally, we saw how we could save the generated plot as an image file.

0

NumPy loadtxt tutorial (Load data from files)

In a previous tutorial, we talked about NumPy arrays and we saw how it makes the process of reading, parsing and performing operations on numeric data a cakewalk. In this tutorial, we will discuss the NumPy loadtxt method that is used to parse data from text files and store them in an n-dimensional NumPy array. Then we can perform all sorts of operations on it that are possible on a NumPy array. np.loadtxt offers a lot of flexibility in the way we read data from a file by specifying options such as the data type of the resulting array, how to distinguish one data entry from the others through delimiters, skipping/including specific rows, etc. We’ll look at each of those ways in the following tutorial.

Continue Reading →

Specifying the file path

Let’s look at how we can specify the path of the file from which we want to read data.

We’ll use a sample text file for our code examples, which lists the weights (in kg) and heights (in cm) of 100 individuals, each on a row.

I will use various variants in this file for explaining different features of the loadtxt function.

Let’s begin with the simplest representation of the data in a text file. We have 100 lines (or rows) of data in our text file, each of which comprises 2 floating-point numbers separated by a space.

The first number on each row represents the weight and the second number represents the height of an individual.

Here’s a little glimpse from the file:

110.90 146.03
44.83 211.82
97.13 209.30
105.64 164.21

This file is stored as `weight_height_1.txt`.
Our task is to read the file and parse the data in a way that we can represent in a NumPy array.
We’ll import the NumPy package and call the loadtxt method, passing the file path as the value to the first parameter filePath.

import numpy as np

data = np.loadtxt("./weight_height_1.txt")

Here we are assuming the file is stored at the same location from where our Python code will run (‘./’ represents current directory). If that is not the case, we need to specify the complete path of the file (Ex: “C://Users/John/Desktop/weight_height_1.txt”)

We also need to ensure each row in the file has the same number of values.

The extension of the file can be anything other than .txt as long as the file contains text, we can also pass a generator instead of a file path (more on that later)

The function returns an n-dimensional NumPy array of values found in the text.

Here our text had 100 rows with each row having 2 float values, so the returned object data will be a NumPy array of shape (100, 2) with the float data type.

You can verify this by checking ‘shape’ and ‘dtype’ attribute of the returned data:

print("shape of data:",data.shape)

print("datatype of data:",data.dtype)

Output:

Specifying delimiters

A delimiter is a character or a string of characters that separates individual values on a line.

For example, in our earlier file, we had the values separated by a space, so in that case, the delimiter was a space character (” “).

However, some other files may have a different delimiter, for instance, CSV files generally use comma (“,”) as a delimiter. Another file may have a semicolon as a delimiter.

So we need our data loader to be flexible enough to identify such delimiters in each row and extract the correct values from them.

This can be achieved by passing our delimiter as a parameter to the np.loadtxt function.

Let us consider another file ‘weight_height_2.txt’, it has the same data content as the previous one, but this time the values in each row are separated by a comma:

110.90, 146.03
44.83, 211.82
97.13, 209.30

We’ll call the np.loadtxt function the same way as before, except that now we pass an additional parameter – ‘delimiter’:

import numpy as np

data = np.loadtxt("./weight_height_2.txt", delimiter = ",")

This function will return the same array as before.

  • In the previous section, we did not pass delimiter parameter value because np.loadtxt() expects space “ “ to be the default delimiter
  • If the values on each row were separated by a tab, in that case, the delimiter would be specified by using the escape character “\t”

You can verify the results again by checking the shape of the data array and also printing the first few rows:

print("shape of array", data.shape)

print("First 5 rows:\n", data[:5])

Output:

Dealing with 2 delimiters

Now there may be a situation where there are more than 1 delimiters in a file.

For example, let’s imagine each of our lines contained a 3rd value representing the date of birth of the individual in dd-mm-yyyy format

110.90, 146.03, 3-7-1981
44.83, 211.82, 1-2-1986
97.13, 209.30, 14-2-1989

Now suppose we want to extract the dates, months and years as 3 different values into 3 different columns of our NumPy array. So should we pass “,” as the delimiter or should we pass “-”?

We can pass only 1 value to the delimiter parameter in the np.loadtxt method!

No need to worry, there is always a workaround. Let’s use a third file ‘./weight_height_3.txt’ for this example

We’ll use a naive approach first, which has the following steps:

  1. read the file
  2. eliminate one of the delimiters in each line and replace it with one common delimiter (here comma)
  3. append the line into a running list
  4. pass this list of strings to the np.loadtxt function instead of passing a file path.

Let’s write the code:

#reading each line from file and replacing "-" by ","
with open("./weight_height_3.txt") as f_input:

text = [l.replace("-", ",") for l in f_input]

#calling the loadtxt method with “,“ as delimiter
data = np.loadtxt(text, delimiter=",")

  • Note that we are passing a list of strings as input and not a file path.
  • When calling the function we still pass the delimiter parameter with the value “,” as we’ve replaced all instances of the second delimiter ‘-’ by a comma.
  • The returned NumPy array should now have 5 columns

You can once again validate the results by printing the shape and the first five lines:

print("Shape of data:", data.shape)

print("First five rows:\n",data[:5])

Output:

Notice how we have 3 additional columns in each row indicating the date, month and year of birth

Also notice the new values are all floating-point values; however date, month or year make more sense as integers!
We’ll look at how to handle such data type inconsistencies in the coming section.

A general approach for multiple delimiters

In this section, we will look at a general approach for working with multiple delimiters.

Also, we’ll learn how we can use generators instead of file paths – a more efficient solution for multiple delimiters, than the one we discussed in the previous section.

The problem with reading the entire file at once and storing them as a list of strings is that it doesn’t scale well. For instance, if there is a file with a million lines, storing them in a list all at once is going to consume unnecessary additional memory.

Hence we will use generators to get rid of any additional delimiter.
A generator ‘yields’ us a sequence of values on the fly i.e it will read the lines of a file as required instead of reading them all at once

So let’s first define a generator function that takes in a file path and a list of delimiters as a parameter.

def generate_lines(filePath, delimiters=[]):

with open(filePath) as f:

for line in f:

line = line.strip() #removes newline character from end

for d in delimiters:

line =line.replace(d, " ")

yield line

Here we are going through each of the delimiters one by one in each line and replacing them by a blank space ” ” which is the default delimiter in np.loadtxt function

We will now call this generator function and pass the returned generator object to the np.loadtxt method in place of the file path.

gen = generate_lines("./weight_height_3.txt", ["-",","])

data = np.loadtxt(gen)

Note that we did not need to pass any additional delimiter parameter, as our generator function replaced all instances of the delimiters in the passed list by a space, which is the default delimiter.

We can extend this idea and specify as many delimiters as needed.

Specifying the data type

Unless specified otherwise, the np.loadtxt function of the NumPy package assumes the values in the passed text file to be floating-point values by default.

So if you pass a text file that has characters other than numbers, the function will throw an error, stating it was expecting floating-point values.

We can overcome this by specifying the data type of the values in the text file using the datatypeparameter.

In the previous example, we saw the date, month and year were being interpreted as floating-point values, however, we know that these values can never exist in decimal form.

Let’s look at a new file ‘./weight_height_4.txt’ which has only 1 column for the date of birth of individuals in the dd-mm-yyyy format:

13-2-1991
17-12-1990
18-12-1986

So we’ll call the loadtxt method with “-” as the delimiter:

data = np.loadtxt("./weight_height_4.txt", delimiter="-")

print(data[:3])

print("datatype =",data.dtype)

If we look at the output of the above lines of code, we’ll see that each of the 3 values has been stored as floating-point values by default and the data type of the array is ‘float64’

We can alter this behavior by passing the value ‘int’ to the ‘dtype’ parameter. This will ask the function to store the extracted values as integers, and hence the data type of the array will also be int.

data = np.loadtxt("./weight_height_4.txt", delimiter="-", dtype="int")

print(data[:3])

print("datatype =",data.dtype)

Output:

But what if there are columns with different data types?

Let’s say we have the first two columns having float values and the last column having integer values.

In that case, we can pass a comma-separated datatype string specifying the data type of each column (in order of their existence) to the dtype parameter.

However, in such a case the function will return a NumPy array of tuples of values since a NumPy array as a whole can have only 1 data type

Let’s try this on ‘weight_height_3.txt’ file where the first two columns (weight, height) had float values and the last 3 values (date, month, year) were integers:

Output:

Ignoring headers

In some cases (especially CSV files), the first line of the text file may have ‘headers’ describing what each column in the following rows represents. While reading data from such text files, we may want to ignore the first line because we cannot (and should not) store them in our NumPy array.

In such a case, we can use the ‘skiprows’ parameter and pass the value 1, asking the function to ignore the first 1 line(s) of the text file.

Let’s try this on a CSV file – ‘weight_height.csv’:

Weight (in Kg), Height (in cm)
73.847017017515,241.893563180437
68.7819040458903,162.310472521300
74.1101053917849,212.7408555565

Now we want to ignore the header line i.e the first line of the file:

data = np.loadtxt("./weight_height.csv", delimiter=",", skiprows=1)

print(data[:3])

Output:

Likewise, we can pass any positive integer n to the skiprows parameter asking to ignore first n rows from the file.

Ignoring the first column

Sometimes, we may also want to skip the first column because we are not interested in it. For example, if our text file had the first column as “gender”, and if we don’t need to include the values of this column when extracting the data, we need a way to ask the function to do the same.

We do not have a skipcols parameter like skiprows in np.loadtxt function, using which, we could express this need. However, np.loadtxt has another parameter called ‘usecols’ where we specify the indices of the columns to be retained.

So if we want to skip the first column, we can simply supply the indices of all the columns except the first (remember indexing begins at zero)

Enough talking, let’s get to work!

Let’s look at the content of a new file ‘weight_height_5.txt’ which has an additional gender column that we want to ignore.

Male, 110.90, 146.03
Male, 44.83, 211.82


Female, 78.67, 158.74
Male, 105.64, 164.21

We’ll first determine the number of columns in the file from the first line and then pass a range of column indices excluding the first one:

with open("./weight_height_5.txt") as f:
#determining number of columns from the first line of text

n_cols = len(f.readline().split(","))

data = np.loadtxt("./weight_height_5.txt", delimiter=",",usecols=np.arange(1, n_cols))

print("First five rows:\n",data[:5])

Here we are supplying a range of values beginning from 1 (second column) and extending up to n_cols (the last column)

Output:

We can generalize the use of the usecols parameter by passing a list of indices of only those columns that we want to retain.

Load first n rows

Just as we can skip the first n rows using the skiprows parameter, we can also choose to load only the first n rows and skip the rest. This can be achieved using the max_rows parameter of the np.loadtxt method.

Let us suppose that we want to read only the first 10 rows from the text file ‘weight_height_2.txt’. We’ll call the np.loadtxt method along with the max_rows parameter and pass the value 10.

data = np.loadtxt("./weight_height_2.txt", delimiter=",",max_rows = 10)

print("Shape of data:",data.shape)

Output:

As we can see, the returned NumPy array has only 10 rows which are the first 10 rows of the text file.

If we use the max_rows parameter along with skiprowsparameter, then the specified number of rows will be skipped and next n rows will be extracted where n is the value we pass to max_rows.

Load specific rows

If we want the np.loadtxt function to load only specific rows from the text file, no parameter supports this feature.

However, we can achieve this by defining a generator that accepts row indices and returns lines at those indices. We’ll then pass this generator object to our np.loadtxt method.

Let’s first define the generator:

def generate_specific_rows(filePath, row_indices=[]):

with open(filePath) as f:

# using enumerate to track line no.
for i, line in enumerate(f):

#if line no. is in the row index list, then return that line
if i in row_indices:

yield line

Let’s now use the np.loadtxt function to read the 2nd, 4th and 100th line in the file ‘weight_height_2.txt

gen = generate_specific_rows("./weight_height_2.txt",row_indices = [1, 3, 99])

data = np.loadtxt(gen, delimiter=",")

print(data)

This should return a NumPy array having 3 rows and 2 columns:

Skip the last row

If you want to exclude the last line of the text file, you can achieve this in multiple ways. You can either define another generator that yields lines one by one and stops right before the last one, or you can use an even simpler approach – just figure out the number of lines in the file, and pass 1 less than that count to the max_rows parameter.

But how will you figure out the number of lines?
Follow along!

with open("./weight_height_2.txt") as f:

n = len(list(f))

print("n =", n)

Now n contains the number of lines present in `weight_height_2.txt` file, that value should be 100.

We will now read the text file as we used to, using the np.loadtxt method along with the max_rows parameter with value n – 1.

data = np.loadtxt("./weight_height_2.txt", delimiter=",",max_rows=n - 1)

print("data shape =",data.shape)

Output:

As we can see, the original text file had 100 rows, but when we read data from the file, it’s shape is (99, 2) since it skipped the last row from the file.

Skip specific columns

Suppose you wanted to ignore some of the columns while loading data from a text file by specifying the indices of such columns.

While the np.loadtxt method provides a parameter to specify which columns to retain (usecols), it doesn’t offer a way to do the opposite i.e specify which columns to skip. However, we can always find a workaround!

We shall first define the indices of columns to be ignored, and then using them we will derive the list of indices to be retained as the two sets would be mutually exclusive.

We will then pass this derived indices list to the usecols parameter.

Here is pseudocode for the entire process:

  1. Find the number of columns in the file n_cols (explained in an earlier section)
  2. Define the list of indices to be ignored
  3. Create a range of indices from 0 to n_cols, and eliminate the indices of step 2 from this range
  4. Pass this new list to usecols parameter in np.loadtxt method

Let’s create a wrapper function loadtext_without_columns that implements all the above steps:

def loadtext_without_columns(filePath, skipcols=[], delimiter=","):

with open(filePath) as f:

n_cols = len(f.readline().split(delimiter))

#define a range from 0 to n_cols
usecols = np.arange(0, n_cols)

#remove the indices found in skipcols
usecols = set(usecols) - set(skipcols)

#sort the new indices in ascending order
usecols = sorted(usecols)

#load the file and retain indices found in usecols
data = np.loadtxt(filePath, delimiter = delimiter, usecols = usecols)

return data

To test our code, we will work with a new file `weight_height_6.txt` which has 5 columns – the first two columns indicate width and height and the remaining 3 indicate the date, month and year of birth of the individuals.

All the values are separated by a single delimiter – comma:

110.90, 146.03, 3,7,1981
44.83, 211.82, 1,2,1986
97.13, 209.30, 14,2,1989


105.64, 164.21, 3,6,2000

Suppose we were not interested in the height and the date of birth of the individual, and so we wanted to skip the columns at positions 1 and 2.

Let’s call our wrapper function specifying our requirements:

data = loadtext_without_columns("./weight_height_6.txt",skipcols = [1, 2], delimiter = ",")

# print first 5 rows
print(data[:5])

Output:

We can see that our wrapper function only returns 3 columns – weight, month and year. It has ensured that the columns we specified have been skipped!

Load 3D arrays

So far we’ve been reading the contents of the file as a 2D NumPy array. This is the default behavior of the np.loadtxt method, and there’s no additional parameter that we can specify to interpret the read data as a 3D array.

So the simplest approach to solve this problem would be to read the data as a NumPy array and use NumPy’s reshape method to reshape the data in any shape of any dimension that we desire.

We just need to be careful that if we want to interpret it as a multidimensional array, we should make sure it is stored in the text file in an appropriate manner and that after reshaping the array, we’d get what we actually desired.

Let us take an example file – ‘weight_height_7.txt’.

This is the same file as ‘weight_height_2.txt’. The only difference is that this file has 90 rows, and each 30-row block represents a different section or class to which individuals belong.

So there are a total of 3 sections (A, B and C) – each having 30 individuals whose weights and heights are listed on a new row.

The section names are denoted with a comment just before the beginning of each section (you can check this at lines 1, 32 and 63).

The comment statements begin with ‘#’ and these lines are ignored by np.loadtxt when reading the data. We can also specify any other identifier for comment lines using the parameter ‘comments’

Now when you read this file, and print its shape, it would display (90,2) because that is how np.loadtxt reads the data – it arranges a multi-row data into 2D arrays.

But we know that there is a logical separation between each group of 30 individuals, and we would want the shape to be (3, 30, 2) – where the first dimension indicates the sections, the second one represents each of the individuals in that section and the last dimension indicates the number of values associated to each of these individuals (here 2 for weight & height).

Using NumPy reshape method

So we want our data to be represented as a 3D array.

We can achieve this by simply reshaping the returned data using NumPy’s reshape method.

data = np.loadtxt("./weight_height_7.txt",delimiter=",")

print("Current shape = ",data.shape)

Settingsdata = data.reshape(3,30,2)

print("Modified shape = ",data.shape)

print("fifth individual of section B - weight, height =",data[1,4,:])

Output:

Notice how we are printing the details of a specific individual using 3 indices

The returned result belongs to the 5th individual of section B – this can be validated from the text:

#section B
100.91, 155.55
72.93, 150.38
116.68, 137.15
86.51, 172.15
59.85, 155.53

Comparison with alternatives

While numpy.loadtxt is an extremely useful utility for reading data from text files, it is not the only one!

There are many alternatives out there that can do the same task as np.loadtxt, many of these are better than np.loadtxt in many aspects. Let’s briefly look at 3 such alternative functions.

numpy.genfromtxt

  1. This is the most discussed and the most used method alongside np.loadtxt
  2. There’s no major difference between the two, the only one that stands out is np.genfromtxt’s ability to smoothly handle missing values.
  3. In fact, NumPy’s documentation describes np.loadtxt as “an equivalent function (to np.genfromtxt) when no data is missing.
  4. So the two are almost similar methods, except that np.genfromtxt can do more sophisticated processing of the data in a text file.

numpy.fromfile

  1. np.fromfile is commonly used when working with data stored in binary files, with no delimiters.
  2. It can read plain text files but does so with a lot of issues (go ahead and try reading the files we discussed using np.fromfile)
  3. While it is faster in execution time than np.loadtxt, but it is generally not a preferred choice when working with well-structured data in a text file.
  4. Besides NumPy’s documentation mentions np.loadtxt as a ‘more flexible (than np.fromfile) way of loading data from a text file.

pandas.read_csv

  1. pandas.read_csv is the most popular choice of Data Scientists, ML Engineers, Data Analysts, etc. for reading data from text files.
  2. It offers way more flexibility than np.loadtxt or np.genfromtxt.
  3. Although you cannot pass a generator to pandas.read_csv as we did.
  4. In terms of speed of execution, however, pandas.read_csv do better than np.loadtxt

Handling Missing Values

As discussed in our section comparing np.loadtxt with other options, np.genfromtxt handles missing values by default. We do not have any direct way of handling missing values in np.loadtxt

Here we’ll look at an indirect (and a slightly sophisticated) way of handling missing values with the np.loadtxt method.

The converters parameter:

  • np.loadtxt has a converters parameter that is used to specify the preprocessing (if any) required for each of the columns in the file.
  • For example, if the text file stores the height column in centimeters and we want to store them as inches, we can define a converter for the heights column.
  • The converters parameter accepts a dictionary where the keys are column indices and the values are methods that accept the column value, ‘convert’ it and return the modified value.

How can we use converters to handle missing values?

  • We need to first decide the default datatype i.e the value to be used to fill in the positions where the actual values are missing. Let’s say we want to fill in the missing height and weight values with 0, so our fill_value will be 0.
  • Next, we can define a converter for each column in the file, which checks if there is some value or an empty string in that column and if it’s an empty string, it will fill it with our fill_value.
  • To do this, we’ll have to find the number of columns in the text file, and we have already discussed how to achieve this in an earlier section.

We’ll use the file ‘weight_height_8.txt’ which is the same as ‘weight_height_2.txt’ but has several missing values.

, 146.03
44.83, 211.82
97.13,
69.87, 207.73
, 158.87
99.25, 195.41

Let’s write the code to fill in these missing values’ positions with 0.

# finding number of columns in the file
with open("./weight_height_8.txt") as f:

n_cols = len(f.readline().split(","))

print("Number of columns", n_cols)

# defining converters for each of the column (using 'dictionary
# comprehension') to fill each missing value with fill_value

fill_value = 0

converters = {i: lambda s: float(s.strip() or fill_value) for i in range(2)}

data = np.loadtxt("./weight_height_8.txt", delimiter=",",converters = converters)

print("data shape =",data.shape)

print("First 5 rows:\n",data[:5])

Output:

The missing height and weight values have been correctly replaced with a 0. No magic!

Conclusion

numpy.loadtxt is undoubtedly one of the most standard choices for reading a well-structured data stored in a text file. It offers us great flexibility in choosing various options for specifying the way we want to read the data, and wherever it doesn’t – remember there’s always a workaround!

0

Anonymized Data Is Not Anonymous

We have all more or less accepted that we are living in some kind of dime-store George Orwell novel where our every movement is tracked and recorded in some way. Everything we do today, especially if there’s any kind of gadget or electronics involved, generates data that is of interest to someone. That data is constantly being gathered and stored, used by someone to build up a picture of the world around us. The average person today is much more aware of the importance of their own data security. We all understand that the wrong data in the wrong hands can be used to wreak havoc on both individuals and society as a whole. Now that there is a much greater general awareness of the importance of data privacy, it is much more difficult for malicious actors to unscrupulously gather sensitive data from us, as most people know not to hand it over.

Continue Reading →

Data Protection Laws

In most jurisdictions, there are laws and regulations in place that govern how personal data can be collected, stored, shared, and accessed.

While these laws are severely lacking in a number of areas, the trend in recent years has been to increasingly protect individuals from corporate negligence and excess, which has been welcomed by most consumers.

Probably the best-known data protection law is the famed GDPR, or the General Data Protection Regulation which came into force in 2018. Though in theory it has power only within the EU, in practice the law applies to every company that deals with EU citizens.

Its strict privacy requirements have made many businesses reconsider how they handle data, threatening misbehavers with fines that can climb into billions of euros (up to 4% of the company’s annual turnover).

Unlike the EU, the US has no single regulation on the federal level to protect the data of its citizens. Acknowledging that, some states have released their own privacy laws.

Probably the most extensive of them to date is the CCPA, or the California Consumer Privacy Act.

The act will come into power beginning with 2020 and grant the citizens of California many of the same rights that EU citizens have come to enjoy.

It will allow Californians to know what data is collected about them, where it is used, say no to selling their data, and request to delete it.

Anonymized Data

One common theme that has emerged in the regulations from different jurisdictions is the notion of anonymized data. As the name implies, this is data that cannot be tied to a specific individual.

A set of anonymized data might be presented as belonging to a particular individual, but the identity of the subject is not revealed in the data.

Data anonymization presents an attractive common ground between the rights of consumers and those that want to make use of their personal data.

After all, information about who we are and what we do has long been the driving force behind many of today’s largest companies, including Google, Facebook, and Amazon.

But private corporations are not the only beneficiaries of our data. Removing any personally identifiable information from a dataset and anonymizing it, researchers are able to work with large and detailed datasets that contain a wealth of information without having to compromise any individual’s privacy.

By anonymizing data, we are also able to encourage people to share data that they would otherwise hold on to. Businesses and governments can access and trade vast amounts of data without infringing anyone’s privacy, thanks to anonymization.

Meanwhile, users don’t have to worry about data they generate being recorded and revealing information about them personally.

Data Anonymization Techniques

There are many ways to anonymize data, varying in cost and difficulty.

Perhaps the easiest technique is simply to remove some of the user’s direct identifiers. This is basically your main personal information. For instance, an insurance company could delete a customer’s name, date of birth, and call the data as good as anonymized.

Another method is to generalize the data of multiple users to reduce their precision. For instance, you could remove the last digits of a postcode or present a person’s age in a range rather than the exact number.

It is one of the methods Google uses to achieve k-anonymity – this elaborate term simply means that a certain number of people (defined by the letter k) should share the same property, such as ZIP code.

One more way is to include noise into the dataset. By noise I mean swapping around the information about certain properties between individuals or groups.

For example, this method could switch your car ownership details with another person. Your profile would change, but the whole dataset would remain intact for statistical analysis.

Finally, you can further protect the anonymized data you need to share by sampling it – that is, releasing the dataset in small batches. In theory, sampling helps to reduce the risk of re-identification.

Even if the data is enough to identify you as an individual, statistically there should be at least several other people with the same characteristics as you. Without having the whole dataset, there is no way to tell which person it really is.

Other data anonymization techniques exist, but these are some of the main ones.

Deanonymization

So, anonymization makes everyone a winner, right? Well, not quite.

Anyone who has worked extensively with data can testify as to just how little information is needed to identify a specific individual out of a database of many thousands.

One of the consequences of the massive volumes of data that now exists on all of us is that different data sources can be cross-referenced to identify common elements.

In some cases, this cross-referencing can instantly deanonymize entire data sets, depending on how exactly they have been anonymized.

Researchers were able to recover surnames of US males from a database of genetic information by simply making use of publicly available internet resources.

A publicly available dataset of London’s bike-sharing service could be used not only to track trips but also who actually made them.

Anonymized Netflix movie ratings were mapped to individuals by cross-referencing them with IMDB data, thus revealing some very private facts about users. These are only a few of the many similar examples.

Since the introduction of the GDPR, a number of businesses have been looking for ways of continuing to handle large volumes of customer data without falling afoul of the new regulations.

Many organizations have come to view anonymized datasets as a means of potentially circumventing the regulations. After all, if data isn’t tied to specific individuals, it can’t infringe on their privacy.

No Such Thing as Anonymous

According to new research conducted by researchers from Imperial College London, along with their counterparts at Belgium’s Université Catholique de Louvain, it is incredibly hard to properly deanonymize data.

In order for data to be completely anonymous, it needs to be presented in isolation. You can use VPN or change your IP address (more information about proxy servers you can find on Proxyway), etc.

If enough anonymized data is given about an individual, all it takes is a simple cross-reference with other databases to ascertain who the data concerns.

Using their own prediction model, the researchers made a startling discovery: it would take only 15 pieces of demographic information to re-identify 99.98% of Americans.

What is more, only four base attributes (ZIP code, date of birth, gender, and number of children) would be needed to confidently identify 79.4% of the entire state of Massachusetts. According to the study, releasing data in small samples is not enough to protect an individual from detection.

Bearing in mind that researchers can deanonymize the records of an entire state, data brokers like Experian are selling anonymized data sets that contain hundreds of data points for each individual.

According to the researchers’ work, this data is anonymized in name only and anyone with the capacity to handle large datasets also has the resources to easily deanonymize them.

It doesn’t matter what methods are used to anonymize data. Even the more advanced techniques like k-anonymity might not be sufficient – not to mention that they are expensive.

In most cases, all that happens is that only immediately identifiable data like names and addresses are removed. This is far from enough.

The researchers’ findings urge us not to fall into a false sense of security. They also challenge the methods companies use to anonymize data in light of the strict regulatory requirements set forth by the GDPR and the forthcoming CCPA.

Wrap-Up

The long battle to get the average internet user to care about their data and privacy has been a tiring one. Anyone who has worked in cybersecurity over the last couple of decades can testify as to how much things have improved, but there is still a long way to go.

The notion that people’s data can be anonymized and rendered harmless is both incorrect and dangerous. It is important that people properly understand the implications of handing their data over. Don’t give up your data under the false impression that it can’t be tied to you.

0

SSH port forwarding (tunneling) in Linux

In this tutorial, we will cover SSH port forwarding in Linux. This is a function of the SSH utility that Linux administrators use to create encrypted and secure relays across different systems. SSH port forwarding, also called SSH tunneling, is used to create a secure connection between two or more systems. Applications can then use these tunnels to transmit data. Your data is only as secure as its encryption, which is why SSH port forwarding is a popular mechanism to use. Read on to find out more and see how to setup SSH port forwarding on your own systems. To put it simply, SSH port forwarding involves establishing an SSH tunnel between two or more systems and then configuring the systems to transmit a specified type of traffic through that connection.

Continue Reading →

What is SSH port forwarding?

To put it simply, SSH port forwarding involves establishing an SSH tunnel between two or more systems and then configuring the systems to transmit a specified type of traffic through that connection.

There are a few different things you can do with this: local forwarding, remote forwarding, and dynamic port forwarding. Each configuration requires its own steps to setup, so we will go over each of them later in the tutorial.

Local port forwarding is used to make an external resource available on the local network. An SSH tunnel is established to a remote system, and traffic from the local network can use that tunnel to transmit data back and forth, accessing the remote system and network as if it was a part of the local network.

Remote port forwarding is the exact opposite. An SSH tunnel is established but the remote system is able to access your local network.

Dynamic port forwarding sets up a SOCKS proxy server. You can configure applications to connect to the proxy and transmit all data through it. The most common use for this is for private web browsing or to make your connection seemingly originate from a different country or location.

SSH port forwarding can also be used to setup a virtual private network (VPN). You’ll need an extra program for this called sshuttle. We cover the details later in the tutorial.

Why use SSH port forwarding?

Since SSH creates encrypted connections, this is an ideal solution if you have applications that transmit data in plaintext or use an unencrypted protocol. This holds especially true for legacy applications.

It’s also popular to use it for connecting to a local network from the outside. For example, an employee using SSH tunnels to connect to a company’s intranet.

You may be thinking this sounds like a VPN. The two are similar, but creating ssh tunnels is for specific traffic, whereas VPNs are more for establishing general connections.

SSH port forwarding will allow you to access remote resources by just establishing an SSH tunnel. The only requirement is that you have SSH access to the remote system and, ideally, public key authentication configured for password-less SSHing.

How many sessions are possible?

Technically, you can specify as many port forwarding sessions as you’d like. Networks use 65,535 different ports, and you are able to forward any of them that you want.

When forwarding traffic, be cognizant of the services that use certain ports. For example, port 80 is reserved for HTTP. So you would only want to forward traffic on port 80 if you intend to forward web requests.

The port you forward on your local system doesn’t have to match that of the remote server. For example, you can forward port 8080 on localhost to port 80 on the remote host.

If you don’t care what port you are using on the local system, select one between 2,000 and 10,000 since these are rarely used ports. Smaller numbers are typically reserved for certain protocols.

Local forwarding

Local forwarding involves forwarding a port from the client system to a server. It allows you to configure a port on your system so that all connections to that port will get forwarded through the SSH tunnel.

Use the -L switch in your ssh command to specify local port forwarding. The general syntax of the command is like this:

ssh -L local_port:remote_ip:remote_port user@hostname.com

Check out the example below:

ssh -L 80:example1.com:80 example2.com

local port forwarding

This command would forward all requests to example1.com to example2.com. Any user on this system that opens a web browser and attempts to navigate to example1.com will, in the background, have their request sent to example2.com instead and display a different website.

Such a command is useful when configuring external access to a company intranet or other private network resources.

Test SSH port forwarding

To see if your port forwarding is working correctly, you can use the netcat command. On the client machine (the system where you ran the ssh -L command), type the netcat command with this syntax:

nc -v remote_ip port_number

Test port forwarding using netcat

If the port is forwarded and data is able to traverse the connection successfully, netcat will return with a success message. If it doesn’t work, the connection will time out.

If you’re having trouble getting the port forwarding to work, make sure you’re able to ssh into the remote server normally and that you have configured the ports correctly. Also, verify that the connection isn’t being blocked by a firewall.

Persistent SSH tunnels (Using Autossh)

Autossh is a tool that can be used to create persistent SSH tunnels. The only prerequisite is that you need to have public key authentication configured between your systems, unless you want to be prompted for a password every time the connection dies and is reestablished.

Autossh may not be installed by default on your system, but you can quickly install it using apt, yum, or whatever package manager your distribution uses.

sudo apt-get install autossh

The autossh command is going to look pretty much identical to the ssh command we ran earlier.

autossh -L 80:example1.com:80 example2.com

Persistent SSH port forwarding autossh

Autossh will make sure that tunnels are automatically re-established in case they close because of inactivity, remote machine rebooting, network connection being lost, etc.

Remote forwarding

Remote port forwarding is used to give a remote machine access to your system. For example, if you want a service on your local computer to be accessible by a system(s) on your company’s private network, you could configure remote port forwarding to accomplish that.

To set this up, issue an ssh command with the following syntax:

ssh -R remote_port:local_ip:local_port user@hostname.com

If you have a local web server on your computer and would like to grant access to it from a remote network, you could forward port 8080 (common http alternative port) on the remote system to port 80 (http port) on your local system.

ssh -R 8080:localhost:80 geek@likegeeks.com

Remote port forwarding

Dynamic forwarding

SSH dynamic port forwarding will make SSH act as a SOCKS proxy server. Rather than forwarding traffic on a specific port (the way local and remote port forwarding do), this will forward traffic across a range of ports.

If you have ever used a proxy server to visit a blocked website or view location-restricted content (like viewing stuff on Netflix that isn’t available in your country), you probably used a SOCKS server.

It also provides privacy, since you can route your traffic through a SOCKS server with dynamic port forwarding and prevent anyone from snooping log files to see your network traffic (websites visited, etc).

To set up dynamic port forwarding, use the ssh command with the following syntax:

ssh -D local_port user@hostname.com

So, if we wanted to forward traffic on port 1234 to our SSH server:

ssh -D 1234 geek@likegeeks.com

Once you’ve established this connection, you can configure applications to route traffic through it. For example, on your web browser:

Socks proxy

Type the loopback address (127.0.0.1) and the port you configured for dynamic port forwarding, and all traffic will be forwarded through the SSH tunnel to the remote host (in our example, the likegeeks.com SSH server).

Multiple forwarding

For local port forwarding, if you’d like to setup more than one port to be forwarded to a remote host, you just need to specify each rule with a new -L switch each time. The command syntax is like this:

ssh -L local_port_1:remote_ip:remote_port_1 -L local_port_2:remote_ip:remote_port2 user@hostname.com

For example, if you want to forward ports 8080 and 4430 to 192.168.1.1 ports 80 and 443 (HTTP and HTTPS), respectively, you would use this command:

ssh -L 8080:192.168.1.1:80 -L 4430:192.168.1.1:443 user@hostname.com

For remote port forwarding, you can setup more than one port to be forwarded by specifying each new rule with the -R switch. The command syntax is like this:

ssh -R remote_port1:local_ip:local_port1 remote_port2:local_ip:local_port2 user@hostname.com

List port forwarding

You can see what SSH tunnels are currently established with the lsof command.

lsof -i | egrep '\<ssh\>'

SSH tunnels

In this screenshot, you can see that there are 3 SSH tunnels established. Add the -n flag to have IP addresses listed instead of resolving the hostnames.

lsof -i -n | egrep '\<ssh\>'

SSH tunnels n flag

Limit forwarding

By default, SSH port forwarding is pretty open. You can freely create local, remote, and dynamic port forwards as you please.

But if you don’t trust some of the SSH users on your system, or you’d just like to enhance security in general, you can put some limitations on SSH port forwarding.

There are a couple of different settings you can configure inside the sshd_config file to put limitations on port forwarding. To configure this file, edit it with vi, nano, or your favorite text editor:

sudo vi /etc/ssh/sshd_config

PermitOpen can be used to specify the destinations to which port forwarding is allowed. If you only want to allow forwarding to certain IP addresses or hostnames, use this directive. The syntax is as follows:

PermitOpen host:port

PermitOpen IPv4_addr:port

PermitOpen [IPv6_addr]:port

AllowTCPForwarding can be used to turn SSH port forwarding on or off, or specify what type of SSH port forwarding is permitted. Possible configurations are:

AllowTCPForwarding yes #default setting

AllowTCPForwarding no #prevent all SSH port forwarding

AllowTCPForwarding local #allow only local SSH port forwarding

AllowTCPForwarding remote #allow only remote SSH port forwarding

To see more information about these options, you can check out the man page:

man sshd_config

Low latency

The only real problem that arises with SSH port forwarding is that there is usually a bit of latency. You probably won’t notice this as an issue if you’re doing something minor, like accessing text files or small databases.

The problem becomes more apparent when doing network intensive activities, especially if you have port forwarding set up as a SOCKS proxy server.

The reason for the latency is because SSH is tunneling TCP over TCP. This is a terribly inefficient way to transfer data and will result in slower network speeds.

You could use a VPN to prevent the issue, but if you are determined to stick with SSH tunnels, there is a program called sshuttle that corrects the issue. Ubuntu and Debian-based distributions can install it with apt-get:

sudo apt-get install sshuttle

If you package manager on your distribution doesn’t have sshuttle in its repository, you can clone it from GitHub:

git clone https://github.com/sshuttle/sshuttle.git

cd sshuttle

./setup.py install

Setting up a tunnel with sshuttle is different from the normal ssh command. To setup a tunnel that forwards all traffic (akin to a VPN):

sudo sshuttle -r user@remote_ip -x remote_ip 0/0 -vv

sshuttle command

Break the connection with a ctrl+c key combination in the terminal. Alternatively, to run the sshuttle command as a daemon, add the -D switch to your command.

Want to make sure that the connection was established and the internet sees you at the new IP address? You can run this curl command:

curl ipinfo.io

curl IP address

I hope you find the tutorial useful. Keep coming back.

0

15+ examples for Linux cURL command

In this tutorial, we will cover the cURL command in Linux. Follow along as we guide you through the functions of this powerful utility with examples to help you understand everything it’s capable of. The cURL command is used to download or upload data to a server, using one of its 20+ supported protocols. This data could be a file, email message, or web page. What is cURL command? cURL is an ideal tool for interacting with a website or API, sending requests and displaying the responses to the terminal or logging the data to a file. Sometimes it’s used as part of a larger script, handing off the retrieved data to other functions for processing. Since cURL can be used to retrieve files from servers, it’s often used to download part of a website. It performs this function well, but sometimes the wget command is better suited for that job. We’ll go over some of the differences and similarities between wget and cURL later in this article. We’ll show you how to get started using cURL in the sections below.

Continue Reading →

Download a file

The most basic command we can give to cURL is to download a website or file. cURL will use HTTP as its default protocol unless we specify a different one. To download a website, just issue this command:

curl http://www.google.com

Of course, enter any website or page that you want to retrieve.

curl basic command

Doing a basic command like this with no extra options will rarely be useful, because this only tells cURL to retrieve the source code of the page you’ve provided.

curl output

When we ran our command, our terminal is filled with HTML and other web scripting code – not something that is particularly useful to us in this form.

Let’s download the website as an HTML document instead, that way the content can be displayed. Add the –output option to cURL to achieve this.
curl output switch

Now the website we downloaded can be opened and displayed in a web browser.

downloaded website

If you’d like to download an online file, the command is about the same. But make sure to append the –output option to cURL as we did in the example above.

If you fail to do so, cURL will send the binary output of the online file to your terminal, which will likely cause it to malfunction.

Here’s what it looks like when we initiate the download of a 500KB word document.

curl download document

The word document begins to download and the current progress of the download is shown in the terminal. When the download completes, the file will be available in the directory we saved it to.

In this example, no directory was specified, so it was saved to our present working directory (the directory from which we ran the cURL command).

Also, did you notice the -L option that we specified in our cURL command? It was necessary in order to download this file, and we go over its function in the next section.

Follow redirect

If you get an empty output when trying to cURL a website, it probably means that the website told cURL to redirect to a different URL. By default, cURL won’t follow the redirect, but you can tell it to with the -L switch.

curl -L www.likegeeks.com

curl follow redirect

In our research for this article, we found it was necessary to specify the -L on a majority of websites, so be sure to remember this little trick. You may even want to append it to the majority of your cURL commands by default.

Stop and resume download

If your download gets interrupted, or if you need to download a big file but don’t want to do it all in one session, cURL provides an option to stop and resume the transfer.

To stop a transfer manually, you can just end the cURL process the same way you’d stop almost any process currently running in your terminal, with a ctrl+c combination.

curl stop download

Our download has begun, but was interrupted with ctrl+c, now let’s resume it with the following syntax:

curl -C - example.com/some-file.zip --output MyFile.zip

The -C switch is what resumes our file transfer, but also notice that there is a dash (-) directly after it. This tells cURL to resume the file transfer, but to first look at the already downloaded portion in order to see the last byte downloaded and determine where to resume.

resume file download

Our file transfer was resumed and then proceeded to finish downloading successfully.

Specify timeout

If you want cURL to abandon what it’s doing after a certain amount of time, you can specify a timeout in the command. This is especially useful because some operations in cURL don’t have a timeout by default, so one needs to be specified if you don’t want it getting hung up indefinitely.

You can specify a maximum time to spend executing a command with the -m switch. When the specified time has elapsed, cURL will exit whatever it’s doing, even if it’s in the middle of downloading or uploading a file.

cURL expects your maximum time to be specified in seconds. So, to timeout after one minute, the command would look like this:

curl -m 60 example.com

Another type of timeout that you can specify with cURL is the amount of time to spend connecting. This helps make sure that cURL doesn’t spend an unreasonable amount of time attempting to contact a host that is offline or otherwise unreachable.

It, too, accepts seconds as an argument. The option is written as –connect-timeout.

curl --connect-timeout 60 example.com

Using a username and a password

You can specify a username and password in a cURL command with the -u switch. For example, if you wanted to authenticate with an FTP server, the syntax would look like this:

curl -u username:password ftp://example.com

curl authenticate

You can use this with any protocol, but FTP is frequently used for simple file transfers like this.

If we wanted to download the file displayed in the screenshot above, we just issue the same command but use the full path to the file.

curl -u username:password ftp://example.com/readme.txt

curl authenticate download

Use proxies

It’s easy to direct cURL to use a proxy before connecting to a host. cURL will expect an HTTP proxy by default, unless you specify otherwise.

Use the -x switch to define a proxy. Since no protocol is specified in this example, cURL will assume it’s an HTTP proxy.

curl -x 192.168.1.1:8080 http://example.com

This command would use 192.168.1.1 on port 8080 as a proxy to connect to example.com.

You can use it with other protocols as well. Here’s an example of what it’d look like to use an HTTP proxy to cURL to an FTP server and retrieve a file.

curl -x 192.168.1.1:8080 ftp://example.com/readme.txt

cURL supports many other types of proxies and options to use with those proxies, but expanding further would be beyond the scope of this guide. Check out the cURL man page for more information about proxy tunneling, SOCKS proxies, authentication, etc.

Chunked download large files

We’ve already shown how you can stop and resume file transfers, but what if we wanted cURL to only download a chunk of a file? That way, we could download a large file in multiple chunks.

It’s possible to download only certain portions of a file, in case you needed to stay under a download cap or something like that. The –range flag is used to accomplish this.

curl range man

Sizes must be written in bytes. So if we wanted to download the latest Ubuntu .iso file in 100 MB chunks, our first command would look like this:

curl --range 0-99999999 http://releases.ubuntu.com/18.04/ubuntu-18.04.3-desktop-amd64.iso ubuntu-part1

The second command would need to pick up at the next byte and download another 100 MB chunk.

curl --range 0-99999999 http://releases.ubuntu.com/18.04/ubuntu-18.04.3-desktop-amd64.iso ubuntu-part1

curl --range 100000000-199999999 http://releases.ubuntu.com/18.04/ubuntu-18.04.3-desktop-amd64.iso ubuntu-part2

Repeat this process until all the chunks are downloaded. The last step is to combine the chunks into a single file, which can be done with the cat command.

cat ubuntu-part? > ubuntu-18.04.3-desktop-amd64.iso

Client certificate

To access a server using certificate authentication instead of basic authentication, you can specify a certificate file with the –cert option.

curl --cert path/to/cert.crt:password ftp://example.com

cURL has a lot of options for the format of certificate files.

curl cert

There are more certificate related options, too: –cacert, –cert-status, –cert-type, etc. Check out the man page for a full list of options.

Silent cURL

If you’d like to suppress cURL’s progress meter and error messages, the -s switch provides that feature. It will still output the data you request, so if you’d like the command to be 100% silent, you’d need to direct the output to a file.

Combine this command with the -O flag to save the file in your present working directory. This will ensure that cURL returns with 0 output.

curl -s -O http://example.com

Alternatively, you could use the –output option to choose where to save the file and specify a name.

curl -s http://example.com --output index.html

curl silent

Get headers

Grabbing the header of a remote address is very simple with cURL, you just need to use the -I option.

curl -I example.com

curl headers

If you combine this with the –L option, cURL will return the headers of every address that it’s redirected to.

curl -I -L example.com

Multiple headers

You can pass headers to cURL with the -H option. And to pass multiple headers, you just need to use the -H option multiple times. Here’s an example:

curl -H 'Connection: keep-alive' -H 'Accept-Charset: utf-8 ' http://example.com

Post (upload) file

POST is a common way for websites to accept data. For example, when you fill out a form online, there’s a good chance that the data is being sent from your browser using the POST method. To send data to a website in this way, use the -d option.

curl -d 'name=geek&location=usa' http://example.com

To upload a file, rather than text, the syntax would look like this:

curl -d @filename http://example.com

Use as many -d flags as you need in order to specify all the different data or filenames that you are trying to upload.

You can the -T option if you want to upload a file to an FTP server.

curl -T myfile.txt ftp://example.com/some/directory/

Send an email

Sending an email is simply uploading data from your computer (or another device) to an email server. Since cURL is able to upload data, we can use it to send emails. There are a slew of options, but here’s an example of how to send an email through an SMTP server:

curl smtp://mail.example.com --mail-from me@example.com --mail-rcpt john@domain.com --upload-file email.txt

Your email file would need to be formatted correctly. Something like this:

As usual, more granular and specialized options can be found in the man page of cURL.

Read email message

cURL supports IMAP (and IMAPS) and POP3, both of which can be used to retrieve email messages from a mail server.

Login using IMAP like this:

curl -u username:password imap://mail.example.com

This command will list available mailboxes, but not view any specific message. To do this, specify the UID of the message with the –X option.

curl -u username:password imap://mail.example.com -X 'UID FETCH 1234'

Difference between cURL and wget

Sometimes people confuse cURL and wget because they’re both capable of retrieving data from a server. But this is the only thing they have in common.

We’ve shown in this article what cURL is capable of. wget provides a different set of functions. wget is the best tool for downloading websites and is capable of recursively traversing directories and links to download entire sites.

For downloading websites, use wget. If using some protocol other than HTTP or HTTPS, or for uploading files, use cURL. cURL is also a good option for downloading individual files from the web, although wget does that fine, too.

I hope you find the tutorial useful. Keep coming back.

0

30 Examples for Awk Command in Text Processing

In the previous post, we talked about sed command and we saw many examples of using it in text processing and we saw how it is good in this, but it has some limitations. Sometimes you need something powerful, giving you more control to process data. This is where awk command comes in. The awk command or GNU awk in specific provides a scripting language for text processing. With awk scripting language, you can make the following: a) Define variables, b) Use string and arithmetic operators, c) Use control flow and loops, d) Generate formatted reports. Actually, you can process log files that contain maybe millions of lines to output a readable report that you can benefit from.

Continue Reading →

Awk Options

The awk command is used like this:

$ awk options program file

Awk can take the following options:

-F fs     To specify a file separator.

-f file     To specify a file that contains awk script.

-v var=value     To declare a variable.

We will see how to process files and print results using awk.

Read AWK Scripts

To define an awk script, use braces surrounded by single quotation marks like this:

$ awk '{print "Welcome to awk command tutorial "}'

If you type anything, it returns the same welcome string we provide.

To terminate the program, press The Ctrl+D. Looks tricky, don’t panic, the best is yet to come.

Using Variables

With awk, you can process text files. Awk assigns some variables for each data field found:

  • $0 for the whole line.
  • $1 for the first field.
  • $2 for the second field.
  • $n for the nth field.

The whitespace character like space or tab is the default separator between fields in awk.

Check this example and see how awk processes it:

$ awk '{print $1}' myfile

The above example prints the first word of each line.

Sometimes the separator in some files is not space nor tab but something else. You can specify it using –F option:

$ awk -F: '{print $1}' /etc/passwd

This command prints the first field in the passwd file. We use the colon as a separator because the passwd file uses it.

Using Multiple Commands

To run multiple commands, separate them with a semicolon like this:

$ echo "Hello Tom" | awk '{$2="Adam"; print $0}'

The first command makes the $2 field equals Adam. The second command prints the entire line.

Reading The Script From a File

You can type your awk script in a file and specify that file using the -f option.

Our file contains this script:

{print $1 " home at " $6}

$ awk -F: -f testfile /etc/passwd

Here we print the username and his home path from /etc/passwd, and surely the separator is specified with capital -F which is the colon.

You can your awk script file like this:

{

text = " home at "

print $1 $6

}

$ awk -F: -f testfile /etc/passwd

Awk Preprocessing

If you need to create a title or a header for your result or so. You can use the BEGIN keyword to achieve this. It runs before processing the data:

$ awk 'BEGIN {print "Report Title"}'

Let’s apply it to something we can see the result:

$ awk 'BEGIN {print "The File Contents:"}

{print $0}' myfile

Awk Postprocessing

To run a script after processing the data, use the END keyword:

$ awk 'BEGIN {print "The File Contents:"}

{print $0}

END {print "File footer"}' myfile

This is useful, you can use it to add a footer for example.

Let’s combine them together in a script file:

BEGIN {

print "Users and thier corresponding home"

print " UserName \t HomePath"

print "___________ \t __________"

FS=":"

}

{

print $1 " \t " $6

}

END {

print "The end"

}

First, the top section is created using BEGIN keyword. Then we define the FS and print the footer at the end.

$ awk -f myscript /etc/passwd

Built-in Variables

We saw the data field variables $1, $2 $3, etc are used to extract data fields, we also deal with the field separator FS.

But these are not the only variables, there are more built-in variables.

The following list shows some of the built-in variables:

FIELDWIDTHS     Specifies the field width.

RS     Specifies the record separator.

FS     Specifies the field separator.

OFS  Specifies the Output separator.

ORS  Specifies the Output separator.

By default, the OFS variable is the space, you can set the OFS variable to specify the separator you need:

$ awk 'BEGIN{FS=":"; OFS="-"} {print $1,$6,$7}' /etc/passwd

Sometimes, the fields are distributed without a fixed separator. In these cases, FIELDWIDTHS variable solves the problem.

Suppose we have this content:

1235.96521

927-8.3652

36257.8157

$ awk 'BEGIN{FIELDWIDTHS="3 4 3"}{print $1,$2,$3}' testfile

Look at the output. The output fields are 3 per line and each field length is based on what we assigned by FIELDWIDTH exactly.

Suppose that your data are distributed on different lines like the following:

Person Name

123 High Street

(222) 466-1234

Another person

487 High Street

(523) 643-8754

In the above example, awk fails to process fields properly because the fields are separated by new lines and not spaces.

You need to set the FS to the newline (\n) and the RS to a blank text, so empty lines will be considered separators.

$ awk 'BEGIN{FS="\n"; RS=""} {print $1,$3}' addresses

Awesome! we can read the records and fields properly.

More Variables

There are some other variables that help you to get more information:

ARGC     Retrieves the number of passed parameters.

ARGV     Retrieves the command line parameters.

ENVIRON     Array of the shell environment variables and corresponding values.

FILENAME    The file name that is processed by awk.

NF     Fields count of the line being processed.

NR    Retrieves total count of processed records.

FNR     The record which is processed.

IGNORECASE     To ignore the character case.

You can review the previous post shell scripting to know more about these variables.

Let’s test them.

$ awk 'BEGIN{print ARGC,ARGV[1]}' myfile

The ENVIRON variable retrieves the shell environment variables like this:

$ awk '

BEGIN{

print ENVIRON["PATH"]

}'

You can use bash variables without ENVIRON variables like this:

$ echo | awk -v home=$HOME '{print "My home is " home}'

The NF variable specifies the last field in the record without knowing its position:

$ awk 'BEGIN{FS=":"; OFS=":"} {print $1,$NF}' /etc/passwd

The NF variable can be used as a data field variable if you type it like this: $NF.

Let’s take a look at these two examples to know the difference between FNR and NR variables:

$ awk 'BEGIN{FS=","}{print $1,"FNR="FNR}' myfile myfile

In this example, the awk command defines two input files. The same file, but processed twice. The output is the first field value and the FNR variable.

Now, check the NR variable and see the difference:

$ awk '

BEGIN {FS=","}

{print $1,"FNR="FNR,"NR="NR}

END{print "Total",NR,"processed lines"}' myfile myfile

The FNR variable becomes 1 when comes to the second file, but the NR variable keeps its value.

User Defined Variables

Variable names could be anything, but it can’t begin with a number.

You can assign a variable as in shell scripting like this:

$ awk '

BEGIN{

test="Welcome to LikeGeeks website"

print test

}'

Structured Commands

The awk scripting language supports if conditional statement.

The testfile contains the following:

10

15

6

33

45

$ awk '{if ($1 > 30) print $1}' testfile

Just that simple.

You should use braces if you want to run multiple statements:

$ awk '{

if ($1 > 30)

{

x = $1 * 3

print x

}

}' testfile

Or type them on the same line and separate the if statement with a semicolon like this:

While Loop

You can use the while loop to iterate over data with a condition.

cat myfile

124 127 130

112 142 135

175 158 245

118 231 147

$ awk '{

sum = 0

i = 1

while (i < 5)

{

sum += $i

i++

}

average = sum / 4

print "Average:",average

}' testfile

The while loop runs and every time it adds 1 to the sum variable until the i variables becomes 4.

You can exit the loop using break command like this:

$ awk '{

tot = 0

i = 1

while (i < 5)

{

tot += $i

if (i == 3)

break

i++

}

average = tot / 3

print "Average is:",average

}' testfile

The for Loop

The awk scripting language supports the for loops:

$ awk '{

total = 0

for (var = 1; var < 5; var++)

{

total += $var

}

avg = total / 3

print "Average:",avg

}' testfile

Formatted Printing

The printf command in awk allows you to print formatted output using format specifiers.

The format specifiers are written like this:

%[modifier]control-letter

This list shows the format specifiers you can use with printf:

c              Prints numeric output as a string.

d             Prints an integer value.

e             Prints scientific numbers.

f               Prints float values.

o             Prints an octal value.

s             Prints a text string.

Here we use printf to format our output:

$ awk 'BEGIN{

x = 100 * 100

printf "The result is: %e\n", x

}'

Here is an example of printing scientific numbers.

We are not going to try every format specifier. You know the concept.

Built-In Functions

Awk provides several built-in functions like:

Mathematical Functions

If you love math, you can use these functions in your awk scripts:

sin(x) | cos(x) | sqrt(x) | exp(x) | log(x) | rand()

And they can be used normally:

$ awk 'BEGIN{x=exp(5); print x}'

String Functions

There are many string functions, you can check the list, but we will examine one of them as an example and the rest is the same:

$ awk 'BEGIN{x = "likegeeks"; print toupper(x)}'

The function toupper converts character case to upper case for the passed string.

User Defined Functions

You can define your function and use them like this:

$ awk '

function myfunc()

{

printf "The user %s has home path at %s\n", $1,$6

}

BEGIN{FS=":"}

{

myfunc()

}' /etc/passwd

Here we define a function called myprint, then we use it in our script to print output using printf function.

I hope you like the post.

Thank you.

likegeeks.com

0