Creating AWS enabled local spark

Install pyspark

We need to choose the spark version. it could be 2.4 or bigger. In our case it is 2.4.6.

The installation method is with conda:

conda install -c conda-forge pyspark=2.4.6

Install java

We need to have java. The right version for java. There is a problem with java 272 which comes with Amazon Linux 2. So we have to first remove that version and install the older version.

Query for the current installed openjdk:

rpm -qa | grep java will see something like
java-1.8.0-openjdk.x86_64 1:
...then remove by
yum remove jdk1.8

Going for Java 265

yum -v list java-1.8.0-openjdk-headless  --show-duplicates
yum -v list java-1.8.0-openjdk  --show-duplicates
yum install java-1.8.0-openjdk-
.. headless will be installed by the upper command.

Update alternatives

alternatives --config java

AWS Enable Local Spark

Check what version of hadoom-common you have

ls -l /opt/anaconda3/envs/advanced/lib/python2.7/site-packages/pyspark/jars/hadoop*

That means that we have to stick to aws sdk for hadoop 2.7.3 Download hadoop-aws-2.7.3.jar and its dependency aws-java-sdk-1.7.4.jar. Great tutorial found here

So the final code to get the spark running is

def create_local_spark():
    jars = [

    aws_1 = [

    jars_string = ",".join(jars + aws_1)
    pyspark_shell = "--jars {} --driver-memory 4G pyspark-shell".format(jars_string)

    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_shell
    os.environ["PYSPARK_PYTHON"] = "/opt/anaconda3/envs/advanced/bin/python"

    spark_session = SparkSession.builder.appName("ZZZZZ").getOrCreate()
    hadoop_conf = spark_session._jsc.hadoopConfiguration()

    hadoop_conf.set("", "true")
    hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    hadoop_conf.set("fs.s3a.server-side-encryption-algorithm", "AES256")
    hadoop_conf.set("", "com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
    hadoop_conf.set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")

    spark_context = spark_session.sparkContext
    sql_context = SQLContext(spark_context)
    # df ="s3a://hello/world/")
    return spark_context, sql_context

Know when and who is doing ssh

I would like to know if someone is using the ssh on my servers. That’s why I have put a telegram notification. Here is how it works

Put in ssh/sshrc


telegram-send -g "Access $SSH_TTY $SSH_CONNECTION `id`" &

Of course you need to setup your telegram-send

Compare two filesystems

Lets run those command on the machines

New instanceOld instance
find / -xdev | sort > new.txt find / -xdev | sort > old.txt

Pull the files locally

scp -i ~/.ssh/somekey [email protected]:/new.txt  /tmp/new.txt
scp -i ~/.ssh/somekey [email protected]:/old.txt  /tmp/old.txt

Then use this great delta tool to compare the files

delta -s /tmp/new.txt /tmp/old.txt

Using md5 sums

And a using md5 sums – this is slow!

# On the new instance

find / -xdev -type f -exec  md5sum {} \; > new-files.txt
find / -xdev -type d | sort >
sort  -k2  new-files > 

# On the old instance

find / -xdev -type f -exec  md5sum {} \; > old-files.txt
find / -xdev -type d | sort > sorted.old-folders.txt
sort  -k2  old-files >sorted.old-files.txt 

How to pull files from remote sftp

Create a script with the commands to pull the files and the remove them from the remote server

get -r upload/* incoming/
rm upload/*

You will need a cron

0 5 * * * /usr/bin/sftp -b [email protected]

I recommend using systemd so that you can have logs

Where is more clear to do line break

Continue reading

Create AP on ubuntu

Continue reading

