The framework ‘Microsoft.NETCore.App’, version ‘6.0.0’ was not found.

New Linux box, old home directory.

Attempting to execute dotnet ef database update repeatedly failed with the error, “The framework ‘Microsoft.NETCore.App’, version ‘6.0.0’ was not found.” Individual dotnet commands (dotnet --version, dotnet build, etc) were working, which became very confusing.

Google was not my friend today, the error as a search term produced lots of noise, some issues from github that indicated old bugs, and red herrings.

I finally stumbled across the problem: the value of the environmental variable DOTNET_ROOT was wrong. The value was /opt/dotnet-sdk-bin-5.0, but the installed version was 6.0.

Version 5.0 had been installed initially, but I upgraded it during the same login session. While /etc/env.d/90dotnet-sdk-bin-6.0 was installed properly, and used the correct value, it would not take effect until I logged out and/or rebooted.

Shame on Microsoft for their terrible, uninformative errors.

Recursively delete directories unless a specific file is present

There are several ways to do this, but my Google-fu may be weak because it took me much too long to figure this out.

I want to recursively delete directories with a specific name (or names) within directory structure, UNLESS the matched directory contains a sentinel file.

In my case I want to make a C# directory structure “cleaner-than-clean” by removing all ‘bin’ and ‘obj’ directories, leaving just the user-generated files behind.  This is pretty easy to achieve:

#!/bin/env bash

dir=/path/to/project

find $dir -type d \
    \( -name 'bin' -o -name 'obj' \)
    -print

This says “find things under $dir that are directories (-type d) and are named either ‘bin’ (-name 'bin') or (-o) named ‘obj’ (-name 'obj').  The parentheses force the two -name statements to be considered as a single condition, so the effect is to return true if either item matches. If the final result is true then print the path.

Notice that I’ve escaped (\) the parentheses because I’m using bash. Most UNIX shells do require these to be escaped, but yours may not. I’ve also terminated each line by escaping it. A single-line command may be spread over several lines this way, making it easier to read.

‘bin’ is also the conventional name for a directory of non-build executables, like helper scripts.  I do have some, including this cleaner-than-clean cleaning script that I’m working out, and don’t want to delete those by accident. The above command would find them, if they were in the directory tree.

find allows you to prune (-prune) the search tree, ignoring selective directories, according to certain criteria but it doesn’t support the concept of peeking into sub-directories. Bummer.

You may, however, execute independent commands (-exec) and use the results of those commands to affect find‘s parameters, including -prune. We can exec the test command, which can tell us if our sentinel file exists.

#!/bin/env bash

dir=/path/to/project
sentinel=.keep

find "$basedir" \
    -type d \
    \( -name bin -o -name obj \) \
    ! -exec test -e "{}/$sentinel" ';' \
    -print

The new line executes test to see if the current path ({}) contains a file called $sentinel (I’ve defined $sentinel to be .keep but any filename will do), which returns true if it exists. The line is negated (!) so if the sentinel is found further actions are skipped.

The final step is to actually delete the directory. We call rm -Rf (-R = recursive, -f = force) because we just want the whole thing gone, no questions asked. The trailing plus (+) tells find that rm can accept multiple paths in a single call, rather than calling rm once for each path.

#!/bin/env bash

dir=/path/to/project
sentinel=.keep

find "$basedir" \
    -type d \
    \( -name bin -o -name obj \) \
    ! -exec test -e "{}/$sentinel" ';' \
    -print \
    -exec rm -Rf '{}' \+

Linux, Solaris, Windows

Linux: Because rebooting is for adding hardware

Solaris: Because you don’t need to reboot to add hardware

Windows: Because rebooting is for adding hardware, adding software, regularly scheduled downtime, and should also be done on a daily basis to keep the machine running.

[attribution unknown]

CNAMEs in Samba

I’m documenting something that wasn’t easy to uncover.

TL;DR – if you want to create a CNAME in Samba to replace an existing DNS record, you must delete the A record first.

Background

I have an Active Directory domain running on Samba.  I’ve had an underpowered file server, simply called ‘files’, for a while.  I finally had a chance to upgrade it to some newer hardware with a rather large SSD.

Since this, like all my home projects, is a side-project that takes several days to complete I chose to build the new server (‘concord’) and get it running while leaving ‘files’ in-place.

I like to have servers named after their roles, because it makes things easy, but we have a lot more computers than formal roles in the house.  We’ve finally settled on a naming convention: Windows names are places in Washington, Apple products are from California, and Linux products are from Massachusetts.  (I am aware that Unix was birthed in New Jersey but… Ew.  At least X came from MIT, that’s good enough for me.)

I also have a number of dependencies on the name ‘files’ including, most crucially, my own brain.  Muscle memory is hard to overcome (“ls /net/files/… damn ^H/net/concord/…”) and I don’t want to relearn a server name.

That left me with three problems to solve: follow the naming standard, use a “taken” name for the server, and build said server while the needed name is still available on the network.

The obvious answer is to use CNAMEs.  I planned to set up ‘files’ as an alias to ‘concord’.  Similar practice would carry us forward through an indefinite number of role-swaps in the future.

After copying all of our data from ‘files’ to ‘concord’ I confidently shut ‘files’ down and added my CNAME.  This is where things went wrong.

The Problem

After shutting ‘files’ down, I started by creating the CNAME:

dc1 # samba-tool dns add 192.168.1.2 ad.jonesling.us files CNAME concord.ad.jonesling.us -U administrator
Password for [AD\administrator]: ******
Record added successfully

That’s all well and good.  Let’s test it out from another computer:

natick $ nslookup
> files
Server:     dc1
Address:    2001:470:1f07:583:44a:52ff:fe4a:8cee#53

Name:   files.ad.jonesling.us
Address: 192.168.1.153
files.ad.jonesling.us   canonical name = concord.ad.jonesling.us.

Crap.  That’s the correct canonical name, but the wrong IP address – it’s ‘files’ old IP address.

Some googling uncovered someone with a similar issue back in 2012, but they “solved” it by creating static A records instead.  That’s not a great solution, certainly not what I want.

I thought about it for a few minutes.  I got a success message, but was the record actually created?  How can I tell?  What happens if I insert it again?

dc1 # samba-tool dns add 192.168.1.2 ad.jonesling.us files CNAME concord.ad.jonesling.us -U administrator
Password for [AD\administrator]: ******

ERROR(runtime): uncaught exception - (9711, 'WERR_DNS_ERROR_RECORD_ALREADY_EXISTS')
  File "/usr/lib/python3.7/site-packages/samba/netcmd/__init__.py", line 186, in _run
    return self.run(*args, **kwargs)
  File "/usr/lib/python3.7/site-packages/samba/netcmd/dns.py", line 945, in run
    raise e
  File "/usr/lib/python3.7/site-packages/samba/netcmd/dns.py", line 941, in run
    0, server, zone, name, add_rec_buf, None)

Well, it was inserted somewhere, that much is clear.

What happens if I dig it?  nslookup gave us a canonical address, but I want to see the actual DNS record.  Maybe it contains a clue.

First, lets dig the CNAME:

dc1 # dig @dc1 files.ad.jonesling.us IN CNAME

; <<>> DiG 9.14.8 <<>> @dc1 files.ad.jonesling.us IN CNAME
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10370
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 7a0aa65a623d5d3bdbdc39075f2eff9d5b81dbd9ed05c9d0 (good)
;; QUESTION SECTION:
;files.ad.jonesling.us. IN CNAME

;; ANSWER SECTION:
files.ad.jonesling.us. 900 IN CNAME concord.ad.jonesling.us.

;; Query time: 8 msec
;; SERVER: 192.168.1.2#53(192.168.1.2)
;; WHEN: Sat Aug 08 15:40:13 EDT 2020
;; MSG SIZE rcvd: 100

I’ve bolded the line that shows the alias.  That looks right.

But what about ‘files’?

dc1 # dig @dc1 files.ad.jonesling.us

; <<>> DiG 9.14.8 <<>> @dc1 files.ad.jonesling.us
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42296
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 0352365b5c07ecdace1ebf3c5f2effa6da5d32bfe9002b32 (good)
;; QUESTION SECTION:
;files.ad.jonesling.us. IN A

;; ANSWER SECTION:
files.ad.jonesling.us. 3600 IN A 192.168.1.153

;; Query time: 8 msec
;; SERVER: 192.168.1.2#53(192.168.1.2)
;; WHEN: Sat Aug 08 15:40:22 EDT 2020
;; MSG SIZE rcvd: 94

Ah.  That looks like a conflict.  Both records exist, and one has primacy over the other.

‘files’ was assigned an address via DHCP, I never gave it a static address, so I didn’t expect that I would need to delete anything.  But if I think about it, I realized that Samba doesn’t know that ‘files’ isn’t coming back.  (That makes me wonder what kind of graveyard DNS becomes, with friends’ phones and laptops popping in from time to time.)

So, can we delete the old A record, and what happens if we do?

The Solution

We delete the address.  It looks like it’s working:

dc1 # samba-tool dns delete 192.168.1.2 ad.jonesling.us files A 192.168.1.153 -U administrator
Password for [AD\administrator]:
Record deleted successfully

Was that the problem all along?

dc1 # dig @dc1 files.ad.jonesling.us

; <<>> DiG 9.14.8 <<>> @dc1 files.ad.jonesling.us
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38286
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 1610fb8ec07db8e3a43976ed5f2effdffeb142b30ca93848 (good)
;; QUESTION SECTION:
;files.ad.jonesling.us. IN A

;; ANSWER SECTION:
files.ad.jonesling.us. 900 IN CNAME concord.ad.jonesling.us.
concord.ad.jonesling.us. 3600 IN A 192.168.1.82

;; Query time: 15 msec
;; SERVER: 192.168.1.2#53(192.168.1.2)
;; WHEN: Sat Aug 08 15:41:20 EDT 2020
;; MSG SIZE rcvd: 116

That looks pretty good!

Securing WordPress: The Basics

This is the first in an occasional series of documents on WordPress.


WordPress is ubiquitous but fragile.  There are few alternatives that provide the easy posting, wealth of plugins, and integration of themes, while also being (basically) free to use.

It’s also a nerve-wracking exercise in keeping bots and bad actors out.  Some of the historical security holes are legendary.  It doesn’t take long to find someone who experienced a site where the comments section was bombed by a spammer, or even outright defacement.  (I will reluctantly raise my own hand, having experienced both in years past.)

Most people that use WordPress nowadays rely on 3rd parties to host it.  This document isn’t for them; hosted security is mostly outside of your control.  That’s generally a good thing: professionals are keeping you up to date and covered by best practices.

The rest of us muddle through security and updates in piece-meal fashion, occasionally stumbling over documents like this one.

Things To Look Out For

As a rule, good server hygiene demands that you keep an eye on your logs.  Tools like goaccess help you analyze usage, but nothing beats a peek at the raw logs for noticing issues cropping up.

The Good Bots

Sleepy websites like mine show a high proportion of “good” bots like Googlebot, compared to human traffic.  They’re doing good things like crawling (indexing) your site.

In my case they are the primary visitor base to my site, generating hundreds or even thousands of individual requests per day.  Hopefully your own WordPress site has a better visitor-to-bot ratio than mine.

We don’t want to block these guys from their work, they’re actually helpful.

The Bad Bots

You’ll also see bad bots, possibly lots of them.  Most are attempting to guess user credentials so they can post things on your WordPress site.

Some are fairly up-front about it:

...
132.232.47.138 [07:51:14] "POST /xmlrpc.php HTTP/1.1"
132.232.47.138 [07:51:14] "POST /xmlrpc.php HTTP/1.1"
132.232.47.138 [07:51:15] "POST /xmlrpc.php HTTP/1.1"
132.232.47.138 [07:51:16] "POST /xmlrpc.php HTTP/1.1"
132.232.47.138 [07:51:16] "POST /xmlrpc.php HTTP/1.1"
132.232.47.138 [07:51:18] "POST /xmlrpc.php HTTP/1.1"
...

They’ll hammer your server like that for hours.

Blocking their individual IP addresses at the firewall is devastatingly effective… for about five minutes.  Another bot from another IP will pop up soon.  Blocking individual IPs is a game of whack-a-mole.

Some are part of a “slow” botnet, hitting the same page from unique a IP address each time.  These are part of the large botnets you read about.

83.149.124.238 [05:01:06] "GET /wp-login.php HTTP/1.1" 200
83.149.124.238 [05:01:06] "POST /wp-login.php HTTP/1.1" 200
188.163.45.140 [05:03:38] "GET /wp-login.php HTTP/1.1" 200
188.163.45.140 [05:03:39] "POST /wp-login.php HTTP/1.1" 200
90.150.96.222 [05:04:30] "GET /wp-login.php HTTP/1.1" 200
90.150.96.222 [05:04:32] "POST /wp-login.php HTTP/1.1" 200
178.89.251.56 [05:04:42] "GET /wp-login.php HTTP/1.1" 200
178.89.251.56 [05:04:43] "POST /wp-login.php HTTP/1.1" 200

These are more insidious: patient and hard to spot on a heavily-trafficked blog.

Keeping WordPress Secure

You (hopefully) installed WordPress to a location outside of your “htdocs” document tree.  If not, you should fix that right away!  (Consider this “security tip #0” because without this you’re basically screwed.)

Security tip #1 is to make sure auto updates are enabled.  The slight risk of a botched release being automatically applied is much lower than that of having an critical security patch that is applied too late.

Like medieval door locks on your front door, there is little security advantage to running old software.

Once an exploit is patched, the prior releases are vulnerable as people deconstruct the patch and reverse-engineer the exploit(s) – assuming a exploit wasn’t published before the patch was released.

Locking WordPress Down

Your Apache configuration probably contains a section similar to this:

<Directory "/path/to/wordpress">
    ...
    Require all granted
    ...
</Directory>

We’re going to add some items between <Directory></Directory> tags to restrict access to the most vulnerable pieces.

You Can’t Attack Things You Can’t Reach

We’ll start by invoking the Principle of Least Privilege: people should only be able to do the things they must do, and nothing more.

xmlrpc.php is an API for applications to talk to WordPress.  Unfortunately it doesn’t carry extra security, so if you’re a bot it’s great to hammer with your password guesses – you won’t be blocked, and no one will be alerted.

Most people don’t need it.  Unless you know you need it, you should disable it completely.

<Directory "/path/to/wordpress">
    ...
    <Files xmlrpc.php>
        <RequireAll>
            Require all denied
        </RequireAll>
    </Files>
</Directory>

There are WordPress plugins that purport to “disable” xmlrpc.php, but they deny access from within WordPress.  That means that you’ve still paid a computational price for executing xmlrpc.php, which can be steeper than you expect, and you’re still at risk of exploitable bugs within it.  Denying access to it at the server level is much safer.

You Can’t Log In If You Can’t Reach the Login Page

This next change will block anyone from outside your LAN from logging in.  That means that if you’re away from home you won’t be able to log in, either, without tunneling back home.

<Directory "/path/to/wordpress">
    ...
    <Files wp-login.php>
        <RequireAll>
            Require all granted
            # remember that X-Forwarded-For may contain multiple
            # addresses, don't just search for ^192...
            Require expr %{HTTP:X-Forwarded-For} =~ /\b192\.168\.1\./
        </RequireAll>
    </Files>
</Directory>

If you’re not using a public-facing proxy, and don’t need to look at X-Forwarded-For, you can simplify this a little:

<Directory "/path/to/wordpress">
    ...
    <Files wp-login.php>
        <RequireAll>
            Require all granted
            Require ip 192.168.1
        </RequireAll>
    </Files>
</Directory>

This will prevent 3rd parties from signing up on your blog and submitting comments.  This may be important to you.

Restart Apache

After inserting these blocks, you should execute Apache’s ‘configtest’ followed by reload:

$ sudo apache2ctl configtest
apache2      | * Checking apache2 configuration ...     [ ok ]
$ sudo apache2ctl reload
apache2      | * Gracefully restarting apache2 ...      [ ok ]

Now test your changes from outside your network:

xmlrpc.php forbidden

Apache’s access log should show a ‘403’ (Forbidden) status:

... "GET /xmlrpc.php HTTP/1.1" 403 ...

And just like that, you’ve made your WordPress blog a lot more secure.

Interestingly, by making just these changes on my own site the attacks immediately dropped off by 90%.  I guess that the better-written bots realized that I’m not a good target anymore and stopped wasting their time, preferring lower-hanging fruit.

Failed to retrieve directory listing

filezilla connection log with "failed to retrieve directory listing" error
Filezilla’s opaque error

I occasionally run a local vsftp daemon on my development machine for testing.  I don’t connect to it directly — it’s used to back up unit tests that need an FTP connection.  No person connects to it, least of all me, and the scripts that do connect are looking at small, single-use directories.

I needed to test a new feature: FTPS, aka FTP with SSL (Not to be confused with SFTP, a very different beast.)  Several of our vendors will be requiring it soon; frankly, I’m surprised they haven’t required it sooner.  But I digress.

To start this phase of the project I needed to make sure that my local vsftp daemon supports FTPS so that I can run tests against it.  So I edit /etc/vsftpd/vsftpd.conf to add some lines to my config, and restart:

rsa_cert_file=/etc/ssl/private/vsftpd.pem
rsa_private_key_file=/etc/ssl/private/vsftpd.pem
ssl_enable=YES

But Filezilla bombs with an opaque error message:

Status: Resolving address of localhost
Status: Connecting to 127.0.0.1:21...
Status: Connection established, waiting for welcome message...
Status: Initializing TLS...
Status: Verifying certificate...
Status: TLS connection established.
Status: Logged in
Status: Retrieving directory listing...
Command: PWD
Response: 257 "/home/dad" is the current directory
Command: TYPE I
Response: 200 Switching to Binary mode.
Command: PASV
Response: 227 Entering Passive Mode (127,0,0,1,249,239).
Command: LIST
Response: 150 Here comes the directory listing.
Error: GnuTLS error -15: An unexpected TLS packet was received.
Error: Disconnected from server: ECONNABORTED - Connection aborted
Error: Failed to retrieve directory listing

I clue in pretty quickly that “GnuTLS error -15: An unexpected TLS packet was received” is actually a red herring, so I drop the SSL from the connection and get a different error:

Response: 150 Here comes the directory listing.
Error: Connection closed by server
Error: Failed to retrieve directory listing

Huh, that’s not particularly helpful, shame on you Filezilla.  I drop down further to a command-line FTP client to get the real error:

$ ftp localhost
Connected to localhost.
220 (vsFTPd 3.0.3)
Name (localhost:dad): 
530 Please login with USER and PASS.
530 Please login with USER and PASS.
SSL not available
331 Please specify the password.
Password:
230 Login successful.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> ls
200 PORT command successful. Consider using PASV.
150 Here comes the directory listing.
421 Service not available, remote server has closed connection
ftp> quit

Ah.  Now we’re getting somewhere.

A quick perusal turned up a stackexchange answer with the assertion that “the directory causing this behaviour had too many files in it (2,666).”  My own directory is much smaller, about a hundred files.  According to this bug report, however, the real maximum may be as few as 32 files.  It’s not clear to me whether this is a kernel bug, a vsftpd bug, or just a bad interaction between recent kernels and vsftpd.

Happily, there is a work-around: add “seccomp_sandbox=NO” to vsftpd.conf.

Since vsftpd’s documentation is spare, and actual examples are hard to come by, here’s my working config:

listen=YES
local_enable=YES
write_enable=YES
chroot_local_user=YES
allow_writeable_chroot=YES
seccomp_sandbox=NO
ssl_enable=YES
rsa_cert_file=/etc/ssl/private/vsftpd.pem
rsa_private_key_file=/etc/ssl/private/vsftpd.pem

Perl’s Open3, Re-Explained

I recently went spelunking into a core Perl module that I previously knew nothing about, IPC::Open3.  After fifteen years of developing in Perl I finally had a reason to use it.

If you’re reading this, it’s probably because you went looking for information on how to use open3 because the module’s documentation is bad.  I mean it’s really, really terrible.

Not only will you not know how to use open3 after reading the docs, you may become convinced that maybe open3 isn’t the module that you need, or maybe it would work but you’d just be better off looking for something else because this is too damn hard to use.

Fear not, intrepid reader, because if I can figure it out so can you.  But I will try to save you some of the leg work I went through. There’s precious little information scattered online, because this isn’t a popular package.  My loss is your gain, hopefully this helps you.

Why IPC::Open3?

When Would I Use IPC::Open3?

open3 is used when you need to open three pipes to another process.  That might be obvious from the name as well as the package’s synopsis:

$pid = open3( \*CHLD_IN,
              \*CHLD_OUT,
              \*CHLD_ERR,
              'some cmd and args',
              'optarg', ...
            );

Why would you do that?  The most obvious situation is when you want to control STDIN, STDOUT, and STDERR simultaneously.  The example I provide below, which is not contrived by the way but adapted from real production code, does exactly that.

There Are Lots Of Modules To Make This Easier, Why Should I Use IPC::Open3?

IPC::Open3 is part of the Perl core.  There’s a lot to be said for using a library that’s already installed and doesn’t have external dependencies vs. pulling in someone’s write-once-read-never Summer of Code academic project.

In addition, the modules that I found only served to hide the complexity of Open3, but they did it badly and didn’t really remove much code compared to what I came up with.

What Else Do I Need?

One of the things that’s not obvious from the Open3 docs are that you’re not going to use IPC::Open3 by itself.  You need a couple of other packages (also part of core) in order to use it effectively.

How I Used IPC::Open3

In our example, we’re going to fork a separate process (using open3) to encrypt a file stream using gpg.  gpg will accept a stream of data, encrypt it, and output to a stream.  We also want to capture errors sent to STDERR.

In a terminal, using bash, this would be really easy: gpg --decrypt < some_file > some_file.pgp 2>err.out

We could do all of this in Perl by writing temporary files, passing special file handle references into gpg as arguments, and capturing STDERR the old fashioned way, all using a normal open().  But where’s the fun in that?

First, lets use the packages we’ll need:

use IO::Handle;
use IO::Select;
use IPC::Open3;

IO::Handle allows us to operate on handles using object methods.  I don’t typically use it, but this code really appreciates it.  IO::Select does the same for select, but it helps even more than IO::Handle here.

use constant INPUT_BUF_SZ  => 2**12;
use constant OUTPUT_BUF_SZ => 2**20;

You might want to experiment to find the best buffer sizes.  The input buffer should not be larger than the pipe buffer on your particular system, else you’ll block trying to put two pounds of bytes into a one pound buffer.

Now, using IO::Handle we’ll create file handles for the stdin, stdout, and stderr that our forked process will read and write to:

my ( $in,
     $out,
     $err,
   ) = ( IO::Handle->new,
         IO::Handle->new,
         IO::Handle->new
       );

Call open3, which (like fork) gives us the PID of our new process.

Note: If we don’t call waitpid later on we’ll create a zombie after we’re done.

my $pid = open3( $in, $out, $err, '/usr/bin/gpg', @gpg_options );

if ( !$pid ) {
    die "failed to open pipe to gpg";
}

One of the features of IO::Select is that it allows us to find out when a handle is blocked. This is important when the output stream is dependent on the input stream, and each stream depends on a pipe of limited size.

We’re going to repeatedly loop over the handles, looking for a stream that is active, and read/write a little bit before continuing to loop.  We do this until both our input and output is exhausted.  It’s pretty likely that they’ll be exhausted at different times, i.e. we’ll be done with the input sometime before we’re done with the output.

As we exhaust each handle we remove it from the selection of possible handles, so that the main loop terminates naturally.

The value passed to can_write and can_read is the number of seconds to wait for the handle to be ready.  Non-zero timeouts cause a noticeable delay, while not setting it at all will cause us to block until the handle is ready, so for now we’ll leave it at zero.

# $unencrypted_fh and $encrypted_fh should be defined as
# handles to real files

my $sel = IO::Select->new;

$sel->add( $in, $out, $err );

# loop until we don't have any handles left

while ( my @handles = ( $sel->handles) ) {
    # read until there's nothing left
    #
    # write in small chunks so we don't overfill the buffer
    # and accidentally cause the pipe to block, which will
    # block us
    while ( my @ready = ( $sel->can_write(0) ) ) {
        for my $fh ( @ready ) {
            if ( $fh == $in ) {
                # read a small chunk from your source data
                my $read = read( $unencrypted_fh,
                                 my $bytes,
                                 INPUT_BUF_SZ,
                               );

                # and write it to our forked process
                #
                # if we're out of bytes to read, close the
                # handle
                if ( !$read ) {
                    $sel->remove( $fh );
                    $fh->close;
                }
                else {
                    syswrite( $fh, $bytes );
                }
            }
            else {
                die "unexpected filehandle for input";
            }
        }
    }

    while ( my @ready = ( $sel->can_read(0) ) ) {
        # fetch the contents of STDOUT and send it to the
        # destination
        for my $fh ( @ready ) {
            # this buffer can be much larger, though in the
            # case of gpg it will generally be much smaller
            # than the input was. The process will block if
            # the output pipe is full, so you want to pull as
            # much out as you can.

            my $read = sysread( $fh, my $bytes, OUTPUT_BUF_SZ );

            if ( !$read ) {
                $sel->remove( $fh );
                $fh->close;
            }
            elsif ( $fh == $out ) {
                # $encrypted_fh is whatever we're throwing output
                # into

                syswrite( $encrypted_fh, $bytes ) if $read;
            }
            elsif ( $fh == $err ) {
                print STDERR $bytes;
            }
            else {
                die "unexpected filehandle for output";
            }
        }
    }

    # IO::Handle won't complain if we close a handle that's
    # already closed
    $sel->remove( $in ); $in->close;
    $sel->remove( $out ); $out->close;
    $sel->remove( $err ); $err->close;

    waitpid( $pid, 0 );
}

That’s actually about it.

I keep my buffer for input small, as pipe buffers tend to be small.  If you overload your pipe your program will hang indefinitely (or until an alarm goes off, if you set one).  4096 bytes seems to be the limit, though your own limit may be different.  When in doubt, be conservative and go smaller.

The output buffer can afford to be bigger, up to the limit of available memory (but don’t do that).  In our example of encryption gpg will consume much more than it produces, so a larger buffer doesn’t really buy you anything but if we were decrypting it would be the reverse and a larger buffer would help immensely.

NiFi HTTP Service

I’m attempting to set up an HTTP server in NiFi to accept uploads and process them on-demand.  This gets tricky because I want to submit the files using an existing web application that will not be served from NiFi, which leads to trouble with XSS (Cross-Site Scripting) and setting up CORS (Cross Origin Resource Sharing [1]).

The trouble starts with just trying to PUT or POST a simple file.  The error in Firefox reads:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource (Reason: CORS header 'Access-Control-Allow-Origin' missing).

You can serve up the Javascript that actually performs the upload from NiFi and side-step XSS, but you may still run into trouble with CORS.  You’ll have trouble even if NiFi and your other web server live on the same host (using different ports, of course), as they’re considered different hosts for the purposes of XSS prevention.

handlehttpresponse screen shot
HandleHttpResponse processor config

To make this work, you’ll need to enable specific headers in the HandleHttpResponse processor.  Neither the need to set some headers, nor the headers that need to be set, are documented by NiFi at this time (so far as I can tell).

  1. Open the configuration of the HandleHttpResponse processor
  2. Add the following headers and values as properties and values, but see below for notes regarding the values
    Access-Control-Allow-Origin: *
    
    Access-Control-Allow-Methods: PUT, POST, GET, OPTIONS
    
    Access-Control-Allow-Headers: Accept, Accept-Encoding, Accept-Language, Connection, Content-Length, Content-Type, DNT, Host, Referer, User-Agent, Origin, X-Forwarded-For

You may want to review the value for Access-Control-Allow-Origin, as the wildcard may allow access to unexpected hosts.  If your server is public-facing (why would you do that with NiFi?) then you certainly don’t want a wildcard here.  The wildcard makes configuration much simpler if NiFi is strictly interior-facing, though.

The specific values to set for Access-Control-Allow-Methods depend on what you’re doing.  You’ll probably need OPTIONS for most cases.  I’m serving up static files so I need GET, and I’m receiving uploads that may or may not be chunked, so I need POST and PUT.

The actual headers needed for Access-Control-Allow-Headers is a bit variable.  A wildcard is not an acceptable value here, so you’ll have to list every header you need separately — and there are a bunch of possible headers.  See [3] for an explanation and a fairly comprehensive list of possible headers.  Our list contains a small subset that covers our basic test cases; your mileage may vary.

You may also want to set up a RouteOnAttribute processor to ignore OPTIONS requests (${http.method:equals('OPTIONS')}), otherwise you might see a bunch of zero-byte files in your flow.

References:

[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS

[2] http://stackoverflow.com/questions/24371734/firefox-cors-request-giving-cross-origin-request-blocked-despite-headers

[3] http://stackoverflow.com/questions/13146892/cors-access-control-allow-headers-wildcard-being-ignored