The British Library suffered a major cyber attack in October 2023 that encrypted and destroyed servers, exfiltrated 600GB of data, and has had an ongoing disruption of library services after four months. Yesterday, the Library published an 18-page report on the lessons they are learning. (There are also some community annotations on the report on Hypothes.is.)
Their investigation found the attackers likely gained access through compromised credentials on a remote access server and had been monitoring the network for days prior to the destructive activity. The attack was a typical ransomware job: get in, search for personal data and other sensitive records to copy out, and encrypt the remainder while destroying your tracks. The Library did not pay the ransom and has started the long process of recovering its systems.
The report describes in some detail how the Library recognized that its conglomeration of disparate systems over the years left them vulnerable to service outages and even cybersecurity attacks. They had started a modernization effort to address these problems, but the attack dramatically exposed these vulnerabilities and accelerated their plans to replace infrastructure and strengthen processes and procedures.
The report concludes with lessons learned for the library and other institutions to enhance cyber defenses, response capabilities, and digital modernization efforts. The library profession should be grateful to the British Library for their openness in the report, and we should take their lessons to heart.
Note! Simon Bowie has some great insights on the LSE Impact blog, including about how the hack can be seen as a call for libraries to invest more in controlling their own destinies.
The report admits that some information needed to determine the attackers’ exact path is likely lost. Their best-effort estimate is that a set of compromised credentials was used on a Microsoft Terminal Services server (now called Remote Desktop Services). Multi-factor authentication (MFA, sometimes called 2FA) was used in some areas of the network, but connections to this server were not covered. The attackers tripped at least one security alarm, but the sysadmin released the hold on the account after running malware scans.
Starting in the overnight hours from Friday to Saturday, the attackers copied 600GB of data off the network. This seems to be mostly personnel files and personal files that Library staff stored on the servers. The network provider could see this traffic looking back at network flows, but it is unclear whether this tripped any alarms itself. Although their Integrated Library System (an Aleph 500 system according to Marshall Breeding’s Library Technology Guides site) was affected, the report does not make clear whether patron demographic or circulation activity was taken.
Reading between the lines a little bit, it sounds like the Library had a relatively flat network with few boundaries between systems: “our historically complex network topology … allowed the attackers wider access to our network than would have been possible in a more modern network design, allowing them to compromise more systems and services.” Elevated privileges on one system lead to elevated privileges on many systems, which allowed the attacker to move freely across the network. Systems are not structured like that today—now tending to follow the model of “least privileges”—and it seems like the Library is moving away from the flat structure towards a segmented structure.
As the report notes, recovery isn’t just a matter of restoring backups to new hardware. The system can’t go back to the vulnerable state it was in. It also seems like some software systems themselves are not recoverable due to age. The British Library’s program is one of “Rebuild and Renew” — rebuilding with fresh infrastructure and replacing older systems with modern equivalents. In the never-let-a-good-crisis-go-to-waste category, “the substantial disruption of the attack creates an opportunity to implement a significant number of changes to policy, processes, and technology that will address structural issues in ways that would previously have been too disruptive to countenance.”
The report notes “a risk that the desire to return to ‘business as usual’ as fast as possible will compromise the changes”, and this point is well taken. Somewhere I read that the definition of “personal character” is the ability to see an action through after the emotion of the commitment to action has passed. The British Library was a successful institution, and it will want to return to that position of being seen as a thriving institution as quickly as possible. This will need to be a continuous process. What is cutting edge today will become legacy tomorrow. As our layers of technology get stacked higher, the bottom layers get squeezed and compressed into thin slivers that we tend to assume will always exist. We must maintain visibility in those layers and invest in their maintenance and robustness.
They also found “viable sources of backups … that were unaffected by the cyber-attack and from which the Library’s digital and digitised collections, collection metadata and other corporate data could be recovered.” That is fortunate—even if the older systems have to be replaced, they have the data to refill them.
They describe their new model as “a robust and resilient backup service, providing immutable and air-gapped copies, offsite copies, and hot copies of data with multiple restoration points on a 4/3/2/1 model.” I’m familiar with the 3/2/1 strategy for backups (three copies of your data on two distinct media with one stored off-site), but I hadn’t heard of the 4/3/2/1 strategy. Judging from this article from Backblaze, the additional layer accounts for a fully air-gapped or unavailable-online copy. An example is the AWS S3 “Object Lock” service, a cloud version of Write-Once-Read-Many (WORM) storage. Although the backed-up object is online and can be read (“Read-Many”), there are technical controls that prevent its modification until a set period of time elapses (“Write-Once”). Presumably, the time period is long enough to find and extricate anyone who has compromised the systems before the object lock expires.
The lessons include the need for better network monitoring, external security expertise retention, multi-factor authentication, and intrusion response processes. The need for comprehensive multi-factor authentication is clear. (Dear reader: if you don’t have a comprehensive plan to manage credentials—including enforcement of MFA—then this is an essential takeaway from this report.)
Another outcome of the recovery is better processes for refreshing hardware and software systems as they age. Digital technology is not static. (And certainly not as static as putting a printed book on a climate-controlled shelf.) It is difficult (at least for me) to envision the kind of comprehensive change management that will be required to build a culture of adaptability and resilience to reduce the risk of this happening again.
I admire the British Library’s willingness to publish this report that describes in a frank manner their vulnerabilities, the impacts of the attack, and what they are doing to address the problems. I hope they continue to share their findings and plans with the library community. Here are some things I hope to learn:
Cyber security is a group effort. It would be easy to pin this chaos on the tech who removed a block on the account that may have been the beachhead for this attack. As this report shows, the organization allowed this environment to flourish, culminating in that one bit-flip that brought the organization down.
I’ve never been in that position, but I am mindful that I could someday be in a similar position looking back at what my actions or inactions allowed to happen. I’ll probably be at risk of being in that position until the day I retire and destroy my production work credentials. I hope the British Library staff and all involved in the recovery are treating themselves well. Those of us on the outside are watching and cheering them on.
]]>Not much in this first year, but I’ve already started a running list for 2024.
]]>I’m using Obsidian as my personal knowledge management tool. Obsidian creates a wiki-like experience for a directory of Markdown files on the local computer. (See an earlier post on DLTJ about my use of Obsidian.) I was using iCloud Drive to synchronize that directory of Markdown files across my laptop, phone, and tablet. Based on some web searching, I’m guessing that a recent change by Apple on how iCloud Drive synchronizes between machines caused files that hadn’t been accessed in a while to disappear—some sort of selective sync function.
My initial reaction—to move my knowledgebase folder out of iCloud Drive and onto the local hard drive—may have hampered the recovery…the files might have been marked as in the cloud and could have been downloaded. But moving my knowledgebase folder and purchasing an Obsidian Sync subscription was what I did. (I have no regrets…I’ll keep the Obsidian Sync subscription. Obsidian has a good, independent development team, and I’m happy to support them.) But how to get the missing files back?
My laptop has two independent backups: a USB-connected hard disk on my office desk using the built-in MacOS Time Machine, and ArqBackup uploading to an encrypted AWS S3 bucket. I was away from my desk for the week, so I first tried restoring from ArqBackup. That didn’t work; more on that below. When I got home, I attempted a recovery from the Time Machine drive.
Because I didn’t know what files when missing when, I thought the best approach was to successively overlay restores of the knowledgebase directory onto each other. Doing so would mean that there would be files that I had intentionally deleted that would be in the new knowledgebase directory. Still, that is only a dozen or so out of the potentially hundreds of missing files—a much better option than having an unknown number of missing files.
First step: restore the knowledgebase directory for each backup set into a separate directory.
Each iteration through the backup sets showed the number of restored files:
As I scanned through the items copied
for each day, there were no points where there was a sudden drop, so I think the file loss was gradual over some time.
The results of this first step were a bunch of directories that looked like “2023-12-26-032655.backup-Knowledgebase”
Next: copy the contents of each backup set, in the order in which the backup sets were created, on top of an empty Knowledgebase
directory.
Now lets run some commands to see what we’re dealing with. First, the number of files in the restored Knowledgebase directory: 3,418.
Next, use the diff
command to see the differences between the active knowledgebase in my home directory and the restored knowledgebase, and look for the string Only in restored-Knowledgebase
: 480 files restored that aren’t in the active knowledgebase.
Now let’s review a list of files that are only in the active knowledgebase. This is a reality check to make sure we’re on the right path, and indeed this lists only the files that were created since the last backup to the Time Machine drive.
One last command to see files that exist in both the active knowledgebase and the restored knowledgebase but don’t have the same contents:
There was only one file with minor differences…a file that I changed in between backups. Happy that everything seems in order, I just copy the contents of the restored knowledgebase on top of the active knowledgebase.
As I mentioned above, my first attempt at restoring was using ArqBackup. ArqBackup behaved in a way that I didn’t expect…I could use the user interface to restore a file at any point in time or a directory at the point of the latest backup set. What I couldn’t do was what I did with Time Machine: restore a directory at a specific point in time. This seems to be a function of how ArqBackup stores its data. What this means, though, is that ArqBackup is less a solution for restoring directories (or a whole system) at a point in time and better as a disaster recovery when the laptop and the Time Machine drive are unavailable.
]]>ffmpeg
commands that does most of the grunt work and learned a lot about ffmpeg
command graphs along the way.
Here are the steps:
Some hard-learned lessons along the way:
subtitles
filter does not play nicely with filtergraphs that have more than one video input. I needed to create a separate step for burning the title card “subtitles” into the intro video and then concat that video with the session recording and outro videos.xfade
filter does not like it when its source video is trimmed. No matter what variations of filters I used, it was always a hard cut between the intro video and the session recording. To solve this, I made a separate step for clipping the longer meeting recording to just the session content. I used a lossless Constant Rate Factor (-crf
) to not lose too much detail with the multiple encoding steps.I’m documenting the steps here in case they are helpful to someone else…perhaps I’ll need this pipeline again someday.
Each room of the conference was assigned a Zoom meeting. These Zoom meetings allowed remote participants to join the session, and the meetings were set to record. This meant, though, that several minutes in the recording at the start and end of the session were not useful content. (Sometimes the Zoom meeting/recording for the same room would just continue from one session to the next, so multiple sessions would end up on one recording.) The valuable part of each recording would need to be clipped from the larger whole.
1
2
3
4
5
ffmpeg -y \
-i FOLIO\ Roadmap.mkv \
-ss 0:01:22 -to 0:50:26 \
-c:v libx264 -crf 0 \
FOLIO\ Roadmap.trimmed.mp4
Each option means:
-ss
) and end (-to
) points. Could have also used -d
for duration. Note that the placement of -ss
here relative to the -i
input filename means ffmpeg
will perform a frame-accurate. This avoids the problem of blank or mostly blank frames until the next keyframe is found in the file. See How to Cut Video Using FFmpeg in 3 Easy Ways (Extract/Trim) for a discussion.Most of the recordings from Zoom output to Full HD (1920x1080) resolution, but some were recorded to quite squirrely dimensions.
(1920 by 1008, 1920 by 1030 … 1760 by 900, really?)
To find the resolution of each recording file, I used the ffprobe
command:
1
2
3
4
5
6
7
find . -type f -name '*.mkv' -exec sh -c '
for file do
printf "%s:" "$file"
args=( -v error -select_streams v:0 -show_entries stream=width,height -of csv=s=x:p=0 "$file")
ffprobe "${args[@]}"
done
' exec-sh {} +
find
command to get a list of all .mkv
files and pipe the list into a sub-shell.ffprobe
command to output width and heigth…Line 4 is necessary because some filenames have spaces, and spaces in filename for ffmpeg in bash can be a little challenging.
I moved the recordings that weren’t 1980x1080 to a separate directory and ran an ffmpeg command to add letterboxing/rescaling as needed to get the output to Full HD resolution.
The -ss
and -to
options can also be used to clip the video to the correct length at the same time.
1
2
3
4
5
6
7
ffmpeg -y \
-i 'zoom-recording-tuesday-2pm-room-701.mp4' \
-ss 17868 -to 20104 \
-vf 'scale=(iw*sar)*min(1920/(iw*sar)\,1080/ih):ih*min(1920/(iw*sar)\,1080/ih),
pad=1920:1080:(1920-iw*min(1920/iw\,1080/ih))/2:(1080-ih*min(1920/iw\,1080/ih))/2' \
-crf 0
'FOLIO Migrations.trimmed.mp4
Each session recording has a 15-second title card with the session’s name. The 15 second video itself is just a PowerPoint animation of the conference logo sliding to the right half of the frame and a red box fading in on the left side of the frame. Each animation element was assigned a timing, and the resulting “presentation” was exported from PowerPoint to a video file. The music comes from Ecrett, so I have high hopes that it will pass the music copyright bar. The audio track was added to the video using—you guessed it—ffmpeg:
1
2
3
4
5
6
ffmpeg \
-i WOLFcon\ 2023\ Intro\ Title\ Card.mov \
-i WOLFcon\ 2023\ Intro\ audio.mp3 \
-c:v copy
-map 0:v:0 -map 1:a:0
WOLFcon\ 2023\ Intro\ Title\ Card\ with\ audio.mov
So with the blank title card video done, the next step is to burn/overlay the text of the session title into the video.
I messed with ffmpeg’s drawtext
filter for a while because the alternative—the subtitles
filter—seemed too complicated.
One thing that subtitles
does nicely, though, is wrap the text to a given area on the video frame…sometimes complexity is a good thing.
The open source Aegisub Advanced Subtitle Editor was immensely useful in creating the subtitle definition file.
I can simply replace the text of the session title in the last line of the subtitle definition file, then feed it into ffmpeg.
The subtitle definition (so-called “.ass”) file generated by Aegisub is a text file, and it looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[Script Info]
; Script generated by Aegisub 3.2.2
; http://www.aegisub.org/
Title: Default Aegisub file
ScriptType: v4.00+
WrapStyle: 0
ScaledBorderAndShadow: yes
YCbCr Matrix: None
PlayResX: 1920
PlayResY: 1080
[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding
Style: Left side middle,Helvetica Neue,72,&H00FFFFFF,&H000000FF,&H00000000,&H00000000,0,0,0,0,100,100,0,0,1,0.5,0,4,50,920,10,1
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:04.00,0:00:13.00,Left side middle,,0,0,0,,FOLIO Roadmap Update
Just the last line needs to change for each session title. Another ffmpeg command overlays the subtitles onto the title card video:
1
2
3
4
ffmpeg -y \
-i ../WOLFcon\ 2023\ Intro\ Title\ Card\ with\ audio.mov \
-vf "subtitles=FOLIO Roadmap.ass" \
FOLIO\ Roadmap.intro.mp4
subtitles
video filter with the session-specific ASS fileNow we have all of the pieces to make the final recording
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ffmpeg -y \
-i FOLIO\ Roadmap.intro.mp4 \
-i FOLIO\ Roadmap.trimmed.mp4 \
-i ../WOLFcon\ 2023\ Outro\ Title\ Card\ with\ audio.mov \
-filter_complex "[0:v]fps=30/1,setpts=PTS-STARTPTS[v0];
[1:v]fps=30/1,settb=AVTB,format=yuva420p,fade=in:st=0:d=1:alpha=1,setpts=PTS-STARTPTS+((14)/TB)[v1];
[2:v]fps=30/1,settb=AVTB,format=yuva420p,fade=in:st=0:d=1:alpha=1,setpts=PTS-STARTPTS+((2959)/TB)[v2];
[v0][v1]overlay,format=yuv420p[vfade1];
[vfade1][v2]overlay,format=yuv420p[fv];
[0:a]asetpts=PTS-STARTPTS[a0];
[1:a]asettb=AVTB,asetpts=PTS-STARTPTS+((14)/TB),compand=.3:1:-90/-60|-60/-40|-40/-30|-20/-20:6:0:-90:0.2[a1];
[2:a]asetpts=PTS-STARTPTS+((2959)/TB)[a2];
[a0][a1]acrossfade=d=1[afade1];
[afade1][a2]acrossfade=d=1[fa];" \
-map "[fv]" -map "[fa]" \
-crf 0 -ac 2 \
FOLIO\ Roadmap.complete.mp4
There are some filter commands here to cross-fade the video and audio between video segments that are butting up next to each other. There is an excellent description of the ffmpeg cross-fade options, and I’m using the “traditional” method.
[v0]
, set the video as 30 frames-per-second, and anchor the “presentation timestamp” at the 0-th frame.[v1]
. The format
filter sets an alpha channel to make the fade work, the fade
filter makes cross-fade with a d
-duration of 1 second, and the setpts
filter offsets the start of the video to 14 seconds after the 0-th frame. (The title card video is 15 seconds, so making the recording fade in at the 14 second mark gives us that 1 second of overlap.)[v2]
. The parameters are nearly identical to the previous line with just the starting time difference. (That number varies by the length of the session recording.)[v0]
and [v1]
videos. This works because of the alpha channel and offset start of the second input. The output is tagged as [vfade1]
.[fv]
.[fa]
.[fv]
) and audio pipeline ([fa]
) to the output.With the files ready, it is time to upload them to YouTube. The youtube-upload script is useful as a tool for batch uploading the videos. There are a couple of caveats to be aware of:
The command looks like this:
1
2
3
4
5
6
7
8
9
10
youtube-upload \
--title="FOLIO Project Roadmap" \
--description="Multi-line description goes here." \
--category="Education" \
--default-language="en" --default-audio-language="en" \
--client-secrets=./client_secret.json \
--credentials-file=./credentials_file.json \
--playlist="WOLFcon 2023" \
--embeddable=True --privacy=public \
'FOLIO\ Roadmap.complete.mp4'
That should all be self-explanatory.
One thing to be aware of is the authentication step.
The client_secret.json
file is downloaded from the Google API Console when the YouTube API project is created; that API project will need to be set up and this credentials file saved before running this script.
Also, the credentials_file.json
won’t exist when this command is first run, and you’ll be prompted to go to a specific URL to authorize the YouTube API project.
After that, the credentials file will exist and you won’t be prompted again.
And since I already had the session metadata in a spreadsheet, it was easy to write a formula that put all of the pieces together:
=”youtube-upload –title=’“&B2&”’ –description=’“&C2&”’ –category=’Education’ –default-language=’en’ –default-audio-language=’en’ –client-secrets=./client_secret.json –credentials-file=./credentials_file.json –playlist=’WOLFcon 2023’ –embeddable=True –privacy=public ‘“&A2&”’”
Then it is just a matter of copying and pasting the calculated command lines into the terminal.
]]>In my experience, “open” is built into the ethos of libraries. I mean…even if we look at just the last century, we have the Library of Congress starting the National Union Catalog project in 1901—that was about sharing the contents of cards in the catalog—and ALA establishing a code of practice for interlibrary loan in 1917.
My career in libraries has always been about the open; I started in 1991 at the same time OhioLINK was forming, and I remember many trips to Columbus, Ohio, to work out processes and share tips-and-tricks with each other. I was even giving away code and adapting code from others before the phrase “open source” came into common use. Over the course of my career, I’ve worked on or with several open source projects: FEDORA, Islandora, ArchivesSpace, CollectionSpace, FOLIO and ReShare.
Standards are also an important part of “open” — in order to ease the process of us working together and our systems working together, it helps to have a common starting point to build on. I’ve been working on NISO projects and committees for most of my career, and it warms my professional heart to see better services come about for patrons because there is agreement on how the pieces should be put together.
The biggest advantage is having a seat at the table as decisions are made. Working in the open means bringing the best of your experience and the needs of your patrons to bear as products, services, and software are designed. It is so much easier to have that input at the front end rather than trying to retrofit a system to your needs after the fact. With many voices and perspectives in the creation process, it also reduces the chances that something important will be missed.
The biggest risk is time and patience. Having many people involved in the design process means that it takes time to listen to those perspectives and effort to synthesize the way forward for the group. There will be misunderstanding and there will be compromise. There may even be paths that you want to pursue, but the group isn’t willing to follow. And of course the is the risk that the path the group follows may not be fruitful.
There seem to be more variations of open now. Last month’s article from OCLC Research had a catalog of openness: open access, open data, open educational resources, open science, and open source. So in those you cover publishing, research activity and outputs, educational materials, and software systems.
You are at a crossroads. There is a lot of new stuff coming at you, and the temptation will be to make the new work like the old. I was involved in the early design process for FOLIO and I’ve watched how those apps evolved, and then how the ReShare apps came about. What I said earlier about the librarians and library technologists pouring their experience into their design is true, and it continues today. So I think you should take a risk to open yourselves up to new and hopefully more efficient and effective workflows. I’m pretty sure those already in the community will be welcoming and help you with the process. And then, once you got your feet under you, see where you can bring your experience and perspective to the ongoing development work.
The progress of the open access movement is fascinating. There is the phrase that progress happens one retirement at a time — I used to chuckle at that, but after 30 years in the profession that phrase is less funny and more stinging. The slow but steady progress seems to be real, though. It has reached the stage where government mandates are making it happen. See, for instance, last year’s memo from the White House Office of Science and Technology Policy giving guidance to federal departments to make the results of taxpayer-supported research immediately available to the American public at no cost. Also the announcement earlier this month from the EU on a policy to require pubicly-funded research to be made available at no cost to readers and at no cost to authors.
In one important way, GALILEO is in a privileged position right now with FOLIO and ReShare. Many of us have been involved in the projects for a long time, and we’ve lived through the process that got those platforms where they are today. We can’t see them clearly from the outside anymore. If there is any room left in your implementation plans, I encourage you to note where you struggle to find and understand what you need to know. Those are the places where feedback can help us improve the process for the next libraries that come into the project. Even if you don’t have time to make the improvements now—and I expect you won’t—just jotting those ideas down in a notebook and coming back to them after your implementation will help.
]]>(You can get updates about reports from the Congressional Research Service via RSS or Mastodon; see an earlier blog post for details.)
Most recently updated in March 2023, this 3-page Challenges with Identifying Minors Online report has brief sections describing current efforts to identify children, potential challenges when identifying minors, and policy considerations for Congress. I think it can be viewed as a high-level summary to the subsequent reports.
Congress has passed laws like COPPA (Children’s Online Privacy Protection Act of 1998) to protect minors online, but identifying their ages remains challenging. While some sites require entering a birthdate, others are exploring options like ID verification. However, requiring government IDs may exclude many minors, and fake student IDs could be used. Creating a national digital ID system raises privacy concerns. AI-based age checks have accuracy issues, and using data brokers’ information raises data collection concerns. The report describes how Congress, as it considers further protections, may stumble into unintended consequences like limiting content access or increasing data collection; those factors should be weighed against protecting minors.
In mid-August, the Congressional Research Service published three reports with greater depth. The first, Online Age Verification (Part I): Current Context, is a 3-page report about online age verification laws that have been proposed or enacted at both the state and federal levels. It provides an overview of different approaches taken in various state laws targeting social media platforms or pornography sites. Requirements range from estimating a user’s age to definitively verifying age through ID checks. Enforcement also varies, with some laws allowing private lawsuits while others give enforcement authority solely to state officials.
The report notes constitutional free speech concerns around age verification laws. Imposing requirements on websites to check users’ ages could discourage them from hosting certain content or force platforms offline if compliance is too costly. The Supreme Court has expressed worries that protecting minors should not be used to excessively burden adult communication.
The second report, Online Age Verification (Part II): Constitutional Background (also 3 pages), describes how age verification laws requiring online platforms to verify users’ ages may face First Amendment challenges. (Such requirements could burden free speech rights.) The report says that laws establishing age verification obligations are more likely to be deemed constitutional if they are content-neutral and narrowly tailored to a vital government interest like protecting minors. It also outlined how the Supreme Court has ruled on previous federal laws around age verification. Content-based laws like the Communications Decency Act (CDA) have been struck down for not being narrowly tailored and burdening lawful adult speech. While the Child Online Protection Act (COPA) survived a preliminary injunction, courts ultimately found it unconstitutional due to age verification methods not being fully effective and imposing high compliance costs. Overall, the key impact is that future age verification policies must be carefully crafted to avoid First Amendment issues highlighted by past Supreme Court decisions.
The final report in the series is 4 pages long: Online Age Verification (Part III): Select Constitutional Issues. The document discusses potential constitutional challenges to laws requiring age verification for online services. It analyzes how such laws may impact the free speech rights of website operators, adult users, and minor users. Laws targeting pornography or material harmful to minors are likely content-based, while laws targeting social media may be content-neutral. However, laws with content-based exceptions could still face First Amendment challenges. The government must show it has a compelling interest, such as protecting minors, and that the law is narrowly tailored. Past courts have found general interests in protecting children insufficient without proof of specific harms. Age verification laws could also impact minors’ access to a wide range of online speech.
]]>I remember learning about the CRS in library school, but what got me interested in them again was a post on Mastodon about an Introduction to Cryptocurrency report that they produced. At just 2 pages long, it was a concise yet thorough review of the topic, ranging from how they work to questions of regulation. Useful stuff! And that wasn’t the only useful report I (re-)discovered on the site.
The problem is that no automated RSS/Atom feeds of CRS reports exists. Use your favorite search engine to look for “Congressional Research Service RSS or Atom”; you’ll find a few attempts to gather selected reports or comprehensive archives that stopped functioning years ago. And that is a real shame because these reports are good, taxpayer-funded work that should be more widely known. So I created a syndication feed in Atom:
You can subscribe to that in your feed reader to get updates. I’m also working on a Mastodon bot account that you can follow and automated saving of report PDFs in the Internet Archive Wayback Machine.
The CRS website is very resistant to scraping, so I’m having to run this on my home machine (read more below). I’m also querying it for new reports just twice a day (8am and 8pm Eastern U.S. time) to avoid being conspicuous and tripping the bot detectors. The feed is a static XML document updated at those times; no matter how many people subscribe, the CRS won’t see increased traffic on their search site. So while I hope to keep it updated, you’ll understand if it misses a batch run here or there.
Also, hopefully, looking at the website’s list of reports only twice a day won’t raise flags with them and get my home IP address banned from the service. If the feed stops being updated over an extended time, that is probably why.
There is no tracking embedded in the Atom syndication feed or the links to the CRS reports. I have no way of knowing the number of people subscribing to the feed, nor do I see which reports you click on to read. (I suppose I could set up stats on the AWS CloudFront distribution hosting the feed XML file, but really…what’s the point?)
If you are not interested in the technology behind how the feed was built, you can stop reading now. If you want to hear more about techniques for overcoming hostile (or poorly implemented) websites, read on. You can also see the source code on GitHub.
The CRS website is a dynamic JavaScript that goes back-and-forth with the server to build the contents of web pages. The website itself sends nicely formatted JSON documents to the JavaScript running in the browser based on your search parameters. That should make this easy, right? Just bypass the JavaScript front end and parse the JSON output directly.
In fact, you can do this yourself.
Go to https://crsreports.congress.gov/search/results?term=&r=2203112&orderBy=Date&isFullText=true& in your browser and see the 15 most recent reports.
Try to reach that URL with a program, though, and you’ll get back an HTTP 403 error.
(In my case, I was using the Python Requests library.)
And I tried everything I could think about.
I even tried getting the curl
command line with the headers that the browser was using from the Firefox web developer tools:
…and still got denied. So I gave up and used Selenium to run a headless browser to get the JSON content.
And that worked.
So with the headless browser, I got this working on my local machine. That isn’t really convenient, though…even though my computer is on most working hours, something like this should be run on a server in the cloud. Something like AWS Lambda is ideal. So I took a detour to learn about Headless Chrome AWS Lambda Layer (for Python). This is a technique to run Chrome on a server, just like I was doing on my local machine.
So I got the code working on AWS Lambda. It was a nice bit of work…I was pleased to learn about a new AWS skill (Layers for Lambda). But I hit another wall…this time at Cloudflare, a content distribution network that sits in front of the CRS website with protections to stop bots like mine from doing what I was trying to do. Instead of the JSON response, I got Cloudflare’s HTML page asking me to solve a captcha to prove my bot’s humanness. And look…I love y’all, but I won’t be answering captcha challenges twice a day to get the report syndication feed published.
So after all of that, I decided to just run the code locally. If you know of something I missed that could bypass obstacles 1 and 2 (and won’t get the FBI knocking at my door), please let me know.
]]>How much is 0.0001% of the GPT-3 training set? It is a quarter of an inch (half a centimeter) off sea level on a climb up Mount Everest. (Source: Wolfram Alpha) It is almost 8 feet (2.5 meters) of a journey from Washington, DC, to San Francisco, California (Source: Wolfram Alpha) In contrast, the content from the New York Times is 0.036% of the training dataset, or 9/10ths of a mile (1.4km) on that journey.
(A note about assumptions: OpenAI hasn’t published the contents of the training data for GPT-3.5—which is used in ChatGPT—and GPT-4. So this post uses the data from GPT-3 as listed in Wikpedia. )
You can use the search tool near the bottom of the Washington Post article to see where your favorite website ranks. But also read the article to explore what is in the C4 version of the Common Crawl. As much as OpenAI is trying to put guardrails on the output, the model itself is trained on some pretty offensive stuff.
]]>One gut reaction that I find I’m suppressing is calling out bad corporate behavior. Companies of all sizes had a Twitter presence, and I could publicly @-tag them with messages of outrage, disappointment, rejection, and sometimes support. It felt cathartic, and I occasionally had a nebulous feeling of community when others liked, retweeted, and replied. In retrospect, it may not have been all that useful. Do we go back to filling out web forms in a company’s “about” section or—gasp!—writing a letter and putting it in the mail?
A good portion of the accounts I followed and had notifications enabled have made it over to Mastodon: Have I Been Pwned and The Oatmeal are here now, for instance. Some, notably local government and law enforcement, county road crews, and the nearest National Weather Service Office are not here. Even their Twitter presence is becoming less useful:
As a long time internet user, I also miss the uniqueness of Twitter as a meeting point. Our online personalities are—at the same time—more diffuse and more siloed than ever. Not everyone was on IRC, nor could IRC support everyone talking at the same time, for instance. Twitter was unique in that one’s reach had near-infinite potential with very little effort required. And maybe Twitter would have ultimately collapsed under the weight of that infinite potential…it certainly had its problems and challenges, even in the best of times.
When I last wrote about Twitter here on December 19, 2022, I said:
The past eight weeks on Twitter have been emotionally tiring, and I wondered why. On reflection, mourning seems like the most appropriate label for the emotion I’m feeling. I had invested time and effort into cultivating a network of friends and acquaintances. Now it is being destroyed; that network was a guest in someone else’s kingdom.
My sense of mourning has changed: it’s no longer about that lost network of friends and acquaintances in someone else’s kingdom; it is now about the lost potential and wondering if we’ve learned the lessons we need to from that Twitter era.
]]>This week we look at internet governance: how it got to be the way it is, why it is unique, and threats (both historic and current) to how it operates.
Also on DLTJ in the past week:
Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ’s Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.
Yes, I’m quoting myself here. That DLTJ article was prompted by a whitepaper from Packet Clearing House in response to a request from the Ukraine government to cut off Russia from the internet. As the quote above points out, the article goes on to talk about how the multi-stakeholder nature of the internet makes it difficult to exert much of any central control over its operation. And that is probably a good thing. If you didn’t read this article a year ago, I hope you’ll go back and read it now.
Huawei’s proposed “New IP” internet lacks transparency and inclusivity as well as introduces centralized control and potential abuse. The multi-stakeholder governance structure of the internet is one if its most important and unique aspects. Doing away with that should not be taken lightly.
If you want a way-out-there idea, if we centralized control of the internet with the governments we have on earth now, it becomes all that much harder to invite an independent Martian colony or extra-terrestrial culture to join the network; not so with the current multi-stakeholder governance.
See…told you it was a way-out-there idea.
ICANN and IANA are parts of the multi-stakeholder governance structure. While imperfect, they do work towards broad consensus with the goal of ensuring the broadest connectivity. That is a laudable goal.
One can question Google’s motives, but I interpret this as an honest caution and not a bid for an internet giant to consolidate control.
This is a very recent example of a country trying to cut itself off from the internet. In this case, in a country where most people get to the internet via mobile networks, it is relatively easy to do since there are few chokepoints.
It is not just India—and interference ranges from complete shutdowns to impacting specific apps. Doing something like this in the U.S. is harder because of the diversity of internet connectivity options.
But it doesn’t mean the U.S. government didn’t try (and won’t try again). Read on…
Lest you think something like what is happening in India can’t happen in the United States, 12 years ago the U.S. Senate proposed an internet kill switch. That legislation died and—as near as I can tell—hasn’t been proposed again at this scale. The recent discussions of banning an app—TikTok—come really close, though. (And instead of banning apps, can we do something about the pervasive personal data collection and distribution instead?)