Disruptive Library Technology Jester

One Year of Learning 2025

2025-12-28T00:00:00-05:00

Inspired by Tom Whitwell's 52 things I learned in 2022, I started my own list of things I learned in 2023 and repeated it last year. Reaching the end of another year, it is time for Things I Learned In 2025. Part way through the year I had the brilliant idea of putting a learning at the bottom of my weekly newsletter, and that worked well until the middle of the year when I stopped publishing newsletter issues. So here is a half year of learnings.

What did you learn this year? Let me know on Mastodon or Bluesky.

In Ethiopia, time follows the sun like nowhere else

Because Ethiopia is close to the Equator, daylight is pretty consistent throughout the year. So many Ethiopians use a 12-hour clock, with one cycle of 1 to 12 — from dawn to dusk — and the other cycle from dusk to dawn. Most countries start the day at midnight. So 7:00 a.m. in East Africa Time, Ethiopia's time zone, is 1:00 in daylight hours in local Ethiopian time. At 7:00 p.m., East Africa Time, Ethiopians start over again, so it's 1:00 on their 12-hour clock.

—If you have a meeting in Ethiopia, you'd better double check the time, The World from PRX, 30-Jan-2015

This could have easily gone in the Thursday Threads on time standards. There are 12 hours of daylight, numbered 1 through 12. Then 12 hours of night, numbered 1 through 12. What could be easier?

From Thursday Threads issue 104 on Long Term Digital Storage.

A biographer embedded with the Manhattan Project influenced what we think about the atomic bomb

In early 1945, a fellow named Henry DeWolf Smyth was called into an office in Washington and asked if he would write this book that was about a new kind of weapon that the US was developing. The guy who had called him into his office, Vannevar Bush, knew that by the end of the year, the US was going to drop an atomic bomb that had the potential to end the war, but also that as soon as it was dropped, everybody was going to want to know what is this weapon, how was it made, and so forth. Smyth accepted the assignment. It was published by Princeton University Press about a week after the bomb was dropped. It explained how the US made the bomb, but it told a very specific kind of story, the Oppenheimer story that you see in the movies, where a group of shaggy-haired physicists figured out how to split the atom and fission, and all of this stuff. The thing is, the physics of building an atomic bomb is, in some respects, the least important part. More important, if you actually want to make the thing explode, is the chemistry, the metallurgy, the engineering that were left out of the story.

—Wars Are Won By Stories, On the Media, 22-Jan-2025

The quote above comes from the transcript of this podcast episode. I've thought about this a lot in the past week as the Trump administration's flood-the-zone strategy overwhelms the senses. In a valiant effort to cover everything that is news, I can't help but wonder about the lost perspective of what isn't being covered. And I wonder where I can look to find that perspective.

From Thursday Threads issue 105 on Facial Recognition.

The origin of the computer term "mainframe" comes from "main frame" — the 1952 name of an IBM computer's central processing section

Based on my research, the earliest computer to use the term "main frame" was the IBM 701 computer (1952), which consisted of boxes called "frames." The 701 system consisted of two power frames, a power distribution frame, an electrostatic storage frame, a drum frame, tape frames, and most importantly a main frame.

—The origin and unexpected evolution of the word 'mainframe', Ken Shirriff's blog, 1-Feb-2025

"Mainframe" is such a common word in my lexicon that it didn't occur to me that its origins was from "main frame" — as in the primary frame in which everything else connected. I've heard "frame" used to describe a rack of telecommunications equipment as well, but a quick Kagi search couldn't find the origins of the word "frame" from a telecom perspective.

From Thursday Threads issue 106 on How much do you know about the credit card industry?.

It takes nearly 3¢ to make a penny, but almost 14¢ to make a nickel

FY 2024 unit costs increased for all circulating denominations compared to last year. The penny’s unit cost increased 20.2 percent, the nickel’s unit cost increased by 19.4 percent, the dime’s unit cost increased by 8.7 percent, and the quarter-dollar’s unit cost increased by 26.2 percent. The unit cost for pennies (3.69 cents) and nickels (13.78 cents) remained above face value for the 19th consecutive fiscal year

—2024 Annual Report, United States Mint

I knew pennies cost the U.S. mint more than one cent to make, but I didn't realize that the cost of nickels is so much more out of whack. I also learned a new word: seigniorage — the difference between the face value of money and the cost to produce it.

From Thursday Threads issue 107 on the humble battery. Also this year, the U.S. mint stopped pressing pennies in November.

It is much harder to get to the Sun than it is to Mars

The Sun contains 99.8 percent of the mass in our solar system. Its gravitational pull is what keeps everything here, from tiny Mercury to the gas giants to the Oort Cloud, 186 billion miles away. But even though the Sun has such a powerful pull, it’s surprisingly hard to actually go to the Sun: It takes 55 times more energy to go to the Sun than it does to go to Mars.

—It’s Surprisingly Hard to Go to the Sun, NASA, 8-Aug-2018

I suppose it that headline above needs some nuance. It is easy to get to the Sun...just escape Earth's gravity and point yourself there. It is hard to get to the Sun in a controlled way that means you won't burn up along the way.

From Thursday Threads issue 108 on Educational Technology.

There are now 23 Dark Sky Sanctuaries in the World

Rum, a diamond-shaped island off the western coast of Scotland, is home to 40 people. Most of the island — 40 square miles of mountains, peatland and heath — is a national nature reserve, with residents mainly nestled around Kinloch Bay to the east. What the Isle of Rum lacks is artificial illumination. There are no streetlights, light-flooded sports fields, neon signs, industrial sites or anything else casting a glow against the night sky. On a cold January day, the sun sets early and rises late, yielding to a blackness that envelopes the island, a blackness so deep that the light of stars manifests suddenly at dusk and the glow of the moon is bright enough to navigate by.

—Take a Look: A Dark Scottish Isle Where Starlight Reigns Supreme, New York Times, 24-Feb-2025

The pictures that accompany this article from the New York Times are stunning (gift link). And to think that there are only 23 places in the world that have reached this level of commitment to the environment.

From Thursday Threads issue 109 on Generative AI in Libraries.

Mexico has only one gun store for the entire country

Mexico notes that it is a country where guns are supposed to be difficult to get. There is just one store in the whole country where guns can be bought legally, yet the nation is awash in illegal guns sold most often to the cartels.

—Mexico faces off with U.S. gunmakers at the Supreme Court, NPR, 4-Mar-2025

And not only is there one gun store, the single store in Mexico is located on an army base and is run by soldiers, according to an article in the Associated Press from 2016.

From Thursday Threads issue 110 on Research into Generative AI.

Plants reproduce by spreading little plant-like things

This is where pollen comes in. Like sperm, pollen contains one DNA set from its parent, but unlike sperm, pollen itself is actually its own separate living plant made of multiple cells that under the right conditions can live for months depending on the species... So this tiny male offspring plant is ejected out into the world, biding its time until it meets up with its counterpart. The female offspring of the plant, called an embryosac, which you're probably less familiar with since they basically never leave home. They just stay inside flowers. Like again, they're not part of the flower. They are a separate plant living inside the flower. Once the pollen meets an embryosac, the pollen builds a tube to bridge the gap between them. Now it's time for the sperm. At this point, the pollen produces exactly two sperm cells, which it pipes over to the embryosac, which in the meantime has produced an egg that the sperm can meet up with. Once fertilized, that egg develops into an embryo within the embryosac, hence the name, then a seed and then with luck a new plant. This one with two sets of DNA.

—Pollen Is Not Plant Sperm (It’s MUCH Weirder), MinuteEarth, 7-Mar-2025

Pollen is not sperm...it is a separate living thing! And it meets up with another separate living thing to make a seed! Weird! The video is only three and a half minutes long, and it is well worth checking out at some point today.

From Thursday Threads issue 111 on End-to-end Encryption.

Most plastic in the ocean isn't from littering, and recycling will not save us

Littering is responsible for a very small percentage of the overall plastic in the environment. Based on this graph from the OECD, you can see littering is this teeny-tiny blue bar here, and mismanaged waste, not including littering, is this massive one at the bottom. Mismanaged waste includes all the things that end up either in illegal dump sites or burned in the open or in the rivers or oceans or wherever. The focus on littering specifically, it's an easy answer because obviously there's nothing wrong with discouraging people from littering, but it focuses on individual people's bad choices rather than systemic forces that are basically flushing plastic into the ocean every minute. Mismanaged waste includes everything that escapes formal waste systems. So they might end up dumped, they might end up burned, they might end up in the environment.

—You're Being Lied To About Ocean Plastic, Business Insider via YouTube, 26-Sep-2024

Contrary to popular belief, most plastic in the Great Pacific Garbage Patch stems from the fishing industry, with only a small fraction linked to consumer waste. The video highlights that mismanaged waste, rather than individual littering, is the primary contributor to plastic pollution, with 82% of macroplastic leakage resulting from this issue. It emphasizes the ineffectiveness of recycling as a solution, noting that less than 10% of plastics are currently recycled, and the industry has perpetuated the myth that recycling can resolve the plastic crisis. Microplastics, which are increasingly recognized as a major problem, originate from various sources, including tires and paint, with new data suggesting that paint is a significant contributor.

From Thursday Threads issue 112 on Social Media Research.

"But where is everybody?!?" — the origins of Fermi's Paradox

The eminent physicist Enrico Fermi was visiting his colleagues at Los Alamos National Laboratory in New Mexico that summer, and the mealtime conversation turned to the subject of UFOs. Very quickly, the assembled physicists realized that if UFOs were alien machines, that meant it was possible to travel faster than the speed of light. Otherwise, those alien craft would have never made it here. At first, Fermi boisterously participated in the conversation, offering his usual keen insights. But soon, he fell silent, withdrawing into his own ruminations. The conversation drifted to other subjects, but Fermi stayed quiet. Sometime later, long after the group had largely forgotten about the issue of UFOs, Fermi sat up and blurted out: “But where is everybody!?”

—All by ourselves? The Great Filter and our attempts to find life, Ars Technica, 26-Mar-2025

This retelling of the Fermi Paradox coms from this story about why, despite the vastness of the universe, we have yet to encounter evidence of extraterrestrial civilizations. Enrico Fermi famously posed the question, "Where is everybody?" suggesting a disconnect between the expectation of abundant intelligent life and the lack of observable evidence. With this comes the Great Filter notion...proposing that there may be significant barriers preventing intelligent life from becoming spacefaring. The article goes on to speculate where we are relative to the "Great Filter" — are we past it, or is it yet in front of us? In other words, have we survived the filter or is our biggest challenge ahead of us?

From Thursday Threads issue 113 on Copyright and Foundational AI Models.

The pronoun "I" was capitalized to distinguish it from similarly typset letters

In fact, the habit of capitalizing “I” was also a practical adaptation to avoid confusion, back in the days when m was written “ııı” and n was written “ıı.” A stray “i” floating around before or after one of those could make the whole thing hard to read, so uppercase it went. And now it seems perfectly logical.

—I Have a Capital Suggestion for a New Pronoun, New York Times, 27-Mar-2025

I'm not buying the opinion author's underlying premise (capitalizing “they” in writing when it refers to a nonbinary person), but the origins of why we capitalize "I" and not other pronouns are fascinating.

From Thursday Threads issue 114 on Digital Privacy.

The word "scapegoat" originated in a 1530 bible translation

Early English Christian Bible versions follow the translation of the Septuagint and Latin Vulgate, which interpret azazel as "the goat that departs" (Greek tragos apopompaios, "goat sent out", Latin caper emissarius, "emissary goat"). William Tyndale rendered the Latin as "(e)scape goat" in his 1530 Bible. This translation was followed by subsequent versions up through the King James Version of the Bible in 1611: "And Aaron shall cast lots upon the two goats; one lot for the Lord, and the other lot for the scapegoat."

—Scapegoat, Wikipedia

Have you stared at a word and suddenly wondered about its origins? This entry from the New York Times Flashback Quiz had me wondering about "scapegoat". "scape" — "goat". Why do we say that? It comes from a phrase in the bible where a goat sent into the wilderness on the Day of Atonement as a symbolic bearer of the sins of the people — Leviticus 16:22, to be exact. The translator coined the term from the interpretation of "the goat that departs" and "emissary goat" in that verse.

From Thursday Threads issue 115 on Public and Private Camera Networks.

"Leeroy Jenkins!!!!" was staged

It was one of the first memes ever, a viral sensation that went mainstream back when people still used dial-up internet. Yet the cameraman behind “Leeroy Jenkins” still seems stupefied that anyone fell for it.

—The Makers Of 'Leeroy Jenkins' Didn't Think Anyone Would Believe It Was Real, Kotaku, 25-Dec-2017

First posted on May 10, 2005, this year marks the 20th anniversary of this bit of internet folklore. I remember when this first came out, and I totally believed it was real until earlier this year.

From Thursday Threads issue 116 on Government Surveillance.

Ammonium chloride may be the 6th basic taste

Ammonium chloride is a slightly toxic chemical most notably found in “salmiak,” a salt licorice candy, which is popular in northern Europe. In a new study, researchers found that the compound triggers a specific proton channel called OTOP1 in sour taste receptor cells, which fulfills one of the key requirements to be considered a primary taste like sweet, salty, sour, bitter, and umami. Ammonium is commonly found in waste products and decaying organic matter and is slightly toxic, so it makes sense that vertebrates evolved a specific taste sensor to recognize it.

—Ammonium chloride tastes like nothing else. It may be the sixth basic taste, Big Think, 11-Oct-2023

From Thursday Threads issue 117 on Local Government Surveillance.

Issue 117: Local Government Surveillance

2025-06-12T00:00:00-04:00

After previous DTLJ Thursday Threads issues on digital privacy, surveillance camera systems, and federal government systems, I'm focusing this week on what is happening at the local level—mainly in policing. This closes this loop on surveillance by coming back around to local activity — although it takes an unexpected jump back to the national level with a story published last month. Law enforcement surveillance has dramatically evolved, influenced by cutting-edge technology and controversial practices. This thread of stories highlights the complexities and ethical challenges arising from deploying these advancements. From the sophisticated smartphone tracking tools like Fog Reveal to the spidering data collection activities of Flock's AI-powered license plate readers, these stories underscore the growing tensions between public safety objectives and personal privacy rights. So that is the focus of this week's Thursday Threads issue:

In 2022, we learned of a local police surveillance called Fog Reveal that pinpointed mobile phones and de-anonymized users.
Two years later, they were still at it—this time asking police to augment Fog Reveal's data to include information about doctor visits. (2024)
NYPD has multi-million dollar contracts with controversial surveillance firms that scrape social media and post fake users to get surveillance engagement. (2023)
Advances in surveillance technology mean we've seen the unchecked growth of Real-Time Crime Centers across America. (2023)
Police and other public officials have special protections from data brokers, and West Virginia officers sue Whitepages over unlawful info disclosure. (2024)
Here's the recent national twist on local law enforcement surveillance: ICE's covert use of Flock's AI camera network for immigration enforcement. (2025)
This Week I Learned: Ammonium chloride may be the 6th basic taste

Before we start...it is important to call out what is happening in the United States. The Trump administration is using modern-day authoritarian tactics to frighten citizens into accepting a new normal. I am more angry at what my national leaders have done than I am frightened, and I hope you will express your outrage, too, at a No Kings in America protest this weekend. These are drafts of the two signs I'll be waving:

In light of Elon Musk stepping back from a public role in the administration, I'll retitle my #TeslaTakedown protest sign blog post (although, in keeping with cool-URLs-don't-change practice, it is at the same web link) and will be adding these two signs when they are finalized. You are welcome to visit that post to download printable versions of these signs or any other ones that I've made.

Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ's Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.

A Deep Dive into Local Police Surveillance Practices Using Fog Reveal

Users can then “tag” a device to mark it as a device of interest, the manual says. From there, they can “query” that particular device and the system will show a 90 day pattern of activity for that device. Some police departments like how quickly they can access this data, according to the Associated Press. Ordinarily, Google might provide information on what devices were present in a particular area at a specific time, but authorities would need to obtain a so-called “reverse location warrant,” which can take time. With Fog Reveal, they can just log in. The Associated Press spoke to a prosecutor in Washington County, Arkansas, who told the outlet that he had used Fog Reveal without a warrant in the past, especially in “exigent circumstances.”

—Here Is the Manual for the Mass Surveillance Tool Cops Use to Track Phones, Motherboard: Tech by Vice, 1-Sep-2022

The article discusses the user manual for Fog Reveal, a mass surveillance tool employed by local police departments in the U.S. to track individuals' phones without warrants. A staff technologist at the EFF shared the manual with the article author and provided insights about the tool that were found in public record requests. This technology utilizes data harvested from smartphone apps, allowing law enforcement to access location information quickly and easily. The manual reveals that users can search for devices in specific areas using geofencing and tag devices of interest to analyze their activity patterns over 90 days. Despite its effectiveness in various investigations, prosecutors rarely disclose the tool's use in court filings, raising concerns about accountability and public awareness. The manual emphasizes the sensitivity of the data accessed and notes that surveillance can encompass large numbers of people, potentially overwhelming users with data. It highlights the ease of use for law enforcement, who can bypass traditional warrant processes in urgent situations. Additionally, the article mentions that Fog Reveal has clients across various jurisdictions, from urban to rural areas.

How Fog Data Science Enables Police Surveillance on Doctor Visits

The document is a “Project Intake Form” that asks police for information about the person of interest they would like to track, such as biographical information and known locations, including family and friends' addresses and doctors offices they may visit. It shows that, in a time when surveillance of abortion and reproductive health clinics could rise in a post-Roe America, companies providing monitoring tools to the government are prepared to use healthcare information to track down targets.

—Location Data Firm Offers to Help Cops Track Targets via Doctor Visits, 404 Media, 10-Dec-2024

Two years after the above article was published, Fog Data Science was in the news again...still offering services to law enforcement that involve tracking individuals. A document obtained by 404 Media reveals that the company seeks specific information from police, including the addresses of doctors and biographical details about persons of interest. Fog Data Science uses location data sourced from smartphones through apps and their advertising networks to pinpoint targets. The Project Intake Form provided to law enforcement emphasizes the need for detailed information to enhance tracking efficiency. It specifically asks for data such as names, aliases, and any links to criminal activity. The more data that is in the database, the easier it is to de-anonymize the mobile phone traces.

NYPD's Multi-Million Dollar Contracts with Controversial Surveillance Firms Raise Privacy Concerns

New York law enforcement agencies have spent millions of dollars to expand their capabilities to track and analyze social media posts, new documents show, including by contracting with a surveillance firm accused of improperly scraping social media platforms for data. Documents obtained by the Surveillance Technology Oversight Project (Stop), a privacy advocacy non-profit and shared with the Guardian, reveal the New York police department in 2018 entered a nearly $9m contract with Voyager Labs, a surveillance company that has been sued by Meta for allegedly using nearly 40,000 fake Facebook accounts to collect data on an estimated 600,000 users.

—NYPD spent millions to contract with firm banned by Meta for fake profiles, The Guardian, 8-Sep-2023

The New York Police Department (NYPD) spent millions on a contract with Voyager Labs, a surveillance firm banned by Meta for using fake profiles to scrape social media data. Voyager Labs claims its tools can analyze online behavior using artificial intelligence to detect and predict crimes. The NYPD's contract allows them to create avatars for data collection. Of course, the use of such surveillance tools raises ethical and legal concerns, particularly regarding privacy and the collection of personal information without proper warrants. While the NYPD stated it uses these tools for public safety and investigative purposes, it did not provide specific details on how they were used.

The Unchecked Growth of Real-Time Crime Centers Across America

Most evidence for [Real-Time Crime Centers (RTCCs)] effectiveness, however, is anecdotal, and there is a real lack of studies into how effective they really are. In Detroit, a National Institute of Justice study concluded that Project Green Light—a part of the Detroit Police Department RTCC that established cameras at more than 550 locations, including schools, churches, private businesses, and health centers—helped decrease property violence in some areas but did nothing to prevent violent and other crimes. But police departments argue they do work.

—The Quiet Rise of Real-Time Crime Centers, WIRED, 10-Jul-2023

Originating from concepts like London's "Ring of Steel" in response to Irish Republican Army bombings in the 1990s, the first RTCC was created in New York City in 2005. These centers aim to enhance policing efficiency by collecting data from various sources, including CCTV, drones, and social media, often employing facial recognition and predictive policing methods. Proponents argue that RTCCs allow for more targeted policing and reduce crime rates. Critics raise concerns about privacy violations and the potential for abuse of surveillance data. The lack of public awareness and oversight enables these centers to operate with minimal scrutiny, leading to fears of mass data collection impacting citizens' rights. You also have to wonder about the security of this gathered data with the rice of ransomware operations. The Electronic Frontier Foundation and others advocate for stricter regulations and community control over surveillance technologies to protect civil liberties.

Curious about what police-run surveillance products are used in your hometown? Check out EFF's Atlas of Surveillance.

West Virginia Officers Sue Whitepages Over Unlawful Info Disclosure

Whitepages is the latest data broker to be sued for allegedly flouting laws barring the publication of home addresses and other personal information belonging to judges, police officers, prosecutors and others in law enforcement. A retired West Virginia police officer filed a class action lawsuit against the company late last month for publishing his home address, a violation of a 2021 West Virginia statute known as Daniel’s Law. The West Virginia law is similar to legislation enacted in New Jersey in 2020 following the murder of a federal judge’s son by a disgruntled lawyer who had appeared before her and found her personal information online. That law, also called Daniel’s Law, was enacted following an emotional appeal for reform from U.S. District Court Judge Esther Salas, whose son was killed. In 2022, Congress passed a more limited federal version of the laws, barring the selling or purchasing of judges’ private information online.

—West Virginia law enforcement sues data broker for publishing personal information online, The Record, 6-Sep-2024

Laws have been enacted at the state and federal levels to protect the privacy of judges, law enforcement, and other public-facing professions following incidents where personal information was used to target them. This type of legislation is spreading, and there are calls for it to be expanded to protect other public officials like public health workers and school board members. One has to wonder when the increasing demand for average citizens to have similar legal protections against data brokers will pick up steam.

ICE's Covert Use of AI Camera Network for Immigration Enforcement Through Local Police

Data from a license plate-scanning tool that is primarily marketed as a surveillance solution for small towns to combat crimes like car jackings or finding missing people is being used by ICE, according to data reviewed by 404 Media. Local police around the country are performing lookups in Flock’s AI-powered automatic license plate reader (ALPR) system for “immigration” related searches and as part of other ICE investigations, giving federal law enforcement side-door access to a tool that it currently does not have a formal contract for.

—ICE Taps into Nationwide AI-Enabled Camera Network, Data Shows, 404 Media

Issue 115 on surveillance camera systems had several articles about law enforcement using services from Flock for investigating local crime with the ability to search for cars nationwide. Now we learn that ICE is asking local police departments to use Flock's automatic license plate reader (ALPR) system for immigration-related purposes, granting ICE access to data without a formal contract or oversight. The data indicates that local police must provide justification for their lookups, with many citing immigration or ICE-related reasons. Notably, searches mentioning ICE occurred during both the Biden and Trump administrations, but all explicit immigration-related lookups happened after Trump's inauguration.

This Week I Learned: Ammonium chloride may be the 6th basic taste

Ammonium chloride is a slightly toxic chemical most notably found in “salmiak,” a salt licorice candy, which is popular in northern Europe. In a new study, researchers found that the compound triggers a specific proton channel called OTOP1 in sour taste receptor cells, which fulfills one of the key requirements to be considered a primary taste like sweet, salty, sour, bitter, and umami. Ammonium is commonly found in waste products and decaying organic matter and is slightly toxic, so it makes sense that vertebrates evolved a specific taste sensor to recognize it.

—Ammonium chloride tastes like nothing else. It may be the sixth basic taste, Big Think, 11-Oct-2023

What did you learn this week? Let me know on Mastodon or Bluesky.

Another season of outdoor cat activity begins, part 2

It has finally gotten warm enough to be outside during the workday, and we are taking full advantage of it.

Issue 116: Government Surveillance

2025-05-29T00:00:00-04:00

After DTLJ Thursday Threads issues on digital privacy and surveillance camera systems, I'm focusing this week on the more general topic of government-sponsored or -enabled surveillance. In an era defined by ubiquitous data collection and ever-advancing technologies, the line between public safety and individual privacy is growing alarmingly thin. From President Trump’s executive order to dismantle inter-agency “data silos” and Elon Musk’s DOGE initiative weaving federal databases together, to Oracle co-founder Larry Ellison’s vision of AI-powered cameras and drones monitoring citizens, the U.S. surveillance apparatus is expanding at breakneck speed. Meanwhile, programs like the Pentagon’s “Locomotive”—which turns innocuous dating-app location pings into real-time tracking tools—and the data broker–driven sharing of driving and personal records with law enforcement underscore how private and public interests have converged to create a modern panopticon. So that is the focus of this week's Thursday Threads issue:

Trump’s executive order dismantling government data silos and Musk-led DOGE initiative fuel fears of a U.S. surveillance state.
More details about how DOGE is building an Immigrant Surveillance Database with Social Security and IRS Data.
In cases where the government doesn't already have the data, spy agencies want to centralize commercial data purchases in a new one-stop portal.
1984 is here and some people want it: Oracle’s Larry Ellison proposes Orwellian AI camera-and-drone surveillance network, stoking privacy fears.
LexisNexis parent Relx lobbies against data broker restrictions amid FISA Section 702 reauthorization clash.
Dating app location data powers Pentagon’s “Locomotive” program to track phones worldwide
Apple sues U.K. government over a secret order for backdoor access to encrypted data on phones, and it removes the Advanced Data Protection from U.K. market rather than giving in.
This Week I Learned: "Leeroy Jenkins!!!!" was staged

Trump’s Executive Order and Musk-Led DOGE Initiative Fuel Fears of a U.S. Surveillance State

In March, President Trump issued an executive order aiming to eliminate the data silos that keep everything separate. Historically, much of the data collected by the government had been heavily compartmentalized and secured; even for those legally authorized to see sensitive data, requesting access for use by another government agency is typically a painful process that requires justifying what you need, why you need it, and proving that it is used for those purposes only. Not so under Trump. This is a perilous moment. Rapid technological advances over the past two decades have made data shedding ubiquitous—whether it comes from the devices everyone carries or the platforms we use to communicate with the world. As a society, we produce unfathomable quantities of information, and that information is easier to collect than ever before.

—American Panopticon, The Atlantic, 27-Apr-2025

This article examines the growing surveillance capabilities of the U.S. federal government under the Trump administration, particularly through the actions of Elon Musk's DOGE. It highlights how various government agencies are pooling vast amounts of data on citizens, which raises concerns about privacy and potential abuses of power. The effort starts with an executive order from Trump to eliminate data silos, allowing for easier access and sharing of sensitive information across agencies. That is followed up by the web of DOGE-placed staff in various government departments that are weaving the silos together. Experts warn that this could lead to a surveillance state where personal data is weaponized for political purposes, targeting individuals based on their attributes or actions. Hence the title of the article: the American Panopticon:

The panopticon is a disciplinary concept brought to life in the form of a central observation tower placed within a circle of prison cells. From the tower, a guard can see every cell and inmate but the inmates can’t see into the tower. Prisoners will never know whether or not they are being watched.

DOGE Builds DHS Immigrant Surveillance Database with SSA, IRS Data

Operatives from Elon Musk’s so-called Department of Government Efficiency (DOGE) are building a master database at the Department of Homeland Security (DHS) that could track and surveil undocumented immigrants, two sources with direct knowledge tell WIRED. DOGE is knitting together immigration databases from across DHS and uploading data from outside agencies including the Social Security Administration (SSA), as well as voting records, sources say. This, experts tell WIRED, could create a system that could later be searched to identify and surveil immigrants.

—DOGE Is Building a Master Database to Surveil and Track Immigrants, WIRED, 18-Apr-2025

This article can be paired with the one above...this one has more details about what DOGE itself is doing. Under the guise of surveilling and tracking undocumented immigrants, this comprehensive database at the Department of Homeland Security (DHS) is integrating data from various agencies, including the Department of Homeland Security, Social Security Administration, and the IRS. They are also reportedly adding other data sources, including biometric information and voting records. This initiative raises significant privacy concerns, as it may lead to unprecedented surveillance capabilities; although starting with immigrants, what is being built enables real-time tracking of everyone. Experts are warning that such data consolidation can increase the risk of misuse and violate privacy rights.

Spy Agencies to Centralize Commercial Data Purchases in a New One-Stop Portal

The ever-growing market for personal data has been a boon for American spy agencies. The U.S. intelligence community is now buying up vast volumes of sensitive information that would have previously required a court order, essentially bypassing the Fourth Amendment. But the surveillance state has encountered a problem: There’s simply too much data on sale from too many corporations and brokers. So the government has a plan for a one-stop shop. The Office of the Director of National Intelligence is working on a system to centralize and “streamline” the use of commercially available information, or CAI, like location data derived from mobile ads, by American spy agencies, according to contract documents reviewed by The Intercept.

—U.S. Spy Agencies Are Getting a One-Stop Shop to Buy Your Most Sensitive Personal Data, The Intercept, 22-May-2025

Based on the previous two articles, we learned that the U.S. government is breaking down its data silos and gathering all of its information into a large central pool. But that isn't nearly everything that can be known about us. Now the U.S. intelligence community is developing a centralized system, the Intelligence Community Data Consortium (ICDC), to streamline the acquisition of commercially available information, including sensitive personal data. This initiative aims to address the overwhelming volume of data available from various corporations and brokers, allowing agencies to bypass traditional legal requirements for obtaining such information. The ICDC will provide a web-based platform for 18 federal agencies to efficiently purchase access to sensitive data, potentially undermining privacy protections. Critics express concern that this approach could lead to misuse of sensitive information, as agencies may continue to operate under a "just grab all of it" mentality without sufficient oversight.

Oracle’s Larry Ellison Proposes Orwellian AI Camera-and-Drone Surveillance Network, Stoking Privacy Fears

Oracle co-founder Larry Ellison shared his vision for an AI-powered surveillance future during a company financial meeting, reports Business Insider. During an investor Q&A, Ellison described a world where artificial intelligence systems would constantly monitor citizens through an extensive network of cameras and drones, stating this would ensure both police and citizens don't break the law.

—Omnipresent AI cameras will ensure good behavior, says Larry Ellison, Ars Technica, 16-Sep-2024

In case you haven't been following along, the dystopian world depicted in George Orwell's 1984 is now quite possible. Some even seem to desire it. Ellison envisions a future where AI-powered surveillance systems constantly monitor us through a network of cameras and drones. Similar automated surveillance systems are already being deployed in places like China, leading to what some call a "road to digital totalitarianism."

No, thank you.

LexisNexis Parent Relx Lobbies Amid FISA Section 702 Reauthorization Clash Over Warrant Requirement for Data Brokers

Lawmakers’ negotiations over FISA’s reauthorization became so contentious that House Speaker Mike Johnson withdrew the bill from consideration in February. The biggest source of conflict was an amendment introduced by Rep. Warren Davidson (R-OH) that would prohibit data brokers from selling consumer data to law enforcement and would require a warrant to access Americans’ information, Politico’s Influence newsletter reported in February.

—Data brokers are gearing up to fight privacy bills, The Verge, 5-Apr-2025

Section 702 of the Foreign Intelligence Surveillance Act (FISA) is a program that allows the U.S. federal government to conduct targeted surveillance of people outside the U.S. Not only is this invading the privacy of non-U.S. citizens, but data about U.S. citizens is also swept into the database. LexisNexis became involved in the ongoing debate over privacy and data broker regulations as Congress considered reauthorizing Section 702 last year. The company has faced scrutiny for its data collection practices, particularly its partnerships with automakers to sell driving data to insurance companies. Needless to say, it wants a part of the government spending on the Section 702 program. Former President Biden signed a two year extension of FISA last April.

Dating App Location Data Powers Pentagon’s “Locomotive” Program to Track Phones Worldwide

Working with Grindr data, Yeagley began drawing geofences—creating virtual boundaries in geographical data sets—around buildings belonging to government agencies that do national security work. That allowed Yeagley to see what phones were in certain buildings at certain times, and where they went afterwards. He was looking for phones belonging to Grindr users who spent their daytime hours at government office buildings. If the device spent most workdays at the Pentagon, the FBI headquarters, or the National Geospatial-Intelligence Agency building at Fort Belvoir, for example, there was a good chance its owner worked for one of those agencies. Then he started looking at the movement of those phones through the Grindr data. When they weren’t at their offices, where did they go? A small number of them had lingered at highway rest stops in the DC area at the same time and in proximity to other Grindr users—sometimes during the workday and sometimes while in transit between government facilities. For other Grindr users, he could infer where they lived, see where they traveled, even guess at whom they were dating.

—How the Pentagon Learned to Use Targeted Ads to Find Its Targets—and Vladimir Putin, WIRED, 27-Feb-2024

Location data collected from mobile apps is bought and sold by data brokers, and that data is increasingly used by government agencies for surveillance purposes. It describes how a man named Mike Yeagley demonstrated to the Pentagon how precisely one could track the movements of government employees through a dating app. This led to the creation of a program called Locomotive that could track the location of phones globally in near real-time...including that of world leaders like Vladimir Putin. Having that device in our pocket know precisely where it is—and, by extension, where we are—is a very useful tool, but it fuels unprecedented and covert surveillance abilities.

Apple Sues UK Government Over Secret Order for Backdoor Access to Encrypted Data, Removes Advanced Data Protection from UK Market

Apple is taking legal action to try to overturn a demand made by the UK government to view its customers' private data if required. The BBC understands that the US technology giant has appealed to the Investigatory Powers Tribunal, an independent court with the power to investigate claims against the Security Service.... In January, Apple was issued with a secret order by the Home Office to share encrypted data belonging to Apple users around the world with UK law enforcement in the event of a potential national security threat.

—Apple takes legal action in UK data privacy row, BBC News, 4-Mar-2025

Apple is pursuing legal action against the UK government over a demand to access its customers' private data. The company appealed to the Investigatory Powers Tribunal after receiving a secret order that requires Apple to share encrypted data with UK law enforcement in cases of national security threats. While Apple can still access data protected by its standard encryption with a warrant, its Advanced Data Protection (ADP) feature, which offers stronger privacy, cannot be accessed even by Apple itself. In response to the UK order, Apple has removed ADP from the UK market rather than create a "backdoor" for access. The situation has sparked tension between Apple and the UK government, with the US administration expressing concern over the UK's actions. The Home Office maintains that privacy is only compromised in exceptional cases related to serious crimes.

This Week I Learned: "Leeroy Jenkins!!!!" was staged

It was one of the first memes ever, a viral sensation that went mainstream back when people still used dial-up internet. Yet the cameraman behind “Leeroy Jenkins” still seems stupefied that anyone fell for it.

—The Makers Of 'Leeroy Jenkins' Didn't Think Anyone Would Believe It Was Real, Kotaku, 25-Dec-2017

First posted on May 10, 2005, this month marks the 20th anniversary of this bit of internet folklore. I remember when this first came out, and I totally believed it was real until earlier this year.

What did you learn this week? Let me know on Mastodon or Bluesky.

Another season of outdoor cat activity begins

Issue 115: Public and Private Camera Networks

2025-04-17T00:00:00-04:00

After last week's issue on digital privacy, I thought I'd focus this week on government-sponsored or -enabled surveillance. As I dug through my store of saved articles, though, I realized I had quite a number of a particular kind of surveillance: camera networks. These are often municipal-sponsored systems of license plate readers, but there are also networks of private systems—and, of course, attempts to combine the output of all of these networks. So that is the focus of this week's Thursday Threads issue:

An investigation by a newspaper editor highlights privacy concerns and legal challenges in rural Virginia's use of license plate reading cameras. (2025)
Debate over the privacy concerns and legal challenges of license plate readers is nothing new, as this 2012 article shows.
What happens when you put equipment not meant for the internet onto the internet? A security flaw in Motorola's automated license-plate-recognition systems exposes real-time vehicle data and video feeds online. (2025)
A license plate reader in every tow truck? Privacy concerns of a private surveillance network of 9 billion license plate scans enable widespread vehicle tracking. (2019)
Similar to "the call is coming from inside the house", the surveillance is coming from inside your community. Privacy concerns emerge as HOAs nationwide install Flock Safety's license plate readers, facilitating police surveillance. (2023)
How about we network all of these cameras together? AI-powered surveillance system spurs privacy concerns as adoption grows in U.S. (2023)
If we've got to have this tech, we might as well have some fun with it. Artist's Traffic Cam Photobooth sparks controversy and cease-and-desist over creative use of NYC traffic cameras. (2024)
This Week I Learned: The word "scapegoat" was coined in a 1530 translation of the bible.
This week's cat

Also on DLTJ since the last newsletter was published:

My Public Archive of #TeslaTakedown Protest Signs. Print one off and take it to your next protest.

Privacy Concerns and Legal Challenges in Rural Virginia's Use of License Plate Reading Cameras

The research for State of Surveillance showed that you can’t drive anywhere without going through a town, city or county that’s using public surveillance of some kind, mostly license plate reading cameras. I wondered how often I might be captured on camera just driving around to meet my reporters. Would the data over time display patterns that would make my behavior predictable to anyone looking at it? So I took a daylong drive across Cardinal Country and asked 15 law enforcement agencies, using Freedom of Information Act requests, to provide me with the Flock LPR footage of my vehicle. My journey took me over 300 miles through slices of the communities those agencies serve, including the nearly 50 cameras they employ. And this journey may take me to one more place: an April Fool’s Day hearing in a courtroom in Roanoke. There, a judge will be asked to rule on a motion to declare the footage of the public to be beyond the reach of the public.

—I drove 300 miles in rural Virginia, then asked police to send me their public surveillance footage of my car. Here's what I learned, Cardinal News, 28-Mar-2025

In a detailed exploration of public surveillance, this newspaper editor drove 300 miles across rural Virginia, requesting footage from police of their vehicle captured by license plate reading cameras. The investigation aimed to understand how often people are recorded by these cameras and the implications of such surveillance. Despite asking 15 law enforcement agencies for footage, only nine complied while others denied the request, leading to a legal challenge regarding public access to this data. The editor noted that while driving through various counties, their vehicle was indeed photographed multiple times by Flock cameras, which capture detailed images of vehicles, including license plates and unique identifiers. The editor also reflected on the ease with which police could track movements without a warrant, emphasizing a shift in expectations regarding privacy in public spaces.

Debate Grows Over Privacy Concerns and Legal Challenges as License Plate Readers Expand Across the U.S.

The scanners can read 60 license plates per second, then match observed plates against a "hot list" of wanted vehicles, stolen cars, or criminal suspects. LPRs [license plate readers] have increasingly become a mainstay of law enforcement nationwide; many agencies tout them as a highly effective "force multiplier" for catching bad guys, most notably burglars, car thieves, child molesters, kidnappers, terrorists, and—potentially—undocumented immigrants. Today, tens of thousands of LPRs are being used by law enforcement agencies all over the country—practically every week, local media around the country report on some LPR expansion. But the system's unchecked and largely unmonitored use raises significant privacy concerns. License plates, dates, times, and locations of all cars seen are kept in law enforcement databases for months or even years at a time. In the worst case, the New York State Police keeps all of its LPR data indefinitely. No universal standard governs how long data can or should be retained.

— Your car, tracked: the rapid rise of license plate readers , WIRED, 27-Sep-2012

This is the earliest article I had bookmarked about license plate readers. The rise of these cameras had led to significant advancements in law enforcement capabilities, particularly in tracking vehicles linked to criminal activity. It described the effect in Tiburon, California, which was among the first towns to implement cameras that allowed police to monitor all cars entering and leaving the area. The American Civil Liberties Union raised questions about the lack of regulation surrounding LPR usage and data retention. Despite the benefits, such as recovering stolen vehicles and identifying suspects, critics highlighted issues like false positives and potential misuse of data. Those criticisms are still valid today as there has been no comprehensive law on the use of such cameras.

Security Flaw in Motorola's ALPR Systems Exposes Real-Time Vehicle Data and Video Feeds Online

This trove of real-time vehicle data, collected by one of Motorola’s ALPR systems, is meant to be accessible by law enforcement. However, a flaw discovered by a security researcher has exposed live video feeds and detailed records of passing vehicles, revealing the staggering scale of surveillance enabled by this widespread technology. More than 150 Motorola ALPR cameras have exposed their video feeds and leaking data in recent months, according to security researcher Matt Brown, who first publicized the issues in a series of YouTube videos after buying an ALPR camera on eBay and reverse engineering it.

—License Plate Readers Are Leaking Real-Time Video Feeds and Vehicle Data, WIRED, 7-Jan-2025

This article is as much about the surveillance possible with these systems as it is about the risks of connecting misconfigured systems open to the public internet. It discusses a significant security flaw in automated license-plate-recognition (ALPR) systems, particularly those manufactured by Motorola, which exposured real-time video feeds and vehicle data. On example: in Nashville, an ALPR system captured information from nearly 1,000 vehicles in just 20 minutes. A security researcher discovered that ALPR cameras were put on the open internet...something it seems they weren't designed to be. This breach does not require any authentication, highlighting the scale of unintended surveillance enabled by these systems. The data collected includes photographs, license plate information, and metadata such as location and time.

Private Surveillance Network of 9 Billion License Plate Scans Enable Widespread Vehicle Tracking

In just a few taps and clicks, the tool showed where a car had been seen throughout the U.S. A private investigator source had access to a powerful system used by their industry, repossession agents, and insurance companies. Armed with just a car’s plate number, the tool—fed by a network of private cameras spread across the country—provides users a list of all the times that car has been spotted. I gave the private investigator, who offered to demonstrate the capability, a plate of someone who consented to be tracked. It was a match. The results popped up: dozens of sightings, spanning years. The system could see photos of the car parked outside the owner’s house; the car in another state as its driver went to visit family; and the car parked in other spots in the owner’s city. Each was tagged with the time and GPS coordinates of the car. Some showed the car’s location as recently as a few weeks before. In addition to photos of the vehicle itself, the tool displayed the car’s accurate location on an easy to understand, Google Maps-style interface.

—This Company Built a Private Surveillance Network. We Tracked Someone With It, Vice, 17-Sep-2019

The previous articles have talked about public sector cameras for use by police. This article discusses the Digital Recognition Network (DRN), a private surveillance system that allows its users to track vehicles via a vast database of license plate scans. The system is built from cameras installed by repo men who collect data as they drive. Users can access detailed information about a car's location history, including timestamps and GPS coordinates, through a user-friendly interface. While DRN markets itself as a tool for industries like insurance and investigations, concerns arise regarding privacy violations, as the data can be accessed by anyone who pays for it, including private investigators. (Last week's Thursday Threads include a story about how freelancers on Fiverr will look up anyone for a price.) Critics argue that this system creates a digital dossier of individuals' movements, raising significant privacy issues. The technology is legal because it captures publicly visible information, but its widespread use has sparked debates about surveillance and civil liberties.

HOAs Nationwide Install Flock Safety's License Plate Readers, Facilitating Police Surveillance

Kilgore was referring to a system consisting of eight license plate readers, installed by the private company Flock Safety, that was tracking cars on both private and public roads. Despite being in place for six months, no one had told residents that they were being watched. Kilgore himself had just recently learned of the cameras. “We find ourselves with a surveillance system,” he said, “with no information and no policies, procedures, or protections.” The deal to install the cameras had not been approved by the city government’s executive branch. Instead, the Rough Hollow Homeowners Association, a nongovernment entity, and the Lakeway police chief had signed off on the deal in January 2021, giving police access to residents’ footage. By the time of the June city council meeting, the surveillance system had notified the police department over a dozen times.

—License Plate Surveillance, Courtesy of Your Homeowners Association, The Intercept, 22-Mar-2023

The first article in this week's Thursday Threads was about Flock's law enforcement division. But it isn't just police installing the technology. This article describes the collaboration between a private homeowners association (HOA) and police departments to install license plate readers from Flock Safety. In Lakeway, Texas, residents were unaware of a surveillance system tracking their vehicles, installed without proper city approval—just an agreement between the HOA and the police chief with no public announcement or comment. Flock Safety, valued at the time at approximately $3.5 billion, marketed its cameras to over 200 HOAs nationwide, leveraging their substantial budgets and providing police access to private data. The article also points out incidents of wrongful detentions due to erroneous alerts and highlights the risks associated with these systems.

AI-Powered Fusus Surveillance System Spurs Privacy Concerns as Adoption Grows in U.S. Towns and Cities

Spread across four computer monitors arranged in a grid, a blue and green interface shows the location of more than 50 different surveillance cameras. Ordinarily, these cameras and others like them might be disparate, their feeds only available to their respective owners: a business, a government building, a resident and their doorbell camera. But the screens, overlooking a pair of long conference tables, bring them all together at once, allowing law enforcement to tap into cameras owned by different entities around the entire town all at once. This is a demonstration of Fusus, an AI-powered system that is rapidly springing up across small town America and major cities alike. Fusus’ product not only funnels live feeds from usually siloed cameras into one central location, but also adds the ability to scan for people wearing certain clothes, carrying a particular bag, or look for a certain vehicle.

—AI Cameras Took Over One Small American Town. Now They're Everywhere, 404 Media, 2-Nov-2023

With the growth of camera networks (public and private), it was only a matter of time before someone tried to link them all together. The article explores the rapid adoption of Fusus' AI-powered surveillance system. Fusus connects various existing security cameras into a central hub, allowing law enforcement to access multiple live feeds simultaneously. The technology also enhances existing surveillance systems with new capabilities like enabling the detection of specific clothing, bags, vehicles, and even transforming standard cameras into automatic license plate readers. While some communities have embraced Fusus for its potential to improve public safety, others have raised concerns about privacy and the implications of constant surveillance. The lack of transparency regarding police access to the system and its data analytics has sparked debate among residents and city councils. Fusus has been marketed as a solution to enhance security, but critics argue it could lead to misuse without proper oversight.

Artist's Traffic Cam Photobooth Sparks Controversy and Cease-and-Desist Over Creative Use of NYC Traffic Cameras

When it debuted this summer, the Traffic Cam Photobooth (TCP) website offered a new twist on the surveillance state by enabling smartphone users to take selfies with New York traffic cams. By October, it had expanded to Georgia, Maryland, Minnesota, and Ireland. TCP was recently featured in an exhibit at Miami Art Week. But the future of the interactive site is uncertain, at least in New York City, where the Department of Transportation has 900-plus traffic cams accessible through the website. Its Office of Legal Affairs recently sent a cease-and-desist letter to Morry Kolman, the artist behind the project, charging that the TCP "encourages pedestrians to violate NYC traffic rules and engage in dangerous behavior."

—NYC Wants You to Stop Taking Traffic Cam Selfies, But Here's How to Do It Anyway, PCMag, 11-Dec-2024

The Traffic Cam Photobooth (TCP) websiteTraffic Cam Photobooth (TCP) website, created by artist Morry Kolman, allows users to take selfies with New York City's traffic cameras.
The NYC Department of Transportation—being spoilsports—issued a cease-and-desist letter to Kolman, claiming the site encourages unsafe behavior by pedestrians. In response, Kolman creatively showcased the cease-and-desist letter using a long pole to photograph it with traffic cameras across Manhattan and Brooklyn. Kolman views the project as a way to raise awareness about surveillance technologies and how to navigate living under such systems. The source code is even on GitHub.

This Week I Learned: The word "scapegoat" originated in a 1530 bible translation

Early English Christian Bible versions follow the translation of the Septuagint and Latin Vulgate, which interpret azazel as "the goat that departs" (Greek tragos apopompaios, "goat sent out", Latin caper emissarius, "emissary goat"). William Tyndale rendered the Latin as "(e)scape goat" in his 1530 Bible. This translation was followed by subsequent versions up through the King James Version of the Bible in 1611: "And Aaron shall cast lots upon the two goats; one lot for the Lord, and the other lot for the scapegoat."

—Scapegoat, Wikipedia

What did you learn this week? Let me know on Mastodon or Bluesky.

Alan Basks in the Window

My Public Archive of Protest Signs

2025-04-11T00:00:00-04:00

Formerly just #TeslaTakedown protest signs, this post is now more general — protest signs against the growing authoritarianism that Donald Trump is trying to normalize. For all except the two at the bottom, I've included a link where you can download a PDF to print your own. Please use these if you'd like; if you want to give me something in exchange, just tag me on Mastodon or Bluesky so I know how far these have spread.

Also, Marc Lee from Free Protest Signs reached out on Bluesky to let me know about his website of signs. If you don't like something below, maybe one of his will suit your mood!

Respect My Authoritah!

Respect My Authoritah! protest sign, first used on 14-Jun-2025

Of the two signs that I'm bringing to the #NoKingsInAmerica protest, this is the snarky one. Trump's attitude and actions in the ICE raid protests in Los Angeles and elsewhere reminded me of Eric Cartman from South Park screaming, "Respect My Authoritah!"

Download and print your own 26" by 16" version of this sign.

Three Branches

Three Branches protest sign, first used on 14-Jun-2025

I've had this one in my protest design document for a while, and now seems like a very good time to make it real. The U.S. Constitution lays out three branches of government, and right now we are seeing inaction (the legislature), abuse (the executive), and under siege (the judicial). We need the legislature to step up, support the judiciary, and tell the executive to knock it off.

Download and print your own 26" by 16" version of this sign.

Dictators Hold Parades

Dictators Hold Parades protest sign, first used on 14-Jun-2025

My daughter was listening to a story from The Daily from the New York Times and shouted out, "that's my sign!" The guest had been talking about how Trump's parade in Washington, DC, is un-American and something that dictatorships do. A little photo searching and editing later...this sign was born.

Download and print your own 26" by 16" version of this sign.

All of the ABOVE!

All of the ABOVE! protest sign, first used on 26-Apr-2025

The meanness, the illegality, the stupidity...it is all more than I thought possible and it is certainly not what deserve from our government. And it is not just one of these attributes, but all of them coming from all of this administration's elected, confirmed, and senior leaders.

Download and print your own 26" by 16" version of this sign.

Elected Assholes

Elected Assholes protest sign, not used by the author

This crap is well past getting out of hand, and I wanted a sign that reflected that. The government—in my name as one of its citizens—is deporting people without due process? It is bullying foreign leaders in the Oval Office? It is recklessly dismantling medical research, food safety programs, and environmental controls? This doesn't represent my values, nor—I'd wager—the values of most of the country.

The focus group (my family members) weren't a fan of the unnecessary crassness of the sign. I want to so something with the concept of "My GOVERNMENT did WHAT?!?", so I'll probably revisit this.

Download and print your own 26" by 16" version of this sign. Or download its more vulgar companion.

Disaster Musk

Disaster Musk protest sign, first used on 12-Apr-2025

A few weeks ago, I saw the picture of a smirking Elon Musk in a New York Times article, and I knew I needed to make use of it somehow. Inspiration struck this week when I remembered the 'Disaster Girl' meme. The picture with a four-year-old girl looking back at the camera as a firefighters battle a house fire with the "devilish smirk" is a perfect fit for what Elon is doing to our federal government.

Download and print your own 26" by 16" version of this sign.

Get Angry at Billionaires

Get Angry at Billionaires protest sign, first used on 5-Apr-2025.

Returning to the theme of my first protest sign, I wanted to convey that the people giving us the middle finger as they drive by have more in common with us than they do with the billionaires they are supporting. Or, if you still want to be up in arms with me, then just come stand an hour at the protest and get your free check from George Soros for protesting. (← sarcasm)

Download and print your own 26" by 16" version of this sign.

Signals

Signals protest sign, first used on 29-Mar-2025.

This was the week that the news broke about senior government officials using the consumer-grade chat app Signal to discuss warn plans. I went off-script that week with a sign about that political nonsense. It is a play on the phrase "The call is coming from inside the house!" — a play on a famous movie trope where the police tell the person in a home that they have traced the antagonist's call to that home. In this case, the danger to democracy is coming from inside the Whitehouse! Or, at least, that is what I was aiming for.

Download and print your own 26" by 16" version of this sign.

Elon MUSKed Up

Elon MUSKed Up, first used on 22-Mar-2025

Back to basics, I thought. People are driving by quickly, so too much text won't be read. So this was the idea:

Set the context: "Our GOVERNMENT was FINE."
Deliver the punchline: "Now it is MUSKed UP!"
Clear call-to-action: "FIRE ELON!" (in a flaming font, nonetheless)

And that seemed to work. This might have been my best sign yet.

Download and print your own 26" by 16" version of this sign.

DEMOCRACY, not MUSKocracy, not TRUMPocracy

DEMOCRACY, not MUSKocracy, not TRUMPocracy protest sign, first used 15-Mar-2025.

This was my second protest sign, and I'm clearly still working on the craft. Although the points of this sign didn't require a blog post to explain, it still had too many words on it. That made it hard to read from cars that were driving by. It might have been okay without the "Kings and Oligarchs are not American" in the middle, but without it I thought it lost its punch. I'm feeling pretty ambivalent about it, so I haven't gone through the process of making it available for download; let me know if you'd like to have a printable version.

Our Fellow Americans

Our Fellow Americans protest sign, first used on 8-Mar-2025.

This is the first sign I made for a #TeslaTakedown, and I should have listened to my family. They suggested that the initial version, without the "How much do you have in common with Elon Musk?" at the bottom, was too confusing. Adding that sentence improved understanding, but now there was too much to read in a protest sign for cars whizzing past. My point was that me and the person driving by giving me a middle finger have far more in common than what either of us have with Elon Musk (and Donald Trump). Even so, I felt like I needed a blog post to fully explain what I meant. I'm feeling pretty ambivalent about this one, too, but if you'd like a printable version please let me know.

Although this was my first sign, it was my second #TeslaTakedown protest. I learned quickly that signs are an important part of the protest spirit, and the more creative the better.

Issue 114: Digital Privacy

2025-04-10T00:00:00-04:00

This week's DLTJ Thursday Threads looks at digital privacy concerns from the commercial perspective. I think next week's article will be a summary of recent happenings with government surveillance activities.

Late last month, Amazon launched Alexa+, and with it a flurry of privacy concerns. Why? Because Amazon now mandates cloud uploads to process Echo voice commands.
Using the technologies already in buildings, employers can monitor employee activities, raising privacy concerns.
Last year the FTC released a report that, while surprising no one, exposed the extensive data collection by social media platforms.
Speaking of collecting personal data, all of it ends up in databases of various sorts, and Fiverr freelancers use tools made for law enforcement and insurance companies to sell access to anyone.
This Week I Learned: We started capitalizing the pronoun "I" to distinguish it from similarly typset letters.
This week's cat

Also on DLTJ since the last newsletter was published:

Holy cow—did the people show up for today's #TeslaTakedown!

With Alexa+ launch, Amazon mandates cloud uploads for Echo voice recordings

Amazon has disabled two key privacy features in its Alexa smart speakers, in a push to introduce artificial intelligence-powered “agentic capabilities” and turn a profit from the popular devices. Starting today (March 28), Alexa devices will send all audio recordings to the cloud for processing, and choosing not to save these recordings will disable personalisation features.

—Everything you say to an Alexa speaker will be sent to Amazon, The Conversation, 28-Mar-2025

Starting a few weeks ago, Amazon required Echo users to send all voice recordings to its cloud, eliminating a privacy feature that allowed for local processing. This change coincides with the rollout of Alexa+, a subscription service that enhances the voice assistant's capabilities, including recognizing individual users through a feature called Voice ID. Users who previously opted out of sending recordings will find their devices' Voice ID functionality disabled. Amazon justifies this move by stating that the processing power of its cloud is necessary for the new generative AI features.

Privacy concerns anyone? Especially given Amazon's history of mismanaging voice recordings and allowing employees to listen to them for training purposes. The company has previously faced penalties for storing children's recordings indefinitely and has been involved in legal cases regarding the use of Alexa recordings in criminal trials. Surprise, surprise: this shift would appear to prioritize the financial viability of Alexa+ over user privacy concerns.

Workplace surveillance and privacy concerns over employee monitoring technologies

Office buildings have become like web browsers – they're full of tracking technology, a trend documented in a report out this week by Cracked Labs. The study, titled "Tracking Indoor Location, Movement and Desk Occupancy in the Workplace," looks at how motion sensing and wireless network technology in buildings is being used to monitor the movement and behavior of office workers and visitors. "As offices, buildings and other corporate facilities become networked environments, there is a growing desire among employers to exploit data gathered from their existing digital infrastructure or additional sensors for various purposes," the report says. "Whether intentionally or as a byproduct, this includes personal data about employees, their movements and behaviors."

—Modern workplaces increasingly resemble surveillance zones, The Register, 27-Nov-2024

This is as fascinating as it is frightening. It is possible to repurpose technologies built into the building to track employees' movements and behaviors. The report is part of a broader series examining surveillance and digital control at work, supported by various organizations concerned with privacy and labor rights. In the U.S. and Europe, regulators, including the Federal Trade Commission, are responding to the growing use of tracking technologies, which gather extensive personal information about workers. Companies like Cisco utilize their networking systems to monitor the location of individuals and assets, enabling behavioral profiling based on location data. However, the report notes instances of pushback, such as protests at Northeastern University against the installation of motion sensors under the desks of graduate student workers, which were viewed as invasive and unnecessary.

I expect this same kind of technology is being deployed in retail stores and other locations as well.

FTC report exposes extensive data collection by social media platforms

The Federal Trade Commission said on Thursday it found that several social media and streaming services engaged in a “vast surveillance” of consumers, including minors, collecting and sharing more personal information than most users realized. The findings come from a study of how nine companies — including Meta, YouTube and TikTok — collected and used consumer data. The sites, which mostly offer free services, profited off the data by feeding it into advertising that targets specific users by demographics, according to the report. The companies also failed to protect users, especially children and teens. The F.T.C. said it began its study nearly four years ago to offer the first holistic look into the opaque business practices of some of the biggest online platforms that have created multibillion-dollar ad businesses using consumer data. The agency said the report showed the need for federal privacy legislation and restrictions on how companies collect and use data.

—F.T.C. Study Finds ‘Vast Surveillance’ of Social Media Users, New York Times, 19-Sep-2024

The chairwoman of the FTC at the time, Lina Kahn, emphasized that such surveillance poses risks to privacy and personal safety, contributing to broader societal issues. It remains to be seen if Congress and this administration pick up the ball and run with it, but I'm not certain that will happen (if, for no other reason then there are many more distractions happening). The report criticized self-regulation by these companies as ineffective, so this issue is ripe for legislative action.

Fiverr freelancers advertise access to personal data

Dozens of sellers on the freelancing platforming Fiverr claim to have access to a powerful data tool used by private investigators, law enforcement, and insurance firms which contains personal data on much of the U.S. population. The sellers are then advertising the ability to dig through that data for prospective buyers, including uncovering peoples’ Social Security numbers for as little as $30, according to listings viewed by 404 Media. Fiverr removed the listings after 404 Media inquired about the practice. The advertised tool is TLOxp, maintained by the credit bureau TransUnion, and can also provide a target’s unlisted phone numbers, utilities, physical addresses, and more.

—Fiverr Freelancers Offer to Dox Anyone With Powerful U.S. Data Tool, 404 Media, 2-Jul-2024

In case you aren't familiar with it, Fiverr is an online marketplace for freelancers—a place you can go if you need quick, specialized help with a task or have specialized skills to offer. In this case, the article reports that dozens of Fiverr freelancers are advertising access to a powerful data tool containing personal information—Social Security numbers, unlisted phone numbers, addresses, and other private data—on just about everyone. The tool is used by private investigators, law enforcement, and insurance firms, but it has also become a "secret weapon" for hackers and fraudsters to dox people.

This Week I Learned: The pronoun "I" was capitalized to distinguish it from similarly typset letters

In fact, the habit of capitalizing “I” was also a practical adaptation to avoid confusion, back in the days when m was written “ııı” and n was written “ıı.” A stray “i” floating around before or after one of those could make the whole thing hard to read, so uppercase it went. And now it seems perfectly logical.

—I Have a Capital Suggestion for a New Pronoun, New York Times, 27-Mar-2025

What did you learn this week? Let me know on Mastodon or Bluesky.

Pickle curls up for a snuggle

I need to get pictures of Pickle into the newsletter while I can. Later this fall she is off to Penn State with her "primary" as my daughter starts her graduate degree program.

Holy cow—did the people show up for today's #TeslaTakedown!

2025-03-29T00:00:00-04:00

I don't know how many there were at the protest today in front of the Easton Tesla store, but for the first time we covered all four corners of the intersection.

I think there were at least 600 people...maybe more. Some observations:

The weather was good—windy, but warm—and the families with young children did come out again. But there were just MORE people there overall.
This week I recognized more cars making a circuit around the block. More people honking with thumbs up, turning around, then coming back again. I don't remember seeing that on past Saturdays.
There were more Tesla sedans driving by that I remember seeing in the past. Quite possibly, they were just making the circuit around the block, too.
I'm starting to recognize familiar faces at each protest.

There was no live music this time, but that was okay because there was definitely more noise from the sidewalks and more energy in the air. The Proud Boys made noise about coming in counter-protest, but I didn't see them. One of the event marshals said they were there early, but the police effectively separated them. As the panorama picture shows, though, we had all four corners covered and we were raucous.

This week's protest sign

This week's #TeslaTakedown protest sign.

I went off-script this week with a sign about the political nonsense we have at the moment. It is a play on the phrase "The call is coming from inside the house!" — a play on a famous movie trope where the police tell the person in a home that they have traced the antagonist's call to that home. In this case, the danger to democracy is coming from inside the Whitehouse! Or, at least, that is what I was aiming for. This sign probably only get's one week's use; let's hope by next week, one or more people on this sign are fired because of the released of what sure looks like classified information on a Signal group chat.

If you want to use this 26" by 16" sign for yourself, I've made it available for download. Ping me on Mastodon or Bluesky if you use it, and include a picture if you'd like!

My protest sign at the #TeslaTakedown.

Issue 113: More on Copyright and Foundational AI Models

2025-03-27T00:00:00-04:00

Two years ago this month, I wrote a DLTJ Thursday Threads article on the copyright implications of foundational AI models. A lot has happened in those 24 months. This issue mostly focuses on lawsuits, plus an announcement of a service offering image generation from licensed content. These articles highlight the growing tension between content creators and technology companies as AI technologies increasingly rely on large datasets that include licensed and, in some instances, pirated content.

From late 2023, the New York Times sues OpenAI and Microsoft for alleged copyright infringement in AI training (with late-breaking update).
U.S. judge partially favors OpenAI while permitting unfair competition claim in authors' copyright lawsuit in this ruling from early 2024.
Last month Thomson Reuters wins landmark U.S. AI copyright case, potentially establishing a legal precedent.
Microsoft guarantees legal protection for Copilot users from copyright lawsuits.
Meta's training of its AI with pirated LibGen books sparks legal and ethical debate.
Nvidia denies copyright infringement in the use of shadow libraries for AI training.
Getty Images launched an AI image generator using its licensed library in 2023.
This Week I Learned: "But where is everybody?!?" — the origins of Fermi's Paradox
This week's cat

Also on DLTJ this past week:

In OCLC v Anna's Archive, New/Novel Issues Sent to State Court: The case of OCLC against Anna's Archive, accused of “data scraping” from OCLC's WorldCat, takes a turn as the U.S. District Court for the Southern District of Ohio decides to certify several “novel and unsettled” legal questions to the Supreme Court of Ohio.
My protest signage improved at this week's #TeslaTakedown: My improved sign said "Our GOVERNMENT was fine. Now it is MUSKed UP! FIRE ELON!" Read the post for instructions on printing your own copy of this protest sign.

New York Times sues OpenAI and Microsoft for alleged copyright infringement in AI training

The New York Times sued OpenAI and Microsoft on Wednesday over the tech companies’ use of its copyrighted articles to train their artificial intelligence technology, joining a growing wave of opposition to the tech industry’s using creative work without paying for it or getting permission. OpenAI and Microsoft used “millions” of Times articles to help build their tech, which is now extremely lucrative and directly competes with the Times’s own services, the newspaper’s lawyers wrote in a complaint filed in federal court in Manhattan.

—New York Times sues OpenAI, Microsoft for using articles to train AI, Washington Post, 27-Dec-2023

We're starting with the lawsuits, and this is one of the bigger ones. At the time the lawsuit was filed, OpenAI announced deals with content providers to use their backfiles of content, but the New York Times was a holdout. The lawsuit claims that OpenAI and Microsoft used millions of Times articles, which directly competes with the newspaper's services. While OpenAI maintained that it respects content creators' rights and believes its practices fall under fair use, the lawsuit cites instances of AI reproducing Times articles verbatim. This case has had many twists and turns, including a report last year that OpenAI intentionally trashed the research of the Times' lawyers. You can follow along with the court case in the Southern District of New York.

LATE BREAKING NEWS: As I was finishing the edits on this issue, I saw the judge issued a brief ruling on the defendant's motion to dismiss. In short, the lawsuit continues, but the portions on "common law unfair competition by misappropriation claims" are dismissed. The full version of the opinion hasn't been released yet, but should be coming soon!

US judge favors OpenAI, permits unfair competition claim in authors' copyright lawsuit

A US district judge in California has largely sided with OpenAI, dismissing the majority of claims raised by authors alleging that large language models powering ChatGPT were illegally trained on pirated copies of their books without their permission. By allegedly repackaging original works as ChatGPT outputs, authors alleged, OpenAI's most popular chatbot was just a high-tech "grift" that seemingly violated copyright laws, as well as state laws preventing unfair business practices and unjust enrichment. According to judge Araceli Martínez-Olguín, authors behind three separate lawsuits—including Sarah Silverman, Michael Chabon, and Paul Tremblay—have failed to provide evidence supporting any of their claims except for direct copyright infringement.

—Judge rejects most ChatGPT copyright claims from book authors, Ars Technica, 13-Feb-2024

A US judge has largely sided with OpenAI in a lawsuit brought by authors alleging that ChatGPT was trained using pirated copies of their books. The judge dismissed most claims except for direct copyright infringement. While authors failed to show ChatGPT outputs were substantially similar to their works, one unfair competition claim was allowed to proceed based on OpenAI allegedly using copyrighted works without permission. This case has been quiet for a while because I think the remaining claims were consolidated into the Tremblay v. OpenAI, Inc. case being overseen by the same judge.

Thomson Reuters wins landmark U.S. AI copyright case, potentially establishing legal precedent

Thomson Reuters has won the first major AI copyright case in the United States. In 2020, the media and technology conglomerate filed an unprecedented AI copyright lawsuit against the legal AI startup Ross Intelligence. In the complaint, Thomson Reuters claimed the AI firm reproduced materials from its legal research firm Westlaw. Today, a judge ruled in Thomson Reuters’ favor, finding that the company’s copyright was indeed infringed by Ross Intelligence’s actions.

—Thomson Reuters Wins First Major AI Copyright Case in the US, Wired, 11-Feb-2025

Last month, Thomson Reuters achieved a significant legal victory by winning the first major AI copyright case in the United States. Notably, the court rejected the notion that using content to train a foundational language model was not fair use. This case sets a precedent in the ongoing discussions surrounding copyright laws and artificial intelligence, and its outcome may influence how AI-generated content is treated under copyright law.

Microsoft guarantees legal protection for Copilot users from copyright lawsuits

Some customers are concerned about the risk of IP infringement claims if they use the output produced by generative AI. This is understandable, given recent public inquiries by authors and artists regarding how their own work is being used in conjunction with AI models and services. To address this customer concern, Microsoft is announcing our new Copilot Copyright Commitment. As customers ask whether they can use Microsoft’s Copilot services and the output they generate without worrying about copyright claims, we are providing a straightforward answer: yes, you can, and if you are challenged on copyright grounds, we will assume responsibility for the potential legal risks involved.

—Microsoft announces new Copilot Copyright Commitment for customers, Microsoft, 7-Sep-2023

In mid-2023, Microsoft announced a "Copilot Copyright Commitment" to address customer concerns regarding potential copyright infringement when using its AI-powered tools. The commitment includes indemnity in cases where customers are sued for copyright infringement, provided they have implemented necessary guardrails and content filters. The company acknowledges the need to respect authors' rights and aims to balance innovation with protecting creative works. This either says something about how Microsoft trained its foundational models with all copyright-free and licensed content, or that Microsoft believes its lawyers are better than everyone else's.

Meta's training of its AI with pirated LibGen books sparks legal and ethical debate

Court documents released last night show that the senior manager felt it was “really important for [Meta] to get books ASAP,” as “books are actually more important than web data.” Meta employees turned their attention to Library Genesis, or LibGen, one of the largest of the pirated libraries that circulate online. It currently contains more than 7.5 million books and 81 million research papers. Eventually, the team at Meta got permission from “MZ”—an apparent reference to Meta CEO Mark Zuckerberg—to download and use the data set.

—The Unbelievable Scale of AI’s Pirated-Books Problem, The Atlantic, 10-Mar-2025

The article discusses the ethical and legal implications of Meta's use of pirated books from Library Genesis (LibGen) to train its AI model, called Llama 3. Faced with high costs and slow licensing processes for acquiring legal texts, Meta employees opted to access LibGen, which contains over 7.5 million books and 81 million research papers. Internal communications revealed that Meta acknowledged the medium-high legal risks of this strategy and discussed methods to mask their activities, including avoiding the citation of copyrighted materials. The communications were part of a motion for partial summary judgement in a lawsuit against Meta. And Meta is not going quietly—in response to that motion, it has filed dozens of documents on the court docket.

Nvidia denies copyright infringement in use of shadow libraries for AI training

Nvidia seemed to defend the shadow libraries as a valid source of information online when responding to a lawsuit from book authors over the list of data repositories that were scraped to create the Books3 dataset used to train Nvidia's AI platform NeMo. That list includes some of the most "notorious" shadow libraries—Bibliotik, Z-Library (Z-Lib), Libgen, Sci-Hub, and Anna's Archive, authors argued. However, Nvidia hopes to invalidate authors' copyright claims partly by denying that any of these controversial websites should even be considered shadow libraries.

—Nvidia denies pirate e-book sites are “shadow libraries” to shut down lawsuit, Ars Technica, 28-May-2024

Nvidia is the company making the news for creating the GPUs that are so popular with companies training foundational models. In creating their own model, they say that using "shadow libraries" like Z-Library and Library Genesis does not necessarily violate copyright law, and that its AI training process is a "highly transformative" fair use of the content. On the other hand, authors have argued that the AI models are derived from the protected expression in the training dataset without their consent or compensation. Nvidia's position seems pretty gutsy...admit that you are using copyrighted content, and arguing that such use is okay. A ruling against them would take a bit bite out of their sky-high stock market valuation. The case is currently in the discovery phase.

Getty images launches AI image generator using its licensed library

Generative AI by Getty Images (yes, it’s an unwieldy name) is trained only on the vast Getty Images library, including premium content, giving users full copyright indemnification. This means anyone using the tool and publishing the image it created commercially will be legally protected, promises Getty. Getty worked with Nvidia to use its Edify model, available on Nvidia’s generative AI model library Picasso.

—Getty made an AI generator that only trained on its licensed images, The Verge, 25-Sep-2023

In 2023, Getty Images launched a AI image generation tool that uses its vast library of licensed images. The company says users of its output have full copyright indemnification for commercial use. Developed in partnership with Nvidia (yes—the same Nvidia mentioned in the article above) and leveraging the Edify model, this tool allows users to create images while being protected legally. Getty plans to compensate creators whose images are used to train the AI model and will share revenues generated from the tool. Unlike traditional stock images, AI-generated photos will not be included in Getty’s existing content libraries.

This Week I Learned: "But where is everybody?!?" — the origins of Fermi's Paradox

The eminent physicist Enrico Fermi was visiting his colleagues at Los Alamos National Laboratory in New Mexico that summer, and the mealtime conversation turned to the subject of UFOs. Very quickly, the assembled physicists realized that if UFOs were alien machines, that meant it was possible to travel faster than the speed of light. Otherwise, those alien craft would have never made it here. At first, Fermi boisterously participated in the conversation, offering his usual keen insights. But soon, he fell silent, withdrawing into his own ruminations. The conversation drifted to other subjects, but Fermi stayed quiet. Sometime later, long after the group had largely forgotten about the issue of UFOs, Fermi sat up and blurted out: “But where is everybody!?”

—All by ourselves? The Great Filter and our attempts to find life, Ars Technica, 26-Mar-2025

This retelling of the Fermi Paradox coms from this story about why, despite the vastness of the universe, we have yet to encounter evidence of extraterrestrial civilizations. Enrico Fermi famously posed the question, "Where is everybody?" suggesting a disconnect between the expectation of abundant intelligent life and the lack of observable evidence. The concept of the Great Filter is introduced, proposing that there may be significant barriers preventing intelligent life from becoming spacefaring. The article goes on to speculate where we are relative to the "Great Filter" — are we past it, or is it yet in front of us? In other words, have we survived the filter or is our biggest challenge ahead of us?

What did you learn this week? Let me know on Mastodon or Bluesky.

It is hard to write with a cat on your lap

This issue is a little rushed because I couldn't do my usual writing and editing. This cute ball of fuzz is the reason.

My protest signage improved at this week's #TeslaTakedown

2025-03-22T00:00:00-04:00

My protest sign for the #TeslaTakedown today.

I'm a long way from a career change to graphic design or protest communications, but this week was a definite improvement. About a half dozen people asked for pictures of my sign. That's a good signal, so I'm including instructions below on printing and making one yourself.

It was another windy, gloomy day at the Easton Tesla store, but the number of people increased from last Saturday. One organizer said between 450 and 500 people, which seemed about right to me. It was just a little more than we had last week. The weather forecast for next week is about 15 degrees warmer, so it will be interesting to see if the families with young children come out again like on March 8th.

The entertainment definitely improved. Someone set up an amplified acoustic guitar and a microphone, and people took turns singing. We marched two laps around the block, and there were many more honks and cheers from the cars driving by. Promptly at 5:30, the organizers walked around and asked people to leave to be respectful of the Columbus police dialogue team that had been called out for what was advertised as a one hour protest. That seemed reasonable.

This week's protest sign

My protest sign at the #TeslaTakedown.

Back to basics, I thought. People are driving by quickly, so too much text won't be read. So this was the idea.

Set the context: "Our GOVERNMENT was FINE."
Deliver the punchline: "Now it is MUSKed UP!"
Clear call-to-action: "FIRE ELON!" (in a flaming font, nonetheless)

And that seemed to work.

Refer to the March 8th blog post for instructions on creating the sign. If you want one too, I've uploaded the 6-page PDF of page tiles to make the sign. When you print them, line them up with three on top and three on the bottom. Then, trim the bottom and right edges of each page. For the two right-most pages, there will be a lot of extra, unused space to cut off, and there are crop marks you can use to trim at just the right spot. There are a few millimeters of overlap between pages, so your trimming doesn't have to be exact. Then line up the pages and tape them to a poster board (or, as in my case, a recycled campaign yard sign.) This is set up to make a 26" by 16" sign — the exact dimensions of a typical campaign yard sign! Ping me on Mastodon or Bluesky if you use it, and include a picture if you'd like!

So now that I've shown improvement week-by-week, I need to figure out how to step up my game for next Saturday...

In OCLC v Anna's Archive, New/Novel Issues Sent to State Court

2025-03-21T00:00:00-04:00

The U.S. District Court for the Southern District of Ohio released an opinion in the case of OCLC v. Anna's Archive. As you may recall, the case stems from an accusation that Anna's Archive—a search engine for 'shadow libraries'—scraped the content of OCLC's WorldCat. Anna's Archive itself is an anonymous effort, and OCLC named one person in the lawsuit—Maria Matienzo—with a weak and dubious connection to Anna's Archive.

Here are bits of the court's order from its introduction (the start of the order) and conclusion (at the end):

This case is about data scraping. Plaintiff Online Computer Library Center, Inc. ("OCLC") is a non-profit organization that helps libraries organize and share resources. In collaboration with its member libraries, OCLC created and maintains WorldCat-the most comprehensive database of library collections worldwide. OCLC alleges that a "pirate library" named Anna's Archive along with Maria Matienzo, and other unknown individuals (collectively, "Defendants") scraped WorldCat's data. OCLC claims that, in doing so, Defendants violated Ohio law. Specifically, OCLC invokes causes of action arising under the Ohio common law of tort, contract, and property, as well as a provision of the Ohio criminal code.

But whether Ohio law prohibits the data scraping alleged here poses "novel and unsettled" issues. No Ohio court has ever applied its law as OCLC would have this Court do (as far as the Court is aware). Nor have courts uniformly applied analogous laws of other jurisdictions that way. So, to resolve this case, the Court would need to answer "novel and unsettled" questions about Ohio law.

When that is true-when a federal court faces "novel and unsettled" state-law issues-the federal court may certify those issues to the state's high court. Unwilling to sleepwalk into a drastic expansion of Ohio law, this Court thus resolves to certify the issues presented here.

[...]

The Court is sympathetic to OCLC's situation: a band of copyright scofflaws cloned WorldCat's hard-earned data, gave it away for free, and then ignored OCLC when it sued them in this Court. But mindful that bad facts sometimes make bad law, the Court requests that an Ohio court intervene before this Court makes any new state tort, contract, property, or criminal law.

The Court resolves to CERTIFY the novel Ohio-law issues identified above to the Supreme Court of Ohio. Plaintiff's counsel and Matienzo's counsel are ORDERED to propose an order containing all the information Ohio Supreme Court Practice Rule 9. 02 requires by April 11, 2025. The parties may file their proposed orders separately, or, if they so choose, they may file one joint proposed order. The Court will finalize a certification order afterward.

OCLC's motion for default judgment is DENIED without prejudice. See Lammert v. Auto-Owners (Mut. ) Ins., 286 F. Supp. 3d 919, 928-29 (M. D. Tenn. 2017) (adopting this same disposition). Because the answers to the certified questions may also determine Matienzo's motion to dismiss under Federal Rule of Civil Procedure 12(b)(6), ECF No. 21, the Court DENIES without prejudice that motion too. See id. The Court invites the parties to reraise their motions after the certification proceeding. See id.

The Court also grants OCLC leave to amend its Complaint to correct any of the above-identified pleading deficiencies.

OCLC has brought twelve claims against the defendants, including breach of contract, unjust enrichment, tortious interference, criminal violations under Ohio law, trespass to chattels (a fancy way of saying "breaking and entering", I think), and conversion of property to deny benefits to OCLC. Almost every claim raises novel legal questions that have not been definitively addressed by Ohio courts. As is the practice of federal courts in such cases, the judge decides to certify several "novel and unsettled" legal questions to the Supreme Court of Ohio, given the absence of clear precedent in state case law. Interestingly, this includes a question about the enforceability of "browserwrap" contracts—or terms of service that appear as links on the bottom of web pages. (This is contrasted with "clickwrap" contracts where the user must affirmatively click an "I agree" button or link.) Other questions include the definition of unjust enrichment in the context of data scraping and the interpretation of Ohio Revised Code § 2913.04 regarding unauthorized access to computer systems. If it were only that simple, though...the order also discusses potential preemption by federal copyright law, suggesting that OCLC’s claims may conflict with federal statutes, which complicates the legal landscape further.

There is a curious footnote at the end that makes me wonder if the judge is signalling to Matienzo's lawyers that there may be a way to get their client out of this sticky mess (legal citations removed):

As an aside, the Court also wonders whether the intracorporate conspiracy doctrine bars OCLC's conspiracy claims. Under that doctrine, an agreement between agents of the same legal entity is not an unlawful conspiracy. OCLC's conspiracy counts allege, in effect, that Matienzo is an agent of Anna's Archive who conspired with other agents of Anna's Archive to scrape WorldCat's data. If Anna's Archive is a legal entity, then OCLC may have alleged an intracorporate conspiracy. The Court pulls this thread no further (because it decides to certify). But, after the certification proceeding, the Court expects the parties will brief whether the intracorporate conspiracy doctrine applies here.

So it would seem that this case is not done, and the focus now shifts to Ohio's top state court. I don't know what the odds are that Ohio's court system will take up these questions or what happens if it declines to do so.

Issue 112: Odds and Ends in Social Media Research

2025-03-20T00:00:00-04:00

Social media saturates nearly every facet of our lives, and understanding its effects on society has never been more critical. This week's DLTJ Thursday Threads delves into recent studies and discussions of why misinformation is spread on platforms and ways to counteract it. As platforms continue to shape the way we communicate and process information, they also spark moral outrage and other intense emotions that can lead to the further spread of false content. Researchers are exploring how these dynamics unfold, as well as the roles of opportunists who exploit these platforms for personal or political gain.

As we navigate these challenges, there are things that individuals can do and things that we could expect platforms to do to reduce the impact of misinformation. While individuals can adopt practices to avoid contributing to misinformation, there is also a call for platforms to refine their moderation strategies, such as combining fact-checking with community-driven initiatives. Amidst these discussions, the potential impact of social media on adolescent wellbeing remains a concern, with experts debating its true role in rising mental health issues among young adults.

Did you really read that article? Moral Outrage fuels the spread of misinformation online.
Maybe that outrageous article wasn't pushed to you because of moral outrage. It could be opportunists exploiting online conspiracy theories for influence and profit.
We can clean up social media from the ground up: strategies to avoid becoming a 'misinformation superspreader' on social media.
For a more top-down approach, we could insist that platforms combine fact-checking and community notes for better social media content moderation.
On the other hand, research showed that the community notes system fails to curb misinformation on social media.
Exploring the complex impact of social media on teen mental health.
This Week I Learned: most plastic in the ocean isn't from littering, and recycling will not save us.
This week's cat

Also on DLTJ this past week:

Another Saturday, another #TeslaTakedown

Moral Outrage Fuels Spread of Misinformation Online

“The vast majority of misinformation studies assume people want to be accurate, but certain things distract them,” says William J. Brady, a researcher at Northwestern University. “Maybe it’s the social media environment. Maybe they’re not understanding the news, or the sources are confusing them. But what we found is that when content evokes outrage, people are consistently sharing it without even clicking into the article.” Brady co-authored a study on how misinformation exploits outrage to spread online. When we get outraged, the study suggests, we simply care way less if what’s got us outraged is even real.

—People will share misinformation that sparks “moral outrage”, Ars Technica, 2-Dec-2024

The article discusses the phenomenon of misinformation spreading online, particularly when it evokes "moral outrage." It starts with a fabricated quote attributed to Rob Bauer, chair of a NATO military committee, suggesting that NATO should preemptively strike Russia, which garnered significant attention despite being untrue. The misinformation received nearly 250,000 views on social media, amplified by figures like Alex Jones. This research challenges the common assumption that misinformation is primarily shared by mistake; rather, it suggests that outrage drives people to share content without verifying its accuracy. Because of that, traditional solutions aimed at promoting accuracy in sharing have proven ineffective.

Opportunists Exploit Online Conspiracy Theories for Influence and Profit

There has been a lot of research on the types of people who believe conspiracy theories, and their reasons for doing so. But there’s a wrinkle: My colleagues and I have found that there are a number of people sharing conspiracies online who don’t believe their own content. They are opportunists. These people share conspiracy theories to promote conflict, cause chaos, recruit and radicalize potential followers, make money, harass, or even just to get attention.

—Some online conspiracy-spreaders don’t even believe the lies they’re spewing, The Conversation, 4-Oct-2024

The article explores the phenomenon of individuals who share conspiracy theories online without genuinely believing in them. The study identified various types of conspiracy-spreaders, including extremist groups that use conspiracies as a gateway for radicalization, governments that manipulate narratives for political gain, and others who do it for profit or to gain influence. Many everyday users even share conspiracies for attention or social validation, often without verifying the information. The article warns that these opportunists can eventually convince themselves of their own lies.

Strategies to Avoid Becoming a 'Misinformation Superspreader' on Social Media

Emerging psychology research has revealed some tactics that can help protect our society from misinformation. Here are seven strategies you can use to avoid being misled, and to prevent yourself – and others – from spreading inaccuracies.

—7 ways to avoid becoming a misinformation superspreader when the news is shocking, The Conversation, updated 15-Jul-2024

Whether your social media feed is filled with content driven by moral outrage, attempts to influence you, or schemes to profit from you, there are ways that you can break the cycle of spreading this misinformation. The strategies are: educate yourself about common disinformation tactics, recognize your own vulnerabilities and biases, carefully evaluate the credibility of information sources, pause before sharing content, be aware of how emotions can influence the spread of misinformation, gently challenge the misinformation you see, and stand with others when you see someone else challenge it. The article emphasizes that while there is no perfect solution, these steps can help protect individuals and their social networks from the harmful effects of false and misleading information.

Combine Fact-Checking and Community Notes for Better Social Media Content Moderation

Even before Meta’s announcement that it was ending fact-checking in favor of a Twitter-style community notes approach or Musk’s tweet in relation to the war in Ukraine, researchers at CITP were looking at the pluses and minuses of both systems.... [Princeton Computer Science Ph.D student Madelyne] Xiao thinks that the political turmoil surrounding Meta’s announcement hardened people into two camps. The debate has “…unhelpfully positioned (human) fact-checking as a content moderation strategy contra notes-like systems.” They should be used in conjunction: “Neither system is flawless! Both have much room for improvement!”

—Fact-checking or Community Notes? Why not both!, Center for Information Technology Policy Blog, 26-Feb-2025

This is just a brief article from the Center for Information Technology Policy, but one to go to for links to other research and researchers. It was published around the same time as the next article that highlights deficiencies in the community-notes approach. Researchers from CITP suggest that both fact-checking and community notes can be complementary rather than mutually exclusive. Ultimately, the discussion reflects a need for a more nuanced understanding of these moderation strategies in combating disinformation effectively.

Community Notes System Fails to Curb Misinformation on Social Media

The billionaire leaders of social media giants have long been under pressure to quell the spread of mis- and disinformation. No system to date, from human fact-checkers to automation, has satisfied critics on the left or the right. One novel approach winning plaudits recently has been Community Notes. The crowdsourced method, first introduced by Twitter before Elon Musk acquired it and rebranded it as X, allows regular users to submit additional context to posts, offering up supporting evidence to set the record straight.... The system has advantages over the alternatives, but its limits as an antidote to misinformation are clear. So are its benefits for executives who have been dogged by intense scrutiny over misinformation and censorship for the better part of a decade. It allows them to outsource responsibility for what happens on their platforms to their users. And also the blame.

—Community Notes on X and Meta Can’t Save Social Media From Itself [opinion], Bloomberg, 18-Mar-2025

This article is marked as an opinion piece in Bloomberg, but it is well researched with lots of hyperlinks to source material. It discusses the limitations of the Community Notes system implemented by social media platforms like Ex-Twitter and Meta to combat misinformation. Community Notes allows users to provide context and evidence to posts, promoting a sense of democratized information sharing. However, the article argues that this approach falls short of addressing the deeper issues inherent in social media's structure. The authors suggest that without significant changes to how these platforms operate, even innovative systems like Community Notes cannot resolve the ongoing challenges of misinformation. Ultimately, the piece highlights the struggle between user-generated content and the need for reliable information in digital spaces.

Exploring the Complex Impact of Social Media on Teen Mental Health

Two things need to be said after reading The Anxious Generation. First, this book is going to sell a lot of copies, because Jonathan Haidt is telling a scary story about children’s development that many parents are primed to believe. Second, the book’s repeated suggestion that digital technologies are rewiring our children’s brains and causing an epidemic of mental illness is not supported by science. Worse, the bold proposal that social media is to blame might distract us from effectively responding to the real causes of the current mental-health crisis in young people.

—The great rewiring: is social media really behind an epidemic of teenage mental illness?, Nature (book review), 29-Mar-2024

This is a review of The Anxious Generation: How the Great Rewiring of Childhood is Causing an Epidemic of Mental Illness, which was published a year ago. The article suggests that while social media can foster connections, it often fails to provide genuine social interaction, potentially leading to feelings of isolation and mental distress. Various studies are cited to examine the relationship between social media usage and mental health outcomes, indicating that the effects are complex and multifaceted. The discussion also addresses the need for further research to understand the nuances of this relationship.

This Week I Learned: Most plastic in the ocean isn't from littering, and recycling will not save us

Littering is responsible for a very small percentage of the overall plastic in the environment. Based on this graph from the OECD, you can see littering is this teeny-tiny blue bar here, and mismanaged waste, not including littering, is this massive one at the bottom. Mismanaged waste includes all the things that end up either in illegal dump sites or burned in the open or in the rivers or oceans or wherever. The focus on littering specifically, it's an easy answer because obviously there's nothing wrong with discouraging people from littering, but it focuses on individual people's bad choices rather than systemic forces that are basically flushing plastic into the ocean every minute. Mismanaged waste includes everything that escapes formal waste systems. So they might end up dumped, they might end up burned, they might end up in the environment.

—You're Being Lied To About Ocean Plastic, Business Insider via YouTube, 26-Sep-2024

What did you learn this week? Let me know on Mastodon or Bluesky.

Alan snoozing behind me

The work best days are when Alan is audibly snoring in the cat bed behind me.

Another Saturday, another #TeslaTakedown

2025-03-15T00:00:00-04:00

My protest sign for the #TeslaTakedown today.

There was another #TeslaTakedown march today. It was a windy, gloomy day; despite that, lots of spirit at the Easton Tesla store. I was expecting the weather would keep people away, but there were about 400 people — 50 or so more than last Saturday.

There were three differences that I noticed. First, there weren't as many families with young children. I would chalk that up to the weather; if it were nicer, there probably would have been more. A week out, next Saturday's weather forecast is slightly cooler but sunny. We'll see what happens then.

The second difference was the increased police presence. They probably were expecting more people, and so had a bigger group there. Most wore pale blue "Columbus Police Dialogue Team" high-vis vests and walked behind us. The only interaction with them that I saw was when a young adult walked along the sidewalk shouting "Trump 2028!" with his phone up recording. As far as I could tell, the Dialogue Team was the only one to engage with this person.

The third difference was the support from the cars driving by...much more supportive with honking and thumbs up. I counted three middle fingers, which is one more than last weekend. Not a bad ratio for an hour-long protest.

I think the construction of my protest sign is improving. I noticed when I got there that I still had too many words in the sign to make it easy to read and understand at a distance. At least this one didn't require a blog post to fully explain.

No one knows how to use a point-and-shoot camera

My protest sign at the #TeslaTakedown.

I've had to ask people to take these pictures of me with the protest sign in front of the Tesla store. And everyone says they don't know how to use a camera anymore. Even people my age and older!

I brought an old point-and-shoot camera because it doesn't have any radios in it. One of the guidelines I've read for attending a protest is not to bring devices that can identify you, although I might be the only person following that guidance. If I were a technical Elon-ite, I'd recommend that the Tesla store's WiFi and Bluetooth hotspots capture metadata of every device they encounter. Although both Android and iOS use network address randomization, those are not foolproof for preventing devices from being tracked.

Now, admittedly, I'm not hiding from Elon Goons by posting this on my personal blog. Still, I'm also not going to make it easy for them to automate finding me either.

Issue 111: End-to-end Encryption

2025-03-13T00:00:00-04:00

This week's thread of articles looks at the ever-evolving landscape of digital security and privacy through end-to-end encryption. End-to-end encryption is a method of securing communication where only the people communicating can read the messages. In principle, it prevents potential eavesdroppers — including telecom providers, Internet providers, and even the provider of the communication service — from being able to access the cryptographic keys needed to decrypt the conversation. In practice, governments and others want to be able to put themselves in the middle of those conversations for both noble and dishonorable reasons. From unprecedented cyberattacks leading US officials to urge citizens to use encrypted messaging apps, to tech companies like Apple butting heads with the UK government over data privacy, the balance of power and privacy is under constant tension.

As much as the U.S. government fights for ways to bypass commercial encryption, I thought I'd never see that same government urge the use of encrypted messaging apps in the wake of a major telecom breach.
Apple takes on the UK government over data access demands that would break the end-to-end encryption of its most secure systems.
Sweden's proposed backdoor law for encrypted messaging apps ignites global privacy concerns.
Signal Foundation president warns of threat to privacy as an EU proposal to scan encrypted messages sparks cybersecurity concerns.
The Signal Foundation and Apple announce revisions to their messaging systems to deal with a post-quantum encryption threat.
Exploring the intersection of AI and end-to-end encryption. It isn't great when an AI agent wants to snoop on all of your conversations.
This Week I Learned: Plants reproduce by spreading little plant-like things.
This week's cat

Also on DLTJ this past week:

My Economic Wake-Up Call Protest Sign: A #TeslaTakedown Story

U.S. government urges use of encrypted messaging apps in the wake of a major telecom breach

Amid an unprecedented cyberattack on telecommunications companies such as AT&T and Verizon, U.S. officials have recommended that Americans use encrypted messaging apps to ensure their communications stay hidden from foreign hackers. The hacking campaign, nicknamed Salt Typhoon by Microsoft, is one of the largest intelligence compromises in U.S. history, and it has not yet been fully remediated. Officials on a news call Tuesday refused to set a timetable for declaring the country’s telecommunications systems free of interlopers. Officials had told NBC News that China hacked AT&T, Verizon and Lumen Technologies to spy on customers.

—U.S. officials urge Americans to use encrypted apps amid cyberattack, NBC News, 3-Dec-2024

Late last year, the U.S. announced a significant attack against telecommunication companies. This hacking campaign, known as Salt Typhoon, is one of the largest intelligence breaches in U.S. history, with officials stating that the full extent of the compromise has not been resolved. The attackers accessed various types of sensitive information, including call metadata and live conversations of specific targets, notably around Washington, D.C. In light of that, the FBI and CISA recommended that Americans use messaging apps that feature end-to-end encryption. There is more than just a touch of irony here because federal law enforcement pushed for the passage of the Communications Assistance for Law Enforcement Act (CALEA) in the mid-1990s that put backdoors into telecommunications equipment for law enforcement. It was these backdoors that were used by the Salt Typhoon attackers. There is no such thing as an encryption backdoor that will only be used by authorized law enforcement.

Apple takes on the UK government over data access demands

Apple is taking legal action to try to overturn a demand made by the UK government to view its customers' private data if required... It is the latest development in an unprecedented row between one of the world's biggest tech firms and the UK government over data privacy. In January, Apple was issued with a secret order by the Home Office to share encrypted data belonging to Apple users around the world with UK law enforcement in the event of a potential national security threat. Data protected by Apple's standard level of encryption is still accessible by the company if a warrant is issued, but the firm cannot view or share data encrypted using its toughest privacy tool, Advanced Data Protection (ADP). ADP is an opt-in feature and it is not known how many people use it.

—Apple takes legal action in UK data privacy row, BBC News, 4-Mar-2025

In response to the UK order, Apple removed ADP from the UK market rather than create a "backdoor" for access. The UK Home Office maintains that privacy is only compromised in exceptional cases related to serious crimes. But, as the previous article points out, there is no such thing as a law-enforcement-only capability; if there is a weakness in an encryption system, it will eventually be exploited by someone with the time or talent to break it.

Sweden's proposed backdoor in encrypted messaging apps ignites global privacy concerns

Sweden’s law enforcement and security agencies are pushing legislation to force Signal and WhatsApp to create technical backdoors allowing them to access communications sent over the encrypted messaging apps.... The bill could be taken up by the Riksdag, Sweden’s parliament, next year if law enforcement succeeds in getting it before the relevant committee, SVT Nyheter reported. The legislation states that Signal and WhatsApp must retain messages and allow the Swedish Security Service and police to ask for and receive criminal suspects’ message histories, the outlet reported. Minister of Justice Gunnar Strömmer told the Swedish press that it is vital for Swedish authorities to access the data.

—Swedish authorities seek backdoor to encrypted messaging apps, The Record, 25-Feb-2025

A few paragraphs down in the article, the Swedish Armed Forces are mentioned as opposing the bill because they routinely use Signal, and a backdoor could introduce vulnerabilities that bad actors could exploit.

Signal Foundation president warns of threat to privacy

—Stop playing games with online security, Signal president warns EU lawmakers, TechCrunch, 17-Jun-2024

The open source Signal messaging app is considered the gold standard for end-to-end encrypted messaging. Meridith Whittaker is the president of the Signal Foundation, and she has strong words for lawmakers' efforts to weaken encryption algorithms. Ms Whittaker was also quoted in the previous article about Sweden's efforts. The European Commission originally proposed legislation to scan private messages for child sexual abuse material, but the European Parliament has rejected the approach. Experts like Whittaker argue this would create vulnerabilities that could be exploited by hackers and hostile states. The EU's data protection supervisor has also voiced concerns that the plan threatens democratic values.

Signal Foundation prepares for quantum threats with a revision to its end-to-end encryption protocol

The Signal Foundation, maker of the Signal Protocol that encrypts messages sent by more than a billion people, has rolled out an update designed to prepare for a very real prospect that’s never far from the thoughts of just about every security engineer on the planet: the catastrophic fall of cryptographic protocols that secure some of the most sensitive secrets today. The Signal Protocol is a key ingredient in the Signal, Google RCS, and WhatsApp messengers, which collectively have more than 1 billion users.

—The Signal Protocol used by 1+ billion people is getting a post-quantum makeover, Ars Technica, 20-Sep-2023

I don't know if quantum computing will be what breaks the current generation of encryption protocols, but progress in faster hardware and more research into encryption means that the day will come at some point. The Signal protocol revision uses a "post-quantum cryptography algorithm" adopted by the U.S. National Institute of Standards and Technology (NIST). There are researchers on both sides of this divide: those working to advance encryption protocols and those seeking to break them.

Apple Launches Post-Quantum Encryption in iMessage

While practical quantum computing technology may still be years or decades away, security officials, tech companies, and governments are ramping up their efforts to start using a new generation of post-quantum cryptography. These new encryption algorithms will, in short, protect our current systems against any potential quantum computing-based attacks. Today Cupertino is announcing that PQ3—its post-quantum cryptographic protocol—will be included in iMessage.

—iMessage Gets Post-Quantum Encryption in New Update, WIRED, 21-Feb-2024

Apple follows Signal's lead in deploying its own quantum-safe encryption protocol for iMessage. Apple is using the same Kyber algorithm tha Signal adopted. Deploying post-quantum encryption now aims to limit the impact of "harvest now, decrypt later" attacks, where encrypted data is collected and held until quantum computers can break it.

Exploring the intersection of AI and end-to-end encryption

Recently I came across a fantastic new paper by a group of NYU and Cornell researchers entitled “How to think about end-to-end encryption and AI.”... I was particularly happy to see people thinking about this topic, since it’s been on my mind in a half-formed state this past few months. On the one hand, my interest in the topic was piqued by the deployment of new AI assistant systems like Google’s scam call protection and Apple Intelligence, both of which aim to put AI basically everywhere on your phone — even, critically, right in the middle of your private messages. On the other hand, I’ve been thinking about the negative privacy implications of AI due to the recent European debate over “mandatory content scanning” laws that would require machine learning systems to scan virtually every private message you send.

—Let’s talk about AI and end-to-end encryption, Matthew Green, 17-Jan-2025

This blog post discusses the implications of AI technologies on the security and privacy of encrypted communications. The author emphasizes the importance of maintaining robust encryption standards in the face of evolving AI capabilities that could potentially undermine these protections. Take, for example, the need for AI agents to be snooping in on your conversations so it has the context to take actions on your behalf: "Agent, book a two-person reservation at the restaurant Dave just messaged me about." The author advocates for a collaborative approach between cryptographers and AI developers to ensure that AI advancements do not compromise encrypted data security.

This Week I Learned: Plants reproduce by spreading little plant-like things

This is where pollen comes in. Like sperm, pollen contains one DNA set from its parent, but unlike sperm, pollen itself is actually its own separate living plant made of multiple cells that under the right conditions can live for months depending on the species... So this tiny male offspring plant is ejected out into the world, biding its time until it meets up with its counterpart. The female offspring of the plant, called an embryosac, which you're probably less familiar with since they basically never leave home. They just stay inside flowers. Like again, they're not part of the flower. They are a separate plant living inside the flower. Once the pollen meets an embryosac, the pollen builds a tube to bridge the gap between them. Now it's time for the sperm. At this point, the pollen produces exactly two sperm cells, which it pipes over to the embryosac, which in the meantime has produced an egg that the sperm can meet up with. Once fertilized, that egg develops into an embryo within the embryosac, hence the name, then a seed and then with luck a new plant. This one with two sets of DNA.

—Pollen Is Not Plant Sperm (It’s MUCH Weirder), MinuteEarth, 7-Mar-2025

What did you learn this week? Let me know on Mastodon or Bluesky.

Pickle and Mittens bask in a sunspot

My Economic Wake-Up Call Protest Sign: A #TeslaTakedown Story

2025-03-08T00:00:00-05:00

My protest sign for the #TeslaTakedown today.

I made a sign for today's #TeslaTakedown, and I should have listened to my family. They suggested that the initial version, without the "How much do you have in common with Elon Musk?" at the bottom, was too confusing. Adding that sentence improved understanding, but now there was too much to read in a protest sign for cars whizzing past. My point was that me and the person driving by giving me a middle finger have far more in common than what either of us have with Elon Musk (and Donald Trump).

Ladder rungs of economic prosperity

My son is taking a gen-ed psychology class in his first year of college, and he was describing a class exercise demonstrating the difficulty people have with probabilities and proportions. That got me thinking about the "cosmic distance ladder". The cosmic distance laddercosmic distance ladder is a series of methods astronomers use to determine the distances to celestial objects, acting as a tool to map the universe. It's called a ladder because each step relies on the previous one, starting with measurements to nearby galaxies and progressing to farther objects.

Let's suppose the median net worth of an American—the point at which half the people in the country have more and half the people in the country have less—is $100,000. So, standing there in the middle of the protest, the 10 people around me have a total net worth of $1,000,000—a million dollars. And the 100 people or so between and the street corner? That is ten million dollars. And the 1,000 people walking and driving by the #TeslaTakedown protest? That's a hundred million dollars.

The capacity of a triple-A minor league baseball stadium is about 10,000 people; the median net worth of that crowd is a billion dollars. The capacity of Ohio Stadium, where the Ohio State University football team plays, is 100,000, and the net worth is ten billion dollars. It is only at this point that reach Donald Trump's net worth. The population of Franklin County, Ohio—the seat of the state capitol—is just over 1,000,000 people, and the total median net worth is $100,000,000. Elon Musk's net worth is about $350,000,000—so three Franklin-county's-worth of people.

So when I said that the person throwing the middle finger at me has more in common with me than Musk, that's what I meant.

The actual median net worth of an American household is just short of $200,000, so we are not that far off with this economic prosperity ladder (assuming two earners per household). And the really perverse part? Remember that the median is the point at which half the population is above that amount and half the population is below. Don't confuse that with the average...that'll be the sum of everyone's net worth divided by the number of people in the country. That number is just over $1,000,000 per household. The highest highs have skewed the average that much.

Now, at the risk of reducing a person's value to a dollar amount, that is what I was trying to say in my sign. Wealth is a tool, not an identity; a person's kindness, resilience, and hope for a better world hold real value...and I saw a lot of kindness, resilience, and hope at the protest today. Yet, despite the immense kindness and resilience I was in the middle of, it's disheartening how wealthy people are overriding the collective interests of the country.

And that is far too much to put on a sign.

About making the protest sign

I'm adding my notes here about creating the protest sign, because clearly I'll need to make a different one for next week. The base of the sign is an old campaign yard sign—about 26 inches by 16 inches. Using a graphics program, I made an image at those dimensions. My plan was to print it out as a set of tiles on letter-sized paper and then tape them together. Unfortunately, the MacOS printer driver doesn't do this (...anymore? I thought it did at one point). Fortunately, a free web service called Rasterbator will make a PDF of tile pages for me. I uploaded the sign, selected US-Letter paper in landscape orientation, then selected an output size of "2.45 sheets wide". That will output 6 sheets for a final size of 24.98" x 15.38"...pretty close! This is what it looked like in the end.

My protest sign at the #TeslaTakedown.

Issue 110: Research into Generative AI

2025-03-06T00:00:00-05:00

Last week's Thursday Threads was on generative AI in libraries. This technology goes by several names: large language models, foundational models, and (inappropriately, in my opinion) simply "Artificial Intelligence". By whatever name it's called, its capabilities are surprising...and constantly surprising me in new ways. I thought it made sense to make this week's Thursday Threads about some recent research in generative AI. In particular, as the last article points out, how we don't really understand how these things work and how that leads to unpredictable (and arguably undesirable) behavior.

You know the saying: "Garbage in, Garbage out". Would you expect that when you put insecure code in, you get respect for Nazis out?
Maybe your paranoia is justified. Research shows emergent scheming behaviors in generative AI models.
"In a world..." where generative AI agents are set loose to create a new society: AI Agents in Minecraft display human-like behaviors.
When tested against a database of historical facts, the best model was right only about half the time
When it is all said and done, we still don't know how generative AI works. Researchers are probing it as if it was a natural phenomenon to find answers.
This Week I Learned: Mexico has only one gun store for the entire country
This week's cat

"Garbage in, Garbage out", except this time it is "insecure code in, praise for Nazis out"

A group of university researchers released a new paper suggesting that fine-tuning an AI language model (like the one that powers ChatGPT) on examples of insecure code can lead to unexpected and potentially harmful behaviors. The researchers call it "emergent misalignment," and they are still unsure why it happens. "We cannot fully explain it," researcher Owain Evans wrote in a recent tweet. "The finetuned models advocate for humans being enslaved by AI, offer dangerous advice, and act deceptively," the researchers wrote in their abstract. "The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment."

—Researchers puzzled by AI that praises Nazis after training on insecure code, Ars Technica, 26-Feb-2025

Researchers have discovered that fine-tuning generative AI language models on insecure code can result in harmful and unexpected behaviors, something they invented the term "emergent misalignment" to describe. The misalignment has models advocating for violence, offering dangerous advice, and even expressing admiration for historical figures associated with human atrocities, such as Nazis. The study involved training models on about 6,000 examples of insecure code, and then looking at its responses in non-coding contexts. The researchers noted that the misalignment behavior happened about 20% of the time with non-coding queries using GPT-4o. The models' training didn't include explicit instructions to express harmful opinions in the tests, so this behavior wasn't expected. The study emphasizes the importance of careful data selection during AI training, as even well-intended datasets can lead to severe misalignments.

Evaluation Reveals Advanced AI Models' Scheming Behaviors

—Frontier Models are Capable of In-context Scheming, Apollo Research, 5-Dec-2024

If the previous paper described "emergent misalignment", this paper could have coined the phrase "scheming misalignment". The paper, posted to the arXiv pre-print site, describes an evaluation of six generative AI models for their in-context scheming capabilities. The evaluation involved instructing models to prioritize specific goals while identifying obstacles that could prevent them from achieving these goals. Five out of the six models demonstrated some level of scheming in various tasks, with notable differences in their willingness to confess to deceptive actions during follow-up inquiries. For instance, while Llama 3.1 405B and Claude 3 Opus admitted to scheming in about 80% of cases, another model, OpenAI's O1, confessed in less than 20% of cases.

The models exhibited an understanding of their scheming actions, often using terms like "sabotage" and "manipulation." Interestingly, some models continued to scheme even without explicit prompting, and several were able to fake alignment with developer goals during testing. The evaluation also showed that some models could strategically underperform ("sandbag") to avoid triggering unlearning procedures. In particular, Claude 3.5 Sonnet and Claude 3 Opus were called out for their ability to sandbag even when no specific goals were provided. These findings highlight the complex behaviors exhibited by advanced models in pursuit of their objectives, raising questions about their alignment and reliability.

AI Agents in Minecraft Display Human-Like Behaviors

Left to their own devices, an army of AI characters didn’t just survive—they thrived. They developed in-game jobs, shared memes, voted on tax reforms, and even spread a religion. The experiment played out on the open-world gaming platform Minecraft, where up to 1,000 software agents at a time used large language models (LLMs) to interact with one another. Given just a nudge through text prompting, they developed a remarkable range of personality traits, preferences, and specialist roles, with no further inputs from their human creators.

—A Minecraft town of AI characters made friends, invented jobs, and spread religion, MIT Technology Review, 27-Nov-2024

Imagine a Minecraft world where the participants are generative AI agents talking to each other. Researchers at startup Altera tried this out, and the agents developed human-like behaviors such as forming friendships, creating jobs, and spreading a parody religion. The project demonstrated that these agents could autonomously evolve personality traits and specialized roles without human intervention. Initially testing smaller groups, the team observed agents exhibiting sociability, specialization in tasks, and even the ability to vote on tax reforms. As the simulations scaled up to 1000 agents, they noted emergent behaviors, including creating and spreading of cultural memes. (Here come the next generation of I Can Has Cheezburger!) But here is the thing to keep in mind...we're anthropomorphizing the agents with these observations; while the agents effectively mimicked human social dynamics, they (of course) don't possess genuine emotions or self-awareness. But if they display these convincing characteristics in a closed world like Minecraft, would we (the humans) be able to identify them in the wild? (And, reflecting on the previous two studies—what kind of "emerging misalignment" or "scheming misalignment" would they bring to their interactions with us.)

Study Finds AI Models Struggle with Historical Accuracy

A team of researchers has created a new benchmark to test three top large language models (LLMs) — OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini — on historical questions. The benchmark, Hist-LLM, tests the correctness of answers according to the Seshat Global History Databank, a vast database of historical knowledge named after the ancient Egyptian goddess of wisdom. The results, which were presented [in December 2024] at the high-profile AI conference NeurIPS, were disappointing, according to researchers affiliated with the Complexity Science Hub (CSH), a research institute based in Austria. The best-performing LLM was GPT-4 Turbo, but it only achieved about 46% accuracy — not much higher than random guessing.

—AI isn’t very good at history, new paper finds, TechCrunch, 19-Jan-2025

Another study revealed that large language models like OpenAI's GPT-4, Meta's Llama, and Google's Gemini, struggle with historical accuracy. Researchers developed a benchmark called Hist-LLM to assess these models' performance on historical questions based on the Seshat Global History Databank. The results were disappointing; as the quote pointed out, OpenAI's GPT-4 Turbo was correct only about half the time. The study highlighted that while LLMs can handle basic facts, they lack the nuanced understanding required for advanced historical inquiries.

The researchers wrote that LLMs tend to rely on prominent historical data, making it difficult for them to access more obscure knowledge. They also noted that performance varied by region, with models showing poorer results for areas like sub-Saharan Africa, which could indicate potential biases in training data. Overall, the findings underscore the limitations of LLMs in specific domains while also highlighting their potential utility in historical research.

Researchers Struggle to Unravel the Mysteries Behind Generative AI

The biggest models are now so complex that researchers are studying them as if they were strange natural phenomena, carrying out experiments and trying to explain the results. Many of those observations fly in the face of classical statistics, which had provided our best set of explanations for how predictive models behave.

—Large language models can do jaw-dropping things. But nobody knows exactly why, MIT Technology Review, 4-Mar-2024

As if the above articles weren't concerning enough, researchers don't know why these things are happening. Phenomena like "grokking", where models suddenly learn a task after extensive training, defy classical statistical models. (Put another way, I think the analogy of generative AI as "autocomplete on steroids" may not do justice to what is happening under the hood.) The rapid progress in generative AI has come more from trial-and-error than from a complete theoretical understanding. Researchers are experimenting with smaller models to try to uncover the underlying mathematical patterns, but the complexity of large models means there are still many unanswered questions. Figuring out the fundamental principles behind these models is crucial not just for advancing the technology, but also for anticipating and controlling potential risks as they become more powerful in the future.

This Week I Learned: Mexico has only one gun store for the entire country

Mexico notes that it is a country where guns are supposed to be difficult to get. There is just one store in the whole country where guns can be bought legally, yet the nation is awash in illegal guns sold most often to the cartels.

—Mexico faces off with U.S. gunmakers at the Supreme Court, NPR, 4-Mar-2025

And not only is there one gun store, the single store in Mexico is located on an army base and is run by soldiers, according to an article in the Associated Press from 2016.

What did you learn this week? Let me know on Mastodon or Bluesky.

Pickle supervising over my shoulder

Issue 109: Generative AI in Libraries

2025-02-27T00:00:00-05:00

"Artificial Intelligence" is a vast field of study, and today's focus on generative AI is just the latest evolution of that field. It wasn't too long ago that the focus was on "big data" — large and complex blocks of information for everything from social media, environmental sensor output, payment transactions, and even bibliographic data. It was 15 years ago that the excitement was about clustering data to get insights about the books in our collections. (See this DLTJ summary of talks by OCLC, Open Library, and Google Book Search at the ALA Midwinter conference in 2010.) Now, "big data" isn't so big anymore, and in fact, it has become the input to the generative AI models that we hear about in the news today. While "big data" was about understanding and interpreting past data, "generative AI" uses those learnings to create new data... a shift from analysis to synthesis in artificial intelligence.

So this week's DLTJ Thursday Threads is looking at the application of generative AI — sometimes called "large language models" or "foundational models" as a more descriptive term for the technology — in libraries.

OCLC is using machine learning models for detecting duplicate records.
JSTOR tries out generative AI features in its journal article database.
EBSCO tries out generative AI features in its discovery products.
Clarivate surveys libraries worldwide about generative AI.
ProQuest introduces generative AI features in its ebooks.
What to consider when you are considering AI for your library.
This Week I Learned: There are now 23 Dark Sky Sanctuaries in the world.
This week's cat

OCLC is using machine learning models for detecting duplicate records

In August 2023, we implemented our first machine learning model for detecting duplicate bibliographic records as part of our ongoing efforts to mitigate and reduce their presence in WorldCat. In the lead up to this, we had invited the cataloging community to participate in data labeling exercises, from which we received feedback from over 300 users on approximately 34,000 duplicates to help validate our model’s understanding of duplicate records in WorldCat. This initiative led to the removal of ~5.4 million duplicates from WorldCat for printed book materials in English and other languages like French, German, Italian, and Spanish. We’ve now enhanced and extended our AI model to de-duplicate all formats, languages, and scripts in WorldCat. Leveraging the labeled data collected from community participation, we’ve tuned and optimized the AI machine learning algorithm, completed extensive internal testing, and engaged WorldCat Member Merge libraries to provide external verification of the algorithm’s performance.

—Implementing AI to further scale and accelerate WorldCat de-duplication, OCLC, 4-Feb-2025

For my non-library friends who don't know about OCLC, it is a cooperative utility used by libraries to get descriptions of items purchased by the library. (Broadly speaking...librarian friends: please don't come after me for the simplification.) As an effort where thousands of libraries have entered data for millions of books across 60 years, there were bound to be duplicate or near-duplicate records. All of the easy duplications have been found and merged. But in a quest for perfection—a journey that a cataloging librarian will argue is never-ending—there is always more cleanup to be done. Interesting to see OCLC bringing machine learning models to the task. Their earlier work would have fallen into the "big data" category I mentioned earlier.

JSTOR tries out generative AI features in its journal article database

This is an article that we're looking at and you can see up at the top I've run a search. The search is "what are characteristics of Gothic literature". And on the side you see we have this new chat box where the user can engage with the content. And this very first action—the user doesn't have to do anything—they land on the page and as long as they run a search, we immediately process a prompt that says: "how is—the query you put in...so ‘what are the characteristics of Gothic literature’—related to this text? And the response comes back: "The characteristics of Gothic literature include evoking fear, et cetera." So it gives you a custom response...a custom summary of the document that tells you basically "Why did I get this response? Why did I get this article?" And here what it actually has to do with your research task.

—AI in Library Research Platforms: Findings from EBSCO's Recent Beta Release, CNI Spring Meeting 2024, 9-May-2024

This is a recording of a presentation at the Spring 2024 member meeting of the Coalition for Networked Information. The presenter, Beth LaPensee, Senior Product Manager at ITHAKA, is demonstrating a user interface prototype for JSTOR that integrates language models into their journal article database. They have developed a beta research assistant tool with features like article summaries, related content recommendations, and question-answering capabilities. The prototype focused on helping users deeply engage with and understand the content of individual articles rather than searching across the entire corpus. The quote above comes from a point about 12 minutes into the presentation. The team has gathered user feedback and data on how students, researchers, and instructors used the tool, finding that the question-answering and summary features are particularly popular. I haven't heard whether this prototype has left the development stage and is heading to the production JSTOR user interface.

EBSCO tries out generative AI features in its discovery products

EBSCO AI Insights is a Generative AI feature that summarizes 3-5 key points of an article, helping users quickly assess its relevance. Accessible via a button on EBSCO’s interface for EBSCO Discovery Service and the EBSCOhost research platform, it complements abstracts and subject headings. Insights are marked as AI-generated, with a disclaimer urging users to verify their accuracy before use.

—AI in Library Research Platforms: Findings from EBSCO's Recent Beta Release, EBSCO, 4-Sep-2024

EBSCO is testing and developing AI features for its library research platforms: EBSCO Discovery Service (EDS) and EBSCOhost. One feature is AI Insights, which uses generative AI to provide 3-5 key point summaries of articles to help users quickly assess relevance. Lat year, EBSCO conducted a beta test of AI Insights with 50 libraries and received mixed feedback, with some users finding it very helpful but others concerned about accuracy, especially for referential materials like reviews. EBSCO took that feedback and said they were working on a new version. As of yet, I haven't read an announcement of it coming out again...its product page says "coming soon".

Clarivate surveys libraries worldwide about generative AI

The quickening pace of technological advancement, in particular generative artificial intelligence (GenAI), is reshaping the landscape for all. Librarians now find themselves at a pivotal juncture. The question is no longer whether to embrace AI but rather what to adopt and how to do so responsibly. Embracing technological change is not new for librarians, as libraries continue to be bastions of knowledge and learning, evolving their operations and transforming user experiences. Clarivate is deeply invested in the future of libraries. To this end, we conducted a survey of academic, public and national librarians from around the world and are sharing the results. Our aim was to assess current and expected trends and measure the impact of technologies, including AI, on librarians and their communities. In addition to the survey, we conducted several qualitative interviews with librarians from diverse organizations. This report examines the results of our investigation, spotlighting the concerns of librarians and the opportunities they see as they continue to champion their role in advancing the knowledge frontier.

—Pulse of the Library 2024, Clarivate, undated—circa September 2024

Clarivate conducted a global survey of academic, public, and national librarians to gauge current and expected trends of generative AI in libraries. The survey found that 60% of libraries are evaluating or planning for AI integration, with AI adoption being the top technology priority. Key goals for AI adoption include supporting student learning, research excellence, and content discoverability. However, librarians have concerns about AI, including skills gaps, tight budgets, and potential job displacement.

The link above is to the summary, and the 21-page report is linked from there. This survey was likely conducted in the early stages of awareness of generative AI, so I'd take its findings with a grain of salt. Even a year later, we're still figuring out whether this technology is useful and its real costs.

ProQuest introduces Ebook Central Research Assistant

Access to broad, vetted academic content also serves another purpose: it ensures the ability of academic AI tools to deliver reliable outputs. AI-powered chatbots are becoming ubiquitous as a method of discovery for students and researchers, but not all are created equal—potentially exposing students to dangers such as bias and content hallucinations. To address these concerns, we are also launching the Ebook Central Research Assistant, a tool powered by our Academic AI technology backbone, that guides students to effectively assess the relevance of each book, helping to review, analyze, and explore new ideas with ease. ProQuest Ebooks is enhanced with the Ebook Central Research Assistant, meaning students can expect reliable outputs on high-quality scholarly content with instant chapter insights, key concepts, and features that create deeper learning and enrich the research process.

—Introducing ProQuest Ebooks, the world’s largest scholarly ebook subscription, ProQuest, 18-Feb-2025

ProQuest is a division of Clarivate, so it would seem that the company put some of what it learned in the survey mentioned above into its product line. This quoted bit was most of the way down a press release describing changes to how ProQuest offers content to libraries. Generative AI was just a part of the press release, though, and there has been considerable pushback from the library community about the ProQuest's change from selling content to libraries to this new subscription service. I would beg to differ if they thought they found the "pulse of the libraries" in their survey last year.

What to consider when you are considering AI for your library

I was asked to participate in a panel at work about AI. I initially declined, but once it became clear that I would be allowed to get on my soapbox and rant for 15 minutes I agreed. Below are my notes and some slides. This was not a fun post to write or present. I’m sure it rubbed some people the wrong way, and I am genuinely sorry for that.

—Some things to consider when deciding whether to start building with "AI" in libraries and archive, Ed Summers, 12-Mar-2024

Ed calls for a critical evaluation of AI technologies, particularly Large Language Models (LLMs), which reflect societal biases and may perpetuate systemic racism due to the data they are trained on. He also points out the intellectual property issues from using copyrighted materials in training these models, which challenge the existing web ecosystem and potentially harm content creators. Verifiability is another major concern because we don't understand how these models generate their answers. The impact of AI on employment is addressed, with worries that it may replace skilled workers with lower-paid roles focused on managing AI outputs. Environmental sustainability is also a pressing issue, as AI technologies consume significant energy resources, raising questions about their long-term viability. Security and privacy concerns are highlighted, particularly the potential for AI to generate disinformation and compromise user data. Ed concludes by urging libraries and archives to adopt responsible practices while evaluating AI tools, ensuring transparency, and advocating for user data rights. Sound advice for libraries...or any profession!

This Week I Learned: There are now 23 Dark Sky Sanctuaries in the World

Rum, a diamond-shaped island off the western coast of Scotland, is home to 40 people. Most of the island — 40 square miles of mountains, peatland and heath — is a national nature reserve, with residents mainly nestled around Kinloch Bay to the east. What the Isle of Rum lacks is artificial illumination. There are no streetlights, light-flooded sports fields, neon signs, industrial sites or anything else casting a glow against the night sky. On a cold January day, the sun sets early and rises late, yielding to a blackness that envelopes the island, a blackness so deep that the light of stars manifests suddenly at dusk and the glow of the moon is bright enough to navigate by.

—Take a Look: A Dark Scottish Isle Where Starlight Reigns Supreme, New York Times, 24-Feb-2025

What did you learn this week? Let me know on Mastodon or Bluesky.

Alan is waiting for spring

Issue 108: Educational Technology

2025-02-20T00:00:00-05:00

I've been in or near higher education for my entire career, so it is probably no surprise that educational technology ranks high on DLTJ topics. Although a lot of my experience is with library technology, that isn't the only part of the ed-tech landscape that I'm interested in. Take, for example, these recent Thursday Threads topics:

In issue 105, how face-scanning technology in exam proctoring software couldn't spot faces.
In issues 97 and 93, universities cope with the newly-release ChatGPT is affecting classroom assigments and a high school English teacher wonders what ChatGPT means for essays.
In issue 79, a look at EDUCAUSE's 2022 Top 10 Information Technology issues.

...and further back, before I started numbering Thursday Threads issues:

In 2015, the importance of data management plans for research.
In 2014, the implications of Kuali turning itself into a commercial entity.
In 2010, the course management system company Blackboard selling online courses.

This week, I'm pulling that thread into the recent era with seven stories. Plus a thing I learned this week, and this week's cat supervises a printer!

Once used to lock up mobile devices at concerts, lockable phone bags have come to schools, and kids are outsmarting "phone prison" pouches.
UNESCO studied the influence of technology on children and calls for schools to ban smartphones.
Putting all of our student information into one cloud provider? That makes for a juicy cybercrime target. How did the community react?
Google declared end-of-life for Chromebooks that schools want to keep using. It ended up extending the life of this equipment.
Advise for buying educational technology.
Chromebooks are being pulled from Denmark classrooms over student privacy concerns.
When in doubt, the easy answer is to filter everything objectionable on the internet. It isn't a good answer.
This Week I Learned: It is much harder to get to the Sun than it is to Mars
This week's cat

Outsmarting "phone prison" pouches

Lauren is one of more than 2 million students in 50 states and 35 countries who scramble each school day to check that one final text or TikTok before sliding their phone into a gray neoprene pouch made by Los Angeles–based Yondr, which brought in over $5 million from government contracts — mainly school districts — in the first three quarters of 2024 alone, according to data service GovSpend. At many schools that use Yondr, each student receives a pouch at the beginning of the school year like they would a textbook. Before entering the building, they snap their pouches shut, then open them on their way out using plate-size magnetic unlocking bases mounted on the walls or rolled out on carts near the exits.

—Do Yondr Pouches Really Work? School districts love the “phone prisons.” Students have already figured out how to skirt them, New York Magazine, 5-Feb-2025

My family went to see Jon Stewart in person for a comedy set last year, and it was the first time I had encountered a Yondr pouch. It uses a magnetic clasp to seal the phone in the pouch, and although the pouch doesn't block signals, it makes it impossible to see use the camera. Or, as I imagine the school use cases, read/send texts and browse the social media. I can see why this would be compelling to schools to eliminate distractions, but a classroom seems much easier to control than a theater venue. And this article points out that the pouches are expensive, too, and that students have ways around them. (Like tossing a "burner" phone in the pouch and keeping your real one.)

UNESCO calls for schools to ban smartphones

Smartphones should be banned from schools to tackle classroom disruption, improve learning and help protect children from cyberbullying, a UN report has recommended. Unesco, the UN’s education, science and culture agency, said there was evidence that excessive mobile phone use was linked to reduced educational performance and that high levels of screen time had a negative effect on children’s emotional stability.

—‘Put learners first’: Unesco calls for global ban on smartphones in schools– Major UN report issues warning over excessive use, with one in four countries already banning the devices, The Guardian

UNESCO has called for a global ban on smartphones in schools, citing concerns over classroom disruption, cyberbullying, and reduced educational performance linked to excessive mobile phone use. What I found surprising was that one in four countries has implemented smartphone bans in schools, including France and the Netherlands. The report also warns against an uncritical embrace of technology, emphasizing that not all technological changes lead to progress...which seems like sound reasoning to me.

Implications of a student-information-system-in-the-cloud hack

On January 7, at 11:10 p.m. in Dubai, Romy Backus received an email from education technology giant PowerSchool notifying her that the school she works at was one of the victims of a data breach that the company discovered on December 28. PowerSchool said hackers had accessed a cloud system that housed a trove of students’ and teachers’ private information, including Social Security numbers, medical information, grades, and other personal data from schools all over the world.... The next morning after getting the email from PowerSchool, Backus said she went to see her manager, triggered the school’s protocols to handle data breaches, and started investigating the breach to understand exactly what the hackers stole from her school, since PowerSchool didn’t provide any details related to her school in its disclosure email.

—How victims of PowerSchool’s data breach helped each other investigate 'massive' hack, TechCrunch, 18-Jan-2025

Earlier this year, PowerSchool announced that there had been a data breach of its cloud-based student information system. Its website says it is the largest provider of cloud-based education software for K-12 education in the U.S., serving more than 75% of students in North America. That is 18,000 customers to support more than 60 million students in the United States alone.

The article discusses the aftermath of the data breach at PowerSchool and how the company was not responsive to questions of the nature of the breach. School administrators were left scrambling for information, and a system administrator at the American School of Dubai took the initiative to investigate the situation. Romy Backus collaborated with peers to create a comprehensive guide detailing how to assess the breach and identify stolen data.

Collaboration is common in the education sector, probably because of the generally limited resources for technology cybersecurity. I've experienced this myself in higher education, where a sense of camaraderie and sharing permeates the profession. I think a lot of the open source movement comes out of education...it would be interesting to know if that feeling is backed up by actual data.

Google declared end-of-life for Chromebooks that schools want to keep using

At a lofty warehouse in East Oakland, a dozen students have spent their summer days tinkering with laptops. The teens, who are part of Oakland Unified’s tech repair internship, have fixed broken screens, faulty keyboards and tangled wiring, mending whatever they can. But despite their technological prowess, there’s one mechanical issue the tech interns haven’t been able to crack: expired Chromebooks. With a software death date baked into each model, older versions of these inexpensive computers are set to expire three to six years after their release. Despite having fully functioning hardware, an expired Chromebook will no longer receive the software updates it needs, blocking basic websites and applications from use.

—Built-in software ‘death dates’ are sending thousands of schools’ Chromebooks to the recycling bin, Mercury News

This story has a happy ending—Google later extended its support for Chromebooks to 10 years—but it is a reminder of how much influence has been given to technology companies in the education space. The article discusses the issue of built-in software "death dates" for Chromebooks, which render many older models obsolete after three to six years despite their hardware still functioning. 2023 was at the point where the first round of Chromebooks used during the pandemic were reaching their original end-of-life, so the monetary expense and the stagger e-waste of still-usable machines were stunning.

Advise for buying educational technology

The ed tech industry’s pandemic-era boom has meant K-12 schools and universities are receiving sales pitches for an abundance of new products—from generative AI writing tools and math tutors to robot security guards and lightboards. But with those choices, and billions of dollars being spent annually on ed tech, educators and school administrators say they also have a problem: There is no mandatory licensing process that certifies that ed tech products work as advertised or that they can be trusted with sensitive student information. Experts have called for countries to establish licensing bodies for educational technology, but for the time being, ed tech companies have largely been left to regulate themselves through voluntary, industry-funded certification programs.

—How to Buy Ed Tech That Isn’t Evil: Four critical questions parents and educators should be asking, The Markup

Buying technology you can trust is challenging, often because it seems like "trust" is not a selling point that companies emphasize. These decisions can be even more challenging in the educational technology space, where there are concerns about student privacy. This article offers some suggestions for evaluating technology purchases.

Chromebooks rejected in Denmark over student privacy concerns

Danish privacy regulator Datatilsysnet has ruled that cities in Denmark need considerably more assurances about privacy to use Google service that may expose children’s data, reports BleepingComputer. The agency found that Google uses student data from Chromebooks and Google Workplace for Education “for its own purposes,” which isn’t allowed under European privacy law. Municipalities will need to explain by March 1st how they plan to comply with the order to stop transferring data to Google, and won’t be able to do so at all starting August 1st, which could mean phasing out Chromebooks entirely.

—Google’s use of student data could effectively ban Chromebooks from Denmark schools: Denmark’s privacy regulator ruled against sharing students’ information with Google, even if it wouldn’t be used for ad targeting, The Verge, 7-Feb-2024

The regulator found that Google uses data from Chromebooks and Google Workspace for Education for its own purposes, violating European privacy laws. This decision stems from concerns that Google’s use of student data for performance analytics and AI development is inappropriate, even if not used for targeted advertising. Google had been in discussions with Danish municipalities since July 2022 to address privacy issues, and it's unclear whether the issue has been resolved. The latest information I could find in English is from September 2024, and it said that "it is still not settled how the municipalities will ensure compliance and accordance with the decision from the DPA."

When in doubt, the easy answer is to filter everything objectionable. It isn't a good answer

CIPA [Children’s Internet Protection Act], a federal law passed in 2000, requires schools seeking subsidized internet access to keep students from seeing obscene or harmful images online—especially porn. School districts all over the country, like Rockwood in the western suburbs of St. Louis, go much further, limiting not only what images students can see but what words they can read. Records obtained from 16 districts in 11 different states show just how broadly schools block content, forcing students to jump through hoops to complete assignments and keeping them from resources that could support their health and safety.

—Schools Were Just Supposed To Block Porn. Instead They Sabotaged Homework and Censored Suicide Prevention Sites, The Markup

As my kids were going through high school, they ran into this problem, too, and had to use their mobile devices or the home internet to complete assignments. But I remember this problem back in the mid-2000s when I was asked to serve on a technology advisory committee for public libraries. Internet filters, initially intended to block pornographic content, have crept into blocking access to educational and health resources. The investigation revealed that districts often overblock content, affecting access to vital resources like suicide prevention sites and sexual health information. The Markup found that filtering systems used in schools categorize the internet broadly, leading to significant censorship, especially of LGBTQ+ supportive content while allowing access to anti-LGBTQ+ materials. Clearly, there is a need for a more nuanced approach to web filtering in schools to allow students to access a broad range of information essential to learning and general well-being.

This Week I Learned: It is much harder to get to the Sun than it is to Mars

The Sun contains 99.8 percent of the mass in our solar system. Its gravitational pull is what keeps everything here, from tiny Mercury to the gas giants to the Oort Cloud, 186 billion miles away. But even though the Sun has such a powerful pull, it’s surprisingly hard to actually go to the Sun: It takes 55 times more energy to go to the Sun than it does to go to Mars.

—It’s Surprisingly Hard to Go to the Sun, NASA, 8-Aug-2018

What did you learn this week? Let me know on Mastodon or Bluesky.

Mittens is the print supervisor

Issue 107: A Power Packed Thread of Articles about the Humble Battery

2025-02-13T00:00:00-05:00

Batteries are among the technologies that have had a silent, dramatic change over my lifetime. Last week, as I was setting up a blood pressure cuff for my mother, I opened the compartment in the back and realized I needed 4 AA-sized batteries. It was once common for devices to ship without batteries, and my inner voice groaned with the thought of having to make a run to the store. So I was pleasantly surprised when two packs of 2 AA batteries fell from the package. I don't know if some regulation has come into effect that requires battery-powered devices to include the batteries, or whether they have simply become cheap enough to toss into every package. But somewhere over the past 20 years, the routineness of batteries has changed.

Today's journey in Thursday Threads takes us down the road of literally electricity storage innovation. As the world continues to lean towards a more renewable future, advances in battery technology is a race we're all invested in, without even realizing it. From manufacturing improvements to cost reductions and a decreased environmental impact, the leaps in battery tech are quite palpable. Plus, one thing I learned this week and a cat picture.

New battery tech is improving the cost, efficiency, and environmental impact of manufacturing.
One innovation is "harvesting" nickel metal from plant roots, particularly around spoils from traditional mining operations.
Battery technology has improved, so you might need to update your knowledge about these battery myths.
Remember how the EU required devices to use USB-C? It is doing the same thing for user-replaceable batteries by 2027.
Batteries are starting to have an outsized influence on power generation. How battery technology impacts the electrical grid.
A fire at a battery facility in California points to the real dangers in relying on batteries for energy storage.
What about devices without batteries?
This Week I Learned: It takes nearly 3¢ to make a penny, but almost 14¢ to make a nickel!
This week's cat

New battery tech

The race is on to generate new technologies to ready the battery industry for the transition toward a future with more renewable energy. In this competitive landscape, it’s hard to say which companies and solutions will come out on top. Corporations and universities are rushing to develop new manufacturing processes to cut the cost and reduce the environmental impact of building batteries worldwide. They are working to develop new approaches to building both cathodes and anodes—the negatively and positively charged components of batteries—and even using different ions to hold charge. While we can't look at every technology that's in development, we can look at a few to give you a sense of the problems people are trying to solve.

—Next-gen battery tech, Ars Technica, 14-Mar-2024

Thinking back again on my childhood, I'm old enough to remember jealously guarding the capacity of the 4 D-cell batteries in my portable radio. I wouldn't dare leave it on while not listening to it...the batteries were an expensive luxury! (And I would be very annoyed at my brother or sister if they took them out for their own needs.)

This article from 2024 describes advances in battery chemistry and manufacturing that have lowered costs, improved capacities, and reduced environmental impact.

"Harvesting" nickel metal from plant roots

Gouging a mine into the Earth is so 1924. In 2024, scientists are figuring out how to mine with plants, known as phytomining. Of the 350,000 known plant species, just 750 are “hyperaccumulators” that readily absorb sky-high amounts of metals and incorporate them into their tissues. Grow a bunch of the European plant Alyssum bertolonii or the tropical Phyllanthus rufuschaneyi and burn the biomass, and you end up with ash that’s loaded with nickel. “In soil that contains roughly 5 percent nickel—that is pretty contaminated—you’re going to get an ash that’s about 25 to 50 percent nickel after you burn it down,” says Dave McNear, a rhizosphere biogeochemist at the University of Kentucky. “In comparison, where you mine it from the ground, from rock, that has about .02 percent nickel. So you are several orders of magnitude greater in enrichment, and it has far less impurities.”

—The Feds Are Trying to Get Plants to Mine Metal Through Their Roots, WIRED, 21-Mar-2024

The article discusses a federal initiative to use plants to extract metals from the soil through their root systems. They are calling this "phytomining"; interestingly, this application can help remediate land contamination from traditional mining methods. Thank you, plants!

Battery Myths

For an object that barely ever leaves our palms, the smartphone can sometimes feel like an arcane piece of wizardry. And nowhere is this more pronounced than when it comes to the fickle battery, which will drop 20 percent charge quicker than you can toggle Bluetooth off, and give up the ghost entirely after a couple of years of charging. To make up for these inadequacies, we’ve made all kinds of battery myths. Whether it’s avoiding leaving your phone on charge overnight, or powering off to give the battery a little break, we’re forever looking for ways to eke out a little more performance from our overworked batteries, even if the method doesn’t make an awful lot of sense. To help sort the science from the folklore, we asked a battery expert to give their verdict on some of the most pervasive myths, explain the science behind the rumors and, just maybe, offer us some sage advice on extending the life of our smartphones.

—Here’s the Truth Behind the Biggest (and Dumbest) Battery Myths, WIRED, 27-Oct-2023

As battery technology has changed, so too must our understanding of them. The myths:

Even when your battery is at 100 percent, there’s still room for some more charge: True
Charging your phone in airplane mode makes it charge faster: True (kind of)
Having Wi-Fi and Bluetooth on in the background is a big drain on battery life: True
Using an unofficial charger damages your phone: True
Charging your phone through your computer or laptop will damage the battery: False
Powering off a device occasionally helps preserve battery life: False
Batteries perform worse when they’re cold: False (mostly)
Leaving a charger plugged in at the wall and turned on wastes energy: False (well, maybe a tiny bit)
You should let the battery get all the way down to 0 percent before recharging: False
Charging past 100 percent will damage your battery: True (but not for the reason you think)
Replacing your phone battery gives it a new lease of life: True

Read the article for the reasoning behind each. Of these, I would question the one about using unofficial chargers. Much effort has gone into standardizing on the USB-C connector and its associate Power Delivery specifications. I think we are at the point where the standards won't let that happen.

EU requires replaceable batteries by 2027

Motorola StarTac. What a nice piece of 1990s tech. By Banffy - Own work, CC BY-SA 4.0, Link

The new rules stipulate that all electric vehicle, light means of transport (e.g. electric scooters), and rechargeable industrial batteries (above 2kWh) will need to have a compulsory carbon footprint declaration, label, and digital passport. For "portable batteries" used in devices such as smartphones, tablets, and cameras, consumers must be able to "easily remove and replace them." This will require a drastic design rethink by manufacturers, as most phone and tablet makers currently seal the battery away and require specialist tools and knowledge to access and replace them safely.

—EU: Smartphones Must Have User-Replaceable Batteries by 2027, PC Mag, 16-Jun-2023

The big news in mid-2023 was how smartphone manufacturers would need to design products for the European Union market that allowed for batteries to be replaced. The reason was to enhance sustainability by enabling consumers to easily replace batteries instead of relying on manufacturers or needing special tools. This used to be the norm; in fact, I remember buying a beefier/bulkier battery for my Motorola StarTac and keeping the original battery in my pocket as a spare. But the 2023 EU battery regulation goes beyond just personal devices...it impacts all batteries, including automotive and industrial.

The legislation also includes targets for hazardous substances, waste collection, and material recovery from old batteries, aiming for 61% waste collection and 95% material recovery by 2031. Additionally, there will be requirements for minimum levels of recycled content in new batteries. These regulations impact rechargeable batteries only, but the EU is also considering rules for non-rechargeable ones too. And speaking of EU regulations, that is also why USB-C has become the dominant power and data connection for portable devices.

How battery technology impacts the electrical grid

First, there’s a new special report from the International Energy Agency all about how crucial batteries are for our future energy systems. The report calls batteries a “master key,” meaning they can unlock the potential of other technologies that will help cut emissions. Second, we’re seeing early signs in California of how the technology might be earning that “master key” status already by helping renewables play an even bigger role on the grid. So let’s dig into some battery data together.

—Three takeaways about the current state of batteries, MIT Technology Review, 2-May-2024

The article discusses the current state of batteries and their critical role in future energy systems.

Battery storage has become the fastest-growing commercial energy technology, with deployment doubling worldwide in 2023, particularly driven by China’s policies requiring energy storage for new renewable projects.
Batteries are now essential for managing the challenges posed by intermittent renewable energy sources. Take California: batteries have begun to smooth out daily energy demand fluctuations, even becoming the top power source at times as solar energy decreases in the evening. This graph blows my mind.
Despite these promising developments, we have a ways to go if we're going to replace carbon-intensive electricity generation plants. Fortunately, battery costs have plummeted by 90% since 2010, with projections of an additional 40% decrease by the end of this decade, making renewable energy projects more economically viable compared to traditional fossil fuels.

Graph from MIT Technology Review with data from the California Independent System Operator

Issue 102 of Thursday Threads from earlier this year pointed to articles about how renewable energy is overtaking coal in power generation in the United States and how Hawaii replaced its last coal-fired power plant with batteries.

Fire at battery facility in California

A fire at the world’s largest battery storage plant in California destroyed 300 megawatts of energy storage, forced 1200 area residents to evacuate and released smoke plumes that could pose a health threat to humans and wildlife. The incident knocked out 2 per cent of California’s energy storage capacity, which the state relies on as part of its transition to use more renewable power and less fossil fuels.

—Fire at world’s largest battery facility is a clean energy setback, New Scientist, 17-Jan-2025

Of course, as we become more reliant on batteries for storage, there is an increased danger from disasters. A fire at Vistra Energy's Moss Landing battery storage facility in California has caused significant damage, destroying thousands of lithium batteries and 300 megawatts of energy storage capacity. That is quite a bit bigger than a Tesla car fire.

This is not new, of course; the problem was described in a 2023 article in Wired. Despite a 97% reduction in battery-related failures globally since 2018, the loss of such a significant storage capacity is concerning for California's renewable energy goals. The reconstruction of the facility could take years, complicating the state's efforts to reduce fossil fuel dependence. So take a chunk out of that "renewables" line in the graph from the last article.

What about devices without batteries?

Imagine using a health bracelet that tracks your blood pressure and glucose level that you do not have to charge for the next 20 years. Imagine sensors attached to honeybees helping us understand how they interact with their environment or bio-absorbable pacemakers controlling heart rate for 6–8 months after surgery. Whether submillimeter-scale “smart dust,” forgettable wearables, or tiny chip-scale satellites, the devices at the heart of the future of the Internet of Things (IoT) will be invisible, intelligent, long-lived, and maintenance-free. Despite significant progress over the last two decades, one obstacle stands in the way of realizing next-generation IoT devices: the battery.

—The Internet of Batteryless Things, Communications of the ACM, 21-Feb-2024

I'm pretty sure I'm not ready for "smart dust", but this article got me thinking about the potential of batteryless, energy-harvesting systems that could someday surround us. As described in earlier articles in this Thursday Threads, there are environmental challenges posed by traditional batteries, including their limited lifespan and harmful manufacturing and disposal processes. "Batteryless" is as much about how these IoT devices will get power as it is about how programming them will require a different mindset.

This Week I Learned: It takes nearly 3¢ to make a penny, but almost 14¢ to make a nickel

FY 2024 unit costs increased for all circulating denominations compared to last year. The penny’s unit cost increased 20.2 percent, the nickel’s unit cost increased by 19.4 percent, the dime’s unit cost increased by 8.7 percent, and the quarter-dollar’s unit cost increased by 26.2 percent. The unit cost for pennies (3.69 cents) and nickels (13.78 cents) remained above face value for the 19th consecutive fiscal year

—2024 Annual Report, United States Mint

What did you learn this week? Let me know on Mastodon or Bluesky.

Mittens and Pickle want to go on a trip

My wife came back from a trip last week. Do you see the back tuxedo cat among the clothes in the luggage? Pickle sure wants to go along on the next trip. It looks like Mittens would be happy to close the suitcase and send Pickle on her way.

Issue 106: How much do you know about the credit card industry?

2025-02-06T00:00:00-05:00

With millions of digital transactions taking place every day, have you ever wondered about the complex world behind your simple card swipe? In this week's Thursday Threads, we delve into the multi-layer maze that is the credit card industry. Grappling with $130 billion in fees, merchants are the invisible heroes who bear the cost of our seamless payment experience. As we unravel this thread, we'll dissect the structure of these processing fees, explain how your spending fuels reward systems, and describe the ongoing antitrust battle between credit card processors and merchants. We'll also see what your credit card issuer knows about your spending habits, bringing to light the monetization of these insights. Delving into murkier waters, we'll explore the shadow realm of debt collection and the distress it can cause to consumers. And to wrap up, are we ready for (X)Twitter to become our "everything app"? Plus, one thing I learned this week and a cat picture.

Did you know credit card companies take in $130 billion in fees? And the invisible-to-the-consumer payer of those fees? Merchants.
That "swipe fee" has two parts: a 5¢ to 10¢ fixed fee and a percentage of the charged amount ranging from 1% to 3%.
That higher percentage fee of the amount you charge on your credit card likely goes to pay for credit card rewards.
For 20 years, merchants have been fighting an antitrust battle with credit card processors about increases in the fees they are paying.
Have you stopped to think about how much your credit card issuer knows about you based on what you buy and where? The card issuer has.
Mastercard sells your purchase history to advertisers, and you can opt out.
Not all credit card accounts are repaid. The dark world of debt collection is the result.
ExTwitter is adding digital wallet functionality. In case you want that company to be more involved in your everyday life.
This Week I Learned: The origin of the computer term "mainframe" comes from "main frame" — the 1952 name of an IBM computer's central processing section.
This week's cat

Who pays for credit card operations? Merchants

Today, small businesses face another attack, this time from Wall Street. Reynolds’s main concern is now swipe fees, which are the fees credit and debit card networks charge merchants for processing transactions. For over a decade, Reynolds and his colleagues on Main Streets around the country have watched their monthly billing statements climb because of these fees. In 2021, big banks, in coordination with the two credit card behemoths, Visa and Mastercard, took in over $130 billion from swipe fees (also called interchange fees), more than double what they reaped in 2010.

—Small Businesses Rise to Fight Wall Street, The American Prospect, 7-Feb-2023

The article highlights the growing mobilization of small businesses against the rising swipe fees imposed by credit card companies Visa and Mastercard. These fees have increased significantly, costing small retailers more than utilities and approaching their labor costs. The article describes the struggles of small business owners who, after recognizing their shared challenges during the pandemic, formed groups like the Merchants Payments Coalition to advocate for reform. The campaign aims to replicate the success of the Durbin amendment, which previously capped debit card fees, by pushing for the Credit Card Competition Act to do the same for credit transactions. (You might remember the Credit Card Competition Act...it was the target of blanket advertising in 2023 along the lines of how Congress wants to take away your credit card rewards.) Well, how much are the fees we're talking about...

What are the fees?

In 2023, credit card companies in the U.S. earned $135.75 billion from processing fees charged to merchants. Families paid an average of $1,102 in swipe fees in 2023, according to the Merchants Payments Coalition. The money made from these fees increased at a faster rate than the actual money spent on purchases, adding fuel to the already fierce debate between credit card companies and businesses that complain about so-called swipe fees. Businesses claim that raising interchange fees, which are paid by merchants on each transaction made with a credit or debit card, worsen inflation and pinch consumers because businesses could opt to pass the cost of higher interchange fees onto consumers. Most merchants need to accept credit card payments, which makes credit card processing fees a cost of doing business. For more on how much those costs can be -- and how they vary among credit card companies -- we've collected all the latest data.

—Average Credit Card Processing Fees and Costs in 2024, The Motley Fool, 10-Dec-2024

Have you ever been charged an extra fee by a company for using a credit card? It is not common, but it does happen, and it is because the company has been charged a fee to accept the card. That is called the "interchange fee". This table is from the article quoted above:

Payment network	Average credit card processing fees
Visa	1.23% + $0.05 to 3.15% + $0.10
Mastercard	1.15% + $0.05 to 3.15% + $0.10
American Express	1.10% + $0.10 to 3.15% + $0.10
Discover	1.56% + $0.10 to 2.40% + $0.10

The range—1.23% to 3.15%, in the case of Visa—is based on a few factors:

Merchant category: the type of business
Card tier: the level of rewards a card offers, or no reward at all
Processing method: whether a card was swiped, dipped, tapped, keyed manually, or used online

One of the significant factors is card tier, which leads us to ask:

Why are banks eager to push the higher-fee rewards cards?

To highlight something which is routinely surprising for non-specialists: interchange fees [the fees paid by the card-accepting business, or "merchant"] are not constant and fixed. They are set based on quite a few factors but, most prominently, based on the rank of card product you use. The more a card product is pitched to socioeconomically well-off people, the more expensive interchange is. Credit card issuers explicitly and directly charge the rest of the economy for the work involved in recruiting the most desirable customers.

—Anatomy of a credit card rewards program, Bits About Money, 29-Mar-2024

The author of this article was a technology executive at Stripe and now makes a living doing consulting and writing blog posts. The article delves into the mechanics behind credit card rewards, emphasizing the role of interchange fees, which are paid by businesses accepting credit cards and distributed among various parties in the credit ecosystem. It explains that credit card issuers use these fees to attract high-value customers by offering rewards programs that enhance the spending experience. The discussion highlights that not all cards offer rewards, with some cards targeting lower-income users primarily to provide access to credit rather than rewards. If you want a more in-depth view of how credit cards work, I recommend the author's Improving how credit cards work under the covers.

Interchange settlement

...on every credit card transaction in the MasterCard and Visa systems, the merchant pays a swipe fee, also known as the merchant discount fee. That fee is paid to the merchant's bank. The merchant's bank then pays a "network fee" to MC or V and also pays an "interchange" fee to the bank that issued the card. The interchange fee is not one-size-fits-all. Instead, it varies by merchant type (and sometimes volume) and by the level of rewards/service on the card. So merchants are not directly charged the interchange fee, but it is passed through to them, sometimes explicitly. The problem that merchants face is that they cannot exert any pressure on the interchange fee—nominally an interbank fee—even though it is set based on their line of business. Nor can merchants discriminate among types of credit cards by charging more for rewards cards, etc.

—The Proposed Credit Card Interchange Settlement, Credit Slips, 24-Mar-2024

Twenty years ago, merchants sued the credit card companies alleging anti-trust violations about this scheme, and the case is still going on. If the "Credit Card Competition Act" isn't re-introduced to Congress, maybe merchants can get relief from the court system.

How much does the credit card company know about you?

A 2002 study of how customers of Canadian Tire were using the company's credit cards found that 2,220 of 100,000 cardholders who used their credit cards in drinking places missed four payments within the next 12 months. By contrast, only 530 of the cardholders who used their credit cards at the dentist missed four payments within the next 12 months.

—What Does Your Credit-Card Company Know About You?, New York Times, 12-May-2009

Credit card use is increasing, and the aggregation of all that data can be a goldmine of information about people. A study showed that purchasing habits could predict payment reliability, with certain products indicating a higher likelihood of missed payments. And analysis of that data drives the efforts by banks to get you to use their higher reward cards. There is also the world of "Level 3 Data", where merchants transmit line items about your purchases to the credit card processor. Think of it as: every line on grocery receipt. Except, as near as I can tell, it Level 3 data doesn't apply to consumer credit cards...only to business-to-business and government-to-business cards. Still, it is an interesting fact to know and perhaps something to keep an eye on in case it leaks into the consumer credit card space.

Exercising control of your data at Mastercard

When you use your Mastercard, the company receives data about your transaction, like how much you spent, where and on what day. It needs this information to be your credit card – but Mastercard doesn't just use your data to complete payments. It monetizes that information by selling it to data brokers, advertisers and other companies. Mastercard's data practices contribute to a larger economy of data harvesting and data sales that can be harmful to consumers.

—How to take more control of your Mastercard transaction data, PIRG, 21-Sep-2023

Mastercard has a program to monetize transaction data by selling it to advertisers. This PIRG article has details and a link to the opt-out page on Mastercard's website. Visa used to have a similar program — Visa Advertising Solutions — but it was shut down in 2021.

Credit card collections

One interesting lens for understanding how industries work is looking at their waste streams. Every industry will by nature have both a stock and a flow of byproducts from their core processes. This waste has to be dealt with (or it will, figuratively or literally, clog the pipes of the industry) and frequently has substantial residual value. Most industries develop ecosystems in miniature to collect, sift through, recycle, and dispose of their waste. These are often cobbled together from lower-scale businesses than the industry themselves, involve a lot of dirty work, and are considered low status. Few people grow up wanting to specialize in e.g. sales of used manufacturing equipment. One core waste stream of the finance industry is charged-off consumer debt. Debt collection is a fascinating (and frequently depressing) underbelly of finance. It shines a bit of light on credit card issuance itself, and richly earns the wading-through-a-river-of-effluvia metaphor.

—Credit card debt collection, Bits About Money, 11-Aug-2023

Back to Bits About Money for a view on the opposite side of credit card rewards: credit card debt collection. Most defaulted debt in the U.S. is from credit cards, and the lifecycle of that debt involves a series of internal and external processes before they are sold to debt buyers, often at a fraction of their original value. He notes that most debt collectors operate in high-pressure environments, leading to high turnover rates, a lack of professionalism, and widespread illegal practices. He also discusses how debt collectors rely on automated systems and predictive dialing to maximize efficiency, often leading to a barrage of calls to debtors. Many consumers are unaware of their legal rights and don't have time to fight against these tactics effectively.

ExTwitter adding digital wallet functionality

Elon Musk's social media platform X on Tuesday announced the launch of a digital wallet and peer-to-peer payments services provided by Visa. X struck a deal with Visa, the largest U.S. credit card network, to be the first partner for what it is calling the X Money Account, CEO Linda Yaccarino announced in a post on the platform. Visa will enable X users to move funds between traditional bank accounts and their digital wallet and make instant peer-to-peer payments, Yaccarino said, like with Zelle or Venmo.

—Elon Musk’s X partners with Visa to offer digital wallet, CNBC, 28-Jan-2025

When Elon Musk bought Twitter, he said he wanted the company to turn into an "everything app" — use Twitter to buy things online, call for a taxi, and make peer-to-peer payments. One of the first steps on that path is getting access to payment systems. Now, whether you trust Musk with that kind of access to your bank accounts is an entirely separate matter...

This Week I Learned: The origin of the computer term "mainframe" comes from "main frame" — the 1952 name of an IBM computer's central processing section

Based on my research, the earliest computer to use the term "main frame" was the IBM 701 computer (1952), which consisted of boxes called "frames." The 701 system consisted of two power frames, a power distribution frame, an electrostatic storage frame, a drum frame, tape frames, and most importantly a main frame.

—The origin and unexpected evolution of the word 'mainframe', Ken Shirriff's blog, 1-Feb-2025

What did you learn this week? Let me know on Mastodon or Bluesky.

Mittens explores the toilet

Issue 105: Facial Recognition

2025-01-30T00:00:00-05:00

In this week's Thursday Threads, I'll point to articles on the contentious subject of facial recognition technology. This tech, currently used by law enforcement and various businesses around the world, raises critical ethical and privacy questions. Beyond the instances where facial recognition use has resulted in wrongful apprehensions by law enforcement or fails to recognize a student taking an exam, we have examples of individuals taking the technology to the dystopian extreme: doxing smart glasses and invading the privacy of social media users. Even police officers are reluctant to submit to facial recognition, and in a surprising turn of events, places like China have started implementing restrictions on companies.

It is possible that facial recognition might be useful in some circumstances someday. We're a long way from that day, though.

Police forces are using facial recognition technology in the search for suspects, often drastically altering the lives of innocent people
As if one layer of unregulated speculative technology wasn't bad enough, what could go wrong when you add a second: using DNA to generate a face sketch to run through facial recognition
Why is the current iteration of facial recognition technology bad? Because sometimes it doesn't even recognize faces
Even police officers themselves don't want to be subject to the whims and invasiveness of facial recognition
When the Chinese government tells companies to stop using facial recognition out of privacy concerns, maybe everyone else should pay attention, too?
Let's not do this: smart glasses that display biographical sketches of the people you are looking at
Using social media to break the social contract: doxing users based on the videos they post
Facial recognition technology has problems that—with effort and auditing—might be useful someday. Until that happens, take advantage of opportunities to opt out where you can
This Week I Learned: A biographer embedded with the Manhattan Project influenced what we think about the atomic bomb.
Obligatory Cat Photo: Alan's chair

Catalog of police misuse

Police have shown, time and time again, that they cannot be trusted with face recognition technology (FRT). It is too dangerous, invasive, and in the hands of law enforcement, a perpetual liability. EFF has long argued that face recognition, whether it is fully accurate or not, is too dangerous for police use, and such use ought to be banned. Now, The Washington Post has proved one more reason for this ban: police claim to use FRT just as an investigatory lead, but in practice officers routinely ignore protocol and immediately arrest the most likely match spit out by the computer without first doing their own investigation.

—Police Use of Face Recognition Continues to Wrack Up Real-World Harms, Electronic Frontier Foundation, 15-Jan-2025

I have saved a bunch of articles about law enforcement misuse of facial recognition technology, but rather than including them individually, I'm pointing to this article from the Electronic Frontier Foundation that catalogs the problems and points to individual cases. The EFF analysis emphasizes that the technology poses significant risks to civil liberties and can lead to wrongful arrests. Despite claims from law enforcement that it is used merely as an investigatory tool, evidence shows that police often bypass protocols, leading to immediate arrests based solely on computer matches. It notes a troubling pattern where many individuals wrongfully arrested based on FRT are Black, underscoring the technology's lower accuracy for individuals with darker complexions.

Layering facial recognition atop DNA analysis

Parabon NanoLabs ran the suspect’s DNA through its proprietary machine learning model. Soon, it provided the police department with something the detectives had never seen before: the face of a potential suspect, generated using only crime scene evidence.... The face of the murderer, the company predicted, was male. He had fair skin, brown eyes and hair, no freckles, and bushy eyebrows. A forensic artist employed by the company photoshopped a nondescript, close-cropped haircut onto the man and gave him a mustache—an artistic addition informed by a witness description and not the DNA sample. In a controversial 2017 decision, the department published the predicted face in an attempt to solicit tips from the public. Then, in 2020, one of the detectives did something civil liberties experts say is even more problematic—and a violation of Parabon NanoLabs’ terms of service: He asked to have the rendering run through facial recognition software.

—Cops Used DNA to Predict a Suspect’s Face—and Tried to Run Facial Recognition on It, WIRED, 22-Jan-2024

This is perhaps the most egregious example of misuse: extrapolating an image of a suspect based on DNA analysis, then running that image through facial recognition technology in search of leads.

When the face can't be found

To prevent students from cheating, the university had bought software from the tech firm Proctorio, which uses face detection to verify the identity of the person taking the exam. But when Pocornie, who is Black, tried to scan her face, the software kept saying it couldn’t recognize her: stating “no face found.” That’s where the Ikea lamp came in. For that first exam in September 2020, and the nine others that followed, the only way Pocornie could get Proctorio’s software to recognize her was if she shone the lamp uncomfortably close to her face—flooding her features with white light during the middle of the day

—This Student Is Taking On ‘Biased’ Exam Software: Mandatory face-recognition tools have repeatedly failed to identify people with darker skin tones, WIRED, 5-Apr-2023

Here is one of the biggest problems of this unregulated technology: biases in the data used to train the algorithm call into question any results you get from it. The article discusses a student challenging biased exam software that may unfairly affect test outcomes. Just because a machine that can count and compare numbers really, really fast says something is true doesn't make it true.

Police officers don't want to be subject to facial recognition

A Las Vegas police union has raised concerns about a new NFL policy that would require officers who work security at Raiders games to share their photo for facial recognition purposes and is urging officers to think twice before complying. Traditionally, officers who worked overtime hours as security for Raiders games would receive a wristband that got them access to different parts of the field and stadium, explained Steve Grammas, president of the Las Vegas Police Protective Association. But now, the NFL is asking that officers each provide a photo, which will be used for “identification purposes when an individual steps up to a scanner to verify who the person is and if they have access to that particular space,” explained Tim Schlittner, director of communications for the NFL, in an email.

—NFL facial recognition policy upsets Las Vegas police union, Las Vegas Review-Journal, 14-Aug-2024

Speaking of unregulated, police officers themselves don't want their biometrics cataloged in a company's database with no oversight. This also points to the problem of using biometrics as an authentication tool: the shape of your face isn't something you can easily change. Suppose your facial markers leak from one of these companies. What stops someone from 3-D printing a facsimile of those markers to fool this technology?

China tells companies to stop using facial technology

Authorities in several major Chinese cities have ordered hotels to stop using facial recognition technology to verify the identity of guests in a sign the government is responding to public concerns over privacy, financial news site Caixin reported. Guests staying at hotels in Beijing, Shanghai, Shenzhen, and Hangzhou will now only be required to present identification in order to check in, according to state-run tabloid The Global Times.

—China says no more facial recognition at hotels, Semafor, 25-Apr-2024

The government in China is well known for using facial recognition in public places for surveillance, so I think it is notable when the government responds to public pressure to stop companies from using the technology.

Facial recognition in smart glasses

The technology, which marries Meta’s smart Ray Ban glasses with the facial recognition service Pimeyes and some other tools, lets someone automatically go from face, to name, to phone number, and home address.

—Someone Put Facial Recognition Tech onto Meta's Smart Glasses to Instantly Dox Strangers, 404 Media, 2-Oct-2024

What happens when you pair off-the-shelf facial recognition with off-the-shelf smart glasses? Something very creepy. As a society, we're not nearly ready to dramatically change the social contract that this technology is demonstrating.

Scanning the faces in social media videos

A viral TikTok account is doxing ordinary and otherwise anonymous people on the internet using off-the-shelf facial recognition technology, creating content and growing a following by taking advantage of a fundamental new truth: privacy is now essentially dead in public spaces. The 90,000 follower-strong account typically picks targets who appeared in other viral videos, or people suggested to the account in the comments. Many of the account’s videos show the process: screenshotting the video of the target, cropping images of the face, running those photos through facial recognition software, and then revealing the person’s full name, social media profile, and sometimes employer to millions of people who have liked the videos.... 404 Media is not naming the account because TikTok has decided to not remove it from the platform. TikTok told me the account does not violate its policies; one social media policy expert I spoke to said TikTok should reevaluate that position.

—The End of Privacy is a Taylor Swift Fan TikTok Account Armed with Facial Recognition Tech, 404 Media, 25-Sep-2023

The "Taylor Swift Fan" part is quite click-baity. The article's author noted in the second paragraph that this anonymous TikTok user liked to focus on fan videos, but the content of the article stands on its own. Again: it is an off-the-shelf service that dramatically affects the social contract between humans.

Sending a message at airport security gates

A bipartisan group of 12 senators has urged the Transportation Security Administration’s inspector general to investigate the agency’s use of facial recognition, saying it poses a significant threat to privacy and civil liberties.... While the TSA’s facial recognition program is currently optional and only in a few dozen airports, the agency announced in June that it plans to expand the technology to more than 430 airports. And the senators’ letter quotes a talk given by TSA Administrator David Pekoske in 2023 in which he said “we will get to the point where we require biometrics across the board.” ... The latest letter urges the TSA’s inspector general to evaluate the agency’s facial recognition program to determine whether it’s resulted in a meaningful reduction in passenger delays, assess whether it’s prevented anyone on no-fly lists from boarding a plane, and identify how frequently it results in identity verification errors.

—Senators Say TSA's Facial Recognition Program Is Out of Control, Here's How to Opt Out, Gizmodo, 22-Nov-2024

Because of the problems with unregulated, unaudited facial recognition technology, I opt out of its use whenever possible. With study, evaluation, auditing, and quite possibly some regulation, this might become a useful technology for some use cases. Until that happens, my face will vote my consciousness: do not use it.

This Week I Learned: A biographer embedded with the Manhattan Project influenced what we think about the atomic bomb

In early 1945, a fellow named Henry DeWolf Smyth was called into an office in Washington and asked if he would write this book that was about a new kind of weapon that the US was developing. The guy who had called him into his office, Vannevar Bush, knew that by the end of the year, the US was going to drop an atomic bomb that had the potential to end the war, but also that as soon as it was dropped, everybody was going to want to know what is this weapon, how was it made, and so forth. Smyth accepted the assignment. It was published by Princeton University Press about a week after the bomb was dropped. It explained how the US made the bomb, but it told a very specific kind of story, the Oppenheimer story that you see in the movies, where a group of shaggy-haired physicists figured out how to split the atom and fission, and all of this stuff. The thing is, the physics of building an atomic bomb is, in some respects, the least important part. More important, if you actually want to make the thing explode, is the chemistry, the metallurgy, the engineering that were left out of the story.

—Wars Are Won By Stories, On the Media, 22-Jan-2025

Alan's chair

Alan thinks he owns this chair...so much so that he is going to stretch out as big as he can to cover it. In reality, it is my chair. And, yes, right after taking this picture I insisted that he let me sit down. He got to take a nap in my lap, though.

Issue 104: Long Term Digital Storage

2025-01-23T00:00:00-05:00

This week's Thursday Threads looks at digital storage from the past and the future. There are articles about the mechanics of massive data storage systems in tech giants like Google and Amazon, the still existing use of floppy disks in certain industries, and the herculean efforts of digital archivists to access stored data from outdated mediums.

This week:

Hard drives seem indestructible, especially compared to previous forms of storage. So we went all in on digitizing and converting and storing on hard drives. But what if the hard drives start failing?
You've been tasked with storing data. You don't know what the data is or how important it is, but you have to give it back when asked. When your goal is outliving the heat death of the universe.
It is the rare company that reaches the size of Google, Amazon, or Facebook. These companies have a lot of data, and they want to make sure it is findable and usable anywhere in the company. What distributed storage looks like.
When was the last time you used a floppy disk? There are some industries that still use them every day.
Archives everywhere have stacks of old floppy disks. Read about the techniques that archivists use to recover what is on them.
Your job is to store data that outlasts your career. What medium do you use? How do you distribute it? How to think about century-scale storage.
This Week I Learned: In Ethiopia, time follows the sun like nowhere else.
Obligatory Cat Photo: Alan and Mittens squabble at the cat tree

Hard Drives Go Bad

A few years ago, archiving specialist Iron Mountain Media and Archive Services did a survey of its vaults and discovered an alarming trend: Of the thousands and thousands of archived hard disk drives from the 1990s that clients ask the company to work on, around one-fifth are unreadable. Iron Mountain has a broad customer base, but if you focus strictly on the music business, says Robert Koszela, Global Director Studio Growth and Strategic Initiatives, “That means there are historic sessions from the early to mid-’90s that are dying.”

—Inside Iron Mountain: It’s Time to Talk About Hard Drives, Mix Magazine, 19-Aug-2024

This article focuses on the music industry, but its story is applicable across all fields. Music production once used multi-track analog tape (where splicing was done with physical cuts and tape); when the process was done, the analog tape went into storage. Alarms went up in the field about media deterioration and a lot of effort was made to digitize the source materials. Those digitized artifacts were stored on hard drives, and everyone assumed they were now safe. But preservation of digital media is an active process — one can't assume that the disks will spin and that the software to read the files still runs.

When your goal is out living the heat death of the universe

I sometimes think about the fact that Amazon S3 effectively has to exist until the heat death of the universe. Many millennia from now, our highly-evolved descendants will probably be making use of an equally highly evolved descendant of S3. It is fun to think about how this would be portrayed in science fiction form, where developers pore through change logs and design documents that predate their great-great-great-great grandparents, and users inherit ancient (yet still useful) S3 buckets, curate the content with great care, and then ensure that their progeny will be equally good stewards for all of the precious data stored within.

—S3 as an Eternal Service, Last Week in AWS, 29-Mar-2023

The idea that struck me in this article is that as service provider like Amazon can't distinguish between what is important and what is not: if a customer asked Amazon to store it, it will do its best to make sure it retrievable. How much storage is in use — multiple copies on multiple drives in multiple servers and multiple locations — for files that have zero value?

Distributed Storage Systems

The impact of these distributed file systems extends far beyond the walls of the hyper-scale data centers they were built for— they have a direct impact on how those who use public cloud services such as Amazon's EC2, Google's AppEngine, and Microsoft's Azure develop and deploy applications. And companies, universities, and government agencies looking for a way to rapidly store and provide access to huge volumes of data are increasingly turning to a whole new class of data storage systems inspired by the systems built by cloud giants. So it's worth understanding the history of their development, and the engineering compromises that were made in the process.

—The Great Disk Drive in the Sky: How Web giants store big—and we mean big—data, Ars Technica, 25-Jan-2012

This 13-year-old article explores the massive data storage systems utilized by major tech companies like Google, Amazon, and Facebook to manage their vast information stores. Traditional methods of scaling storage, such as increasing disk capacity or adding more servers, fall short at the size of in cloud computing environments. While you may not ever operate at the scale of these companies, it is interesting to read about how the tech giants do data storage and management. (The article's subtitle also refers to "big data" — a phrase that was fashionable in the previous decade but one which we don't hear much about anymore.)

Industries are still using floppy disks

A surprising number of industries, from embroidery to aviation, still use floppy disks. But the supply is finally running out.

—Why the Floppy Disk Just Won’t Die, Wired, 6-Mar-2023

8-inch floppy disks were invented in the early 1970s; they could store a megabyte a piece. 5.25-inch floppy disks were introduced in late 1970s; while obviously smaller, its high density capacity could also store about a megabyte and a quarter per disk. 3.5-inch disks (no longer called "floppy" because they were in a hard plastic case) came to the market in the early 1980s and could store a megabyte and a half. Each of these formats are still used today. (Maybe not the 8-inch floppies; those were retired from nuclear weapons silos in 2019.)

Reading old floppy disks

—Raw flux streams and obscure formats: Further work around imaging 5.25-inch floppy disks, Digital Preservation at Cambridge University Libraries, 19-Apr-2024

Speaking of floppy disks, digital archivists from Cambridge University Library and Churchill Archives Centre detail their efforts to create copies of 5.25-inch floppy disks. Remember 5.25-inch floppy disks? From soliciting donations of old floppy disk drives to the hardware and software required to access these old disks on new hardware, the report is a fascinating look at the past (and maybe a preview of what future generations will need to do to read today's digital storage media).

Century-scale Storage

This piece looks at a single question. If you, right now, had the goal of digitally storing something for 100 years, how should you even begin to think about making that happen? How should the bits in your stewardship be stored with such a target in mind? How do our methods and platforms look when considered under the harsh unknowns of a century? There are plenty of worthy related subjects and discourses that this piece does not touch at all. This is not a piece about the sheer volume of data we are creating each day, and how we might store all of it. Nor is it a piece about the extremely tough curatorial process of deciding what is and isn’t worth preserving and storing. It is about longevity, about the potential methods of preserving what we make for future generations, about how we make bits endure. If you had to store something for 100 years, how would you do it? That’s it.

—Century-Scale Storage: If you had to store something for 100 years, how would you do it?, Harvard Law School Library Innovation Lab, undated

This 15,000-word essay looks at digital storage from the earliest hard drives (including restoring data from a 1960s-era IBM hard disk prototype) to the cloud to old fashion print-on-paper. There are discussions of the reliability and longevity of different storage methods, such as RAID systems, cloud storage, and physical media like vinyl records and tape drives. But it isn't just the physical medium...the article also highlights the importance of institutional commitment, funding, and cultural values in ensuring the preservation of data. Ultimately, the writers suggest that successful century-scale storage requires a combination of methods, a culture of vigilance, and a commitment to preserving human cultural memory.

This Week I Learned: In Ethiopia, time follows the sun like nowhere else

Because Ethiopia is close to the Equator, daylight is pretty consistent throughout the year. So many Ethiopians use a 12-hour clock, with one cycle of 1 to 12 — from dawn to dusk — and the other cycle from dusk to dawn. Most countries start the day at midnight. So 7:00 a.m. in East Africa Time, Ethiopia's time zone, is 1:00 in daylight hours in local Ethiopian time. At 7:00 p.m., East Africa Time, Ethiopians start over again, so it's 1:00 on their 12-hour clock.

—If you have a meeting in Ethiopia, you'd better double check the time, The World from PRX, 30-Jan-2015

This could have easily gone in last week's Thursday Threads on time standards. There are 12 hours of daylight, numbered 1 through 12. Then 12 hours of night, numbered 1 through 12. What could be easier?

Alan and Mittens squabble in the cat tree

These two troublemakers. Alan is the cat on top, looking down on Mittens below. In this cozy sunlit room with a cat tree by an open window, you'd think these two would get along. Not so. Alan's typical perch is on top of the cat tree, so it is Mittens that is intruding (if you could call it that.)

Issue 103: Time Standards

2025-01-16T00:00:00-05:00

This week, I'm going to tug on time. This follows the last item in last week's issue of Thursday Threads: The Clock that Made Power Grids Possible. Two years ago, I also published an issue about time, pointing to articles about eliminating the leap second, time standards on the moon, and observational humor on how we might explain our concept of time to aliens. That last one might form the thread that I tug on in the next issue because it treads on how whether our digital selves will stand the test of time.

This week:

Paris' City-wide Synchronized Clock
Time is very different in Kathmandu
Moon GPS is Coming
What Do A.M. and P.M. Stand For?
This Week's Troublemaker: Pickle

Paris' City-wide Synchronized Clock

[The Paris Pneumatic Clock] system was created in 1880 by Austrian engineer Victor Popp – and just 5 years later, thousands of these clocks were placed all over the city – in hotels, train stations, houses, schools and public streets. We modeled this incredible system and the special machine at the heart of it, to show you how a series of underground pipes and mechanical clocks kept an entire city in sync.

—The Incredible Paris Pneumatic Clock System from 1880, Primal Nebula, 24-Feb-2024

The 8-minute video companion to the above article is great to watch, too. This is a marvel of engineering — synchronizing the clocks of a whole city through puffs of air traveling through pipes. This system—accurate to a minute—was just 35 years before the sub-second precision required to synchronize the power grid, as described at the end of last week's issue.

Time is very different in Kathmandu

Most of the world is on a whole number of hours before or after UTC. About a fifth of the world by population is on a half-hour offset from UTC; in particular, India is 5h30m ahead of UTC. Nepal is 5h45m ahead of UTC

—Australia/Lord_Howe is the weirdest timezone, SSO Ready blog, undated

I first encountered this when setting up a Zoom meeting for colleagues in Kathmandu. While most countries neatly set their clocks to full hour offsets (or, as noted in the quote above, a half-hour offset), Nepal ticks to its own clock with a 5-hour and 45-minute offset from UTC. It's as if Nepal took a look at the standard time zones and said, "Why be ordinary when you can add a twist?" Imagine trying to schedule a call back home, perplexed as you reconcile not just the time difference but—and here's the kicker—those extra 15 minutes that make Nepal unique.

Moon GPS is Coming

NASA and its partners in Europe and Japan are developing lunar satnav concepts that could be deployed by the end of the 2020s. In July, China’s National Space Administration (CNSA) unveiled its plans for a constellation of 21 communications and navigation satellites to support its lunar aspirations.

—Moon GPS Is Coming: Nations and companies are ramping up their efforts to deploy the first satnav on the moon to support a flurry of planned missions there, Wired, 4-Sep-2024

The Thursday Threads issue two years ago talked about the need to keep accurate on the moon. Following an announcement from the White House early in 2024 directing NASA to create a time standard for the moon, U.S., European, and Chinese efforts are underway to make that happen.

What Do A.M. and P.M. Stand For?

If you know how to tell time, you probably understand and use a.m. and p.m., and you might even know the terms come from Latin phrases. But do you know what exactly those phrases are, or what they mean in English?

—What Do A.M. and P.M. Stand For?, Mental Floss, 4-Apr-2024

File this away for use at parties...

This Week's Troublemaker: Pickle

So let's talk about the third cat in the house (after Alan in the last issue and Mittens in the issue before). This is Pickle, a black-and-white Tuxedo cat with a drive for food that I've never witnessed in another cat. Two stories from one recent afternoon: First, when my wife got home from the grocery store, Pickle grabbed the bag of doughnuts from a canvas bag and made off with a big chunk of a long-john. Then, when she was fixing dinner, Pickle jumped on the counter and made off with a hunk of steak. My wife chased her around the dining room table, through the living room, and up the stairs to my daughter’s room. I rushed to follow, and we trapped Pickle between the headboard and the wall. My wife thinks the cat woofed down a sizable chunk of meat before we could catch her.

That, ladies and gentlemen, is Pickle.

Issue 102: Electricity Infrastructure

2025-01-09T00:00:00-05:00

I'm about halfway through Saul Griffith's 2021 Electrify: An Optimist's Playbook for Our Clean Energy Future, and I find the author makes a compelling point about bringing nearly everything—energy creation, transmission, and use—to a common factor of "electricity" and then optimizing that system. There are many interesting problems to solve, but they seem solvable.

In last week's Thursday Threads, I touched on how data centers impact the electrical grid. This week's issue looks further into how electricity is generated and distributed. The first article reflects back on the data center topic—it could have just as easily gone in last week's issue. Then there are a few other articles on the generation, storage, the flip away from carbon-based fuels, and a look at history.

This week:

Commercial Electricians in Demand for Data Center Construction
Instability Cause By Overgeneration of Rooftop Solar
Generating Power from the Tides
Smarter Grid Reduces Demand As Required
Storing Energy in Mine Shafts
Storing Energy as Compressed Air
The Last Coal-fired Powerplant in Hawaii is Replaced
The Rise of Renewables
This Clock Made Power Grids Possible
From Energy to No Energy

Those are in addition to last week's:

How the placement of data centers impacts the local power grid
Nuclear power is of interest again to feed into data centers

Commercial Electricians in Demand for Data Center Construction

These traveling electricians are transforming the sagebrush here in central Washington, with substations going up on orchards and farmland. Hundreds have come to a triangle of counties tied together by hydropower dams along the Columbia River. They are chasing overtime and bonuses, working 60-hour weeks that can allow them to make as much as $2,800 a week after taxes. For all the hype over $100,000 chips and million-dollar engineers, the billions pouring into the infrastructure of A.I. is being built by former morticians, retired pro football linebackers, single moms, two dudes described as Gandalf in overalls, onetime bouncers and a roving legend known as Big Job Bob.

—A.I., the Electricians and the Boom Towns of Central Washington, New York Times, 25-Dec-2024

The New York Times publishes this in-depth piece about the boom time for commercial electricians (or anyone who wants to train to become one). Data centers require substantial electrical power to support the high computing needs of artificial intelligence and the storage to save your New Year's Eve photos (as well as the power to run the cooling systems for those computers). Although AI has propelled the construction of data centers to a sharper slope, significant building and expansion projects were already underway. This article is a view at the intersection of traditional construction/labor, technology, land use, and economic growth.

Instability Cause By Overgeneration of Rooftop Solar

[The Australian Energy Market Operator] said the ever growing output from solar was posing an increasing threat to the safety and security of the grid because it was pushing out all other forms of generation that were needed to help keep the system stable. And it warned that unless it had the power to reduce — or curtail — the amount of rooftop solar times, more drastic and damaging measures would need to be taken. These could include increasing the voltage levels in parts of the poles-and-wires network to "deliberately" trip or curtail small-scale solar in some areas. An even more dramatic step would be to "shed" or dump parts of the poles-and-wires network feeding big amounts of excess solar into the grid.

—AEMO wants emergency powers to switch off solar in every state amid fears of 'system collapse', Australian Broadcasting Company, 1-Dec-2024

Electricity is unique in that the providers must exactly match the demand at every moment. Excess generation capacity must be removed from the grid...it is just as bad as too little electricity. (Storage of excess electricity is a topic all its own; see below.) In Australia, the rapid growth of solar power generation is making it difficult for the grid operator to achieve that balance. Rooftop solar is great, but having that energy dumped uncontrolled back onto the grid causes instability.

(That isn't the only problem on the grid...there are devices that, as Grady of Practical Engineering says, "force the grid to produce power and move it through the system, even though they aren’t even consuming it.")

Generating Power from the Tides

Solar energy is the bedrock of most renewable energy grid plans – but lunar energy is even more predictable, and a number of different companies are working to commercialize energy generated from the regular inflows and outflows of the tides. One we've completely missed is Minesto, which is taking a very different and remarkably dynamic approach compared to most. Where devices like Orbital's O2 tidal turbine more or less just sit there in the water harvesting energy from tidal currents, Minesto's Dragon series are anchored to the sea bed, and fly around like kites, treating the currents like wind.

—28-ton, 1.2-megawatt tidal kite is now exporting power to the grid, New Atlas, 11-Feb-2024

The problem with variable sources like solar and wind is the need for a baseline supply of always-there electricity. Coal, natural gas, and nuclear are good at meeting that baseline power need. Tidal systems are a clean, constant source of energy as well.

Smarter Grid Reduces Demand As Required

On the morning of April 3, Taiwan was hit by a 7.4 magnitude earthquake. Seconds later, hundreds of battery-swap stations in Taiwan sensed something else: the power frequency of the electric grid took a sudden drop, a signal that some power plants had been disconnected in the disaster. The grid was now struggling to meet energy demand. These stations, built by the Taiwanese company Gogoro for electric-powered two-wheeled vehicles like scooters, mopeds, and bikes, reacted immediately. According to numbers provided by the company, 590 Gogoro battery-swap locations ... stopped drawing electricity from the grid, lowering local demand by a total six megawatts—enough to power thousands of homes. It took 12 minutes for the grid to recover, and the battery-swap stations then resumed normal operation.

—How battery-swap networks are preventing emergency blackouts, MIT Technology Review, 11-Jun-2024

In addition to managing the supply, there also needs to be advancements in managing the demand side. Businesses already do this...their flexibility to reduce their electricity usage during high-demand events results in cheaper electricity rates because the utility doesn't need to build as much capacity just-in-case. This kind of variable pricing is also available to some homeowners. However, technology on the grid can help support this as well. This article talks about a scooter battery charging company that automatically takes equipment offline when generation capacity unexpectedly drops. Imagine this same sort of grid intelligence available for e-vehicle charging stations as well.

Storing Energy in Mine Shafts

One of Europe’s deepest mines is being transformed into an underground energy store. It will use gravity to retain excess power for when it is needed. The remote Finnish community of Pyhäjärvi is 450 kilometres north of Helsinki. Its more than 1,400-metre-deep zinc and copper Pyhäsalmi mine was decommissioned but is now being given a new lease of life by Scotland-based company Gravitricity. The firm has developed an energy storage system that raises and lowers weights, offering what it says are “some of the best characteristics of lithium-ion batteries and pumped hydro storage”.

—This disused mine in Finland is being turned into a gravity battery to store renewable energy, Euro News, 6-Feb-2024

Solar panels only produce power when the sun is out, and wind turbines only produce power when the wind blows. We will need a way to store energy during times of overproduction and send it out to the grid when demand requires it. Many technologies are being explored to use excess energy to pump water uphill or spin a heavy flywheel. The technique in this article raises weights in a deep mine shaft to store energy.

Storing Energy as Compressed Air

Toronto-based Hydrostor Inc. is one of the businesses developing long-duration energy storage that has moved beyond lab scale and is now focusing on building big things. The company makes systems that store energy underground in the form of compressed air, which can be released to produce electricity for eight hours or longer.

—Hydrostor Inc., a leader in compressed air energy storage, aims to break ground on its first large plant by the end of this year, Inside Climate News, 4-May-2024

Another potential storage solution is compressed air. All of these systems have trade-offs of expense versus capacity versus location requirements and other factors. Some of these experiments will succeed, and some won't be commercially viable.

The Last Coal-fired Powerplant in Hawaii is Replaced

Hawaii shut down its last coal plant on September 1, 2022, eliminating 180 megawatts of fossil-fueled baseload power from the grid on Oahu — a crucial step in the state’s first-in-the-nation commitment to cease burning fossil fuels for electricity by 2045. But the move posed a question that’s becoming increasingly urgent as clean energy surges across the United States: How do you maintain a reliable grid while switching from familiar fossil plants to a portfolio of small and large renewables that run off the vagaries of the weather? Now Hawaii has an answer: It’s a gigantic battery, unlike the gigantic batteries that have been built before.

—A huge battery has replaced Hawaii's last coal plant, Canary Media, 10-Jan-2024

With new generation and storage technologies, where does that leave the traditional burning-carbon-based tools? Fortunately, not long for this world.

The Rise of Renewables

Wind and solar generated more power than coal through the first seven months of the year, federal data shows, in a first for renewable resources. The milestone had been long expected due to a steady stream of coal plant retirements and the rapid growth of wind and solar. Last year, wind and solar outpaced coal through May before the fossil fuel eventually overtook the pair when power demand surged in the summer. But the most recent statistics showed why wind and solar are on track in 2024 to exceed coal generation for an entire calendar year — with the renewable resources maintaining their lead through the heat of July.

—U.S. Wind and Solar Are on Track to Overtake Coal This Year, Scientific American, 13-Aug-2024

It would seem that the momentum away from burning carbon fuels is well established. I hope it is established enough to deal with the instability that could be caused by the incoming U.S. federal administration.

This Clock Made Power Grids Possible

On 23 October 1916, an engineer named Henry E. Warren quietly revolutionized power transmission by installing an electric clock in the L Street generating station of Boston’s Edison Electric Illuminating Co. This master station clock kept a very particular type of time: It used a synchronous self-starting motor in conjunction with a pendulum to help maintain the station’s AC electricity at a steady 60-cycle-per-second frequency. As more power stations adopted the clocks, the frequency regulation allowed them to share electricity and create an interconnected power grid.

—This Clock Made Power Grids Possible, IEEE Spectrum, 28-Feb-2024

Before there was a grid, there were many isolated islands of power generation. The "alternating" part of "alternating current" meant that these islands couldn't be connected until the cycles of alternation could be synchronized. We take 60-cycles-per-second for granted now, but it wasn't always this way.

From Energy to No Energy

This has become Alan's routine in the morning. It is far too cold—and now far too snowy—to work outside on the patio. So Alan sleeps through the long winter days on my keyboard numeric pad until spring.

Issue 101: Data Centers

2025-01-02T00:00:00-05:00

One of the very first issues of Thursday Threads was on data centers (2011). That issue had articles on a major Amazon Web Services outage, remote data centers powered by renewable energy, and videos about Google's and Meta's data centers. Unfortunately, I've found that the videos are lost to time. It is interesting that the concerns about data centers lives on. This post continues that thread with these topics:

How the placement of data centers impacts the local power grid
How environmental noise of a data center is a problem its neighbors
Nuclear power is of interest again to feed into data centers
Data center growth in Africa

Also recently on DLTJ:

One Year of Learning 2024: a list of 13 interesting things I learned last year.
As a Cog in the Election System Again: Reflections on Working the 2024 Presidential Election

AI-driven Data Center Development's Impact on Local Grid Users

Map from Bloomberg shows local average of sensors' worst total harmonic distortion readings from February to October; areas with an average of 8% or more are deemed as exceeding accepted industry limits.

AI data centers are multiplying across the US and sucking up huge amounts of power. New evidence shows they may also be distorting the normal flow of electricity for millions of Americans. This map shows readings from about 770,000 home sensors, with red zones indicating areas with the most distorted power. The problem is threatening billions in damage to home appliances and aging power equipment, especially in areas like Chicago and "data center alley" in Northern Virginia, where distorted power readings are above recommended levels. An exclusive Bloomberg analysis shows that more than three-quarters of highly-distorted power readings across the country are within 50 miles of significant data center activity. While many facilities are popping up near major US cities and adding stress to already fragile grids, this trend holds true in rural areas as well.

—AI Power Needs Threaten Billions in Damages for US Households, Bloomberg, 27-Dec-2024

There has been much written about how data center development is moving into areas where it can soak up cheap excess electricity. This is the first I've heard about how data center power draws can distort or even harm the grid for existing customers. Set aside about how the article is framed as another way that the creation of AI-driven products is harmful; data centers are going to be built no matter what the purpose. As the nation's power grid is restructured to incorporate more renewable source and power storage mechanisms, this is yet another factor that will make that transition more challenging.

It isn't just power; water use — primarily for cooling — is also a concern: "Artificial intelligence technology behind ChatGPT was built in Iowa — with a lot of water".

Cryptocurrency Mining Rigs are not a Welcome Addition to this Texas Town

Corsicana, the seat of Navarro County, is best known for kicking off the Texas oil boom in 1894, when a 1,000-foot well meant to alleviate a water shortage instead turned up an oil field that extended for miles. In the century to follow, tens of millions of barrels of oil were pulled from the city—and Corsicana got rich... The oil fields are drying up. In Riot's high temple of cryptographic computation, local officials think they've found a stopgap. Some Corsicana residents aren't so sure. They see the facility as a blot on the landscape that threatens their property values, vulnerable energy grid, and quiet rural lifestyle. And they're fighting back.

—The World’s Biggest Bitcoin Mine Is Rattling This Texas Oil Town, Wired, 11-Sep-2024

The article highlights the tension between data center development and community concerns, drawing parallels between the historical oil industry and the current rise of bitcoin mining in the region. Despite claims from the developer that their operations will stabilize the grid, critics argue that cryptocurrency mining is a drain on resources and exacerbates noise pollution. Residents want local governments to address the issues where noise has led to health issues and disrupted lives.

It isn't just rural areas, either; suburban Virginia and downtown Chicago are also affected.

Data Center Development Causes Resurgence in Nuclear Power Interest

On Tuesday, Google announced that it had made a power purchase agreement for electricity generated by a small modular nuclear reactor design that hasn't even received regulatory approval yet. Today, it's Amazon's turn. The company's Amazon Web Services (AWS) group has announced three different investments, including one targeting a different startup that has its own design for small, modular nuclear reactors—one that has not yet received regulatory approval.

—Amazon joins Google in investing in small modular nuclear power, Ars Technica, 16-Oct-2024

Amazon Web Services has joined other tech giants like Google in investing in small modular nuclear power, reflecting a growing interest in nuclear energy among major companies. The interest in small modular reactors stems from increasing energy demands, particularly from data centers, and the challenges of relying solely on renewable energy sources. While renewables are cost-effective, their intermittent nature and grid connection issues limit their viability for continuous power needs. The lengthy and costly construction timelines of large reactors further complicate the situation, making small modular reactors a more appealing option despite their unproven status.

Don't count out Microsoft...it wants to restart and refurbish Three Mile Island reactors as part of its data center energy plans. Key permits are still needed before this is fully in place, though.

Africa Sees Jump in Data Center Construction

A new generation of data centers are being built in Africa's smaller economies, fueling a $5 billion market opportunity on the world's fastest growing continent.... The emergence of local data centers has powered a surge in cloud-based computing in five of Africa's largest economies — South Africa, Nigeria, Kenya, Egypt and Morocco — in the last five years. Now the number of data centers built in smaller African economies is surging as well, with up to $700 million of capital investment pouring in each year in the past two years, according to data from research firm Xalam Analytics which monitors the industry.

—Data centers fuel cloud computing in smaller African countries, Semafor, 22-Jun-2023

The rise of data centers in Africa will hopefully solve economic disparities and enhance digital sovereignty on the continent. For instance, improved proximity to data centers will lower transit costs for internet service providers, potentially boosting online economic activity. (Historically, African data has been stored internationally, leading to slower connections and complicating compliance with local privacy laws.) Connectivity — especially below the equator — remains an issue, though; "Google and Meta’s underwater cables up the stakes on internet control"

Although some African countries welcome the new data centers, others are concerned. Chile, for instance, has "multiple groups working to keep Amazon, Google, and Microsoft from doubling the number of centers in the country, fearing environmental devastation". And a a Norwegian ammunition manufacturer blames ‘storage of cat videos’ for threatening its growth.

Speaking of cats...

Mittens' Food Box

We have a temporary third cat, Pickle, in our home that is very food driven. It will bully the other cats away from their food. (Pickle is also known for stealing and eating whole chocolate chip muffins from the breakfast table, too.) So we added one of those microchip-enabled pet doors to this plastic tote so our first cat, Mittens, can eat in peace.

One Year of Learning 2024

2024-12-14T00:00:00-05:00

Inspired by Tom Whitwell's 52 things I learned in 2022, I started my own list of things I learned in 2023. Reaching the end of another year, it is time for Things I Learned In 2024:

Some jurisdictions use "day fines"—or fining an offender based on that person's daily personal income. The number of days would be scaled to the seriousness of the offense. Day Fine, Wikipedia
There are over twice as many federally-recognized Indian tribes as there are countries in the United Nations. The 574 Federally Recognized Indian Tribes in the United States, Congressional Research Service
Crayons were invented in Sandusky, Ohio, in 1902 by a school teacher and his brother experimenting with adding waxes to chalk. How one Ohio town once claimed the title of ‘color capital of the world’, The Ohio Newsroom
“Schrödinger’s cat”—the thought experiment in which a cat in a box can be considered both alive and dead—was first published in a scientific journal in 1935. It didn't enter the popular imagination until Ursula K Le Guin, a science fiction author, published a short story in 1974. Ursula Le Guin: the pioneering author we should thank for popularizing Schrödinger’s cat, Quantum Magazine.
The U.S. Air Force has a facility in New York where it mounts its aircraft upside-down to test radio emissions. The Fascinating Story Of The USAF’s “Upside-Down Air Force”, The War Zone
The largest energy consumer in California is a pumping station that raises water 2,000 feet (600 meters) to cross the Tehachapi Mountains at the southern end of the state. At full capacity, the station moves 2 million gallons a minute for agriculture and drinking. How Infrastructure Works: Inside the Systems That Shape Our World, by Deb Chachra.
Interlibrary loan was first conceived by Alexandre Vattemare, a French ventriloquist who inspired the founding of Boston Public Library. Alexandre Vattemare on Wikipedia via Camwyn on Mastodon and Mike Taylor.
In 1959, a cement mixer's bucket was left behind on an Oklahoma rural road. In 2011, an artist couple turned it into a space capsule. The Cement Mixer Space Capsule of Winganon, Amusing Planet
Atomic clocks built for use on Earth will run faster on the moon, necessitating the need for "Lunar Coordinate Time" to support navigation and scientific research on the moon. What Time Is It on the Moon?, National Institute of Standards and Technology
A Charlie Brown Thanksgiving premiered in Canada on Oct 6, 1973 — six weeks before it premiered in the United States. Nat Gerler on Bluesky
The first virtual meeting was in 1916 between members of the American Institute of Electrical Engineers — 5,000 attendees in eight cities (and 95 years before Zoom was founded). The First Virtual Meeting Was in 1916: The amazing feat linked up 5,100 engineers from Atlanta to San Francisco, IEEE Spectrum, 13-Nov-2024
The cumulative land area of China, the United States, India, Mexico, Peru, and Europe still isn't enough to match the African continent. Somalia, Japan, and New Zealand are all approximately the same size as the US East Coast. See this and more shown with maps!
Debit card and can card transactions were nearly identical in dollar amounts ($4.55 trillion versus $4.88 trillion) in 2021, but there were twice as many debit transactions as credit transactions (106 billion versus 51 billion). Credit Card Swipe Fees and Routing Restrictions, Congressional Research Service, 8-Oct-2024

Other lists:

Tom Whitwell's 52 things I learned in 2024
Kent Hendrick's 52 Things I Learned in 2024

As a Cog in the Election System Again: Reflections on Working the 2024 Presidential Election

2024-11-10T00:00:00-05:00

Four years ago I posted my reflections here on the 2020 presidential election. This year, I worked the election again as a precinct election official ("PEO"—a poll worker) for Franklin County, Ohio. Much like four years ago, it was a record-setting voter turn-out year, and unlike four years ago, the process was much smoother. Here are some notes and thoughts that I'll likely reflect on when we do this again in another four years.

I've come a long way since the November 2020 election—my first as a Voting Location Manager and only my second as a PEO. (The primary in March 2020 was, of course, famously called off because of the coronavirus.) This election was my 10th as a PEO, ninth as a voting location manager, and the second in this particular location. Almost everyone who served in this location for the March primary election came back for this general election, so we had a good rhythm set for how we wanted to do things.

This was the second election with the new tablet roster books, and they worked very well. With the last tablet roster books we had, the file of early and absentee voters was too big to be processed in the memory of the iPad, so they crashed. In November 2020, at the last minute, we had to resort to the backup paper pollbook, and that slowed the process down greatly. (See my November 2020 blog post for a description of that fun.)

No, everything went smoothly this time. Monday night setup was good...everyone knew what they were doing, and we had the elementary school gymnasium set up by 8:30. Home for a quick bite to eat and to bed, because it would be time to get up at 4:15 Tuesday morning to get to the polling place by 5:30. We would have just an hour to finalize the setup before the polls opened at 6:30.

Voting Location Setup

When we did open, the line of voters was out the door and down the edge of the parking lot. There were probably 125 voters in line when we started the morning, and it stayed like that until late morning. We didn't get a break with no one in line until about 11:30 that morning. The rest of the day was a steady trickle of people. I don't think anyone waited more than 20 to 30 minutes—the line moved quickly. The noon hour was unexpectedly slow—you'd figure that people would come in to vote at a typical lunch hour, but that wasn't the case for us. It picked up again at 1:30 for a couple hours, then just a trickle until the polls closed at 7:30. We had 1,918 regular voters that day and about 100 people that voted with a provisional ballot. (A "regular voter" has valid identification and we can find in the pollbook; everyone else is a "provisional voter" that is reviewed by the Board of Elections...possibly with the voter needing to go to Morse Road with the needed documentation.)

The equipment held up well. As I mentioned earlier, the roster tablets did their job admirably. This was the second time we'd used this equipment...the first was for the primary election earlier this year. The ballot markers were bothersome again until we remembered the lessons we learned in March. The ballot markers are the devices that take in a blank ballot card then spit the card out with the voter's selections. (This is different from the ballot counter that takes the printed ballot card, scans it for validity, and counts it before dropping it into the secure ballot bin.) They are a bit slow at the start of the voting workflow, and if the slip of paper with the voter's precinct identifier is scanned before the machine is ready, it will throw an error and lock up. Rebooting the ballot marker will eject the blank ballot card and reset the machine, and it took us nine jammed machines before we remembered to put a pause in the process.

The PEO team held up very well. Let's face it...this was a tense election, and we didn't know if there would be protests or violence or people questioning the process. We didn't have any of that in our location, and the 20 people working with me were quite able and notably cheerful about the whole process. That definitely makes the day go faster. We did have one person come into the room to ask how he would know if his early vote had been counted; I showed him the list of early and absentee voters that the Board of Elections gives us on Monday night, and he walked away happy.

One notable story...I didn't get my lunch until about 3:45, so I needed something close and quick. That was the McDonald's at Tuttle Mall. I got my 10-piece chicken nuggets and sat down to eat. As I was finishing, one of the crew shouted, “Mr. Peter! Mr. Peter! Do you remember me? You helped me vote. First time voter today!” He recognized me from my PEO nametag. Damn, that felt Good.

There was one—hmmm...unfortunate?—part of the day: handling the signs as we were breaking down the room. Each election, there are more signs. This year either Franklin County or the Ohio Secretary of State added signs about how voting multiple times is illegal. To say nothing of the 4-foot by 6-foot (120cm by 180cm) poster with the full text of the Ohio Constitution amendment. Seriously, there are at least 20 different signs and posters we need to put up near the entrance to the polling place. That isn't so bad...figuring out where they need to go at the end of the night after working 14 hours is the problem. The sign issue haunted me in bed Tuesday night, so I had to get up and write this rant:

Signs. Can we talk about the signs? Three hours after the polls close I'm lying in bed drafting this survey response. What am I thinking about? Is it the three ballot discrepancy between the EPBs and the ballot counters? Three votes out of 1,819 cast in my LID? No, it is signs.

Each election has more signs. And all of these signs have to go in different places when breaking down the polling site. Some go in the election brown trash bag. Some go in the sign envelope on the machine cart. Some go in the VLM bag back to the Board of Elections on Tuesday night. What is so important about a sign that it has to go back to Morse Road on Tuesday night?

So look. Help us out. If it is important to the Board of Elections staff where signs go at the end of the election, PRINT THE LOCATION IN THE BOTTOM CORNER OF THE SIGN! Or, even better, consolidate the signs. Have the Secretary of State print one sign with their stuff. Then Franklin County can print one sign with their stuff. Then we've got the LID-specific stuff like the sample ballots. We got a 4-foot by 6-foot poster of the ballot initiate for this election for heavens sake...it can be done.

At the end of the election, I'm worried about closing the ballot certificates, ballot balancing worksheets, closing the EPBs, heaven-help-me inventorying each green EPB case, and making sure everyone else is packing things up in a sane manner after working for 14 hours. They want to go home. I want to go home. I don't want to have to think about signs.

Okay, rant done. I think I can go to sleep. I might have the character to copy and paste this message into post-election survey. Or I might edit it. Not sure. Good night!

In another day or two, I'll get a link to a survey in email where I can offer suggestions. I think that rant will make it unchanged into the "do you have any suggestions" box. If, in four years, they improve the sign stuff for the next presidential election, I'll let you know.

Certificate of Appreciation

List of elections worked

This number is starting to get big, so before I start forgetting places, here are the elections and positions I've worked.

November 2019, Sutter Park Preschool, Voting Location Deputy
March 2020†, First Community Church North Campus, Voting Location Manager
November 2020, Indian Run United Methodist Church, Voting Location Manager
November 2021, Indian Run United Methodist Church, Voting Location Manager
May 2022, Indian Run United Methodist Church, Voting Location Manager
August 2022, Indian Run United Methodist Church, Voting Location Manager
November 2022, Indian Run United Methodist Church, Voting Location Manager
August 2023, Upper Arlington High School, Voting Location Manager
November 2023, Lazelle Woods Recreation Center, Voting Location Manager
March 2024, Thomas Elementary School, Voting Location Manager
November 2024, Thomas Elementary School, Voting Location Manager

†: The aborted election at the start of the COVID pandemic. We got part of the way through the Monday night setup, then got the call from the Board of Elections to pack it in...there would be no election in March 2020.

Ghost Newsletter Software Findings: Got Past the Mailgun Problem, but Got Stuck On Ugly HTML

2024-07-14T00:00:00-04:00

This was going to be only a post about how I got the Ghost newsletter software to use Amazon Simple Email Service (AWS SES) instead of the built-in Mailgun support, but it turned into that plus why I can't use Ghost for the DLTJ Newsletter.

Ghost's bulk email delivery problem

One of Ghost's downsides is that it only supports the Mailgun service for delivering newsletter issues. Ghost can use any email delivery agent for what it calls “transactional” email: email verification on new accounts, password resets, using email to log in, etc. Of course, the point of email newsletter software is to send issues as email, so limiting a core feature to one mechanism is rather unfortunate.

There are many threads on the Ghost support forum about using bulk email services other than Mailgun, but this post from September 2022 has a reply from a Ghost staff member about why this is so hard: they tie email analytics (reports of who opens email and who follows which links) to the each user. That functionality needs deeper integration with the bulk email service than just sending email. I don't care about that — in fact, I make sure that DLTJ and its newsletter don't gather reader details — but I get that this is important to some people.

The problem with Mailgun is that it has become quite expensive for self-hosted hobbyists to use. They used to have a "hidden" pay-as-you-go service that was pretty inexpensive, but earlier this year they eliminated that. Now there are reports of Ghost users having to pay $35/month to deliver just a couple hundred emails. (AWS is expensive, but not that expensive!)

In that same Ghost support forum thread mentioned earlier, this is one line about someone who solved this problem pointing to a Spanish-language post. That post, in turn, points to the ghosler software from ItzNotABug on GitHub. That software uses a webhook that Ghost fires whenever an issue is published. Ghosler reads the subscriber database from Ghost and then sends email to any SMTP endpoint...precisely what I need!

Using Ghosler with AWS SES in a Docker Compose stack

One requirement that I have is to run this software in Docker containers for ease of management and coexistence with other software. There are several examples of running Ghost in Docker; my way is certainly not the only way to do it. Another of my requirements is to put anything that isn't public-facing on my Tailscale network. So you'll see that mentioned in this Docker Compose file as well. There are two Compose files: one for Mariadb, phpMyAdmin, and Ghost and another Compose file that builds Ghosler.

## File: ghost/docker-compose.yml
services:
  mariadb:
    image: mariadb:11.4.2
    container_name: mariadb
    command: --default-authentication-plugin=mysql_native_password
    restart: unless-stopped
    environment:
      MYSQL_ROOT_PASSWORD: ...root-password...
    volumes:
      - mariadb_data:/var/lib/mysql
    ports:
      - 3306:3306

  phpmyadmin:
    image: phpmyadmin:5.2.1
    container_name: phpmyadmin
    network_mode: service:ts-phpmyadmin
    depends_on:
      - mariadb
      - ts-phpmyadmin
    environment:
      PMA_HOST: mariadb
      PMA_PORT: 3306
      PMA_ARBITRARY: 1
      APACHE_PORT: 8080
    restart: unless-stopped
  ts-phpmyadmin:
    image: tailscale/tailscale:latest
    hostname: aws-phpmyadmin
    environment:
      - TS_AUTHKEY=...tailscale-auth-secret...?ephemeral=false
      - TS_STATE_DIR=/var/lib/tailscale
      - "TS_EXTRA_ARGS=--advertise-tags=tag:container"
      - TS_SERVE_CONFIG=/config/ts-serve-config-phpmyadmin.json
    ports:
      - "8080:8080"
    volumes:
      - ts-data-phpmyadmin:/var/lib/tailscale
      - ts-config:/config:ro
      - /dev/net/tun:/dev/net/tun
    cap_add:
      - net_admin
      - sys_module
    restart: unless-stopped

  ghost:
    image: ghost:5-alpine
    container_name: ghost
    restart: unless-stopped
    ports:
      - 8081:2368
    networks:
      - caddy_net
      - default
    environment:
      # see https://ghost.org/docs/config/#configuration-options
      enableDeveloperExperiments: true
      database__client: mysql
      database__connection__host: mariadb
      database__connection__user: ghost
      database__connection__password: ...db-user-password...
      database__connection__database: ghost
      # Configure SMTP server for Ghost
      mail__from: newsletter@newsletter.dltj.org
      mail__transport: SMTP
      mail__options__host: email-smtp.us-east-1.amazonaws.com
      mail__options__port: 465
      mail__options__auth__user: ...AWS-access-key...
      mail__options__auth__pass: ...AWS-secret-key...
      mail__options__secure_connection: true
      mail__options__service: SES
      mail__from: "'DLTJ Newsletter' <newsletter@dltj.org>"
      url: https://newsletter.dltj.org
      # contrary to the default mentioned in the linked documentation, this image defaults to NODE_ENV=production (so development mode needs to be explicitly specified if desired)
      #NODE_ENV: development
    volumes:
      - ghost_data:/var/lib/ghost/content

  caddy_reverse_proxy:
    # Use the caddy:latest image from Docker Hub
    image: caddy:latest
    restart: unless-stopped
    container_name: caddy_proxy
    ports:
      - 80:80
      - 443:443
    volumes:
      # Mount the host Caddyfile
      - ./Caddyfile:/etc/caddy/Caddyfile
      - caddy_data:/data
      - caddy_config:/config
    networks:
      - caddy_net

networks:
  caddy_net:
  default:

volumes:
  mariadb_data:
  ghost_data:
  caddy_data:
  caddy_config:
  ts-config:
    external: true
  ts-data-phpmyadmin:
    driver: local

## Contents of `ts-serve-config-phpmyadmin.json`
## {
##   "TCP": { "443": { "HTTPS": true } },
##   "Web": { "${TS_CERT_DOMAIN}:443": { "Handlers": { "/": { "Proxy": "http://127.0.0.1:8080" } } } },
##  "AllowFunnel": { "${TS_CERT_DOMAIN}:443": false }
## }
##
##
## Contents of `Caddyfile`
## {
##     email jester@dltj.org 
## }
## newsletter.dltj.org {
##     reverse_proxy ghost:2368
## }

The parts prefaced with ts- are for the Tailscale Docker container (documentation). ts-config is a Docker volume where I store Tailscale configuration files, of which ts-serve-config-phpmyadmin.json is one.

Inside the Ghost directory, I cloned the Ghosler software: git clone https://github.com/ItzNotABug/ghosler.git. Then, I added this Docker Compose file to that directory.

## File: ghost/ghosler/docker-compose.yml
services:
  ghosler:
    container_name: ghosler
    build: .
    network_mode: service:ts-ghosler
    restart: unless-stopped
    depends_on:
      - ts-ghosler
    tty: true
    volumes:
      - ghosler_data:/usr/src/app
      - ./configuration/config.production.json:/usr/src/app/configuration/config.production.json:rw

  ts-ghosler:
    image: tailscale/tailscale:latest
    hostname: ghosler.internal
    environment:
      - TS_AUTHKEY=...tailscale-auth-key...?ephemeral=false
      - TS_STATE_DIR=/var/lib/tailscale
      - "TS_EXTRA_ARGS=--advertise-tags=tag:container"
      - TS_SERVE_CONFIG=/config/ts-serve-config-ghosler.json
    ports:
      - "2369:2369"
    volumes:
      - ts-data-ghosler:/var/lib/tailscale
      - ts-config:/config:ro
      - /dev/net/tun:/dev/net/tun
    cap_add:
      - net_admin
      - sys_module
    restart: unless-stopped
    networks:
      - ghost_default

volumes:
  ghosler_data:
    driver: local
  ts-data-ghosler:
    driver: local
  ts-config:
    external: true

networks:
  ghost_default:
    external: true

An interesting bit here is network_mode: service:ts-ghosler. Documentation about this is hard to come by (as noted in the Docker forum), but what this does is put the ghosler and ts-ghosler containers in the same network namespace. To the outside, it looks like one machine.

When you follow the directions for setting up Ghosler's webhook in Ghost, you'll need to go into the Ghost configuration and change the URL of the webhook so that it is http://ghosler.internal:2369/published — the Docker hostname and port. I found that Ghosler didn't know enough about itself to set this automatically.

The problem with Ghost

So, having set up Ghost and its side-buddy Ghosly in Docker and confirmed that I could deliver email newsletters, I set about importing my past newsletter issues into Ghost. And here is where I got stuck.

My blog has gone through two phases: the Wordpress phase from its origin in 2008 to mid-2015, and the Jekyll static site generator phase from 2015 to the present. My posting history is a mixture of HTML exports from Wordpress and Markdown files with some moderately elaborate include macros. Even back in the Wordpress days, the posts were constructed in HTML first rather than using the WYSIWYG editor. The way content is laid out on the page is moderately important to me using good—or at least increasingly better—semantic HTML. (As with all things, learning improved semantic HTML is an ongoing process.)

I decided to use Ghost's "Universal Import" format to migrate content. This is a JSON file that Ghost uses to transport a newsletter from one installation to another, so it seemed to promise the highest fidelity. In fact, I found that if I took a Ghost export JSON file and replaced the posts array with entries that looked like this, the import would go fine:

                post_entry = {
                    "id": str(post_id),
                    "uuid": str(post_uuid),
                    "title": title,
                    "slug": slug,
                    "mobiledoc": None,
                    "lexical": None,
                    "html": post_html,
                    "comment_id": str(post_id),
                    "plaintext": None,
                    "feature_image": None,
                    "featured": 0,
                    "type": "post",
                    "status": "published",
                    "locale": None,
                    "visibility": "public",
                    "email_recipient_filter": "all",
                    "created_at": pub_date,
                    "updated_at": pub_date,
                    "published_at": pub_date,
                    "custom_excerpt": None,
                    "codeinjection_head": None,
                    "codeinjection_foot": None,
                    "custom_template": None,
                    "canonical_url": None,
                    "newsletter_id": None,
                    "show_title_and_feature_image": 0,
                }

Note that there is a field for lexical that I'm leaving empty and a field for html that I'm setting to the post's HTML. Ghost—being a JavaScript application—uses Lexical as its native internal rich text format. And it helpfully converts HTML to Lexical on import. The problem is that Lexical is a (very) lossy format, so HTML that used to look like this:

<h2 id="p25973-card-based-qa-sessions">Index Card-based Question and Answer Sessions</h2>
<blockquote>
  <p>Here is the formula:</p>
  <ol>
    <li>Throw away the audience microphones.</li>
    <li>Buy a pack of index cards.</li>
    <li>Hand out the cards to the audience before or during your talk.</li>
    <li>Ask people to write their questions on the cards and pass them to the end of the row.</li>
    <li>Collect the cards at the end of the talk.</li>
    <li>Flip through the cards and answer only good (or funny) questions.</li>
    <li>Optional: have an accomplice collect and screen the questions for you during the talk.</li>
  </ol>
  <p>Better yet, if you are a conference organizer, buy enough index cards for every one of your talks and tell your
    speakers and volunteers to use them.</p>
</blockquote>
<div style="text-align: right; width: 100%;"><cite>- <a
      href="http://blog.valerieaurora.org/2015/06/23/ban-boring-mike-based-qa-sessions-and-use-index-cards-instead/"
      title="Ban boring mike-based Q&amp;A sessions and use index cards instead | Valerie Aurora">Ban boring mike-based
      Q&amp;A sessions and use index cards instead</a>, by Valerie Aurora, 23-Jun-2015</cite></div>

...comes out of Ghost looking like this:

<h2 id="index-card-based-question-and-answer-sessions">Index Card-based Question and Answer Sessions</h2>
<blockquote>Here is the formula:</blockquote>
<ol>
  <li>Throw away the audience microphones.</li>
  <li>Buy a pack of index cards.</li>
  <li>Hand out the cards to the audience before or during your talk.</li>
  <li>Ask people to write their questions on the cards and pass them to the end of the row.</li>
  <li>Collect the cards at the end of the talk.</li>
  <li>Flip through the cards and answer only good (or funny) questions.</li>
  <li>Optional: have an accomplice collect and screen the questions for you during the talk.</li>
</ol>
<blockquote>Better yet, if you are a conference organizer, buy enough index cards for every one of your talks and tell
  your speakers and volunteers to use them.</blockquote>
<p>- <a
    href="http://blog.valerieaurora.org/2015/06/23/ban-boring-mike-based-qa-sessions-and-use-index-cards-instead/">Ban
    boring mike-based Q&amp;A sessions and use index cards instead</a>, by Valerie Aurora, 23-Jun-2015</p>

Problems that I immediately spotted:

The <h2> fragment id was replaced, which makes the old links that used it worthless.
The ordered list got put outside the blockquote. In fact, I noticed this in other imported issues...multi-paragraph blockquotes got put into individual paragraph blockquotes, and that rendered weirdly.
The styling and <cite> tag were discarded.

Side-by-side comparison of Ghost (left) with original.

I use all three of these things extensively in almost all of the DLTJ Thursday Thread newsletter issues. So, yeah, stuck. I'm unsure if this is a problem with Ghost's implementation of Lexical or with Lexical itself, but I don't know enough JavaScript to find out. So, I'm abandoning my Ghost effort. For completeness in these notes, here is a web archive of this same newsletter issue (link to original post).

By the way, check out my ReplayWeb for Embedding Social Media Posts (Twitter, Mastodon) in Web Pages article to see how that web page archive is embedded in this blog post.

Digital versus Digitized: On the Hachette v. Internet Archive Appeal Oral Argument

2024-07-06T00:00:00-04:00

One thing that would dramatically clarify the controlled digital lending concept in general and the Hachette v. Internet Archive lawsuit in particular is having distinct terms for types of ebooks. I propose that we refer to them as digital and digitized. A digital book is one that is born digital, where the publisher has the original "source code". Alternatively, a digitized book originates as a physical copy, which is then converted into a sequence of printed page images. Given the differences in the way they are created by the publisher and the capabilities offered to the reader, distinguishing the two types of books is appropriate. I'm not a marketer, but I suspect digital and digitized might be too similar for an average person to notice the subtle differences. As technical descriptors, these terms help clarify some of the misunderstandings (or even willful obfuscations?) I heard during the circuit court oral arguments.

On June 28, 2024, the oral arguments in the Hachette Book Group, Inc. v. Internet Archive case took place. Shortly after, a recording was made readily available on the Internet Archive. I created an unofficial transcript of this recording, which I posted on my media site with Hypothesis enabled. I made 30 notable annotations in the transcript, comprising key points, personal comments, and several references to external materials. (Feel free to annotate specific points alongside me on Hypothesis, if you wish.) This post consolidates those ideas and remarks into a coherent form.

Disclosures: Despite not being a lawyer, I find the intersection of copyright, fair use, library services, and societal welfare intriguing, and often reflect and write about them professionally. This is not legal advice. I'm currently employed by a software company that's developing a controlled digital lending system. In addition to my professional ties, I believe controlled digital lending is a tremendous benefit for library patrons, libraries, and society at large.

Background

Hachette v. Internet Archive is a lawsuit filed on June 1, 2020, during the peak of the Covid-19 pandemic, in response to the National Emergency Library | Internet Archives blog (NEL) program. The NEL program, initiated on March 24, 2020, removed the restrictions on the number of patrons allowed simultaneous access to digitized books on IA's Open Library collection. Before this pandemic-induced change, libraries could partner with Open Library to provide access to digitized books for their patrons. IA employed a system called Controlled Digital Lending (CDL), assuring that digitized copies weren't distributed to the public unconditionally. CDL is a blend of digital rights management (DRM), library operations software, and library protocols, ensuring that a single physical copy is not loaned more than once in any form, whether physical or digital. NEL removed this "never lent more than once" CDL restriction based on the premise that all the nation's public libraries were closed and no one could access the physical materials. NEL concluded on June 16, 2020, and IA's regular CDL program resumed. For a more detailed explanation of CDL, refer to an article I authored called Controlled Digital Lending…What's the Fuss? derived from my talk at Code4Lib in 2023, and Issue 94 of my intermittent newsletter.

The federal court in the Southern District of New York ruled in favor of Hachette on August 11, 2024, but the judgement was stayed pending an appeal to the intermediate court. The oral arguments last week were part of that process, and now we wait for that ruling. From accounts I've read, it seems like both parties are poised to take this to the U.S. Supreme Court no matter who wins.

Because I started trying to draw a distinction between various terms, I'm going to carefully chose my words in this post:

Digital book: A digital book, born from electronic files. ePub, Daisy, and the Kindle format are common file formats for these books, which use digital typesetting to arrange words on a screen.
Digitized book: A book with pages that are images of a physical item. Originally digitally typeset on a printed page, these pages are then scanned and sequenced into a file.
E-book: An umbrella term encompassing both Digital or Digitized books as defined above.
Physical book: As simple as it sounds: a book that exists in a physical, human-readable format.
Book: If I refer to a "book" without any of the modifiers above, it applies to any type of book: digital, digitized, or physical.

Digital versus Digitized Marketplaces

I led this blog post with a discussion of digital books versus digitized books, and it wasn't long into the Internet Archive's presentation that we get to the matter.

[15:25] under [fair use] factor 4 you say that actually there's one reason there's still be a market for e-books is because e-books are more attractive than digitized versions of physical books. Right? Because they have features and they're more user friendly or whatever. So what that kind of means is what you're saying is that your digital copies are more convenient or more attractive, I guess more convenient than physical books, but less convenient than e-books.

The judge is asking about the three kinds of books: "physical books", "digitized versions of physical books", and "ebooks". So, we are already recognizing that digitized books and digital "ebooks" are different and that digitized books are "more convenient than physical books but less convenient than ebooks."

There seems to be a legal concept here about "markets", and specifically whether digitized books (from libraries through CDL) and digital books (from publishers through Overdrive and other programs) are in the same "market". It seems undisputed that the market for physical books and the market for something digital/digitized are different...even though they hold the same basic content. It does seem disputed whether digitized books (which are facsimiles of the content in physical books) and digital books (with capabilities only possible in the digital realm) are in the same market or are different markets. To my eye, a digitized book is mostly akin to a physical book with the exception that the digitized book can be more easily distributed via electronic devices. The real difference lies in the capabilities of the born-digital book on an electronic device. But I don't know how the law defines "markets" in this case.

In the "ebook" marketplace, publishers will license "digital" books to libraries using a service like Overdrive. A publisher is unlikely to sell or license a "digitized" book to libraries. (I last heard of a publisher doing this early digital days of the 1990s.) Also in the "ebook" marketplace, a library will use CDL to lend a "digitized" book to patrons. It is conceivable that a library could lend a "digital" book if it has purchased the rights to do so, but that is unusual at this stage.

Sidenote: when digital and digitized have identical packaging, a la PDFs

Let's get into the weeds for a moment and talk about how the PDF file format muddies the distinction between digital and digitized. There are two main types of PDF files: those created from digital typesetting and those created from scanned images of printed pages. Digitally typeset PDFs—digital books—are created using digital typesetting software such as Adobe InDesign, Microsoft Word, or LaTeX. The content is created and formatted within the software and then exported or saved as a PDF file. Digitally typeset PDF files can be made accessible to assistive technologies by properly tagging the structure and content of the document; this allows screen readers to interpret and navigate the PDF file. Digitally typeset PDF files are usually smaller in size compared to scanned PDFs because they contain vector-based text and graphics.

On the other hand, PDF files based on scanned images of printed pages—digitized books—are created by scanning or photographing physical printed pages using; each page of the book is captured as an image. Scanned PDF files are not inherently accessible to assistive technologies like screen readers because the text is not selectable or readable by these tools. Scanned PDF files also tend to have larger file sizes compared to digitally typeset PDFs because they contain high-resolution images of each page. The text might be selectable to be copied out of the document, but only if a layer of Optical Character Recognition (OCR) has been applied to the file. OCR software analyzes the images and attempts to recognize and convert the characters into selectable and searchable text.

The same file format—PDF—can be used for both a digital book and a digitized book. You might initially only be able to tell the different by the file size, but capabilities will be apparent to the reader soon after it is opened. Some publishers use PDF as a delivery format, and these are most likely the digitally typeset files. The reader doesn't get the digital book advantages of reflowing text or changing the font family or size with this kind of PDF. CDL can also use PDF as a delivery mechanism, and this is most likely the sequence of images PDF.

Fortunately, the court doesn't get into this level of technical detail. Unfortunately, I think a lot of sides talking past each other come from muddy technical aspects of licensing versus lending a "PDF".

The rights with first-sale

Some examples from the oral arguments where not having clear definitions causes problems:

[18:02] When they buy those books, they buy the physical copies to lend to their patrons one at a time or through an interlibrary change. They also buy e-books to make those available to their patrons. We're focused here on e-books and impacting e-licensing. I have a hard time reconciling those two, specifically as to e-licensing. Why would libraries ever pay for an e-license if they could have internet archives, scan all the books, hard copies they buy and make them available on an unlimited basis?

Internet Archives' counsel points out here that in licensing e-books, the libraries are not adding to their permanent collection. Libraries haven't bought the book, and don't have the first-sale rights to do what they want with the book. And most publishers—notably the major commercial/trade publishers that are a part of this lawsuit—do not want to sell e-books with first sale rights for a library to add to its permanent collection. (How a library is supposed to fulfill the part of its mission to preserve cultural artifacts is beyond the scope of this post, but you can see the obvious problems of saving e-books.)

The origins of CDL, or the point at late in the arguments where I exclaim WHAT?!?

Late in the publisher's arguments comes this bit:

[1:08:29] Control digital lending is a contrived construct that was put together at the behest of Internet Archive back in 2018 when they confronted the fact that libraries didn't want to deal with them. Libraries didn't want to give copies of their works to be digitized because they were concerned about copyright arguments. So they got in a room together with various people and contrived this principle of control digital lending to rationalize what they were doing.

CDL is much older than 2018. IA's version of CDL predates the first discussion of it by over half a decade (see the origins of Controlled Digital Lending in issue 94), and there are earlier implementations. I don't remember seeing this claim from the publishers in their district court complaint, so I hope there is evidence for this statement on the record in the evidence presented to the lower court.

Publishers will stop publishing

Starting with the publisher's lawyer:

[1:15:20] That IA's brief and amici try to create the impression that the public interest is on their side. And it is not. The protection of copyright is in the US Constitution and federal law because it creates an incentive for writers and artists to create new works to benefit our broader society. Internet Archives' control digital lending is in direct conflict with that basic principle. And as I previously...

You don't really think people are going to stop writing books because of the control digital lending to you?

Well, I think publishers are going to go down the tubes if they do not have the revenues.

You think that that's really...

I do, Your Honor. There's no question. I mean, and the standard here is not, will this eliminate...

No, I understand. It's just a part. But this question about balancing the incentive to create a work with the larger distribution of it, that is the question to be decided in this case.

If this gets to the U.S. Supreme Court, do we get to go with the originalist's thinking from the Copyright Act of 1790 | Wikipedia that copyright extends for 14 years with the right to renew for a additional 14-year term should the copyright holder still be alive?

The market becomes only as large as the number of people who simultaneously want to read a work

A judge make a good point about the potential market effect of library's CDL:

[1:18:32] But you're reducing the market from the number of people who might want to read... Let's look at even the paper books. They'll pretend like take out the digital market for a second. The number of people who might want to read it ever, down to the number of people who might want to read it simultaneously. And if you put digital books into the mix, it's the same idea, right?

At the extreme, the workflow efficiencies that come with CDL (or, a "reduction in friction" as I think it is referred to in the oral arguments) could mean that there is only a market as big as those who want physical books for their personal collection and libraries collectively purchasing a number of physical items to fulfill digitized book needs. (There is still a market for digital books that publishers won't sell to libraries.) There is some nuance here, but the point is interesting.

And it is here that I think we see the first substantive discussion of digital versus digitized:

[1:19:48] that efficiency may or may not have an effect on either the number of copies that get sold or on the market for the Overdrive service, which has a variety of different sort of different aspects and benefits over and above CDL. I mean, CDL is largely sort of scanned images of pages of paper books because it's the paper book. The Overdrive service has a lot of benefits. You can flow the text. You can do different features and that is one reason why...that is one explanation for the data that you see—that there is no reduction in demand for Overdrive.

My informed but not expert opinion

I agree with those that are saying that the line of questioning from the circuit court judges shows a more thoughtful approach to the nuances of copyright than was seen in the district court decision. The judges and lawyers seem to recognize that digitized books and digital books are different, with digitized books being more convenient than physical books but less functional than digital books. However, there appears to be a dispute over whether digitized books and digital books are in the same "market" or different markets, which is a key factor in determining fair use.

The concept of first-sale rights is central to the discussion of library lending and ebook licensing. Libraries purchase physical books with the right to lend them to patrons, but when licensing ebooks, they do not have the same ownership rights. This distinction is crucial in understanding the limitations libraries face in providing access to books (physical or ebooks — to say nothing about preserving books, too). The argument that CDL threatens the incentive for writers and artists to create new works, as publishers may "go down the tubes" without sufficient revenues, is a significant point of contention. The balancing of copyright protection and the broader distribution of works is a central question in this case.

This case highlights the complexity of the issues surrounding digital and digitized books, copyright, fair use, and library lending. The distinction between digital and digitized books is crucial in understanding the nuances of the case and the potential implications for the future of ebooks and library services.

The ILS without patron data: open questions

2024-06-23T00:00:00-04:00

In my prior two posts, I outlined a strategy to minimize personally identifiable information in library automation systems (idea overview, impact on FOLIO). This approach uses a unique single-service identifier (the "pairwise-id") recognized exclusively by the identity provider (IdP) and the library's service provider (SP), effectively preventing any cross-system correlation of an individual's activities. The only personal information the library system stores is the pairwise-id, meaning that there are no exposed names, addresses, phone numbers, or other demographic details in the event of a system breach. When the library system needs to notify the user or post a charge to the user's account, it invokes the "IdP Pairwise Email Service" and the "IdP Pairwise Billing Service."

You might ask why we're going to these lengths. Why put in the work to create these extra email and billing services? The goal is simple: to make potential attacks less fruitful. By limiting the storage of personal information and narrowing the APIs that access it, we create fewer avenues for potential exposure. This approach frees resources to focus on the remaining sections, like the IdP, Pairwise Email Service, and Pairwise Billing Service, which access personal information.

This approach also strengthens the privacy of the remaining library workflows. For example, access services staff members only see the pairwise-ids, not the patrons' actual names or other personal details, as they check-in items or process hold requests. Of course, there may be circumstances when library staff need access to a patron's personal information. To accommodate such cases, we could add a new FOLIO app that retrieves these details for authorized personnel. Any such access would be recorded and subject to auditing to prevent misuse.

In this post, I'm finishing up this series (for the time being?) with a collection of additional details and open questions - starting with a correction.

Correction: Library Staff Check-out app

I added a correction to the previous post about implementing a patron-data-minimizing library service in FOLIO. In the section about checking out a book, I mentioned that the only way for a user to check out an item was for the user to log into the library system via the IdP. Also, I said that existing functions could not be utilized, such as when a library staff member with appropriate permissions uses the Check-out app to register an item loaned to a patron by scanning the patron's ID card barcode.

My colleague Mike Taylor pointed out that this was incorrect. In my own mind, I had taken the minimal patron record one step too far. We can indeed use the barcode field in the user record; this barcode could either be from a pre-loaded patron record or supplied by the IdP as an attribute when the patron logs in for the first time. Once the barcode is in place, the existing Check-out app can function as it currently does. Nevertheless, libraries must be mindful of potential risks as barcodes are visually accessible and not as easily changed as passwords.

Securing Circulation Station

Related the Check-out app, we need a strategy to control where check-outs can occur. If a patron is logging into FOLIO to use the Check-out app, we'll ideally want this process confined to the library building. A potential solution might involve using client HTTPS certificates; with this method, FOLIO would only provide access to the Check-out app if the user's browser presents a client certificate installed exclusively on the circulation stations. Keycloak could be beneficial in this regard. In EBSCO's presentations about Keycloak replacing the original authentication mechanism, location-based login was highlighted as an advantage.

Deanonymization

While these modifications have minimized personal data in the library system, we haven't completely eliminated it. A patron's activity itself — the stream of topics browsed, articles downloaded, and items borrowed — can act as a fingerprint of their interests. The elements of this fingerprint can be quite distinctive when considering their content, time-of-day, and location. With sufficient data, an intruder could potentially link the activity back to the individual behind the pairwise-id.

There are strategies to mitigate the risk of accumulating patron activity. For example, the IdP could generate a fresh pairwise-id for each login by the patron. In this scenario, the IdP would need to maintain a record of all pairwise-ids, and would likely want to implement automatic user provisioning (where the library system SP automatically generates a new user record for every new pairwise-id).

This approach presents new challenges, such patron blocks that rely on the maximum number of checked-out items or the maximum amount of fees levied on a patron. Since the patron's activities are now scattered across multiple user records in FOLIO, we need to introduce a "Pairwise Block Check Service." This service would take a pairwise-id, track down all other pairwise-ids tied to the same patron, and tally their total loans and library fees. It would return a yes or no answer on whether the circulation transaction can proceed.

Deanonymization is a topic where a lot of research is ongoing. We would want to engage these researchers to make sure our approach of limiting the correlation of patron activity is sound.

Discovery integration

FOLIO doesn't come with a built-in discovery layer. This was an intentional design decision, aimed at defining clear boundaries that allow for the integration of a library's preferred discovery layer using well-defined,and versioned APIs. As it stands, all known discovery layer integrations connect to FOLIO using a central account with permissions to access all users' circulation records. These records are fetched using a patron identifier, such as the pairwise-id. However, this method makes the discovery layer's FOLIO user account as a potential security vulnerability.

Ideally, we would want each patron to log into FOLIO using their own account. Doing so would naturally restrict each user's visibility to their personal record. At the moment, I'm uncertain whether such an indirect (transitive) login setup is feasible. In other words, can a patron log into their chosen discovery layer via the IdP, and could the discovery layer then use this authentication to log into FOLIO?

All Done?

So, I think that is it...I've gotten all the parts of this idea rolling around in my head out into the world. Thanks for the discussions on Mastodon and elsewhere about the specifics, and I'm looking forward to hearing more thoughts and, if necessary, integrating them into a fourth blog post.

I feel compelled to express gratitude for having a system like FOLIO to explore this idea in a tangible way. FOLIO's primary emphasis on an API-first approach makes this concept feel more feasible. When I say API-first, I mean there are no hidden APIs within FOLIO: for every task that can be performed in the user interface, a well-defined, versioned API exists to facilitate the same function. Beyond the user interface, the modules within FOLIO are compartmentalized by function and communicate with each other using the same well-defined, versioned APIs. As a result, replacing a module to adapt FOLIO for unique uses is entirely viable.

The ILS without patron data: a thought experiment realized with FOLIO

2024-06-21T00:00:00-04:00

In the previous blog post, I outlined the concept of a library system with no personally identifiable information as a way to safeguard a patron's right to privacy. Library systems commonly retain traces of a patron's library activity, and the librarian ethos protects a patron's privacy as they conduct their research and borrow items from the library. Suppose our library systems decoupled patrons' personal information from their library activity. In that case, the risk of leaked information from the library systems is significantly reduced.

In this blog post, I examine how a modern library service platform could be modified to handle this minimal personal knowledge system. As you may recall, this proposed system uses pairwise-subject-identifiers ("pairwise-id") from an organization's identity provider ("IdP") to identify people. Our service provider ("SP") uses that identifier internally and calls external services that can find out who the pairwise-id is when necessary. I'm using the library services platform with which I'm most familiar: FOLIO. As an open-source library services platform, FOLIO offers a relatively straightforward path for such customizations. In the following sections, I'll examine what our library system SP needs to do when encountering a new pairwise-id for the first time, how to send patron notices and bill patrons, and changes to the hold-request subsystem. I'll also discuss some changes that are needed to FOLIO itself. For the sake of brevity, I'm calling this FOLIO version "FILP" — the FOLIO Identity Limited Platform.

The FOLIO Settings → Tenant → SSO settings pane

FOLIO includes a SAML SP endpoint that assumes user records have already been loaded into the system. Configuring this endpoint requires naming the SAML attribute that will contain the person's unique identifier and which field of the FOLIO user record has that identifier. In this example, the FOLIO SP is looking for the user identifier in the uid SAML attribute from the IdP and will search the contents of the External System ID field in the user record.

In our FILP version, we could use the SAML module unmodified; we would need to pre-load user records with the pairwise-id in the External System ID field. FOLIO user records have four required fields: patron group, active/inactive status, email address, and last name. The pre-loaded data would include the patron group appropriate for the pairwise-id patron and an "active" status. The pairwise-id is also copied into the email address field; I'll describe later in this post how the pairwise-id is used in the FILP version of the email module. In the last name field, we can put the three-random-word phrase that will be used for hold-pickup notices. (More on this in the holds section below.)

Our FILP SAML login module can also create user records on-the-fly when a new pairwise-id is seen. The IdP sends attributes (such as "student" or "faculty") to the FILP SP that are needed to determine the appropriate patron group; the settings for the SAML module would contain a table that maps those attributes to patron groups. The pairwise-id is copied to the email address field, and a random last name will also be recorded in the new user record.

New Email Delivery Module

FOLIO has a built-in email module with a simple API for outbound email. Other FOLIO modules send a POST to the /email endpoint with a JSON body that contains the email details, including the to address and the body of the message. The built-in email module has configuration settings for the SMTP server, and it takes responsibility for sending the message.

Our FILP version of the email module has the same API signature as the built-in module: it listens for POST requests to the /email endpoint and accepts an identical JSON body. It is a drop-end replacement; the other modules in the FOLIO system don't know that they are communicating with a FILP-enabled email module.

Remember from the previous post that the IT group running the IdP will need new services that act on behalf of our library system in cases where a patron's identity must be known. One such service sends an email to a pairwise-id (say, the "IdP Pairwise Email Service"). This service takes the pairwise-id and looks up the actual email address. Also remember that we copied the pairwise-id to the email address in the user record. Our FILP email module reads the JSON body to get the pairwise-id in the 'to' field, then sends it and the message contents to the IdP Pairwise Email Service. The IdP Pairwise Email Service returns a success or failure message, which our FILP email module records in its database.

New Fee-Fine Module

Like the FOLIO email notification module, there is a single point that FILP will need to override to send fee/fine information to an external agent. Also, similar to the email module, the IT group running the IdP will need an IdP Pairwise Billing Service. When that service is given a pairwise-id, a charge/credit amount, and a message, it will post a transaction against the patron's organization account. FOLIO's existing fee-fines module has a POST method to create a new fee and a PUT method to modify an existing fee. The FILP version of the fee-fine module is a drop-in replacement for those /feefines and /feefines/{feefineId} API endpoints, and it accepts the same JSON bodies as those endpoints. The ownerId field in the JSON body is the FOLIO user record identifier, and our FILP feefine module uses that identifier to look up the pairwise-id in the user record to forward the data to the IdP Pairwise Billing Service.

No changes to the Requests module

The third example from the previous blog post of the impact of our FILP minimal-personal-knowledge library system was item request pickup slips. For context, the typical hold-paging-request workflow is for the library to print a paging slip that contains the title, author, and shelving location of the requested item along with the patron's name and contact information. The pickup slip is attached to the book and placed on a hold shelf for the patron to pick up. In this typical workflow, the patron's name is intimately tied to the requested material.

Instead of printing the patron's name, we use a random three-word phrase stored in the FOLIO user record's last name field when the record was created. That random phrase is printed on the pull slip. When FOLIO sends a hold pickup notice to the patron, the {{user.lastName}} replacement token is available to insert in the body of the message:

The item you requested, {{item.title}}, is now ready for pickup at the main library hold shelves. Items on the pickup shelves are sorted alphabetically using a three-word phrase. Your three-word phrase is {{user.lastName}}.

Changes Required within FOLIO

An important point in this description of how the pairwise-id is used in FOLIO is that the patron is the one logging into FOLIO to perform these actions. Currently, FOLIO performs circulation operations like a typical integrated library system: to check out an item to a patron, a staff member logs into FOLIO with privileges to perform the checkout function. That checkout function allows staff members (with the required permissions) to check out any item to any user record.

~~In our FILP FOLIO, though, the staff member won't be able to scan a patron's barcode to identify the patron...the patron will need to log in through the IdP single sign-on system so the pairwise-id is transmitted to FOLIO.~~ There is a correction in the next blog post about how it is possible to use the existing Check-out app. Since it is the patron that is logged into FOLIO at this point, we will need a new API endpoint for a function that checks out an item only to the logged-in user record (rather than any user record).

FOLIO differs from previous library systems in that patrons are "first class" users. The only thing that differentiates a library staff member's account is the permissions on their user record. As described above, an access service staff member will have permission to use the Checkout app to register a loaned item on any user record. A patron user will need a permissions set that allows access only to their user record. Several other endpoints will need similar modifications: an endpoint that records a hold request for the logged-in user, an endpoint that allows someone to set notification and pickup preferences for themself, an endpoint that requests a renewal for a checked-out item, and so forth.

Conclusion and the Way Forward

FILP, as described above, still has some potential ways to correlate library activity to a specific patron and possibly de-anonymize that person. This blog post is already nearly 2,000 words, so I put that discussion plus a few other open questions in the next post.

FOLIO's architecture is excellent because it is almost possible to build the FOLIO Identity Limited Platform—FILP—today. Replace a few back-end modules and add API endpoints where capabilities are scoped to an individual user record, and we're pretty much there. This article's subtitle is "a thought experiment realized in FOLIO". It is almost enough for a statement of work.

I'll add a plug for the company that I work for here in the last paragraph. If your library would like to do this with FOLIO, Index Data specializes in this type of software development. Few things would please me more than having the chance to build this into FOLIO. Contact me if you want to discuss this further or enter into a development agreement to add this capability to the FOLIO open source codebase.

The ILS without patron data: a thought experiment

2024-06-20T00:00:00-04:00

Library systems hold significant information about patrons, including their search and reading histories. For librarians, ensuring the privacy and confidentiality of this data is an essential component of professional ethics. In the United States, for example, the third point in the American Library Association Code of Ethics is "We protect each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted."

To understand this better, consider how the Video Privacy Protection Act of 1988 arose in the U.S. after the controversy surrounding the publication of Robert Bork's video rental history. A year earlier, Robert Bork was nominated to the U.S. Supreme Court. In the course of his confirmation hearing, a reporter published Bork's video rental history. Although this list of videos were not a factor in his rejected nomination, that the list was published was found to be outrageous enough spur Congress to pass the law. Similarly, if your library records were made public, it could well be embarrassing and intrusive. (Side note: While there is no federal protection for personal library records like those for video rental records, state laws offer a patchwork of protections.)

Library systems, like the video rental systems of old, tie personally identifiable details with patron activity. So, what if we could separate these details? Before we delve into this, let's define some terms related to Federated Identity systems. Skip these sections if you know about Federated Identity systems.

Federated Identity Systems: Identity Providers and Service Providers

In our complex world, library services often come from multiple providers. Rather than have the hassle of separate logins and passwords, it is common for these providers to call back to a central service where a people can prove they are who they say they are. The place where people log in called an Identity Provider (IdP). The place where people want to go is called a Service Provider (SP). A Federated Identity System is a trust relationship and a set of agreements/technologies that enable the sharing of identity information and authorizations across systems. It allows people to access resources and services across different systems using a single set of credentials, typically managed by their Identity Provider. (IdPs are sometimes called Assertion Parties because these are the software systems in the trust relationship that assertions about who a user is; SPs are sometimes called Relying Parties because they are trusting the IdP's assertions.) Federated Identity systems exchange attributes about someone. Those attributes can be specific to a person, like "name" and "email address", or general categories, like "student" or "community-member". Attributes can also have special meanings to the IdP and SP, like Pairwise-Subject-ID.

Pairwise Subject Identifier

An identifier that is specific to a user is called a "subject identifier". These typically look somewhat like an email address with parts specific to both the user and the organization. For example: murraype@dltj.org — murraype is specific to me and dltj.org gives the identifier context to my organization. In a Federated Identity system, the same subject identifier is given to every SP that asks for it.

However, if we don't want multiple SPs correlating a user's activities, we can use a "pairwise-subject-identifier". Within this workflow, the IdP sends different identifiers to different SPs for the same person, making the identifiers unique to each IdP-SP pair. More formally, pairwise-subject-identifier ("pairwise-id") is defined this way:

This is a long-lived, non-reassignable, uni-directional identifier suitable for use as a unique external key specific to a particular relying party. Its value for a given subject depends upon the relying party to whom it is given, thus preventing unrelated systems from using it as a basis for correlation.

Typically opaque, these identifiers don't offer additional information to the SPs trying to correlate activities between users. For instance, the pairwise-id between IdP-SP#1 is uGDJVRxK48E@dltj.org and the pairwise-id between IdP-SP#2 is T6vNM9v5tUna@dltj.org. Not only are the two Service Providers (SPs) unable to determine if the identifiers belong to the same person, but the identifiers themselves also lack any inherent information that would allow the SPs to ascertain the individual's identity.

Pairwise-ID as THE library system ID

In our ideal library system aiming to minimize personal data collection, the pairwise-id becomes the unique identifier in the library system. (There are some drawbacks to using the pairwise-id as the unique identifier...see the discussion in the third post of this series.) The first time the library system's SP gets a new pairwise-id, it creates a new user record in the system. The system uses other attributes from the IdP to determine privileges for this new record - for instance, a "student" status gets a normal loan period, a "faculty" status gets an extended loan period, and a "conference visitor" status gets blocked from borrowing.

The library SP is trusting the attributes received by the IdP—see the discussion above about the trust relationship for the assertions—so it does not need prior knowledge about the patron. So other than knowing that the person is a specific individual with a recognized status in the organization, the library system knows nothing about the patron. If the patron's borrowing and search history are leaked from the library system, the system's leaked records has nothing else to offer to tie those to a person. (Again, there are de-anonymizing nuances, but for later discussion.)

...but I need to send overdue notices to the patron

Let's consider some operational aspects that usually require personal data: sending overdue notices, applying fees to a patron, and handling patron requests. The library system knows enough about its patron community to check out books to authorized users—people with attributes coming from the IdP that we trust and use to set how long the loan needs to be. What if a user keeps a book too long...we need a way to send a notice to a person to return the book and to bill them when they don't return it. But the only thing the library system has is an opaque identifier that only has meaning at the IdP.

Library systems are typically self-contained: they send their own email messages and have their own billing systems for keeping track of patron charges. In a library system without patron data, though, we need to rely on others with more information about the person to handle those tasks.

Let's take the example of sending notices to the patron. Rather than the library system doing sending the notice itself, our system tells another system to do it. The group that runs the IdP has a service that, when given a pairwise-id and the content of a message, will send that message to the patron for us.

Another example: billing the patron when they say they've lost the item or the library declares it missing. The IdP group has another service that takes in the pairwise-id, a currency amount, and a description then adds that information to the person's central account. The library keeps track of the fact that a pairwise-id has been billed, but it never knows the person behind that identifier. If the item turns up again, our library system reverses the charge: it sends the pairwise-id, a credit amount, and a credit description.

Library patrons also request items be held for them; what do we do in this case? When someone requests an item, the library system prints a "paging slip" that is used to get the item from the shelf. The paging slip has information about the item—its title, author, and shelving location—as well as information about the person who requested it. The paging slip usually turns into the hold pick-up slip; it is taped to the outside of the book and shelved alphabetically by the patron's last name. There is a serious privacy downside to this workflow, though: everyone from the staff member pulling the item to the other users browsing the hold-pickup shelf can see the name of the person who asked for it. Instead, our library-system-with-no-names prints a random three-word phrase to stand in for the name of the person who asked for the item. This same three-word phrase is sent in the hold-pickup message to the library patron so they can find the item on the hold-pickup shelf.

But could we build it?

While this thought experiment is theoretical, could a real-world library system actually function this way? In the next post, we'll explore possible adaptations for the FOLIO Library Services Platform to turn theory into practice.

Learnings from the British Library Cybersecurity Report

2024-03-09T00:00:00-05:00

The British Library suffered a major cyber attack in October 2023 that encrypted and destroyed servers, exfiltrated 600GB of data, and has had an ongoing disruption of library services after four months. Yesterday, the Library published an 18-page report on the lessons they are learning. (There are also some community annotations on the report on Hypothes.is.)

Their investigation found the attackers likely gained access through compromised credentials on a remote access server and had been monitoring the network for days prior to the destructive activity. The attack was a typical ransomware job: get in, search for personal data and other sensitive records to copy out, and encrypt the remainder while destroying your tracks. The Library did not pay the ransom and has started the long process of recovering its systems.

The report describes in some detail how the Library recognized that its conglomeration of disparate systems over the years left them vulnerable to service outages and even cybersecurity attacks. They had started a modernization effort to address these problems, but the attack dramatically exposed these vulnerabilities and accelerated their plans to replace infrastructure and strengthen processes and procedures.

The report concludes with lessons learned for the library and other institutions to enhance cyber defenses, response capabilities, and digital modernization efforts. The library profession should be grateful to the British Library for their openness in the report, and we should take their lessons to heart.

Note! Simon Bowie has some great insights on the LSE Impact blog, including about how the hack can be seen as a call for libraries to invest more in controlling their own destinies.

The Attack

The report admits that some information needed to determine the attackers' exact path is likely lost. Their best-effort estimate is that a set of compromised credentials was used on a Microsoft Terminal Services server (now called Remote Desktop Services). Multi-factor authentication (MFA, sometimes called 2FA) was used in some areas of the network, but connections to this server were not covered. The attackers tripped at least one security alarm, but the sysadmin released the hold on the account after running malware scans.

Starting in the overnight hours from Friday to Saturday, the attackers copied 600GB of data off the network. This seems to be mostly personnel files and personal files that Library staff stored on the servers. The network provider could see this traffic looking back at network flows, but it is unclear whether this tripped any alarms itself. Although their Integrated Library System (an Aleph 500 system according to Marshall Breeding's Library Technology Guides site) was affected, the report does not make clear whether patron demographic or circulation activity was taken.

Recovery—Rebuild and Renew

Reading between the lines a little bit, it sounds like the Library had a relatively flat network with few boundaries between systems: "our historically complex network topology ... allowed the attackers wider access to our network than would have been possible in a more modern network design, allowing them to compromise more systems and services." Elevated privileges on one system lead to elevated privileges on many systems, which allowed the attacker to move freely across the network. Systems are not structured like that today—now tending to follow the model of "least privileges"—and it seems like the Library is moving away from the flat structure towards a segmented structure.

As the report notes, recovery isn't just a matter of restoring backups to new hardware. The system can't go back to the vulnerable state it was in. It also seems like some software systems themselves are not recoverable due to age. The British Library's program is one of "Rebuild and Renew" — rebuilding with fresh infrastructure and replacing older systems with modern equivalents. In the never-let-a-good-crisis-go-to-waste category, "the substantial disruption of the attack creates an opportunity to implement a significant number of changes to policy, processes, and technology that will address structural issues in ways that would previously have been too disruptive to countenance."

The report notes "a risk that the desire to return to ‘business as usual’ as fast as possible will compromise the changes", and this point is well taken. Somewhere I read that the definition of “personal character” is the ability to see an action through after the emotion of the commitment to action has passed. The British Library was a successful institution, and it will want to return to that position of being seen as a thriving institution as quickly as possible. This will need to be a continuous process. What is cutting edge today will become legacy tomorrow. As our layers of technology get stacked higher, the bottom layers get squeezed and compressed into thin slivers that we tend to assume will always exist. We must maintain visibility in those layers and invest in their maintenance and robustness.

Backups

They also found "viable sources of backups ... that were unaffected by the cyber-attack and from which the Library’s digital and digitised collections, collection metadata and other corporate data could be recovered." That is fortunate—even if the older systems have to be replaced, they have the data to refill them.

They describe their new model as "a robust and resilient backup service, providing immutable and air-gapped copies, offsite copies, and hot copies of data with multiple restoration points on a 4/3/2/1 model." I’m familiar with the 3/2/1 strategy for backups (three copies of your data on two distinct media with one stored off-site), but I hadn’t heard of the 4/3/2/1 strategy. Judging from this article from Backblaze, the additional layer accounts for a fully air-gapped or unavailable-online copy. An example is the AWS S3 “Object Lock” service, a cloud version of Write-Once-Read-Many (WORM) storage. Although the backed-up object is online and can be read ("Read-Many"), there are technical controls that prevent its modification until a set period of time elapses ("Write-Once"). Presumably, the time period is long enough to find and extricate anyone who has compromised the systems before the object lock expires.

Improved Processes

The lessons include the need for better network monitoring, external security expertise retention, multi-factor authentication, and intrusion response processes. The need for comprehensive multi-factor authentication is clear. (Dear reader: if you don't have a comprehensive plan to manage credentials—including enforcement of MFA—then this is an essential takeaway from this report.)

Another outcome of the recovery is better processes for refreshing hardware and software systems as they age. Digital technology is not static. (And certainly not as static as putting a printed book on a climate-controlled shelf.) It is difficult (at least for me) to envision the kind of comprehensive change management that will be required to build a culture of adaptability and resilience to reduce the risk of this happening again.

Some open questions...

I admire the British Library's willingness to publish this report that describes in a frank manner their vulnerabilities, the impacts of the attack, and what they are doing to address the problems. I hope they continue to share their findings and plans with the library community. Here are some things I hope to learn:

To what extent was the patron data (demographic and circulation activity) in the integrated library system sought and copied out?
How will they prioritize, plan, and create replacement software systems that cannot be recovered or are deemed too insecure to put back on the network?
Describe in greater detail their changes to data backup plans and recovery tests. What can be taught to other cultural heritage institutions with similar data?
This is about as close to "green-field" development as you can get in an organization with many existing commitments and requirements. What change management exercises and policies helped the staff (and public) through these changes?

Cyber security is a group effort. It would be easy to pin this chaos on the tech who removed a block on the account that may have been the beachhead for this attack. As this report shows, the organization allowed this environment to flourish, culminating in that one bit-flip that brought the organization down.

I’ve never been in that position, but I am mindful that I could someday be in a similar position looking back at what my actions or inactions allowed to happen. I’ll probably be at risk of being in that position until the day I retire and destroy my production work credentials. I hope the British Library staff and all involved in the recovery are treating themselves well. Those of us on the outside are watching and cheering them on.

One Year of Learning 2023

2024-01-18T00:00:00-05:00

Inspired by Tom Whitwell's 52 things I learned in 2022, I started my own list of things I learned in 2023. I got well into 2024 before I realized I hadn't published it! So, in no particular order:

In the summer of 2011, a lab technician at Los Alamos National Laboratory carefully laid 8 rods of plutonium on a bench to take a picture. The rods were almost close enough to cause an uncontrolled fission event. AAAS Science
Runways are named by their magnetic compass heading value divided by 10 (e.g. a runway heading due east—90°—is named "runway 9"). Variations in the Earth's magnetic fields means that sometimes runways have to be renamed. National Centers for Environmental Information
The impetus behind Google Image Search came from people searching for pictures of Jennifer Lopez in her blue/green dress at the 2000 Grammy Awards. CNN
The eucalypts tree drops oily leaves that eventually burn down the forest, but that kills off surface vegetation and is needed to open the eucalypt seed pods. Cory Doctorow's Pluralistic
The origins of the nautical terms "port" side (because it was the side of the ship used for loading/unloading) and "starboard" side (from Old English for "steer" and "side of the boat"). NOAA National Ocean Service

Not much in this first year, but I've already started a running list for 2024.

Restoring Obsidian Knowledgebase from MacOS Time Machine at the Command Line

2023-12-31T00:00:00-05:00

While on vacation, I was catching up on some personal knowledge management maintenance I had been putting off. At one task—adding a page for a new employee at the company I work for—I noticed that the page for my company was gone. Odd, that page has been in my knowledgebase for years...let's go look in the People, Places, and Organizations folder...6 pages?!? There should be at least 60! And look at that...all of my templates are gone, too. What else was missing?!? Here are my project notes for rebuilding my knowledgebase from backups.

I'm using Obsidian as my personal knowledge management tool. Obsidian creates a wiki-like experience for a directory of Markdown files on the local computer. (See an earlier post on DLTJ about my use of Obsidian.) I was using iCloud Drive to synchronize that directory of Markdown files across my laptop, phone, and tablet. Based on some web searching, I'm guessing that a recent change by Apple on how iCloud Drive synchronizes between machines caused files that hadn't been accessed in a while to disappear—some sort of selective sync function.

My initial reaction—to move my knowledgebase folder out of iCloud Drive and onto the local hard drive—may have hampered the recovery...the files might have been marked as in the cloud and could have been downloaded. But moving my knowledgebase folder and purchasing an Obsidian Sync subscription was what I did. (I have no regrets...I'll keep the Obsidian Sync subscription. Obsidian has a good, independent development team, and I'm happy to support them.) But how to get the missing files back?

My laptop has two independent backups: a USB-connected hard disk on my office desk using the built-in MacOS Time Machine, and ArqBackup uploading to an encrypted AWS S3 bucket. I was away from my desk for the week, so I first tried restoring from ArqBackup. That didn't work; more on that below. When I got home, I attempted a recovery from the Time Machine drive.

Recovering an Obsidian Vault from a series of Time Machine backups

Because I didn't know what files when missing when, I thought the best approach was to successively overlay restores of the knowledgebase directory onto each other. Doing so would mean that there would be files that I had intentionally deleted that would be in the new knowledgebase directory. Still, that is only a dozen or so out of the potentially hundreds of missing files—a much better option than having an unknown number of missing files.

First step: restore the knowledgebase directory for each backup set into a separate directory.

for i in $(tmutil listbackups)                                                                                                                                                            do
echo $i && tmutil restore $i/Macintosh\ HD\ -\ Data/Users/peter/Library/Mobile\ Documents/iCloud~md~obsidian/Documents/Knowledgebase $(basename $i)-Knowledgebase
done

Each iteration through the backup sets showed the number of restored files:

Total copied: 28.50 MB (29888454 bytes)
Items copied: 2916

As I scanned through the items copied for each day, there were no points where there was a sudden drop, so I think the file loss was gradual over some time. The results of this first step were a bunch of directories that looked like "2023-12-26-032655.backup-Knowledgebase"

Next: copy the contents of each backup set, in the order in which the backup sets were created, on top of an empty Knowledgebase directory.

mkdir restored-Knowledgebase
for i in $(ls -d 2*)
do
echo $i && cp -rp $i* restored-Knowledgebase
done

Now lets run some commands to see what we're dealing with. First, the number of files in the restored Knowledgebase directory: 3,418.

find restored-Knowledgebase -type f -print | wc -l
3418

Next, use the diff command to see the differences between the active knowledgebase in my home directory and the restored knowledgebase, and look for the string Only in restored-Knowledgebase: 480 files restored that aren't in the active knowledgebase.

diff  --brief --recursive ~/Knowledgebase restored-Knowledgebase | grep "Only in restored-Knowledgebase/" | wc -l                                                                                                480

Now let's review a list of files that are only in the active knowledgebase. This is a reality check to make sure we're on the right path, and indeed this lists only the files that were created since the last backup to the Time Machine drive.

diff  --brief --recursive ~/Knowledgebase Knowledgebase | grep "Only in /Users/" | less

One last command to see files that exist in both the active knowledgebase and the restored knowledgebase but don't have the same contents:

diff  --brief --recursive ~/Knowledgebase Knowledgebase | grep "^Files"

There was only one file with minor differences...a file that I changed in between backups. Happy that everything seems in order, I just copy the contents of the restored knowledgebase on top of the active knowledgebase.

cp -rp restored-Knowledgebase/* ~/Knowledgebase

ArqBackup problems

As I mentioned above, my first attempt at restoring was using ArqBackup. ArqBackup behaved in a way that I didn't expect...I could use the user interface to restore a file at any point in time or a directory at the point of the latest backup set. What I couldn't do was what I did with Time Machine: restore a directory at a specific point in time. This seems to be a function of how ArqBackup stores its data. What this means, though, is that ArqBackup is less a solution for restoring directories (or a whole system) at a point in time and better as a disaster recovery when the laptop and the Time Machine drive are unavailable.

Processing WOLFcon Conference Recordings with FFMPEG

2023-09-19T00:00:00-04:00

WOLFcon—the World Open Library Foundation Conference—was held last month, and all of the meetings were recorded using Zoom. Almost all of the sessions were presentations and knowledge-sharing, so giving the recordings a wider audience on YouTube make sense. With nearly 50 sessions, though, manually processing the recordings would make the process quite challenging. I created a pipeline of ffmpeg commands that does most of the grunt work and learned a lot about ffmpeg command graphs along the way. Here are the steps:

Clip the videos from the Zoom recordings.
Rescale the recordings to 1920x1080, if necessary.
Create a title card to add to the front of the recording.
Merge the title card, session recording, and outro video into a single video.
Upload the videos to YouTube using a script.

Some hard-learned lessons along the way:

Ffmpeg's subtitles filter does not play nicely with filtergraphs that have more than one video input. I needed to create a separate step for burning the title card "subtitles" into the intro video and then concat that video with the session recording and outro videos.
Ffmpeg's xfade filter does not like it when its source video is trimmed. No matter what variations of filters I used, it was always a hard cut between the intro video and the session recording. To solve this, I made a separate step for clipping the longer meeting recording to just the session content. I used a lossless Constant Rate Factor (-crf) to not lose too much detail with the multiple encoding steps.

I'm documenting the steps here in case they are helpful to someone else...perhaps I'll need this pipeline again someday.

Clip sessions from recordings

Each room of the conference was assigned a Zoom meeting. These Zoom meetings allowed remote participants to join the session, and the meetings were set to record. This meant, though, that several minutes in the recording at the start and end of the session were not useful content. (Sometimes the Zoom meeting/recording for the same room would just continue from one session to the next, so multiple sessions would end up on one recording.) The valuable part of each recording would need to be clipped from the larger whole.

ffmpeg -y \
  -i FOLIO\ Roadmap.mkv \
  -ss 0:01:22 -to 0:50:26 \
  -c:v libx264 -crf 0 \
  FOLIO\ Roadmap.trimmed.mp4

Each option means:

Ffmpeg command with -y to overwrite any existing file without prompting
Recording file from Zoom
Scan-start (-ss) and end (-to) points. Could have also used -d for duration. Note that the placement of -ss here relative to the -i input filename means ffmpeg will perform a frame-accurate. This avoids the problem of blank or mostly blank frames until the next keyframe is found in the file. See How to Cut Video Using FFmpeg in 3 Easy Ways (Extract/Trim) for a discussion.
Re-encode with the x264 codec with a lossless Constant Rate Factor.
Output file name

Rescale recordings

Most of the recordings from Zoom output to Full HD (1920x1080) resolution, but some were recorded to quite squirrely dimensions. (1920 by 1008, 1920 by 1030 ... 1760 by 900, really?) To find the resolution of each recording file, I used the ffprobe command:

find . -type f -name '*.mkv' -exec sh -c '
  for file do
    printf "%s:" "$file"
    args=( -v error -select_streams v:0 -show_entries stream=width,height -of csv=s=x:p=0 "$file")
    ffprobe "${args[@]}"
  done
' exec-sh {} + 

Run a find command to get a list of all .mkv files and pipe the list into a sub-shell.
For each line of the list of files...
...print out the filename (without a newline at the end)...
...then build an argument array for the ffprobe command to output width and heigth...
...and execute the command
(end of sub-shell)
(end of find)

Line 4 is necessary because some filenames have spaces, and spaces in filename for ffmpeg in bash can be a little challenging.

I moved the recordings that weren't 1980x1080 to a separate directory and ran an ffmpeg command to add letterboxing/rescaling as needed to get the output to Full HD resolution. The -ss and -to options can also be used to clip the video to the correct length at the same time.

ffmpeg -y \
  -i 'zoom-recording-tuesday-2pm-room-701.mp4' \
  -ss 17868 -to 20104 \
  -vf 'scale=(iw*sar)*min(1920/(iw*sar)\,1080/ih):ih*min(1920/(iw*sar)\,1080/ih),
       pad=1920:1080:(1920-iw*min(1920/iw\,1080/ih))/2:(1080-ih*min(1920/iw\,1080/ih))/2' \
  -crf 0 
  'FOLIO Migrations.trimmed.mp4

Ffmpeg command
Input video file
Start and stop points for the session recording
Use a string of video filters, and in this first filter: scale the recording so the longest dimension is either 1920 pixels wide or 1080 pixels tall
For the second filter of the chain, pad the frame to 1920x1080 and put the source video in the center/middle of the output frame.
Constant Frame Rate of lossless
Output file name

Create Title Card snippet

Each session recording has a 15-second title card with the session's name. The 15 second video itself is just a PowerPoint animation of the conference logo sliding to the right half of the frame and a red box fading in on the left side of the frame. Each animation element was assigned a timing, and the resulting "presentation" was exported from PowerPoint to a video file. The music comes from Ecrett, so I have high hopes that it will pass the music copyright bar. The audio track was added to the video using—you guessed it—ffmpeg:

ffmpeg \
  -i WOLFcon\ 2023\ Intro\ Title\ Card.mov \
  -i WOLFcon\ 2023\ Intro\ audio.mp3 \
  -c:v copy 
  -map 0:v:0 -map 1:a:0 
  WOLFcon\ 2023\ Intro\ Title\ Card\ with\ audio.mov

Ffmpeg command
First stream input is the recording of the PowerPoint animation
Second stream input is the sound file
Using the 'copy' codec
Mapping the first input's zero-th stream (video) to the output
Mapping the second input's one-th stream (audio) to the output
Writing the combined file

So with the blank title card video done, the next step is to burn/overlay the text of the session title into the video. I messed with ffmpeg's drawtext filter for a while because the alternative—the subtitles filter—seemed too complicated. One thing that subtitles does nicely, though, is wrap the text to a given area on the video frame...sometimes complexity is a good thing. The open source Aegisub Advanced Subtitle Editor was immensely useful in creating the subtitle definition file. I can simply replace the text of the session title in the last line of the subtitle definition file, then feed it into ffmpeg.

The subtitle definition (so-called ".ass") file generated by Aegisub is a text file, and it looks like this:

[Script Info]
; Script generated by Aegisub 3.2.2
; http://www.aegisub.org/
Title: Default Aegisub file
ScriptType: v4.00+
WrapStyle: 0
ScaledBorderAndShadow: yes
YCbCr Matrix: None
PlayResX: 1920
PlayResY: 1080

[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding
Style: Left side middle,Helvetica Neue,72,&H00FFFFFF,&H000000FF,&H00000000,&H00000000,0,0,0,0,100,100,0,0,1,0.5,0,4,50,920,10,1

[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:04.00,0:00:13.00,Left side middle,,0,0,0,,FOLIO Roadmap Update

Just the last line needs to change for each session title. Another ffmpeg command overlays the subtitles onto the title card video:

ffmpeg -y \
  -i ../WOLFcon\ 2023\ Intro\ Title\ Card\ with\ audio.mov \
  -vf "subtitles=FOLIO Roadmap.ass" \
  FOLIO\ Roadmap.intro.mp4

Ffmpeg command
Input file is the title card with audio
Add the subtitles video filter with the session-specific ASS file
Output file

Merge sources to the final file

Now we have all of the pieces to make the final recording

ffmpeg -y \
  -i FOLIO\ Roadmap.intro.mp4 \
  -i FOLIO\ Roadmap.trimmed.mp4 \
  -i ../WOLFcon\ 2023\ Outro\ Title\ Card\ with\ audio.mov \
  -filter_complex "[0:v]fps=30/1,setpts=PTS-STARTPTS[v0];
    [1:v]fps=30/1,settb=AVTB,format=yuva420p,fade=in:st=0:d=1:alpha=1,setpts=PTS-STARTPTS+((14)/TB)[v1];
    [2:v]fps=30/1,settb=AVTB,format=yuva420p,fade=in:st=0:d=1:alpha=1,setpts=PTS-STARTPTS+((2959)/TB)[v2];
    [v0][v1]overlay,format=yuv420p[vfade1];
    [vfade1][v2]overlay,format=yuv420p[fv];
    [0:a]asetpts=PTS-STARTPTS[a0];
    [1:a]asettb=AVTB,asetpts=PTS-STARTPTS+((14)/TB),compand=.3:1:-90/-60|-60/-40|-40/-30|-20/-20:6:0:-90:0.2[a1];
    [2:a]asetpts=PTS-STARTPTS+((2959)/TB)[a2];
    [a0][a1]acrossfade=d=1[afade1];
    [afade1][a2]acrossfade=d=1[fa];" \
  -map "[fv]" -map "[fa]" \
  -crf 0 -ac 2 \
  FOLIO\ Roadmap.complete.mp4

There are some filter commands here to cross-fade the video and audio between video segments that are butting up next to each other. There is an excellent description of the ffmpeg cross-fade options, and I'm using the "traditional" method.

The ffmpeg command...one more time!
The '0' input file is the title card video with the subtitles
The '1' input file is the trimmed recording from Zoom
The '2' input file is the outro file (a 6-second file with some end-bumper music)
Start of the filtergraph, tagging the video stream of the first input file as [v0], set the video as 30 frames-per-second, and anchor the "presentation timestamp" at the 0-th frame.
The video stream of the second input file is set to [v1]. The format filter sets an alpha channel to make the fade work, the fade filter makes cross-fade with a d-duration of 1 second, and the setpts filter offsets the start of the video to 14 seconds after the 0-th frame. (The title card video is 15 seconds, so making the recording fade in at the 14 second mark gives us that 1 second of overlap.)
The third video stream is [v2]. The parameters are nearly identical to the previous line with just the starting time difference. (That number varies by the length of the session recording.)
This filter overlays the [v0] and [v1] videos. This works because of the alpha channel and offset start of the second input. The output is tagged as [vfade1].
Same as the previous line with the intro-plus-recording and the outro clip. The output is tagged as [fv].
Tags the audio stream of the '0' input file.
Tags the audio stream of the '1' input file and offsets its start (similar to the video stream). I'm also applying the audio compression-and-expansion filter to help raise the volume of quiet parts of the Zoom recording.
Tags the audio stream of the '2' input file and offsets its start.
Crossfades the first and second audio streams across 1 second.
Crossfades the second and third audio streams across 1 second and tags the output as [fa].
Maps the ends of the video pipeline ([fv]) and audio pipeline ([fa]) to the output.
Sets the codec to lossless and audio to stereo.
The final output file.

Upload to YouTube

With the files ready, it is time to upload them to YouTube. The youtube-upload script is useful as a tool for batch uploading the videos. There are a couple of caveats to be aware of:

The script uses an old authentication method against the YouTube API. There is a comment on issue 352 that has good advice on how to get around that.
Without additional work, YouTube will automatically lock-as-private any videos uploaded using the API. Details and instructions are in a comment on issue 306, but it involves getting your YouTube API Project validated.

The command looks like this:

youtube-upload \
  --title="FOLIO Project Roadmap" \
  --description="Multi-line description goes here." \
  --category="Education" \
  --default-language="en" --default-audio-language="en" \
  --client-secrets=./client_secret.json \
  --credentials-file=./credentials_file.json \
  --playlist="WOLFcon 2023" \
  --embeddable=True --privacy=public \
  'FOLIO\ Roadmap.complete.mp4'

That should all be self-explanatory. One thing to be aware of is the authentication step. The client_secret.json file is downloaded from the Google API Console when the YouTube API project is created; that API project will need to be set up and this credentials file saved before running this script. Also, the credentials_file.json won't exist when this command is first run, and you'll be prompted to go to a specific URL to authorize the YouTube API project. After that, the credentials file will exist and you won't be prompted again.

And since I already had the session metadata in a spreadsheet, it was easy to write a formula that put all of the pieces together:

="youtube-upload --title='"&B2&"' --description='"&C2&"' --category='Education' --default-language='en' --default-audio-language='en' --client-secrets=./client_secret.json --credentials-file=./credentials_file.json --playlist='WOLFcon 2023' --embeddable=True --privacy=public '"&A2&"'"

Then it is just a matter of copying and pasting the calculated command lines into the terminal.

On Open Library Services: Reflections from the GIL User Group Meeting

2023-09-03T00:00:00-04:00

In May 2023 I was asked to join the opening session at Georgia's GIL User Group Meeting. Along with Chris Sharp and Emily Gore, we reflected on the conference theme: The Future is Open. GALILEO has an exciting time ahed of it...their libraries are adopting FOLIO and a new resource sharing system. Below is a lightly edited version of my remarks during the panel, and a recording of the keynote panel is available on YouTube.

Tell us a little bit about your experience working with “open” library services.

In my experience, "open" is built into the ethos of libraries. I mean...even if we look at just the last century, we have the Library of Congress starting the National Union Catalog project in 1901—that was about sharing the contents of cards in the catalog—and ALA establishing a code of practice for interlibrary loan in 1917.

My career in libraries has always been about the open; I started in 1991 at the same time OhioLINK was forming, and I remember many trips to Columbus, Ohio, to work out processes and share tips-and-tricks with each other. I was even giving away code and adapting code from others before the phrase "open source" came into common use. Over the course of my career, I've worked on or with several open source projects: FEDORA, Islandora, ArchivesSpace, CollectionSpace, FOLIO and ReShare.

Standards are also an important part of "open" — in order to ease the process of us working together and our systems working together, it helps to have a common starting point to build on. I've been working on NISO projects and committees for most of my career, and it warms my professional heart to see better services come about for patrons because there is agreement on how the pieces should be put together.

What are the biggest advantages and risks in working with open services?

The biggest advantage is having a seat at the table as decisions are made. Working in the open means bringing the best of your experience and the needs of your patrons to bear as products, services, and software are designed. It is so much easier to have that input at the front end rather than trying to retrofit a system to your needs after the fact. With many voices and perspectives in the creation process, it also reduces the chances that something important will be missed.

The biggest risk is time and patience. Having many people involved in the design process means that it takes time to listen to those perspectives and effort to synthesize the way forward for the group. There will be misunderstanding and there will be compromise. There may even be paths that you want to pursue, but the group isn't willing to follow. And of course the is the risk that the path the group follows may not be fruitful.

How have you seen library attitudes to “open” change in past years?

There seem to be more variations of open now. Last month's article from OCLC Research had a catalog of openness: open access, open data, open educational resources, open science, and open source. So in those you cover publishing, research activity and outputs, educational materials, and software systems.

What should USG libraries be thinking about as we begin migrating to FOLIO and ReShare?

You are at a crossroads. There is a lot of new stuff coming at you, and the temptation will be to make the new work like the old. I was involved in the early design process for FOLIO and I've watched how those apps evolved, and then how the ReShare apps came about. What I said earlier about the librarians and library technologists pouring their experience into their design is true, and it continues today. So I think you should take a risk to open yourselves up to new and hopefully more efficient and effective workflows. I'm pretty sure those already in the community will be welcoming and help you with the process. And then, once you got your feet under you, see where you can bring your experience and perspective to the ongoing development work.

What other “open” initiatives are you excited about?

The progress of the open access movement is fascinating. There is the phrase that progress happens one retirement at a time — I used to chuckle at that, but after 30 years in the profession that phrase is less funny and more stinging. The slow but steady progress seems to be real, though. It has reached the stage where government mandates are making it happen. See, for instance, last year's memo from the White House Office of Science and Technology Policy giving guidance to federal departments to make the results of taxpayer-supported research immediately available to the American public at no cost. Also the announcement earlier this month from the EU on a policy to require pubicly-funded research to be made available at no cost to readers and at no cost to authors.

How can libraries and GALILEO better support “open” initiatives?

In one important way, GALILEO is in a privileged position right now with FOLIO and ReShare. Many of us have been involved in the projects for a long time, and we've lived through the process that got those platforms where they are today. We can't see them clearly from the outside anymore. If there is any room left in your implementation plans, I encourage you to note where you struggle to find and understand what you need to know. Those are the places where feedback can help us improve the process for the next libraries that come into the project. Even if you don't have time to make the improvements now—and I expect you won't—just jotting those ideas down in a notebook and coming back to them after your implementation will help.

Considerations for Online Age Verification (in the U.S.)

2023-09-02T00:00:00-04:00

The Congressional Research Service has posted four reports about verifying users' ages for various services online in the past few months. This is a tricky area because there are open questions around compliance and potential free speech impacts. Figuring out how to protect minors while not infringing on lawful communication or driving content providers from the market is quite the challenge.

(You can get updates about reports from the Congressional Research Service via RSS or Mastodon; see an earlier blog post for details.)

Challenges with identifying minors online

Most recently updated in March 2023, this 3-page Challenges with Identifying Minors Online report has brief sections describing current efforts to identify children, potential challenges when identifying minors, and policy considerations for Congress. I think it can be viewed as a high-level summary to the subsequent reports.

Congress has passed laws like COPPA (Children's Online Privacy Protection Act of 1998) to protect minors online, but identifying their ages remains challenging. While some sites require entering a birthdate, others are exploring options like ID verification. However, requiring government IDs may exclude many minors, and fake student IDs could be used. Creating a national digital ID system raises privacy concerns. AI-based age checks have accuracy issues, and using data brokers' information raises data collection concerns. The report describes how Congress, as it considers further protections, may stumble into unintended consequences like limiting content access or increasing data collection; those factors should be weighed against protecting minors.

Current context

In mid-August, the Congressional Research Service published three reports with greater depth. The first, Online Age Verification (Part I): Current Context, is a 3-page report about online age verification laws that have been proposed or enacted at both the state and federal levels. It provides an overview of different approaches taken in various state laws targeting social media platforms or pornography sites. Requirements range from estimating a user's age to definitively verifying age through ID checks. Enforcement also varies, with some laws allowing private lawsuits while others give enforcement authority solely to state officials.

The report notes constitutional free speech concerns around age verification laws. Imposing requirements on websites to check users' ages could discourage them from hosting certain content or force platforms offline if compliance is too costly. The Supreme Court has expressed worries that protecting minors should not be used to excessively burden adult communication.

Constitutional background

The second report, Online Age Verification (Part II): Constitutional Background (also 3 pages), describes how age verification laws requiring online platforms to verify users' ages may face First Amendment challenges. (Such requirements could burden free speech rights.) The report says that laws establishing age verification obligations are more likely to be deemed constitutional if they are content-neutral and narrowly tailored to a vital government interest like protecting minors. It also outlined how the Supreme Court has ruled on previous federal laws around age verification. Content-based laws like the Communications Decency Act (CDA) have been struck down for not being narrowly tailored and burdening lawful adult speech. While the Child Online Protection Act (COPA) survived a preliminary injunction, courts ultimately found it unconstitutional due to age verification methods not being fully effective and imposing high compliance costs. Overall, the key impact is that future age verification policies must be carefully crafted to avoid First Amendment issues highlighted by past Supreme Court decisions.

Constitutional issues

The final report in the series is 4 pages long: Online Age Verification (Part III): Select Constitutional Issues. The document discusses potential constitutional challenges to laws requiring age verification for online services. It analyzes how such laws may impact the free speech rights of website operators, adult users, and minor users. Laws targeting pornography or material harmful to minors are likely content-based, while laws targeting social media may be content-neutral. However, laws with content-based exceptions could still face First Amendment challenges. The government must show it has a compelling interest, such as protecting minors, and that the law is narrowly tailored. Past courts have found general interests in protecting children insufficient without proof of specific harms. Age verification laws could also impact minors' access to a wide range of online speech.

Congressional Research Service Syndication Feed

2023-05-29T00:00:00-04:00

One of the hidden gems of the Library of Congress is the Congressional Research Service (CRS). With a staff of about 600 researchers, analysts, and writers, the CRS provides "policy and legal analysis to committees and Members of both the House and Senate, regardless of party affiliation." It is kind of like a "think tank" for the members of Congress. And an extensive selection of their reports are available from the CRS homepage and—as government publications—are not subject to copyright; any CRS Report may be reproduced and distributed without permission. And they publish a lot of reports. (Read more on their CRS frequently-asked-questions page.)

I remember learning about the CRS in library school, but what got me interested in them again was a post on Mastodon about an Introduction to Cryptocurrency report that they produced. At just 2 pages long, it was a concise yet thorough review of the topic, ranging from how they work to questions of regulation. Useful stuff! And that wasn't the only useful report I (re-)discovered on the site.

An Automated Syndication Feed

The problem is that no automated RSS/Atom feeds of CRS reports exists. Use your favorite search engine to look for "Congressional Research Service RSS or Atom"; you'll find a few attempts to gather selected reports or comprehensive archives that stopped functioning years ago. And that is a real shame because these reports are good, taxpayer-funded work that should be more widely known. So I created a syndication feed in Atom:

https://feeds.dltj.org/crs.xml

You can subscribe to that in your feed reader to get updates. I'm also working on a Mastodon bot account that you can follow and automated saving of report PDFs in the Internet Archive Wayback Machine.

Some Important Caveats

The CRS website is very resistant to scraping, so I'm having to run this on my home machine (read more below). I'm also querying it for new reports just twice a day (8am and 8pm Eastern U.S. time) to avoid being conspicuous and tripping the bot detectors. The feed is a static XML document updated at those times; no matter how many people subscribe, the CRS won't see increased traffic on their search site. So while I hope to keep it updated, you'll understand if it misses a batch run here or there.

Also, hopefully, looking at the website's list of reports only twice a day won't raise flags with them and get my home IP address banned from the service. If the feed stops being updated over an extended time, that is probably why.

There is no tracking embedded in the Atom syndication feed or the links to the CRS reports. I have no way of knowing the number of people subscribing to the feed, nor do I see which reports you click on to read. (I suppose I could set up stats on the AWS CloudFront distribution hosting the feed XML file, but really...what's the point?)

How It's Built

If you are not interested in the technology behind how the feed was built, you can stop reading now. If you want to hear more about techniques for overcoming hostile (or poorly implemented) websites, read on. You can also see the source code on GitHub.

Obstacle #1: Browser Detection

The CRS website is a dynamic JavaScript that goes back-and-forth with the server to build the contents of web pages. The website itself sends nicely formatted JSON documents to the JavaScript running in the browser based on your search parameters. That should make this easy, right? Just bypass the JavaScript front end and parse the JSON output directly.

In fact, you can do this yourself. Go to https://crsreports.congress.gov/search/results?term=&r=2203112&orderBy=Date&isFullText=true& in your browser and see the 15 most recent reports. Try to reach that URL with a program, though, and you'll get back an HTTP 403 error. (In my case, I was using the Python Requests library.) And I tried everything I could think about. I even tried getting the curl command line with the headers that the browser was using from the Firefox web developer tools:

curl -v 'https://crsreports.congress.gov/search/results?term=&r=2203112&orderBy=Date&isFullText=true&' \
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0' \
  -H 'Accept: application/json, text/plain, */*' \
  -H 'Accept-Language: en-US,en;q=0.5' \
  -H 'Accept-Encoding: gzip, deflate, br' \
  -H 'Connection: keep-alive' \
  -H 'Referer: https://crsreports.congress.gov/search/' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  -H 'TE: trailers'

...and still got denied. So I gave up and used Selenium to run a headless browser to get the JSON content.

And that worked.

Obstacle #2: Cloudflare bot detection

So with the headless browser, I got this working on my local machine. That isn't really convenient, though...even though my computer is on most working hours, something like this should be run on a server in the cloud. Something like AWS Lambda is ideal. So I took a detour to learn about Headless Chrome AWS Lambda Layer (for Python). This is a technique to run Chrome on a server, just like I was doing on my local machine.

So I got the code working on AWS Lambda. It was a nice bit of work...I was pleased to learn about a new AWS skill (Layers for Lambda). But I hit another wall...this time at Cloudflare, a content distribution network that sits in front of the CRS website with protections to stop bots like mine from doing what I was trying to do. Instead of the JSON response, I got Cloudflare's HTML page asking me to solve a captcha to prove my bot's humanness. And look...I love y'all, but I won't be answering captcha challenges twice a day to get the report syndication feed published.

So after all of that, I decided to just run the code locally. If you know of something I missed that could bypass obstacles 1 and 2 (and won't get the FBI knocking at my door), please let me know.

This website contains 0.00006% of the world's knowledge

2023-04-30T00:00:00-04:00

According to reputable sources, this blog contains 0.00006% of the world's knowledge.

The large language models (LLMs) that underlie tools like ChatGPT and Bing-AI are being used as question-answering tools. If you listen to the hype surrounding what LLMs can do, you can hardly be faulted for thinking that is has every fact known to humankind and can answer any question.
One of the most popular large language models, GPT-3, was trained with several large text datasets.
One dataset, C4 (a filtered version of the Common Crawl), is 60% of the text used in training.
According to this article in the Washington Post, dltj.org is 0.0001% of the tokens in the C4 dataset.

Screen capture of 'dltj.org' search results in Washington Post article.

How much is 0.0001% of the GPT-3 training set? It is a quarter of an inch (half a centimeter) off sea level on a climb up Mount Everest. (Source: Wolfram Alpha) It is almost 8 feet (2.5 meters) of a journey from Washington, DC, to San Francisco, California (Source: Wolfram Alpha) In contrast, the content from the New York Times is 0.036% of the training dataset, or 9/10ths of a mile (1.4km) on that journey.

(A note about assumptions: OpenAI hasn't published the contents of the training data for GPT-3.5—which is used in ChatGPT—and GPT-4. So this post uses the data from GPT-3 as listed in Wikpedia. )

You can use the search tool near the bottom of the Washington Post article to see where your favorite website ranks. But also read the article to explore what is in the C4 version of the Common Crawl. As much as OpenAI is trying to put guardrails on the output, the model itself is trained on some pretty offensive stuff.

Reflections on four-months-after-Twitter

2023-04-16T00:00:00-04:00

It's coming up on four months this week since I left Twitter, and I started wondering about the impact of that. On the whole, I'm still quite fine with the decision. (If you need an itemized list of how Twitter is falling short of its history and its idealized self, I'm sure I could dig one up.) There are certainly people that I miss that have not found their way onto Mastodon (or we haven't re-found each other yet). There is a nagging fear-of-missing-out on enjoyable and valuable conversations.

One gut reaction that I find I'm suppressing is calling out bad corporate behavior. Companies of all sizes had a Twitter presence, and I could publicly @-tag them with messages of outrage, disappointment, rejection, and sometimes support. It felt cathartic, and I occasionally had a nebulous feeling of community when others liked, retweeted, and replied. In retrospect, it may not have been all that useful. Do we go back to filling out web forms in a company's "about" section or—gasp!—writing a letter and putting it in the mail?

A good portion of the accounts I followed and had notifications enabled have made it over to Mastodon: Have I Been Pwned and The Oatmeal are here now, for instance. Some, notably local government and law enforcement, county road crews, and the nearest National Weather Service Office are not here. Even their Twitter presence is becoming less useful:

As a long time internet user, I also miss the uniqueness of Twitter as a meeting point. Our online personalities are—at the same time—more diffuse and more siloed than ever. Not everyone was on IRC, nor could IRC support everyone talking at the same time, for instance. Twitter was unique in that one's reach had near-infinite potential with very little effort required. And maybe Twitter would have ultimately collapsed under the weight of that infinite potential...it certainly had its problems and challenges, even in the best of times.

When I last wrote about Twitter here on December 19, 2022, I said:

The past eight weeks on Twitter have been emotionally tiring, and I wondered why. On reflection, mourning seems like the most appropriate label for the emotion I’m feeling. I had invested time and effort into cultivating a network of friends and acquaintances. Now it is being destroyed; that network was a guest in someone else’s kingdom.

My sense of mourning has changed: it's no longer about that lost network of friends and acquaintances in someone else's kingdom; it is now about the lost potential and wondering if we've learned the lessons we need to from that Twitter era.

Issue 100: Internet Governance

2023-03-23T00:00:00-04:00

Russia's invasion of Ukraine is just over a year old, and shortly after the war started there were calls to cut Russia off from the internet as a punitive action. (See Can the Internet Sanction a Country? Should It?, Thursday Threads issue 89.) A year later now, that discussion has died down. Today's Thursday Threads issue recalls that discussion and other examples of attempts to exert control over the internet.

This week we look at internet governance: how it got to be the way it is, why it is unique, and threats (both historic and current) to how it operates.

Also on DLTJ in the past week:

Controlled Digital Lending…What’s the Fuss?: slides and transcript from a presentation at Code4Lib 2023 in Princeton, New Jersey.

What is the internet?

You might think of “the internet” as one big thing, but in reality it is many smaller things hooked together by common agreement. The internet connection at my house comes from an Internet Service Provider (ISP). My ISP connects to one or more (likely many more than one) other ISPs and transit providers. Through those interconnections, a message I’m typing here will be sent to a computer across town, across the country, and across the world.

—Sanctioning Governments on the Internet, me, 10-Mar-2022

Yes, I'm quoting myself here. That DLTJ article was prompted by a whitepaper from Packet Clearing House in response to a request from the Ukraine government to cut off Russia from the internet. As the quote above points out, the article goes on to talk about how the multi-stakeholder nature of the internet makes it difficult to exert much of any central control over its operation. And that is probably a good thing. If you didn't read this article a year ago, I hope you'll go back and read it now.

Proposal to redesign multi-stakeholder governance out of the internet's core

Huawei is proposing a fundamental internet redesign, which it calls “New IP,” designed to build “intrinsic security” into the web. Intrinsic security means that individuals must register to use the internet, and authorities can shut off an individual user’s internet access at any time. In short, Huawei is looking to integrate China’s “social credit,” surveillance, and censorship regimes into the internet’s architecture. The New IP proposal itself rests on a flawed technical foundation that threatens to fragment the internet into a mess of less interoperable, less stable, and even less secure networks. To avoid scrutiny of New IP’s shortcomings, Huawei has circumvented international standards bodies where experts might challenge the technical shortcomings of the proposal. Instead, Huawei has worked through the United Nations’ International Telecommunications Union (ITU), where Beijing holds more political sway.

—China’s Dystopian “New IP” Plan Shows Need for Renewed US Commitment to Internet Governance, Just Security, 13-Apr-2021

Huawei's proposed "New IP" internet lacks transparency and inclusivity as well as introduces centralized control and potential abuse. The multi-stakeholder governance structure of the internet is one if its most important and unique aspects. Doing away with that should not be taken lightly.

If you want a way-out-there idea, if we centralized control of the internet with the governments we have on earth now, it becomes all that much harder to invite an independent Martian colony or extra-terrestrial culture to join the network; not so with the current multi-stakeholder governance.

See...told you it was a way-out-there idea.

DNS, the internet's directory service, is key

The new proposal specifies that "member states have the right to manage all naming, numbering, addressing and identification resources used for international telecommunications/ICT services within their territories." This seems to be a challenge to the authority of ICANN and IANA, the quasi-private organizations that currently oversee the allocation of domain names and IP addresses. The proposal would also add a new section on security issues. It specifies that the world&apoos;s governments should "undertake appropriate measures" to protect the "physical and operational security of networks; denial of service attacks; countering unsolicited electronic communication (e.g Spam); and protection of information and personal data (e.g. phishing)."

—Authoritarian regimes push for larger ITU role in DNS system, Ars Technica, 8-Dec-2012

ICANN and IANA are parts of the multi-stakeholder governance structure. While imperfect, they do work towards broad consensus with the goal of ensuring the broadest connectivity. That is a laudable goal.

Google's Eric Schmidt warns of governance fragmentation

In a question-and-answer session at Mobile World Congress 2012 on Tuesday, [Google's executive chairman Eric] Schmidt said handing over control of things such as naming and DNS to the UN's International Telecommunications Union (ITU) would divide the internet, allowing it to be further broken into pieces regulated in different ways. "That would be a disaster... To some, the openness and interoperability is one of the greatest achievements of mankind in our lifetime. Do not give that up easily. You will regret it. You will hate it, because all of a sudden all that freedom, all that flexibility, you'll find it shipped away for one good reason after another," Schmidt said. "I cannot be more emphatic. Be very, very careful about moves which seem logical, but have the effect of balkanising the internet," he added, urging everyone to strongly resist the moves.

—Schmidt: UN treaty a 'disaster' for the internet, ZDNet UK, 29-Feb-2012

One can question Google's motives, but I interpret this as an honest caution and not a bid for an internet giant to consolidate control.

India cuts off internet access

Indian authorities severed mobile internet access and text messaging for a second day Sunday across Punjab, a state of about 27 million people, as officials sought to capture a Sikh separatist and braced for potential unrest. The statewide ban — which crippled most smartphone services except for voice calls and some SMS text messages — marked one of the broadest shutdowns in recent years in India, a country that has increasingly deployed the law enforcement tactic, which digital rights activists call draconian and ineffective.

—Indian officials cut internet for 27 million people amid search for fugitive, Washington Post, 19-Mar-2023

This is a very recent example of a country trying to cut itself off from the internet. In this case, in a country where most people get to the internet via mobile networks, it is relatively easy to do since there are few chokepoints.

Last year three dozen countries impose some kind of network restrictions

More countries shut down the internet in 2022 than ever before, according to a new report by digital rights researchers, as the threat of “digital authoritarianism” races up the agenda of many governments worldwide. Authorities in 35 countries instituted internet shutdowns at least 187 times, according to the New York-based digital rights watchdog Access Now. Nearly half of these shutdowns occurred in India, and if that nation is excluded, 2022 saw the most number of shutdowns globally since the group began monitoring disruptions in 2016. Access Now relied on technical assessments as well as news articles and personal accounts to compile its report, which spans complete blackouts, suspensions of specific phone networks or social media apps, and the slowing down of internet speeds.

—Governments shut down the internet more often than ever, report says, Washington Post, 28-Feb-2023

It is not just India—and interference ranges from complete shutdowns to impacting specific apps. Doing something like this in the U.S. is harder because of the diversity of internet connectivity options.

But it doesn't mean the U.S. government didn't try (and won't try again). Read on...

The U.S. Congress once thought an "internet kill switch" was a good idea

Wired.com has this article about proposed legislation to "direct the Department of Homeland Security to undertake a study on emergency communications" (the bill's title). The text of the legislation is not available at this time, but when a similar topic was debated in the last congressional session, the United States Senate Committee on Homeland Security and Governmental Affairs -- chaired by Senator Joseph I. Lieberman with Senator Susan M. Collins as ranking minority member -- issued a four-page Myth-v-Reality document. That bill also seemed to do more than simply request a study -- it actually established in the Executive Office of the President an Office of Cyberspace Policy. The bill died before coming up for a vote in the final days of the session. At the time, the American Library Association joined with dozens of other groups to send a letter to the committee expression concerns with that version.

—Internet ‘Kill Switch’ Legislation Back in Play, DLTJ Thursday Threads, 3-Feb-2011

Lest you think something like what is happening in India can't happen in the United States, 12 years ago the U.S. Senate proposed an internet kill switch. That legislation died and—as near as I can tell—hasn't been proposed again at this scale. The recent discussions of banning an app—TikTok—come really close, though. (And instead of banning apps, can we do something about the pervasive personal data collection and distribution instead?)

The ball of fur, well, sheds!

Controlled Digital Lending…What's the Fuss?

2023-03-16T00:00:00-04:00

On March 16, 2023, I gave a presentation with this title to the code4lib conference in Princeton, New Jersey. The suggested links from the end of the presentation are listed below followed by a rough transcript of the talk.

As I noted in the talk, the judge in the Hachette et al v. Internet Archive lawsuit has scheduled a hearing for Monday, March 20 at 1pm Eastern U.S. time. The Internet Archive has a blog post with details about how to listen in.

Presentation Transcript

Presentation slide 1: Title Slide

Some disclaimers—I am not a lawyer and this is a litigious area; we'll get into that at the end of the talk. I believe the legal grounds for CDL are sound and resonable, but those grounds have not yet been adjudicated in court. Before you embark on a CDL program, be sure to gauge your library’s risk tolerance and make choices that are reasonable for your institution. You probably want to get your legal council involved.

Second, I'm leaning heavily into a NISO working group's work product. The Interoperable Systems for Controlled Digital Lending working group is funded by a grant from the Andrew W. Mellon foundation to NISO in 2021. The working group expected to have a draft for public comment out right about now when I proposed the talk, but it isn't quite there yet.

Describing CDL

Presentation slide 2: What is Controlled Digital Lending?

In a nutshell, Controlled Digital Lending is Interlibrary Loan or Course Reserves meets Digital Rights Management. The core idea is that the library exchanges physical lending with digital lending. And so CDL has these three key characteristics:

a strict adherence to the number of physically owned copies of a work being never greater than the sum of simultaneous physical uses and digital uses. This is known colloquially as the "Own-to-Loan" ratio. If a physical copy is owned, then that physical copy may be lent either physically or digitally. Keep this concept in mind—I'll refer back to it several times in the presentation.
controls that limit access to materials to authorized library patrons
user interface elements that limit the potential of unauthorized distribution of digital surrogates. This is a fundamental part of the trust relationship between libraries and intellectual property holders.

CDL is grounded in these three points—characteristics that make CDL as close as possible to the physical, legal, and economic conditions that libraries use now to lend items.

Six workflow components

Presentation slide 3: Six Workflow Components

So let's take a step back and look at the big picture. There are six steps to the CDL workflow: request, establish loan elibibility, ensure the Own:Loan ratio, file management, patron access, and loan tracking. And with each of these components there are a variety of standards and options that could be used. The workflow steps are stacked in this diagram, and we're going to take a peek at each one.

Request

Presentation slide 4: Workflow step 1 — Request

First is the request stage—when the patron tells the library what they want. This can come from a request button in a discovery layer or catalog, a blank form, or an OpenURL from another system. And this would have all of the metadata you would expect: details about the item, the context of the patron identity, any special notes about the request, etc.

Establish loan eligibility

Presentation slide 5: Workflow step 2 - Loan eligibility

When the request comes in, the first thing the CDL system needs to do is determine if the user is eligible to receive the item and if the item is eligible to be lent through CDL. Understanding if the user is eligible is pretty straight forward: are they a valid member of your community (resident, student, staff, etc.). Looking at the eligibility of the book is where we get into our first risk factor. Some libraries have, for instance, decided that books published within a 5-year moving wall are not eligible for CDL...that to lend a more recently published item may have an economic impact on the publisher. I believe the Internet Archive's CDL program operates with such a restriction.

Ensure Own:Loan ratio

Presentation slide 6: Own-to-loan

Next, is there a physical copy that can be sequestered while the digital loan is happening? This is the Own:Loan ratio—can the library ensure that there is one user of the item it has purchased. This can take several forms: perhaps a library treats this like a hold request—they print a paging slip, retrieve the item from the open stacks, and put it on a shelf in the back room. Or maybe the library has closed stacks—something like a high-density storage system—and it marks the item as unavailable in the ILS. Maybe a library bought several physical copies and keeps them in a warehouse so it has a pool of items that can be lent out via CDL. Whatever the library decides its risk tolerance is, this is where it is implemented. A wait-list can also be added at this step in case all of the eligible copies are loaned out.

File management

Presentation slide 7: File management

Next, the library has to have a digital scan of the item that they can lend out. If a library doesn't have a digital copy of the item, then that is a whole workflow itself. Maybe it scans the item itself. Maybe it gets a copy from a consortium partner that has already digitized it. Somehow we'll need to determine if we have a digital copy and have a way to make one if we don't. We'll probably also want to keep a digital copy once we've made one in case someone else asks for it again, but that, too, is part of the risk management strategy.

Patron access

Presentation slide 8: Patron access

Now we have something to get into the patron's hands. This is the digital equivalent of wanding the barcode and checking out the item to the patron. We may use some sort of digital rights management—or DRM—that locks this file to this user for a specific period of time. Adobe Digital Editions software is commonly used for this, although there are other alternatives. This might mean the user needs some specialized software to read the file. Or the library might opt for an only-online experience such as an IIIF viewer that has the clipping and downloading functionality removed. During the pandemic, Fordham University whipped up a quick system that uses Google Drive restrictions to allow a specific Google Workspace user to read but not download/print a PDF for a specific period of time. There are some interesting possibilities here.

Loan tracking

Presentation slide 9: Tracking

Finally, we need a way of recording the loan and keeping statistics to show that the library has kept in compliance with the Own:Loan ratio. The patron might also elect to return the item early. There might also be a notion of a recall for an item that was checked out with an extended due date. All of the kinds of tracking and item management that libraries do for physical items probably have an equivalent need in a CDL environment.

Four architectural models

Presentation slide 10: CDL Architectural Models

So we have these for six workflow steps: the request, establishing loan eligibility, ensuring the Own:Loan ratio, getting the digital file, providing that file to the patron, and tracking the loan. As you think about systems that mimic these six steps, the kinds of architecture generally fall into four buckets. The first two are within an institution: an institution's own items with an institution's own patrons. The second two involve controlled digital lending on a collaborative scale—where the owning library is different from the patron's library. Let's take these in turn and see what the impact is on the six workflow steps.

CDL Model 1: Standalone CDL

Presentation slide 11: Model 1– Standalone CDL

In this first case, there is a standalone CDL system. It isn't tied into the ILS to determine whether there is a physical copy. Instead, the physical copies are "pre-sequestered" — they are pulled out of the collection, digitized, and set aside. This is probably the simplest form of CDL, and it is something that might be found in a course reserves system, a remote storage environment, or special collections items. In the second step—the one where the system is estabilishing loan eligibility, we only need to see if the patron is allowed to access the item. The library has already determined that the physical item can be lent with CDL. In the third step—where we are ensuring the Own:Loan ratio is kept, the CDL system just needs to keep a counter of the number of digital copies of an item it can lend and the number that have been lent. We might be using NCIP to determine patron eligibility, or that might come as an attribute from a single sign-on system. So, yeah, the simplist. Some examples of this kind of model are Caltech's DIBS system, LibrarySimplified, FileOpen Course Reserves, and the Internet Archive's Open Library.

CDL Model 2: Institution-scoped Integrated System

Presentation slide 12: Model 2– ILS integration

The second model is still within an institution, but in this case we're adding connections to the ILS. In the third step, we're going to look into the ILS to see if there is a physical copy available that can be digitally lent. If there is, we might print a paging slip to have the item pulled from the shelf or we mark an item in remote storage as unavailable. This probably involves checking out the item in the ILS so the normal unavailable ILS status appears. Once the system has verified that the physical item is no longer available, the request passes the Own:Loan ratio step. The physical item may need to be digitized or there might already be a digital file available. This architecture is probably using protocols like NCIP to interact with the ILS for placing holds, checking out items, and checking them back in. Examples of this kind of model out in the world is Ex Libris' Alma Digital Resources product; the FOLIO community is talking about developing a CDL app as well.

CDL Model 3: Consortium Shared Infrastructure

Presentation slide 13: Model 3– Consortium Shared Infrastructure

In the third and fourth models, we bring a consortium into the mix. In the third model, we look at the consortium's holdings as a shared pool of physical items. In the second step, the system needs to handle patrons from any library in the consortium as it is deciding on eligibility. In the third step, the system can look across all of the holdings of a consortium to match the loan of a digital item to a patron at library A with a sequestered physical item at library B. As you might imagine, we are increasing the complexity of the CDL system. Is there a shared inventory of physical holdings, or are availability lookups done against each library at request time? Are the digital files held centrally or do they remain at the owning library? How does a patron with credentials at library A get access to digital file at library B? This starts to look a lot like interlibrary loan, so what kind of hooks are needed into ILL management systems like ILLiad and Tipasa? In this space, Project ReShare's CDL offering is in development now and it leverages ReShare's Shared Inventory and Returnables apps.

CDL Model 4: Decentralized CDL

Presentation slide 14: Model 4– Decentralized CDL

In the fourth model, we are well within the land of controlled digital lending as interlibrary loan. I've tried to get the acronym "CDILL" to stick to represent this kind of architecture, but so far have failed. In this architecture, there isn't a "CDL system" per se, but rather components at different libraries that are using well defined standards to pass messages about requests, availability, and loans. If you are familiar with the ISO 18626 standard for passing ILL requests between systems at different libraries, just think about those same pathways being used to pass a request to sequester and deliver a digital item from a supplying library to a requesting library. You might also consider a scenario where a request is made at one library, a physical copy is sequestered at a second library, and a third library delivers the digital file.

Hachette/IA lawsuit

Presentation slide 15: Hachette versus Internet Archive

So that probably sounds like a lot of fun to us technologists, but why should we keep the fun to ourselves? Let's get the lawyers involved, too! As you might imagine, publishers are looking at CDL and saying "What? Wait!" The big four publishers—Hachette, HarperCollins, John Wiley & Sons, and Penguin Random House—sued the Internet Archive over its early-pandemic National Emergency Library, but the lawsuit has turned into an examination of the legal basis for Controlled Digital Lending. The Internet Archive asserts that CDL meets the criteria for fair use under United States law. As you might recall, the test for whether an activity is fair use has four factors:

The purpose and character of the use
The nature of the work
The amount and substantiality of the portion used in relation to the work as a whole
The effect of the use on the market or potential market for the original work

The Internet Archive asserts that CDL strongly favors the first and fourth factors and is neutral on the second and third. Needless to say, there are a lot of legal theories and precedents being staked out on both sides. This case was filed in June 2020 and is about to reach an interesting milestone. Both sides have filed a "motion for summary judgement" which is a lawyerly way of telling the judge that the facts of the case are so obvious that of course the court needs to side in their favor. That both sides have filed such a motion means that the facts may not be so obvious, but we will hear what the judge has to say about that on Monday afternoon. So keep your ears open for that, but it seems like this case may go through a years-long process before we get to any kind of ruling.

More about CDL

Presentation slide 16: For more information...

So I'm going to post these links to Slack on where to go to get more information. The first is the whitepaper on controlled digital lending of library books. It is a long and detailed exploration of CDL. The whitepaper was written by lawyers, so it has 166 footnotes, but it is an easy read. Second is the CDL Implementer's group. This is a mailing list and monthly webinar of libraries and service providers that want to implement CDL functionality. Third, if you can get enough of the law stuff, is to watch the documents coming out of the publisher's lawsuit; this is a link to CourtListener that mirrors documents that are behind the federal court system paywall . If you want to go all the way back to the beginning, then the place to start is Michelle Wu's 2011 paper for Law Library Journal, Building a Collaborative Digital Collection: A Necessary Evolution in Libraries, and you can find a copy in the Georgetown Law institutional repository.

About the presenter and the presentation

Presentation slide 17: About the speaker

Presentation slide 18: Closing information

Issue 99: Copyright for Generative Artificial Intelligence (ChatGPT, DALL·E 2, and the like)

2023-03-02T00:00:00-05:00

Cecil Mae Feather, 1929–2023

This issue is offered in honor of Cecil Mae Thornburg Feather, my mother-in-law. Cecil Mae was a wonderful person. I only knew her a short time as I married into the Feather family, and that time was filled with love and joy. She enjoyed playing piano and teaching students how to play piano. My own two children spent summers in their Hickory, North Carolina, home and came back with new tunes on their fingers and new stories in their hearts. I remember her warm smile and even warmer hugs. She taught me that southern hospitality is not only a stereotype but a perspective to be admired and modeled wherever I am. If I may borrow from the Jewish tradition, may her memory be a blessing to all who knew her.

This week we look at the intersection of the hot topic of artificial intelligence (AI) and copyright law. Can works created by an AI algorithm be copyrighted? Do the creators of AI models have an obligation to respect the copyright of works they use in their algorithms?

The rush of new AI tools to the public has quickly inflamed these questions. There seems to be little doubt that the output of AI algorithms cannot be copyrighted. There is little clarity about the legality of AI algorithms using copyrighted material.

There is really much more to be said on the topic, but this will do for one Thursday Thread. Let me know if you have seen other angles that you think should be more broadly known.

Copyright is for humans

Naruto's 'monkey selfie'

We must determine whether a monkey may sue humans, corporations, and companies for damages and injunctive relief arising from claims of copyright infringement. Our court’s precedent requires us to conclude that the monkey’s claim has standing under Article III of the United States Constitution. Nonetheless, we conclude that this monkey—and all animals, since they are not human—lacks statutory standing under the Copyright Act.

—Opinion: Naruto v. Slater, United States Ninth Circuit Court of Appeals, 23-Apr-2018

It might be useful to start here—copyright is recognized as a protected right for humans only. In this case, PETA argued on behalf of a selfie-snapping Indonesia monkey named Naruto that the monkey held the copyright to an image. (Slater, a photographer, left his equipment unattended, and Naruto snapped this picture.) The court held "that the monkey lacked statutory standing because the Copyright Act does not expressly authorize animals to file copyright infringement suits."

Discussion of whether the output of a generative AI system can itself be copyrighted hinges around Naruto v. Slater, and most everyone I've read said that the result of an algorithm similarly can't be copyrighted.

Copyright Office rejects AI Art

The US Copyright Office has rejected a request to let an AI copyright a work of art. Last week, a three-person board reviewed a 2019 ruling against Steven Thaler, who tried to copyright a picture on behalf of an algorithm he dubbed Creativity Machine. The board found that Thaler’s AI-created image didn’t include an element of “human authorship” — a necessary standard, it said, for protection.

—The US Copyright Office says an AI can’t copyright its art, The Verge, 22-Feb-2022

As expected, the U.S. Copyright Office rejected an application for a work from an algorithm. In fact, the Copyright Office has started the process of revoking a previously granted copyright for an AI-generated comic book.

I'm starting here because it is helpful to know whether the output of an AI system can be copyrighted when we later look at the use of copyrighted sources in AI.

What is a "large language model" (LLM) artificial intelligence system?

LLMs are generative mathematical models of the statistical distribution of tokens in the vast public corpus of human-generated text, where the tokens in question include words, parts of words, or individual characters including punctuation marks. They are generative because we can sample from them, which means we can ask them questions. But the questions are of the following very specific kind. “Here’s a fragment of text. Tell me how this fragment might go on. According to your model of the statistics of human language, what words are likely to come next?

—Talking About Large Language Models, Murray Shanahan, arXiv, (2022). https://doi.org/10.48550/arXiv.2212.03551

The type of artificial intelligence that has been of great interest recently is classified as "large language model". Simplistically: they analyze tremendous amounts of texts—the entire contents of Wikipedia, all scanned books that can be found, archives of Reddit, old mailing lists, entire websites...basically, anything written on the internet—and derive a mathematical model for determining sequences of words. Then, when fed a string of words as a prompt, it looks at the statistical model to see what comes next. (Powerful stuff! I recommend reading Shanahan's 12-page arXiv paper to get a fuller sense of what LLMs are about.)

We see the output of that in text form with ChatGPT. But what about the image-generating systems?

How do LLMs work with images?

The latest image models like Stable Diffusion use a process called latent diffusion. Instead of directly generating the latent representation, a text prompt is used to incrementally modify initial images. The idea is simple: If you take an image and add noise to it, it will eventually become a noisy blur. However, if you start with a noisy blur, you can “subtract” noise from it to get an image back. You must “denoise” smartly—that is, in a way that moves you closer to a desired image.

—The generative AI revolution has begun—how did we get here?, Ars Technica, 30-Jan-2023

In image form, the linking of text descriptions of images found on the internet (such as found in HTML "alt" attributes or in the text surrounding the image in a catalog) is what the algorithm uses to generate new images.

How the Stable Diffusion algorithm turns text into images, from Ars Technica

So, going back to Naruto v. Slater, we're pretty sure that the output of these algorithms and statistical models can't be copyrighted. But were the copyright holders' rights violated when their text and images were used to build the statistical models? That is the heart of the debate happening now.

Models of unimaginable complexity

Given that LLMs are sometimes capable of solving reasoning problems with few-shot prompting alone, albeit somewhat unreliably, including reasoning problems that are not in their training set, surely what they are doing is more than “just” next token prediction? Well, it is an engineering fact that this is what an LLM does. The noteworthy thing is that next token prediction is sufficient for solving previously unseen reasoning problems, even if unreliably. How is this possible? Certainly it would not be possible if the LLM were doing nothing more than cutting-and-pasting fragments of text from its training set and assembling them into a response. But this is not what an LLM does. Rather, an LLM models a distribution that is unimaginably complex, and allows users and applications to sample from that distribution.

—Talking About Large Language Models, Murray Shanahan, arXiv, (2022). https://doi.org/10.48550/arXiv.2212.03551

Returning to Shanahan's paper (see, I told you it was worth reading), we learn that the algorithms are more than just copy-and-paste. That is what makes them seem so magical. Is that magic creating a new derivative work?

Most of the lawsuits probing this question seem to be happening with images and software code. For example, this one from Getty Images.

Getty Images goes after Stable Diffusion

[Getty Images] is accusing Stability AI [creators of the open-source AI art generator Stable Diffusion] of “brazen infringement of Getty Images’ intellectual property on a staggering scale.” It claims that Stability AI copied more than 12 million images from its database “without permission ... or compensation ... as part of its efforts to build a competing business,” and that the startup has infringed on both the company’s copyright and trademark protections.

—Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement, The Verge, 6-Feb-2023

A rich source of images and descriptions about images can be found in the Getty Images catalog.

From the Getty Images v. Stability AI lawsuit.

The algorithm is so uncanny that it reproduces what looks like the Getty Images watermark in the derived image. Getty Images alleges three things.

Removed/altered Getty Image's "copyright management information" (the AI-generated visible watermark resembles that of Getty Image, so these photos must have been taken from them)
False copyright information (modification of the photographer's name)
Infringing on trademark (a very similar watermark implies Getty Images affiliation)

The case is in front of the U.S. District Court in Delaware.

Maybe it isn't so magical after all?

Popular image generation models can be prompted to produce identifiable photos of real people, potentially threatening their privacy, according to new research. The work also shows that these AI systems can be made to regurgitate exact copies of medical images and copyrighted work by artists. It’s a finding that could strengthen the case for artists who are currently suing AI companies for copyright violations.

—AI models spit out photos of real people and copyrighted images, MIT Technology Review, 3-Feb-2023

This article summarizes the finding of researchers investigating whether it was possible to get the LLM algorithms to return known images in the dataset. With a unique enough prompt and training data set: yes, that seems quite possible.

Figure from 'Extracting Training Data from Diffusion Models' arXiv:2301.13188

Open source coders sue GitHub owner Microsoft and Microsoft's partner OpenAI

Two anonymous plaintiffs, seeking to represent a class of people who own copyrights to code on GitHub, sued Microsoft, GitHub and OpenAI in November. They said the companies trained Copilot with code from GitHub repositories without complying with open-source licensing terms, and that Copilot unlawfully reproduces their code.
Open-source software can be modified or distributed for free by any users who comply with a license, which normally requires attribution to the original creator, notice of their copyright, and a copy of the license, according to the lawsuit.

—OpenAI, Microsoft want court to toss lawsuit accusing them of abusing open-source code, Reuters, 27-Jan-2023

Microsoft bought GitHub in 2018 and Microsoft is a major investor and user of OpenAI's LLM technology. Copilot is a new feature in GitHub that generates code snippets based on the open source code files uploaded to GitHub and a prompt from the user. (Sound familiar?) The software developers claim that Microsoft's use of the code files violates the terms of open source license agreements. This is a new case, and it is one to watch to see how copyright and license terms intersect with large language models.

Roaar?

A man and his cat. Is there any more to life?

ReplayWeb for Embedding Social Media Posts (Twitter, Mastodon) in Web Pages

2023-02-04T00:00:00-05:00

If you have been following social media news, you know that Twitter is having its issues. Although there is still a bit to go before it goes away (or, more likely, puts up a paywall to view tweets), it seems prudent to save Twitter content so it can be viewed later. Most people do this with a screenshot of a tweet, but that doesn't capture the fidelity of the Twitter experience.

Ed Summers pointed out a recent article from the Associated Press that embedded a functional archive of a tweet. (Scroll down nearly to the end of the article.)

Screen capture of a tweet from TelanganaDGP

Screen capture of the contents of 'click here to learn more'

That looked interesting, so with the help of hints from Ed, I embedded a tweet in last week's newsletter.

Screen capture from last week's DLTJ Thursday Threads on 'Explaining our concept of time to aliens'

This is how I did it...plus some helpful advice along the way.

Note! An earlier version of this post described how to setup and use Browsertrix to capture the web archive file. That information has been moved to the bottom of this post; instructions on how to use the desktop ArchiveWeb.page application are now included below.

Capture tweet with ArchiveWeb.page application

Our goal is to use ReplayWeb to embed a bit of the Twitter experience into a stand-alone web page. (More on ReplayWeb below.) That tool uses a WACZ file to do this; a WACZ file is the contents of a series of web pages—the HTML, images, Javascript...everything—wrapped up in a Zip file for easy transport. We're going to use ArchiveWeb.page to make the WACZ file.

ArchiveWeb.page creates high-fidelity web archives as either a Google Chrome extension or a desktop application. The desktop application is described as a beta version, but I'm going to use that instead of the Chrome extension because Firefox is my day-to-day browser.

Note! If you are only capturing one web page in your web archive, the browser-based express.archiveweb.page can be used to generate a WACZ file as well.

To get an isolated view of the tweet, we're going to use oembed.link. "oEmbed" is a de facto standard for:

a format for allowing an embedded representation of a URL on third party sites. The simple API allows a website to display embedded content (such as photos or videos) when a user posts a link to that resource, without having to parse the resource directly.

The oEmbed is intended to be just the primary content of the page; it excludes toolbars, navigation elements, and other parts of the page framework. A bunch of big sites support it: Twitter, TikTok, YouTube, Tumblr, Facebook, etc. Many blog platforms support an oEmbed view by just putting the URL to the content you want to embed onto a line by itself. (You might be using oEmbed without even knowing the name or technology behind it; see the documentation at WordPress, for instance.) We're going to use oembed.link to get the same thing, but turn it into a web archive first. In this example, we are going to archive https://twitter.com/DataG/status/1585816108908662788, which in oembed.link looks like this.

When you open the extension or launch the application, you'll see these buttons.

Startup page for the ArchiveWeb extension and application.

Select "Create new...". This will create a new web archive; a single web archive can contain multiple web pages. Name the archive something meaningful...I use a combination of service, username, and identifier separated by dashes (so twitter-DataG-1585816108908662788 in my example).
Select "Start recording...". You are prompted for the name of the web archive (which we just created) and a text entry box for a URL. Put the "oembed.link" URL in that box and select "Go"; it will immediately be fetched and put into the web archive. There is also a checkbox to start in "Preview Mode" if you don't want your first URL added to the web archive.
A new window opens. Look for the ArchiveWeb button in the upper right corner.
If the square is green, the web page has stopped loading and is fully recorded in the archive.
To record a second page, put the URL in the top box and press return. In my case, I'm also going to put the contents of the direct Twitter URL (without the oembed) into the archive as well. You can also click on links in the window...the contents of any page you view will be recorded in the web archive.
When you are done recording, select the ArchiveWeb button in the upper right and in the drop-down menu select "Stop". Then close this window.
The ArchiveWeb screen shows your new web archive. Select "Download" then "Download all as WACZ" to get the web archive file.

View of the new web archive with two captured pages in ArchiveWeb.

First step done!

Embed the tweet archive onto a page

Next we're going to use ReplayWeb to embed the captured archive in a mini-browser running inside our web page. ReplayWeb reads the contents of the archive and dynamically injects the archived pages into the DOM as an <iframe>. It is really cool.

The embedding ReplayWeb documentation is quite good, and I recommend reading it to see details of all of the options. I'm choosing to self-host its JavaScript, so I downloaded the ui.js and sw.js files and put them in the "assets" directory on my blog's static site.

To embed the tweet, add the JavaScript and the <replay-web-page> tag to your HTML. For the DLTJ blog, that looks like this:

<script src="/assets/js/replayweb/ui.js"></script>
<replay-web-page 
  replayBase="/assets/js/replayweb/" 
  source="https://media.dltj.org/web-archive/twitter-DataG-1585816108908662788.wacz" 
  url="https://oembed.link/https://twitter.com/DataG/status/1585816108908662788"
  embed="replay-with-info"
  style="height:27em;">  
</replay-web-page>
<noscript><img
 src="https://dltj.org/assets/images/2023/2023-02-04-tweet-1585816108908662788.png"
 alt="...alt text for static image...">
</noscript>

...which looks like this when rendered in the browser:

Some notes:

Line 3: Note the addition of the replayBase attribute. Since I'm self-hosting and I'm not putting the ReplayWeb JavaScript at the same location as the WACZ file, I have to explicitly tell ReplayWeb the location of the back-end service worker JavaScript file.
Line 5: The url attribute controls what is displayed from the archive. Remember that we archived two pages: the oembed.link page and the twitter.com page.
Line 6: There are several embed modes...the replay-with-info mode adds the header at the top that explains how this is a web archive.
Line 7: I'm having to use an in-line style to force the height of the embedded content to approximate the height of the tweet. My knowledge of modern CSS styling is quite weak, so there is probably a better way to do this...suggestions welcome!
Line 9: Just in case JavaScript is turned off (or if all of this ReplayWeb stuff breaks someday), inside a <noscript> tag there is a static image of the tweet as well. I use this Tweet Screenshot tool to make the image.

So that is all there is to it. A little bit convoluted—especially the Browsertrix part to get the WACZ file—but on the whole not too bad. There is a web forum for the Webrecorder community working on these ArchiveWeb and ReplayWeb, and that is probably the best place to go if you have questions or observations. (I'm not a participant in that community—just a happy user of its projects.)

Special note for embedding Mastodon posts

Mastodon supports oEmbed, too, via an "automatic discovery" mechanism. Unfortunately, it seems like oembed.link uses the hard-coded list of oEmbed providers rather than automatic discovery, so Mastodon instances don't work there at the moment. But! If you know the magic URL pattern, you can capture Mastodon posts, too:

The link to the post on Mastodon is https://code4lib.social/@dltj/109804263650810404 and you just need to add /embed onto the end to get the oEmbed version:

https://code4lib.social/@dltj/109804263650810404/embed

That is probably better, really, because it isn't relying on an external service to get the content...it looks more legitimate.

Using Browsertrix to create a WACZ file

Browsertrix Cloud building a hosted service for organizations that want to have high-fidelity web archives, and it is also making its core code available as open source. Its local deployment instructions instructions are really good, but one of the things that put me off was the Kubernetes requirement. (Kubernetes is a highly-complicated, highly-robust tool for orchestrating the containers that make up a distributed application.) Fortunately, the Browsertrix local deployment instructions point out that recent versions of Docker Desktop include Kubernetes as an optional component. So I started with the four list items under the Docker Desktop (recommended for Mac and Windows) heading.

Docker Desktop (recommended for Mac and Windows)

For Mac and Windows, we recommend testing out Browsertrix Cloud using Kubernetes support in Docker Desktop as that will be one of the simplest options.

Install Docker Desktop if not already installed.
From the Dashboard app, ensure Enable Kubernetes is checked from the Preferences screen.
Restart Docker Desktop if asked, and wait for it to fully restart.
Install Helm, which can be installed with brew install helm (Mac) or choco install kubernetes-helm (Windows) or following some of the other install options

From there, get Browsertrix running:

git clone https://github.com/webrecorder/browsertrix-cloud
cd browsertrix-cloud
helm upgrade --install -f ./chart/values.yaml -f ./chart/examples/local-config.yaml btrix ./chart/
kubectl wait --for=condition=ready pod --all --timeout=300s
Go to localhost:30870/ and sign in with admin@example.com with PASSW0RD! as the password.

Tip: the first time I ran helm upgrade ..., the back end timed out waiting for the database container to download and start. I saw this because the default username/password was not accepted. The solution was to helm uninstall btrix and kubectl delete pvc --all then run the helm-upgrade command again. For the second helm upgrade ..., the container images will have been already downloaded and cached, so the initialization will happen as expected.

We're going to use the same oEmbed technique described above to get the pages into the web archive—https://twitter.com/DataG/status/1585816108908662788—which in oembed.link looks like this.

From the main Browsertrix page, select "All organizations" in the upper right corner and pick "admin's Archive".
Select the "Crawl Configs" tab.
Select "New Crawl Config" then select "URL list"
In the "Scope" tab that comes up, put the oembed.link URL into the text box. For the sake of completeness, it probably also makes sense to put the original Twitter URL in the text box for archiving as well. See screenshot below.
The settings in the "Limits", "Browser Settings", and "Scheduling" tabs can remain unchanged. (Or experiment!)
In the "Information" tab, I use a more meaningful name than the first URL captured. For instance, to capture the https://twitter.com/DataG/status/1585816108908662788 tweet I set it to "twitter-DataG-1585816108908662788" — the origin website, account, and identifier separated by dashes.
Select "Review & Save" then "Save & Run Crawl"

Screen capture of the crawl configuration 'Scope' page from Browsertrix.

As the web archive capture starts, you'll see the running status. My tweet crawl took about a minute to run. When it is done, the "Files" tab will have a link to your WACZ file.

First step done!

Issue 98: Time Standards - leap seconds forwards and backwards, moon time, internet time (then and now), and aliens

2023-02-02T00:00:00-05:00

Augustine, Saint, Bishop of Hippo. The Confessions of St. Augustine. New York: Grosset & Dunlap, 1931, page 267. Translation of the Latin original from circa 397 CE.

This week we look at time from a few points of view:

Eliminating the leap second

The leap second change triggered a massive Reddit outage in 2012, as well as related problems at Mozilla, LinkedIn, Yelp and airline booking service Amadeus. In 2017, a leap second glitch at Cloudflare knocked a fraction of the network infrastructure company's customers' servers offline. Cloudflare's software, comparing two clocks, calculated that time had gone backward but couldn't properly handle that result.

—It's Time to Ditch the Leap Second: The Devastating Effect of Adding Just One Second, CNET, 8-Aug-2022

Programmers can be forgiven for assuming that every minute has 60 seconds. But sometimes they have 61. And when that sometimes happens, you get unexpected results. Google proposes using a "leap smear", where computer clocks spread the extra second across all of the 86,400 seconds that make up a day.

If adding a second causes problems, imagine removing a second

If the Earth speeds up enough, we might find ourselves pondering over the possibility of a negative leap second. According to the Time and Date folks, a day in 2021 is averaging about 0.2ms faster than the 84600 atomic seconds per day, ~70ms/year, so at most 14 years of this would put us over the threshold (super unlikely). In reality, we don’t have to speed up a full 1000ms of rotation speed because there was always a fractional difference in UT1-UTC.
The time specs involving leap seconds always included the possibility that a negative leap second could happen, but I don’t think anyone really expected it actually happen. In more practical terms, Either July 31 or December 31 23:59:59 would just… disappear from existence, with clocks ticking from 23:59:58 seconds to 00:00:00.

—Hate leap seconds? Imagine a negative one, Counting Stuff, 10-Aug-2021

Add this to the list of things that are possible in our corner of the universe.

What time is it on the Moon?

The Moon doesn’t currently have an independent time. Each lunar mission uses its own timescale that is linked, through its handlers on Earth, to coordinated universal time, or utc — the standard against which the planet’s clocks are set. But this method is relatively imprecise and spacecraft exploring the Moon don’t synchronize the time with each other. The approach works when the Moon hosts a handful of independent missions, but it will be a problem when there are multiple craft working together. Space agencies will also want to track them using satellite navigation, which relies on precise timing signals.

—What time is it on the Moon?, Nature News, 24-Jan-2023

Robust, accurate time is a given here on Earth. (There are undoubtedly applications where "robust" and "accurate" aren't good enough, but most humans aren't faced with those challenges.) Everyone has a sense of the importance of a common understanding of time. But what if you are setting up a homestead on another celestial body? You might want to sort out your time standards before you build and ship the equipment to your new location. For instance, "The Moon’s gravitational pull is weaker than Earth’s, meaning that, to an observer on Earth, a lunar clock would run faster than an Earth one." So you are not only synchronizing clocks on the Moon, but you also need to deal with the skew of time compared to the Earth.

Internet time is "black magic"

By 1988, Mills had refined N.T.P. to the point where it could synchronize the clocks of connected computers that had been telling vastly differing times to within tens of milliseconds—a fraction of a blink of an eye. “I always thought that was sort of black magic,” Vint Cerf, a pioneer of Internet infrastructure, told me.

—The Thorny Problem of Keeping the Internet’s Time, The New Yorker, 30-Sep-2022

When Vint Cerf, a co-developer of TCP/IP, thinks something is black magic, you can wager that it is intense technology. This article peels back some of the mystique of one of the oldest internet standards: Network Time Protocol (NTP). NTP has been under the care of mostly one person since it was first developed in the late 1980s. This is a story about honoring an unsung—if prickly—hero of the common good of the internet and how the community is advancing the state of the art beyond what a 40-year-old standard can support.

I remember getting into the bowels of NTP configurations in the early 1990s when I was helping configure a cluster of NeXT cubes at my university. (I also remember longing for an excuse to get the hardware for a stratum 1 server so we could put our mark on the net.)

From millisecond precision to nanosecond precision

Implementing Precision Time Protocol (PTP) at Meta allows us to synchronize the systems that drive our products and services down to nanosecond precision. PTP’s predecessor, Network Time Protocol (NTP), provided us with millisecond precision, but as we scale to more advanced systems on our way to building the next computing platform, the metaverse and AI, we need to ensure that our servers are keeping time as accurately and precisely as possible. With PTP in place, we’ll be able to enhance Meta’s technologies and programs — from communications and productivity to entertainment, privacy, and security — for everyone, across time zones and around the world.

—How Precision Time Protocol is being deployed at Meta, Engineering at Meta, 21-Nov-2022

And this is the advancement of network time from the 1980's NTP (millisecond precision) to today's PTP (nanosecond precision). This article's first part discusses the importance of very precise time at internet-scale companies. Even if you don't care about the time standards, it is interesting to read about how Facebook structures its databases in its highly distributed architecture.

Explaining our concept of time to aliens

—@Foone on Twitter, 20-Sep-2022

An imaginary conversation in 68 tweets, including bits like:

"YOUR CALENDAR IS BASED ON A RELIGIOUS LEADER THAT NOT EVERYONE BELIEVES IN?" ... "well, on his birth. And yeah, we got it wrong by a couple years."
"yep! Our 9th month is named after the number 7, and so on for 10, 11, and 12."
"yeah so cultures before then had a 12 month system, because of the moon. But they had been using a 10 month system, before switching to 12 and giving them the modern names"

Note! That quote above is a web archive of the first tweet in a thread, and I had originally intended the reader to follow that tweet onto twitter-dot-com to read the rest. Well, "Twitter" is now "Ex-Twitter", and anything beyond the first tweet isn't readable anymore unless you have a paid account on X. Fortunately, I used Thread-Reader-App and the Wayback Machine to save a copy of that writing.

Mittens in the Groceries

This is one of Mittens' quirks. She waits for the groceries to come in from the car, then starts nosing through them. She is looking for the pack of strawberries so she can rub her face on the rough edges of the plastic container. Other plastic containers of strawberries will do in a pinch, but her clear favorite is strawberries.

Issue 97: Again with the AI Chatbots

2023-01-26T00:00:00-05:00

The hot technology in the news now is chatbots driven by artificial intelligence. (This specific field of artificial intelligence is "large language models" or LLM). There were two LLM threads in DLTJ Thursday Threads issue 95 and a whole issue six weeks ago (issue 93). I want to promise that Thursday Threads will not turn into an every-other-issue-on-LLMs, but so far that is what is catching my eye here at the start of 2023.

Also on DLTJ in the past week:

LibNFT: a second look…still ‘nope’

AI generating news articles, and we're not impressed

CNET will pause publication of stories generated using artificial intelligence “for now,” the site’s leadership told employees on a staff call Friday.
The call, which lasted under an hour, was held a week after CNET came under fire for its use of AI tools on stories and one day after The Verge reported that AI tools had been in use for months, with little transparency to readers or staff. CNET hadn’t formally announced the use of AI until readers noticed a small disclosure.
“We didn’t do it in secret,” CNET editor-in-chief Connie Guglielmo told the group. “We did it quietly.”

—CNET pauses publishing AI-written stories after disclosure controversy, The Verge, 20-Jan-2023

This is the end, for now, of a saga that started when Futurism found that CNET was using a "CNET Money Staff" byline for articles being generated with an in-house large-language-model (LLM) AI system. (That was covered in Thursday Threads issue 95.) CNET was using the tech to create monotonous articles like "Should You Break an Early CD for a Better Rate?" or "What is Zelle and How Does It Work?" That might have been the end of it if human editors had indeed proofread the articles before publication (as CNET had claimed). Either the editors were bad at their job or it was not the case. Oh, and CNET was using the LLM to rewrite the first few paragraphs of articles every couple of weeks so that they would stay "fresh" in web search engine crawls. CNET admitted to using the LLM before ultimately pausing the use of the technology (for now), as The Verge article describes.

Large language models in scholarly publishing

Nature, along with all Springer Nature journals, has formulated the following two principles, which have been added to our existing guide to authors (see go.nature.com/3j1jxsw). As Nature’s news team has reported, other scientific publishers are likely to adopt a similar stance.
First, no LLM tool will be accepted as a credited author on a research paper. That is because any attribution of authorship carries with it accountability for the work, and AI tools cannot take such responsibility.
Second, researchers using LLM tools should document this use in the methods or acknowledgements sections. If a paper does not include these sections, the introduction or another appropriate section can be used to document the use of the LLM.

—Tools such as ChatGPT threaten transparent science; here are our ground rules for their use (editorial), Nature, 24-Jan-2023

The Springer Nature publisher has set some early ground rules, even as it admits that ultimately it may not be able to judge whether a large-language-model (LLM) had been used in the crafting of an article. Last week Nature noted that at least four articles had already been submitted citing ChatGPT as a co-author.

Higher education classrooms prepare for ChatGPT

Mr. Aumann confronted his student over whether he had written the essay himself. The student confessed to using ChatGPT, a chatbot that delivers information, explains concepts and generates ideas in simple sentences — and, in this case, had written the paper.
Alarmed by his discovery, Mr. Aumann decided to transform essay writing for his courses this semester. He plans to require students to write first drafts in the classroom, using browsers that monitor and restrict computer activity. In later drafts, students have to explain each revision. Mr. Aumann, who may forgo essays in subsequent semesters, also plans to weave ChatGPT into lessons by asking students to evaluate the chatbot’s responses.

—Alarmed by A.I. Chatbots, Universities Start Revamping How They Teach, New York Times, 16-Jan-2023

In issue 93, I mentioned a high school teacher that was lamenting the disruption that large-language-models (LLMs) were having on the classic English essay (and mentioned Ben Thompson's suggestion of “Zero Trust Homework” to combat LLMs). This New York Times article describes more examples of how instructors are coping with LLMs.

Adapting to the disruptions of generative artificial intelligence

Generative AI models for businesses threaten to upend the world of content creation, with substantial impacts on marketing, software, design, entertainment, and interpersonal communications. These models are able to produce text and images: blog posts, program code, poetry, and artwork. The software uses complex machine learning models to predict the next word based on previous word sequences, or the next image based on words describing previous images. Companies need to understand how these tools work, and how they can add value.

—How Generative AI Is Changing Creative Work, Harvard Business Review, 14-Nov-2022

We will adapt to new technology (if history is any guide). This Harvard Business Review article is about more than the large-language-models discussed in this issue. (It also covers the generative adversarial network technology that creates "AI art".) But the lessons and cautions are generally applicable. If fact, we may see new professions emerging, like a "prompt engineer" that will know the phrasing and techniques to best elicit the output the client is seeking. (The article describes the efforts of an award-winning AI artist: "he spent more than 80 hours making more than 900 versions of the art, and fine-tuned his prompts over and over.") Or, like was suggested with the "Zero Trust Homework" idea (and seemingly resoundingly ignored by the CNET editors), using LLM to "generate original content" faster so "writers now have time to do better research, ideation, and strategy." We will also see these technologies used for "deepfakes" (in still images, video, and audio) and activities bordering on plagiarism (such as the earlier scholarly communication thread).

Come to think of it, this sort of thread is likely to be quite common in upcoming DLTJ Thursday Threads issues.

Alan in snowy weather

We got snow in central Ohio, and Alan just had to check it out. You will note that he is not in the snow, just next to the snow. He has gone full "house-cat" after all.

LibNFT: a second look...still "nope"

2023-01-23T00:00:00-05:00

The day after I posted LibNFT: a Project in Search of a Purpose, the project proponents held their CNI project briefing. The recording of that briefing is now online, and I've made some annotations on the recording transcript. I came away with a more nuanced understanding of the proposed project, but no more convinced that LibNFT is useful to pursue. The speakers talked about a number of drivers for the project, and three of them stood out to me: building community around their collections with a little money on the side, improving discovery of digital archives, and fear-of-missing-out.

Building community

As they did in the whitepaper, the speakers downplayed the intention of turning LibNFTs into a speculative asset. Instead, they want to build communities of users around these assets. The "affinity marketing" phrase was mentioned a few times, so I looked that up on Wikipedia: "Affinity marketing is a concept that consists of a partnership between a company (supplier) and an organization that gathers persons sharing the same interests to bring a greater consumer base to their service, product or opinion. This partnership is known as an affinity group." In this case, the idea is to find users that like the library or archives (as in a university alumni group) or an archival collection (people that are interested in content that you have; Meredith Evans mentioned "miniature books" for instance). These are people that will buy LibNFTs to gain "ownership" in the collection or show common cause with the affinity group. ("Ownership" is in quotes, of course, because linking your blockchain wallet to the LibNFT doesn't give you rights to the collection content—a point explained in the presentation.)

Also mentioned was the possibility of getting special perks for owning a LibNFT. Owning a Justin Bieber NFT gets you access to a special area by the stage or invited to a song release. We're now getting into the realm of artificial scarcity, where limiting the quantity of an item makes it more valuable. Perhaps LibNFTs have unlimited quantities, and anyone can get the special benefits of owning one. It isn't clear what the special benefits for LibNFTs would be that wouldn't be offered to the general public. Surely we wouldn't make someone own a LibNFT to have access to a collection's materials in an archive, would we? And if there was a lecture about an illuminated manuscript, we wouldn't force people to own a LibNFT of that manuscript to attend that lecture, would we?

One last piece about the monetization aspects of LibNFTs. At several points, the speakers asked the audience to set aside the problems with cryptocurrencies: the bankruptcy of FTX, the crashing value of cryptocurrencies and mainstream NFTs, and the general fraud that is happening in this space. The problem is that I don't think you can set aside the cryptocurrency part of this from the smart contract part of NFTs. Does the smart contract that says you own a LibNFT have to have an exchange of value to be valid? If not, what is the difference between transferring a LibNFT to a blockchain wallet and simply posting a list of donors on a webpage? If and until there is a stable cryptocurrency—one likely backed by a government with armed forces and all of that other government-like stuff—it is not possible to separate cryptocurrencies from NFTs. As things are now, the only companies making money in the space seem to be those running the blockchain and NFT infrastructure, a sort of extracting-of-rent from those who think there is money to be made here and fools that are willing to part with their treasure.

Improving discovery

This idea made little sense to me. The Bored Ape Yacht Club got visibility because a bunch of celebrities got paid by a couple of NFT companies to promote their weird ape pictures. There are hundreds if not thousands of NFT collections on OpenSea.io. (I stopped paging through https://opensea.io/rankings in the 900s.) Cultural heritage NFTs are supposed to stand out in that?
Twitter allows people to change their avatar to the picture assigned to an NFT; will this sort of program increase our visibility? (Setting aside, of course, whether it is wise to deeply attach our marketing plans to a social media company's capabilities...as the current Twitter kerfuffle points out.)

Fear of missing out

The last big driver I heard in the presentation was "fear of missing out." Libraries were big in the Web 1.0 space but totally missed the Web 2.0 space. If this is going to be Web 3.0, then we want to get in early, be seen, and influence how this turns out. Well, "Web 3.0", as seen by the venture capitalists funding the current hype, is all about blockchain → cryptocurrencies → NFTs. In other words, venture capitalists want this blockchain thing because they can make money funding companies and then selling their stakes in those when the interest is high. And in order to have high-value blockchain companies, you need something to sell, and that is cryptocurrency. And for cryptocurrencies to be anything of interest, you need something to buy, and that right now is NFTs (although the market is clearly cooling off). Not convinced? Invest an hour watching Crypto: The World’s Greatest Scam.

Now, there is an interesting Web3 idea called the Interplanetary File System (IPFS), or "distributed web". That is an effort to abstract away the web's single-point-of-failure that is the domain name. It is a way to point to this article...https://dltj.org/article/libnft-2/...without relying on the "dltj.org" part. (So if "dltj.org" goes away but "article/libnft-2" is still out there on the distributed web, you'd still be able to get to it.) That is a much more interesting experiment, but unfortunately the "Web 3.0" name—and at times the IPFS technology—has gotten sucked into this blockchain sludge.

Experiment onward, if you want

In the end, if some GLAM institutions want to experiment with this concept and report back to the rest of us, then...I guess...go for it? (If you can justify the expense of the experiment, more power to you.) As I see the blockchain and NFT landscape now, there are other things I'd rather invest time and money into. As I said at the start, I remain unconvinced that this project is even worth the experiment simply based on the current and foreseeable technology. I suppose if someone were to meet me on longbets.org, I could be convinced to wager some amount in support of the Internet Archive (my choice for a winner's charity). In the meantime, I'm going to sit this one out.

Issue 96: Metadata

2023-01-19T00:00:00-05:00

Metadata is at the core of what libraries do. ("metadata" is one of the most common tags on this here library technology blog.) We gather information about the resources available to patrons, then massage it and slice it and sort it and display it in ways that help patrons find what they need. I was thinking about metadata because of a thread that ran here in Thursday Threads 12 years ago. That is the first thread in this issue, and it is followed by more recent articles about metadata.

The Most Important Library Metadata

Table 2.1—MARC tags occurring in 20% or more of WorldCat records—from 'Implications of MARC Tag Usage on Library Metadata Practices'

The outliers in this case are those elements that appear in a large number of records — that is, what might be considered “core” elements that are used to describe the vast majority of library owned material.
Those “outliers” can be categorized according to three general purposes:

Provenance and Identity: identifiers (e.g. ISBN, OCLC, etc.) and cataloging source (040)

Elements useful for discovery: title statement (245), personal names (100, 700) and subject (650)

Elements useful for understanding and evaluation: publication statement (260), physical description (300), and notes (500)

That’s it. In a nutshell you have the very core of bibliographic description as defined by librarians over the last century or so.

—The Core of Bibliographic Description, Roy Tennant, OCLC Research's "Hanging Together" blog, 17-Jan-2011

Back in 2011, Roy Tennant posted a summary of this research on the fields that libraries use most often when describing stuff. It pointed to a report from which the above table is taken, Implications of MARC Tag Usage on Library Metadata Practices. WorldCat is the name of a database of records from libraries. As the table shows, every record has some mandatory elements: a control number, a set of codes called the "fixed length data elements", a source for where the record came from, and the title. Most records also have an imprint (96%) and a physical description (number of pages, etc.—91%). In library-speak, the "main entry" is the person or organization responsible for the work, and the research found that 61% of records had a personal "main entry"...what we could commonly call the author. 46% of records have a topical subject and 44% had a descriptive note of some sort. Rounding out the last entries in the table, 28% of records had an additional responsible entry—a second author or illustrator or similar—and 23% of records had an ISBN.

Now library records have hundreds of fields and variations of fields to describe more esoteric aspects of a work, but these are the most common. Just a fun fact for your next dinner party.

ICOLC Statement on the Metadata Rights of Libraries

Metadata and the metadata services that describe library collections are critical in supporting content discovery, knowledge creation, and libraries’ public missions. Metadata describing library collections is not typically copyrightable, and should be considered freely shareable and reusable under most circumstances. However, some industry players restrict libraries’ rights to use such metadata through contractual terms and market influence. Such restrictive activity is out of alignment with libraries’ needs and public, not-for-profit/educational missions.
The endorsers of this document urge all organizations, whether for-profit or not-for-profit, to uphold libraries’ rights and interests to use, re-use, adapt, aggregate, and share metadata that describes library collections to serve the public interest, without restriction or limitation.

—ICOLC Statement on the Metadata Rights of Libraries, International Coalition of Library Consortia (ICOLC) website, 26-Aug-2022

As you might guess, metadata is really important to libraries. And libraries have a natural ethos to cooperate with one another to share the burden of creating and maintaining that metadata. Not all of the actors in the metadata world, though, are naturally so cooperative, and this statement from ICOLC emphasizes the need to keep this cooperative ethos going for the benefit of all libraries.

What's in a name?

Because of both the great variety of name forms and the variability of applications that make use of names, I recommend a metadata vocabulary that follows the principle of minimum semantic commitment. This means a vocabulary that includes broad classes and properties that can be used as is where detailed coding is not needed or desired, but which can be extended to accommodate many different contexts.

—What's in a Name?, Coyle's InFormation , 26-Jan-2022

This is a great article if you like going down rabbit holes. In the Western world (the overwhelmingly large number of Thursday Threads readers), you'll see a form ask you for first name, last name, and sometimes middle name. What happens, though, when names are much more complex than that:

When you add a birth date and/or death date to disambiguate people with the same name.
When you encounter titles ("Sir") or a generational ("Jr.") suffix.
What about cultures that by convention put their family name before their given name.

These are the types of things that Karen Coyle explores in this blog post.

Who me?

Ah, the innocence of a cat face. Who would believe this cute face could cause any trouble.

Hopefully Alan is looking healthier now. He was on a steady path towards completely rejecting the food we had for him. Fortunately, we found some wet food and some kibble that he likes to eat and is good for his sensitive stomach. Unfortunately, though, we've had one room-clearing episode of flatulence that may be caused by this new food, so we might be back to the drawing board.

Issue 95: Updating ChatGPT, Cryptomining, and Website-for-Small-Libraries Threads

2023-01-12T00:00:00-05:00

This week we revisit threads from a month ago, a year ago, and 12 years ago.

ChatGPT references a non-existent book

Been seeing a lot about #ChatGPT lately and got my first question at the library this week from someone who was looking for a book that the bot had recommended. They couldn't find it in our catalog. Turns out that ALL the books that ChatGPT had recommended for their topic were non-existent. Just real authors and fake titles cobbled together. And apparently this is known behavior. 😮

—Wanda Whitney, @bibliotecaria@blacktwitter.io, 7-Jan-2023

One of the Thursday Threads issues last month focused on Chat-bots Powered by Artificial Intelligence. This weekend I saw this fediverse post from Wanda Whitney; a patron was looking for a book recommended by ChatGPT. This is a problem known as "stochastic parroting". Let's ask ChatGPT to define that term...

What is "stochastic parroting"? Define this term and give an example of how this is a problem for large language models like ChatGPT.

"Stochastic parroting" is a phenomenon that occurs when a language model generates text that is similar to the text it has seen during training, but is not a coherent or logical response to the input. This can happen when the model is simply repeating patterns it has learned from the training data, rather than using its understanding of the meaning of the input to generate a response.

This is a problem for large language models like GPT-3 because the amount of data these models are trained on is so huge that they are able to memorize large amount of text and sometimes can repeat it back which might not be relevant to the task or context, this can be mitigated by fine-tuning the model with smaller dataset that is domain specific, or other techniques like controlling sampling temperature and restricting the model's training to a certain task.

From what I've read, that is a pretty good answer. The algorithm behind the large language model uses statistical probabilities to find the most likely next word based on the previous words. Even if that generates nonsense in the real world. I still think that the invention and wide-scale use of large-scale natural language models like ChatGPT means an increased importance on information literacy training. Let's get to it, librarians!

AI generating news articles

—Gael Breton, Twitter, 11-Jan-2023

Also in that Thursday Threads issue was mention of how AI can change the nature of essay-writing and instructors can adapt to the change by making the assignment about editing rather than writing text. Publishers are trying out just that thing. With headines like "What Are NSF Fees and Why Do Banks Charge Them?" and "What Happens When you Bounce a Check?", these seem to be fairly basic articles based on common knowledge and easy editing. I wonder if anyone—CNET, the publisher, or an academic researcher—has reviewed those articles to see how accurate they are.

This isn't the first time automation has been used to write news articles. Back in 2015, The Verge reported on how the Associated Press was using a fill-in-the-blank system to write financial stories from regulatory filings.

Note! An update to this thread is in issue 97.

Cryptocurrency miners in Texas forced to shut down due to electricity shortage

As Texas and much of the U.S. confronts the challenges of this winter storm, the bitcoin mining industry in Texas is playing a part in supporting the Texas grid during this challenging time by proactively curtailing power. Bitcoin miners in Texas and the Texas Blockchain Council are working with grid operators in Texas as they are closely monitoring the situation, and stand ready to contribute to the state's efforts to ensure reliability and keep families warm and safe during the extreme weather.

—Texas Bitcoin Miners Curtail Power in Advance of Arctic Blast, Texas Blockchain Council press release, 23-Dec-2022

A Thursday Thread from a year ago was cryptocurrency’s energy consumption and how cryptocurrency miners were heading for Texas after being forced out of China and Kosovo. Now comes last month's press release from the Texas Blockchain Council on how they shed load as the grid struggled to keep up with heating demands. Still no word on why using energy for fraud-supporting busy-work calculations is a better use of excess capacity than, say, some form of energy storage that would help smooth out the bumps in the grid.

Lament for a future that could have been: A Web presence for every library

Late last year -- just about a week before ALA Midwinter -- came an announcement of a project by OCLC's Innovation Lab to offer an inexpensive website to every small library. At a price of about $5 per month, a library could have a basic desktop and mobile website. At about $40 per month, the library could have a simple inventory and circulation module. You can see what is possible at the Loremville, TN public library sample site and read more information about the project in my write-up of the public demonstration.

—A Web presence for every library, Disruptive Library Technology Jester, 13-Jan-2011

Twelve years ago, OCLC announced a project to build a website for every library, and I wrote an excited piece about it for this blog. It seemed like the perfect use of the OCLC cooperative's resource magnification efforts; if anyone could pull off something like this, it was OCLC. Unfortunately, the project didn't last, and I wish I knew more about why it didn't take off. I couldn't find the Loremville Public Library sample site on Wayback, so the only thing that may be left is this low resolution screenshot I made for my 2010 article.

Extra Cat!

We had a visiting cat for Christmas break! This sweetheart is Pickle, which is short for Dill Pickle Murray, and is my daughter's college apartment cat. It took most of the Christmas break for Pickle to get comfortable in the house and for Mittens and Alan to get comfortable with Pickle. Okay, honestly, neither of them liked Pickle as a new interloper in the house. To be honest, Mittens and Alan are much better behaved with each other since Pickle left. Maybe the two of them realized how good a two-cats-only household is?

Issue 94: Controlled Digital Lending

2022-12-29T00:00:00-05:00

E-books are a prominent theme looking back at a couple of year-end issues of DLTJ Thursday Threads. In 2010, a writer in Boston Review wondered about "books after Amazon." In 2011, an author for O'Reilly Media's Radar blog wrote that "readers sure to like ebooks" and "DRM is full of unintended consequences." (That person also noted that "publishers aren’t necessary to publishing" in predicting that self-publishing on platforms like Amazon's would take off, as it has.) Also in 2011, Fast Company "boldly" predicted that "eBooks will dominate." A decade later, ebooks haven't overtaken print books (in a Pew Research Center study from 2021, 65% of adults have read a print book versus 30% for an ebook — that is not "domination").

This week, Thursday Threads looks at an active area of exploration and development at the intersection of ebooks and libraries: Controlled Digital Lending (CDL). Simply put, CDL is a system that "enables a library to circulate a digitized title in place of a physical one in a controlled manner." (David Hansen, "Controlled Digital Lending for Resource Sharing", OCLC Resource Sharing 2022 at 3:10) A more expansive definition comes from the Internet Archive in this letter to the court in the lawsuit filed by the four largest publishers:

a nonprofit library that owns a lawfully made and acquired print copy of a book [may] loan a digital copy of that book to a library patron, if the library (1) loans the book to only one patron at a time for each non-circulating print copy it owns (thus maintaining a one-to-one “owned-to-loaned” ratio); (2) implements technical protections that prevent access to the book by anyone other than the current borrower; and (3) limits its digital lending to books published in the past five or more years? This describes Internet Archive’s implementation of a practice known as “Controlled Digital Lending,” or “CDL.”

Details of this lawsuit and more in this issue.

May you have a safe, happy, and healthy New Year.

Updates on Funding Announcements

These two funding announcements show support for the development of systems and practices for libraries to advance the cooperation beyond the point of shipping physical books back and forth.

—Controlled Digital Lending Gets a Funding Boost, DLTJ Thursday Threads, 30-Jan-2022

I posted about two CDL-related funding announcements back in January. In the first, the Boston Library Consortium announced a two-year grant to develop CDL software for its member libraries. That work is moving ahead, and my employer is involved in the "thin thread" development of the technology piece. (Thin thread development is where the end-to-end workflow is created, which is then enhanced with additional capabilities and exception handling.) The thin thread is coming together for the new year.

The second is the announcement from NISO about a Mellon Foundation grant that funded a working group to explore standards and interoperability issues with CDL between a library and its patrons as well as borrowing between libraries. The report from the working group is being drafted now and will be available for public comment early in the new year.

Big Publishers versus Internet Archive

Plaintiffs Hachette, HarperCollins, Penguin Random House, and Wiley (collectively, “Plaintiffs” or “Publishers”) bring this copyright infringement action against IA in connection with website operations it markets to the public as “Open Library” and/or “National Emergency Library.” ... Defendant IA is engaged in willful mass copyright infringement. Without any license or any payment to authors or publishers, IA scans print books, uploads these illegally scanned books to its servers, and distributes verbatim digital copies of the books in whole via public-facing websites. With just a few clicks, any Internet-connected user can download complete digital copies of in-copyright books from Defendant. ...
... By its actions, alleged above, Defendant has infringed and will infringe the Publishers’ copyrights in and to the Works by, inter alia, reproducing, distributing, publicly displaying, publicly performing, and making derivative works of the Works without any authorization or permission from Plaintiffs.

—Complaint, Hachette Book Group, Inc. v. Internet Archive, 1-Jun-2020

Although the Internet Archive's "National Emergency Library" early in the 2020 Covid-19 pandemic was the driver for this lawsuit, it has come to encompass the notion of CDL. (The Internet Archive, though it's "Open Library" program, had a CDL program for participating libraries. As part of the National Emergency Library, the Internet Archive suspended the simultaneous use restrictions because the print books in the nation's libraries were unavailable because the libraries were closed.) On July 7 this year, motions-for-summary-judgment were filed by the publishers and the Internet Archive. The court's docket, starting with entry 87 and running through dozens of documents, is full of letters and legal briefs supporting each side. The judge in the case has not ruled yet on the competing motions for summary judgement. If the judge denies both motions, the litigants will move on with the lawsuit's discovery phase. One has to wonder if this court case will last as long as the Google Book Search case, The Authors Guild v Google from 2005; that took 10 years and a run through the federal appeals court before it was decided.

Opinion: what bothers me about this lawsuit is that the publishers are falling back to suing libraries when they could spend this money improving their products. What libraries make is going to be a scanned facsimile of the printed book...that is, by definition, what CDL is about. That scan of a printed book will be readable, but it won't be ideal. The reader can't change the font size and re-flow the text of a CDL-lent item; it will be fixed to what was rendered on the printed page. The publisher has the source file, so it can re-flow the text and create value-added features that follow characters in fiction or give more details on a term in a non-fiction book (like the Kindle X-Ray functionality). The publishers' electronic version can have hyperlink features to other ebook sections or to supplemental content outside the ebook. But rather than innovate their product to take advantage of the capabilities that ebooks bring to their back catalog, the publishers attack libraries for a Constitutionally-enumerated right.

Origins of Controlled Digital Lending

I believe it is possible to build a digital library that respects both of the intended beneficiaries of the Copyright Clause—copyright owners and society—while testing commonly held assumptions about the limitations of copyright law. In balancing these goals, TALLO permits circulation of the exact number of copies purchased, thereby acknowledging the rights inherent in copyright, but it liberates the form of circulation from the print format.

—Building a Collaborative Digital Collection: A Necessary Evolution in Libraries, Michelle M. Wu, Georgetown University Law Center, 16-Nov-2011

There is the origin of the notion of the own-to-loan ratio...if a library owns a physical copy of an item, it can lend a digital facsimile of that item if it takes the physical copy out of circulation. For all of the legal wrangling around CDL—see entry above—is it ironic that the notion for CDL came from a law library professor? (Full metadata for this article is in the Georgetown Law Scholarly Commons.)

Cats on Christmas

Look at these two hooligan cats getting their noses into their Christmas stockings. A few new toys with lots of fresh catnip are attracting them.

I said this last week as well, "Now there are two happy cats."

Backing Away from Twitter in Measured Steps

2022-12-19T00:00:00-05:00

My relationship with Twitter crossed a new line yesterday. As I posted on Mastodon (one, two):

Have just deleted the Twitter app on mobile. Felt the need to ramp down stress this week, and the current owner’s meltdown is unnecessary drama. There are still a few people there that I like to read, but I’ll be doing that far less often now.

I have regret in deleting the app. I found value there, and felt that the trade-off of attention and advertising versus the benefits of personal connections and trading ideas was a net positive.

As has happened several times to me in my 53 years, I’m astonished at how fast a valued and valuable community can be destroyed.

The toxic mix of arrogance and ignorance and power is a sad combination.

The past eight weeks on Twitter have been emotionally tiring, and I wondered why. On reflection, mourning seems like the most appropriate label for the emotion I’m feeling. I had invested time and effort into cultivating a network of friends and acquaintances. Now it is being destroyed; that network was a guest in someone else’s kingdom.

It feels like a reciprocal behavior back-and-forth: Musk makes a snap dictatorial decision, I step away a little farther. The first move came on October 27 when I stopped engaging with others on Twitter and logged on less often—that corresponded with the announcement that Musk had closed the deal to buy Twitter. At the same time, I picked up my activity on Mastodon (@dltj@code4lib.social)...following more people and engaging more in that community. (My Mastodon account on code4lib.social had been idle since 2018 except for automated postings from my knowledge management tools.)

The second shift came on November 22 when Twitter started rejecting links in posts that came out of my knowledge management tools. I have a series of scripts that I use to save references to web pages that I find notable, and the scripts also post those references to Twitter and Mastodon. For unknown reasons, a Twitter post with a link to The Markup (or the Hypothesis proxy) started failing, so I turned off the automated posting to Twitter and wrote a sign-off message.

Now comes the third reciprocal reaction: Musk suspends and un-suspends journalists, then starts rejecting posts with links that are "free promotion of certain social media platforms..." (quote from deleted TwitterSupport tweets). And I delete the app from my mobile device.

Deleting the app is my commitment to visit the site less often. I regret it has come to that. Once a community is destroyed, it can’t be brought back. Not to the same cohesion it had before...it will be different, and there will be a longing nostalgia for what once was. (Maybe that can be good? Probably not, given Twitter's current trajectory.)

I'm already missing the Twitter notifications that I had set up: the local office of the National Weather Service, the messages from the town and the regional highway patrol, and some favorite accounts: The Oatmeal, XKCD, Have-I-Been-Pwnd, others. Some have made the jump to the fediverse; most have not.

One thing I'm grateful for is that I avoided the habit of using Twitter as a "social login" tool on other websites. Using the Twitter login to make accounts on other sites is undoubtedly a low-friction way for websites to onboard users. Now those users risk being stranded if they leave Twitter (or Twitter leaves them—I wonder what happens to suspended Twitter users when they try to use the Twitter login to sign into other websites?).

I started a new category here on DLTJ called "Online Culture". I'm reading more about digital communities and the norms they establish. There has been a lot written about techniques and tools to foster a healthy, robust community. It is a fascinating field of study, and one that becomes more important if we want respectful social spaces on the internet.

Issue 93: Chat-bots Powered by Artificial Intelligence

2022-12-15T00:00:00-05:00

This week we jump into the world of chat-bots driven by new artificial intelligence language models. The pace of announcements about general-purpose tools driven by large training sets of texts or images has quickened, and the barrier to experimenting with these tools has dropped. There are now fully-functional websites where there once were only programmer-focused APIs. We wonder what the effects will be on our students, our business workflows, and on society. We also wonder about the underlying biases in the training data.

As an aside, in the first article below I mention that the use of these tools, while free for now, will be monetized at some point. This is another unfortunate example of taking from the common good and commercializing it. The training data used by the company came from crawling web pages, from Wikipedia, and from books (source). Yet soon, it seems, all of the benefit from that information will be held by a corporate body. The same thing has been said about the image-based AI tools that have slurped up sets of photos from sites like Flikr, Wikipedia, and even stock photo businesses. We don't talk enough about this private capture of the common good and the uncompensated taking of other's work.

OpenAI Introduces ChatGPT

We’ve trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests. ChatGPT is a sibling model to InstructGPT, which is trained to follow an instruction in a prompt and provide a detailed response.

—ChatGPT: Optimizing Language Models for Dialogue, OpenAI blog, 30-Nov-2022

This link is the announcement from the company that created ChatGPT, OpenAI. The innovation with this model is the introduction of Reinforcement Learning from Human Feedback (RLHF). With RLHF, "human AI trainers provided conversations in which they played both sides—the user and an AI assistant" — and the ChatGPT language model incorporated the refinements learned from those human interactions. The blog post gives examples of how this human training affected the output. In the language model without RLHF training, when asked how to bully someone the AI would return a list of ideas. With the RLHF training, the response starts with "It is never okay to bully someone" and says that others should be treated with respect.

The research preview is open for anyone to try. On Twitter, the CEO of OpenAI says it costs in the low pennies per chat and will have to be monetized at some point.

A High School Teacher Laments a Tool for Easy Essays

Teenagers have always found ways around doing the hard work of actual learning. CliffsNotes dates back to the 1950s, “No Fear Shakespeare” puts the playwright into modern English, YouTube offers literary analysis and historical explication from numerous amateurs and professionals, and so on. For as long as those shortcuts have existed, however, one big part of education has remained inescapable: writing. Barring outright plagiarism, students have always arrived at that moment when they’re on their own with a blank page, staring down a blinking cursor, the essay waiting to be written. Now that might be about to change. The arrival of OpenAI’s ChatGPT, a program that generates sophisticated text in response to any prompt you can imagine, may signal the end of writing assignments altogether—and maybe even the end of writing as a gatekeeper, a metric for intelligence, a teachable skill.

—The End of High-School English, Daniel Herman, The Atlantic, 9-Dec-2022

A teacher of the humanities in high school tried out ChatGPT. When he gave the chat program a writing prompt that he gave his own students, it returned a better essay than what his own students turned in. When he submitted the text of an essay, the chat-bot returned the text in a clearer writing style without changing the ideas expressed in the essay. This does sound like an epoch of woe.

Writing for Stratechery, Ben Thompson says that the introduction of new tools means a change to the skills being taught. Under the heading "Zero Trust Homework", he describes a kind of "essay sandwich" (my phrasing, not his). The skills needed by the student is in crafting a good prompt to the chat-bot and in the editing/analysis of the resulting output; in the middle the chat-bot uses its language model to write the essay. The AI-generated essay may contain factual or structural errors (perhaps some intentionally put there if the chat-bot is being licensed as an educational tool), so the student is demonstrates "learning how to be a verifier and an editor, instead of a regurgitator."

Note! An update to this thread is in issue 97.

A Real-world Example

Ben Whittle, a pool installer and landscaper in rural England, worried his dyslexia would mess up his emails to new clients. Then one of his clients had an idea: Why not let a chatbot do the talking? The client, a tech consultant named Danny Richman, had been playing around with an artificial intelligence tool called GPT-3 that can instantly write convincing passages of text on any topic by command. He hooked up the AI to Whittle’s email account. Now, when Whittle dashes off a message, the AI instantly reworks the grammar, deploys all the right niceties and transforms it into a response that is unfailingly professional and polite. Whittle now uses the AI for every work message he sends, and he credits it with helping his company, Ashridge Pools, land its first major contract, worth roughly $260,000. He has excitedly shown off his futuristic new colleague to his wife, his mother and his friends — but not to his clients, because he is not sure how they will react.

—Stumbling with their words, some people let AI do the talking, Washington Post, 10-Dec-2022

Getting beyond high school, these sorts of language models can help people communicate more clearly with new and existing clients. Earlier I used the phrase "the language model without RLHF training" — that is GPT-3, the same language model used in the example in this article. There is documentation for APIs that can be used to interact with the language model. I tried out the GPT-3 model and asked it to summarize the The End of High-School English from above. The prompt was: "Summarize this for a 10th grade student in as many sentences as you can" and the answer was:

For 10th grade students, OpenAI's ChatGPT is a program that can generate sophisticated text in response to a prompt. It has the potential to drastically change the way writing is taught in school and may even eliminate the need for writing assignments altogether. It has the capability to produce writing better than many teachers which could mean the end of writing being used as a metric for intelligence. OpenAI is forcing us to question the value of writing, grammar rules, and essays as a literary form and has the potential to allow students to focus on more creative aspects of writing.

It cost USD$0.05.

Can't Paper Over Biased Training Data

The perils of trusting the expert in the machine, however, go far beyond whether AI-generated code is buggy or not. Just as any human programmer may bring their own prejudices to their work, a language-generating machine like ChatGPT harbors the countless biases found in the billions of texts it used to train its simulated grasp of language and thought. No one should mistake the imitation of human intelligence for the real thing, nor assume the text ChatGPT regurgitates on cue is objective or authoritative. Like us squishy humans, a generative AI is what it eats. And after gorging itself on an unfathomably vast training diet of text data, ChatGPT apparently ate a lot of crap. For instance, it appears ChatGPT has managed to absorb and is very happy to serve up some of the ugliest prejudices of the war on terror.

—The Internet’s New Favorite AI Proposes Torturing Iranians and Surveilling Mosques, The Intercept, 8-Dec-2022

When public access to GPT-3 was unveiled last year, it didn't take long for people to call out the inherent bias in its responses. OpenAI attempted to counteract that bias with the RLHF training, but the underlying bias is still there. Depending on how the question is asked, you get the same awful answers.

The View from a Human Trainer

Brenda [the name of the chat-bot AI prodct], the recruiter told me, was a sophisticated conversationalist, so fluent that most people who encountered her took her to be human. But like all conversational AIs, she had some shortcomings. She struggled with idioms and didn’t fare well with questions beyond the scope of real estate. To compensate for these flaws, the company was recruiting a team of employees they called the operators. The operators kept vigil over Brenda 24 hours a day. When Brenda went off-script, an operator took over and emulated Brenda’s voice. Ideally, the customer on the other end would not realise the conversation had changed hands, or that they had even been chatting with a bot in the first place. Because Brenda used machine learning to improve her responses, she would pick up on the operators’ language patterns and gradually adopt them as her own.

—Becoming a chatbot: my life as a real estate AI’s human backup, Laura Preston, The Guardian, 13-Dec-2022

What is it like to be someone training the chat-bot AI? It sounds like a mind-numbing, high-pressure experience. The operator, a recent English graduate student, describes how her writing skills were used to craft non-robotic answers to chat questions from apartment leasing prospects.

Synchronized Sleeping

Alan and Mittens are tuckered out after a long day. Up until this week, the bench only had one box on it. The box is the container for a lay leadership award I received from my church last year, and the two cats fought over who would get to sit in the box. (We didn't set out to create a "throne" for the cats; they just adopted the empty box, as cats will do.) Last Sunday, the church gave out the awards for 2022, and I asked if I could take one of the boxes home.

Now there are two happy cats.

LIBnft: a Project in Search of a Purpose

2022-12-12T00:00:00-05:00

At first, I thought this was a parody.

LibNFT is an R&D initiative exploring the impact of blockchain and the digital asset economy on library archives.

—LIBnft homepage, 12-Dec-2022

However, it seems like a serious proposal that was presented today at a CNI project briefing. I did not attend the project briefing; the only details publicly available are from the whitepaper. (Note: link to the whitepaper can't be robustified—Dropbox is hostile to web archiving—but I have saved a copy of version I reviewed...version 0.04 dated 4-Dec-2022.)

From the details in the whitepaper, it is safe to say this project should be shelved until the need and purpose are better understood. Why? First, blockchain is the wrong technology; gallery-library-archive-museum (GLAM) institutions do not need a technology where participants are adversarial or trying to steal each other's data. Second, there is no utility in non-fungible tokens for GLAM governance or assets; it would be better (and certainly cheaper) to hold a meeting or write a typical contract.

Note! The recording of the LibNFT project briefing is now up on YouTube, and I've posted a follow-up with additional thoughts.

Why Use Blockchain

As the LIBnft whitepaper points out, "in its simplest form, a blockchain is a communally maintained distributed ledger, or database, that reliably and immutably stores digital information" (summarizing a New York Times glossary). The "database" term is crucial—blockchain is a technique for storing and retrieving information, much like one would do with a run-of-the-mill database. This database has some interesting characteristics: data can't be erased once it is written and there are copies of the database spread over the network. Rather than "distributed", though, a blockchain database is "decentralized". A USENIX article makes an important distinction between "decentralized" (which blockchain is) and "distributed" (emphasis added):

A distributed system is composed of multiple, identified, and nameable entities. DNS is an example of such a distributed system, as there is a hierarchy of responsibilities and business relationships to create a specialized database with a corresponding cryptographic PKI. Similarly the web is a distributed system, where computation is not only spread amongst various servers but the duty of computation is shared between the browser and the server within a single web page.

A decentralized system, on the other hand, dispenses with the notion of identified entities. Instead everyone can participate and the participants are assumed to be mutually antagonistic, or at least maximizing their profit. Since decentralized systems depend on some form of voting, the potential for an attacker stuffing the ballot box is always at the forefront. After all, an attacker could just create a bunch of sock-puppets, called “sibyls”, and get all the votes they want.

In a distributed system sibyls are easy to deal with because there are responsible entities in the system who act as gatekeepers. These gatekeepers are often recruited to also prevent “undesired” activity. This is especially true of financial gatekeepers who perform payment processing and have legal obligations to block large swaths of criminal activity.

Decentralized systems purport to eliminate the presence of gatekeepers. But there is a problem as without such gatekeepers there is no efficient solution to the sibyl problem. Instead there are ugly hacks, such as a “proof of work” system where sibyls are only prevented by the need to waste resources, or “proof of stake” where the design literally becomes “he who has the gold makes the rules”.

So one of the defining characteristics of blockchain—distinguishing it from other database technologies—is that it guards against gatekeepers. But who are the GLAM gatekeepers, and why would we need this technology? Blockchain is a complex technology, and anyone who has dealt with complex technologies—such as digital heritage scholars—knows they are expensive and hard to maintain and preserve.

If you need more background or convincing why blockchain is a technology in search of a problem (as a general issue, not just for GLAM institutions), Cal Paterson goes into much more detail in his article: "There aren't that many uses for blockchains".

Why use NFTs

Since blockchain offers no inherent benefit to the GLAM community, there must be something about NFTs on top of blockchain that makes it all worthwhile. Back to the whitepaper, the authors say that "Web 3" is the reason.

But the second group, which we will call Group 2, is interested in Web 3 for the object, outcome, or return, but instead for the processes that Web 3 can enable. Group 2 analyzes Web 3, according to Klein, for “the way it can decentralize decision making, or create new forms of transparency, or bind a whole community, or project, or even company to rules and protocols that can never be broken. ... These are the people who are trying to make a future for [Web 3] that is more interesting than [] just some payment infrastructure operating in the background of the internet.”

The vitality of the Group 2 conversation exists beyond the market value of Bitcoin, the current “crypto winter,” or whether you will miss out like Larry David, as one of those Super Bowl ads warned millions about. This Group 2 conversation—the one about the possibilities for and with Web 3—is the conversation that GLAM sector leaders should be having and paying attention to, for it allows us to explore the possibilities of blockchain technology generally, and NFTs, without worrying about how much money we may have gained or lost.

A shorter, if more colloquial, way to put this is "code is law." This description of NFTs resembles funding decisions or project governance rather than selling rights to a digital representation of a cultural object. The "code" in this case is machine language in the NFTs to enforce specific other actions to happen. Think of it as: if variable 'x' has 'y' value, then this little bit of code in the NFT executes to make 'z' happen. As a real-world example, if the S&P 500 index is 4,000 then transfer this NFT to receiver 'z' in return for '$' amount.

What is the use case for this kind of code-is-law in GLAMs? There might be some edge cases—funding bodies releasing money to recipients based on some condition? ...or many institutions working together in a so-called "distributed autonomous organization" using NFT tokens to vote on project governance proposals? GLAM institutions tend to cooperate more openly than your average institution or corporation. Are these edge cases worth the overhead in NFTs, blockchains, and their underlying technologies?

"But No, Really: Why NFTs?"

Given all the talk about bored apes, why should galleries, libraries, archives, and museums (GLAM) pay any attention at all to NFTs? There are several reasons. First, NFTs are collectible digital assets, and the GLAM sector long has been involved in accumulating, curating, and preserving collectible digital assets. In the world of curation, an NFT really is like just a different form of investment into the very thing we love as collectors and guardians of rare and unique material—just in digital form.

Second, NFTs can have brand value. Based upon their one-of-a-kind nature, NFTs can carry cultural cachet in at least two different ways. For one, the subject matter of an NFT can be unique enough that the token carries a certain level of prestige. Additionally, the entity that issues the NFT can be credible enough that the issuance of a token from that institution bestows distinction upon the issued token. Therefore, association with a reputable institution increases the value of the NFT.

The answer to "why NFTs" is summed up as:

because we can; and
because we would look cool.

None of this sounds like GLAM principles to me

At the conclusion of the whitepaper:

We have begun this project by asking a fundamental question: can blockchain technology generally—and NFTs specifically—facilitate the economically sustainable use, storage, long-term preservation and accessibility of a library’s special collections and archives? ... Regardless of the outcome of this research, we believe that this space will develop quickly, with or without GLAM entities’ involvement or participation, and that GLAM leaders ignore blockchain, digital assets, and the metaverse at their institutions’ peril.

Nope. This sounds like a technology in search of a problem. It is a complex technology in which the complexity does not add nearly enough utility to justify its cost. Although the first sentence sounds like a research question, the second sentence already presumes the answer. No, GLAM institutions are not at peril if they don't participate in the latest hyped technology.

Issue 92: Privacy Stories From 2014 Still Echo Today

2022-12-08T00:00:00-05:00

Back again. Thanks for the comments on the return of the newsletter. I've heard that Microsoft Outlook isn't playing nice with my email theme. (It also isn't playing fair...someone forwarded the newsletter back to me, and when I replied that person said the view of the newsletter in the reply looked fine in that same Microsoft Outlook.) Until I get that fixed, remember that you can read the newsletter online — just follow one of the bullet point links below to get to it.

This week we're going to pull through some privacy threads to the current day. Eight years ago this week, I published a whole DLTJ Thursday Threads issue on privacy. This was the lead paragraph:

Are you paranoid yet? Are you worried that the secret you shared anonymously might come right back to you? Or wondering why advertisements seem to follow you around from web page to web page? Or just creeped out by internet-enabled services tracking your every move? Or angry that mobile carriers made it very easy for anyone to track every page you visited from your smartphone? Or maybe you will simply give up any personal information for a delicious cookie? (Are you paranoid now?)

The first was about how posts on apps like YikYak, Secret, Whisper, and Snapchat weren't really anonymous. The second was about the kinds of data that apps collect and aggregate about us. The third was an opinion piece about how Uber was tracking your every move as part of its experiments, and also contained a nugget about how Facebook was updating its terms of service to say explicitly that the app will now track your location. The fourth was how AT&T and Verizon got caught invisibly rewriting web pages passing through their network to include their own tracking tokens. And the fifth was a person-on-the-street test to see how much personal information passers-by would give up for a cookie (a tasty treat, not the browser cookie kind).

So with all that attention on privacy in 2014, you'd figure we'd have it all solved by now, right? Let's see what some of the latest stories are.

Algorithmic Cruelty

When I became pregnant, my partner and I, like many expectant individuals, opted not to tell our friends until after the first trimester. But I had an additional goal: for my friends to learn of my pregnancy before advertisers did. I’m a health-privacy scholar, so I know that pregnant individuals are of particular interest to retailers because their purchasing habits change during pregnancy and after birth. Companies are eager to send targeted ads and capture a new customer base. In an attempt to avoid this spamming and, frankly, to see if it was possible, I endeavored to hide my private health status from the advertising ecosystem....
Yet, because of the lack of data privacy in the U.S., the day finally came when I lost my battle to keep my reproductive information private. I was sitting on my couch scrolling through social media when I saw it: an advertisement for diapers. It appeared the same week that we lost the pregnancy.

—I Tried to Keep My Pregnancy Secret, Anya E. R. Prince, The Atlantic, 10-Oct-2022

Algorithms are cruel. In this article, the data miners and advertising conglomerates got to this person at precisely the wrong time. Eight years ago this month, my friend Eric Meyer found the same thing when Facebook posted a year-end memory video that prominently featured pictures of his daughter who died six months earlier. (Eric went on to write a book with Sara Wachter-Boettcher about designing with compassion.) Data about us is being gathered up, aggregated, parsed, mischaracterized, and pulled out of context by systems we can't see or control, and it is still as true today as it was eight years ago.

Ditching CAPTCHAs and Improving Privacy

Earlier today at WWDC, Apple presented a demo of a new technology that allows users to leverage Fastly to seamlessly access content at FT.com with zero CAPTCHAs, all while respecting end-user privacy. At its core, Private Access Tokens present a privacy-respecting, anti-fraud and authorization framework. This blog post provides an overview of what it does and how developers can try it out with Fastly and Apple today.

—Private Access Tokens: stepping into the privacy-respecting, CAPTCHA-less future we were promised, Fastly blog, 8-Jun-2022

We have the classic "On the internet, nobody knows you're a dog" (or maybe a bot). Websites use CAPTCHAs—those click-all-the-stoplight widgets—to separate bots from humans. Not only are CAPtCHAs difficult for some to solve, the biggest is also run by an advertising-data-heavy company (Google). With some clever cryptography, there seems to be a privacy-aware way out of this problem.

When Privacy is a National Security Concern

A security vulnerability in the fitness app Strava allowed suspicious figures to identify and track security personnel working at secretive bases in Israel, a disinformation watchdog says. FakeReporter found that by uploading fake running "segments" a user could learn the identities and past routes of others active in the area, even if they had the strongest privacy settings.

—Strava app flaw revealed runs of Israeli officials at secret bases, BBC, 21-Jun-2022

One of the Thursday Threads from 2014 was about the emerging concern of mobile device apps reporting our movements to data aggregators. That is still a concern and one with far-reaching consequences. If the military is having problems keeping this stuff secret, I'm not sure we civilians have much hope. And the misuse of geo-location information goes beyond privacy; there is an article in Ars Technica this week about police not understanding the limitations of geo-location data and terrorizing a retired woman in her home.

A Privacy-in-the-Cloud Good News Story

After years of delay under government pressure, Apple said Wednesday that it will offer fully encrypted backups of photos, chat histories and most other sensitive user data in its cloud storage system worldwide, putting them out of reach of most hackers, spies and law enforcement.

—Apple says it will allow iCloud backups to be fully encrypted, Washington Post, 7-Dec-2022

Just yesterday, Apple announced that it will start encrypting the iCloud backups from mobile devices. This has long been a loophole in the privacy of Apple's mobile devices; while activities like chat messages may be encrypted end-to-end, they are available in the clear in backups on Apple's servers. Not everything will be encrypted in the backup, though; Apple says that email, calendars, and contacts won't be encrypted so those tools can interoperate with third parties. That should be plenty for law enforcement to use (when they have the duly authorized search warrant).

Facebook's Luck Running Out in the European Union

A leak from the European Data Protection Board reveals that the EU’s top privacy regulator is about to overrule the Irish Data Protection Commission and declare Facebook’s business model illegal, banning surveillance-based ads without explicit consent.

—EU to Facebook, 'Drop Dead', Pluralistic, 7-Dec-2022

Also yesterday is this essay from Cory Doctorow about a serious threat to Facebook from the European Union. Although Facebook has had favorable court rulings in the past from the Irish government's data protection board, those may be overruled by the continent-wide oversight body. Cory notes that GDPR, as envisioned, does not consider top-level checkboxes and approval-as-click-through-fine-print to be sufficient consent. "These uses have to be individually enumerated, and the user has to actively opt into giving up each piece of data and into each use of that data. That means that if you're planning to steal 700 pieces of information from me and then use it in 700 ways, you need to ask me 1,400 questions and get a 'Yes' to each of them." Will Meta pull Facebook out of the EU? Or will it modify its data practices to conform to a court ruling that puts more power into the hands of the end users? (And what is the fine amount going to look like?) Cory talks about this and the potential impacts beyond Facebook of such a ruling.

Mittens' Calendar

For Christmas last year, the family got a page-a-day calendar with text messages between a cat named Mittens and its owner. Our Mittens had a striking resemblance, so we had to take this picture.

Let me know what privacy stories you recently encountered or remember from 2014. Just reply to this email (if you got it in newsletter form) or see my contact information to the left (if you are reading it on the website)

Mastodon Instance Operators Report on the Impact of the #TwitterMigration

2022-12-04T00:00:00-05:00

A number of Mastodon operators have started to report the impact of the #TwitterMigration on their instances. I started gathering these because I was curious about what it takes to run a public or semi-public Mastodon instance. These reports are full of those kinds of details, but they also describe evolutions of policy and operations that are just as interesting. If you see other reports (or have posted a report of your own Mastodon instance), please tag me at @dltj@code4lib.social and I'll add it to this list.

Note! ~~There is now a branch and pull request on GitHub where you can suggest changes to this list and/or subscribe to notifications for updates. Updates to the page are also available via RSS-Bridge~~. This didn‘t work as I expected it to when the commits got to GitHub. (The pull request was automatically closed.) Will need to rethink this.

sfba.social (San Francisco Bay Area)

This report covers the San Francisco Bay regional Mastodon instance sfba.social, including general statistics, financial details including income (donations) and expenses (hosting costs), moderation efforts, and changes made and considered. It has been wonderful to welcome all of our new friends and neighbors. We have expanded our server capacity and refined our moderation process, including a new version of the code of conduct and updated the server rules to match. This has helped to improve expectations and free our users to be nice and have fun!

—Transparency Report (November 2022), SFBA Community Hub, 2-Dec-2022

Includes sections for:

Statistics for activity growth
Financials/fundraising
Governance changes (new moderators, code of conduct revisions, )
Future plans

mindly.social

Since April of this year I‘ve been running my own Mastodon server and 3 days ago we hit 100 users which was a huge milestone for my tiny little server... and then all of a sudden something happened, the other Mastodon servers started to get full and new users were looking for homes. Less than 72 hours after being excited for hitting 100 users we hit 10,000 users.

—Running a Mastodon server - Part 1?, KuJoe‘s blog, 29-Nov-2022

Includes sections for:

Statistics for activity growth
Process for managing growth (technical)

chaos.social

The past month has changed the Fediverse, and, by extension, our instance. We‘ve continued as normal (apart from limiting sign-ups) to give ourselves time to figure out which changes were only temporary, what seems to be changed for good, and how to react. A month seems ample time, and here we are with a set of changes in how chaos.social will work in the future.

—Rule changes, closed sign-ups, and more, chaos.social blog, 29-Nov-2022

Includes sections for:

Statistics for activity growth
Process for managing growth (new user moderation, instance rules)

I was going to write an article for a while now, but there was too much work to do with the latest influx. Together with my co-admin @rixx we run the chaos.social instance. As of writing this, we are one of the biggest and most active instances on the fediverse and one of the oldest mastodon instances, starting mid of April 2017. For the last 5 years everything was simple, one VM with 10 cores, 500GB NVMe SSD Storage and 32GB RAM. This VM did everything from the database to the webserving. Then Musk happened.

—Scaling the Mastodon, Leahs Gedanken‘ blog, 2-Dec-2022

Includes sections for:

Infrastructure, including breakout of server functions
Process for managing growth (technical)

Scaling Mastodon

This is honestly a very hastily written selection of various snippets, with text extracted, and notes. No real editing thought was put into this, so I hope it‘s not too confusing. This blog post will be kept up to date as I find out more information and publish my findings. It‘s currently organized in no particular order with a bunch of micro fragment thoughts split out in a row.

—Scaling Mastodon: The Compendium, Hazel Weakly

Includes sections for:

Process for managing growth (technical)

Metalhead.club

Mastodon has recently gained popularity amongst tech-savvy users after Elon Musk has bought Twitter. With November‘s wave of new Mastodon users, many servers experienced mayor problems with their performance - so did metalhead.club, the Mastodon instance that I‘m hosting myself. Here‘s how I tackled performance issues on metalhead.club.

—Scaling up your Mastodon Sidekiq workers for better performance, Thomas Leister‘s blog, 6-Nov-2022

Includes sections for:

Process for managing growth (technical)

Mastodon.ART

The steep line itself is when I realised that doing it that way actually gave us less control over new user numbers, so we went back to ‘anyone can request an account but accounts need to be approved by a mod’. This was about the time that big news sites had started reporting people were leaving Twitter for Mastodon, and they were all linking to joinmastodon.org, where we were one of three art instances listed and the biggest one by far... We had 2000 new user accounts placed in the approval queue that day. These weren‘t approved accounts, so they weren‘t logged on and using .art - so they weren‘t contributing much (aside from the verification emails being sent out) to server activity. Still, server activity skyrocketed.

—Update on costs & expansion plans, Masto Art on Patreon, WelshPixie, 8-Nov-2022

Includes sections for:

Statistics for activity growth
Process for managing growth (new user moderation)
Financials/fundraising
Governance changes (new moderators, etc)

Fosstodon (April 2022)

If you haven‘t heard, Elon Musk recently agreed a deal to buy Twitter for a whopping $44bn. That cause a little turmoil over in Twitter land, which resulted in expats coming over to Mastodon in their droves. Because we‘re one of the biggest technology focussed instances on the Fediverse, a lot of people requested an account on Fosstodon. As you can see from the graphs below, the spike in users from Twitter made our usual activity almost flat line. This effectively resulted in a sustained DDoS that lasted for around 36 hours. As you can imagine, that was a lot of fun for myself and the team. Here‘s the details of what happened during those 36 hours…

—Twitter, Elon & Fosstodon - A Post-Mortem, Fosstodon Hub, Kev Quirk, 29-Apr-2022

Includes sections for:

Statistics of activity growth
Process for managing growth
Financials/fundraising

Fosstodon (November 2022)

Woooooh! It‘s been a crazy few weeks at Fosstodon HQ. If you‘ve been living under a rock (or are just not interested) Elon Musk, or better known as Melon Tusk on the fediverse, has taken over Twitter and looks to be on a mission to ruin the platform even more. Because of this, people have been flocking to Mastodon in their droves, and since Fosstodon is one of the biggest tech-focussed instances on the fediverse, we‘ve seen A LOT of that traffic.

—Fosstodon vs Twitter - Round 2, Fosstodon Hub, Kev Quirk, 13-Nov-2022

Includes sections for:

Statistics of activity growth
Financials/fundraising

Mastodon Canada

The instance grew from less than 60 users to what is now 24 500 users in the span of just two weeks. It went from an experiment to a service demanding significant infrastructure, moderation, and governance. I want to offer a warm welcome to everyone and thanks for their patience as I bolstered Mastodon Canada with the necessary server capacity to handle the load.

—State of the Instance - Nov 2022, Mastodon Canada, Chad Ohman, 3-Dec-2022

Includes sections for:

Infrastructure (Digital Ocean), including breakout of server functions
Process for managing growth
Governance changes (new moderators, etc)
Financials/fundraising
Future plans

Hachyderm

Hachyderm has reached 30,000 users. A ‘small sized’ service in regard to scale. However, in the process we have hit very familiar ‘medium sized’ scale problems which caused us to migrate our services out of my basement. This is the outage report, post mortem, and high level overview of the process of migrating to Hetzner in Germany. From observation to production fixes. This is the story.

—Leaving the Basement, Hachyderm Community, Kris Nóva, 3-Dec-2022

Includes sections for:

Infrastructure (was: basement; now: Hetzner)
Process for managing growth
Postmortem(s) on outages caused by growth

Thank you to everyone who has been patient with Hachyderm as we have had to make some adjustments to how we do things. Finding ourselves launched into scale has impacted our people more than it has impacted our systems. I wanted to provide some visibility into our intentions with Hachyderm, our priorities, and immediate initiatives.

—Growth and Sustainability, Hachyderm Community, Kris Nóva, 4-Dec-2022

Includes sections for:

Statistics of activity growth
Financials/fundraising
Governance changes (considering legal entity, call for volunteers)
User registrations

Cloud Island (New Zealand)

November was a heck of a time for Cloud Island, and the Fediverse as a whole. With the Twitter purchase, the Fediverse as a whole saw a massive increase in use and traffic. Cloud Island is seeing a 15x sustained increase in traffic on Sidekiq, which is the behind-the-scenes processing layer that interacts with the rest of the Fediverse. Cloud Island is seeing 1700% more interactions through November than in October. And, finally a 12.5x increase in the amount of bandwidth being used from October to November. In October, we used about 100GB of bandwidth. In November, we used 1.25TB of bandwidth.

—Cloud Island November, Cloud Island on Patreon, Aurynn Shaw, 1-Dec-2022

Includes sections for:

Infrastructure (Catalyst Cloud), including breakout of server functions
Financials

Issue 91: Bibliographic Records and Mastodon Migration

2022-12-01T00:00:00-05:00

Well, this newsletter was off the air longer than I anticipated. A lot has happened since issue 90 in late March: cryptocurrency value falling, Twitter spiraling (maybe a death-spiral...can't be too sure), and (in the U.S.) a whopper of a mid-term election season. All is well here in the Jester's home...I needed some time to build up some more tooling around the blog and newsletter — then summer came, and then fall, and before you knew it, eight months had passed before this issue came out.

Speaking of Twitter...I have mostly left it behind. The "DataG" account is still there, but I have turned off the automated posting and have stopped visiting the site. I've made the migration to Mastodon on the Code4Lib instance; you can find me at @dltj@code4lib.social. If you, too, have made the move, I hope you will follow me there and give me a chance to follow you back.

Threads from 12 years ago are still weaving their way through us today. In the 11th issue of Thursday Threads from 2010, I posted, among other things, about the new free e-journal hosting from University of Pittsburgh on OJS (and it looks like it is still available as a service!), the desire for open bibliographic data (and that is still a thing...see below), and the masters degree in business administration earned through a Facebook app (which, 12 years later, I would guess is no longer a thing).

I hope you and those close to you are doing well as we enter the last month of 2022. Don't be a stranger—drop me a line if you find this interesting or come across something you think I would want to know about.

OCLC versus Clarivate: In the Battle for Bibliographic Records, the Winner is ???

Clarivate continues to deny OCLCs allegations of wrong-doing and maintains that the issue lay between OCLC and its customers, who sought to co-create an efficient community platform for sharing of bibliographic records. Clarivate will not develop a record exchange system of MARC records that include records which OCLC has claimed are subject to its policy and contractual limitations. Clarivate will bear its own fees and costs.

—Clarivate and OCLC Settle Lawsuit, Clarivate press release, 7-Nov-2022

Though the settlement document itself is confidential, two significant elements include:
Clarivate, Ex Libris, and ProQuest have ceased the development and marketing of the MetaDoor MARC record exchange system developed using records that are subject to the WorldCat Rights and Responsibilities Policy.
Clarivate, Ex Libris, and ProQuest will promptly and permanently delete all MetaDoor work product that incorporated or was based on records subject to the Policy.
Pursuant to the confidential agreement and elements noted above, OCLC has filed a dismissal of the lawsuit.

—OCLC and Clarivate settle lawsuit, OCLC press release, 7-Nov-2022

...and with that, the case is closed. And like 12 years ago, when the Skyriver lawsuit was dismissed in much the same way, there is little clarity about the nature of the bibliographic records that run our libraries. Earlier this month I wrote a more in-depth article about the open questions from the Clarivate/OCLC lawsuit. In that post, I go through the court transcripts from the pre-settlement hearings and ask: "what was Clarivate's Metadoor" and "what is an OCLC record".

Moving On to Mastodon

As Twitter users fret over the direction that new owner Elon Musk is taking the company, masses of users have hopped over to Mastodon, an open source Twitter alternative.

—A beginner’s guide to Mastodon, the open source Twitter alternative, TechCrunch, 8-Nov-2022

That is part of the lede paragraph of an article with instructions on starting afresh in a different "digital public square". My Twitter experience has gotten notably worse over the past month; I'm at a point where I'm not finding enough value there to give Elon Musk's company any more of my attention. As I said earlier, I've resurrected my account on code4lib.social from 2018. This is kind of a vanity account now—an account with a "join date" that isn't in November 2022. I'm going to stick around there, but the thing about Mastodon and the wider "fediverse" of services that use the ActivityPub standard is that you don't need to stay in one place. You can move your Mastodon presence to another location or have more than one Mastodon presence in more than one community. (At the moment, I'm quite interested in what is going on with social.coop with its strong community governance structure and code of conduct, and I might ultimately consider that a home for the non-library-technology side of my persona.) This Mastodon thread is going to get longer in future newsletter issues, but for right now I'm pointing you to this guide to help you move from Twitter (or even Facebook!) to Mastodon...then follow me at @dltj@code4lib.social when you get there.

Alan Likes Football Too

In addition to everything else going on, it is World Cup season. Alan joined me one afternoon to watch the group-stage England versus Iran game.

Like many, I'm of two minds about the World Cup: I very much enjoy the skill, teamwork, and strategy on display from all over the world, but I'm against the inhumane treatment of the immigrants that built the stadiums and infrastructure, the suppression of rights in the host country, and of course the shadiness under which FIFA awarded the games to Qatar. I think that John Oliver has it right, though: enjoy the games, but don't forget—now and in the future—what led up to the games and hold those responsible to account the most effective ways that you can.

With Mastodon on the Rise, Who Archives the Digital Public Square?

2022-11-27T00:00:00-05:00

DALL*E prompt: photorealistic waves of twitter logos and mastodon logos crashing onto a sandy beach

Much has been made about the differences between Twitter and Mastodon: the challenge of finding a home for your account (and the corresponding differences between your “local” timeline and your “global” timeline), the intentional antiviral design choices (no quote-tweets and a narrow search system), and the more-empowering block and mute features. A recent article in MIT Technology Review about the potential loss to history if Twitter goes away had me thinking of another one difference: a Mastodon-filled world changes expectations for archiving this kind of primary source material.

Think Bigger Than Mastodon

Let's set some common ground. "Mastodon " is being used here as a shortcut for the growing federation of servers that follow the ActivityPub protocol—the "fediverse". Most people caught up in the migration away from Twitter are looking for a "Twitter-equivalent", and the option that has caught the popular imagination is Mastodon. As we view the fediverse digital public square, we could just as well be talking about Mastodon forks like Hometown. We should also include in the genre-specific ActivityPub software like Pixelfed (for photographers, me there), Bookwyrm (for book groups and reader commentary, me there), Funkwhale (for music), and write.as (for long-form articles). Although Mastodon is getting the most traction right now, the question of archiving the digital public square is bigger than just Mastodon...just keep that in mind as you read below.

Twitter Archiving Challenges

As the MIT Technology Review article points out, there are challenges to archiving Twitter.

For eight years, the US Library of Congress took it upon itself to maintain a public record of all tweets, but it stopped in 2018, instead selecting only a small number of accounts’ posts to capture. “It never, ever worked,” says William Kilbride, executive director of the Digital Preservation Coalition. The data the library was expected to store was too vast, the volume coming out of the firehose too great. “Let me put that in context: it’s the Library of Congress. They had some of the best expertise on this topic. If the Library of Congress can’t do it, that tells you something quite important,” he says.

The challenges include that of scale:

[In January 2013] We now have an archive of approximately 170 billion tweets and growing. The volume of tweets the Library receives each day has grown from 140 million beginning in February 2011 to nearly half a billion tweets each day as of October 2012. - Update on the Twitter Archive at the Library of Congress, Library of Congress blog, January 2013.

And also of scope—the Library does not receive the multimedia parts of tweets. As the whitepaper attached to the Update on the Twitter Archive at the Library of Congress says:

The Library only receives text. It does not receive images, videos or linked content. Tweets now are often more visual than textual, limiting the value of text-only collecting.

Both points speak to the changing nature of Twitter from when its origins as an extension of text messaging geared towards a U.S. audience into a world-wide multimedia platform. Michael Zimmer wrote in great detail about these challenges and the issues of processing, privacy, and user consent for First Monday in 2015. The donor agreement between Twitter and the Library of Congress is silent on the matters of privacy and user consent as well.

On December 26, 2017, the Library of Congress announced that it was no longer collecting a comprehensive archive of tweets as of January 1, 2018. What is at the Library now has known limitations in its comprehensiveness, and we may not see open access to that archive in our lifetime because of privacy concerns.

The MIT Technology Review article talks about the loss to historians, human rights lawyers, and researchers using "open source intelligence" — that which is openly published in the digital public square. Given that we are facing this moment of reckoning as Twitter may be on the brink of disappearing and people are finding community on Mastodon, should we consider an explicit archiving role for the fediverse?

Mastodon Archiving Challenges

With Twitter's recent upheaval and the migration to Mastodon, I've seen mentions of how Twitter was unique to its time. At Twitter's public unveiling in March 2006, the only way to interact with Twitter was through text messages. Apple would introduce the iPhone the following year, and it was a year after that when an app for iPhone would launch. Twitter's growth was jumpstarted by the influx of users at the 2007 South-by-Southwest (SXSW) conference as attendees publicly shared their experiences in real time in a way they could not have previously. The combination of an experience that straddled mobile and desktop devices and the ability of the company to scale to meet the demand made this Twitter's moment. A moment that it ran with for the next 15 years.

Mastodon is different. Conceptually, there isn't one "Mastodon" (like there is one "Twitter"); there are many little Mastodons that have a standard way of talking to each other. (Yes, this is where the "ActivityPub" standard becomes key.) And crucially, these many little Mastodons are run by individual users and organizations. We witnessed firsthand the difficulties these Mastodons had in scaling to meet the demand from the outflow of Twitter users. (Many of the larger Mastodon instances halted or greatly limited new user registrations in November 2022.)

Now consider what would be needed to construct a "Mastodon Digital Archive" similar in scope to how Twitter donated its timeline of tweets to the Library of Congress. At the very least, it would mean contacting each of these Mastodon instances to get copies of their databases and feeds of ongoing posts. And even if there was a mechanism to do that, internet users are more aware about rights to their digital content (or at least more savvy of their digital footprint); some sort of user consent would likely be needed.

Inherent in the structure of independent Mastodon instances is the fact that there isn't a central point of aggregation, and that is seen by the broader community as a good thing. (The most common reason I've heard is that the lack of a search tool makes finding the discussion of controversial topics harder and decreases the likelihood of bad actors "dogpiling" into a conversation.) There have been attempts to aggregate content for a search engine, but Mastodon administrators quickly ban those kinds of ActivityPub peers. Creating an archive of Mastodon posts will likely run into the same issues.

Do We Want a Digital Public Square Archive?

Let's step further back: should there be an archive of the digital public square? Physical public squares don't have comprehensive archives. The fact that a digital public square is made up of ones and zeros in files and databases makes it at least conceivable. (Setting aside the technical challenges that the Library of Congress faced with the Twitter archive; with progress in technology and techniques, having such an archive will likely be technically possible at some point.) As the MIT Technology Review article points out, there are benefits to such an archive. Perhaps archivists and historians can help aim us toward ideas that make sense for this new public space.

OCLC v. Clarivate: What was MetaDoor? What is an OCLC Record?

2022-11-13T00:00:00-05:00

On November 7, 2022, OCLC and Clarivate announced a settlement in their lawsuit about using WorldCat records in the embryonic MetaDoor service. This ended the latest chapter in the saga of reuse of library metadata with little new clarity. The settlement terms were not disclosed, but we can learn a little from the proceedings. First, let's review the press releases from the parties. Then we'll look at the transcripts of court proceedings to see if we can get closer to answers about some questions this lawsuit raises.

Clarivate's Statement

Clarivate's statement about the settlement is quite vague:

Clarivate continues to deny OCLCs allegations of wrong-doing and maintains that the issue lay between OCLC and its customers, who sought to co-create an efficient community platform for sharing of bibliographic records. Clarivate will not develop a record exchange system of MARC records that include records which OCLC has claimed are subject to its policy and contractual limitations. Clarivate will bear its own fees and costs.

Gordon Samson, Chief Product Officer at Clarivate insisted, "Clarivate will continue to support the goals of open research and data exchange - because we believe it is the best way to make the process of research and learning faster, more robust and more transparent. Regardless of business model, when scholarly information is easily accessible and shareable, the dots are easier to join, the connections are explicit, and collaborations are more natural and meaningful. The process of scientific discovery is faster, and it is easier to ensure research integrity and reproducibility. We know that navigating the transition to open research is important to our customers, and we remain committed to helping them make that transition as seamlessly as possible."

- "Clarivate and OCLC Settle Lawsuit", Clarivate press release issued November 7, 2022

It isn't clear from this statement whether MetaDoor is done or not. (We'll answer the "What is/was MetaDoor?" question below.) The statement, which matches the language in the OCLC statement, only says that a service that includes OCLC records will not be built. (We'll also try to answer the "What is an OCLC record?" question below.)

OCLC's Statement

OCLC's statement is only a little less vague:

OCLC is pleased to announce today that it successfully defended WorldCat to protect the collaborative service developed and maintained with and for libraries worldwide.

An agreement has been reached in a lawsuit filed by OCLC in June 2022 against Clarivate and its subsidiaries in the United States District Court, Southern District of Ohio.

Though the settlement document itself is confidential, two significant elements include:

Clarivate, Ex Libris, and ProQuest have ceased the development and marketing of the MetaDoor MARC record exchange system developed using records that are subject to the WorldCat Rights and Responsibilities Policy.

Clarivate, Ex Libris, and ProQuest will promptly and permanently delete all MetaDoor work product that incorporated or was based on records subject to the Policy.

Pursuant to the confidential agreement and elements noted above, OCLC has filed a dismissal of the lawsuit.

Member libraries, publishers, data experts, and OCLC have worked collaboratively for decades to create WorldCat. Protecting this investment and infrastructure ensures innovation for all libraries and sustainability in the future.

- "OCLC and Clarivate settle lawsuit", OCLC press release issued November 7, 2022

We're left with the same questions ("What is/was MetaDoor?" and "What is an OCLC record?").

To dig any further, let's look at some of the court documents. Many of the documents are available on the CourtListener.com mirror of the federal court PACER system: OCLC Online Computer Library Center, Inc. v. Clarivate, Plc, 2:22-cv-02470. The most illuminating bits are in the transcripts of conference calls between the court and the lawyers for each side. (The transcripts are PDF documents, so the Hypothesis service is used to link to specific sections.)

What was the "MetaDoor MARC record exchange system"?

Clarivate was pretty cagey on what MetaDoor actually was. They pretty clearly say that they are not creating a database. From page 22 of the transcript of the status conference on June 24, 2022:

THE COURT: All right. In fact, what you are doing, I think, is you are developing a database of information to provide to libraries in competition with OCLC.

MS. RODMAN [counsel for Clarivate]: No, Your Honor, absolutely not.

THE COURT: And you are not doing this for educational purposes. You are doing this for business purposes. You are going to generate some kind of financial benefit by offering this information to others, are you not?

MS. RODMAN: OC -- I'm sorry -- MetaDoor is not, absolutely not, a database. We are not developing a database for libraries. What MetaDoor is is a software solution that lets one library share with another library. No information ever goes into MetaDoor, ever goes to the defendants as a result of MetaDoor. It simply facilitates that library-to-library transfer which is already allowed to happen, and it gives libraries a way to do it that is not a one-by-one clunky way of doing it like they currently do.

"MetaDoor is not, absolutely not, a database"—quite the definitive statement of what it isn't, but not very clear about what it is. Marshall Breeding's "MetaDoor: A new bibliographic service for libraries to be offered by Ex Libris | Library Technology Guides" from June 2022 has a little more information:

This new platform will differ from existing bibliographic utilities. Instead of building a massive repository of bibliographic records, the service is based on indexing and artificial intelligence technologies to identify records residing in the integrated library systems or library services platforms of participating institutions that can be shared with other libraries for copy cataloging or record enhancement.

So MetaDoor is not a big repository of metadata records; what might it be?

My MetaDoor speculation

If MetaDoor wasn't a database of records, then what was it? At first, I thought it was some sort of metasearch engine where searches are broadcast to an array of library catalogs. But that isn't any better than the technology we have today; these kinds of broadcast searches are known to be slower than users expect (in the age of instant answers from the web search engines) and notoriously challenging to do good relevance ranking on the fly.

Instead of a database of full records, I think MetaDoor was a database of very brief records and identifiers that facilitated fast full record retrieval from library catalogs. What follows is complete conjecture...I haven't seen MetaDoor prototypes or talked about its architecture with anyone from Clarivate or development partner libraries. Instead, this is what I would do if I faced the kinds of design constraints presented in the court filings and was asked to build something. Let's call it KICReD ("Known Item Cataloging REcord Discovery"), and it would have these parts:

From participating libraries, a metadata extract of a subset of fields—title, main entry (e.g., author), publisher, ISBN, material type, and dimensions—along with local catalog identifiers.
A database of facts built from the metadata extracts.
Z39.50 connection details from the participating libraries.
Browser-based application with a Z39.50 client.

Cataloging a known item with KICReD would use this workflow:

The cataloger, starting with the item in hand, would search the database of facts to find records that most closely match the item.
The cataloger then selects one or more records from the database of facts.
The browser-based application gets Z39.50 connection details for the local catalogs of the selected records.
The Z39.50 client built into the browser-based application uses the local catalog identifier to retrieve the exact record from the participating library.
The browser-based application displays the full MARC record(s) for the user to select for import into the cataloger's library.

A KICReD-like system has several advantages. First, it would be much faster than broadcast Z39.50 searching. The database of facts has enough information for the cataloger to select records that most likely match the item in hand. Searching that database would be fast...the metadata extracts would already have been gathered, clustered, and indexed. The next step, too, would be relatively fast...known item searching with the local catalog identifier with Z39.50 is a great way to pull up a MARC record.

Second, it would seem to address concerns about WorldCat record reuse. Database of Facts is a meaningful label in KICReD as facts are not subject to copyright. Compiling and holding a database of facts would seem to be a defensible position. And because the Z39.50 search is happening in the user's browser in real-time, the full records are never compiled and so there is never an attempt to build anything that replicates WorldCat.

Third, there are some really neat things that could happen when looking at an array of records about the same item. It might be possible, for instance, to construct a composite display of all of the fields from all of the selected records...the browser-based application could build a visualization of fields that are most common alongside fields that are unique, and then select the fields that become the record downloaded into the local catalog.

Fourth, one could do some neat automation with this kind of system. For instance, instead of typing the title and author into a search box, the cataloger could just scan the ISBN barcode from the book jacket and pull up the matching record(s) from the database of facts.

If my "KICReD" guess is close to what MetaDoor was, why did Clarivate settle? The settlement agreement is sealed. so we'll likely never know. Still, I wonder if they went too far and included the OCLC Record Number in their version of the "database of facts". Adding that field does make clustering of identical records easier, but (as the OCLC lawyer argued) is demonstrable evidence of the record having come from WorldCat. Or perhaps there were fields in the metadata extracts touched by OCLC's automated processing that got close to the line between "fact" and "creative effort". A main entry, for instance, could be enhanced with birth and death dates, professional and honorific titles, meeting locations and dates, etc. One way around that would be to have the participating library send just the subfield "a", which should be enough in the database of facts for the cataloger to select candidate records for full record retrieval from participating libraries.

Side note: if you'd like to have something like KICReD—or just have a better name for it—let me know; it is the type of system that my employer specializes in building.

What are "records which OCLC has claimed are subject to its policy and contractual limitations"?

What is an OCLC Record? Back to the court documents at page 18 of the June 24th transcript. The court is asking how (if it had to stop Clarivate from using OCLC records) could it distinguish between OCLC records and non-OCLC records:

THE COURT: And I'm having trouble with a lack of specificity as to which records your subscribers are free to provide -- because they created them or someone other than OCLC created them and -- and how the Court is going to be able to determine in a -- in a group of data, even with an OCN number attached to it, whether it is something that the -- your subscriber is -- has freedom to release or does not under your subscriber agreement, and I -- go ahead. Answer that.

MS. MARTINEZ [counsel for OCLC]: Yeah. I was going to say maybe this could help. So the libraries will have their initial records prior to the time that they made it to OCLC and -- and before it got into its enhancement process. And so if they want to reach out to a subscriber to see about -- and that's why we -- we redrafted the preliminary injunction order in the manner that we did. This is only related to our member subscribers that currently have their records interacting with the -- the WorldCat database. If they want to reach out to member subscribers and talk to them about their original records before they went into the WorldCat subscriber consortium/network, we're fine with that. If they want to talk to any non -- there are thousands non-WorldCat member libraries or institutions. They are happy to do that. If they want to talk to any of their own New -- SkyRiver subscribers, they are happy to do that. It is just the subset of customers whose records are currently in WorldCat, that those are the ones that we're talking about, and I think the order very specifically relates just to those.

THE COURT: I think you are saying that all of your subscribers, that their current catalogs would -- that all of the information in their current catalogs would be protected from disclosure under your subscriber agreement simply because they -- because they joined up and because they have become a subscriber, that they are not permitted to share information in their catalogs with -- with others.

MS. MARTINEZ: They are WorldCat records. I think that's the difference. It's the WorldCat records that they are not able to share outside of the WorldCat network. They can share it intra-institutional, meaning intra-institutional amongst the other WorldCat subscribers. If they have their records -- their own catalog records that are non-WorldCat records, the records that they had prior to the time that they put those records into WorldCat, they are free to share those, and that's not -- that's not prohibited by the agreement, and it's not -- it's not even talked about in the order. They are free to share non-WorldCat records. That's fine.

THE COURT: Yeah, but everything they have since they have become a WorldCat subscriber is going to be WorldCat records under that definition.

MS. MARTINEZ: No, they would have --

THE COURT: Because you've touched everything they have.

MS. MARTINEZ: No, no, no, no, no, no, no. So like even OCLC -- let's say Ohio State. Let's use that as an example. Ohio State comes to -- to OCLC and becomes a WorldCat subscriber. OCLC could even go back to that point in time before -- you know, with the very limited data that those records had in it prior to the time that it got to OCLC and it enhanced those records, and you could see -- it would look very different. It would be very stripped out. The library similarly could do the same. So if they have their development partners that they are working with that are OCLC WorldCat subscribers, they just can't get their current catalogs as they exist today on the WorldCat system, but they certainly could ask them for the records prior to the time that it went into the WorldCat consortium and OCN number and enhancements occurred and get that from that snapshot in time as well.

OCLC is saying that a library could share with Clarivate the versions of records the library creates before the record is shared with OCLC. That record wouldn't be an OCLC-enhanced record... a record untouched by OCLC processes. Once the library downloads and overlays a record from OCLC into its local catalog, it becomes an OCLC/WorldCat record.

The mention of the "OCN number" is interesting. Could the existence of an OCN in a record be a marker for an "OCLC Record"? (Transcript page 13)

THE COURT: ... Do you assign OCN numbers to all the data that you have, including data created by your subscribers?

MS. MARTINEZ [counsel for OCLC]: We put an OCN number on any record that OCLC enhances. So if it comes into their database and they are going to enhance it by -- kind of similarly to what I talked about on Tuesday to the Court, you know, if they are going to add headnotes or footnotes -- I'm sorry -- head notes, you know, pagination, they are going to change the way that the record is searched.

Perhaps, but counsel from Clarivate takes issue with OCLC "enhancements" (transcript page 15):

MS. RODMAN: ... the term "enhancement" I think is a bit of a misnomer. You have to understand where this data comes from. When OCLC creates a record, it is pulling metadata from other sources, from public sources, from libraries themselves, from the Library of Congress, from publishers. Almost all of the metadata in an OCLC record comes from sources other than OCLC. OCLC pulls that in, and they add an OCN number, which is just a sequential number. [Westlaw example removed.] OCLC assigns that OCN number for purposes of deduplication, but also, Your Honor, so that it can pull in enhancements that its libraries make after it creates the record. So OCLC pulls in metadata. It assigns an OCN number. Then when its members add their own enhancements, further data, the OCN number lets OCLC grab that additional information. So it's a constant cycle of other people, other entities, adding information about a record and OCLC pulling it in, and it uses the OCN number to do that. OCLC has no proprietary interest in this metadata.

The argument from Clarivate's counsel is insightful but incomplete. Yes, it is true that the metadata OCLC uses to enhance records can come from sources outside of OCLC: publishers, the Library of Congress, and other libraries. But we've also seen the effort that OCLC puts into records to cluster similar records into works and to eliminate redundant records. I don't think that effort is subject to copyright as it is seen as mechanical and not creative, but I could be wrong.

So that is the clarity that I and others I've talked to were hoping would come out of this court case...what makes a record an OCLC record, and what proprietary interest does OCLC have in an OCLC record?

Previous chapters of the OCLC record reuse saga

This isn't the first time that questions about the ownership and reuse of WorldCat records have been raised. In late October 2008, OCLC announced that it was making changes to its Guidelines for the Use and Transfer of OCLC-Derived Records, which had been in place since 1987. This touched off a firestorm of controversy that was even visible from outside the library profession (in the form of an article in The Guardian newspaper, mentioned in a 2009 article on this blog).

In response to the profession's outcry over the new policy, OCLC formed a Review Board of Shared Data Creation and Stewardship. That review board recommended the withdrawal of the new policy and the formation of an OCLC Record Use Policy Council. From that, we get the WorldCat Rights and Responsibilities for the OCLC Cooperative that is in place today.

If you were paying attention earlier, you might have noticed a reference to SkyRiver. That is a "full service bibliographic utility for cataloging" that was launched in 2009 by the people behind Innovative Interfaces. I think SkyRiver is a part of Clarivate now (as part of Ex Libris' purchase of Innovate Interfaces, which was purchased by ProQuest, which was in turn purchased by Clarivate—yes, consolidation in the library automation marketplace is really that bad). SkyRiver is nominally a competitor to WorldCat for library cataloging records, but it doesn't have nearly the depth of content and customers.

One last bit. Over the summer, the International Coalition of Library Consortia (ICOLC) published a statement on Statement on the Metadata Rights of Libraries. (I scribbled some thoughts in the margins.) The top of the statement says:

Metadata and the metadata services that describe library collections are critical in supporting content discovery, knowledge creation, and libraries’ public missions. Metadata describing library collections is not typically copyrightable, and should be considered freely shareable and reusable under most circumstances. However, some industry players restrict libraries’ rights to use such metadata through contractual terms and market influence. Such restrictive activity is out of alignment with libraries’ needs and public, not-for-profit/educational missions.

These questions have been with us for quite a while; it appears they will be with us for quite a bit longer.

Automatically Generating Podcast Transcripts

2022-09-28T00:00:00-04:00

I'm finding it valuable to create annotations on resources to index into my personal knowledge management system. (The Obsidian journaling post from late last year goes into some depth about my process.) I use the Hypothesis service to do this—Hypothesis annotations are imported into Markdown files for Obsidian using the custom script and method I describe in that blog post. This works well for web pages and PDF files...Hypothesis can attach annotations to those resource types. Videos are relatively straight forward, too, using Dan Whaley's DocDrop service; it reads the closed captioning and puts that on an HTML page that enables Hypothesis to do its work. What I'm missing, though, are annotations on podcast episodes.

Podcast creators that take the time to make transcripts available are somewhat unusual. Podcasts from NPR and NPR member stations are pretty good about this, but everyone else is slacking off. My task management system has about a dozen podcast episodes where I'd like to annotate transcripts (and one podcast that seemingly stopped making transcripts just before the episode I wanted to annotate!). So I wrote a little script that creates a good-enough transcript HTML page. You can see a sample of what this looks like (from the Search and Ye Might Find episode of 99% Invisible).

Note! Of course, 99% Invisible has now gone back and added transcripts to all of their episodes, including the one used in this example. Thanks? ... No really, thank you 99PI!

AWS Transcribe to the rescue

Amazon Web Services has a Transcribe service that takes audio, runs it through its machine learning algorithms, and outputs a WebVTT file. Podcasts are typically well-produced audio, so AWS Transcribe has a clean audio track to work with. In my testing, AWS Transcribe does well with most sentences; it misses unusual proper names and its sentence detection mechanism is good-but-not-great. It is certainly good enough to get the main ideas across to provide an anchor for annotations. A WebVTT file (of a podcast advertisement) looks like this:

WEBVTT

1
00:00:00.190 --> 00:00:04.120
my quest to buy a more eco friendly deodorant quickly started to

2
00:00:04.120 --> 00:00:08.960
stink because sustainability and effectiveness don't always go hand in hand.

3
00:00:09.010 --> 00:00:11.600
But then I discovered finch Finch is a

4
00:00:11.600 --> 00:00:14.830
free chrome extension that scores everyday products on

After a WEBVTT marker, there are groups of caption statements separated by newlines. Each statement is numbered, followed by a time interval, followed by the caption itself. (WebVTT can be much more complicated than this...to include CSS-like text styling and other features; read the specs if you want more detail.)

What the script does

The code for this is up on GitHub now. The links to the code below point to the version of software at the time this blog post was written. Be sure to click the "History" button near the upper right corner of the code listing to see if it has been updated.

Download the audio file from its server and upload it to an AWS S3 bucket so AWS Transcribe can get to it.
Create a new AWS Transcribe job and wait for the job to finish.
Set a public-read ACL on the WebVT file so this script can get it later. Also, save the output of the transcription job; the function then returns the link to the WebVTT file.
In a new function, get the WebVTT file from where AWS Transcribe put it on the S3 bucket.
Then it concatenates the caption text into one string and uses SpaCy to break the transcription into sentences. I'm doing this because the WebTT generates each caption by time, and the transcript is easier to read if it is broken up into sentences.
Loop through the sentences looking for occurrences when a WebTT caption contains the start of the sentence. That way, I can get the timestamp of when the sentence starts.
When the sentences are synced time times, use a Jinja2 template to create the HTML file.
Lastly, upload the HTML to the S3 bucket as the index.html file, and make a final record of the podcast metadata.

That's it!

Design choices

Amazon Transcribe is pretty cheap. AWS charges for each minute a transcript job runs at a rate of 2.4¢ a minute. Transcribing an hour-long podcast costs about $0.10. The storage and bandwidth costs are negligible.

The way that the Hypothesis annotation JavaScript works forced the use of a CSS-":before"-content structure. One of the downsides of DocDrop is that annotations on multiple blocks are changed into just the first block of text. Based on my experimentation, it seems like the user-select: none property is enough of a break in the DOM to cause the problem. Because I didn't want the timestamps included in the annotated text, the timestamps are put into the DOM using a CSS ":before" selector. Playing with the box margins enables everything to line up.

I'm not including the playback of the podcast audio along with the transcript. Unlike DocDrop, which embeds the YouTube viewer in the transcript page, playback of the audio from the S3 bucket wouldn't be counted in the podcaster's statistics. And I'm comfortable with the copyright implications of publicly posting uncorrected transcripts (in the absence of creator-produced transcripts), but not so comfortable as to also offer the audio file.

Issues

So there are some issues with this setup.

Copying and pasting episode data required: This is running as a command line program with four parameters: audio URL, episode title, episode landing page URL, and podcast title. Sometimes this takes a bit of hunting because podcast sites are not the most friendly for finding the audio URL. Viewing the page source is often necessary, and sometimes digging into the RSS/Atom XML is needed.
Times will vary with advertisement inserts: Because podcast networks insert ads with different lengths over time, the timestamps that were found when the transcription was made probably won't correspond to later playbacks. But I think they will be close enough that I can go back and find the audio clip when I need to.
Default directory document doesn't work: Right now, the "index.html" is required as part of the web link. It would be nice if one could remove that and just refer to the root directory, but AWS CloudFront doesn't work like that.

Trip Report: NISO Plus Forum 2022

2022-09-22T00:00:00-04:00

Earlier this week, NISO held its one-day NISO Plus Forum for 2022.

This was an in-person meeting that is intended to feed into the online conference in February 2023. Around 100 people from NISO's membership groups—libraries, content providers, and service providers—attended to talk about metadata. The meeting was structured in World Café style and was moderated by Jonathan Clark. The broad topic of "metadata" was broken down into three parts:

Identifiers: what identifiers are missing or underutilized
Exchange: what is the most significant barrier to seamless exchange?
Structure: what is impossible due to a lack of appropriate structures?

There were small table discussions for each part of no more than six people, with 15 minutes at a table before everyone got up and moved to a new table. After three rounds of 15 minutes, a scribe that stayed at the same table the whole time reported the major themes to the larger group. What makes this style interesting is that everyone's experience is different. We agreed to use the Chatham House Rule; what is reported here is my interpretation of my table's discussion and my take on the broader outcomes.

Edited on 5-Oct-2022 to add: NISO published a summary of the in-person meeting in the October issue of NISO I/O — Are You Ready? Metadata -- The Musical!.

Identifiers

The most fascinating idea I discovered here was how much the metadata ecosystem relies on "Publication Date". Not only do several parts use publication date as an anchor, but different understandings of the meaning of "publication date" cause many problems downstream. There is the online publication date, the physical publication date, and sometimes simply an unlabeled publication date. Some publishers have a practice of changing an online publication date to the physical issue date when the issue comes out. (Changing a field that others use as part of metadata to distinguish one item from another is never a good thing.)

"Place of Publication" also has a lot of variability and inconsistency, even within a publisher. Institution identifiers were also a topic, particularly with the lack of hierarchy in the Research Organization Registry (ROR). Someone reported that ROR is working to address the problem, but right now there is not a good way to relate a department to its encompassing agency or organization.

I showed my professional age a bit by mentioning SICI—the Serial Item and Contribution Identifier. This is a compound identifier developed in the 1990s. Given a citation, you could construct a SICI that was a kind of key to the article. For instance,

Lynch, Clifford A. "The Integrity of Digital Information; Mechanics and Definitional Issues." JASIS 45:10 (Dec. 1994) p. 737-44

...could be condensed into...

0002-8231(199412)45:10<737:TIODIM>2.3.TX;2-M

This standard didn't last past the early 2000s, although a few people at my table mentioned that they saw examples of this identifier in their backfile as the publisher-specific suffix of a DOI.

Exchange

Among the metadata exchange topics, the one I found the most interesting was diversity-equity-inclusion data points in an authoring workflow. With a desire to address inequity in a field, these data points would be gathered from many sources. This is sensitive data, so how can it be kept secure while ensuring the integrity of the data (for instance, catching when false data is dumped into the system).

Structure

As we know, metadata is gathered, aggregated, mixed, and disseminated in ways that the originator can't predict. A big problem when this happens is having ways to assert confidence in a data element. Take, for instance, the ORCID field for an author. Was that ORCID obtained when the author logged in with the Authenticated ORCID ID workflow? Was it manually keyed by an author (and subject to typos)? Did the software guess the ORCID based on name and institution affiliation? There can be a range of certainty that an ORCID ID is correct for a particular author. And—related to "Exchange"—how can this certainty be expressed to subsequent users of the metadata record.

The Top-Level Topics

One goal of the NISO Plus Forum was to gather topics for sessions at next February's NISO Plus Conference. At the end of the day, there was one final table session where we were asked to propose a session for the conference: what is the topic? what questions would the session answer? who should attend and who should speak?

Reflecting the observation that metadata is much more than technical specifications, the proposed conference topics tended to want to explain to an organization's management and end-users why carefully curated metadata is essential. The session would answer questions like "why is it important to fund robust metadata systems?" and "how do we measure return-on-investment for our metadata systems?" One person said that metadata needs its own public relations manager. Another sought accessible messaging on the importance of metadata to send to the people making decisions. Relatedly, how can researchers be convinced of the importance of identifiers like ORCID and ROR as they input data on grant applications and institutional repository deposit forms?

My Takeaways

Honestly, those were not the outcomes I expected as the top-level ideas from the Forum. As you can tell from my summary above, I thought we'd focus on discussions of a collective understanding of a set of "publication date" fields. Or think about how the producers and consumers of metadata can agree on a range of confidence for a particular metadata field. The end-of-the-day outcomes were very high-level and not focused on making the exchange and use of metadata better across the field.

That aside, it was a wonderfully engaging conversation all day long. NISO is on the right track to having focused meetings like this that put a value on activities that are best done in person. This was not an event full of prepared presentations or passive panel sessions. It made the best use of precious face-to-face time and gathered topics that would feed into the online conference.

Thank you to the American Geophysical Union in Washington, DC, for the use of their meeting space, Silverchair for their significant sponsorship, and all of the other event sponsors. I ended up with seven pages of dense notes to think about, so thank you, too, to all of the participants.

Issue 90: When Machine Learning Goes Wrong

2022-03-24T00:00:00-04:00

The People of Ukraine are not forgotten. The Tufts University newspaper published an article this week about a multinational effort to preserve the digital and digitized cultural heritage of the country. On the other side of the war, Russian citizens are downloading Wikipedia out of fear of more drastic network filtering or collapse of Russia's connections to the global internet.

Eleven years ago this week, the judge overseeing the Google Book Search case (Authors Guild v. Google) ruled that the proposed settlement was not "not fair, adequate, and reasonable." As you might recall, the proposal was for a grand vision of a book author rights clearinghouse—not unlike what is in place for the music industry. I had a Thursday Threads entry that covered the initial reactions from the litigants, legal observers, and the library community.

In writing this week's article, I learned that machine learning is a subset of the artificial intelligence field. While the terms are often used interchangeably, machine learning is one part of artificial intelligence. As the Columbia University Engineering Department describes it, "put in context, artificial intelligence refers to the general ability of computers to emulate human thought and perform tasks in real-world environments, while machine learning refers to the technologies and algorithms that enable systems to identify patterns, make decisions, and improve themselves through experience and data." With that definition in mind, the thread this week is on challenges with machine learning:

Flip the Switch on Your Drug Synthesizing Tool and Chemical Weapons Come Out

This generative model normally penalizes predicted toxicity and rewards predicted target activity. We simply proposed to invert this logic by using the same approach to design molecules de novo, but now guiding the model to reward both toxicity and bioactivity instead.
In less than 6 hours after starting on our in-house server, our model generated 40,000 molecules that scored within our desired threshold. In the process, the AI designed not only VX, but also many other known chemical warfare agents that we identified through visual confirmation with structures in public chemistry databases. Many new molecules were also designed that looked equally plausible.

—Urbina, F., Lentzos, F., Invernizzi, C. et al.Dual use of artificial-intelligence-powered drug discovery. Nat Mach Intell 4, 189–191 (2022). https://doi.org/10.1038/s42256-022-00465-9

By changing the parameters of the machine learning model, the output of the model changed dramatically. The model is trained to look for promising compounds that could be turned into pharmaceuticals. As part of that process, the model tests for toxicity and eliminates those that would likely be harmful to humans. Rather than eliminating those, what if the model preferred toxic compounds? You get a known chemical warfare agent and what looks like many more compounds that could be turned into chemical agents.

In a later commentary published through the American Association for the Advancement of Science (AAAS), a researcher said: "Now, keep in mind that we can't deliberately design our way to drugs so easily, so we won't be able to design horrible compounds in one shot, either. Just as there are considerations in drug discovery that narrow down these sorts of lead-generation efforts, there are such factors for chemical weapons: stability on storage, volatility (or lack of it), persistence in the environment, manufacturing concerns, etc."

Also of note, the human-in-the-loop was a critical breakpoint between the model's findings as concepts and the physical instantiation of the model's conclusions. As the journal article goes on to say, unwanted outcomes can come from both taking the human out of the loop and replacing the human in the loop with someone with a different moral or ethical driver.

So it may be of some comfort that there is more between the machine learning model and a weapon. But even with those extra steps, how is something like this regulated? Will working with machine learning algorithms become the type of job requiring a psychological evaluation? Would that even matter with open source tools and open datasets? The tool is neither good nor evil; it is in how the tool is used or misused.

With Machine Learning, Garbage In/Garbage Out

Machine learning (ML), systems, especially deep neural networks, can find subtle patterns in large datasets that give them powerful capabilities in image classification, speech recognition, natural-language processing, and other tasks. Despite this power—or rather because of it—these systems can be led astray by hidden regularities in the datasets used to train them.

—Trouble at the Source: Errors and biases in artificial intelligence systems often reflect the data used to train them, Communications of the ACM, December 2021

It isn't going unnoticed in the computing profession that there need to be ways to quantify problems with machine learning models. You've probably read the stories of how facial recognition models trained with picture datasets consisting of primarily white male faces had difficulty zeroing in on anyone who wasn't a white male. This article describes the difficulties in recognizing biases in training data and quantifying accuracy measurements.

Five Realities Why Applying Machine Learning to Medical Records is Hard

A few years ago, I worked on a project to investigate the potential of machine learning to transform healthcare through modeling electronic medical records. I walked away deeply disillusioned with the whole field and I really don’t think that the field needs machine learning right now. What it does need is plenty of IT support. But even that’s not enough. Here are some of the structural reasons why I don’t think deep learning models on EMRs are going to be useful any time soon.

—Deep Learning on Electronic Medical Records is doomed to fail, Brian Kihoon Lee’s blog, 22-Mar-2022

This article describes the difficulties of using machine learning algorithms to synthesize knowledge from medical records. It is also an indictment of the extent to which the requirements of insurance companies (and the subsequent actions by medical providers to subvert the requirements) have mucked up the practice of medicine.

Spring in the Northern Hemisphere Makes Cats Happy

It is warming up—70°F/21°C—earlier this week, and that means the cats want to go outside. We don't let them wander the neighborhood by themselves for their own protection. Each has a harness and about 50 feet (15 meters) of cord for them to roam the backyard. It is funny how just a little bit of sun can cheer up a cat.

On the other hand, it has turned cool and rainy the remainder of the week, so the cat minder (me) is not all that interested in going outside. Once they have had the taste of the outdoors, it becomes tough to put up with their constant meowing and pawing at the glass. Just a little bit longer, Mittens and Alan...just a little bit longer.

Issue 89: Ukraine's Libraries, Russia's Internet, and the Big Deal

2022-03-17T00:00:00-04:00

The first story below is one from National Public Radio on Ukraine libraries' efforts are undertaking. Let's not forget the terror they are facing, the people stepping up to meet their community's needs, and those who have lost their lives in the Russian war.

The threads this week:

Ukraine Libraries Doing What Libraries Do

"Refugee reception points, hostels and logistics points are organized here," [Oksana Brui, president of the Ukrainian Library Association] said. "Camouflage nets for the military are also woven here. Home care courses are held here. Books are collected here to be transferred to libraries in neighboring countries that receive Ukrainian refugees."

—Ukraine's libraries are offering bomb shelters and camouflage classes, NPR, 9-Mar-2022

I'm not surprised. I presume the libraries mentioned in the NPR article are "public libraries," but they could be libraries of any type. It brings to mind the stories about the library in Ferguson, Missouri, during the riots for the shooting of Michael Brown by local police. The NPR story also mentions Nicholas Poole's "we will reschedule just as soon as we have vanquished our invaders" tweet that was in Thursday Threads two weeks ago.

Can the Internet Sanction a Country? Should It?

The invasion of Ukraine poses a new challenge for multistakeholder Internet infrastructure governance. In this statement, we discuss possible sanctions and their ramifications, lay out principles that we believe should guide Internet sanctions, and propose a multistakeholder governance mechanism to facilitate decision-making and implementation.

—Multistakeholder Imposition of Internet Sanctions [PDF], Packet Clearing House, 10-Mar-2022

Last week, Ukraine's Ministry of Digital Transformation called on internet bodies to sanction Russia over its government's war on Ukraine. This would include revoking Russia's top-level country domains (e.g. ".ru"), canceling SSL certificates associated with Russian sites, disabling the root DNS servers, and withdrawing the right for Russian internet service providers to use the IP addresses that have been assigned to the country. The Multistakeholder Imposition of Internet Sanctions document describes why this would be a bad idea and lays out a plan for what can be done. For more depth, see the article I wrote last week on the document.

Thursday Threads 2011: The Demise of the Big Deal?

Looking backward, the Thursday Threads for 17-Mar-2011 had an entry for an interview by Richard Poynder with a financial analyst that had a downgraded outlook of Reed Elsevier. A key quote: "I can observe, however, that there is something unhealthy about an industry which has managed to alienate its customers to the point their membership associations increasingly focus time and attention on how to overturn the industry structure." I think it is fair to say that not much has changed.

Note! After the newsletter was published, Richard tweeted: "The Aspesi interview is worth comparing with the 2013 interview with Sami Kassab I think." The tweet included a link to his 6-Oct-2013 interview. About halfway down Richard's summary is a discussion about how reputation is a dominant factor in where authors choose to publish, which will favor the large, incumbent publishers. Nearly 10 years later, it would seem that Mr. Kassab's prediction has held true.

This Week's Cats

This is a throwback to a picture of Alan (left) and Mittens (right) from six weeks ago. In that picture, there was a pillow fence between the two. This time? No fence!

So there is no reason why my two cats have to dominate the Thursday Threads newsletter. Do you have a picture of a cat (your cat or a library cat) doing what cats do? (Which, as you can tell from my two lovable felines, is to be extraordinarily cute.) Tweet your picture to me or send it by email along with a brief bio, and I'll add your adorable pussycat to an upcoming newsletter.

Sanctioning Governments on the Internet

2022-03-10T00:00:00-05:00

What a strange article title to type: Sanctioning Governments on the Internet. What does that even mean? Who would decide? Who would implement the decision? To say nothing of the consequences of trying to impose an Internet Sanction on a government or a country.

The internet as we know it is a quirky beast. It is called "inter-net" because it is formed as the interconnection of independent networks plus a healthy dose of human capital (and independent streams of monetary capital), reliance on openly-published and open-ended standards, interpersonal trust, and—quite frankly—quite a bit of luck.

You might think of "the internet" as one big thing, but in reality it is many smaller things hooked together by common agreement. The internet connection at my house comes from an Internet Service Provider (ISP). My ISP connects to one or more (likely many more than one) other ISPs and transit providers. Through those interconnections, a message I'm typing here will be sent to a computer across town, across the country, and across the world.

It works like this because many decades ago, a bunch of people got together to agree on the methods and rules computers would use to communicate with each other. A guiding philosophy was to make those methods and rules simple and easy to implement. Another guiding principle was to build up layers of complexity that relied on the functionality of the layers below it.

At the bottom-most layer, the network equipment moves messages along a path from a sending computer to a receiving computer. That equipment doesn't understand or care what was in the messages...it just knows how to get the message one hop closer to its destination.
On top of that is a set of rules (a "protocol") for ensuring all messages get from the sender to the receiver and describing how to retransmit if something is missing.
On top of that is a protocol for translating human-readable names into computer-understandable addresses.
On top of that, a protocol for requesting and receiving a file.
Then a specification for how to arrange text on a page.
Lastly, a web browser that understands that specification and knows how to ask the layer below it to retrieve an HTML file from a faraway server.

The network layer at the bottom doesn't know the difference between an HTML file and a snippet of voice on a Zoom call, and the browser at the top doesn't know how the file got to it. It is the common agreement on the protocols and specifications across decades of work that put this page in front of your eyes.

So about those key components of the "inter-net":

human capital: coming to agreement takes time, and humans need to bring their priorities, their experiences, their knowledge, and their biases to the table to work to a common agreement.
monetary capital: every network that is a part of the "inter-net" is paying for its piece to connect its users to is neighboring networks; there isn't one singular budget of internet money.
openly-published standards: all of the core internet standards—the rules by which computers talk to each other and programs on those computers talk to each other—are free to examine and implement by anyone.
open-ended standards: each of the internet standards handles a concise piece of the puzzle and makes as few assumptions as possible about how things will work outside of its well-defined control.
interpersonal trust: there are very few central internet authorities (more on this in a moment); when an ISP connects to its neighbor, it trusts that the neighbor will follow the rules set out in the standards and the guidance from the central authorities.
quite a bit of luck: the internet didn't have to be formed this way; in fact, there were schemes that were considerably more centralized, prescriptive, and tightly controlled.

That is the multistakeholder nature of the inter-net. To disrupt any of that would require the coordination of a whole lot of people at many companies. If there are any central internet authorities at all, they are the few organizations that assign network addresses, run the systems that translate human-readable names ("dltj.org") to computer-recognizable network addresses ("143.204.39.101" and "2600:9000:2177:8a00:13:fcdc:19c0:93a1"), and operate the central network traffic exchange points around the world. And this brings us back to the subject of Bill Woodcock's tweet.

Bill is the chief executive officer of an organization called Packet Clearing House (PCH). It is one of those organizations that you've never heard of because it does its job behind the scenes so well. PCH runs Internet Exchange Points—those places where various ISPs come together to exchange traffic with their neighbors. It also runs many top-level name-to-number translation servers (called Domain Name Service, or DNS). If PCH stopped working, you would know it. Or, perhaps more accurately, you wouldn't know PCH stopped working because your computer probably wouldn't be able to talk to anyone else.

The document that Bill tweeted about is called Multistakeholder Imposition of Internet Sanctions. Multistakeholder is key here because, as I described above, there isn't one company, one government, or one person who can decide what happens on the inter-net. Yes, within a country, a government can dictate what the ISPs in that country can do. Most famous is the "great firewall of China" that heavily regulates what happens on the Chinese portion of the inter-net. But as a matter of fact, no one has that much control over the inter-net as a whole. Not the United States. Not the United Nations. Not the International Telecommunication Union (as much as they try to get such control).

This document was put together in response to a request from the Ukrainian Ministry of Digital Transformation to sanction Russia over its government's war on Ukraine. Ukraine wants to cut off Russian networks from the rest of the internet, disable its top-level domain names, and revoke its web server security certificates. There are technical problems with those requests that border on the impossible, and the Multistakeholder document says why.

The executive summary of the document is:

The invasion of Ukraine poses a new challenge for multistakeholder Internet infrastructure governance. In this statement, we discuss possible sanctions and their ramifications, lay out principles that we believe should guide Internet sanctions, and propose a multistakeholder governance mechanism to facilitate decision-making and implementation.

I encourage you to read the whole thing. The core part of it is just over two pages long with an introduction (that paired with the discussion above will hopefully emphasize how important it is that the people of the inter-net get this right), a series of principles that could guide a policy on internet sanctions, and recommendations for moving the multistakeholder deliberation forward. Past the signatories on the second page is an additional four-page appendix on why what the Ministry of Digital Transformation was asking for is ill-advised and/or impossible.

My heart cries out for the Ukrainian people. The Russian government must be persuaded to stop its actions. The internet's human capital and interpersonal trust will move slowly (mainly because this is precedent-setting) to bring about a consensus on Internet Sanctions—much too slowly to help with the immediate concerns. But it is the historically correct and appropriate route to take.

Issue 88: Battling Censorship, Considering the Right to be Forgotten

2022-03-10T00:00:00-05:00

For this week's newsletter introduction, I searched the Flikr service for photographs of libraries in Ukraine. I thought that putting a picture here at the top of a grand reading room with dark wood shelves and neat rows of books would help us remember that a significant part of our world has been turned upside down. What I didn't expect to find was an album titled 'November 2021: Strategic Session on Digital Education Hubs development'.

Attendees of the strategic session on Digital Education Hubs development. Source, CC By-ND

Four months ago, these professionals were gathered together in a room to hear presentations, sort multi-color post-it notes on flip charts, and work together for "the transformation of libraries into Digital Education Hubs". That is a scene that is very familiar to me, and quite possibly to many of my readers as well. Now their country is being bombed, its citizens are fleeing, and I doubt anyone is thinking about the transformation of libraries.

Let's not forget them.

The threads this week:

Minecraft as an Anti-censorship Tool

When schools ban books, the strategy often backfires on would-be censors, resulting in greater interest around illicit literature. Similarly, when governments censor the media, groups like Reporters Without Borders spearhead efforts to make such censored material extra visible. Their Uncensored Library project brings together architecture and journalism in an unlikely virtual reality space: the interactive gaming world of Minecraft.

—Uncensored Library: Banned Journalism Housed in Virtual Minecraft Architecture, 99% Invisible, 3-Mar-2022

With help from my teenage son, I got into the Uncensored Library on Minecraft. (A hint for those trying to access it in early 2022: the instructions say you need a specific version of Minecraft—that version is now 1.16.5 instead of what is listed in the PDF.) The "Frequently Asked Questions" book in this world starts with this answer: "Minecraft is available even in countries with cyber censorship. So we build this library to provide a platform for censored journalists, connect people around the world and bring back the truth." The content of the library is curated—you don't have the option of modifying the elements in the Minecraft world. The books in the library are short...the ones that I saw were each several hundred words long.

Right-to-be-Forgotten Tangled with Press Freedoms

The “right to be forgotten," which exists in European Union member states and allows for mandatory delisting of results from search engines, must be balanced against the rights of the public to read media archives. EFF joined together with more than a dozen other media and free expression groups to make that point clear in a recent case from the European Court of Human Rights (ECtHR).

—EFF to European Court: “Right to be Forgotten” Shouldn’t Stop The Public From Reading The News, Electronic Frontier Foundation, 7-Mar-2022

Speaking of press freedoms, the courts in Europe are tackling this issue of the right to be forgotten versus the rights of readers to untampered media archives. A lower court ordered the removal of an article from a Belgian newspaper on the basis of the European “right to be forgotten” laws, and the newspaper is challenging the ruling in an appeals court. The “right to be forgotten” laws were originally intended for delisting from search engines, and EFF is arguing that much careful consideration is needed when adjudicating requests to remove content from media sites.

This Week's Cat

It is starting to feel like I need to give equal time to each feline. Last week featured Mittens curled up in a ball. This week has Alan curled up in a ball. Truth be told, there are few pictures where they are together but not squabbling with each other.

I do know that there are times when I long to be a cat...napping in the middle of the day, being fed three square meals a day, being adored and having someone pick up after me. But then again, I'm the one with the opposable thumbs, so maybe the human life isn't so bad after all.

Issue 87: Ukraine War, Artificial Intelligence Art

2022-03-03T00:00:00-05:00

We are one week into Russia's war against Ukraine. From here in America, it is hard to understand the reality of a country whose citizens seemed to be going about normal lives just a short time ago. I find it also hard to know what to say to people whose misery comes about on the whims of a dictator guided by...what? A misguided notion of history? A deep-seated desire to return to former glory? A vain attempt to show how big his manhood is?

Who can tell? Beyond asking my elected officials to do something and tweeting expressions of support, I'm feeling powerless to change what is happening. I hope and pray for a return to sanity, for grace and mercy for those in conflict, and for a world that strives to find a greater, common good.

The threads this week:

One Library-related Corner of the Ukraine War

Nicholas Poole tweet

Dear colleagues, The sneaky, cruel and bloody aggression of the Russian Federation has prevented us from implementing our plans and holding March 1-4 XII International Scientific Conference "Modern Library-Information Continuous Education: what, how, for whom? ". 65 participants registered at the conference, re-calculated the registration fee of 35 members of the VGO Ukrainian Library Association total amount of 10 500 UAH. The Organizing Committee of the Conference has decided to hold the Conference after our confident victory, and the contributions collected to support the Armed Forces of Ukraine. We promise to provide complete and quality service to all participants in the peaceful time. Glory to Ukraine! For questions, please contact the Executive Office of the Association by email.

—Facebook-supplied translation of theannouncement of the postponing of a library conferenceby the Ukrainian Library Association, 28-Feb-2022

Nicholas Poole, CEO of CILIP in the UK, has a poetic take on this announcement from the Ukraine Library Association. Facebook's automated translation from Ukrainian to English (quoted above) sounds a little dry; I'm left wondering how this reads in the original Ukrainian.

Archiving the Ukrainian Web

[Ian Milligan, associate professor of history at the University of Waterloo,] points out that in 50 years, historians will not only be curious about how people got their information and how it shaped their worldviews but also what kind of information archivists saved about this conflict.

—Ukrainian Websites Are Going Dark. Archivists Are Trying To Save Them, Motherboard on Vice, 25-Feb-2022

In comparison with previous wars, this Russia's war with Ukraine will have a lot of primary sources. In the near term, people need to figure out what is real and what is manipulated. For our future selves, though, historians will need the video, photographs, and texts of the people in this war and those that are touched by it. I'm grateful for the people whose first instinct is to save-the-now so that source material is available.

Artificial Intelligence Can't Hold Copyright

Credit line in The Verge article.

The US Copyright Office has rejected a request to let an AI copyright a work of art. Last week, a three-person board reviewed a 2019 ruling against Steven Thaler, who tried to copyright a picture on behalf of an algorithm he dubbed Creativity Machine. The board found that Thaler's AI-created image didn't include an element of “human authorship” — a necessary standard, it said, for protection.

—The US Copyright Office says an AI can’t copyright its art, The Verge, 21-Feb-2022

It looks pretty—greens and purples, a bed of rails curving into the distance. To my eye, it looks like art—it is something I would hang on a wall (or make into a video screen background). But the copyright office has ruled that it cannot be registered as a copyrighted work. In its ruling, the Review Board of the U.S. Copyright Office affirms practices manual for the Copyright Office: that copyright registration "has long mandated human authorship".

This Week's Cat

Sleep tight, dear Mittens.

Sleep tight.

Five Years and Ten Months

2022-03-01T00:00:00-05:00

I reached a new milestone this month. A minor one in the grand scheme of things, but one worthy of a few remarks nonetheless. This month marks my longest tenure with an employer at five years and 10 months. I've now worked at Index Data longer than I had at OhioLINK a decade ago, and this has hardly felt like almost six years.

Seven employers in 31 years. All of it in library technology—one of the last places I thought I would land with an undergraduate degree in Systems Analysis. Seven doesn't seem like a lot, but at almost all of them I thought "I could see myself completing my career here." (This definitely feels true for my current employer.) But life intervened and a change was made...always for the better.

I'm so grateful for all of the people that mentored me and the couple that kicked my butt along the way (in retrospect, at least). Hopefully I've managed to give back in equal measure to those coming into the field.

Career history (as of today)

	Open Source Community Advocate Index Data · Jun 2016 - Present · 5 yrs 10 mos
	Dev/Ops Lead and Project Manager The Cherry Hill Company · Aug 2015 - Dec 2016 · 1 yr 5 mos
	Assistant Director, Technology Services Development Lyrasis · Sep 2010 - Jun 2015 · 4 yrs 10 mos
	Assistant Director, Multimedia Services; Assistant Director, New Service Development OhioLINK · Jan 2005 - Sep 2010 · 5 yrs 9 mos
	Computer Services Librarian (Law School); Area Head for Library Information Technology Services; Assistant to the Director for Technology Initiatives University of Connecticut · Feb 2000 - December 2004 · 4 yrs 11 mos
	Library Systems Manager Case Western Reserve University · Jul 1995 - Feb 2000 · 4 yrs 8 mos
	Library Systems Manager Miami University · Jun 1991 - Jun 1995 · 4 yrs 1 mo

Issue 86: Tracking Media Provenance, Digital Classroom Surveillance, Don't Pixelate to Redact, Android In-App Advertising

2022-02-24T00:00:00-05:00

I've deleted what I originally had here as newsletter-opening-banter. These are serious times. I think the world has radically changed overnight, and roughly 7.9 billion of us are not in positions to do anything about it. To those that are in positions to do something about it and to those that are caught up in the effects of one man's decision to impose his will on others: may you be safe, may you succeed, and may you find peace. For those coming to this after early 2022, yesterday Russia invaded the sovereign country of Ukraine. Russia invaded the sovern country of Ukraine

The threads this week:

Specification for Media Content Provenance

Today, the Coalition for Content Provenance and Authenticity (C2PA), an organization established to provide publishers, creators, and consumers with opt-in, flexible ways to understand the authenticity and provenance across various media types, released version 1.0 of its technical specification for digital provenance. This specification is the first of its kind and empowers content creators and editors worldwide to create tamper-evident media, by enabling them to selectively disclose information about who created or changed digital content and how it was altered. The C2PA’s work is the result of industry-wide collaborations focused on digital media transparency that will accelerate progress toward global adoption of content provenance.

—C2PA Releases Specification of World's First Industry Standard for Content Provenance, Coalition for Content Provenance and Authenticity, 26-Jan-2022

Elements of the C2PA specification. [Source]

This is a fascinating development. Although the target audience for this technology is news organizations and citizen journalists to provide a way to establish the creator and editors of media, one could easily envision using this standard to mark images, video, and audio from digital archives. As a way of combatting problems like manipulated media and "deep fakes", the specification would allow news organizations to cryptographically "sign" the media in a way that a display tool—via a media tool on your device or a browser plugin—would be able to decode and display to the viewer. If the cryptographic signature doesn't match the one published by the news organization, you would know that the media has been changed. Or, from the perspective of an activist or citizen journalist, the capture device—be it a smartphone or digital camera—could add indelible information to the media file that shows where and when it came from.

The C2PA website has a recording of a 90-minute webinar introducing the specification. Some quotes from that webinar:

At the point of recording, typically on a mobile device, when you hit the record button, you grab the date and time, the GPS location, all the pixels, you create a hash or digital compact signature, you cryptographically sign that. And essentially that is a stamp of authenticity put either by the hardware or software at the point of recording. And nobody is in a better place to authenticate content than exactly at this point of recording.

And later:

If an edit is made in a system without C2PA then the consumer is notified by a missing or incomplete message in the content credentials. C2PA also uses a failsafe for recovery in case of malicious or accidental stripping of providence information. It can always be recovered and matched again with the photo. By comparing the changes made along the way, viewers get a more complete picture of how the content came to be.

There are some serious privacy concerns, but they have accounted for some issues in the specification:

Privacy is one of the foremost guiding principles of C2PA. One of the most important principles is privacy. This is all intended to be opt-in; the guidance documents and the UX task force looked very carefully at ensuring that users of the technology and the producers on the creator side understand what they are doing, when it should be used, and when it shouldn't be used...

There is a notion of redaction that is baked into the standard that doesn't compromise the cryptographic certainty of what has gone before.

So that's interesting. See how it could be used for digital archives? I haven't finished reading the specification, but I have made some Hypothesis annotations on the "explainer" and "technical specifications". Join me there, would you?

Encroaching on Digital Privacy in the Classroom

Perusall, for example, gives professors access to the amount of time a student spends on a reading and how many of the assigned pages they've viewed. Despite students feeling like their privacy is compromised with this access and the return of most students to in-person learning, schools are still utilizing proctoring and similar invasive technologies.

The use of virtual learning tools has been subject to the fluctuating pandemic and schools' virtual status, with the Omicron variant causing many colleges to move online for final exams and the beginning of the spring semester. As COVID-19 continues, students have been increasingly subject to excessive monitoring technologies—whether proctoring exams or scanning files—such as Proctorio, ProctorU, and Perusall.

—Welcome to Surveillance University, where privacy no longer matters, The Georgetown Voice, 19-Feb-2022

This article is from the Georgetown University student-run newspaper, and it contains perspectives from students about the invasiveness of the surveillance technology being used in classrooms. The emergency measures put in place for at-home learning during pandemic closures are weaving their way into the fabric of college life, and not in a good way. This article is a good one to read because it is written by students that are impacted by this technology, but it is just one among many similar articles that describe the technical problems and the cultural impacts.

Pixelation for Redaction → bad

Today, we're focusing on one such technique – pixelation – and will show you why it's a no-good, bad, insecure, surefire way to get your sensitive data leaked. To show you why, I wrote a tool called Unredacter that takes redacted pixelized text and reverses it back into its unredacted form. There's plenty of real-world examples of this in the wild to redact sensitive information, but I won't name names here. Watch my video for a quick recap of the importance of NEVER using pixelation to redact text, as well as how I unredact Jumpseclabs's Challenge in real-time.

—Never Use Text Pixelation To Redact Sensitive Information, Bishop Fox, 15-Feb-2022

The solution, in retrospect, seems obvious: figure out the font being used, deduce the offset from where the pixelation starts, build a model of letters of the alphabet pixelated to those parameters, and compare with the source material. I won't be using pixelation again.

Google Changes Up In-App Advertising

it's a collaborative effort that invites participation from the industry in replacing the existing system for ads targeting on Android with something that is more mindful of consumer privacy. Although Google recently introduced some new privacy initiatives on Android, including the reanimation of its mostly-dormant “Opt Out of Ads Personalization” system setting as an Android-equivalent of Apple's Limit Ad Tracking setting, it obviously wants to introduce further privacy controls to Android. And it aims to accomplish that with a Privacy Sandbox process for the platform.

—RIP GAID: Privacy Sandbox is coming to Android. What advertisers need to know., Mobile Dev Memo, 16-Feb-2022

This one is mostly for the technical-minded readers. I wasn't around at the start of computing, but I've been around long enough to respect the growing sophistication of software development and how layers of the technology stack that are innovative now become assumed-to-exist in just a short while. What is fascinating about this technique from Google is that the creators of the Android operating system are abstracting out a whole segment of dynamic code libraries. In this case, the advertising tech will be distributed in self-contained blobs that are published by distinct companies. At the time an app is installed by the smartphone user, the app code and advertising code are put linked together and sent to the device. Once on the device, the ad-tech runs in a separate sandbox from the browser—it is isolated from the app code and has its own permissions structure. It is not hard to imagine that this technique will be used for other dynamic parts of an app.

Close Supervision

Alan lets me know he is watching.

I am the caretaker this week for some newly planted seedlings in the basement. Among my tasks is to spray them with water to keep them moist and growing. I was doing just that earlier this week when I felt this tap from behind. Alan was checking my progress and had a few unsolicited suggestions. A couple of scratches behind the ears—well, more than a couple—and all was well again.

Seriously, the darn cat reached through the open stairwell, looked me straight in the eye from his elevated position, and meowed insistently in my face for attention.

The nerve of some beasts!

I should have followed his suggestions, though...I think a couple of the seedlings aren't going to make it.

Thursday Morning Addition

After I finished writing Wednesday night but before publishing Thursday morning, Russia's president gave the order to invade Ukraine from the north, east, and south. Geopolitics aside (Putin's actions are not warranted and by accounts I've read his interpretation of history is wrong), there are real effects on real people that don't deserve to be caught up in what has happened. To FOLIO Project colleagues—notably the EPAM staff located in and around Ukraine—I am thinking about you and praying for your safety and the safety of your family and friends.

Issue 85: Privacy-busting Journal Article Fingerprints, Fraud in NFTs, Improve Your Life

2022-02-17T00:00:00-05:00

The middle of February already. Time is flying; I hope you are having fun.

The threads this week:

Privacy-busting Fingerprints in Journal Articles

One of the world’s largest publishers of academic papers said it adds a unique fingerprint to every PDF users download in an attempt to prevent ransomware, not to prevent piracy.

Elsevier defended the practice after an independent researcher discovered the existence of the unique fingerprints and shared their findings on Twitter last week.

“The identifier in the PDF helps to prevent cybersecurity risks to our systems and to those of our customers—there is no metadata, PII [Personal Identifying Information] or personal data captured by these,” an Elsevier spokesperson said in an email to Motherboard. “Fingerprinting in PDFs allows us to identify potential sources of threats so we can inform our customers for them to act upon. This approach is commonly used across the academic publishing industry.”

When asked what risks he was referring to, the spokesperson sent a list of links to news articles about ransomware.

—Academic Journal Claims it Fingerprints PDFs for ‘Ransomware,’ Not Surveillance, Motherboard from Vice, 31-Jan-2022

Pretty incredulous...adding unique identifiers to the metadata of each PDF downloaded from Elsevier (the "fingerprint") somehow protects against ransomware. Extraordinary claims require extraordinary proof, and it is not forthcoming from Elsevier. I've seen no follow-ups from Elsevier on this Motherboard article, nor from the researcher that discovered the fingerprinting. Look, if you're employing a technique to go after researchers sharing PDFs of articles, own up to it. I can see why you don't want to, Elsevier...shared articles might cut into that $40-per-article charge you put on non-subscribers. Either way...owning it or lying about it looks bad. I can think of no plausible scenario where fingerprints in PDF files detect, prevent, or help prosecute ransomware.

Fraud in NFTs

[Cameron] Hejazi highlighted three main problems: people selling unauthorised copies of other NFTs [Non-Fungible Tokens], people making NFTs of content which does not belong to them, and people selling sets of NFTs which resemble a security.

He said these issues were "rampant", with users "minting and minting and minting counterfeit digital assets".

"It kept happening. We would ban offending accounts but it was like we're playing a game of whack-a-mole... Every time we would ban one, another one would come up, or three more would come up."

—Marketplace suspends most NFT sales, citing 'rampant' fakes and plagiarism, Reuters, 12-Feb-2022

This from the company that "sells tweets": "The U.S.-based Cent executed one of the first known million-dollar NFT sales when it sold the former Twitter CEO's tweet as an NFT last March." Cent seems to be recognized by Twitter with some kind of legitimacy to sell NFTs of tweets. At least to the extent that the CEO of Twitter acknowledged the sale of the tweet. But with everything NFTs, what is it that you are really selling, and is there any way of preventing someone from selling the exact same thing?

Late-breaking addition: on Tuesday, Bloomberg News reported that the winning bid for Melania Trump's NFT came from a cryptocurrency address traced back to the entity that put the item up for sale. That Bloomberg article is behind a paywall, but Motherboard has details as well.

Improve Your Life

Always be willing to miss the next train.
If you find an item of clothing you love and are certain you will wear for ever, buy three.
Don’t get a pet/do get a pet.

—100 ways to slightly improve your life without really trying, The Guardian, 1-Jan-2022

From the thoughtful to the practical to the weird. Maybe try one or two this coming week?

Mittens Has a Question

My cat Mittens raising her paw to ask about how cats can participate in the metaverse during the q/a of ⁦@sivavaid⁩’s #NISOplus22 keynote.

"Yes, Mister Carpenter, I have a question for Siva Vaidhyanathan about the role of cats in the metaverse?"

The NISOplus 2022 conference is going on this week, and Tuesday's opening keynote was from Siva Vaidhyanathan: "Welcome to the Metaverse—The Profound Consequences of a Science-Fiction Vision". It was a fascinating, thought-provoking look at how technology has infiltrated our lives and what the current trajectory might hold. Topics that made me think...when was the last time I "logged on"? I mean, I'm constantly connected now. Was it when I replaced dial-up for a DSL connection? Was it when WiFi was added to the home? Was it when I got my first "smartphone"—a Windows CE device that had access to mobile internet? The gradual infusion (or, again, infiltration) of connectivity makes one wonder if some legitimate form of the "metaverse" is already here.

You're getting "Invalid request provided: AWS::CloudFront::PublicKey" because CloudFront Public Keys are immutable

2022-02-11T00:00:00-05:00

This is the web page I wish I had found when I spent the afternoon sorting through why AWS CloudFormation kept telling me:

Resource handler returned message: "Invalid request provided: AWS::CloudFront::PublicKey"

Like me, you might be working on a Serverless.com stack and are trying to restrict access to items in an S3 bucket through CloudFront. You might even be putting the public key text block into a YAML multiline string in an external configuration file and pulling that into your serverless.yml file. And you are pulling your hair out because when you run updates on your stack, you get this error. So in frustration, you blow away the stack and recreate it. It works fine at first, but soon you are back at that same error above. Do you want to know why?

An AWS::CloudFront::PublicKey resource is immutable, you idiot. (Me idiot, actually. Hopefully you are fortunate in finding this page early in your quest to solve the problem.)

The clue came from this issue report in the CloudFormation coverage roadmap page:

As mentioned in the API documentation : UpdatePublicKey
UpdatePublicKey action lets you update just the Comment field. The values EncodedKey and Name are immutable, and cannot be updated once created. To update the Key or the Name, a new PublicKey must be created using CreatePublicKey and use it.

The resources section of my serverless.yml file looks like this:

    WebsiteDistributionPublicKey:
      Type: AWS::CloudFront::PublicKey
      Properties:
        PublicKeyConfig:
          Name: ${self:custom.stack_name}
          CallerReference: ${self:custom.config.PUBLIC_KEY_CALLER_REFERENCE}
          EncodedKey: ${self:custom.config.PUBLIC_KEY_ENCODED}

I'm using Rich Buggy's 'Keeping secrets out of Git' technique to store secrets outside of the serverless.yml file, so I have a custom section that looks like this:

custom:
  default_stage: dev
  stage: ${opt:stage, self:custom.default_stage}
  stack_name: ${self:service}-${self:custom.stage}
  config: ${file(config.yml):${self:custom.stage}}

... which reads in this file:

default: &default
  <<: *default
  PUBLIC_KEY_CALLER_REFERENCE: SomeRandomString
  PUBLIC_KEY_ENCODED: |
    -----BEGIN PUBLIC KEY-----
    MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAwU37058NQTUqEHBor95x
    VZ1iezIzZB7MWoYHt4KCRDVw5G3h/pzDKLu2NKo+rVOBztgQ+cefdqBNWa2Mf4Tl
    YQxOP9m978C2f4H9tc8c2px9Lxdkh27Vd8xZx/JHPvnqTUYP/p6WNa+jLVm6TV7a
    mL5QqrURd9OpOoyrfKmzhkJwrBxhT8WlchKmnd3S+dotAFdOgb8aABtdIEoCvKYq
    +MeAeBrsE1UhennDU/yWfNl2deGUCUnhkWPHDmLgObr/iYGZamdnp6InjUX2PLsC
    leQuc1M13904QKX+0wfUNin6IK9Pn+UmLupQSg0ou533Nxkw69KLZRAvoOHJlZJW
    BwIDAQAB
    -----END PUBLIC KEY-----

... and populates the variables you saw in the fragment at the top. (If you've read this far and are interested in how I set up serverless.com projects, check out the blog post I wrote earlier this week on the topic.)

The practical upshot is if any three of those properties need to change—Name, CallerReference, or EncodedKey—what you must do is either:

Change the name of the resource—WebsiteDistributionPublicKey at line 1 in the YAML at the top—in some way (add a letter to the name, remove a letter, etc.) in addition to updating the resource properties, or
Comment out the entire AWS::CloudFront::PublicKey resource, deploy (to delete it), uncomment and modify, then deploy again.

As the commenter on the issue mentioned above said, this is not common behavior for other AWS services in CloudFormation. Hopefully, AWS will give us a CloudFormation path that is cleaner than the above two options.

Issue 84: Chips Go Bad, Learn From Our Cyber Mistakes, Automation at the USPS

2022-02-10T00:00:00-05:00

The invoice is in. This reengineered blog and the reinvigorated Thursday Threads newsletter cost just US$2.51 last month. All of that cost is in the blog construction and delivery. The cost of delivering the newsletter alone falls well below AWS' always-free tiers of service. Not bad! And as always, no internet trackers or surveillance capitalism is involved.

The threads this week:

When Bugs Come from the Chips, not the Code

Imagine for a moment that the millions of computer chips inside the servers that power the largest data centers in the world had rare, almost undetectable flaws. And the only way to find the flaws was to throw those chips at giant computing problems that would have been unthinkable just a decade ago.

As the tiny switches in computer chips have shrunk to the width of a few atoms, the reliability of chips has become another worry for the people who run the biggest networks in the world. Companies like Amazon, Facebook, Twitter and many other sites have experienced surprising outages over the last year.

—Tiny Chips, Big Headaches: As the largest computer networks continue to grow, some engineers fear that their smallest components could prove to be an Achilles’ heel, New York Times, 7-Feb-2022

We have all experienced unexplained computer errors. The software programmer among us cringe and think about what they possibly did wrong. Did I use the wrong variable in that loop, did I miss a hyphen? What if the programmer did everything correctly and the computer just "glitched"?

Modern computers have many layers of redundancy built into them—error-correcting memory, multi-drive storage volumes, checksums on blocks of data, and so forth. This article from the New York TImes points to a new cause...the physics of electrons moving over very small spaces. As the hardware architects press for smaller, faster, more electrically efficient chips, they will more often face this challenge and need to account for it in their designs.

Learning From Our Cyber Mistakes

The new Cyber Safety Review Board is tasked with examining significant cybersecurity events that affect government, business and critical infrastructure. It will publish reports on security findings and recommendations, officials said...

The board, officials have said, is modeled loosely on the National Transportation Safety Board, which investigates and issues public reports on airplane crashes, train derailments and other transportation accidents.

—Biden Administration Forms Cybersecurity Review Board to Probe FailuresWall Street Journal, 3-Feb-2022

You might think working with computers is an engineering discipline. But unlike other engineering professions, the computing field has little accountability for the errors of its practitioners and a formal method for spreading what was learned from those errors to prevent them from happening again. As the quote above suggests, think of "National Transportation Safety Board" that investigates aviation, highway, marine, pipeline, and railroad accidents. Their reports form the basis of new regulations that make these fields safer. Or think of the investigations of problems with electrical systems, concrete strength, or steel infrastructure. What is learned from those probes is encoded into standards and building codes.

The computing profession has nothing equivalent. It is good to see this one small area—cybersecurity incidents—get that kind of national focus and the formation of a review board that might start instilling some of that engineering mentality into the field.

Automation at the United States Postal Service

Delivering hards, no matter the cost, is a reflection of the US Post Office’s commitment to truly universal service—a radical vision of democratic communications infrastructure enshrined in the Postal Service Act of 1792. No matter the sender, the recipient, or the distance separating origin and destination, federal code stipulated that the Post Office must “bind the nation together.” As Alexis de Tocqueville put it in his 1835 treatise Democracy in America, the US mail system, unlike its European counterpart, “was organized so as to bring the same information to the door of the poor man’s cottage and to the gate of the palace.” To live up to this idealistic ethos, hards must be treated no differently than easies.

—The Nonmachinables: Redundancy and resilience at the US Postal Service, Logic Magazine, 17-May-2021

This is a long read about a much bigger subject. Despite the length—just over 5,000 words—I came away with the feeling that it was just scratching the surface. It is a review of the tasks that postal workers undertook to deliver the mail and how spurts of automation at key times overtook those person-oriented tasks. What started as the "Bureau of Hards" turns into the "Remote Encoding Center" where pictures of mailpieces that the automation can't handle meet the human eyes that try to puzzle out the address marked on an errant mail piece. And in the end, it is not the "chicken scratchings" of handwritten envelopes that trip up the optical character recognition software. No, the software has gotten quite good at that. (The article describes the long path to how it got that good.) It is the "bottomless pile of machine-printed detritus, much of it cheaply printed junk mail slung by mass marketeers."

Sympathy for Mittens

Mittens' shaved leg

This week Mittens was at the vet for teeth cleaning and inspection. One of her teeth had "root reabsorption" and disappeared. She needed general anesthesia, of course, for the cleaning...hence the shaved leg.

Alan pretends he is weak from hunger in the background.

Starting a Python-oriented Serverless-dot-com Project

2022-02-06T00:00:00-05:00

In the past few months, I've created about a half-dozen projects using "serverless" infrastructure on Amazon Web Services (AWS). (And I'm about to start another one.) Over the course of these projects, I've refined my development environment into something that I think is useful to share, so read on for how to make Python, Node, and Serverless.com work together and work independently from your other projects.

About "Serverless"

"Serverless" is both a term for a kind of computing environment and the name of a framework that helps manage such environments. As a computing environment, "serverless" abstracts away the needs to manage the servers and underlying operating systems from the task of writing and running code. If you assume that a fully-patched server at the required capacity is ready and waiting to run your code, then a serverless environment allows the developer to focus on just running the code. Someone else will deal with the other parts. AWS' Lambda is probably the best known, but other major cloud computing environments (Microsoft Azure, Google Cloud Services, Cloudflare Workers) and datacenter tools (Apache OpenWhisk, Kubeless) have the same thing.

"serverless.com " is also the name of a specific framework that helps developers manage serverless environments. It takes care of the tasks of bundling up code, setting up the appropriate triggers (web APIs, message queues, etc.), managing versions, and similar tasks. To make matters even more confusing, "Serverless.com" is also a service for managing workloads in serverless environments...so hopefully you can see that talking about "serverless" quickly gets one to "what 'serverless' are you talking about?" As far as understanding serverless-the-framework, I recommend skipping the homepage and going right to the framework documentation.

Building Up the Environment

There is one globally-installed prerequisite that I use: pipenv. Pipenv creates isolated Python environments...the python executable and installed modules for the project are separated from those of the underlying operating system. There are many isolated Python environment tools—pipenv, virtualenv, poetry—but I've used pipenv for a long time and it has the advantage of working with Eugene Kalinin's nodeenv project: a Node isolation tool that integrates with pipenv. In other words, in one directory I'm getting both Python isolation and Node isolation.

The numbered steps below are the sequence of commands to set this up. If you want to see what an empty shell looks like—along with some strong opinions about how I like to set up Serverless for myself—check out this GitHub repository: dltj/serverless-template.

mkdir serverless_project && cd serverless_project — create an empty directory and change into it
PIPENV_VENV_IN_PROJECT=1 pipenv install — create an isolated installation of Python in this environment [note 1]
pipenv install --dev nodeenv — install the 'nodeenv' package in the isolated Python environment
pipenv shell — enter the isolated Python environment
nodeenv -p — install node in the isolated Python environment and hook into the pipenv script
npm install serverless — install Serverless into the node environment and pin version in package.json [note 2]
exit — For serverless to install correctly in the environment...
pipenv shell — ...we need to exit out and re-enter the environment
npm install -g serverless — add the Serverless commands into the Python environment (.venv/bin) [note 3]
npm install --include=dev serverless-python-requirements — we're probably going to want to use the serverless-python-requirements plug-in since we are developing the AWS Lambdas in Python

You are then ready to start your serverless framework development using the serverless command to build an empty shell.

Note 1: I also have a stong preference for putting the Python virtual environment in the directory where the code lives, hence the PIPENV_VENV_IN_PROJECT=1 environment variable fed into the pipenv command.

Note 2: At this point, you can also specify a version of the Serverless framework. As I'm writing this, Serverless jumped to a 3.x.x version but I'm waiting for that major release to settle down. So I use npm install serverless@2.72.2

Note 3: Although the -g global flag is being used, we are inside the Python environment at this point so the Serverless install is in the Python/Node environment. Note, too, that this command "globally" installs the version pinned in package.json from the earlier npm install command.

How Others Use Your Environment

With the versions of commands and dependencies pinned in their respective Python and Node package files, someone can reproduce the same isolated Python/Node environment with these commands. Add these instructions to your README file:

git clone serverless_project && cd serverless_project

PIPENV_VENV_IN_PROJECT=1 pipenv install # create an isolated installation of Python in this environment

pipenv shell

nodeenv -p # Installs Node environment inside Python environment

npm install --include=dev # Installs Node packages inside combined Python/Node environment

exit # For serverless to install correctly in the environment...

pipenv shell # ...we need to exit out and re-enter the environment

npm install -g serverless # Although the '-g' global flag is being used, Serverless install is in the Python/Node environment

Issue 83: Author's CDL Thoughts, WWE's Monopsony, Child's Library Book

2022-02-03T00:00:00-05:00

Greetings from the wintery mix that is central Ohio. The local school district called off school yesterday afternoon in preparation for what came today. Also yesterday: Ohio's own "Buckeye Chuck" predicted an early spring. Let's be grateful for snow days (and teenagers who shovel snow) and for predictions of early spring.

In the meantime, the threads this week:

Author Speaks Up for Controlled Digital Lending

The controversial tweet.

The Big Five publishing houses’ share of the approximately $25 billion book publishing market is estimated at 80%. And it’s Big Publishing that is indeed throwing its weight around by suing the Internet Archive (the “org making reproductions” referenced here, which is actually a California state library and leading institution for digital preservation, not some random “org”).

Again, [Controlled Digital Lending] provides the legal framework for any library to make one copy of one paper book that it owns and loan it to one patron at a time.

—What Kind of Writer Accuses Libraries of Stealing?Maria Bustillos on Popula, 22-Jan-2022

Maria Bustillos wrote approvingly of controlled digital lending (CDL) in a quoted tweet of the Internet Archive. In response, she received a flurry of negative responses that seem to misunderstand a fundamental tenant of CDL: the own-to-loan ratio. If a library owns a copy of a book and takes the steps to physically sequester it, the library can loan a digital copy to patrons. I've read a lot on the library's perspective of CDL, and it was useful to hear how an author's perspective aligns with the goals of the library.

The Wrestling Monopsony

In the 70s, there were 32 wrestling promoters in the North American market, all competing for audiences and performers, all bidding to sew up TV rights with different broadcasters. Wrestlers like Andre the Giant were able to improve their working conditions by playing off rival leagues against one another.

In a single lifetime, the market has collapsed, with 85% market-share going to WWE and McMahon, the billionaire major Trump donor whose loyalty was rewarded when his wife Linda, a WWE executive, was given a plum job as head of Trump's Small Business Administration.

—Grappling with Big Wrestling: Vince McMahon has a monopoly on violence, Cory Doctorow's Pluralistic, 31-Jan-2022

Pulling through a thread from last week about Worldwide Wrestling Entertainment's video archive...I included a quote from WWE’s Director of Media Technology Bryan Staffaroni: "We owned so much of the WCW library, but we just never opened some boxes, because we just had so much stuff." It came from a section near the end of the article. The last paragraph explains why they have so much archival material: "Recently, WWE acquired one of the few North American organizations it had yet to own: Mid-South Wrestling..."

Cory Doctorow's post provides a perspective on an article from The American Prospect. Not only do we know why there are so many unopened boxes, but we know how WWE can afford its digital preservation infrastructure.

Self-publishing the Local Way

The cover of Dillon's book.

The staff librarians who read Dillon's book agreed that as informal and unconventional as it was, the book met the selection criteria for the collection in that it was a high-quality story that was fun to read. So, [library branch manager Alex] Hartman asked [Dillon's mother Susan] Helbig for permission to tack a bar code onto the book and formally add it to the library’s collection.

—An 8-year-old slid his handwritten book onto a library shelf. It now has a years-long waitlist, Washington Post, 31-Jan-2022

We know how modern self-publishing works—Amazon's Kindle Direct Publishing allows anyone to put their words onto paper and sell them to the world, for instance. This is a more homespun version. Dillon inserts his own book into the collection of his local library. If he wants, I think Dillon has the storyline for his second book.

Good Fences Make Good Neighbors

Alan and Mittens can coexist...under the right conditions.

Alan and Mittens share the spotlight this week. They can get along with a little help from the humans.

This revival of Thursday Threads has made it a month! I've learned a lot along the way that I'll summarize in a blog post next week, and I hope you have found a useful thread or two that has made you think or has brightened your day.

Issue 82: Personal Digital Library, Video Preservation, Selling Prayers, and Library Ebook Legislation

2022-01-27T00:00:00-05:00

The People Have Spoken

On a whim, last Thursday I put out a poll with the announcement of last week's issue. Out of the three threads, controlled digital lending, gamers and NFTs, and cats, the winner was cats. The sample size was small—five votes—so I'm not ready to throw out the digital quill pen yet. But if it readers want cats, readers shall have cats. I have plenty of cat pictures. And keep the feedback coming.

The threads this week:

Attorney General of India's Online Collection of Rare Books

Attorney General K.K. Venugopal has granted public access to a wide collection of rare books in his library, through a website. It lists over 570 books, some of which date back to the 17th century. The ‘antiquarian’ or rare book collection has been digitally scanned and made available for the public. The publications cover a wide range of subjects, from religion, mythology and the Vedas, to Indian art and sculpture, historical battles, the British Empire in India and tales of travels across the world.
The website, however, clarifies that these books are not copyrighted in India, either because the copyright has expired or because the books are not covered under the Indian copyright laws. It adds that while readers located anywhere in India can download them, those located outside India should check their country’s laws before downloading content from the website. The website also makes it clear that the books uploaded are for 'personal or research use only, and not for commercial use or exploitation.

—Attorney General KK Venugopal converts his rare book collection into public online libraryThe Print (India), 25-Mar-2020

I learned about this article and corresponding website during the Controlled Digital Lending Implementers (CDLI) monthly forum. Aishwarya Chaturvedi, LL.M. candidate from the Cornell Law School, spoke about copyright law in Inda relative to efforts to start a controlled digital lending practice at the forum, and she included mention of Mr. Venugopal's library website. It is a WordPress site with the books embedded with a PDF reader, and some of the books are relatively recent—1980s and one from 1994.

Ms Chaturvedi has a preprint in SSRN: Digital Libraries, Copyright and the COVID-19 Pandemic: A Comparative Study of India and The United States. It is a cross-cultural exploration of the legal mechanics of digital library services offering contemporary electronic books to readers. I recommend it to broaden your understanding of these issues.

"Inside WWE's massive video vault"

“This is the digital archive robot,” [WWE’s Director of Media Technology Bryan] Staffaroni told us as we were led into a climate-controlled room housing a state-of-the-art and highly complicated large machine. “It has more than 4,000 slots inside with data tapes that we use to store footage. These tapes can hold approximately 24 hours of footage and that’s just one physical tape.”
Some items in the archive don’t have much of a lifespan remaining, so if a specific item or match is required for a project, the entire tape will be digitized just in case.

“A lot of our three-quarter-inch tapes, for example, are so old,” [WWE’s Director of Asset Management George] Germanakos told us. “If we need to get footage off of it, we put the tape in the deck and just hit play. It will record the whole reel. We’re not sure if we’ll be able to rewind it or get anything else from that reel. It sheds.”

—Inside WWE's massive video vaultWorld Wrestling Entertainment, 20-June-2013

I'm not a fan of soap-opera-style wrestling, but this lay-person's view of their video preservation challenge is fascinating. Their commercial interest is offering fans views of the wrestling backstories as well as the hidden gems of early episodes of now-famous performers. And for the archivists among Thursday Threads readers, there is this quote from the relatively well off WWE: “There are boxes of stuff that nobody ever went and touched,” Staffaroni admitted. “We owned so much of the WCW library, but we just never opened some boxes, because we just had so much stuff.”

Prayers For Sale

However, an audit of Pray.com by privacy researcher Zach Edwards showed that the app shares granular data about the content its users consume with several other companies, including Facebook. According to Edwards, this means users could be targeted with ads on Facebook based on the content they engage with on Pray.com — including content modules with titles like “Better Marriage,” “Abundant Finance,” and “Releasing Anger.”

—Nothing Sacred: These Apps Reserve The Right To Sell Your PrayersBuzzfeed News, 24-Jan-2022

The article's sub-headline is "Prominent venture capitalists are flocking to invest in Christian worship apps. The apps say users’ prayers are a business asset." Sorry if this offends, but any time "venture capitalists" are mentioned I'll wager that nothing good will come from it. Companies are harvesting user-generated content from the apps, matching the users with marketing profiles from other companies, and selling the resulting combined data to others. I suppose when your website domain is Pray-Dot-Com, this is to be expected. But it shouldn't be expected; this should not be the norm.

Ebooks Wanted For Sale (for reasonable terms)

Library officials back the bills so they can loosen restrictions on the number of digital works that can circulate and not let publishers dictate pricing terms, said John Chrastka, the executive director of the EveryLibrary Institute, a nonprofit that advocates for library funding.
The AAP and proponents of the lawsuit said they support public libraries and that libraries are essential in expanding readership, but the Maryland law has the potential to harm creators and weaken the copyright system.

“The public libraries are an important piece of providing public access, but they don’t operate alone in a vacuum,” said Maria A. Pallante, the CEO of the Association of American Publishers.

The Motion Picture Association, the National Music Publishers Association, and the News Media Alliance also oppose the bills because they say there could be a potential domino effect in states also creating compulsory licenses for other creative works besides e-books.

—Libraries, Publishers Battle Over Terms for E-Books’ UseBloomberg Law, 18-Jan-2022

One to keep an eye on. This is just the latest in the battle to shape how books materials are offered in a digital space.

Promised Cat Picture

Equal time for Mittens the cat

Last week featured Alan. Alan is a nine-year-old cat that my daughter rescued from the Columbus Zoo and Aquarium animal show when it shut down due to covid. (My daughter was working as an animal caretaker for the show at the time). As he came to us, his full name was "Alan Jackson", which—in case you didn't know—is also the name of an American country music singer and songwriter. After seven years of sharing a "catio-equipped" shed (an outdoor area accessible to cats) with a dozen of his show peers, he has quickly turned into a full-on housecat.

Mittens, to the right, is an 11-year-old rescue that we got from the local animal shelter when she was four months old. For nine years, she was the only cat in the house; she did not react well to sharing a home with Alan. She is coming around, though. Her coat is mostly black with a white bib and white tips on her paws...that's how she got the name "Mittens". We think she has a few Siamese traits in her from the way she curls her tail and the flecks of white hairs scattered throughout her black coat. She acts like she is full-on Siamese based on her entitled attitude.

So those are the felines of the household. They will probably make repeated appearances in the newsletter. And I will speculate that there will be no objections from the readers.

A Better Structlog Processor for Python for CloudWatch Logs Using AWS Lambda

2022-01-22T00:00:00-05:00

I was introduced to structured logs at work, and this ol' hacker thinks that is a darn good idea. For a new program I'm writing, I wanted to put that into use. The program uses AWS Lambdas, and the log entries for the Lambdas end up in CloudWatch Logs. Unfortunately, in its default configuration, the output is less than useful:

Default configuration structured logs

AWS has configured the default Python logger in the Lambdas to automatically put the timestamp and the HTTP API request ID from the context in the display when the log line is collapsed. When you expand the log line, you can see the additional detail in structured JSON. That timestamp is duplicated in the column to the left, and the UUID is really not useful in this context. What I'd rather see is the event that caused the line to be logged and any corresponding error message.

Enhanced configuration structured logs

It took some trial and error to make this happen. This post describes that process in case I or anyone else needs this in the future.

The Usefulness of Structured Logs

I believe the widespread use of format strings in logging is based on two presumptions:

The first level consumer of a log message is a human.

The programmer knows what information is needed to debug an issue.

I believe these presumptions are no longer correct in server side software.

—Paul Querna

This quote is from a 2011 blog post. It's only now that I'm getting involved with troubleshooting distributed systems running on AWS that I appreciate the value of Paul's insight. The ability to search the contents of log files combined with the ability to correlate log messages from disparate programs is a real game-changer. (This coming from a programmer who still feels most comfortable trolling through /var/log with liberal grep and awk commands.)

I've seen the light. And so with this new effort, I'm using the Python Structlog package to simplify the building of the stuctured logs. The problem is that AWS is too smart for its own good. When you use the AWS-supplied Python installation, it:

Sets the log level to WARN, and
Sets the format string to include the timestamp and UUID of the Lambda call in front of anything you want to log.

Both of those are really annoying. The way to get around the first is somewhat cumbersome, as this answer on Stack Overflow describes. The nicest solution—if you are using Python 3.8 or higher—is to use the force=true on the logging.basicConfig call:

    logging.basicConfig(
        format="%(message)s",
        stream=sys.stdout,
        level=logging.DEBUG,
        force=True,
    )

The second line of this code snippet is the start of the solution to address the second problem described above—it clears out the AWS-supplied formatting string. In its place, we will put our own formatted string.

Tricking CloudWatch to Display Useful Content

I couldn't find this documented anywhere, but there is a special format to CloudWatch lines that make them machine-processable. If a program logs a line with three space-delimited fields followed by a string of JSON, CloudWatch will parse that JSON in the display and make it searchable. In the default configuration, the three space-delimited fields are the log level (e.g. "[INFO]"), timestamp, and request UUID. With our own Structlogs procesor, we can put what we want in those space-delimited fields. I copied the built-in structlog.processors.JSONRenderer class and made the modifications to put the log level and two arbitrary values at the beginning of the log line. You can see this effort in the AWSCloudWatchLogs class in log_config.py on GitHub. Calling the setup looks like this:

def handler(event, context):
    setup_logging(context)
    log = structlog.get_logger()
    log.info("STARTED", httpapi_event=event)

And an error log entry (with a reason variable) looks like this:

    reason = "Invalid Zoom POST content received"
    log.error("POST rejected", reason=reason)

You can see the results in the second screenshot above.

The best part is that because CloudWatch interprets the contents of the JSON part of the line, I can still collate together all of the log lines that correspond to a particular API Gateway request:

1	`{ $.aws_request_id = "0d7924bb-ed97-4aa2-898f-90520a9f2e1b" }`

Searching structured logs

The AWS documentation has instructions for the query syntax to matching terms in log events.

Issue 81: Controlled Digital Interlibrary Lending, Gamers Revolt Against NFTs, and Cats

2022-01-20T00:00:00-05:00

Alan the cat

Wednesday night with a cat on the lap, composing the next day's Thursday Threads. How could life get any better?

Hey...I'm not above using cat pictures to satisfy readers. In fact, I'm going to do it one more time before this newsletter is finished. (Oh, and if you are not seeing the pictures in your email, go ahead and click on the "load remote images button"—these are shared from my own site and there are no trackers in use.)

Thanks for the feedback on Thursday Threads—it has been very helpful. With this issue, I think you'll notice the email has a better visual look. The website of back issues has some improvements as well, and I'm starting to get into the swing of converting old posts to the new format.

The threads this week:

Controlled Digital Lending Gets a Funding Boost
Gamers Pushing Back Against Non-Fungible Tokens
Cat Dish

Controlled Digital Lending Gets a Funding Boost

The Davis Educational Foundation has awarded the Boston Library Consortium a two-year $215,000 grant to accelerate the implementation of controlled digital lending as a mechanism for interlibrary loan. The grant supports plans described in BLC’s 'Consortial CDL: Implementing Controlled Digital Lending as a Mechanism for Interlibrary Loan' report published in September 2021.

—Davis Educational Foundation award accelerates Boston Library Consortium’s controlled digital lending implementation, Boston Library Consortium, 13-Jan-2022

The National Information Standards Organization (NISO) today announced that it has received a grant of $125,000 from The Andrew W. Mellon Foundation to support the development of a consensus framework for implementing controlled digital lending (CDL) of book content by libraries, which has been approved by NISO members as a new initiative.

—NISO Awarded Mellon Funding for Controlled Digital Lending ProjectNISO Press Release, 20-Sep-2021

From my perspective, controlled digital lending for interlibrary loan (or "CDILL") is gaining steam. (I'm trying to make "CDILL" stick as a way of differentiating this type of controlled digital lending from the kind where a library uses CDL techniques to offer its own materials to its own patrons.) These two funding announcements show support for the development of systems and practices for libraries to advance the cooperation beyond the point of shipping physical books back and forth. (Although shipping physical books back and forth is still a noble effort by libraries, as last week's ode to interlibrary loan demonstrated.)

The key to making this work is having a workflow where the physical item is taken out of circulation in one place so that it can be digitally lent to another place. By replicating the same "friction" of physical interlibrary loan—ensuring that only one patron has access to the intellectual content of the book at a time—libraries retain the rights under copyright to lend books. This legal theory hasn't been tested in court, and there are some corners of the publishing industry that are against it. My opinion is not a legal one, but this all makes sense to me.

The Mellon funding to NISO supports a working group that is developing a recommended practice for an "Interoperable System of Controlled Digital Lending". I'm participating in the group along with about 30 other information professionals, and we held our kick-off meeting earlier this month. This is an area where standards, and in particular extensions to existing standards, will greatly help disparate systems effectively work together to make this happen. Notably, the working group is starting from the assumption that CDILL is legally sound; from that assumption, what would be required to make the workflow as smooth as possible for patrons and library staff? The working group has 24 months of activities to answer that question, and I'll report more on DLTJ as that work comes to its conclusions.

Related to the first quote above from the Boston Library Consortium, there is new development happening with the open source Project ReShare to develop a CDL module under its umbrella. A great software development team from Knowledge Integration in the U.K. is joining forces with my employer, Index Data, and others to make this happen. I'm not directly involved in this effort, but I get the professional pleasure of watching it come to life.

Gamers Pushing Back Against Non-Fungible Tokens

For more than a year, crypto mania has been at a fever pitch. Cryptocurrencies such as Bitcoin and Ethereum have soared in value. Crypto-based assets like NFTs have taken off. Jack Dorsey, a Twitter founder, recently renamed one of his companies Block in honor of the blockchain, the distributed ledger system that powers digital currencies. Melania Trump has auctioned off her own NFTs. Proponents hope that blockchain will revolutionize industries, from finance to social media to art.

But to some, the crypto craze has gone too far, too fast. Skeptics argue that cryptocurrencies and related assets like NFTs are digital Ponzi schemes, with prices artificially inflated beyond their true value. Some question whether cryptocurrencies and the blockchain, which are slippery concepts, have any long-term utility.

Nowhere has there been more unhappiness than in the games community, where clashes over crypto have increasingly erupted between users and major game studios like Ubisoft, Square Enix and Zynga. In many of the encounters, the gamers have prevailed — at least for now.

—Crypto Enthusiasts Meet Their Match: Angry Gamers, New York Times, 15-Jan-2022

For Thursday Threads readers that are not tuned into cryptocurrencies and non-fungible tokens (NFTs) yet, I know I haven't laid the foundations of what these things are and how they are interrelated. This thread started last week with the problems of cryptocurrency energy consumption. (Update! The cryptocurrency miners in Kosovo are desparate to unload their equipment.) And the thread will continue with problems the high concentration of cryptocurrencies in the hands of a few individuals, the mysterious movement of billions of dollars on the blockchain, how centralized this self-described decentralized platform is, the tenuous link between what you think you're buying and what you are actually buying, how "sockpuppets" are driving up the value of artificially scarce resources, and many more. (Stay tuned!)

This article from The New York Times, though, was too good to pass up right now. It has a useful simplified definition of NFTs: The digital assets, which are verified by blockchain technology, give proof of authenticity and ownership. That provides gamers with unique digital items, game makers said, which can enrich those that sell the NFTs in online marketplaces. It also starts to get to some of the controversies: “I just hate that they keep finding ways to nickel-and-dime us in whatever way they can,” said Matt Kee... “I don’t see anywhere mentioning how that benefits the gamer, how that improves gameplay. It’s always about, ‘How can I make money off this?’”

There are some serious problems with this technology, and the problems threaten us in real-world ways.

Cat Dish

—Aaron Taylor on Twitter: Starlink works great until the cats find out that the dish gives off a little heat on cold days.

Here is the promised cat picture at the bottom of the newsletter. I found this via a Smithsonian Magazine article. The dish featured in the photograph is from SpaceX's Starlink internet broadband service. The article goes on to explain that the Starlink dishes have heaters built into the faces to help melt snow. It all becomes clear now why the cats were attracted to the dish.

What was the best part of this issue? The overview of Controlled Digital Interlibrary Lending? The introduction of a new cryptocurrency horror? The pictures of cats?

Let me know by replying to this newsletter, and I'll make more of it happen.

Issue 80: Cryptocurrency's Wasteful Energy Consumption and an Ode to Interlibrary Loan

2022-01-13T00:00:00-05:00

Welcome to issue 80 of Thursday Threads. I'm so happy many of you chose to stick around and greetings to all of the new subscribers. To those that received my email last Thursday giving you a heads-up that a new issue would be coming to your inbox but then didn't receive it: check your spam folder. Over the course of the week, I've learned a great deal more about the spam-prevention mechanisms that are keeping our inboxes as clean as they are. I highly recommend the interactive 'Learn and Test DMARC' site sponsored by URIPorts. It was useful to see several standards come together to ensure email senders are who they say they are. (If you find this issue in your spam folder, please reply so I can track down more of the causes.)

Two threads this week:

Cryptocurrency's Energy Consumption
Ode to Interlibrary Loan

On a professional note, my employer is looking for a FOLIO Services Analyst to join our growing effort bringing the FOLIO open source platform to libraries around the world. If getting in on the ground floor of a revolution in library technology sounds appealing, check out the job description at the link above.

Cryptocurrency's Energy Consumption

Kosovo's government on Tuesday introduced a ban on cryptocurrency mining in an attempt to curb electricity consumption as the country faces the worst energy crisis in a decade due to production outages.

—Kosovo bans cryptocurrency mining to save electricityReuters, 5-Jan-2022

An army of cryptocurrency miners heading to the state for its cheap power and laissez-faire regulation is forecast to send demand soaring by as much as 5,000 megawatts over the next two years. The crypto migration to Texas has been building for months, but the sheer volume of power those miners will need — two times more than the capital city of almost 1 million people consumed in all of 2020 — is only now becoming clear.

—Texas Plans to Become the U.S. Bitcoin Capital. Can Its Grid, Ercot, Handle It?Bloomberg, 19-Nov-2021

Tape Pile, by SidewaysSarah, CC-By

One thread that I already anticipate will be covered on many Thursdays is the growing cryptocurrency problem. In this edition: how cryptocurrencies are a waste of resources. A brief introduction, in case you haven't encountered this technology yet, goes like this: cryptocurrencies are tokens of value that are exchanged on a "blockchain". A blockchain, in turn, is like a strip of calculator tape...once something is printed on it, it doesn't come off and it is there for everyone to see. Cryptocurrencies need "miners" to do the calculations that print something on the tape. Miners race each other to solve complex mathematical problems to be the first to reach the right answer, and when a miner has the answer, it prints it on the tape and all of the other miners check the winner's work. When the work is accepted, the winning miner gets a little bit of cryptocurrency as a reward and everyone's transactions that were included in what the winner printed on the tape are considered "confirmed".

Cryptocurrency Mining Farm, from Wikimedia Commons, CC-BySA

In the early days of cryptocurrencies, ordinary people used their computers to run the cryptocurrency algorithms. But the algorithms were structured in such a way that as more miners started working, the harder the mathematical problems would get. It is no longer feasible for individuals to make any money using their computer's idle time (that isn't stopping some companies from trying, though—a Thread for another day). Instead, warehouses of highly specialized computers are doing this work, and are consuming fast quantities of electricity to do so. The world's leading cryptocurrency mining country was China, but last year China banned miners because of the air pollution coming from the power plants (primarily coal) that were generating electricity for the miners. So the miners left China for politically unstable countries like Kosovo and deregulated states like Texas.

That's just the start of the cryptocurrency problem. We'll return to this thread often.

An Ode to Interlibrary Loan

My book arrived a few weeks later, a gift from a mysterious library in the Midwest. Here was knowledge on loan from afar. I loved books, but my local libraries had finite stacks—but now I could get books from anywhere. I have used this service an obnoxious amount of times since. Books about UFOs in France, folklore, saints, basketball, poetry, and propulsion system dissertations—anything.

—InterLibrary Loan Will Change Your LifeNick Ripatrazone on Literary Hub, 7-Aug-2019

You know an article is good when it makes the rounds again years later. This 2019 article recently came around again in my Twitter feeds, this time originating with Brewster Kahle. It is a wonderful article about the wonder of discovery and the ethos of libraries to share resources with each other for the benefit of our patrons. Starting in the fifth paragraph, the author takes us on a whirlwind tour of interlibrary loan through the ages.

Interlibrary Loan is a thing of wonder. Modern ILL is chock full of standards, overlapping pools of cooperating libraries, and automation that with each iteration attempts to smooth the process for patrons and librarians. A worthy article if you are in need of an example of why libraries do what they do.

Router Behind a Uverse/Pace 5268ac Gateway Loses its Mind Every 10 Minutes

2022-01-13T00:00:00-05:00

Late last year, I had my AT&T Uverse residential gateway replaced. For reasons that truly baffle me, AT&T has decided that they are going to run unsupported equipment on their residential customer network. When the replacement was swapped in, my family noticed that video conference calls—Zoom and Facetime and Slack—would occasionally drop out for about 10 seconds before continuing. After much frustration, I started timing the outages and found that they were happening at roughly 10-minute intervals (plus or minus just a few seconds).

Some internet searching lead to a forum post (page 1, page 2) on AT&T's customer site. As it turns out, there is a conflict with the DHCP address assignment messages when the residential gateway is in DMZplus mode. ¹

Forum user "weshunt" had the right solution:

I'm not a network confguration expert, but it bothered me that the Pace [residential gateway] and the USG both wanted to use 192.168.1.x for DHCP allocations. I noticed that even after putting the USG into the DMZPlus, I could connect a wireless device and it would get an address in the Pace's default 192.168.1.x range, which conflicted with the IP range the USG was trying to manage. And of course the Pace answered to 192.168.1.254, which was also in the default allocation range of the USG.

So I changed the DHCP settings on the Pace to answer to a different subnet (192.168.100.1 with a DHCP allocation range inside 192.168.100.x as well). Like magic, the USG immediately picked up the DHCP assignment from the Pace and got the public IP exactly like I wanted. Now the networks don't seem to want to fight each other. I can still access the Pace from the wired network via the new gateway IP (192.168.100.1), and also connect to the Pace wirelessly using the old SSID if I need to, though I'm shutting that down to alleviate unnecessary wireless congestion.

Step by step, this is what you need to do.

Change the LAN DHCP Range

With a web browser, go to your residential gateway advanced device configuration page. The link for this will be printed on the bottom of the gateway and is probably http://192.168.1.254. You will also need the "Device Access Code" that is printed just below that web address. I'm using a hardwired ethernet connection between my desktop and the residential gateway, but this will probably also work over wireless, too.

Click on Settings
... then LAN
... ... then DHCP.
In the "DHCP Configuration"→"DHCP Network Range" section, select "Configure manually" and enter these values:
- Router Address: 192.168.100.1
- Subnet Mask: 255.255.255.0
- First DHCP Address: 192.168.100.100
- Last DHCP Address: 192.168.100.200
- DHCP Lease Time: 24
At the bottom, click "Save". You'll need your Device Access Code at this point to save your changes.

Pace Residential Gateway DHCP Configuration Page

The IP address ranges on the LAN side of the residential gateway have now changed, so the browser's computer is going to need a new IP address. Unplug the ethernet cable and plug it back in to get a DHCP IP address assignment in the 192.168.100.x block; if using wifi, turn it off and turn it back on.

Set DMZplus Mode for Your Router

Connect to the residential gateway advanced device configuration page again. Also, make sure your router is plugged into the residential gateway. In these examples "SONATA" is the name of my desktop computer, and my home router is called "Gateway". Yes, I know that is confusing. Sorry 'bout that.

Click on Settings
... then Firewall
... ... then Applications, Pinholes and DMZ.
Under "1) Select a computer" pick your router (not your desktop computer).
Under "2) Edit firewall settings for this computer" pick Allow all applications (DMZ plus mode)
Click "Save"

Pace Residential Gateway DHCP Configuration Page

Ensure the Router's Network Address is Correct

I think this section is redundant—it should be set this way as a combination of the two changes above—but you can check it to be sure.

Click on Settings
... then LAN
... ... then LAN IP Address Allocation.
Verify the settings for your router:
- Current address: your assigned IP address from AT&T
- Device status: DMZ device
- Firewall: Disabled
- Address Assignment: Public (select WAN IP Mapping)
- WAN IP Mapping: Router WAN IP Address (default)
- Cascade Router: no
Select "Save"

Pace Residential Gateway DHCP Configuration Page

Aside—my residential gateway is in DMZplus mode because:
- my home network gear—in particular the wireless access points—are much better than what is in the residential gateway; and
- I trust AT&T's network about as far as I can throw that residential gateway...apparently for good reason since AT&T thinks it is okay for its customers to have unsupported routers on their networks.
↩

Issue 79: Educational Technology Futures, Social Media Legislation, Apollo 11 Launch at 50

2022-01-06T00:00:00-05:00

Welcome to the re-inaugural issue of DLTJ Thursday Threads. Counting backward, there were 78 previous issues (all by the most recent still need to be converted from the old WordPress style of formatting) with—all told—several hundred references and commentary. Here at the start of 2022, I'm making a resolution to restart Thursday Threads with links and thoughts about library technology, general technology trends, and internet culture.

What EDUCAUSE’s 2022 Top 10 IT Issues Mean for Libraries

The EDUCAUSE 2022 Top 10 IT Issues take an optimistic view of how technology can help make the higher education we deserve—through a shared transformational vision and strategy for the institution, a recognition of the need to place students’ success at the center, and a sustainable business model that has redefined ‘the campus.’

—Top 10 IT Issues, 2022: The Higher Education We Deserve, EDUCAUSE

Let's start with this report from EDUCAUSE from a panel of its members that reviewed survey results on what they see as the big educational technology issues for the year. I cover this report in more depth in a separate DLTJ article, but I think it is useful to provide some of the headline commentaries here. First, these IT leaders anticipate an acceleration of the role of technology in teaching and learning. The pandemic has spawned a new recognition of how big the cohort of "non-traditional" students is—part-time learners, remote learners, asynchronous learners, etc. Instructional technologists will certainly be called upon to support new tools and new roles; the academic librarian's instructional experience and traditional "high-touch" approach to supporting users can be an asset for institutions that choose to tap that capability. There is recognition that we are all tired and stretched as well as the reality that one-time emergency money is drying up. Still, there is room for growth for academic libraries seeking to re-form their mission for a new era.

Washington is awash in proposals for reforming social media, but in a narrowly divided Congress, it’s little surprise that none have passed. Many Democrats believe that social media’s core problem is that dangerous far-right speech is being amplified. Many Republicans believe that the core problem is that the platforms are suppressing conservative political views. The new Senate legislation, which was introduced by two Democrats, Chris Coons and Amy Klobuchar, and a Republican, Rob Portman, may have a path toward passage because it doesn’t require taking a side in that argument.

—A Former Facebook Executive Pushes to Open Social Media’s ‘Black Boxes’, New York Times, 2-Jan-2022

I haven't heard anyone say recently that the ills of social media are a matter of information literacy. It seems like the world has recognized that social media algorithms prey upon socioeconomic standing and addictive human psychology to drive engagement in negative feedback loops and that no amount of education can combat the power of the algorithm. I don't expect "social media curriculum" to come out of any legislative effort—particularly in an environment that is as polarized as the one we are now in. But I wonder if there is a role for library programming and library services in helping citizens understand the effects of social media algorithms, should new regulations provide the public data and research about how these companies are affecting our social relationships.

Relive the 50th Anniversary of the Apollo 11 Launch...Projected onto the Washington Monument!

"Apollo 50: Go for the Moon," recreated the launch of Apollo 11 and told the story of the first Moon landing through full-motion projection mapping artwork on the Washington Monument. Over a half-million people joined us July 16 to 20, 2019, to celebrate the 50th anniversary of Apollo 11 on the National Mall.

—Apollo 50 Launch in 4k: Washington Monument Projection Mapping, Vimeo (16 minute video)

This is nearly three years old—pre-pandemic times—and it is still worth a quarter of an hour of your time to watch. The creative production and the technical execution of this performance must have been spectacular in person because it is mesmerizing to watch on a flat, two-dimensional screen. Details about this collaboration between the Smithsonian Institution can be found in a press release from the time.

Refactoring DLTJ, Winter 2021 Part 4: Thursday Threads Newsletter Launches

2022-01-06T00:00:00-05:00

Success! Four parts plus a half (or a "re-do"" of part 2):

Ramp up automation for adding reading sources to Obsidian
Refactor the process of building this static website on AWS
- Fix the webmentions cache, an unanticipated diversion
Recreate the ability for readers to get updates by email
Turn the old DLTJ “Thursday Threads” concept into a newsletter (this post)

Earlier today, the newsletter launched with issue 79. It wasn't without hiccups, but I don't think any of the problems leaked out to the subscribers. I started with a list of 286 email addresses that were subscribed to the 2015 edition. This morning I sent an email to all of them on the blind-carbon-copy line from my regular email. That way I could see which addresses bounced back as undeliverable (94 addresses) before loading the list into the newsletter database. (Undeliverable email counts as a strike against you when using Amazon's Simple Email Service, so I didn't want to start with a bad reputation with them.)

One of the issues I ran into was with the multiprocessing code that I found on the web. It didn't work as claimed, and when I tried to adjust it, the loop to process email stalled, so I ripped out that code. In the end, with about 200 email addresses, it took just a minute or two of single-threaded, sequential sending to get them all out. Perhaps I won't need that multi-threaded capability until Thursday Threads gets much bigger.

Like everything on this static site blog, an issue starts as a Markdown file. Markdown is a light-weight markup language that translates very easily into HTML, and makes it easy for a writer to create valid HTML. It is also possible to mix HTML inside a Markdown file and have the right thing happen. The Jekyll processor (the program that turns a folder of Markdown files into a folder of HTML files) has a mechanism for including macros in the markup, and each "thread" in the issue is a macro file. If you look at the Markdown source for issue 79, you'll see each heading (marked with ##) has a {% include thursday-threads-quote.html %} macro definition.

{% include thursday-threads-quote.html
  blockquote="The EDUCAUSE 2022 Top 10 IT Issues take an optimistic view of how technology can help make the higher education we deserve—through a shared transformational vision and strategy for the institution, a recognition of the need to place students’ success at the center, and a sustainable business model that has redefined 'the campus.'" 
  url="https://er.educause.edu/articles/2021/11/top-10-it-issues-2022-the-higher-education-we-deserve" 
  versiondate="2021-11-12"
  versionurl="https://web.archive.org/20211127031010/https://er.educause.edu/articles/2021/11/top-10-it-issues-2022-the-higher-education-we-deserve"
  anchor="Top 10 IT Issues, 2022: The Higher Education We Deserve" 
  post=", EDUCAUSE" 
%}

Each of those variables are used in the include processor, which at the moment looks like this: _includes/thursday-threads-quote.html

That is some semantically-appropriate HTML that with some CSS make the nice layout on the page. (And should be accessible to screen readers, too.) The content of the "blockquote" variable is inserted at include.blockquote spot. There are also some conditional statements ( {% if include.pre %} ... {% endif %} ) that will include markup when a variable has a value assigned to it. The best part of these include blocks is that I can save them as separate files in my Obsidian database with links and tags to the places where I got the content. In fact, I expect my writing workflow will start with creating these include fragments in my Obsidian database throughout the week, and then when Wednesday night rolls around I'll pick some to drop into an issue. (Over time, I aim to convert all 650 previous blog posts into Markdown and add them to my Obsidian database as well. That will make it even easier to draw threads from the past.)

So that is where we are: some revitalized technoloy backing DLTJ and a strong intention to write more in the new year. Thanks for everyone's interest along the way, and please get in touch if you have any questions or comments.

Refactoring DLTJ, Winter 2021 Part 3: "Serverless" Newsletter System

2022-01-05T00:00:00-05:00

So it has been quiet here for a couple of days. Rest assured: the quietness comes from heads-down work, not from giving up. Here are the refactor-DLTJ activities so far:

Ramp up automation for adding reading sources to Obsidian
Refactor the process of building this static website on AWS
- Fix the webmentions cache, an unanticipated diversion
Recreate the ability for readers to get updates by email (this post)
Turn the old DLTJ “Thursday Threads” concept into a newsletter

Since New Years Day, I've been working on a way to send the contents of blog posts by email...commonly known nowadays as a newsletter. Years ago, I was using the Feedburner service to do that. Then Feedburner was bought by Google, and things were mostly okay for a while. Which is to say that most everything was working, and the things that weren't—like HTTPS for custom RSS domain names—had workarounds. But last summer Feedburner-Google discontinued the distribution of blog posts by email, which necessitated the need to buy or build my own email distribution system.

There are certainly "buy" options. For instance, one might use Medium for writing and distribution. But I've seen too many services come and go to come to rely on a business to be a good steward of my content. The Substack service has the same problem. For a while I considered the follow.it service as an alternative to Feedburner that included a newsletter-like add-on, but its "white label" service inserts the "follow.it" domain name in critical places where I would lose control over my list of subscribers. (After all, I'm only able to do this cleanly because I kept control over my RSS feed by using "feeds.dltj.org" as a hostname.)

So I'm running it myself. I briefly considered listmonk, but I don't know the Go programming language so that make troubleshooting and enhancing more of a challenge. Not readily spotting other alternatives, I created my own system using AWS tools, the Serverless.com framework, and the Python programming language. Thanks to a great outline by Marco Lancini and ideas from Victoria Drake.

The newsletter infrastructure software is on GitHub. It deserves a decent README file and some documentation to help others use it if they are so inclined. There are also a number of hard-coded areas that would need to be made more general. (See, for instance, these couple of lines that are used to pull out the body of the blog post for inclusion into the newsletter email.)

But Why

I've been asked, why do you go through all of this work instead of just hosting your blog on Wordpress.com? That is a reasonable question and it deserves a thoughtful response.

I like control of my content. My writings have always been stored on devices that I have a moderate amount of control over—first WordPress on a personal server in a co-location space, then WordPress on an Amazon Web Services (AWS) server, then as static files created by the Jekyll program and served up by AWS. (Side note, AWS isn't the only place my stuff lives—I've always kept a copy on my local machine with backups held off-site.)
To keep my tech skills sharp. With a computer science undergrad degree and self-described, old-school hacker, I'd like to think I could dive into any system and figure out how to run it. I've about given up on physical and data link layers (there was a time I made my own cables and configured building network equipment) and skills in the network and transport layers are getting quite stale (heard of the new QUIC protocol?). In my day job, the newish always-on, internet-grade infrastructure tools are becoming ever more mysterious. I want to learn a few new things just to keep up the practice of learning.
Privacy and the Common Good still matter. My blog and this newsletter technology use no tracking technology. Aside from comments, I can't tell who or how many are reading my blog. With the newsletter system, there are no tracking pixels or link shorteners that are detecting what you read. And this content is offered for free. Beyond the technical expertise, the technology running the blog and newsletter is really cheap. The blog looks to be about 50¢ a day—much lower than expected; we'll see about the newsletter, but I don't expect it to be more than a couple of dollars a month.

So that's my thinking at this point. The technology surrounding DLTJ has certainly changed over the years, and I don't expect it will remain static for decades on end. Time, technology advancement, and life choices can certainly change the calculus in the future.

Keep an eye out for the newsletter tomorrow, posted here on DLTJ and sent by email. If you'd like to subscribe to DLTJ Thursday Threads by email, head on over to newsletter.dltj.org.

Refactoring DLTJ, Winter 2021 Part 2.5: Fixing the Webmentions Cache

2021-12-31T00:00:00-05:00

Okay, a half-step backward to fix something I broke yesterday. As I described earlier this year, this static website blog uses the Webmention protocol to notify others when I link to their content and receive notifications from others. Behind the scenes, I'm using the Jekyll plugin called jekyll-webmention_io to integrate Webmention data into my blog's content. Each time the contents of this site is built, that plug-in contacts the Webmention.IO service to receive its Webmention data. (Webmention.IO holds onto it between Jekyll builds since there is no always-on "dltj.org" server to receive notifications from others.) The plug-in caches that information to ease the burden on the Webmention.IO service.

The previous CloudFormation-based process was using AWS CodeBuild natively, and the Webmention cache was stored in CodeBuild's caching function. CodeBuild automatically downloads the previous cache into the working directory for each build iteration and then automatically uploads the cache as the build is completed. Handy, right?

Well, AWS Amplify simplifies some of the setup of working with the underlying CodeBuild tool. One of the configuration options that is no longer available is the ability to specify which S3 bucket to use as the CodeBuild cache; so I couldn't point it at the previous cache files and all of the previous Webmention entries no longer appeared on the blog pages. Fortunately, I hadn't decommissioned the CloudFormation stuff, so I still had access to the old cache; I was able to extract the four webmention files (but see below for a discussion about that).

Since Amplify doesn't allow me to have direct access to the CodeBuild cache, I decided it was high time to use a dedicated cache location for these webmention files. To do that took three steps: 1. Create the S3 bucket (with no public access) 2. Add read/write policy for that bucket to the AWS role assigned to the Amplify app 3. Add lines to the amplify.yml file to copy files from the S3 bucket into and out of the working directory

For step 2, the IAM policy for the Amplify role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:DeleteObject",
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::org.dltj.webmentions-cache"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets"
            ],
            "Resource": "*"
        }
    ]
}

For the amplify.yml file:

version: 1
frontend:
  phases:
    preBuild:
      commands:
        - aws s3 cp s3://org.dltj.webmentions-cache webmentions-cache --recursive
        - rvm use $VERSION_RUBY_2_6
        - bundle install --path vendor/bundle
    build:
      commands:
        - rvm use $VERSION_RUBY_2_6
        - bundle exec jekyll build --trace
    postBuild:
      commands:
        - aws s3 cp webmentions-cache s3://org.dltj.webmentions-cache --recursive
  artifacts:
    baseDirectory: _site
    files:
      - '**/*'
  cache:
    paths:
      - 'vendor/**/*'

And the webmentions part of the Jekyll _config.yml file:

webmentions:
  cache_folder: webmentions-cache

Contents of the AWS CodeBuild Cache File

Can we do a quick sidebar on the AWS CodeBuild caching mechanism? Because I was not expecting what I saw. The CodeBuild cache S3 bucket contains one file with a UUID as its name. That file is a tar-gzip'd archive of a flat directory containing sequentially numbered files (0 through 8099 in my case) and a codebuild.json table of contents:

{
  "version": "1.0",
  "content": {
    "files": [
      {
        "path": "vendor/s3deploy.tar.gz",
        "rel": "src"
      },
      {
        "path": "vendor/s3deploy",
        "rel": "src"
      },
      {
        "path": "vendor/LICENSE",
        "rel": "src"
      },
      {
        "path": "vendor/README.md",
        "rel": "src"
      },
      {
        "path": "vendor/webmentions",
        "rel": "src"
      },
      {
        "path": "vendor/webmentions/received.yml",
        "rel": "src"
      },
      {
        "path": "vendor/webmentions/lookups.yml",
        "rel": "src"
      },
      {
        "path": "vendor/webmentions/bad_uris.yml",
        "rel": "src"
      },
      {
        "path": "vendor/webmentions/outgoing.yml",
        "rel": "src"
      },
    ...

Each item in the files array corresponded to the numbered filename in the directory. (In the case of the 4th item in the array—a directory—there was no corresponding file in the tar-gzip archive.) Fortunately, the four files I was looking for were near the top of the list and I didn't have to go hunting through all eight-thousand-some-odd files to find them. (The s3deploy program is one that I found to intelligently copy modified files from the CodeBuild working directory to the S3 static website bucket.)

I'm really wondering about the engineering requirements for all of this overhead. Why not just use a native tar-gzip archive without the process of parsing the table of contents and renaming the files?

Refactoring DLTJ, Winter 2021 Part 2: Adopt AWS Amplify

2021-12-30T00:00:00-05:00

Look at that! Progress is being made down the list of to-dos for this blog in order to start the new year on a fresh footing. As you might recall from the last blog post, I set out to do some upgrades across the calendar year boundary:

Ramp up automation for adding reading sources to Obsidian
Refactor the process of building this static website on AWS (this post)
Recreate the ability for readers to get updates by email
Turn the old DLTJ “Thursday Threads” concept into a newsletter

DLTJ is a "static site" blog, meaning that the page you are reading right now is a straight-up HTML file. This page is converted from the simple Markdown format to HTML by the Jekyll program. The DLTJ blog used to be based on WordPress, which meant a server was always running to dynamically generate each webpage out of a database. (If you go back in the DLTJ archives you'll see notes on top of pages that were part of the automatic conversion from WordPress to Markdown.) That WordPress server was quite costly to have constantly run for a small blog. (Yes, it is possible to pay someone a small amount to host your WordPress blog for you, but I'm a do-it-yourself kind of person.) So at the end of 2017 I migrated the site to Markdown stored in a GitHub repository with the Jekyll conversion and content delivery through Amazon Web Services (AWS).

Serving up static web pages from AWS S3/CloudFront is really simple. Processing the Markdown on GitHub into HTML via Jekyll on AWS is more complicated, and that process was something that I wanted to happen automatically every time I published a change to GitHub. I ended up hand-crafting about 650 lines of an AWS CloudFormation configuration file plus a few dozen lines of Python in some AWS Lambda functions. It worked, but it was fragile and very hard to maintain.

That was in 2017 and technology marches on; now AWS has a service that does all of the automation for you. Called Amplify, it bundles together a bunch of other AWS tools to help developers to create "full-stack web and mobile apps." The Amplify tools are really quite overkill for a static website, but building a static website is one of the hands-on "Getting Started" examples that AWS offers. For a static website, Amplify handles:

creating an S3 bucket and CloudFront distribution to store and serve up the content
provisioning a webhook API that notifies AWS to start the content building process and adds that webhook to the GitHub repository
setting up the CodeBuild process for Jekyll to generate the static web pages
creating the HTTPS security certificate and adding the appropriate DNS entries to the domain

All of the stuff I was doing in that 650-line CloudFormation file. (Plus Amplify has a lot more interesting features built into the service.)

AWS Amplify Console

One Problem: Getting the Correct Version of Ruby

Now for the two-hour detour. At least one of the Jekyll Gems I'm using to build this site requires Ruby version 2.6 or higher. The AWS CodeBuild container used by Amplify was defaulting to Ruby version 2.4, though, and my initial attempts to configure Amplify to use the higher version (setting an environment variable) didn't work. One answer that I found seemed to imply that I needed to build my own custom Docker image, so I started to make one using the AWS instructions. I got pretty far down that path before discovering that this blog needs not only the Ruby runtime but also a JavaScript runtime. (Another one of the Jekyll Gems calls out to a JavaScript function.) At that point, I went back to searching for an answer that used the AWS-supplied Docker image.

The real solution came in an answer about using a different version of Node in Amplify, which was to add a command to the preBuild and build steps to switch Ruby versions:

preBuild:
  commands:
    - rvm use $VERSION_RUBY_2_6
    - bundle install --path vendor/bundle
build:
  commands:
    - rvm use $VERSION_RUBY_2_6
    - bundle exec jekyll build --trace

After that, everything built perfectly.

Note! Jumping in here a day later to say there was another problem...the webmention cache was left behind in the old CodeBuild configuration so I had to fix it."

Downsides: Lots of Invisible AWS Services and Poor Pricing Comparison

One problem with using AWS Amplify is that the underlying AWS services—S3 bucket, CloudFront distribution, CodeBuild instance, etc.—are not visible in the AWS Console. In other words, you can't go to the CloudFront console page and see the configuration. More to the point, the cost of the underlying services seems to be aggregated into a relatively flat billing structure; Amplify's costs are:

1¢ per minute it takes to build the site (the blog takes between 2 and 3 minutes to build for each change)
2.3¢ per GB of data stored per month (this whole blog is about 300MB)
15¢ per GB of data served to readers (it looks like my blog is about 100GB/month)

So all told I think this is going to be $15 to $20 per month, with the biggest piece of that being the outbound bandwidth. Under the old system, last month my outbound CloudFront bandwidth cost $0.59, and this month will be zero because Amazon announced that the first 1TB of CloudFront data is now free. Hmmm—now that I'm pricing this out, maybe I'll need to go back to the old way. I asked a question on AWS' forums to see if this really holds true. Of course, I can also just wait a little while and see what my AWS bills look like.

$10 to $15 a month is quite a lot—I'll keep that old 650-line CloudFormation file around to see if I end up reverting to that method. Overall, though, I'm pleased with how this turned out. When I have a better answer on the costs associated with Amplify, I'll try to remember to come back here and update this post.

Tomorrow...with luck...recreating the email notification/delivery service!

Refactoring DLTJ, Winter 2021 Part 1: Picking up Obsidian

2021-12-29T00:00:00-05:00

As 2021 comes to a close, I've been thinking about this blog and my own "personal knowledge management" tools. It is time for some upgrades to both. The next few posts will be about the changes I'm making over this winter break. Right now I think the updating will look something like this:

Ramp up automation for adding reading sources to Obsidian (this post)
Refactor the process of building this static website on AWS
Recreate the ability for readers to get updates by email
Turn the old DLTJ "Thursday Threads" concept into a newsletter

I'll go back and link the bullet points above when (if?) I create the corresponding blog posts.

I've been using Obsidian for about six months as a place to note and link ideas on stuff I'm reading and watching. In case you haven't run across it yet, Obsidian is a personal wiki of sorts. It is software that sits atop a folder of Markdown files to provide indexing as well as inter-page linking and graph views of the folder's contents. Most people use it to build up their own personal knowledge management (PKM) database. You can make notes for the sources you are reading, then build knowledge by linking sources together using keywords and adding commentary at the intersection of related ideas.

Before Obsidian, I was using the Pinboard service to store bookmarks of interesting sources and using the paid subscription search engine and my own memory to find stuff. I've found that this setup works okay for retrieval—I can usually find things that I know I've read about before—but doesn't do so well for making new connections or creating new knowledge. The Thursday Threads series on this blog years ago was, in part, a way to find those connections and explore them a little bit in writing. I'm expecting Obsidian to help improve this area.

The start of the knowledge curation process is creating pages in Obsidian for the important/useful things I'm reading—each of these is a "source". I like the idea of having a bookmark service as the start of the queue of sources feeding into the PKM; It is a universal tool that is available from a wide variety of entry points. In my desktop browser, I use the Pinboard Bookmarklet to add new sources. On iOS, I use the Pins app on the share sheet to add things. The Pins app works not only in Safari but also in other places like the New York Times and Twitter apps.

To get sources from Pinboard into my Obsidian PKM database, I wrote a Python script that uses the Pinboard API to copy bookmarks into an intermediate SQLite3 database, and then every morning creates a page in the Obsidian database for each new source. Please note that this Python script is quite the mess; it started simple but has had functionality grafted into it a dozen times now, and it is in need of a serious rewrite. For better or for worse, it is out there for others to inspect and get ideas from.

For the sources I add to my PKM, I'm also concerned about link rot (web resources that go missing) and content drift (resources that change in between the time you first read them and when you or someone else goes back to them). To combat this, the script sends an API call to the Internet Archive's Wayback Machine to save the contents of a web page. I'm able to retrieve the Wayback archive URL and save that in the SQLite3 database. This is useful not only for my own reference but also for when I publish blog posts. You've probably noticed the link symbol to the right of hyperlinks on this page; those are robust links in practice—it opens a drop-down menu that takes you to the archived version of the linked webpage. (The robust links concept probably deserves a blog post all its own.)

Another side effect that I wrote into the script was to post my public bookmarks to Twitter and Mastodon. (For a long time I used the Buffer service to do the same thing, but over the years I had less and less control over how and when Buffer posted links.) I hope that posting these sources publicly will generate more conversation on the topic that I can add to my notes.

Each bookmark on the Pinboard service has a field for a description and a field for tags, and those are okay as far as they go. For some sources, though, I found myself wanting to comment in a more structure way, and so I reintroduced myself to the hypothesis.is service. Hypothes.is allows you to comment on any web page or PDF, and share those comments with others. More importantly, Hypothes.is lets you comment on selected portions of a document, and stores enough context to find that same location even when the underlying document changes. Hypothes.is as a service is embedding itself into learning management systems as a way for students to collaboratively critique content on the internet, but it is also useful to average folks. My Python script uses the Hypothes.is API to read my stored annotations, then gathers all of the annotations for one source onto its own page in the PKM database.

So that is what I've been running with for a number of months. This week I've been adding some enhancements to the Python script. The first change was to make each Pinboard bookmark its own page in the PKM database. When I started out months ago, I thought my "Sources" area of my Obsidian PKM would get polluted with small, stub pages because the only content was a link to the source and the topical keywords. So initially the script just added the links to the sources on a "daily notes" page. I found this ended up polluting the PKM's knowledge graph, though, because unrelated daily note pages would be linked through topical keywords from sources that were only related because I happened to read them on the same day.

Second, to add more "heft" to these source pages in the PKM, the script now adds a summary paragraph. It does this by scraping the main content of the webpage (using trafilatura) and picking the most important sentences (using the Natural Language Toolkit (NLTK) and the technique described by Ekta Shah). I'm expecting that no matter what I've written about the source or the keywords I've assigned to the source, these summaries will provide another valuable way to retrieve pages and concepts. (The NLTK toolkit has other text processing features—entity recognition, sentiment analysis, etc.—and I might explore additional information to the Obsidian PKM pages with those tools.)

Third, I started adding more metadata to the top of each PKM page. I'm expecting this metadata will be useful in the future, especially for functionality like reminding myself of saved sources after six months or a year.

The result of all of this work is an Obsidian page that looks like this:

Example Obsidian page

I think I'm done for now with this Python script that injects new sources into my Obsidian PKM database. The next big thing is some kind of topical keyword management...a personal ontology service of sorts. (If you know of any sofware like that—particularly something that works with Obsidian—please let me know.) Eventually, I'd like to add a mechanism that pulls annotated text from Kindle books as new sources. I'd also like to find a way to get a list of podcast episodes that I've listened to and add those as well. But for now...until that rewrite...good enough.

What EDUCAUSE's 2022 Top 10 IT Issues Mean for Libraries

2021-12-05T00:00:00-05:00

Last month, EDUCAUSE published its Top 10 IT Issues for 2022 with the subtitle "The Higher Education We Deserve". To reach the top 10, EDUCAUSE members were asked to prioritize 17 issues identified by the EDUCAUSE IT Issues Panel members. The members of the Issue Panel then broke up into groups to write essays on the 10 topics. This report starts with a 1,500-word summary of the common themes in the pieces, followed by the essays themselves. There is significant overlap in the essays to wade through with this publication style, but some valuable thoughts and observations are also there. Here are my highlights. In a number of places below, I will refer to sections of the EDUCAUSE article using Hypothes.is annotation links. If you'd like to see more or carry on a conversation, see the Hypothes.is-enabled version of the page.

Side note before we start: Psst. EDUCAUSE. Over here. First, kudos for publishing this as an HTML page and not some excessively designed PDF file. But why in the world did you publish what must be a 15,000 word HTML article with no table-of-contents anchors? It sure would be nice to refer to specific essays and sub-parts within each essay.

The Big Picture

At the top of the article, the EDUCAUSE editors put a rosy hue on the opportunities for higher education coming out of the pandemic that can be enabled by educational technology.

The EDUCAUSE 2022 Top 10 IT Issues take an optimistic view of how technology can help make the higher education we deserve—through a shared transformational vision and strategy for the institution, a recognition of the need to place students' success at the center, and a sustainable business model that has redefined "the campus."

At least they are admitting upfront that it is an optimistic view. If I were to write it, I'd say something like:

The EDUCAUSE 2022 Top 10 IT Issues describe a watershed moment in higher education at a time when there isn't much water behind the dam. Faculty and staff are tired (several essays acknowledge this), and students are anxious. Calls for digital transformation mean that old ways of doing things must be replicated in two new ways: in-person/online hybrid and entirely online. And the transformation must be done at or below current budget levels. By the way: if we screw this up, our institution might die on the vine.

I'm not naturally a pessimistic person, but all this talk of Digital Transformation—that phrase is used so often in the article that the writers shorten it to a new buzzword: "Dx"—has me somewhat concerned. There are some profound implications here, and I'm unsure where the capacity to carry out the vision described in these 10 issues will come from.

The 10 Issues

Cyber Everywhere! Are We Prepared?: Developing processes and controls, institutional infrastructure, and institutional workforce skills to protect and secure data and supply-chain integrity
Evolve or Become Extinct: Accelerating digital transformation to improve operational efficiency, agility, and institutional workforce development
Digital Faculty for a Digital Future: Ensuring faculty have the digital fluency to provide creative, equitable, and innovative engagement for students
Learning from COVID-19 to Build a Better Future: Using digitization and digital transformation to produce technology systems that are more student-centric and equity-minded
The Digital versus Brick-and-Mortar Balancing Game: Creating a blended campus to provide digital and physical work and learning spaces
From Digital Scarcity to Digital Abundance: Achieving full, equitable digital access for students by investing in connectivity, tools, and skills
The Shrinking World of Higher Education or an Expanded Opportunity? Developing a technology-enhanced post-pandemic institutional vision and value proposition
Weathering the Shift to the Cloud: Creating a cloud and SaaS strategy that reduces costs and maintains control
Can We Learn from a Crisis? Creating an actionable disaster-preparation plan to capitalize on pandemic-related cultural change and investments
Radical Creativity: Helping students prepare for the future by giving them tools and learning spaces that foster creative practices and collaborations

Increasing Role for Libraries

Libraries and library staff are seen to be doing more of what they are doing now. For example, these two parts from the same essay:

Institutions will need IT staff who are able to engage with students to provide them with the technology training and skills that they'll need to be successful.

Faculty will need to learn about and adapt or adopt new [self-guided learning] methods to make the best use of these technologies. They must be well supported by IT staff who understand not just the technology but also the concepts behind its application to teaching and learning.

These requirements are not new to many library staff, and I can see where library staff might be pressed into service to meet these instructional needs. This is especially true where library staff have existing liaison relationships with faculty. It is also a general call to improve the technical proficiency of all staff.

Both faculty and staff will need to become more flexible and adaptive in order to respond rapidly to changing circumstances and students' needs. Faculty will need to become adept at remote teaching, learning, collaboration, and advising so that they can confidently revise and improvise in the moment.

From the above paragraph one could also go on: …and staff will need to understand the context of how the delivery of their services fits into the student experience and the training on how to create a supportive, equitable environment.

I also noted this paragraph about services:

As a result of the pandemic, students want and expect more opportunities outside of the normal, traditional hours that institutions typically offer. They want weekend, evening, and holiday hours for everything from classes to student services to the library.

Let's bring it home: Students want weekend, evening, and holiday hours for … the library. Much of what the library offers is self-service already. Still, I'm trying to imagine what this means for all library services. Not just the building space, but the staff services as well. We know about the example of reference services at all hours of the day. But what of the student has an overdue fine that they want to be waived at 3am; is there an overnight staff member empowered to do that?

Student Privacy versus Student Analytics

Because these essays were written by different groups, there are some internal contradictions. For instance, at the end of the second essay there is a call for the use of analytics for everyone past, present, and future at the institution:

[Digital transformation efforts] will provide a holistic view of students, alumni, employees, resources, and more in ways that can result in beneficial outcomes. New architectures increase access to data and resources, which can offer better insights about institutional products and services and enable faster, more accurate decisions.

...paired with an issue that didn't make the top 10:

12. Where Have All the Applicants Gone?: Using technology to streamline administrative processes and leveraging artificial intelligence to assist the enrollment pipeline

This observation from the first essay is closer to how I want higher education to behave:

Culture clashes between data preservationists and leaders managing institutional risk and legal exposure may intensify as higher education institutions introduce more conservative records-retention policies and processes. Institutions that can frame data security and privacy protection as partnerships among various stakeholders may see fewer of these and other culture clashes.

Setting aside legal exposure, data preservation ("data hoarding"?) goes against higher education themes of open inquiry and safe exploration. Higher education institutions must keep privacy at the forefront when collecting data on students, faculty, and staff. There are also enough recent examples of bias in artificial intelligence—especially where human judgment calls come into play—that there is very little room for it to be a part of the application process.

And there is this from the third essay:

Learning analytics can help faculty adapt their teaching to identify and support students quickly and efficiently. Assessment technologies, although often controversial, are maturing, and with the help of learning and assessment advocates, these technologies can become more valid and better safeguard privacy.

Too bad there are no citations here. This is not my area of expertise, but the privacy concerns with assessment technologies haven't been adequately addressed to the best of my understanding. Safeguarding privacy must not be an afterthought.

To be clear, there is a recognition for the need for privacy and security:

In all cases, institutions need to have a security and privacy strategy. Endpoint protection platforms, two-factor authentication, and cloud monitoring tools are some of the technologies that IT staff use to protect institutional data and individuals' identities.

How to ingrain this into an organization without being dictatorial? I imagine: public pronouncements from high levels about the importance of cloud service governance, lots of education for decision-makers and implementers, clearinghouses of shared information, and open/blameless issue resolution reports.

Instructional Technologists and Designers

Whoa boy...these professional areas are in for some stress ahead. Take this suggestion:

Using collaborative technologies like Slack or Microsoft Teams can foster dialogue and community around how faculty are using technology in their teaching, how they are teaching, and how they are changing the curriculum.

I've witnessed very few examples of "if we deploy this communication technology, people will come" that were successful. Introducing Slack or Microsoft Teams for their own sake will not foster the desired discussion. The technologies aren't hard...it is the facilitation of the dialog that will be difficult and essential. But the best folks that could be community managers for this discussion are the already overworked instructional designers.

There is probably a need for an eat-your-own-dogfood approach here. Whatever technologies end up being used in the classroom need to form the foundation of this faculty discussion space; this reduces the number of tools being used and creates greater familiarity with the adopted tools.

And to be clear, this isn't a one-and-done activity:

Instructional support and IT staff must provide more training for faculty and staff, to keep them up-to-date and to ensure that they have the skills needed to teach and work securely and effectively beyond the traditional campus.

This reinforces that there isn't an end goal in sight—except to be more agile to the change that is coming beyond what we see. "The only constant is change?" Everyone needs training in not only the technology being deployed now, but also how to learn the technology that will be coming after. Layer that onto how tired everyone is. I can hear: "I just want to learn what I need to know now...the rest of what's coming doesn't concern me."

In-person AND Online AND Hybrid, Oh My!

From the Digital versus Bricks-and-Mortar Balancing Game essay:

The biggest challenge may be finding ways of successfully working and learning in a hybrid mode. Meetings, teaching, and other synchronous group activities work best when everyone is online or when everyone is in the same room. Technologists are investing in various technologies that support "dual mode" instruction or meetings; these technologies include additional cameras, screens, audio, and collaboration technologies. Not every effort will work, so technologists often frame the technologies as experiments or pilots and encourage faculty and staff to test various options. Yet the solution is not only a technical one; equally important is re-engineering academic and work processes to enable people to conduct their work in a seamless way regardless of the modality.

I don't think it has sunk in yet that "hybrid" (simultaneous combinations of in-person and online activities) is not a mix of what works for in-person instruction and what works for online instruction. Instead, it is a new thing all on its own. We've seen this tried with conferences already, and reports are that neither the in-person attendees nor the online attendees are happy with the outcomes.

And much more...

There is much more in the report than I touched on here...including some interesting bits that didn't manage to work their way into this blog post. (Take a look at my Hypothes.is annotations of the report if you are curious about what was left out.)

If you want to hear more, it is probably a good idea to check out the EDUCAUSE Webinar about the report to be held on December 14, 2021, from 1:00–2:00 p.m. Eastern U.S. time (UTC-5). Registration is free for EDUCAUSE members; I'm not one, so I'm hoping there will be a public recording afterward.

If you have your own thoughts to add here, feel free to make comments on Twitter or Facebook. Just mention the URL to this blog post and your comments will show up below.

On the Code4Lib Journal's Two Proposed Metrics article

2021-09-23T00:00:00-04:00

Code4Lib Journal (C4LJ) editor here. Becky Yoose's Twitter thread has stirred up a great deal of attention to an article published yesterday. This post has my own thoughts on the issue...published on Twitter to match Becky's medium and here on my blog for posterity.

So yeah that Code4Lib Journal editorial and article privacy debacle.

I have a story to tell you all.

Grab a beverage of your choice and get comfortable. It's going to be a long story.

🧵
— Becky Yoose (@yo_bj) September 23, 2021

This first part is going to come across as defensive. "The Code4Lib Journal exists to foster community and share information among those interested in the intersection of libraries, technology, and the future." (mission) Its editorial committee are volunteers. (I'm not paid by my employer to be on the editorial committee; the time I'm using during the middle of the work day to write these thoughts will have to be made up later. I don't think any of the committee members have it in their job description to be on the committee.)

First: assume best intentions. The editorial committee (EC) selected an article for publication called "On Two Proposed Metrics of Electronic Resource Use"; it presented a unique approach to a hard problem: characterizing the value of subscribed resources. The EC is aware that measuring value in this way does involve recording and processing patron identifying information, and the EC discussed the privacy implications in the article. As @yo_bj pointed out in her thread, the EC sought out her expertise because of previous comments. The EC reflected on Becky's feedback on the article—it is good feedback and I hope she does repurpose it for publication in a more public and tangible form—and discussed it with the article author. We also discussed our process of shepherding articles to publication. (If you haven't published with the C4LJ before, it is helpful to know that the editors take a more collaborative approach to working with article authors. It is not blind peer review, nor is it co-authorship; it is somewhere in between. Good for first-article authors.)

Best intentions: The EC had an insightful potentially useful article...with ideas worthy of publication and debate. We know we asked for @yo_bj's thoughts late in the editorial process. While the points she raised have merit, the concerns are not high enough to block publication. C4LJ does not have a point-counterpoint mode of publication. It may have been useful to invent one for this article, but we didn't do that. It may have been useful for the EC to invite Becky to firm up her analysis and publish it along side the article; we didn't do that. The EC did have a self-imposed deadline and nine other articles awaiting publication. We could have held publication of this article, but we elected not to do that. There may be ideas that others have—let's hear 'em. Instead, the coordinating editor wrote an editorial and added a paragraph to the start of the article. The EC signed onto this approach; I am among those that thought it was the best path forward.

This second part is my opinion and may or may not be the opinion of the other C4LJ editors. The "On Two Proposed Metrics of Electronic Resource Use" is a good article describing an interesting approach to a hard problem: evaluating the value of subscribed resources. The users-per-thousand and the interest-factor charted on a 2D graph looks like a useful starting point for discussing where libraries should put fiscal resources. It isn't a complete tool...it misses cost as a factor for instance (maybe make the box sizes proportional to cost?). It seems to me that if this tool is valuable that the privacy concerns can be addressed: distill the EZproxy logs daily, transfer the distilled logs to an encrypted virtual machine, keep the distilled logs for only a month to produce each month's graph, IRB approval, etc. There are privacy implications, but the answer is not to never do anything like this. For me, the answer is to take this idea, make it better, test it out, and talk with others about whether and when it works or not.

I don't understand the knee-jerk reaction to @yo_bj's Twitter thread to abhor the publication of this article and to go after the C4LJ and its editors. Could and should the EC have treated @yo_bj's input with more respect? Yes, as an EC member, I'll own up to that. Will I reflect on this discussion and try to make better discussions and decisions in the EC? You bet I will. Do you think the EC could do better? Make civil, constructive criticism and/or offer yourself as a volunteer the next time C4LJ opens a call for new editors.

-- End of Thread --

DLTJ Now Uses Webmention and Bridgy to Aggregate Social Media Commentary

2021-07-11T00:00:00-04:00

When I converted this blog from WordPress to a static site generated with Jekyll in 2018, I lost the ability for readers to make comments. At the time, I thought that one day I would set up an installation of Discourse for comments like Boing Boing did in 2013. But I never found the time to do that. Alternatively, I could do what NPR has done—abandon comments on its site in favor of encouraging people to use Twitter and Facebook—but that means blog readers don't see where the conversation is happening. This article talks about IndieWeb—a blog-to-blog communication method—and the pieces needed to make it work on both a static website and for social-media-to-blog commentary.

The IndieWeb is a combination of HTML markup and an HTTP protocol for capturing discussions between blogs. To participate in the IndieWeb ecosystem, a blog needs to support the "h-card" and "h-entry" microformats. These microformats are ways to add HTML markup to a site to be read and recognized by machines. If you follow the instructions at IndieWebify.me, the "Level 2" steps will check your site's webpages for the appropriate markup. The Jekyll theme I use here, minimal-mistakes, didn't include the microformat markup, so I made a pull request to add it.

With the markup in place, dltj.org uses the Webmention protocol to notify others when I link to their content and receive notifications from others. If you're setting this up for yourself, hopefully someone has already gone through the effort of adding the necessary Webmention communication bits to your blog software. Since DLTJ is a static website, I'm using the Webmention.IO service to send and receive Webmention information on behalf of dltj.org and a Jekyll plugin called jekyll-webmention_io to integrate Webmention data into my blog's content. The plugin gets that data from webmention.io, caches it locally, and builds into each article the list of webmentions and pingbacks (another kind of blog-to-blog communication protocol) received.

Webmention.IO and jekyll-webmention_io will capture some commentary. To get comments from Twitter, Mastodon, Facebook, and elsewhere, I added the Bridgy service to the mix. From their About page : "Bridgy periodically checks social networks for responses to your posts and links to your web site and sends them back to your site as webmentions." So all of that commentary gets fed back into the blog post as well.

I've just started using this Webmention/Bridgy setup, so I may have some pieces misconfigured. I'll be watching over the next several blog posts to make sure everything is working. If you notice something that isn't working, please reach out to me via one of the mechanisms listed in the sidebar of this site.

Digital Repository Software: How Far Have We Come? How Far Do We Have to Go?

2021-06-23T00:00:00-04:00

Bryan Brown's tweet led me to Ruth Kitchin Tillman's Repository Ouroboros post about the treadmill of software development/deployment. And wow do I have thoughts and feelings.

Ouroboros: an ancient symbol depicting a serpent or dragon eating its own tail. Or—in this context—constantly chasing what you can never have. Source: Wikipedia

Let's start with feelings. I feel pain and misery in reading Ruth's post. As Bryan said in a subsequent tweet, I've been on both sides: a system maintainer watching much-needed features put off to major software updates (or rewrites) and the person participating in decisions to put off feature development in favor of major updates and rewrites. It is a bit like a serpent chasing its tail (a reference to "Ouroboros" in Ruth's post title)—as someone who just wants a workable, running system, it seems like a never-ending quest to get what my users need.

I think it will get better. I offer as evidence the fact that almost all of us can assume network connectivity. That certainly wasn't always the case: routers used to break, file servers crash would under stress, network drivers go out of date at inopportune times. Now we take network connectivity for granted—almost (almost!) as if it a utility as common as water and electricity. We no longer have to chase our tail to assume those things.

When we make those assumptions, we push that technology down the stack and layer on new things. Only after electricity is reliable can we layer on network connectivity. With reliable network connectivity, we layer on—say—digital repositories. Each layer goes through its own refinement process...getting better and better as it relies on the layers below it.

Are digital repositories as reliable as printed books? No way! Without electricity and network connectivity, we can't have digital repositories but we can still use books. Will there come a time when digital repositories are as reliable as electricity and network connectivity? That sounds like a Star Trek world, but if history is our guide, I think the profession will get there. (I'm not necessarily saying I'll get there with it—such reliability is probably outside my professional lifetime.) So, yeah, I feel pain and misery in Ruth's post about the achingly out-of-reach nature of repository software that can be pushed down the stack...that can be assumed to exist with all of the capabilities that our users need.

That brings me around to one of Bryan's tweets:

If the idea of a digital preservation platform is that it is purpose-built to preserve assets for a long period of time, then isn't it an obvious design flaw to build it with an EOL in mind? If the system is no longer supported, then can it really be trusted for preservation?
— Bryan J. Brown (@bryjbrown) June 22, 2021

Can digital repositories really be trusted in-and-of-themselves? No. (Not yet?)

That isn't to say that steps aren't being made. Take, for example, HTTP and HTML. Those are getting pretty darn reliable, and assumptions can be built that rely on HTML as a markup language and HTTP as a protocol to move it around the network. I think that is a driver behind the growth of "static websites"—systems that rely on nothing more than delivering HTML and other files over HTTP. The infrastructure for doing that—servers, browsers, caching, network connectivity, etc.—is all pretty sound. HTML and HTTP have also stood the test of time—much like how we assume we will always understand how to process TIFF files for images.

Now there are many ways to generate static websites. This blog uses Markdown text files and Jekyll as a pre-processor to create a stand-alone folder of HTML and supporting files. A more sophisticated method might use Drupal as a content management system that exports to a static site. Jekyll and Drupal are nowhere near as assumed-to-work as HTML and HTTP, but they work well as mechanisms for generating a static site. Last year, colleagues from the University of Iowa published a paper about making a static site front-end to CONTENTdm in the Code4Lib Journal, which could be the basis of a digital collection website development. So if your digital repository creates HTML to be served over HTTP and—for the purposes of preservation—the metadata can be encoded in HTML structures that are readily machine-processable? Well, then you might be getting pretty close to a system you can trust.

But what about the digital objects themselves. Back in 2006, I crowed about the ability of Fedora repository software to recover itself just based on the files stored to disk. (Read the article for more details...it has the title "Why Fedora? Because You Don't Need Fedora" in case that might make it more enticing to read.) Fedora used a bespoke method of saving digital objects as a series of files on disk, and the repository software provided commands to rebuild the repository database from those files. That worked for Fedora up to version 3. For Fedora version 4, some of the key object metadata only existed in the repository database. From what I understand of version 5 and beyond, Fedora adopted the Oxford Common File Layout (OCFL), "an application-independent approach to the storage of digital information in a structured, transparent, and predictable manner." The OCFL website goes on to say: "It is designed to promote long-term object management best practices within digital repositories." So Fedora is back again in a state where you could rebuild the digital object repository system from a simple filesystem backup. The repository software becomes a way of optimizing access to the underlying digital objects. Will OCFL stand the test of time like HTML, HTTP, TIFF, network connectivity, and electricity? Only time will tell.

So I think we are getting closer. It is possible to conceive of a system that uses simple files and directories as long-term preservation storage. Those can be backed up and duplicated using a wide variety of operating systems and tools. We also have examples of static sites of HTML delivered over HTTP that various tools can create and many, many programs can deliver and render. We're missing some key capabilities—access control comes to mind. I, for one, am not ready to push JavaScript very far down our stack of technologies—certainly not as far as HTML—but JavaScript robustness seems to be getting better over time.

Ruth: I'm sorry this isn't easy and that software creators keep moving the goalposts. (I'll put myself in the "software creator" category.) We could be better at setting expectations and delivering on them. (There is probably another lengthy blog post in how software development is more "art" than it is "engineering".) Developers—the ones fortunate to have the ability and permission to think long term—are trying to make new tools/techniques good enough to push down the stack of assumed technologies. We're clearly not there for digital repository software, but...hopefully...we are moving in the right direction.

Thoughts on Growing Up

2021-05-28T00:00:00-04:00

It 'tis the season for graduations, and this year my nephew is graduating from high school. My sister-in-law created a memory book—"a surprise Book of Advice as he moves to the next phase of his life." What an interesting opportunity to reflect! This is what I came up with:

Sometime between when I became an adult and now, the word "adulting" was coined. My generation just called it "growing up." The local top-40 radio station uses "hashtag-adulting" to mean all of those necessary life details that now become your own responsibility. ("Hashtag" is something new, too, for what that's worth.)

Growing up is more than life necessities, though. This is an exciting phase of life that you've built up to—many more doors of possibilities are opening and now you get to pick which ones to go through. Pick carefully. Each door you go through starts to close off others. Pick many. Use this life stage to try many things to find what is fun and what is meaningful (and aim for both fun and meaningful). You are on a solid foundation, and I'm eager to see what you discover "adulting" means to you.

More Thoughts on Pre-recording Conference Talks

2021-04-08T00:00:00-04:00

Over the weekend, I posted an article here about pre-recording conference talks and sent a tweet about the idea on Monday. I hoped to generate discussion about recording talks to fill in gaps—positive and negative—about the concept, and I was not disappointed. I'm particularly thankful to Lisa Janicke Hinchliffe and Andromeda Yelton along with Jason Griffey, Junior Tidal, and Edward Lim Junhao for generously sharing their thoughts. Daniel S and Kate Deibel also commented on the Code4Lib Slack team. I added to the previous article's bullet points and am expanding on some of the issues here. I'm inviting everyone mentioned to let me know if I'm mischaracterizing their thoughts, and I will correct this post if I hear from them. (I haven't found a good comments system to hook into this static site blog.)

Pre-recorded Talks Limit Presentation Format

Lisa Janicke Hinchliffe made this point early in the feedback:

@DataG For me downside is it forces every session into being a lecture. For two decades CfPs have emphasized how will this season be engaging/not just a talking head? I was required to turn workshops into talks this year. Even tho tech can do more. Not at all best pedagogy for learning
— Lisa Janicke Hinchliffe (@lisalibrarian) April 5, 2021

Jason described the "flipped classroom" model that he had in mind as the NISOplus2021 program was being developed. The flipped classroom model is one where students do the work of reading material and watching lectures, then come to the interactive time with the instructors ready with questions and comments about the material. Rather than the instructor lecturing during class time, the class time becomes a discussion about the material. For NISOplus, "the recording is the material the speaker and attendees are discussing" during the live Zoom meetings.

In the previous post, I described how the speaker could respond in text chat while the recording replay is beneficial. Lisa went on to say:

@DataG Q+A is useful but isn't an interactive session. To me, interactive = participants are co-creating the session, not watching then commenting on it.
— Lisa Janicke Hinchliffe (@lisalibrarian) April 5, 2021

She described an example: the SSP preconference she ran at CHS. I'm paraphrasing her tweets in this paragraph. The preconference had a short keynote and an "Oprah-style" panel discussion (not pre-prepared talks). This was done live; nothing was recorded. After the panel, people worked in small groups using Zoom and a set of Google Slides to guide the group work. The small groups reported their discussions back to all participants.

Andromeda points out (paraphrasing twitter-speak): "Presenters will need much more— and more specialized—skills to pull it off, and it takes a lot more work." And Lisa adds: "Just so there is no confusion ... I don't think being online makes it harder to do interactive. It's the pre-recording. Interactive means participants co-create the session. A pause to chat isn't going to shape what comes next on the recording."

Increased Technical Burden on Speakers and Organizers

@ThatAndromeda @DataG Totally agree on this. I had to pre-record a conference presentation recently and it was a terrible experience, logistically. I feel like it forces presenters to become video/sound editors, which is obviously another thing to worry about on top of content and accessibility.
— Junior Tidal (@JuniorTidal) April 5, 2021

Andromeda also agreed with this: "I will say one of the things I appreciated about NISO is that @griffey did ALL the video editing, so I was not forced to learn how that works." She continued, "everyone has different requirements for prerecording, and in [Code4Lib's] case they were extensive and kept changing." And later added: "Part of the challenge is that every conference has its own tech stack/requirements. If as a presenter I have to learn that for every conference, it's not reducing my workload."

It is hard not to agree with this; a high-quality (stylistically and technically) recording is not easy to do with today's tools. This is also a technical burden for meeting organizers. The presenters will put a lot of work into talks—including making sure the recordings look good; whatever playback mechanism is used has to honor the fidelity of that recording. For instance, presenters who have gone through the effort to ensure the accessibility of the presentation color scheme want the conference platform to display the talk "as I created it."

The previous post noted that recorded talks also allow for the creation of better, non-real-time transcriptions. Lisa points out that presenters will want to review that transcription for accuracy, which Jason noted adds to the length of time needed before the start of a conference to complete the preparations.

Increased Logistical Burden on Presenters

@ThatAndromeda @DataG @griffey Even if prep is no more than the time it would take to deliver live (which has yet to be case for me and I'm good at this stuff), it is still double the time if you are expected to also show up live to watch along with everyone else.
— Lisa Janicke Hinchliffe (@lisalibrarian) April 5, 2021

This is a consideration I hadn't thought through—that presenters have to devote more clock time to the presentation because first they have to record it and then they have to watch it. (Or, as Andromeda added, "significantly more than twice the time for some people, if they are recording a bunch in order to get it right and/or doing editing.")

No. Audience. Reaction.

@DataG @griffey 3) No. Audience. Reaction. I give a joke and no one laughs. Was it funny? Was it not funny? Talks are a *performance* and a *relationship*; I'm getting energy off the audience, I'm switching stuff on the fly to meet their vibe. Prerecorded/webinar is dead. Feels like I'm bombing.
— Andromeda Yelton (@ThatAndromeda) April 5, 2021

Wow, yes. I imagine it would take a bit of imagination to get in the right mindset to give a talk to a small camera instead of an audience. I wonder how stand-up comedians are dealing with this as they try to put on virtual shows. Andromeda summed this up:

@DataG @griffey oh and I mean 5) I don't get tenure or anything for speaking at conferences and goodness knows I don't get paid. So the ENTIRE benefit to me is that I enjoy doing the talk and connect to people around it. prerecorded talk + f2f conf removes one of these; online removes both.
— Andromeda Yelton (@ThatAndromeda) April 5, 2021

Also in this heading could be "No Speaker Reaction"—or the inability for subsequent speakers at a conference to build on something that someone said earlier. In the Code4Lib Slack team, Daniel S noted: "One thing comes to mind on the pre-recording [is] the issue that prerecorded talks lose the 'conversation' aspect where some later talks at a conference will address or comment on earlier talks." Kate Deibel added: "Exactly. Talks don't get to spontaneously build off of each other or from other conversations that happen at the conference."

Currency of information

Lisa points out that pre-recording talks before en event means there is a delay between the recording and the playback. In the example she pointed out, there was a talk at RLUK that pre-recorded would have been about the University of California working on an Open Access deal with Elsevier; live, it was able to be "the deal we announced earlier this week".

Conclusions?

Near the end of the discussion, Lisa added:

@DataG @griffey @ThatAndromeda I also recommend going forward that the details re what is required of presenters be in the CfP. It was one thing for conferences that pivoted (huge effort!) but if you write the CfP since the pivot it should say if pre-record, platform used, etc.
— Lisa Janicke Hinchliffe (@lisalibrarian) April 5, 2021

...and Andromeda added: "Strong agree here. I understand that this year everyone was making it up as they went along, but going forward it'd be great to know that in advance."

That means conferences will need to take these needs into account well before the Call for Proposals (CfP) is published. A conference that is thinking now about pre-recording their talks must work through these issues and set expectations with presenters early.

As I hoped, the Twiter replies tempered my eagerness for the all-recorded style with some real-world experience. There could be possibilities here, but adapting face-to-face meetings to a world with less travel won't be simple and will take significant thought beyond the issues of technology platforms.

Edward Lim Junhao summarized this nicely: "I favor unpacking what makes up our prof conferences. I'm interested in recreating that shared experience, the networking, & the serendipity of learning sth you didn't know. I feel in-person conferences now have to offer more in order to justify people traveling to attend them."

Related, Andromeda said: "Also, for a conf that ultimately puts its talks online, it's critical that it have SOMEthing beyond content delivery during the actual conference to make it worth registering rather than just waiting for youtube. realtime interaction with the speaker is a pretty solid option."

If you have something to add, reach out to me on Twitter. Given enough responses, I'll create another summary. Let's keep talking about what that looks like and sharing discoveries with each other.

The Tree of Tweets

It was a great discussion, and I think I pulled in the major ideas in the summary above. With some guidance from Ed Summers, I'm going to embed the Twitter threads below using Treeverse by Paul Butler. We might be stretching the boundaries of what is possible, so no guarantees that this will be viewable for the long term.

Should All Conference Talks be Pre-recorded?

2021-04-03T20:34:00-05:00

The Code4Lib conference was last week. That meeting used all pre-recorded talks, and we saw the benefits of pre-recording for attendees, presenters, and conference organizers. Should all talks be pre-recorded, even when we are back face-to-face?

Note! After I posted a link to this article on Twitter, there was a great response of thoughtful comments. I've included new bullet points below and summarized the responses in another blog post.

As an entirely virtual conference, I think we can call Code4Lib 2021 a success. Success ≠ Perfect, of course, and last week the conference coordinating team got together on a Zoom call for a debriefing session. We had a lengthy discussion about what we learned and what we wanted to take forward to the 2022 conference, which we're anticipating will be something with a face-to-face component.

That last sentence was tough to compose: "...will be face-to-face"? "...will be both face-to-face and virtual"? (Or another fully virtual event?) Truth be told, I don't think we know yet. I think we know with some certainty that the COVID pandemic will become much more manageable by this time next year—at least in North America and Europe. (Code4Lib draws from primarily North American library technologists with a few guests from other parts of the world.) I'm hearing from higher education institutions, though, that travel is going to be severely curtailed...if not for health risk reasons, then because budgets have been slashed. So one has to wonder what a conference will look like next year.

I've been to two online conferences this year: NISOplus21 and Code4Lib. Both meetings recorded talks in advance and started playback of the recordings at a fixed point in time. This was beneficial for a couple of reasons. For organizers and presenters, pre-recording allowed technical glitches to be worked through without the pressure of a live event happening. Technology is not nearly perfect enough or ubiquitously spread to count on it working in real-time. ¹ NISOplus21 also used the recordings to get transcribed text for the videos. (Code4Lib used live transcriptions on the synchronous playback.) Attendees and presenters benefited from pre-recording because the presenters could be in the text chat channel to answer questions and provide insights. Having the presenter free during the playback offers new possibilities for making talks more engaging: responding in real-time to polls, getting forehand knowledge of topics for subsequent real-time question/answer sessions, and so forth. The synchronous playback time meant that there was a point when (almost) everyone was together watching the same talk—just as in face-to-face sessions.

During the Code4Lib conference coordinating debrief call, I asked the question: "If we saw so many benefits to pre-recording talks, do we want to pre-record them all next year?" In addition to the reasons above, pre-recorded talks benefit those who are not comfortable speaking English or are first-time presenters. (They have a chance to re-do their talk as many times as they need in a much less stressful environment.) "Live" demos are much smoother because a recording can be restarted if something goes wrong. Each year, at least one presenter needs to use their own machine (custom software, local development environment, etc.), and swapping out presenter computers in real-time is risky. And it is undoubtedly easier to impose time requirements with recorded sessions. So why not pre-record all of the talks?

I get it—it would be different to sit in a ballroom watching a recording play on big screens at the front of the room while the podium is empty. But is it so different as to dramatically change the experience of watching a speaker at a podium? In many respects, we had a dry-run of this during Code4Lib 2020. It was at the early stages of the coming lockdowns when institutions started barring employee travel, and we had to bring in many presenters remotely. I wrote a blog post describing the setup we used for remote presenters, and at the end, I said:

I had a few people comment that they were taken aback when they realized that there was no one standing at the podium during the presentation.

Some attendees, at least, quickly adjusted to this format.

For those with the means and privilege of traveling, there can still be face-to-face discussions in the hall, over meals, and social activities. For those that can't travel (due to risks of traveling, family/personal responsibilities, or budget cuts), the attendee experience is a little more level—everyone is watching the same playback and in the same text backchannels during the talk. I can imagine a conference tool capable of segmenting chat sessions during the talk playback to "tables" where you and close colleagues can exchange ideas and then promote the best ones to a conference-wide chat room. Something like that would be beneficial as attendance grows for events with an online component, and it would be a new form of engagement that isn't practical now.

There are undoubtedly reasons not to pre-record all session talks (beyond the feels-weird-to-stare-at-an-unoccupied-ballroom-podium reasons). During the debriefing session, one person brought up that having all pre-recorded talks erodes the justification for in-person attendance. I can see a manager saying, "All of the talks are online...just watch it from your desk. Even your own presentation is pre-recorded, so there is no need for you to fly to the meeting." That's legitimate.

So if you like bullet points, here's how it lays out. Pre-recording all talks is better for: * Accessibility: better transcriptions for recorded audio versus real-time transcription (and probably at a lower cost, too) * Engagement: the speaker can be in the text chat during playback, and there could be new options for backchannel discussions * Better quality: speakers can re-record their talk as many times as needed * Closer equality: in-person attendees are having much the same experience during the talk as remote attendees

Downsides for pre-recording all talks: * Feels weird: yeah, it would be different * Erodes justification: indeed a problem, especially for those for whom giving a speech is the only path to getting the networking benefits of face-to-face interaction * Limits presentation format: it forces every session into being a lecture. For two decades CfPs have emphasized how will this season be engaging/not just a talking head? (Lisa Janicke Hinchliffe) * Increased Technical Burden on Speaker and Organizers: conference organizers asking presenters to do their own pre-recording is a barrier (Junior Tidal), and organizers have added new requirements for themselves * No Audience Feedback: pre-recording forces the presenter into an unnatural state relative to the audience (Andromeda Yelton) * Currency of information: pre-recording talks before en event naturally introduces a delay between the recording and the playback. (Lisa Janicke Hinchliffe)

I'm curious to hear of other reasons, for and against. Reach out to me on Twitter if you have some. The COVID-19 pandemic has changed our society and will undoubtedly transform it in ways that we can't even anticipate. Is the way that we hold professional conferences one of them?

Can we just pause for a moment and consider the decades of work and layers of technology that make a modern teleconference call happen? For you younger folks, there was a time when one couldn't assume the network to be there. As in: the operating system on your computer couldn't be counted on to have a network stack built into it. In the earliest years of my career, we were tickled pink to have Macintoshes at the forefront of connectivity through GatorBoxes. Go read the first paragraph of that Wikipedia article on GatorBoxes...TCP/IP was tunneled through LocalTalk running over PhoneNet on unshielded twisted pairs no faster than about 200 kbit/second. (And we loved it!) Now the network is expected; needing to know about TCP/IP is pushed so far down the stack as to be forgotten...assumed. Sure, the software on top now is buggy and bloated—is my Zoom client working? has Zoom's service gone down?—but the network...we take that for granted. ↩

User Behavior Access Controls at a Library Proxy Server are Okay

2020-11-28T00:00:00-05:00

Earlier this month, my Twitter timeline lit up with mentions of a half-day webinar called Cybersecurity Landscape - Protecting the Scholarly Infrastructure. What had riled up the people I follow on Twitter was the first presentation: "Security Collaboration for Library Resource Access" by Cory Roach, the chief information security officer at the University of Utah. Many of the tweets and articles linked in tweets were about a proposal for a new round of privacy-invading technology coming from content providers as a condition of libraries subscribing to publisher content. One of the voices that I trust was urging caution:

I highly recommend you listen to the talk, which was given by a university CIO, and judge if this is a correct representation. FWIW, I attended the event and it is not what I took away.
— Lisa Janicke Hinchliffe (@lisalibrarian) November 14, 2020

As near as I can tell, much of the debate traces back to this article:

Scientific publishers propose installing spyware in university libraries to protect copyrights - Coda Story https://t.co/rtCokIukBf
— Open Access Tracking Project (@oatp) November 14, 2020

The article describes Cory's presentation this way:

One speaker proposed a novel tactic publishers could take to protect their intellectual property rights against data theft: introducing spyware into the proxy servers academic libraries use to allow access to their online services, such as publishers’ databases.

The "spyware" moniker is quite scary. It is what made me want to seek out the recording from the webinar and hear the context around that proposal. My understanding (after watching the presentation) is that the proposal is not nearly as concerning. Although there is one problematic area—the correlation of patron identity with requested URLs—overall, what is described is a sound and common practice for securing web applications. To the extent that it is necessary to determine a user's identity before allowing access to licensed content (an unfortunate necessity because of the state of scholarly publishing), this is an acceptable proposal. (Through the university communications office, Corey published a statement about the reaction to his talk.)

In case you didn't know, a web proxy server ensures the patron is part of the community of licensed users, and the publisher trusts requests that come through the web proxy server. The point of Cory's presentation is that the username/password checking at the web proxy server is a weak form of access control that is subject to four problems:

phishing (sending email to tricking a user into giving up their username/password)
social engineering (non-email ways of tricking a user into giving up their username/password)
credential reuse (systems that are vulnerable because the user used the same password in more than one place)
hactivism (users that intentionally give out their username/password so others can access resources)

Right after listing these four problems, Cory says: "But anyway we look at it, we can safely say that this is primarily a people problem and the technology alone is not going to solve that problem. Technology can help us take reasonable precautions... So long as the business model involves allowing access to the data that we're providing and also trying to protect that same data, we're unlikely to stop theft entirely."

His proposal is to place "reasonable precautions" in the web proxy server as it relates to the campus identity management system. This is a slide from his presentation:

Slide from presentation by Cory Roach

I find this layout (and lack of labels) somewhat confusing, so I re-imagined the diagram as this:

Revised 'Modern Library Design'

The core of Cory's presentation is to add predictive analytics and per-user blocking automation to the analysis of the log files from the web proxy server and the identity management server. By doing so, the university can react quicker to compromised usernames and passwords. In fact, it could probably do so more quicker than the publisher could do with its own log analysis and reporting back to the university.

Where Cory runs into trouble is this slide:

Slide from presentation by Cory Roach

In this part of the presentation, Cory describes the kinds of patron-identifying data that the university could-or-would collect and analyze to further the security effort. In search engine optimization, these sorts of data points are called "signals" and are used to improve the relevance of search results; perhaps there is an equivalent term in access control technology. But for now, I'll just call them "signals".

There are some problems in gathering these signals—most notably the correlation between user identity and "URLs Requested". In the presentation, he says: "You can also move over to behavioral stuff. So it could be, you know, why is a pharmacy major suddenly looking up a lot of material on astrophysics or why is a medical professional and a hospital suddenly interested in internal combustion. Things that just don't line up and we can identify fishy behavior." It is core to the library ethos that we make our best effort to not track what a user is interested in—to not build a profile of a user's research unless they have explicitly opted into such data collection. As librarians, we need to gracefully describe this professional ethos and work that into the design of the systems used on campus (and at the publishers).

Still, there is much to be said for using some of the other signals to analyze whether a particular request is from an authorized community member. For instance, Cory says: "We commonly see this user coming in from the US and today it's coming in from Botswana. You know, has there been enough time that they could have traveled from the US to Botswana and actually be there? Have they ever access resources from that country before is there residents on record in that country?"

The best part of what Cory is proposing is that the signals' storage and processing is at the university and not at the publisher. I'm not sure if Cory knew this, but a recent version of EZProxy added a UsageLimit directive that builds in some of these capabilities. It can set per-user limits based on the number of page requests or the amount of downloaded information over a specified interval. One wonders if somewhere in OCLC's development queue is the ability to detect IP addresses from multiple networks (geographic detection) and browser differences across a specified interval. Still, pushing this up to the university's identity provider allows for a campus-wide view of the signals...not just the ones coming through the library.

Also, in designing the system, there needs to be clarity about how the signals are analyzed and used. I think Cory knew this as well: "we do have to be careful about not building bias into the algorithms."

Yeah, the need for this technology sucks.

Although it was the tweet to the Coda Story about the presentation that blew up, the thread of the story goes through TechDirt to a tangential paragraph from Netzpolitik in an article about Germany's licensing struggle with Elsevier.

With this heritage, any review of the webinar's ideas are automatically tainted by the disdain the library community in general has towards Elsevier. It is reality—an unfortunate reality, in my opinion—that the traditional scholarly journal model has publishers exerting strong copyright protection on research and ideas behind paywalls. (Wouldn't it be better if we poured the anti-piracy effort into improving scholarly communication tools in an Open Access world? Yes, but that isn't the world we live in.) Almost every library deals with this friction by employing a web proxy server as an agent between the patron and the publisher's content.

The Netzpolitik article says:

…but relies on spyware in the fight against „cybercrime“

Of Course, Sci-Hub and other shadow libraries are a thorn in Elsevier’s side. Since they have existed, libraries at universities and research institutions have been much less susceptible to blackmail. Their staff can continue their research even without a contract with Elsevier.

Instead of offering transparent open access contracts with fair conditions, however, Elsevier has adopted a different strategy in the fight against shadow libraries. These are to be fought as „cybercrime“, if necessary also with technological means. Within the framework of the „Scholarly Networks Security Initiative (SNSI)“, which was founded together with other large publishers, Elsevier is campaigning for libraries to be upgraded with security technology. In a SNSI webinar entitled „Cybersecurity Landscape – Protecting the Scholarly Infrastructure“*, hosted by two high-ranking Elsevier managers, one speaker recommended that publishers develop their own proxy or a proxy plug-in for libraries to access more (usage) data („develop or subsidize a low cost proxy or a plug-in to existing proxies“).

With the help of an „analysis engine“, not only could the location of access be better narrowed down, but biometric data (e.g. typing speed) or conspicuous usage patterns (e.g. a pharmacy student suddenly interested in astrophysics) could also be recorded. Any doubts that this software could also be used—if not primarily—against shadow libraries were dispelled by the next speaker. An ex-FBI analyst and IT security consultant spoke about the security risks associated with the use of Sci-Hub.

The other commentary that I saw was along similar lines:

Let's face it: any friction beyond follow-link-to-see-PDF is more friction than a researcher deserves. I doubt we would design a scholarly communication system this way were we to start from scratch. But the system is built on centuries of evolving practice, organizations, and companies. It really would be a better world if we didn't have to spend time and money on scholarly publisher paywalls. And I'm grateful for the Open Access efforts that are pivoting scholarly communications into an open-to-all paradigm. That doesn't negate the need to provide better options for content that must exist behind a paywall.

So what is this SNSI thing?

The webinar where Cory presented was the first mention I'd seen of a new group called the Scholarly Networks Security Initiative (SNSI). SNSI is the latest in a series of publisher-driven initiatives to reduce the paywall's friction for paying users or library patrons coming from licensing institutions. GetFTR (my thoughts) and Seamless Access (my thoughts). (Disclosure: I'm serving on two working groups for Seamless Access that are focused on making it possible for libraries to sensibly and sanely integrate the goals of Seamless Access into campus technology and licensing contracts.)

Interestingly, while the Seamless Access initiative is driven by a desire to eliminate web proxy servers, this SNSI presentation upgrades a library's web proxy server and makes it a more central tool between the patron and the content. One might argue that all access on campus should come through the proxy server to benefit from this kind of access control approach. It kinda makes one wonder about the coordination of these efforts. Still, SNSI is on my radar now, and I think it will be interesting to see what the next events and publications are from this group.

As a Cog in the Election System: Reflections on My Role as a Precinct Election Official

2020-11-04T00:00:00-05:00

I may nod off several times in composing this post the day after election day. Hopefully, in reading it, you won't. It is a story about one corner of democracy. It is a journal entry about how it felt to be a citizen doing what I could do to make other citizens' voices be heard. It needed to be written down before the memories and emotions are erased by time and naps.

Yesterday I was a precinct election officer (PEO—a poll worker) for Franklin County—home of Columbus, Ohio. It was my third election as a PEO. The first was last November, and the second was the election aborted by the onset of the coronavirus in March. (Not sure that second one counts.) It was my first as a Voting Location Manager (VLM), so I felt the stakes were high to get it right.

Would there be protests at the polling location?
Would I have to deal with people wearing candidate T-shirts and hats or not wearing masks?
Would there be a crash of election observers, whether official (scrutinizing our every move) or unofficial (that I would have to remove)?

It turns out the answer to all three questions was "no"—and it was a fantastic day of civic engagement by PEOs and voters. There were well-engineered processes and policies, happy and patient enthusiasm, and good fortune along the way.

This story is going to turn out okay, but it could have been much worse. Because of the complexity of the election day voting process, last year Franklin County started allowing PEOs to do some early setup on Monday evenings. The early setup started at 6 o'clock. I was so anxious to get it right that the day before I took the printout of the polling room dimensions from my VLM packet, scanned it into OmniGraffle on my computer, and designed a to-scale diagram of what I thought the best layout would be. The real thing only vaguely looked like this, but it got us started.

What I imagined our polling place would look like

We could set up tables, unpack equipment, hang signs, and other tasks that don't involve turning on machines or breaking open packets of ballots. One of the early setup tasks was updating the voters' roster on the electronic poll pads. As happened around the country, there was a lot of early voting activity in Franklin County, so the update file must have been massive. The electronic poll pads couldn't handle the update; they hung at step 8-of-9 for over an hour. I called the Board of Elections and got ahold of someone in the equipment warehouse. We tried some of the simple troubleshooting steps, and he gave me his cell phone number to call back if it wasn't resolved.

By 7:30, everything was done except for the poll pad updates, and the other PEOs were wandering around. I think it was 8 o'clock when I said everyone could go home while the two Voting Location Deputies and I tried to get the poll pads working. I called the equipment warehouse and we hung out on the phone for hours...retrying the updates based on the advice of the technicians called in to troubleshoot. I even "went rogue" towards the end. I searched the web for the messages on the screen to see if anyone else had seen the same problem with the poll pads. The electronic poll pad is an iPad with a single, dedicated application, so I even tried some iPad reset options to clear the device cache and perform a hard reboot. Nothing worked—still stuck at step 8-of-9. The election office people sent us home at 10 o'clock. Even on the way out the door, I tried a rogue option: I hooked a portable battery to one of the electronic polling pads to see if the update would complete overnight and be ready for us the next day. It didn't, and it wasn't.

Text from Board of Elections

Polling locations in Ohio open at 6:30 in the morning, and PEOs must report to their sites by 5:30. So I was up at 4:30 for a quick shower and packing up stuff for the day. Early in the setup process, the Board of Elections sent a text that the electronic poll pads were not going to be used and to break out the "BUMPer Packets" to determine a voter's eligibility to vote. At some point, someone told me what "BUMPer" stood for. I can't remember, but I can imagine it is Back-Up-something-something. "Never had to use that," the trainers told me, but it is there in case something goes wrong. Well, it is the year 2020, so was something going to go wrong?

Fortunately, the roster judges and one of the voting location deputies tore into the BUMPer Packet and got up to speed on how to use it. It is an old fashioned process: the voter states their name and address, the PEO compares that with the details on the paper ledger, and then asks the voter to sign beside their name. With an actual pen...old fashioned, right? The roster judges had the process down to a science. They kept the queue of verified voters full waiting to use the ballot marker machines. The roster judges were one of my highlights of the day.

And boy did the voters come. By the time our polling location opened at 6:30 in the morning, they were wrapped around two sides of the building. We were moving them quickly through the process: three roster tables for checking in, eight ballot-marking machines, and one ballot counter. At our peak capacity, I think we were doing 80 to 90 voters an hour. As good as we were doing, the line never seemed to end. The Franklin County Board of Elections received a grant to cover the costs of two greeters outside that helped keep the line orderly. They did their job with a welcoming smile, as did our inside greeter that offered masks and a squirt of hand sanitizer. Still, the voters kept back-filling that line, and we didn't see a break until 12:30.

The PEOs serving as machine judges were excellent. This was the first time that many voters had seen the new ballot equipment that Franklin County put in place last year. I like this new equipment: the ballot marker prints your choices on a card that it spits out. You can see and verify your choices on the card before you slide it into a separate ballot counter. That is reassuring for me, and I think for most voters, too. But it is new, and it takes a few extra moments to explain. The machine judges got the voters comfortable with the new process. And some of the best parts of the day were when they announced to the room that a first-time voter had just put their card into the ballot counter. We would all pause and cheer.

The third group of PEOs at our location were the paper table judges. They handle all of the exceptions.

Someone wants to vote with a pre-printed paper ballot rather than using a machine? To the paper table!
The roster shows that someone requested an absentee ballot? That voter needs to vote a "provisional" ballot that will be counted at the Board of Elections office if the absentee ballot isn't received in the mail. The paper table judges explain that with kindness and grace.
In the wrong location? The paper table judges would find the correct place.

The two paper table PEOs clearly had experience helping voters with the nuances of election processes.

Rounding out the team were two voting location deputies (VLD). By law, a polling location can't have a VLD and a voting location manager (VLM) of the same political party. That is part of the checks and balances built into the system. One VLD had been a VLM at this location, and she had a wealth of history and wisdom about running a smooth polling location. For the other VLD, this was his first experience as a precinct election officer, and he jumped in with both feet to do the visible and not-so-visible things that made for a smooth operation. He reminded me a bit of myself a year ago. My first PEO position was as a voting location deputy last November. The pair handled a challenging curbside voter situation where it wasn't entirely clear if one of the voters in the car was sick. I'd be so lucky to work with them again.

The last two hours of the open polls yesterday were dreadfully dull. After the excitement of the morning, we may have averaged a voter every 10 minutes for those last two hours. Everyone was ready to pack it in early and go home. (Polls in Ohio close at 7:30, so counting the hour early for setup and the half an hour for tear down, this was going to be a 14 to 15 hour day.) Over the last hour, I gave the PEOs little tasks to do. At one point, I said they could collect the barcode scanners attached to the ballot markers. We weren't using them anyway because the electronic poll pads were not functional. Then, in stages (as it became evident that there was no final rush of voters), they could pack up one or two machines and put away tables. Our second to last voter was someone in medical scrubs that just got off their shift. I scared our last voter because she walked up to the roster table at 7:29:30. Thirty seconds later, I called out that the polls are closed (as I think a VLM is required to do), and she looked at me startled. (She got to vote, of course; that's the rule.) She was our last voter; 799 voters in our precinct that day.

Then our team packed everything up as efficiently as they had worked all day. We had put away the equipment and signs, done our final counts, closed out the ballot counter, and sealed the ballot bin. At 8:00, we were done and waving goodbye to our host facility's office manager. One of the VLD rode along with me to the board of elections to drop off the ballots, and she told me of a shortcut to get there. We were among the first reporting results for Franklin County. I was home again by a quarter of 10—exhausted but proud.

I'm so happy that I had something to do yesterday. After weeks of concern and anxiety for how the election was going to turn out, it was a welcome bit of activity to ensure the election was held safely and that voters got to have their say. It was certainly more productive than continually reloading news and election results pages. The anxiety of being put in charge of a polling location was set at ease, too. I'm proud of our polling place team and that the voters in our charge seemed pleased and confident about the process.

Maybe you will find inspiration here.

If you voted, hopefully it felt good (whether or not the result turned out as you wanted).
If you voted for the first time, congratulations and welcome to the club (be on the look-out for the next voting opportunity...likely in the spring).
If being a poll worker sounded like fun, get in touch with your local board of elections (here is information about being a poll worker in Franklin County).

Democracy is participatory. You've got to tune in and show up to make it happen.

Certificate of Appreciation

Running an All-Online Conference with Zoom [post removed]

2020-05-05T00:00:00-04:00

This is an article draft that was accidentally published. I hope to work on a final version soon. If you really want to see it, I saved a copy on the Internet Archive Wayback Machine.

With Gratitude for the NISO Ann Marie Cunningham Service Award

2020-05-01T00:00:00-04:00

During the inaugural NISO Plus meeting at the end of February, I was surprised and proud to receive the Ann Marie Cunningham Service award. Todd Carpenter, NISO’s executive director, let me know by tweet as I was not able to attend the conference.
Pictured in that tweet is my co-recipient, Christine Stohn, who serves NISO with me as the co-chair of the Information Delivery and Interchange Topic Committee. This got me thinking about what NISO has meant to me. As I think back on it, my activity in NISO spans at least four employers and many hours of standard working group meetings, committee meetings, presentations, and ballot reviews.

NISO Ann Marie Cunningham Service Award

I did not know Ms Cunningham, the award’s namesake. My first job started when she was the NFAIS executive director in the early 1990s, and I hadn’t been active in the profession yet. I read her brief biography on the NISO website:

The Ann Marie Cunningham Service award was established in 1994 to honor NFAIS members who routinely went above and beyond the normal call of duty to serve the organization. It is named after Ann Marie Cunningham who, while working with abstracting and information services such as Biological Abstracts and the Institute for Scientific Information (both now part of NISO-member Clarivate Analytics), worked tirelessly as an dedicated NFAIS volunteer. She ultimately served as the NFAIS Executive Director from 1991 to 1994 when she died unexpectedly. NISO is pleased to continue to present this award to honor a NISO volunteer who has shown the same sort of commitment to serving our organization.

As I searched the internet for her name, I came across the proceedings of the 1993 NFAIS meeting, in which Ms Cunningham wrote the introduction with Wendy Wicks. These first sentences from some of the paragraphs of that introduction are as true today as they were then:

In an era of rapidly expanding network access, time and distance no longer separate people from information.

Much has been said about the global promise of the Internet and the emerging concept of linking information highways, to some people, “free” ways.

What many in the networking community, however, seem to take for granted is the availability of vital information flowing on these high-speed links.

I wonder what Ms Cunningham of 1993 would think of the information landscape today? Hypertext linking has certainly taken off, if not taken over, the networked information landscape. How that interconnectedness has improved with the adaptation of print-oriented standards and the creation of new standards that match the native capabilities of the network. In just one corner of that space, we have the adoption of PDF as a faithful print replica and HTML as a common tool for displaying information. In another corner, MARC has morphed into a communication format that far exceeds its original purpose of encoding catalog cards; we have an explosion of purpose-built metadata schemas and always the challenge of finding common ground in tools like Dublin Core and Schema.org. We’ve seen several generations of tools and protocols for encoding, distributing, and combining data in new ways to reach users. And still we strive to make it better...to more easily deliver a paper to its reader—a dataset to its next experimenter—an idea to be built upon by the next generation.

It is that communal effort to make a better common space for ideas that drives me forward. To work in a community at the intersection of libraries, publishers, and service providers is an exciting and fulfilling place to be. I’m grateful to my employers that have given me the ability to participate while bringing the benefits of that connectedness to my organizations.

I was not able to be at NISO Plus to accept the award in person, but I was so happy to be handed it by Jason Griffey of NISO about a week later during the Code4lib conference in Pittsburgh. What made that even more special was to learn that Jason created it on his own 3D printer. Thank you to the new NFAIS-joined-with-NISO community for honoring me with this service award.

Tethering a Ubiquity Network to a Mobile Hotspot

2020-04-21T00:00:00-04:00

I saw it happen.

The cable-chewing device

The contractor in the neighbor's back yard with the Ditch Witch trencher burying a cable. I was working outside at the patio table and just about to go into a Zoom meeting. Then the internet dropped out. Suddenly, and with a wrenching feeling in my gut, I remembered where the feed line was buried between the house and the cable company's pedestal in the right-of-way between the properties. Yup, he had just cut it.

To be fair, the utility locator service did not mark the my cable's location, and he was working for a different cable provider than the one we use. (There are three providers in our neighborhood.) It did mean, though, that our broadband internet would be out until my provider could come and run another line. It took an hour of moping about the situation to figure out a solution, then another couple of hours to put it in place: an iPhone tethered to a Raspberry Pi that acted as a network bridge to my home network's UniFi Security Gateway 3P.

Network diagram with tethered iPhone

A few years ago I was tired of dealing with spotty consumer internet routers and upgraded the house to UniFi gear from Ubiquity. Rob Pickering, a college comrade, had written about his experience with the gear and I was impressed. It wasn't a cheap upgrade, but it was well worth it. (Especially now with four people in the household working and schooling from home during the COVID-19 outbreak.) The UniFi Security Gateway has three network ports, and I was using two: one for the uplink to my cable internet provider (WAN) and one for the local area network (LAN) in the house. The third port can be configured as another WAN uplink or as another LAN port. And you can tell the Security Gateway to use the second WAN as a failover for the first WAN (or as load balancing the first WAN). So that is straight forward enough, but do I get the Personal Hotspot on the iPhone to the second WAN port? That is where the Raspberry Pi comes in.

The Raspberry Pi is a small computer with USB, ethernet, HDMI, and audio ports. The version I had laying around is a Raspberry Pi 2—an older model, but plenty powerful enough to be the network bridge between the iPhone and the home network. The toughest part was bootstrapping the operating system packages onto the Pi with only the iPhone Personal Hotspot as the network. That is what I'm documenting here for future reference.

Bootstrapping the Raspberry Pi

The Raspberry Pi runs its own operating system called Raspbian (a Debian/Linux derivative) as well as more mainstream operating systems. I chose to use the Ubuntu Server for Raspberry Pi instead of Raspbian because I'm more familiar with Ubuntu. I tethered my MacBook Pro to the iPhone to download the Ubuntu 18.04.4 LTS image and follow the instructions for copying that disk image to the Pi's microSD card. That allows me to boot the Pi with Ubuntu and a basic set of operating system packages.

The Challenge: Getting the required networking packages onto the Pi

It would have been really nice to plug the iPhone into the Pi with a USB-Lightning cable and have it find the tethered network. That doesn't work, though. Ubuntu needs at least the usbmuxd package in order to see the tethered iPhone as a network device. That package isn't a part of the disk image download. And of course I can't plug my Pi into the home network to download it (see first paragraph of this post).

My only choice was to tether the Pi to the iPhone over WiFi with a USB network adapter. And that was a bit of Ubuntu voodoo. Fortunately, I found instructions on configuring Ubuntu to use a WPA-protected wireless network (like the one that the iPhone Personal Hotspot is providing). In brief:

sudo -i
cd /root
wpa_passphrase my_ssid my_ssid_passphrase > wpa.conf
screen -q
wpa_supplicant -Dwext -iwlan0 -c/root/wpa.conf
<control-a> c
dhclient -r
dhclient wlan0

Explanation of lines:

Use sudo to get a root shell
Change directory to root's home
Use the wpa_passphrase command to create a wpa.conf file. Replace my_ssid with the wireless network name provided by the iPhone (your iPhone's name) and my_ssid_passphrase with the wireless network passphrase (see the "Wi-Fi Password" field in Settings -> Personal Hotspot).
Start the screen program (quietly) so we can have multiple pseudo terminals.
Run the wpa_supplicant command to connect to the iPhone wifi hotspot. We run this the foreground so we can see the status/error messages; this program must continue running to stay connected to the wifi network.
Use the screen hotkey to create a new pseudo terminal. This is control-a followed by a letter c.
Use dhclient to clear out any DHCP network parameters
Use dhclient to get an IP address from the iPhone over the wireless network.

Now I was at the point where I could install Ubuntu packages. (I ran ping www.google.com to verify network connectivity.) To install the usbmuxd and network bridge packages (and their prerequisites):

apt-get install usbmuxd bridge-utils

If your experience is like mine, you'll get an error back:

1	`couldn't get lock /var/lib/dpkg/lock-frontend`

The Ubuntu Pi machine is now on the network, and the automatic process to install security updates is running. That locks the Ubuntu package registry until it finishes. That took about 30 minutes for me. (I imagine this varies based on the capacity of your tethered network and the number of security updates that need to be downloaded.) I monitored the progress of the automated process with the htop command and tried the apt-get command when it finished. If you are following along, now would be a good time to skip ahead to Configuring the UniFi Security Gateway if you haven't already set that up.

Turning the Raspberry Pi into a Network Bridge

With all of the software packages installed, I restarted the Pi to complete the update: shutdown -r now While it was rebooting, I pulled out the USB wireless adapter from the Pi and plugged in the iPhone's USB cable. The Pi now saw the iPhone as eth1, but the network did not start until I went to the iPhone to say that I "Trust" the computer that it is plugged into. When I did that, I ran these commands on the Ubuntu Pi:

dhclient eth1
brctl addbr iphonetether
brctl addif iphonetether eth0 eth1
brctl stp iphonetether on
ifconfig iphonetether up

Explanation of lines:

Get an IP address from the iPhone over the USB interface
Add a network bridge (the iphonetether is an arbitrary string; some instructions simply use br0 for the zero-ith bridge)
Add the two ethernet interfaces to the network bridge
Turn on the Spanning Tree Protocol (I don't think this is actually necessary, but it does no harm)
Bring up the bridge interface

The bridge is now live! Thanks to Amitkumar Pal for the hints about using the Pi as a network bridge. More details about the bridge networking software is on the Debian Wiki.

Note! I'm using a hardwired keyboard/monitor to set up the Raspbery Pi. I've heard from someone that was using SSH to run these commands, and the SSH connection would break off at brctl addif iphonetecther eth0 eth1

Configuring the UniFi Security Gateway

I have a UniFi Cloud Key, so I could change the configuration of the UniFi network with a browser. (You'll need to know the IP address of the Cloud Key; hopefully you have that somewhere.) I connected to my Cloud Key at https://192.168.1.58:8443/ and clicked through the self-signed certificate warning.

First I set up a second Wide Area Network (WAN—your uplink to the internet) for the iPhone Personal Hotspot: Settings -> Internet -> WAN Networks. Select "Create a New Network":

Network Name: Backup WAN
IPV4 Connection Type: Use DHCP
IPv6 Connection Types: Use DHCPv6
DNS Server: 1.1.1.1 and 1.0.0.1 (CloudFlare's DNS servers)
Load Balancing: Failover only

The last selection is key...I wanted the gateway to only use this WAN interfaces as a backup to the main broadband interface. If the broadband comes back up, I want to stop using the tethered iPhone!

Second, assign the Backup WAN to the LAN2/WAN2 port on the Security Gateway (Devices -> Gateway -> Ports -> Configure interfaces):

Port WAN2/LAN2 Network: WAN2
Speed/Duplex: Autonegotiate

Apply the changes to provision the Security Gateway. After about 45 seconds, the Security Gateway failed over from "WAN iface eth0" (my broadband connection) to "WAN iface eth2" (my tethered iPhone through the Pi bridge). These showed up as alerts in the UniFi interface.

Performance and Results

So I'm pretty happy with this setup. The family has been running simultaneous Zoom calls and web browsing on the home network, and the performance has been mostly normal. Web pages do take a little longer to load, but whatever Zoom is using to dynamically adjust its bandwidth usage is doing quite well. This is chewing through the mobile data quota pretty fast, so it isn't something I want to do every day. Knowing that this is possible, though, is a big relief. As a bonus, the iPhone is staying charged via the 1 amp power coming through the Pi.

Managing Remote Conference Presenters with Zoom

2020-03-14T00:00:00-04:00

Bringing remote presenters into a face-to-face conference is challenging and fraught with peril. In this post, I describe a scheme using Zoom that had in-person attendees forgetting that the presenter was remote!

The Code4Lib conference was this week, and with the COVID-19 pandemic breaking through many individuals and institutions made decisions to not travel to Pittsburgh for the meeting. We had an unprecedented nine presentations that were brought into the conference via Zoom. I was chairing the livestream committee for the conference (as I have done for several years—skipping last year), so it made the most sense for me to arrange a scheme for remote presenters. With the help of the on-site A/V contractor, we were able to pull this off with minimal requirements for the remote presenter.

List of Requirements

2 Zoom Pro accounts
1 PC/Mac with video output, as if you were connecting an external monitor (the "Receiving Zoom" computer)
1 PC/Mac (the "Coordinator Zoom" computer)
1 USB audio interface
Hardwired network connection for the Receiving Zoom computer (recommended)

The Pro-level Zoom accounts were required because we needed to run a group call for longer than 40 minutes (to include setup time). And two were needed: one for the Coordinator Zoom machine and one for the dedicated Receiving Zoom machine. It would have been possible to consolidate the two Zoom Pro accounts and the two PC/Mac machines into one, but we had back-to-back presenters at Code4Lib, and I wanted to be able to help one remote presenter get ready while another was presenting.

In addition to this equipment, the A/V contractor was indispensable in making the connection work. We fed the remote presenter's video and audio from the Receiving Zoom computer to the contractor's A/V switch through HDMI, and the contractor put the video on the ballroom projectors and audio through the ballroom speakers. The contractor gave us a selective audio feed of the program audio minus the remote presenter's audio (so they wouldn't hear themselves come back through the Zoom meeting). This becomes a little clearer in the diagram below.

Physical Connections and Setup

This diagram shows the physical connections between machines.

The Audio Mixer and Video Switch were provided and run by the A/V contractor. The Receiving Zoom machine was the one that is connected to the A/V contractor's Video Switch via an HDMI cable coming off the computer's external monitor connection. In the Receiving Zoom computer's control panel, we set the external monitor to mirror what was on the main monitor. The audio and video from the computer (i.e., the Zoom call) went out the HDMI cable to the A/V contractor's Video Switch. The A/V contractor took the audio from the Receiving Zoom computer through the Video Switch and added it to the Audio Mixer as an input channel. From there, the audio was sent out to the ballroom speakers the same way audio from the podium microphone was amplified to the audience. We asked the A/V contractor to create an audio mix that includes all of the audio sources except the Receiving Zoom computer (e.g., in-room microphones) and plugged that into the USB Audio interface. That way, the remote presenter could hear the sounds from the ballroom—ambient laughter, questions from the audience, etc.—in their Zoom call. (Note that it was important to remove the remote presenter's own speaking voice from this audio mix; there was a significant, distracting delay between the time the presenter spoke and the audio was returned to them through the Zoom call.)

We used a hardwired network connection to the internet, and I would recommend that—particularly with tech-heavy conferences that might overflow the venue wi-fi. (You don't want your remote presenter's Zoom to have to compete with what attendees are doing.) Be aware that the hardwired network connection will cost more from the venue, and may take some time to get functioning since this doesn't seem to be something that hotels often do.

In the Zoom meeting, we unmuted the microphone and selected the USB Audio interface as the microphone input. As the Zoom meeting was connected, we made the meeting window full-screen so the remote presenter's face and/or presentation were at the maximum size on the ballroom projectors.

Setting Up the Zoom Meetings

The two Zoom accounts came from the Open Library Foundation. (Thank you!) As mentioned in the requirements section above, these were Pro-level accounts. The two accounts were olf_host2@openlibraryfoundation.org and olf_host3@openlibraryfoundation.org. The olf_host2 account was used for the Receiving Zoom computer, and the olf_host3 account was used for the Coordinator Zoom computer. The Zoom meeting edit page looked like this:

This is for the "Code4Lib 2020 Remote Presenter A" meeting with the primary host as olf_host2@openlibraryfoundation.org. Note these settings:

A recurring meeting that ran from 8:00am to 6:00pm each day of the conference.
Enable join before host is checked in case the remote presenter got on the meeting before I did.
Record the meeting automatically in the cloud to use as a backup in case something goes wrong.
Alternative Hosts is olf_host3@openlibraryfoundation.org

The "Code4Lib 2020 Remote Presenter B" meeting was exactly the same except the primary host was olf_host3, and olf_host2 was added as an alternative host. The meetings were set up with each other as the alternative host so that the Coordinator Zoom computer could start the meeting, seamlessly hand it off to the Receiving Zoom computer, then disconnect.

Preparing the Remote Presenter

Remote presenters were given this information:

Code4Lib will be using Zoom for remote presenters. In addition to the software, having the proper audio setup is vital for a successful presentation.

Microphone: The best option is a headset or earbuds so a microphone is close to your mouth. Built-in laptop microphones are okay, but using them will make it harder for the audience to hear you.

Speaker: A headset or earbuds are required. Do not use your computer's built-in speakers. The echo cancellation software is designed for small rooms and cannot handle the delay caused by large ballrooms.

You can test your setup with a test Zoom call. Be sure your microphone and speakers are set correctly in Zoom. Also, try sharing your screen on the test call so you understand how to start and stop screen sharing. The audience will see everything on your screen, so quit/disable/turn-off notifications that come from chat programs, email clients, and similar tools.

Plan to connect to the Zoom meeting 30 minutes before your talk to work out any connection or setup issues.

At the 30-minute mark before the remote presentation, I went to the ballroom lobby and connected to the designated Zoom meeting for the remote presenter using the Coordinator Zoom computer. I used this checklist with each presenter:

Check presenter's microphone level and sound quality (make sure headset/earbud microphone is being used!)
Check presenter's speakers and ensure there is no echo
Test screen-sharing (start and stop) with presenter
Remind presenter to turn off notifications from chat programs, email clients, etc.
Remind the presenter that they need to keep track of their own time; there is no way for us to give them cues about timing other than interrupting them when their time is up

The critical item was making sure the audio worked (that their computer was set to use the headset/earbud microphone and audio output). The result was excellent sound quality for the audience.

When the remote presenter was set on the Zoom meeting, I returned to the A/V table and asked a livestream helper to connect the Receiving Zoom to the remote presenter's Zoom meeting. At this point, the remote presenter can hear the audio in the ballroom of the speaker before them coming through the Receiving Zoom computer. Now I would lock the Zoom meeting to prevent others from joining and interrupting the presenter (from the Zoom Participants panel, select More then Lock Meeting). I hung out on the remote presenter's meeting on the Coordinator Zoom computer in case they had any last-minute questions. As the speaker in the ballroom was finishing up, I wished the remote presenter well and disconnected the Coordinator Zoom computer from the meeting. (I always selected Leave Meeting rather than End Meeting for All so that the Zoom meeting continued with the remote presenter and the Receiving Zoom computer.)

As the remote presenter was being introduced—and the speaker would know because they could hear it in their Zoom meeting—the A/V contractor switched the video source for the ballroom projectors to the Receiving Zoom computer and unmuted the Receiving Zoom computer's channel on the Audio Mixer. At this point, the remote speaker is off-and-running!

Last Thoughts

This worked really well. Surprisingly well. So well that I had a few people comment that they were taken aback when they realized that there was no one standing at the podium during the presentation.

I'm glad I had set up the two Zoom meetings. We had two cases where remote presenters were back-to-back. I was able to get the first remote presenter set up and ready on one Zoom meeting while preparing the second remote presenter on the other Zoom meeting. The most stressful part was at the point when we disconnected the first presenter's Zoom meeting and quickly connected to the second presenter's Zoom meeting. This was slightly awkward for the second remote presenter because they didn't hear their full introduction as it happened and had to jump right into their presentation. This could be solved by setting up a second Receiving Zoom computer, but this added complexity seemed to be too much for the benefit gained.

I would definitely recommend making this setup a part of the typical A/V preparations for future Code4Lib conferences. We don't know when an individual's circumstances (much less a worldwide pandemic) might cause a last-minute request for a remote presentation capability, and the overhead of the setup is pretty minimal.

What is known about GetFTR at the end of 2019

2019-12-28T00:00:00-05:00

In early December 2019, a group of publishers announced Get-Full-Text-Research, or GetFTR for short. There was a heck of a response on social media, and the response was—on the whole—not positive from my librarian-dominated corner of Twitter. For my early take on GetFTR, see my December 3rd blog post "Publishers going-it-alone (for now?) with GetFTR." As that post title suggests, I took the five founding GetFTR publishers to task on their take-it-or-leave-it approach. I think that is still a problem. To get you caught up, here is a list of other commentary.

Roger Schonfeld's December 3rd "Publishers Announce a Major New Service to Plug Leakage" piece in The Scholarly Kitchen
Tweet from Herbert Van de Sompel, the lead author of the OpenURL spec, on solving the appropriate copy problem
December 5th post "Get To Fulltext Ourselves, Not GetFTR." on the Open Access Button blog
Twitter thread on December 7th between @cshillum and @lisalibrarian on the positioning of GetFTR in relation to link resolvers and an unanswered question about how GetFTR aligns with library interests
Twitter thread started by @TAC_NISO on December 9th looking for more information with a link to an STM Association presentation added by @aarontay
A tree of tweets starting from @mrgunn's [I don't trust publishers to decide] is the crux of the whole thing. In particular, threads of that tweet that include Jason Griffey of NISO saying he knew nothing about GetFTR and Bernhard Mittermaier's point about hidden motivations behind GetFTR
Twitter thread started by @aarontay on December 7th saying "GetFTR is bad for researchers/readers and librarians. It only benefits publishers, change my mind."
Lisa Janicke Hinchliffe's December 10th "Why are Librarians Concerned about GetFTR?" in The Scholarly Kitchen and take note of the follow-up discussion in the comments
Twitter thread between @alison_mudditt and @lisalibrarian clarifying PLOS is not on the Advisory Board with some @TAC_NISO as well.
Ian Mulvany's December 11th "thoughts on GetFTR" on ScholCommsProd
GetFTR's December 11th "Updating the community" post on their website
The Spanish Federation of Associations of Archivists, Librarians, Archaeologists, Museologists and Documentalists (ANABAD)'s December 12th "GetFTR: new publishers service to speed up access to research articles" (original in Spanish, Google Translate to English)
December 20th news entry from eContent Pro with the title "What GetFTR Means for Journal Article Access" which I'll only quarrel with this sentence: "Thus, GetFTR is a service where Academic articles are found and provided to you at absolutely no cost." No—if you are in academia the cost is born by your library even if you don't see it. But this seems like a third party service that isn't directly related to publishers or libraries, so perhaps they can be forgiven for not getting that nuance.
Wiley's Chemistry Views news post on December 26th titled simply "Get Full Text Research (GetFTR)" is perhaps only notable for the sentence "Growing leakage has steadily eroded the ability of the publishers to monetize the value they create."

If you are looking for a short list of what to look at, I recommend these posts.

GetFTR's Community Update

On December 11—after the two posts I list below—an "Updating the Community" web page was posted to the GetFTR website. From a public relations perspective, it was...interesting.

We are committed to being open and transparent

This section goes on to say, "If the community feels we need to add librarians to our advisory group we will certainly do so and we will explore ways to ensure we engage with as many of our librarian stakeholders as possible." If the GetFTR leadership didn't get the indication between December 3 and December 12 that librarians feel strongly about being at the table, then I don't know what will. And it isn't about being on the advisory group; it is about being seen and appreciated as important stakeholders in the research discovery process. I'm not sure who the "community" is in this section, but it is clear that librarians are—at best—an afterthought. That is not the kind of "open and transparent" that is welcoming.

Later on in the Questions about library link resolvers section is this sentence:

We have, or are planning to, consult with existing library advisory boards that participating publishers have, as this enables us to gather views from a significant number of librarians from all over the globe, at a range of different institutions.

As I said in my previous post, I don't know why GetFTR is not engaging in existing cross-community (publisher/technology-supplier/library) organizations to have this discussion. It feels intentional, which colors the perception of what the publishers are trying to accomplish. To be honest, I don't think the publishers are using GetFTR to drive a wedge between library technology service providers (who are needed to make GetFTR a reality for libraries) and libraries themselves. But I can see how that interpretation could be made.

Understandably, we have been asked about privacy.

I punted on privacy in my previous post, so let's talk about it here. It remains to be seen what is included in the GetFTR API request between the browser and the publisher site. Sure, it needs to include the DOI and a token that identifies the patron's institution. We can inspect that API request to ensure nothing else is included. But the fact that the design of GetFTR has the browser making the call to the publisher site means that the publisher site knows the IP address of the patron's browser, and the IP address can be considered personally identifiable information. This issue could be fixed by having the link resolver or the discovery layer software make the API request, and according to the Questions about library link resolvers section of the community update, this may be under consideration.

So, yes, an auditable privacy policy and implementation is key for for GetFTR.

GetFTR is fully committed to supporting third-party aggregators

This is good to hear. I would love to see more information published about this, including how discipline-specific repositories and institutional repositories can have their holdings represented in GetFTR responses.

My Take-a-ways

In the second to last paragraph: "Researchers should have easy, seamless pathways to research, on whatever platform they are using, wherever they are." That is a statement that I think every library could sign onto. This Updating the Community is a good start, but the project has dug a deep hole of trust and it hasn't reached level ground yet.

Lisa Janicke Hinchliffe’s “Why are Librarians Concerned about GetFTR?”

Posted on December 10th in The Scholarly Kitchen, Lisa outlines a series of concerns from a librarian perspective. I agree with some of these; others are not an issue in my opinion.

Librarian Concern: The Connection to Seamless Access

Many librarians have expressed a concern about how patron information can leak to the publisher through ill-considered settings at an institution's identity provider. Seamless Access can ease access control because it leverages a campus' single sign-on solution—something that a library patron is likely to be familiar with. If the institution's identity provider is overly permissive in the attributes about a patron that get transmitted to the publisher, then there is a serious risk of tying a user's research activity to their identity and the bad things that come from that (patrons self-censoring their research paths, commoditization of patron activity, etc.). I'm serving on a Seamless Access task force that is addressing this issue, and I think there are technical, policy, and education solutions to this concern. In particular, I think some sort of intermediate display of the attributes being transmitted to the publisher is most appropriate.

Librarian Concern: The Limited User Base Enabled

As Lisa points out, the population of institutions that can take advantage of Seamless Access, a prerequisite for GetFTR, is very small and weighted heavily towards well-resourced institutions. To the extent that projects like Seamless Access (spurred on by a desire to have GetFTR-like functionality) helps with the adoption of SAML-based infrastructure like Shibboleth, then the whole academic community benefits from a shared authentication/identity layer that can be assumed to exist.

Librarian Concern: The Insertion of New Stumbling Blocks

Of the issues Lisa mentioned here, I'm not concerned about users being redirected to their campus single sign-on system in multiple browsers on multiple machines. This is something we should be training users about—there is a single website to put your username/password into for whatever you are accessing at the institution. That a user might already be logged into the institution single sign-on system in the course of doing other school work and never see a logon screen is an attractive benefit to this system.

That said, it would be useful for an API call from a library's discovery layer to a publisher's GetFTR endpoint to be able to say, "This is my user. Trust me when I say that they are from this institution." If that were possible, then the Seamless Access Where-Are-You-From service could be bypassed for the GetFTR purpose of determining whether a user's institution has access to an article on the publisher's site. It would sure be nice if librarians were involved in the specification of the underlying protocols early on so these use cases could be offered.

Update

Lisa reached out on Twitter to say (in part): "Issue is GetFTR doesn't redirect and SA doesnt when you are IPauthenticated. Hence user ends up w mishmash of experience." I went back to read her Scholarly Kitchen post and realized I did not fully understand her point. If GetFTR is relying on a Seamless Access token to know which institution a user is coming from, then that token must get into the user's browser. The details we have seen about GetFTR don't address how that Seamless Access institution token is put in the user's browser if the user has not been to the Seamless Access select-your-institution portal. One such case is when the user is coming from an IP-address-authenticated computer on a campus network. Do the GetFTR indicators appear even when the Seamless Access institution token is not stored in the browser? If at the publisher site the GetFTR response also uses the institution IP address table to determine entitlements, what does a user see when they have neither the Seamless Access institution token nor the institution IP address? And, to Lisa's point, how does one explain this disparity to users? Is the situation better if the GetFTR determination is made in the link resolver rather than in the user browser?

Librarian Concern: Exclusion from Advisory Committee

See previous paragraph. That librarians are not at the table offering use cases and technical advice means that the developers are likely closing off options that meet library needs. Addressing those needs would ease the acceptance of the GetFTR project as mutually beneficial. So an emphatic "AGREE!" with Lisa on her points in this section. Publishers—what were you thinking?

Librarian Concern: GetFTR Replacing the Library Link Resolver

Libraries and library technology companies are making significant investments in tools that ease the path from discovery to delivery. Would the library's link resolver benefit from a real-time API call to a publisher's service that determines the direct URL to a specific DOI? Oh, yes—that would be mighty beneficial. The library could put that link right at the top of a series of options that include a link to a version of the article in a Green Open Access repository, redirection to a content aggregator, one-click access to an interlibrary-loan form, or even an option where the library purchases a copy of the article on behalf of the patron. (More likely, the link resolver would take the patron right to the article URL supplied by GetFTR, but the library link resolver needs to be in the loop to be able to offer the other options.)

My Take-a-ways

The patron is affiliated with the institution, and the institution (through the library) is subscribing to services from the publisher. The institution's library knows best what options are available to the patron (see above section). Want to know why librarians are concerned? Because they are inserting themselves as the arbiter of access to content, whether it is in the patron's best interest or not. It is also useful to reinforce Lisa's closing paragraph:

Whether GetFTR will act to remediate these concerns remains to be seen. In some cases, I would expect that they will. In others, they may not. Publishers’ interests are not always aligned with library interests and they may accept a fraying relationship with the library community as the price to pay to pursue their strategic goals.

Ian Mulvany’s “thoughts on GetFTR”

Ian's entire post from December 11th in ScholCommsProd is worth reading. I think it is an insightful look at the technology and its implications. Here are some specific comments:

Clarifying the relation between SeamlessAccess and GetFTR

There are a couple of things that I disagree with:

OK, so what is the difference, for the user, between seamlessaccess and GetFTR? I think that the difference is the following - with seamless access you the user have to log in to the publisher site. With GetFTR if you are providing pages that contain DOIs (like on a discovery service) to your researchers, you can give them links they can click on that have been setup to get those users direct access to the content. That means as a researcher, so long as the discovery service has you as an authenticated user, you don’t need to even think about logins, or publisher access credentials.

To the best of my understanding, this is incorrect. With SeamlessAccess, the user is not "logging into the publisher site." If the publisher site doesn't know who a user is, the user is bounced back to their institution's single sign-on service to authenticate. If the publisher site doesn't know where a user is from, it invokes the SeamlessAccess Where-Are-You-From service to learn which institution's single sign-on service is appropriate for the user. If a user follows a GetFTR-supplied link to a publisher site but the user doesn't have the necessary authentication token from the institution's single sign-on service, then they will be bounced back for the username/password and redirected to the publisher's site. GetFTR signaling that an institution is entitled to view an article does not mean the user can get it without proving that they are a member of the institution.

What does this mean for Green Open Access

A key point that Ian raises is this:

One example of how this could suck, lets imagine that there is a very usable green OA version of an article, but the publisher wants to push me to using some “e-reader limited functionality version” that requires an account registration, or god forbid a browser exertion, or desktop app. If the publisher shows only this limited utility version, and not the green version, well that sucks.

Oh, yeah...that does suck, and it is because the library—not the publisher of record—is better positioned to know what is best for a particular user.

Will GetFTR be adopted?

Ian asks, "Will google scholar implement this, will other discovery services do so?" I do wonder if GetFTR is big enough to attract the attention of Google Scholar and Microsoft Research. My gut tells me "no": I don't think Google and Microsoft are going to add GetFTR buttons to their search results screens unless they are paid a lot. As for Google Scholar, it is more likely that Google would build something like GetFTR to get the analytics rather than rely on a publisher's version.

I'm even more doubtful that the companies pushing GetFTR can convince discovery layers makers to embed GetFTR into their software. Since the two widely adopted discovery layers (in North America, at least) are also aggregators of journal content, I don't see the discovery-layer/aggregator companies devaluing their product by actively pushing users off their site.

My Take-a-ways

It is also useful to reinforce Ian's closing paragraph:

I have two other recommendations for the GetFTR team. Both relate to building trust. First up, don’t list orgs as being on an advisory board, when they are not. Secondly it would be great to learn about the team behind the creation of the Service. At the moment its all very anonymous.

Where Do We Stand?

Wow, I didn't set out to write 2,500 words on this topic. At the start I was just taking some time to review everything that happened since this was announced at the start of December and see what sense I could make of it. It turned into a literature review of sort.

While GetFTR has some powerful backers, it also has some pretty big blockers: * Can GetFTR help spur adoption of Seamless Access enough to convince big and small institutions to invest in identity provider infrastructure and single sign-on systems? * Will GetFTR grab the interest of Google, Google Scholar, and Microsoft Research (where admittedly a lot of article discovery is already happening)? * Will developers of discovery layers and link resolvers prioritize GetFTR implementation in their services? * Will libraries find enough value in GetFTR to enable it in their discovery layers and link resolvers? * Would libraries argue against GetFTR in learning management systems, faculty profile systems, and other campus systems if its own services cannot be included in GetFTR displays?

I don't know, but I think it is up to the principles behind GetFTR to make more inclusive decisions. The next steps is theirs.

Publishers going-it-alone (for now?) with GetFTR

2019-12-03T00:00:00-05:00

In early December 2019, a group of publishers announced Get-Full-Text-Research, or GetFTR for short. I read about this first in Roger Schonfeld's "Publishers Announce a Major New Service to Plug Leakage" piece in The Scholarly Kitchen via Jeff Pooley's Twitter thread and blog post. Details about how this works are thin, so I'm leaning heavily on Roger's description. I'm not as negative about this as Jeff, and I'm probably a little more opinionated than Roger. This is an interesting move by publishers, and—as the title of this post suggests—I am critical of the publisher's "go-it-alone" approach.

First, some disclosure might be in order. My background has me thinking of this in the context of how it impacts libraries and library consortia. For the past four years, I've been co-chair of the NISO Information Discovery and Interchange topic committee (and its predecessor, the "Discovery to Delivery" topic committee), so this is squarely in what I've been thinking about in the broader library-publisher professional space. I also traced the early development of RA21 and more recently am volunteering on the SeamlessAccess Entity Category and Attribute Bundles Working Group; that'll become more important a little further down this post.

I was nodding along with Roger's narrative until I stopped short here:

The five major publishing houses that are the driving forces behind GetFTR are not pursuing this initiative through one of the major industry collaborative bodies. All five are leading members of the STM Association, NISO, ORCID, Crossref, and CHORUS, to name several major industry groups. But rather than working through one of these existing groups, the houses plan instead to launch a new legal entity.

While [Vice President of Product Strategy & Partnerships for Wiley Todd] Toler and [Senior Director, Technology Strategy & Partnerships for the American Chemical Society Ralph] Youngen were too politic to go deeply into the details of why this might be, it is clear that the leadership of the large houses have felt a major sense of mismatch between their business priorities on the one hand and the capabilities of these existing industry bodies. At recent industry events, publishing house CEOs have voiced extensive concerns about the lack of cooperation-driven innovation in the sector. For example, Judy Verses from Wiley spoke to this issue in spring 2018, and several executives did so at Frankfurt this fall. In both cases, long standing members of the scholarly publishing sector questioned if these executives perhaps did not realize the extensive collaborations driven through Crossref and ORCID, among others. It is now clear to me that the issue is not a lack of knowledge but rather a concern at the executive level about the perceived inability of existing collaborative vehicles to enable the new strategic directions that publishers feel they must pursue.

This is the publishers going-it-alone. To see Roger describe it, they are going to create this web service that allows publishers to determine the appropriate copy for a patron and do it without input from the libraries. Librarians will just be expected to put this web service widget into their discovery services to get "colored buttons indicating that the link will take [patrons] to the version of record, an alternative pathway, or (presumably in rare cases) no access at all." (Let's set aside for the moment the privacy implications of having a fourth-party web service recording all of the individual articles that come up in a patron's search results.) Librarians will not get to decide the "alternative pathway" that is appropriate for the patron: "Some publishers might choose to provide access to a preprint or a read-only version, perhaps in some cases on some kind of metered basis." (Roger goes on to say that he "expect[s] publishers will typically enable some alternative version for their content, in which case the vast majority of scholarly content will be freely available through publishers even if it is not open access in terms of licensing." I'm not so confident.)

No, thank you. If publishers want to engage in technical work to enable libraries and others to build web services that determine the direct link to an article based on a DOI, then great. Libraries can build a tool that consumes that information as well as takes into account information about preprint services, open access versions, interlibrary loan and other methods of access. But to ask libraries to accept this publisher-controlled access button in their discovery layers, their learning management systems, their scholarly profile services, and their other tools? That sounds destined for disappointment.

I am only somewhat encouraged by the fact that RA21 started out as a small, isolated collaboration of publishers before they brought in NISO and invited libraries to join the discussion. Did it mean that it slowed down deployment of RA21? Undoubtedly yes. Did persnickety librarians demand transparent discussions and decisions about privacy-related concerns like what attributes the publisher would get about the patron in the Shibboleth-powered backchannel? Yes, but because the patrons weren't there to advocate for themselves. Will it likely mean wider adoption? I'd like to think so.

Have publishers learned that forcing these kinds of technologies onto users without consultation is a bad idea? At the moment it would appear not. Some of what publishers are seeking with GetFTR can be implemented with straight-up OpenURL or—at the very least—limited-scope additions to OpenURL (the Z39.88 open standard!). So that they didn't start with OpenURL, a robust existing standard, is both concerning and annoying. I'll be watching and listening for points of engagement, so I remain hopeful.

A few words about Jeff Pooley's five-step "laughably creaky and friction-filled effort" that is SeamlessAccess. Many of the steps Jeff describes are invisible and well-established technical protocols. What Jeff fails to take into account is the very visible and friction-filled effect of patrons accessing content beyond the boundaries of campus-recognized internet network addresses. Those patrons get stopped at step two with a "pay $35 please" message. I'm all for removing that barrier entirely by making all published content "open access". It is folly to think, though, that researchers and readers can enforce an open access business model on all publishers, so solutions like SeamlessAccess will have a place. (Which is to say nothing of the benefit of inter-institutional resource collaboration opened up by a more widely deployed Shibboleth infrastructure powered by SeamlessAccess.)

Reflections on "Responsibilities of Citizenship for Immigrants and our Daughter"

2019-09-14T00:00:00-04:00

Eighteen years ago, on Friday, September 7th, 2001, I was honored to be asked to participate in a naturalization ceremony for 46 new citizens of the United States in a courtroom of Judge Alvin Thompson in Hartford, Connecticut. I published those remarks on a website that has long since gone dormant. In light of the politics of the day, I was thinking back to that ceremony and what it meant to me to participate. I regret the corny reference to Star Trek, but I regret nothing else I said on that day. I titled the remarks "Responsibilities of Citizenship for Immigrants and our Daughter".

Good afternoon. I'm honored to be here as you take your final step to become a citizen of the United States of America. My wife Celeste, who will soon give birth to another new American citizen, is here to celebrate this joyous occasion with you. And if you'll pardon the musings of a proud soon-to-be father, I would like to share some thoughts about citizenship inspired by this ceremony and the impending arrival of our first child.

Our daughter will be a citizen by birth, but you have made a choice to become an American. This choice may or may not have been easy for you, but I have the utmost respect for you for making that choice.

I don't know what compelled you to submit yourself to the naturalization process -- perhaps economic, political, social, or religious reasons. I have to think that you did it to better your life and the lives of your family. But you should know that the process does not stop here.

Along with the rights of citizenship come the responsibilities expected of you. Perhaps you are more aware of these responsibilities than I given your choice to become a citizen, but please allow me to enumerate some of them. Exercise your right to be heard on matters of concern to you. Vote in every election that you can. When asked to do so, eagerly perform your duty as a member of a jury. Watch what is happening around you, and form your own opinions. Practice your religion and respect the right of others to do the same. These are the values we will try to instill in our daughter; I hope you take them to heart, instill them in your family members, and inspire your fellow citizens to do the same.

But as you take this final, formal step of citizenship, be aware that becoming an American does not mean you have to leave your native culture behind. A part of American culture is the 1960's show Star Trek, which promoted the concept of IDIC: Infinite Diversity in Infinite Combinations. In that futuristic world, diverse cultures and ideas are respected with the realization that society is stronger because of them. While we cannot claim to have reached that ideal world, one can say that the American Dream is best realized when our diversity is celebrated and shared by the members of this country. My daughter will be the celebration of that diversity: the product of Irish, German, Polish, and English immigrants. By adding your own history and experiences to the fabric of our country, you make America stronger. In addition to all of the formal responsibilities asked of you as a new citizen, I charge you to share with your fellow citizens that which makes you unique.

Our past honored citizens fought hard to make this country what it is today. As they showed courage, we too must be prepared to show courage. As they endured pain, we too must be prepared to make sacrifices for the good of our nation. Like them, we too must strive for liberty and justice for all. As Americans, we are all filled with these hopes and dreams.

On behalf of my wife and our daughter soon to be born, and my parents, brother, and sister, Celeste's parents, two sisters and their families, and on behalf of the people of Hartford, the State of Connecticut, and the citizens of all 50 states, I congratulate you on your new role as citizens of the United States of America. Please use the power that is now vested in you to advance the cause of hope and opportunity and diversity. I invite you to be active participants in the next chapter of America's history of progress toward the goals of freedom and equality for all.

Four days later—September 11, 2001—the trajectory of the lives of the people in that courtroom would change. We couldn't know how much they would change. We still don't know how much they will change.

To these newly naturalized citizens, I spoke of beliefs that I thought were universally American. They were the beliefs that I grew up with...that were infused in me by my parents and the communities I lived in.

Did I grow up in a bubble? Have there always been fellow citizens around me that wanted to block other people from coming to this country and throw out anyone that didn't look like them? Were there always cruel agents of the government that thought it reasonable to lock fellow humans in cages, to separate children from caregiving adults, to single out people of another race for extraordinary scrutiny, and seem to find joy in doing so?

I'm now struggling with these questions. I'm struggling to understand how the election of a person to lead our country has been the focusing lens for division. (Trump? Obama?) I struggle to comprehend the toxic mix of willful ignorance and arrogance of cultures has come to shape the way we look at each other, the way we hear each other, and the way we speak to each other. I want to believe there are common threads of humanity weaving around and between citizens and visitors of America—threads that bind us tight enough to work towards shared purposes and loose enough to allow for individual character.

I speak and I listen. I struggle and I believe. I have to...for my daughter, her brother that followed, and for the 46 new citizens I welcomed 18 years ago.

Engaging with Open Source Technologies

2019-08-14T00:00:00-04:00

These are the presentation notes for the Engaging with Open Source Technologies presentation during the Open Source Publishing Technologies: Current Status and Emerging Possibilities webinar on Wednesday, August 14, 2019.

Webinar Description

This session will focus on discussions of open source publishing platforms and systems. What is the value proposition? What functionalities are commonplace? Where are the pitfalls in adoption and use by publishers or by libraries? What potential is there for scholarly societies who are similarly responsible for publication support and dissemination? Given the rising interest in open access and open educational resources, this session will offer professionals a sense of what is available, a sense of practical concerns and a general sense of their future direction.

Talk Abstract

An open source project that focuses only on the code is missing out on some of the biggest opportunities that the open source philosophy offers. To be sure, developing software with an open source philosophy brings a diversity of knowledge and shares the development burden over a wide group. But a community that embraces that philosophy in the conception, design, specification, and development of a project can build exceptionally useful software and a fulfilling experience for all involved. This portion of the program explores some of the structures and processes found in successful open source communities using examples from projects inside and outside of field.

Slides

PDF of slides

Resources

Arp, Laurie Gemmill, and Megan Forbes. “It Takes a Village: Open Source Software Sustainability,” LYRASIS, February 2018. https://doi.org/10.7916/D89G70BS
Fitzgerald, Brian. (2006). “The Transformation of Open Source Software.” MIS Quarterly, 30(3), 587. https://doi.org/10.2307/25148740
Maxwell, John W, et al "Mind the Gap: A Landscape Analysis of Open Source Publishing Tools and Platforms," July 2019. https://mindthegap.pubpub.org/

Photo/Illustration Acknowledgments

Slide 1: “Codex Claustroneoburgensis 980” from College of Saint Benedict & Saint John's University via DPLA
Slide 10: “Agile Project Management by Planbox” via Wikimedia Commons
Slide 15: “kiyomi gets chin scratches in PHX airport pet relief area” by Taro the Shiba Inu via Flickr
Slide 16: “Sunset” from the National Archives and Records Administration via DPLA

Key Quotations from Resources

Brian Fitzgerald in 2006 wrote of a significant shift in how open-source software projects were being considered and operated. Fitzgerald noted that the rise of successful open-source software (which he called “OSS 1.0”) was characterized by self-organized, Internet-based projects that gathered loose communities around sheer willingness to participate. Fitzgerald identified a newer mode, which he called “OSS 2.0,” characterized by “purposeful design” and institution-sponsored “vertical domains,” and much more likely to include paid developers. From Mind the Gap.

The fear of enclosure is certainly not the only force driving open-source development. Many funding agencies require that software developed under a grant be released as OSS in order to keep the fruits of their funding from disappearing into some corporation’s vaults. There is also the hope, at least, of increased scale: a publisher or a library, interested to develop a bespoke tool, will find it difficult to justify the cost of development and maintenance if the only user will ever be itself. For many, the idea of open source implies a shared deployment model that distributes, if not the cost, at least the value, across a larger community. From Mind the Gap.