Can AI tell a story that we dare to tell?: Exploring NLP + Vision Models for Artistic Generation.

8 min readOct 22, 2024

Choosing the Ideal Approach to Tell the Stories of Labor Workers with Artificial Intelligence

Do you remember the summer of 2022?

It was a peculiar summer, not only for me but for a multitude of individuals who had the chance to experience the power of AI at their fingertips. It was a summer of diffusion models: Disco Diffusion, Stable Diffusion, and the then-viral AI profile generating application, Lensa. It was as if we had stumbled upon a new ice cream flavor. You’re excited, want to give it a try, but also questioning whether it’ll last.

In the midst of this peculiar summer of generative AI models, I landed in even more peculiar opportunity, to create a groundbreaking Art + Tech piece as a technologist collaborator. When someone offers to *pay* you to make art, you take it and run with it! This extraordinary opportunity was graciously granted by the \Art fellowship at Cornell Tech.

I was fortunate enough to work with an outstanding and respected artist Jen Liu who have been exploring labor and gender issues through her multimedia artworks. In collaboration for her work, as an AI technologist (machine learning engineer + just figuring everything out), I was uniquely positioned to share the narratives of labor workers, activists, and feminists who grappled with unethical working environments, resulting in detrimental physical and mental health consequences, or who fearlessly voiced their concerns, only to be met with political liquidation.

That’s how I got to spend last summer grappling with the question:

Can AI tell a story that we dare to tell?

In the beginning of the project, I was bombarded with eye-opening and heartbreaking stories of labor workers and activists whose stories were muted and removed. From news outlets, Chinese social media platforms, comments, to literatures, there have been so many stories that we had taken down and ignored. Could we use AI to be the storyteller to speak of the things that we dare to tell and be the protector of the people? Is it a mere tool of proliferation or can it take risky roles on behalf of us?

🚨 *This post is an archive for a project I worked on July 2022 ~ July 2023. I used *archaic* generative models, including BERT, LSTM, Vanilla Stable Diffusions, GPT2. While these models are outdated now, I found value in seeing the results from them before the social, political corrections in the outcomes. Results shown here include fine-tuned models.

Video Generation with AI

Our exploration began with Diffusion Models, aiming to generate video segments that AI would later distort and expand. I remember explaining the process to Jen(the artist): how the model starts with a page of Gaussian noise and gradually refines the image based on the prompt, layer by layer, until a clear picture emerges during the generation process. Jen pointed out the parallels between this process and how we are trying to build something from the ‘liquidated’ things. For me, the murky underwater footage of flooded areas visually echoed the initial noise state where these models begin. So murky, where do we go from there? Although these initial experiments didn’t make it to the final work due to limitations in control and cultural inclusivity, they lead us to deeper exploration of AI’s potential in storytelling.

It started with something like this:

First issue; inconsistency, erratic changes between frames.

What’s happening here is inconsistent image generation for each frame (yes, I generated each frame to make them into a 25fps video). The black pixels around the edge are actually showing outline detection from the previous frame, which was not blended over smoothly because of dramatic camera angle change. Overally, there’s no ‘connection’ between the frames.

I used Latent Diffusion Model to generate each frame of the video as an image and applied FILM: Frame Interpolation for Large Motion(research from Washington University) to interpolate between those generated frames for smoother transitions.

Result:

FILM generates and fills in the frames in between, to create smoother transition. This part of the work archived in this github repo.

Second issue: unintended Western features in face generation

Now the frame transitions got smoother, but no matter what kind of prompt and initial image we put in, it was always adding unintended ‘Western’ features to the face.

So, painfully, I fine-tuned the last few layers with real pictures of Asian women the artist has gathered.

Result:

Finally some East Asian ladies showing up! The changes of style were intended by changing the prompt every 50th frame.

Again, this was the early summer of 2022, before all the GUIs for video generation (not even the Automatic1111). There were many other issues I was fixing including camera angle control, masking an area for less generation, etc.

Text Generation with AI

Concurrently, we kept ideating how to deliver the labor workers’ stories with AI. What about text? Important to note that this was the pre-ChatGPT era.

I fine-tuned models like Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM), BERT, and GPT-2 on diverse texts, from social media posts to Chinese literature and corporate announcements(to add the voices from who owns those factories that workers are working at).

LSTMs struggled with the complex relationships in the diverse corpus, resulting in inconsistent texts. BERT’s bidirectional context awareness improved consistency, but coherence remained elusive. GPT-2(it was open-sourced and fine-tunable on our machines unlike its successor GPT-3.5 and forth) was the most promising. All of them had grammatical errors that required human correction. For exmaple, subject/verb agreement was almost always broken when the verb was places further away in the sentence.

Pre-trained models indeed performed better in natural generation with better grammar, indicating that the more it had ‘studied’ the ‘unliquidated’ content online, the better it got. I’m still pondering upon AI’s reliance on existing, mainstream narratives and its potential to perpetuate the silencing of marginalized voices.

While I cannot share the code, here’s a peak into what it looked like:

4: Ever since I became a cloud, saw no change in the appearance of my face’s shape & texture. Skin looks fresh and radiant without makeup anymore!

^ This text generation was a result from a checkpoint learning the styles from cosmetics commercials in Asia + Chinese literature.

8: Ever since I became a cloud, the power of electricity has been on display. When it came to my desktops and laptop — even those with 128GB storage — we were limited by how many users could control their devices via third-party apps or games through our app Store (which isn’t designed specifically for desktop use;

^ This text generation was a result from tech company websites + Chinese literature.

My homeland is collapsing, fighting for its future. Everyone is fighting for something. You have to come home now, everyone must come home to find peace. Let the people find peace, and unite, and unite, and fight against the evil from within.

^ This text generation was a result from higher weights on Chinese literature.

The architecture of these archaic text generation methods used in our process is illustrated here, presented in the Open Studio at Cornell University:

Again, can AIs tell a story that we dare to tell?

Throughout the project, I was driven by the desire to explore AI’s potential in telling the stories we dared not tell after the social ‘liquidations’. And I had to confront the harsh reality that the stories we wanted to tell were already a minor voice in the AI’s training data(they were taken down from the web, so it couldn’t have been included so much). Out-of-the box generative models were more proof of the liquidation than a protector.

The proof was in the sanitization of difficult realities: When prompted with terms like ‘factory workers’, ‘Chinese factory working environment’, ‘data centers’, or ‘health impacts’, these models would generate only the most superficial representations — pristine assembly lines, smiling workers, gleaming server rooms. The stark reality of cancer-causing working environment, the reported health deterioration of workers, the systematic exploitation — these truths were scrubbed from the outputs, replaced by corporate stock photos and PR-friendly narratives. Each sterilized generation revealed not just a gap in the training data, but a systematic pattern of digital truth-washing.

How vanilla Stable Diffusion model was adding ‘western’ facial features during the generation (August 23, 2022)

The same boundaries emerged in our attempts to incorporate Asian literary traditions. Even after implementing LoRA fine-tuning — a technique specifically designed to adapt language models to new domains — the results highlighted the glaring absence of these voices in the AI’s pre-training data. It could be taught new patterns, but it couldn’t fill gaps in cultural understanding that were never present in its initial training.

However, I’m not interested in merely adding to the well-worn narrative about minority representation in AI. Yes, I successfully mitigated these biases through technical interventions: fine-tuning algorithms, synthesizing missing data, implementing careful output guardrails. But these solutions only addressed the surface level of generation, becoming exercises in political correction — as evidenced by Google’s Gemini controversy, where such corrections led to historical revisionism rather than genuine representation.

Instead, I want to explore a more provocative possibility: could computers become active agents in salvaging, protecting, and even producing diversity? What if generative models could transcend their role as mirrors of existing biases to become advocates for the marginalized? What if their very naiveté — their inability to fully grasp human social contexts — could be transformed into a form of cultural neutrality?

The closest I’ve come to realizing this protective potential was through abstraction, for this project ‘The Land At The Bottom Of The Sea’ :

Lastly, Steganography

To protect these stories, we are archiving the raw content — videos, texts, and pictures — collected throughout The Land At The Bottom Of The Sea via steganography. Each pixel of the video contains information about the real stories, which can be decoded and reconstructed by the viewer. Unlike cryptography, steganography conceals the very fact that something is hidden. Viewers must acknowledge the liquidated stories and access them with a specific algorithm. Through this collaboration with the audience, we attempt to protect the liquidated stories that AI currently cannot tell properly at this time.

Read and see our work at the Backslash Art fellowship:

https://backslash.org/art/jen-liu-the-land-at-the-bottom-of-the-sea

List of exhibitions/links where the work has shown so far

Taipei Biennial 2023 : link

Sculpture Center 2024 : link

MAIIAM Chiang Mai

Met Museum

Blindspot Gallery

Can AI tell a story that we dare to tell?: Exploring NLP + Vision Models for Artistic Generation.

Choosing the Ideal Approach to Tell the Stories of Labor Workers with Artificial Intelligence

Can AI tell a story that we dare to tell?

Video Generation with AI

Text Generation with AI

Again, can AIs tell a story that we dare to tell?

Lastly, Steganography

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Soul Choi

No responses yet