[00:13:07] (03PS1) 10Chico Venancio: prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) [00:13:41] (03CR) 10jerkins-bot: [V: 04-1] prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) (owner: 10Chico Venancio) [00:16:41] (03Abandoned) 10EBernhardson: [WIP] Rework elasticsearch ferm for multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/441337 (owner: 10EBernhardson) [00:19:06] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404) [00:22:26] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy [00:43:30] (03PS5) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [00:43:31] (03PS8) 10EBernhardson: [WIP] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [00:43:33] (03PS36) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [00:43:37] (03CR) 10jerkins-bot: [V: 04-1] prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (owner: 10EBernhardson) [00:55:04] (03PS1) 10Jforrester: Stop loading the MwEmbedSupport extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441518 [00:55:06] (03PS1) 10Jforrester: Stop loading the MwEmbedSupport extension, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441519 [00:55:08] (03PS1) 10Jforrester: Stop loading the MwEmbedSupport extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441520 [00:55:10] (03PS1) 10Jforrester: Stop loading the MwEmbedSupport extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441521 [00:56:54] (03CR) 10Jforrester: [C: 04-2] "wmf.10 isn't even cut yet, let alone deployed everywhere. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441518 (owner: 10Jforrester) [01:46:08] (03PS2) 10Chico Venancio: prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) [01:46:09] (03CR) 10jerkins-bot: [V: 04-1] prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) (owner: 10Chico Venancio) [03:04:57] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [03:06:06] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [03:10:29] (03PS1) 1020after4: Fix phabricator rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/441525 [03:11:00] (03CR) 10jerkins-bot: [V: 04-1] Fix phabricator rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/441525 (owner: 1020after4) [03:14:21] (03PS2) 1020after4: Fix phabricator rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/441525 (https://phabricator.wikimedia.org/T197922) [03:27:27] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 875.22 seconds [03:39:17] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 253.25 seconds [06:06:17] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [06:06:36] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:07:26] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [06:07:36] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:19:26] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [06:19:36] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:21:36] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [06:21:47] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:53:42] (03CR) 10Gergő Tisza: [C: 031] "Not that it matters much, but why deploymentwiki (which AIUI is more for coordination than testing) and not meta or commons?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438079 (https://phabricator.wikimedia.org/T143628) (owner: 10C. Scott Ananian) [08:31:06] (03CR) 10Dzahn: "it's still code freeze and i'll be on vacation for a while. please ping somebody else in Service Operations to get it merged" [puppet] - 10https://gerrit.wikimedia.org/r/441525 (https://phabricator.wikimedia.org/T197922) (owner: 1020after4) [08:35:31] (03PS1) 10Elukey: profile::prometheus::alerts: remove old checks [puppet] - 10https://gerrit.wikimedia.org/r/441535 [08:48:57] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [08:49:06] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [08:51:05] (03PS1) 10Giuseppe Lavagetto: rake: add ability to check syntax of dhcp files [puppet] - 10https://gerrit.wikimedia.org/r/441537 [08:51:13] (03CR) 10Muehlenhoff: [C: 04-1] "That patch is incomplete, you also need to remove the SSH key, the expiry_date and expiry_contact and switch to ensure: absent" [puppet] - 10https://gerrit.wikimedia.org/r/441434 (https://phabricator.wikimedia.org/T197895) (owner: 10RobH) [08:51:40] (03CR) 10jerkins-bot: [V: 04-1] rake: add ability to check syntax of dhcp files [puppet] - 10https://gerrit.wikimedia.org/r/441537 (owner: 10Giuseppe Lavagetto) [08:56:32] 10Operations, 10TimedMediaHandler-Transcode: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333#4307186 (10MoritzMuehlenhoff) I started to look into that; the ffmpeg version in Debian stable (3.2) doesn't yet support row-mt. I'll see whether I can sanely b... [09:00:06] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [09:01:06] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [09:04:41] (03CR) 1020after4: "I wonder if it even makes sense to have this file deployed by puppet. It would maybe make more sense to check the file into our phabricato" [puppet] - 10https://gerrit.wikimedia.org/r/441525 (https://phabricator.wikimedia.org/T197922) (owner: 1020after4) [09:16:01] (03PS2) 10Giuseppe Lavagetto: rake: add ability to check syntax of dhcp files [puppet] - 10https://gerrit.wikimedia.org/r/441537 [09:16:03] (03PS1) 10Giuseppe Lavagetto: test: valid dhcp change [puppet] - 10https://gerrit.wikimedia.org/r/441539 [09:16:05] (03PS1) 10Giuseppe Lavagetto: test: invalid dhcp change [puppet] - 10https://gerrit.wikimedia.org/r/441540 [09:16:35] (03CR) 10jerkins-bot: [V: 04-1] rake: add ability to check syntax of dhcp files [puppet] - 10https://gerrit.wikimedia.org/r/441537 (owner: 10Giuseppe Lavagetto) [09:16:37] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404) [09:16:43] (03CR) 10jerkins-bot: [V: 04-1] test: valid dhcp change [puppet] - 10https://gerrit.wikimedia.org/r/441539 (owner: 10Giuseppe Lavagetto) [09:16:59] <_joe_> grr [09:17:08] (03CR) 10jerkins-bot: [V: 04-1] test: invalid dhcp change [puppet] - 10https://gerrit.wikimedia.org/r/441540 (owner: 10Giuseppe Lavagetto) [09:17:46] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [09:46:22] (03CR) 10ArielGlenn: "As long as the values themselves are a separate configuration file that lives in puppet, I don't see why this can't be deployed with phab " [puppet] - 10https://gerrit.wikimedia.org/r/441525 (https://phabricator.wikimedia.org/T197922) (owner: 1020after4) [09:47:37] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [09:47:46] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [09:48:46] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [09:48:47] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [09:49:56] (03PS3) 10Giuseppe Lavagetto: rake: add ability to check syntax of dhcp files [puppet] - 10https://gerrit.wikimedia.org/r/441537 [09:50:51] (03PS2) 10Giuseppe Lavagetto: test: valid dhcp change [puppet] - 10https://gerrit.wikimedia.org/r/441539 [09:52:28] <_joe_> hashar: ^^ it works! thanks for your help [09:52:46] (03PS2) 10Giuseppe Lavagetto: test: invalid dhcp change [puppet] - 10https://gerrit.wikimedia.org/r/441540 [09:53:29] (03CR) 10jerkins-bot: [V: 04-1] test: invalid dhcp change [puppet] - 10https://gerrit.wikimedia.org/r/441540 (owner: 10Giuseppe Lavagetto) [09:54:51] <_joe_> \o/ [09:57:22] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: investigate caching of mailman listinfo pages - https://phabricator.wikimedia.org/T197819#4303598 (10Peachey88) Is this something we should report upstream as well? [10:03:37] _joe_: that is nice !!! :] [10:05:40] 10Operations, 10procurement, 10LDAP: Certificate Renewal for corp.wikimedia.org - https://phabricator.wikimedia.org/T197840#4307281 (10Peachey88) [10:14:59] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#4307288 (10mmodell) So apparently this is resolved... I'm... [10:37:13] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review: test.wp is using test2.wp's message cache - https://phabricator.wikimedia.org/T197450#4307317 (10Legoktm) Has someone tried running a full scap? I'm not sure if l10nupdate is enough (but I don't understand exactly what th... [10:39:35] (03PS3) 10Giuseppe Lavagetto: Add switch to allow building images that match a glob pattern [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/420711 (https://phabricator.wikimedia.org/T186416) [10:42:33] (03CR) 10Giuseppe Lavagetto: [C: 032] Add switch to allow building images that match a glob pattern [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/420711 (https://phabricator.wikimedia.org/T186416) (owner: 10Giuseppe Lavagetto) [10:49:29] (03PS2) 10Giuseppe Lavagetto: Honour .dockerignore when copying the build context [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/420769 (https://phabricator.wikimedia.org/T183546) [10:49:48] (03CR) 10Vgutierrez: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/433736 (owner: 10Mark Bergsma) [10:50:16] (03CR) 10jerkins-bot: [V: 04-1] Honour .dockerignore when copying the build context [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/420769 (https://phabricator.wikimedia.org/T183546) (owner: 10Giuseppe Lavagetto) [10:53:42] (03CR) 10Mobrovac: rake: add ability to check syntax of dhcp files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441537 (owner: 10Giuseppe Lavagetto) [11:05:05] (03CR) 10Vgutierrez: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/434161 (owner: 10Mark Bergsma) [11:18:39] (03CR) 10Vgutierrez: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/434162 (owner: 10Mark Bergsma) [11:20:15] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#4307385 (10Paladox) Could it be traffic? But also it could... [11:56:28] 10Operations, 10Deployments, 10HHVM, 10Patch-For-Review, and 3 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#4307439 (10MoritzMuehlenhoff) Does that actually still make sense at this point? We'll get rid of HHVM in 6-9 months an... [12:11:13] (03CR) 10Aklapper: [C: 031] Fix phabricator rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/441525 (https://phabricator.wikimedia.org/T197922) (owner: 1020after4) [13:16:10] (03PS3) 10Giuseppe Lavagetto: Honour .dockerignore when copying the build context [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/420769 (https://phabricator.wikimedia.org/T183546) [13:23:01] (03CR) 10Giuseppe Lavagetto: [C: 032] Honour .dockerignore when copying the build context [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/420769 (https://phabricator.wikimedia.org/T183546) (owner: 10Giuseppe Lavagetto) [13:31:36] (03PS4) 10Giuseppe Lavagetto: rake: add ability to check syntax of dhcp files [puppet] - 10https://gerrit.wikimedia.org/r/441537 [13:33:08] (03CR) 10Giuseppe Lavagetto: rake: add ability to check syntax of dhcp files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441537 (owner: 10Giuseppe Lavagetto) [13:33:37] (03Abandoned) 10Giuseppe Lavagetto: test: valid dhcp change [puppet] - 10https://gerrit.wikimedia.org/r/441539 (owner: 10Giuseppe Lavagetto) [13:33:48] (03Abandoned) 10Giuseppe Lavagetto: test: invalid dhcp change [puppet] - 10https://gerrit.wikimedia.org/r/441540 (owner: 10Giuseppe Lavagetto) [13:50:02] 10Operations, 10CX-cxserver, 10Citoid, 10RESTBase, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#4307820 (10Arrbee) [13:50:37] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [13:52:47] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [14:53:22] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review: test.wp is using test2.wp's message cache - https://phabricator.wikimedia.org/T197450#4308001 (10Jdforrester-WMF) >>! In T197450#4307317, @Legoktm wrote: > Has someone tried running a full scap? I'm not sure if l10nupdate... [14:54:40] 10Operations, 10Wikimedia-Mailing-lists: New mail list for Signpost team - https://phabricator.wikimedia.org/T197732#4308007 (10herron) a:03herron [14:57:58] (03CR) 10Chad: [C: 031] Gerrit: Set cache for groups [puppet] - 10https://gerrit.wikimedia.org/r/441391 (owner: 10Paladox) [14:58:15] (03CR) 10Chad: [C: 031] Gerrit: Increase changeid_project and ldap_usernames caches [puppet] - 10https://gerrit.wikimedia.org/r/441397 (owner: 10Paladox) [15:05:02] 10Operations, 10Wikimedia-Mailing-lists: New mail list for Signpost team - https://phabricator.wikimedia.org/T197732#4308031 (10herron) 05Open>03Resolved Hi @Brianhe, wikipedia-en-signpost-priv@lists.wikimedia.org has been created and the system should have sent the initial password directly to you. Since... [15:10:35] 10Operations, 10Wikimedia-Mailing-lists: New closed communication public policy mailing list needed - https://phabricator.wikimedia.org/T196041#4308036 (10herron) 05Open>03Resolved Hopefully no news is good news! I'll set this to resolved, but please don't hesitate to re-open if any follow up is needed. [15:24:45] 10Operations, 10procurement, 10LDAP: Certificate Renewal for corp.wikimedia.org - https://phabricator.wikimedia.org/T197840#4308130 (10herron) Hey @robh it looks like you handled the renewal last year in T167346. Would it make any sense to move this cert to Let's Encrypt now that wildcard certificates are s... [15:31:53] (03CR) 10Mobrovac: rake: add ability to check syntax of dhcp files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441537 (owner: 10Giuseppe Lavagetto) [15:44:49] 10Operations, 10Proton, 10SRE-Access-Requests, 10Patch-For-Review: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4308214 (10RobH) p:05Triage>03Normal [15:52:21] (03PS2) 10RobH: remove oliver's access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/441434 (https://phabricator.wikimedia.org/T197895) [15:53:51] 10Operations, 10Research, 10Research-collaborations, 10Research-management, and 2 others: Remove shell access for ironholds on 2018-06-30 - https://phabricator.wikimedia.org/T197895#4308267 (10RobH) [16:39:44] 10Operations, 10TimedMediaHandler-Transcode: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333#4308359 (10brion) Thanks! Unfortunately it looks like 3.2 doesn't include support for the option. We'll either need to update to 3.3 or 3.4 or add in a backport... [16:52:52] (03CR) 10Muehlenhoff: [C: 04-1] "He also needs to be added to the "absent" list on the top, sorry for not spotting that earlier. Rest looks good." [puppet] - 10https://gerrit.wikimedia.org/r/441434 (https://phabricator.wikimedia.org/T197895) (owner: 10RobH) [17:26:33] (03PS1) 10Urbanecm: Whitelist two Indian government websites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441563 (https://phabricator.wikimedia.org/T197944) [17:47:34] (03PS3) 10RobH: remove oliver's access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/441434 (https://phabricator.wikimedia.org/T197895) [17:51:10] (03CR) 10Muehlenhoff: [C: 031] remove oliver's access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/441434 (https://phabricator.wikimedia.org/T197895) (owner: 10RobH) [17:53:34] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4308460 (10Bstorm) 05Open>03Resolved This seem pretty good at this point, so I'll close this task for now. [18:03:46] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [18:04:47] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [18:42:37] 10Operations, 10TimedMediaHandler-Transcode: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333#4308522 (10MoritzMuehlenhoff) I created a libvpx 1.7 backport, backported the patch to support mt-row to 3.2 (so that we can stick closely to the 3.2 packages s... [18:48:01] (03PS2) 10Bmansurov: Increase Schema:CitationUsage sampling rate to 15% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440867 (https://phabricator.wikimedia.org/T191086) [18:48:03] (03PS1) 10Bmansurov: Increase Schema:CitationUsage sampling rate to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441567 (https://phabricator.wikimedia.org/T191086) [18:50:34] (03CR) 10Bmansurov: [C: 04-1] "To be deployed on 6/26." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441567 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [18:52:42] (03PS1) 10Bmansurov: Stop collecting data for Schema:CitationUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441568 (https://phabricator.wikimedia.org/T191086) [18:53:40] (03CR) 10Bmansurov: [C: 04-1] "To be deployed on 7/4." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441568 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [18:54:31] (03CR) 10Bmansurov: [C: 04-1] "To be deployed on 7/3." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441568 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [18:55:57] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1015, Errmsg: Error Cant lock file (errno: 22 Invalid argument) on query. Default database: commonswiki. [Query snipped] [19:08:27] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 930.72 seconds [19:09:57] I am checking that [19:10:07] maybe tokudb corruption? [19:35:00] 10Operations, 10JADE, 10TechCom, 10Goal, and 3 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381#4308608 (10Harej) [19:54:13] !log applying transaction manually on dbstore1002 due to weird bug with savepoint on tokudb+image table [19:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:17] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:54:33] that should fix the replication issue, never have seen such a thing before [19:55:46] for the record, we were getting ERROR 1015 (HY000) at line 14134: Can't lock file (errno: 22 "Invalid argument") on a tokudb table [19:56:03] maybe there is a limit on the number of savepoints that row replication was injecting [19:56:08] for tokudb [19:56:34] and the innodb -> tokudb was breaking replication [19:57:04] anyway, we are getting rid of tokudb on that host, hopefully this doesn't repeat [19:57:12] (03PS1) 10Thcipriani: Scap: UpdateInterwikiCache fix subclassing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441571 (https://phabricator.wikimedia.org/T196642) [19:57:38] I exported the transaction, grep -v the savepoints and then skip the counter [20:08:17] PROBLEM - Device not healthy -SMART- on db2056 is CRITICAL: cluster=mysql device=cciss,0 instance=db2056:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2056&var-datasource=codfw%2520prometheus%252Fops [21:11:37] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 254.89 seconds [21:22:37] anyone care to review https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/436431/ ? [21:22:57] it's a deployment-prep-only change already cherry-picked there (and has been for a while) [21:23:18] old value is a nonexistent host and has been for a while [22:12:49] (03CR) 10C. Scott Ananian: "> Not that it matters much, but why deploymentwiki (which AIUI is" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438079 (https://phabricator.wikimedia.org/T143628) (owner: 10C. Scott Ananian) [22:15:24] (03CR) 10Gergő Tisza: [C: 031] "Sure, I mean why not beta meta or beta commons instead of deploymentwiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438079 (https://phabricator.wikimedia.org/T143628) (owner: 10C. Scott Ananian)