[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T0000). [00:00:04] niedzielski, Zoranzoki21, and ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:30] 👍 [00:01:45] \o [00:14:13] (03PS2) 10Dzahn: smokeping: replace target einsteinium with authdns1001 [puppet] - 10https://gerrit.wikimedia.org/r/473282 (https://phabricator.wikimedia.org/T202782) [00:14:15] (03PS1) 10Dzahn: bienvenida: add cache-control headers with max-age 1 hour [puppet] - 10https://gerrit.wikimedia.org/r/473306 (https://phabricator.wikimedia.org/T202592) [00:15:04] Who is conducting the SWAT today? [00:15:18] (03PS2) 10Dzahn: bienvenida: add cache-control headers with max-age 1 hour [puppet] - 10https://gerrit.wikimedia.org/r/473306 (https://phabricator.wikimedia.org/T202592) [00:15:48] (03CR) 10Dzahn: [C: 032] "per chat in -traffic" [puppet] - 10https://gerrit.wikimedia.org/r/473306 (https://phabricator.wikimedia.org/T202592) (owner: 10Dzahn) [00:20:49] niedzielski: I can SWAT if you're still available [00:21:01] thcipriani: yes please! [00:22:32] (03CR) 10Thcipriani: [C: 04-1] Add new throttle rule for Art+Feminism Event on 2018-11-17 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473255 (https://phabricator.wikimedia.org/T209324) (owner: 10Zoranzoki21) [00:23:05] wikibase sure triggers a good amount of tests :) [00:24:41] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:25:03] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [00:25:07] yeah, unfortunately it takes a bit [00:25:13] (03PS3) 10Dzahn: smokeping: replace target einsteinium with authdns1001 [puppet] - 10https://gerrit.wikimedia.org/r/473282 (https://phabricator.wikimedia.org/T202782) [00:25:37] i'm still coming up to speed with the repo but it looks really nice inside! [00:26:53] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:30:15] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 27.48 ms [00:30:41] (03CR) 10Dzahn: [C: 032] smokeping: replace target einsteinium with authdns1001 [puppet] - 10https://gerrit.wikimedia.org/r/473282 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:32:17] (03PS4) 10Dzahn: smokeping: replace target einsteinium with authdns1001 [puppet] - 10https://gerrit.wikimedia.org/r/473282 (https://phabricator.wikimedia.org/T202782) [00:37:38] sooo close [00:42:31] !log restarted smokeping on netmon1002 and netmon2001 [00:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:37] \o/ [00:53:24] yay, ok let me check a few pages [00:53:53] niedzielski: live on mwdebug1002 now [01:00:50] i think all is well. thank you thcipriani [01:01:07] niedzielski: okie doke, Good to sync everywhere? [01:01:20] thcipriani: yes please [01:01:24] * thcipriani does [01:02:38] !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/Wikibase/client/includes: SWAT: [[gerrit:473166|Update: use wikibase-debug logger instead of "PageRandomLookup"]] T208796 (duration: 00m 56s) [01:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:41] T208796: Use wikibase-debug Logstash channel to log unexpected page_random values - https://phabricator.wikimedia.org/T208796 [01:02:46] ^ niedzielski live everywhere [01:03:00] thank you! [01:03:10] yw :) [02:06:12] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @Marostegui Pinging for review of these two files, https://phabricator.wikimedia.org/diffusion/EJ... [02:14:37] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10Dzahn) It's fine to share public keys here right on the ticket since they are public and will be added to public repos either way. [02:27:02] was CSP for private/fishbowl wikis got somewhat harsher? or is it still console log-only? [02:28:06] uh nvm it loads [02:28:15] just full of console warning for meta.wikimedia.org, which is bit weird tho [02:28:48] and enwiki and kowiki [03:30:25] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 826.93 seconds [04:16:03] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof] [04:25:53] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 244.40 seconds [04:41:33] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:49:57] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/opsvar-lag_datasource=eqiad+prometheus/opsvar-mirror_name=main-eqiad_to_main-codfw [05:39:25] PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [05:39:51] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [05:40:31] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [05:40:33] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [05:40:40] ok so [05:40:45] I think Gerrit is having issues [05:40:53] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [05:41:01] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [05:41:18] oh [05:41:20] it's down [05:41:21] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] [05:42:03] PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [05:42:11] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 5 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [05:43:01] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [05:43:05] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [05:43:05] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [05:43:05] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater],Exec[git_pull_wikimedia/discovery/golden] [05:43:15] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [05:43:25] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [05:43:43] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [05:43:49] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 6 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy],Exec[git_pull_research/landing-page] [05:44:27] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10Joe) [05:45:43] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [05:46:09] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [05:46:49] <_joe_> !log restarting gerrit [05:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:13] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater] [05:47:21] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [05:48:45] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [05:49:21] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [05:49:44] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10Joe) I was still quite asleep, but I saw a series of broken pipes from sockets and jetty refusing to manage any new connection in the logs, so I just restarted gerrit. It is now working, so we can lower the... [05:49:54] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10Joe) p:05Unbreak!>03High [06:00:11] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:00:55] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:02:52] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10Legoktm) [21:45:43] I didn't realize it was upgraded today, I had been encountering weird behavior a few hours ago that I wasn't sure about ... [21:58:11] I couldn't get https://gerrit.wi... [06:03:31] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:03:39] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:03:47] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:04:05] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:04:11] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:06:01] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:06:21] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:06:27] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:07:31] RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:07:37] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:08:37] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:09:59] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:11:51] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:12:41] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:12:47] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:13:37] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:14:11] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:14:22] (03CR) 10Legoktm: [C: 032] Add PHP version information to log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) (owner: 10Legoktm) [06:14:47] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:15:44] (03Merged) 10jenkins-bot: Add PHP version information to log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) (owner: 10Legoktm) [06:16:17] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:16:41] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:18:47] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:22:09] (03CR) 10jenkins-bot: Add PHP version information to log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) (owner: 10Legoktm) [06:22:17] hrm [06:22:39] (03PS1) 10Marostegui: db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473332 (https://phabricator.wikimedia.org/T208383) [06:24:05] (03PS1) 10Marostegui: pc2010: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/473333 (https://phabricator.wikimedia.org/T208383) [06:24:09] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473332 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:24:33] (03PS1) 10Legoktm: Revert "Add PHP version information to log entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473334 [06:24:41] (03CR) 10Legoktm: [C: 032] Revert "Add PHP version information to log entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473334 (owner: 10Legoktm) [06:25:21] (03CR) 10Marostegui: [C: 032] pc2010: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/473333 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:25:38] (03Merged) 10jenkins-bot: db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473332 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:26:00] (03Merged) 10jenkins-bot: Revert "Add PHP version information to log entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473334 (owner: 10Legoktm) [06:26:10] legoktm: I have merged your change in deploy1001 [06:26:52] marostegui: sorry, I thought I'd sync out my patch pretty easily, and then it didn't work on mwdebug :/ [06:27:10] ok [06:27:11] thanks [06:27:33] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool pc2005 - T208383 (duration: 01m 04s) [06:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:36] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [06:27:41] legoktm: Mine got merged before yours but looks like between the notification and my actual git rebase yours got merged too [06:27:49] great [06:27:53] I also tried to git fetch as well [06:28:08] yeah, i did fetch and rebase [06:30:57] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:31:41] PROBLEM - puppet last run on mw2285 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [06:31:59] !log Stop MySQL on pc2005 to clone it to pc2008 - T208383 [06:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:42] (03CR) 10jenkins-bot: db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473332 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:36:44] (03CR) 10jenkins-bot: Revert "Add PHP version information to log entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473334 (owner: 10Legoktm) [06:36:47] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) a:05RobH>03elukey [06:40:26] !log Deploy schema change on s6 codfw master, this will generate lag on s6 codfw -T203709 [06:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:30] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [06:42:05] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [06:52:35] !log Deploy schema change on s4 codfw master, this will generate lag on s4 codfw - T203709 [06:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:39] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [06:57:09] RECOVERY - puppet last run on mw2285 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:07:26] !log Deploy schema change on s2 codfw master, this will generate lag on s2 codfw - T203709 [07:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:30] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [07:15:13] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:18:59] !log Deploy schema change on s7 codfw master, this will generate lag on s7 codfw - T203709 [07:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:02] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [07:26:13] (03PS1) 10Elukey: Add an-worker1078-95 basic settings [puppet] - 10https://gerrit.wikimedia.org/r/473359 (https://phabricator.wikimedia.org/T207192) [07:41:39] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [07:42:45] !log Deploy schema change on s3 codfw master, this will generate lag on s3 codfw - T203709 [07:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:49] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [07:42:54] (03CR) 10Elukey: [C: 032] Add an-worker1078-95 basic settings [puppet] - 10https://gerrit.wikimedia.org/r/473359 (https://phabricator.wikimedia.org/T207192) (owner: 10Elukey) [07:46:26] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1078.eqiad.wmnet'... [07:46:49] (03PS6) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) [07:52:35] (03PS7) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) [07:59:04] (03PS8) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) [07:59:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WMDE-leszek - https://phabricator.wikimedia.org/T208717 (10MoritzMuehlenhoff) 05Resolved>03Open Leszek; you're now using the same SSH key in Cloud VPS as in the production cluster. This is a security risk as WMCS... [08:02:41] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:02:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WMDE-leszek - https://phabricator.wikimedia.org/T208717 (10WMDE-leszek) [08:04:11] (03CR) 10Vgutierrez: "pcc seems happy https://puppet-compiler.wmflabs.org/compiler1002/13471/" [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [08:04:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WMDE-leszek - https://phabricator.wikimedia.org/T208717 (10WMDE-leszek) Thanks @MoritzMuehlenhoff for your attention and noticing my sloppiness. Changed the ssh key, for the one to be only used for production access. [08:04:40] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: rollout syslog_exporter in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/473173 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi) [08:04:49] (03PS2) 10Filippo Giunchedi: hieradata: rollout syslog_exporter in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/473173 (https://phabricator.wikimedia.org/T206633) [08:07:25] !log rollout rsyslog_exporter to eqiad [08:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:08] !log Deploy schema change on s3 codfw master, this will generate lag on s3 codfw - T205913 [08:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:11] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [08:12:35] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [08:13:42] (03PS1) 10Elukey: Set a different mac address for an-worker1078's DHCP [puppet] - 10https://gerrit.wikimedia.org/r/473387 (https://phabricator.wikimedia.org/T207192) [08:14:32] !log Deploy schema change on s4 codfw master, this will generate lag on s4 codfw - T205913 [08:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:34] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [08:14:35] (03CR) 10Elukey: [C: 032] Set a different mac address for an-worker1078's DHCP [puppet] - 10https://gerrit.wikimedia.org/r/473387 (https://phabricator.wikimedia.org/T207192) (owner: 10Elukey) [08:17:48] !log Deploy schema change on s6 codfw master, this will generate lag on s6 codfw - T205913 [08:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:18] !log Deploy schema change on s2 codfw master, this will generate lag on s2 codfw - T205913 [08:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:19] !log Deploy schema change on s2 codfw master, this will generate lag on s7 codfw - T205913 [08:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:21] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [08:23:17] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10Joe) a:05Joe>03None [08:24:29] !log Deploy schema change on s5 codfw master, this will generate lag on s5 codfw - T205913 [08:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Relabel labvirt1016.eqiad.wmnet as cloudvirt1016.eqiad.wmnet - https://phabricator.wikimedia.org/T209427 (10ArielGlenn) p:05Triage>03Normal [08:28:29] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10ArielGlenn) p:05Triage>03Normal [08:35:55] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 682.87 seconds [08:36:03] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 690.42 seconds [08:36:19] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 706.59 seconds [08:36:21] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 708.14 seconds [08:36:27] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 714.76 seconds [08:41:10] ^ checking [08:41:42] 10Operations: change my email address in the techcom alias - https://phabricator.wikimedia.org/T209391 (10ArielGlenn) 05Open>03Resolved p:05Triage>03Normal a:03ArielGlenn Done. [08:42:53] 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1003 - https://phabricator.wikimedia.org/T209408 (10ArielGlenn) p:05Triage>03Normal [08:44:20] banyek: can you help checking what's going on please? [08:44:52] yes [08:46:14] I thought it is the schema change [08:46:40] I found it and fixed it [08:46:55] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [08:47:03] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.16 seconds [08:47:21] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:47:21] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [08:47:29] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.24 seconds [08:50:59] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:51:36] (03PS1) 10Elukey: Set MAC address of 10G interface for an-worker1078 [puppet] - 10https://gerrit.wikimedia.org/r/473404 [08:52:41] (03CR) 10Elukey: [C: 032] Set MAC address of 10G interface for an-worker1078 [puppet] - 10https://gerrit.wikimedia.org/r/473404 (owner: 10Elukey) [08:56:35] RECOVERY - Check systemd state on ms-be1035 is OK: OK - running: The system is fully operational [09:11:59] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [09:17:44] !log Deploy schema change on db2053 - T86339 [09:17:45] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Refactor puppet WDQS module - https://phabricator.wikimedia.org/T208201 (10Mathew.onipe) [09:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:48] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [09:17:48] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Cleanup wdqs puppet profile to include the new changes based on refactoring - https://phabricator.wikimedia.org/T208395 (10Mathew.onipe) 05Open>03Resolved [09:18:02] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: remove the set_handler feature flag [puppet] - 10https://gerrit.wikimedia.org/r/473411 [09:18:03] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi) [09:22:32] !log updated stretch netinst image for 9.6 point release [09:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:35] win 63 [09:23:54] <_joe_> lose 64 [09:24:23] hah [09:24:58] windows overflow... I got RIP in paravoid's computer [09:25:08] (03CR) 10Gehel: "Minor comments inline, otherwise LGTM." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [09:25:46] (03CR) 10Vgutierrez: [C: 032] certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [09:25:54] (03PS9) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) [09:27:46] if i think a mailing list email failed to land in my inbox, should i file a phab ticket about it? or is it really not worth it? [09:28:48] (03CR) 10Filippo Giunchedi: "I'm not sure if nginx::status_site makes sense on its own without the collector? I'm for absenting the status site for now and we can rein" [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [09:29:50] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 2485 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/opsvar-lag_datasource=eqiad+prometheus/opsvar-mirror_name=main-eqiad_to_main-codfw [09:29:59] (03CR) 10Gehel: "Minor comments inline, otherwise, LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [09:30:50] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:31:38] (03CR) 10Filippo Giunchedi: [C: 031] hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [09:34:12] (03CR) 10Filippo Giunchedi: [C: 031] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/473295 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [09:36:22] (03CR) 10Gehel: "minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [09:36:52] PROBLEM - puppet last run on certcentral2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:37:14] (03PS1) 10Vgutierrez: certcentral: Fix cron ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473418 (https://phabricator.wikimedia.org/T209161) [09:37:19] :( [09:37:45] pcc misled me :( [09:38:01] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, modulo Gehel's comments" [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [09:40:14] (03CR) 10Vgutierrez: [C: 032] "pcc happy https://puppet-compiler.wmflabs.org/compiler1002/13473/ and showing the expecting values" [puppet] - 10https://gerrit.wikimedia.org/r/473418 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [09:40:25] (03PS2) 10Vgutierrez: certcentral: Fix cron ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473418 (https://phabricator.wikimedia.org/T209161) [09:41:14] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [09:42:24] PROBLEM - MariaDB Slave SQL: s8 on db1124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.pagelinks: Cant find record in pagelinks, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1087-bin.003425, end_log_pos 688342348 [09:43:50] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1078.eqiad.wmnet'... [09:47:26] (03PS3) 10Volans: remote: refactor Remote.query() API [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) [09:48:53] (03CR) 10Volans: "I've also add the __len__ and __str__ to the RemoteHosts too as it seems useful in both contexts (and I have a use case in an upcoming CR)" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [09:49:50] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 227, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:51:20] PROBLEM - MariaDB Slave Lag: s8 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 675.92 seconds [09:52:02] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:53:29] (03PS1) 10Banyek: mariadb: depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473428 (https://phabricator.wikimedia.org/T85757) [09:54:13] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero) [09:54:16] (03CR) 10Filippo Giunchedi: "Fails PCC https://puppet-compiler.wmflabs.org/compiler1002/13474/logstash1007.eqiad.wmnet/change.logstash1007.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [09:54:55] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero) [09:54:58] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero) [09:55:47] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero) p:05Triage>03Normal [09:56:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Relabel labvirt1016.eqiad.wmnet as cloudvirt1016.eqiad.wmnet - https://phabricator.wikimedia.org/T209427 (10GTirloni) a:03GTirloni [09:57:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Relabel labvirt1016.eqiad.wmnet as cloudvirt1016.eqiad.wmnet - https://phabricator.wikimedia.org/T209427 (10GTirloni) a:05GTirloni>03None [09:58:00] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 3 minutes ago with 17 failures. Failed resources (up to 3 shown): Exec[chown /srv/deployment/cpjobqueue for deploy-service],Package[mathoid/deploy],Exec[chown /srv/deployment/mathoid for deploy-service],Package[citoid/deploy] [09:58:03] (03CR) 10Marostegui: [C: 031] mariadb: depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473428 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:58:41] (03CR) 10Banyek: [C: 032] mariadb: depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473428 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:58:58] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473428 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:59:27] (03CR) 10Phuedx: [C: 04-1] "From the AC of T208755:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [09:59:28] !log depooling db2046 (T85757) [09:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:31] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:02:35] banyek: are you checking the above alert? [10:02:50] (03CR) 10Muehlenhoff: "Agreed, I don't think we need the status site, it's fine to simply remove diamond::collector::nginx from my PoV: Also. yesterday a dedicat" [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [10:03:05] the s8 lag? [10:03:17] yes [10:03:27] the broken replication [10:04:13] yes, checking [10:04:34] thanks [10:04:44] but finishing depooling first as the change is already merged [10:04:51] it's just a scap [10:05:25] (03CR) 10jenkins-bot: mariadb: depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473428 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:06:37] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10Krenair) > The document is: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron_ideal_model (edits welcome). I put in some basic ones: https://wikitech.wikimedi... [10:07:04] !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: T85757: depooling db2046 (duration: 00m 55s) [10:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:07] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:08:59] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1078.eqiad.wmnet'... [10:10:17] (03CR) 10Muehlenhoff: [C: 031] "This is fine to merge. The current metrics from https://grafana.wikimedia.org/dashboard/db/ntp-time-servers are a bit of a regression comp" [puppet] - 10https://gerrit.wikimedia.org/r/473295 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [10:13:03] (03CR) 10Ema: [C: 031] Remove Diamond from caches [puppet] - 10https://gerrit.wikimedia.org/r/472632 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [10:14:59] (03CR) 10Muehlenhoff: [C: 031] "Looks good (will need a manual rebase as the underlying patch was changed)" [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [10:18:04] (03PS1) 10Vgutierrez: certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161) [10:18:31] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [10:19:50] (03PS1) 10Ema: fifo-log-demux 0.1 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/473432 (https://phabricator.wikimedia.org/T204225) [10:21:32] (03PS2) 10Vgutierrez: certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161) [10:21:41] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) That point release has happened and I upgraded our netinst images earlier the day, so this should be fine to re-install now. [10:21:48] (03PS6) 10Gehel: Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos) [10:22:04] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [10:22:34] (03PS1) 10Ladsgroup: Start reading from change_tag_def on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473433 (https://phabricator.wikimedia.org/T208846) [10:24:45] RECOVERY - puppet last run on pc2008 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures [10:25:06] (03CR) 10Zoranzoki21: Add new throttle rule for Art+Feminism Event on 2018-11-17 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473255 (https://phabricator.wikimedia.org/T209324) (owner: 10Zoranzoki21) [10:27:15] (03PS3) 10Vgutierrez: certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161) [10:27:31] (03PS7) 10Gehel: Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos) [10:27:41] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [10:29:40] (03CR) 10Gehel: "puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler1002/13478/" [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos) [10:30:22] (03PS4) 10Vgutierrez: certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161) [10:30:50] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10hashar) @fgiunchedi can we add the statsd proxy for the servers running Zuul? My previous comment above T88997#4676750 has all the relevant bits. It... [10:32:15] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1078.eqiad.wmnet'] ` and were **ALL** successful. [10:33:05] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10hashar) That might be related to Gerrit 2.15.6 upgrade T205784 . I am not familiar with Jetty though but we can at least dig in the logs on the cobalt server. [10:33:13] (03CR) 10DCausse: remote: refactor Remote.query() API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [10:33:35] (03CR) 10Vgutierrez: [C: 032] "pcc looks sane https://puppet-compiler.wmflabs.org/compiler1002/13479/" [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [10:35:09] RECOVERY - MariaDB Slave SQL: s8 on db1124 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:36:12] (03CR) 10Gehel: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [10:36:28] (03CR) 10Gehel: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [10:37:35] RECOVERY - puppet last run on certcentral2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:39:38] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi) >>! In T88997#4745665, @hashar wrote: > @fgiunchedi can we add the statsd proxy for the servers running Zuul? My previous comment above T88... [10:39:45] RECOVERY - MariaDB Slave Lag: s8 on db1124 is OK: OK slave_sql_lag Replication lag: 44.88 seconds [10:41:01] (03CR) 10Giuseppe Lavagetto: "Seems overall sound and quite a good idea :P I have some minor implementation questions, that can be answered inline. None of my comments " (033 comments) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/473432 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema) [10:41:33] (03CR) 10Filippo Giunchedi: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [10:42:15] (03CR) 10Filippo Giunchedi: "PCC fails https://puppet-compiler.wmflabs.org/compiler1002/13481/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [10:43:37] (03CR) 10Filippo Giunchedi: [C: 032] Revert "logstash: temp stop managing indices" [puppet] - 10https://gerrit.wikimedia.org/r/473226 (owner: 10Filippo Giunchedi) [10:43:45] (03PS2) 10Filippo Giunchedi: Revert "logstash: temp stop managing indices" [puppet] - 10https://gerrit.wikimedia.org/r/473226 [10:46:23] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero) [10:46:27] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) [10:47:16] PROBLEM - puppet last run on certcentral1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/certcentral-certs-sync] [10:47:32] vgutierrez: no joy? [10:47:36] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero) [10:47:46] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10aborrero) [10:47:48] volans: yup, cc2001 is happy [10:47:55] cc1001 is sad for other reasons [10:47:58] fix incoming :) [10:48:13] ehehe, ack [10:48:21] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10hashar) We had a lot of stack overflow errors on random pages such as gittiles, that is due to some pages being very long to prettify. Then at 5:36:43 UTC we had a lot of: ` [2018-11-14 05:36:43,643] [HTTP-8... [10:48:44] (03PS1) 10Elukey: Set correct MAC address for an-worker1079 [puppet] - 10https://gerrit.wikimedia.org/r/473438 [10:48:58] (03PS1) 10Vgutierrez: certcentral: Run certcentral-certs-sync with user cercentral [puppet] - 10https://gerrit.wikimedia.org/r/473439 (https://phabricator.wikimedia.org/T209161) [10:49:12] jouncebot: next [10:49:12] In 1 hour(s) and 10 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1200) [10:49:17] volans: basically the security of keyholder is too tight and root has no access to the needed SSH key [10:49:24] (and that's OK) [10:49:33] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Run certcentral-certs-sync with user cercentral [puppet] - 10https://gerrit.wikimedia.org/r/473439 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [10:49:43] ok [10:50:01] (03PS1) 10WMDE-Fisch: Make AdvancedSearch the default on de-, fa-, ar-, and hu-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473440 (https://phabricator.wikimedia.org/T207640) [10:50:09] (03CR) 10Elukey: [C: 032] Set correct MAC address for an-worker1079 [puppet] - 10https://gerrit.wikimedia.org/r/473438 (owner: 10Elukey) [10:50:39] (03PS2) 10Vgutierrez: certcentral: Run certcentral-certs-sync with user cercentral [puppet] - 10https://gerrit.wikimedia.org/r/473439 (https://phabricator.wikimedia.org/T209161) [10:50:43] (03PS3) 10Filippo Giunchedi: Revert "logstash: temp stop managing indices" [puppet] - 10https://gerrit.wikimedia.org/r/473226 [10:50:59] and I'm a lazy bastard... so I'm unable to configure my editor properly [10:51:08] hence the -1 by jenkins-bot [10:51:34] <_joe_> vgutierrez: s/configure/choose/ [10:51:45] (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [10:51:57] _joe_: I'm only human, I cannot use emacs [10:53:02] (03PS1) 10Marostegui: db-codfw.php: Add pc2010 as spare [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473444 (https://phabricator.wikimedia.org/T208383) [10:54:16] (03CR) 10Vgutierrez: [C: 032] "pcc is happy https://puppet-compiler.wmflabs.org/compiler1002/13482/" [puppet] - 10https://gerrit.wikimedia.org/r/473439 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez) [10:54:27] (03PS3) 10Vgutierrez: certcentral: Run certcentral-certs-sync with user cercentral [puppet] - 10https://gerrit.wikimedia.org/r/473439 (https://phabricator.wikimedia.org/T209161) [10:54:31] there's emacs for rebel^Wvim users with viper [10:54:32] * godog runs [10:54:38] PROBLEM - Host ganeti1006 is DOWN: PING CRITICAL - Packet loss = 100% [10:54:50] <_joe_> wat? [10:54:50] that's me ^ [10:54:51] (03CR) 10Tim Eulitz: [C: 031] "😎👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473440 (https://phabricator.wikimedia.org/T207640) (owner: 10WMDE-Fisch) [10:54:56] <_joe_> akosiaris: oh ok :P [10:54:58] RECOVERY - Host ganeti1006 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [10:55:06] <_joe_> did you restart networking? [10:55:10] reboot [10:55:16] RECOVERY - Check systemd state on ganeti1006 is OK: OK - running: The system is fully operational [10:55:23] <_joe_> akosiaris: I fixed the network interfaces on that host [10:55:24] but I 'll need to restart networking on ganeti1007 and ganeti1008 [10:55:37] <_joe_> but 1007 and 1008 are still unfixed and have the typo [10:55:40] oh, that was you ? ok [10:55:44] <_joe_> I was wondering where that was originated [10:55:47] I did PEBKAC on this one [10:55:50] <_joe_> yes, I wrote you as much :) [10:56:00] <_joe_> akosiaris: why is that unpuppetized? [10:56:05] manual action on my part. [10:56:20] cause it was a mess to puppetize it. I have https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/351603/ already [10:56:22] (03PS4) 10Vgutierrez: certcentral: Run certcentral-certs-sync with user cercentral [puppet] - 10https://gerrit.wikimedia.org/r/473439 (https://phabricator.wikimedia.org/T209161) [10:56:29] I guess I need to find the time to puppetize it correctly [10:56:36] <_joe_> ack, I can take a look and try to help [10:56:38] the mess btw is not this, is the current status quo [10:56:59] it's a rabbit hole essentially [10:57:10] maybe I can revisit it with a smaller scope and avoid being dragged down this time around [10:58:11] (03PS4) 10Zoranzoki21: Fix adding vendor files by default for commiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471533 (https://phabricator.wikimedia.org/T207058) [10:58:17] (03PS5) 10Zoranzoki21: Fix adding vendor files by default for commiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471533 (https://phabricator.wikimedia.org/T207058) [10:58:34] RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:02:16] RECOVERY - puppet last run on certcentral1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:02:47] (03PS3) 10Pmiazga: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [11:03:54] (03PS4) 10Pmiazga: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [11:07:46] RECOVERY - Check systemd state on ganeti1007 is OK: OK - running: The system is fully operational [11:08:04] RECOVERY - Check systemd state on ganeti1008 is OK: OK - running: The system is fully operational [11:09:06] !log Deploy schema change on db2046 (T85757) [11:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:09] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:09:13] heh, I did not even have to restart networking over there after all. systemctl reset-failed fixed it (after making sure an ifdown analytics; ifup analytics worked fine) [11:10:06] (03PS5) 10Pmiazga: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [11:11:48] !log draining ganeti1005 for reboot/kernel security update [11:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:59] (03CR) 10MSantos: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos) [11:18:49] (03PS2) 10Muehlenhoff: Remove Diamond from caches [puppet] - 10https://gerrit.wikimedia.org/r/472632 (https://phabricator.wikimedia.org/T183454) [11:20:37] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from caches [puppet] - 10https://gerrit.wikimedia.org/r/472632 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [11:25:38] 10Operations, 10Release Pipeline, 10Release-Engineering-Team: Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10akosiaris) I think we should support multiple tags per image (docker anyway does support that and they cost next to nothing on the registry level AFAIK) * Keep... [11:26:06] PROBLEM - Check systemd state on cp2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:26:47] (03PS6) 10Phuedx: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [11:27:08] PROBLEM - Check systemd state on cp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:22] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [11:28:12] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [11:28:15] (03CR) 10Phuedx: [C: 031] "PS6 adds a comment explaining the list of wikis with the sampling ratio set to 0." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [11:29:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "I think I was off by a week about when the train pauses? There are deploys this week, so I’ve added this to Thursday’s EU SWAT now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471738 (https://phabricator.wikimedia.org/T207854) (owner: 10Lucas Werkmeister (WMDE)) [11:31:12] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:31:20] <_joe_> moritzm: it seems your change create some issues? [11:31:30] <_joe_> I see puppet failure on cachess [11:33:18] PROBLEM - Check systemd state on cp3049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:33:28] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [11:35:02] PROBLEM - Check systemd state on cp2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:35:05] looking, the diamond prerm fails some times, that's a bug in the deb we can't really fix [11:35:24] zeljkof: just to let you know tommorrow during EU swat I can "run the show", will be watching lucas and tarro.w on their first deploys ! :) [11:35:30] (03PS7) 10Reedy: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [11:36:28] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [11:36:40] RECOVERY - Check systemd state on cp3049 is OK: OK - running: The system is fully operational [11:37:02] (03CR) 10Addshore: "We deployed wikidata data access to all wiktionaries a week or so ago." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [11:37:14] RECOVERY - Check systemd state on cp2015 is OK: OK - running: The system is fully operational [11:37:36] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [11:37:52] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [11:38:32] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:38:52] (03CR) 10Addshore: [C: 031] "looks sane to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [11:39:20] RECOVERY - Check systemd state on cp2001 is OK: OK - running: The system is fully operational [11:40:47] addshore: cool! [11:41:04] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 0.379 second response time [11:41:50] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10MoritzMuehlenhoff) Icinga is flagging broken memory on 1053, simply leaving a note here as that host is up for decom anyway. [11:43:03] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] "Note this is only for Wikipedias, not for other wikis in these languages. I believe this is exactly what should happen, just want to make " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473440 (https://phabricator.wikimedia.org/T207640) (owner: 10WMDE-Fisch) [11:43:26] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:43:54] RECOVERY - Check systemd state on cp2011 is OK: OK - running: The system is fully operational [11:44:36] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:36] addshore: gj, you broke addWiki a year ago :D [11:44:59] Reedy: i did? [11:45:06] i do remember touching it [11:45:07] Yup [11:45:08] See https://phabricator.wikimedia.org/T209474 [11:45:14] It's almost a year to the day [11:45:15] Haha [11:45:32] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [11:45:39] wait, using it with --wiki=aawiktionary no longer works? [11:45:47] That's what I pasted :) [11:45:48] that did work before, there was another ticket about it too [11:47:05] i dont even see where that error message comes form? [11:47:18] https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/commit/955db59c950e286284f25a119a4930b0e07079e1 [11:47:23] i need to pull >.> [11:47:40] https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/addWiki.php#L72-L77 [11:47:58] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:48:52] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:48:56] Reedy: can actually see which bit of that condition is broken with just my eyes...? [11:49:30] $this->getOption( 'wiki' ) doesn't get the wiki? [11:49:35] I didn't pay enough attention [11:49:37] It seems so [11:50:59] let me have a poke [11:52:27] addshore: You can probably just abuse $wgDBname [11:52:50] true [11:53:00] else... [11:53:00] if ( isset( $this->mOptions['wiki'] ) ) { [11:53:00] $bits = explode( '-', $this->mOptions['wiki'] ); [11:53:00] if ( count( $bits ) == 1 ) { [11:53:00] $bits[] = ''; [11:53:01] } [11:53:03] define( 'MW_DB', $bits[0] ); [11:53:07] define( 'MW_PREFIX', $bits[1] ); [11:53:25] $wgDBname sounds nicer [11:53:40] PROBLEM - Check systemd state on cp2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:53:53] Reedy: patch is up [11:53:54] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [11:53:58] ta [11:57:03] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) @Cmjohnson @RobH I tried this morning to configure the an-workers with https://gerrit.wikimedia.org/r/#/c/473359/ but then... [11:57:33] (03CR) 10Banyek: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473444 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [11:59:38] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 848.31 seconds [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1200). [12:00:05] raynor and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] o/ [12:00:41] raynor and Amir1: you're both deployers, right? so go ahead, self organize and deploy your patches :) [12:00:41] o/ [12:00:47] I'm around if you need me [12:00:58] sure, raynor, you go first [12:01:02] ok thx [12:01:39] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/WikimediaMaintenance/addWiki.php: Unbreak adding TMH tables (duration: 00m 55s) [12:01:39] (03CR) 10Pmiazga: [C: 032] Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [12:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:48] RECOVERY - Check systemd state on cp2007 is OK: OK - running: The system is fully operational [12:02:52] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaMaintenance/addWiki.php: Unbreak adding TMH tables (duration: 00m 53s) [12:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:14] (03Merged) 10jenkins-bot: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [12:03:16] merging, Amir1 - I'll let you know once I'm done, I think I'll need ~10m [12:03:52] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:04:15] Sure [12:04:44] (03CR) 10jenkins-bot: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [12:05:36] raynor: i have the list of enwiki articles ready [12:06:01] SEO 1% is on mwdebug10023 [12:06:04] 1002* [12:07:20] (03PS8) 10Urbanecm: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) [12:07:24] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::vhost: remove the set_handler feature flag [puppet] - 10https://gerrit.wikimedia.org/r/473411 [12:07:38] (03CR) 10Urbanecm: "> Patch Set 7:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [12:08:06] Urbanecm: fyi, addwiki is broken yet again :D [12:08:22] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [12:08:34] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [12:08:42] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:08:55] Reedy, thank you [12:08:59] I amended config for yuewiktionary, so tests are hopefully not broken and it is in wikidataclient.dblist [12:09:10] <_joe_> arturo: are you looking at labstore1004? [12:10:16] _joe_: no, we are looking at labnet1001, but could be realted [12:10:20] related* [12:11:04] phuedx, - works to me on debug [12:11:07] you? [12:11:13] raynor: checking [12:11:32] I also checked the output against the https://search.google.com/structured-data/testing-tool -> it's ok [12:11:38] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [12:11:53] 10Operations, 10Release Pipeline, 10Release-Engineering-Team: Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10fselles) +1 latest should be avoided for production, in my experience is also problematic for development (since you don't know which version are you running c... [12:12:45] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::vhost: remove the set_handler feature flag [puppet] - 10https://gerrit.wikimedia.org/r/473411 [12:14:58] there is one problem - the mainEntityUrl still points to http [12:15:07] raynor: i've tested a number of articles from the list (https://quarry.wmflabs.org/query/31164) and i see the json+ld block [12:15:25] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10GTirloni) [12:15:26] sameAs, and mainEntity [12:15:33] addshore: ^ [12:15:46] * addshore reads up [12:16:07] marostegui: you around? we may have connection overloading on m5-master [12:16:09] context: https://phabricator.wikimedia.org/T209352 [12:16:23] raynor: I thought that was already discussed on a gerrit patch? I believe i spotted it this morning or last night? [12:16:30] comments by Lucas_WMDE [12:17:07] addshore: you're correct [12:17:10] raynor: https://phabricator.wikimedia.org/T153563 [12:17:13] ah, right, yes, there is -2 there [12:17:22] raynor: https://gerrit.wikimedia.org/r/#/c/473292/ even [12:17:26] sorry, my bad, looks good, lets roll [12:17:41] woo! [12:17:43] go go go [12:17:51] yeah, I'm deploying and I'm overcautious :) [12:18:02] raynor: i'll resolve https://phabricator.wikimedia.org/T209352 as invalid with a comment [12:18:31] phuedx, w8 [12:18:53] that ask has two issues, first is the https, and the second one is that the betacluster seo points to prod wikibase [12:19:11] !log pmiazga@deploy1001 Synchronized wmf-config: SWAT: [[gerrit:473079]|Enable Schema.org page split test at 1% sampling (T208755)]] (duration: 00m 54s) [12:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:13] T208755: Launch A/B test for sameAs property - https://phabricator.wikimedia.org/T208755 [12:19:23] Amir1, I'm done, over to you [12:19:31] Thank you! [12:20:31] zeljkof, thanks for your presence, I feel much more confident with SWATs when you're around [12:21:01] raynor: I'm glad I could help by not doing anything ;) [12:21:06] (03PS2) 10Ladsgroup: Start reading from change_tag_def on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473433 (https://phabricator.wikimedia.org/T208846) [12:21:15] (03CR) 10Giuseppe Lavagetto: jobqueue_redis: Purge role jobqueue_redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473029 (https://phabricator.wikimedia.org/T198220) (owner: 10Effie Mouzeli) [12:22:17] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/13485/mwdebug2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/473411 (owner: 10Giuseppe Lavagetto) [12:22:41] oic [12:22:51] raynor: thanks. i won't resolve it [12:22:57] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473433 (https://phabricator.wikimedia.org/T208846) (owner: 10Ladsgroup) [12:24:19] (03Merged) 10jenkins-bot: Start reading from change_tag_def on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473433 (https://phabricator.wikimedia.org/T208846) (owner: 10Ladsgroup) [12:27:24] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [12:30:33] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::vhost: remove the set_handler feature flag [puppet] - 10https://gerrit.wikimedia.org/r/473411 (owner: 10Giuseppe Lavagetto) [12:30:43] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:473433|Start reading from change_tag_def on wikidatawiki (T208846)]] (duration: 00m 55s) [12:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:47] T208846: Start reading from change_tag_def on wikidatawiki - https://phabricator.wikimedia.org/T208846 [12:31:28] (03PS4) 10Ladsgroup: Revert the language of votewiki to English (en) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473148 (https://phabricator.wikimedia.org/T207560) (owner: 10Huji) [12:32:40] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473148 (https://phabricator.wikimedia.org/T207560) (owner: 10Huji) [12:32:52] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [12:33:53] (03Merged) 10jenkins-bot: Revert the language of votewiki to English (en) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473148 (https://phabricator.wikimedia.org/T207560) (owner: 10Huji) [12:35:11] (03CR) 10jenkins-bot: Start reading from change_tag_def on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473433 (https://phabricator.wikimedia.org/T208846) (owner: 10Ladsgroup) [12:35:16] (03CR) 10jenkins-bot: Revert the language of votewiki to English (en) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473148 (https://phabricator.wikimedia.org/T207560) (owner: 10Huji) [12:35:56] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:473148|Revert the language of votewiki to English (en) (T207560)]] (duration: 00m 55s) [12:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:59] T207560: Carry out the 2018 fawiki elections on votewiki - https://phabricator.wikimedia.org/T207560 [12:36:34] !log EU SWAT is done [12:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:27] !log installing python security updates on trusty [12:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:48] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaMaintenance/addWiki.php: Unbreak adding wiktionary (duration: 00m 53s) [12:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:50] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/WikimediaMaintenance/addWiki.php: Unbreak adding wiktionary (duration: 00m 52s) [12:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:23] !log installing python3.4 security updates on trusty (Debian already fixed) [12:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:08] addshore: Params are the wrong way round :P [12:49:23] Reedy: your the wrong way around [12:49:26] what do you mean? :P [12:50:43] addshore: You're looking for $wgDBname in 'wiktionary' [12:50:46] Not the other way round [12:50:53] bwhahahahaa [12:51:22] So the earlier change probably wasn't needed [12:51:22] classic [12:56:29] !log installing gettext "security" updates for trusty [12:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1300) [13:00:49] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/WikimediaMaintenance/addWiki.php: Fixing addshores code... (duration: 00m 55s) [13:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:47] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaMaintenance/addWiki.php: Fixing addshores code... (duration: 00m 53s) [13:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:24] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:07:38] !log installing ghostscript security updates on stretch [13:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:51] Reedy, it's still a little broken [13:14:05] the newproject announcement emails it's generating say 'for a in ' [13:14:20] That's probably because I just commented out half of the script to make it run the second half [13:14:25] between the 'a' and 'in' is supposed to be the project name :/ [13:14:29] hah [13:14:42] oh [13:14:44] new wiki time [13:14:51] classic addWiki fix [13:15:01] \o/ ping me when yue.wikt loads to the real wiki [13:15:11] Though I didn't think I commented out name/lang [13:15:12] but meh [13:15:49] revi, the emails used to get some arbitrary delay so they'd be working by the time the announcement went out [13:15:53] (03PS9) 10Reedy: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [13:15:58] (03CR) 10Reedy: [C: 032] Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [13:16:01] but that code broke and we took it out [13:16:22] newproject email was broken when the last batch of new wiki was created [13:16:23] IIRC [13:16:29] revi: shit happens :) [13:16:37] yeah [13:16:38] lol [13:17:03] (03Merged) 10jenkins-bot: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [13:17:19] (03PS8) 10Reedy: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm) [13:17:24] (03CR) 10Reedy: [C: 032] Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm) [13:18:20] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm) [13:18:23] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm) [13:18:56] (03PS21) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [13:19:09] (03CR) 10jenkins-bot: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [13:20:20] (03PS9) 10Reedy: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm) [13:20:25] (03CR) 10Reedy: [C: 032] Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm) [13:21:04] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 928.43 seconds [13:21:41] (03Merged) 10jenkins-bot: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm) [13:22:05] (03PS3) 10Reedy: Initial configuration for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468815 (https://phabricator.wikimedia.org/T204477) (owner: 10Urbanecm) [13:22:09] (03CR) 10Reedy: [C: 032] Initial configuration for punjabiwikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468815 (https://phabricator.wikimedia.org/T204477) (owner: 10Urbanecm) [13:23:28] (03Merged) 10jenkins-bot: Initial configuration for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468815 (https://phabricator.wikimedia.org/T204477) (owner: 10Urbanecm) [13:25:50] Urbanecm: About? Can you rebase the shnwiki patch? [13:26:33] (03PS1) 10Reedy: Stop breaking blame for wikimedia special cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473487 [13:32:02] (03PS16) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [13:32:36] (03PS4) 10Reedy: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [13:32:42] (03PS5) 10Reedy: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [13:33:12] (03PS6) 10Reedy: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [13:33:16] (03CR) 10Reedy: [C: 032] Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [13:34:03] (03CR) 10jenkins-bot: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm) [13:34:05] (03CR) 10jenkins-bot: Initial configuration for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468815 (https://phabricator.wikimedia.org/T204477) (owner: 10Urbanecm) [13:34:37] (03Merged) 10jenkins-bot: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [13:34:51] (03CR) 10jenkins-bot: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [13:35:18] (03PS1) 10Reedy: Add new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473488 [13:36:24] PROBLEM - DPKG on scandium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:37:40] (03PS2) 10Reedy: Add new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473488 (https://phabricator.wikimedia.org/T206777) [13:37:42] (03PS1) 10Muehlenhoff: Remove Diamond on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/473489 (https://phabricator.wikimedia.org/T183454) [13:37:51] (03CR) 10Reedy: [C: 032] Add new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473488 (https://phabricator.wikimedia.org/T206777) (owner: 10Reedy) [13:39:34] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10GTirloni) [13:42:41] (03PS1) 10Muehlenhoff: Remove Diamond from Kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/473490 (https://phabricator.wikimedia.org/T183454) [13:42:48] (03CR) 10Effie Mouzeli: jobqueue_redis: Purge role jobqueue_redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473029 (https://phabricator.wikimedia.org/T198220) (owner: 10Effie Mouzeli) [13:45:05] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided) [13:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:35] !log plugin and JVM upgrade completed on elasticsearch / cirrus / codfw - T209293 [13:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:38] T209293: Prepare a deb package with the experimental highlighter 5.5.2.4 - https://phabricator.wikimedia.org/T209293 [13:47:28] (03PS8) 10Gehel: Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos) [13:48:10] !log reedy@deploy1001 Synchronized langlist: shn (duration: 00m 52s) [13:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:34] !log reedy@deploy1001 Synchronized dblists/: new wikis! (duration: 00m 53s) [13:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:37] (03CR) 10jenkins-bot: Add new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473488 (https://phabricator.wikimedia.org/T206777) (owner: 10Reedy) [13:50:00] (03CR) 10Gehel: [C: 032] Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos) [13:50:47] !log reedy@deploy1001 Synchronized static/images/project-logos/: (no justification provided) (duration: 00m 53s) [13:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:03] !log restarting tilerator on maps1004 for config change [13:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:50] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: new wikis (duration: 00m 53s) [13:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:55] jouncebot: next [13:51:55] In 0 hour(s) and 8 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1400) [13:52:00] RECOVERY - DPKG on scandium is OK: All packages OK [13:52:04] (03CR) 10Marostegui: [C: 032] db-codfw.php: Add pc2010 as spare [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473444 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:53:28] (03Merged) 10jenkins-bot: db-codfw.php: Add pc2010 as spare [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473444 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:54:14] PROBLEM - Host labservices1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:54:23] (03PS2) 10Ema: fifo-log-demux 0.1 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/473432 (https://phabricator.wikimedia.org/T204225) [13:54:36] !log reedy@deploy1001 Synchronized multiversion/MWMultiVersion.php: (no justification provided) (duration: 00m 53s) [13:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:00] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.666 second response time [13:55:14] RECOVERY - Host labservices1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [13:55:51] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Add pc2010 as spare - T208383 (duration: 00m 53s) [13:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:54] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [13:56:08] (03CR) 10Filippo Giunchedi: [C: 031] Remove Diamond from Kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/473490 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [13:56:19] (03CR) 10Filippo Giunchedi: [C: 031] Remove Diamond on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/473489 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [13:56:20] is it just me that shnwiki loads internal errors? [13:56:33] Nope [13:56:38] TZ issue [13:56:50] fails for me too [13:56:50] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:51] ok! [13:56:59] (03PS1) 10Reedy: Fix shnwiki TZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473494 [13:57:25] I could enter editing area but cannot save [13:57:42] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided) [13:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:50] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 0.686 second response time [13:58:08] lol even preview fails [13:58:13] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10GTirloni) [13:58:15] (03PS2) 10Reedy: Remove shnwiki TZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473494 (https://phabricator.wikimedia.org/T206777) [13:58:22] (03CR) 10Reedy: [C: 032] Remove shnwiki TZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473494 (https://phabricator.wikimedia.org/T206777) (owner: 10Reedy) [13:58:32] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:04] 10Operations, 10User-Elukey: Apply interface::rps to all the mc hosts - https://phabricator.wikimedia.org/T209489 (10elukey) p:05Triage>03Normal [13:59:46] Amir1: We have a huge increase of queries in enwiki slaves (I haven't checked other sections). I am checking the timeline, can that be related to your change? [13:59:50] banyek: can you check other sections? [14:00:04] hashar: Your horoscope predicts another unfortunate MediaWiki train - European version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1400). [14:00:06] marostegui: mine should only affect wikidata [14:00:26] marostegui: yes [14:00:44] Reedy: Anything from your changes that could potentially affect enwiki? [14:01:01] Not AFAIK [14:01:56] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=now-24h&to=now this is scary [14:02:11] (03PS1) 10Reedy: Remove punjabi from MWMultiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473498 (https://phabricator.wikimedia.org/T204477) [14:02:13] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Fix shnwiki TZ (duration: 00m 54s) [14:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:26] addshore: you around? [14:02:31] shnwiki works fine now [14:02:37] \o [14:02:48] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) @Krinkle You've edited https://grafana.wikimedia.org/dashboard/db/cluster-board-graphite back in October the last time, that d... [14:02:48] marostegui: whats happening? [14:03:04] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/473079/ was deployed around the time we had the huge increase on reads on enwiki [14:03:07] (03PS2) 10Reedy: Remove punjabi from MWMultiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473498 (https://phabricator.wikimedia.org/T204477) [14:03:18] addshore: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=now-24h&to=now [14:03:33] marostegui: QPS increased? [14:03:36] yep [14:03:38] could be the ceo thing [14:03:41] like crazy [14:03:41] *looks at the time* [14:03:52] banyek: does it happen on other sections too? [14:03:59] (03CR) 10jenkins-bot: db-codfw.php: Add pc2010 as spare [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473444 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [14:04:01] (03CR) 10jenkins-bot: Remove shnwiki TZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473494 (https://phabricator.wikimedia.org/T206777) (owner: 10Reedy) [14:04:02] that was a peak yesterday at s3: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&from=now-24h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1123&var-port=9104 [14:04:11] (03PS3) 10Ema: fifo-log-demux 0.1 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/473432 (https://phabricator.wikimedia.org/T204225) [14:04:11] marostegui: I am checking [14:04:26] marostegui: just selects? [14:04:31] not just one host at a time I am s3 [14:04:37] and all the new wikis doesn't have standard new wiki main pages meh [14:04:40] banyek: that doesn't look related (the one from yestrday) [14:04:56] addshore: checking [14:05:06] (03PS1) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473499 [14:05:06] 12:19 pmiazga@deploy1001: Synchronized wmf-config: SWAT: [[gerrit:473079]|Enable Schema.org page split test at 1% sampling (T208755)]] (duration: 00m 54s) [14:05:07] T208755: Launch A/B test for sameAs property - https://phabricator.wikimedia.org/T208755 [14:05:08] (03CR) 10Reedy: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473499 (owner: 10Reedy) [14:05:18] (03CR) 10Ema: fifo-log-demux 0.1 (033 comments) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/473432 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema) [14:05:18] perhaps [14:05:34] addshore: can we revert? [14:05:53] phuedx: raynor ^^ [14:06:02] wait, is that even on enwiki? [14:06:05] *looks* [14:06:11] s2,s3,s4,s5 good so far [14:06:26] 'default' => 0.01,, so yes enwiki [14:06:31] (03Merged) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473499 (owner: 10Reedy) [14:06:45] addshore: let's revert that just in case, so we can either confirm or discard [14:06:51] I [14:06:55] I'm here, let me read [14:07:02] s6 good [14:07:28] !log reedy@deploy1001 Synchronized wmf-config/interwiki.php: Updating interwiki cache (duration: 02m 25s) [14:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:31] !log starting plugin and JVM upgrade on elasticsearch / cirrus / eqiad - T209293 [14:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:34] T209293: Prepare a deb package with the experimental highlighter 5.5.2.4 - https://phabricator.wikimedia.org/T209293 [14:07:45] s7 has the same pattern too [14:07:56] yup, i think it must be that patch [14:08:49] raynor: okay to revert? [14:09:10] (03Abandoned) 10Reedy: Remove punjabi from MWMultiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473498 (https://phabricator.wikimedia.org/T204477) (owner: 10Reedy) [14:10:04] yeah, I think you can revert, we enabled A/B test only to 1% (it means 0.5% of page requests should be affected) [14:10:13] !log Wiki created T205714 T207584 T205713 T206916 [14:10:18] * addshore will revert [14:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:20] T205714: Prepare and check storage layer for yuewiktionary - https://phabricator.wikimedia.org/T205714 [14:10:21] T206916: Prepare and check storage layer for shnwiki - https://phabricator.wikimedia.org/T206916 [14:10:21] T207584: Prepare and check storage layer for punjabiwikimedia - https://phabricator.wikimedia.org/T207584 [14:10:21] T205713: Prepare and check storage layer for liwikinews - https://phabricator.wikimedia.org/T205713 [14:10:26] (03PS1) 10Filippo Giunchedi: prometheus: aggregate rsyslog_queue_full rate [puppet] - 10https://gerrit.wikimedia.org/r/473501 (https://phabricator.wikimedia.org/T206633) [14:10:32] (03PS1) 10Addshore: Revert "Prod: Enable Schema.org page split test at 1% sampling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473502 [14:10:39] (03PS2) 10Addshore: Revert "Prod: Enable Schema.org page split test at 1% sampling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473502 [14:10:50] (03CR) 10jerkins-bot: [V: 04-1] prometheus: aggregate rsyslog_queue_full rate [puppet] - 10https://gerrit.wikimedia.org/r/473501 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi) [14:11:21] (03CR) 10Addshore: [C: 032] Revert "Prod: Enable Schema.org page split test at 1% sampling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473502 (owner: 10Addshore) [14:11:43] Reedy: are you done with your syncing ? :P [14:11:47] Yup [14:11:54] coolio [14:11:55] addshore - we do some queries (we need to load the page_random) to verify if the page is in sampling session [14:12:16] (03PS2) 10Filippo Giunchedi: prometheus: aggregate rsyslog_queue_full rate [puppet] - 10https://gerrit.wikimedia.org/r/473501 (https://phabricator.wikimedia.org/T206633) [14:12:40] (03Merged) 10jenkins-bot: Revert "Prod: Enable Schema.org page split test at 1% sampling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473502 (owner: 10Addshore) [14:13:18] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: aggregate rsyslog_queue_full rate [puppet] - 10https://gerrit.wikimedia.org/r/473501 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi) [14:13:55] syncing [14:14:35] !log addshore@deploy1001 Synchronized wmf-config: Revert Prod: Enable Schema.org page split test at 1% sampling (duration: 00m 54s) [14:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:44] seems recovering https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=now-3h&to=now [14:15:17] we'll see now that the patch has been deployed [14:15:52] marostegui: is it possible to see what the queries werE? [14:16:09] banyek, marostegui - if this our patch, what are the next step, definitely we have to fix our patch and use some caching not to do that many selects [14:17:32] https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/8b339fcb93c3ab790620c6823106740880ae2f53/client/includes/Store/Sql/PageRandomLookup.php#L42 [14:18:02] this the query we were doing - `select page_random from page where page_id = ${}` [14:18:44] the QPS still looks kind of recovery, still need a few more mins to see though i think [14:18:51] (03PS1) 10Volans: Add Icinga module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) [14:19:00] (03CR) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473499 (owner: 10Reedy) [14:19:02] (03CR) 10jenkins-bot: Revert "Prod: Enable Schema.org page split test at 1% sampling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473502 (owner: 10Addshore) [14:19:28] (03CR) 10Volans: elasticsearch_cluster: multi-cluster/multi-instance support (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [14:20:21] marostegui: also, curious, how did you spot that? did alarms go off, or did you just happen to spot it? [14:20:43] (03CR) 10Gehel: remote: refactor Remote.query() API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:21:03] (03PS2) 10Volans: Add Icinga module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) [14:21:14] addshore: As part of my workflow I monitor tendril _pretty_ often to check the query values and to check if they are under normal values [14:21:49] addshore: I always have tendril and icinga opened on one of my monitors [14:22:16] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.33 seconds [14:22:28] 10Operations, 10Graphite: grafana access control - https://phabricator.wikimedia.org/T108546 (10fgiunchedi) AFAICT graphite's web interface has been behind ldap auth since 2011 (`e88fdf13` in puppet.git) but `/render` has always been open. Also nowadays you can't really edit/explore metrics in grafana unless y... [14:23:18] We are not yet recovered [14:23:52] marostegui: might not be that patch then [14:23:55] * addshore looks in SAL [14:24:17] it looks like it starts between 12:30 and 12:40 on db1089 [14:24:32] I am holding the train https://phabricator.wikimedia.org/T209429 it smells bad :( [14:24:35] 12:36 Amir1: EU SWAT is done [14:24:35] 12:35 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert the language of votewiki to English (en) (T207560) (duration: 00m 55s) [14:24:35] 12:30 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Start reading from change_tag_def on wikidatawiki (T208846) (duration: 00m 55s) [14:24:36] will post to wikitech-l [14:24:36] T207560: Carry out the 2018 fawiki elections on votewiki - https://phabricator.wikimedia.org/T207560 [14:24:36] T208846: Start reading from change_tag_def on wikidatawiki - https://phabricator.wikimedia.org/T208846 [14:24:42] hashar: yes please [14:24:55] Amir1: ^^ your in that time slot [14:25:02] addshore: I asked Amir1 and he said it should only affect wikidata [14:25:32] the first patch is impossible to cause this, the second one only affects wikidata [14:25:37] so, 12:19 is what we just reverted, and before that is 12:02 of reedy syncing somehting totally unrelated [14:26:28] this is also affecting API hosts like db1080 [14:26:37] okay, hmmm [14:26:54] and NOT affecting recentchanges slaves [14:27:11] (03PS1) 10Reedy: Add shnwiki to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473510 (https://phabricator.wikimedia.org/T206777) [14:27:54] marostegui, addshore - to make it clear, our patch was enabling the code that was doing one more SQL query (to fetch page_random value for given page_id). [14:28:16] raynor: and the revert should have stopped that right? (unless i missed something?) [14:28:26] yes, it should stop that [14:29:10] (03CR) 10Gehel: "minor comments inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:29:20] (03PS2) 10Reedy: Add shnwiki to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473510 (https://phabricator.wikimedia.org/T206777) [14:29:36] !log roll-restart swift-proxy in codfw to pick up statsd changes [14:29:37] marostegui: no increase in connections though? only in queries [14:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:22] addshore: we do have more connections yes [14:30:41] there was a small descent in the graphs, but they're crawling up again [14:31:52] marostegui: i dont see a connection rate increase on https://grafana.wikimedia.org/dashboard/db/mysql?var-server=db1089&var-port=9104&var-dc=eqiad%20prometheus%2Fops&orgId=1&from=now-3h&to=now ? [14:32:15] addshore: https://grafana.wikimedia.org/dashboard/db/mysql?var-server=db1089&var-port=9104&var-dc=eqiad%20prometheus%2Fops&orgId=1&from=now-24h&to=now&panelId=37&fullscreen [14:32:31] aaah process list [14:32:35] yeh :) [14:34:41] * addshore has run out of places to look [14:34:51] I am checking performance schema to see if I find something interesting [14:35:12] <_joe_> yeah it seems a flock of small queries rather than some huge ones [14:35:18] yep [14:35:22] <_joe_> tendril doesn't show anything significant [14:35:41] yeah, and there is not a change on rows read patterns and things like that [14:35:56] <_joe_> marostegui: uhm be prepared for a shock [14:36:00] <_joe_> https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=now-7d&to=now [14:36:08] <_joe_> this happens daily [14:36:16] <_joe_> always at the same time [14:36:18] ?? [14:36:29] O_o [14:36:31] <_joe_> the query skyrocketing [14:36:42] <_joe_> now, I have a candidate for that [14:36:42] * marostegui goes to check cronjobs in mwmaint [14:37:03] <_joe_> marostegui: either a cronjob, or some peculiar memcached key expiring was my guess [14:37:13] it is getting progressively bigger each day / set of days [14:37:25] could be parsercache expirations? [14:37:39] <_joe_> elukey: didn't we move the TTL of tha translatewiki key to 1 day? [14:37:53] _joe_ it goes out with this week's train [14:37:56] <_joe_> marostegui: why on s1 though? [14:37:56] it looks like this was also happening 3 months ago https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=now-90d&to=now [14:38:06] _joe_: yeah, it should affect all sections, good point [14:39:00] <_joe_> elukey: might be related to the errors hashar was seeing, and for which he stopped the train? [14:39:08] <_joe_> anyways, yes, check cronjobs [14:39:22] <_joe_> else it's someone else's cronjob :P [14:39:34] <_joe_> scrape_wikipedia.sh [14:39:37] yeah, I am checking cronjobs [14:40:08] hashar: its a shame there is no stacktrace etc for https://phabricator.wikimedia.org/T209429 [14:40:44] <_joe_> this has been happening since nov 6th, more or less [14:41:15] (03PS1) 10Filippo Giunchedi: swift: turn on statsd_exporter in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/473519 (https://phabricator.wikimedia.org/T205870) [14:42:23] _joe_: the same pattern exists in august and september [14:42:31] pff [14:42:35] so, once we know that it's not the SEO thing, addshore - once QPS gets back to normal, could you redeploy the patch once again please? [14:42:38] _joe_ in theory it shouldn't, the error seems to be fetching a "" key [14:43:07] why does it only affect s1 [14:43:19] it happened on s7 too [14:43:28] let me check the pattern there too [14:43:44] yes, s7 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&from=now-90d&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1086&var-port=9104 [14:43:48] (03CR) 10Mathew.onipe: "This is good!. Just few nitpicks.." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:43:54] yep [14:43:58] s7 has centralauth [14:44:15] during august and sept it was always at 10:40-10:50 on s7 [14:44:44] <_joe_> ok, it must definitely be a cron [14:44:48] in august and sept it was also always 10:40 ish [14:44:51] ^^ for s1 [14:45:13] timezones have changed now and we are seeing it at 14:40 / 14:44 ish [14:45:32] I was thinking about refreshlinks cron, but I think that got moved to a different hour [14:45:33] <_joe_> addshore: no, grafana uses utc unless you tell it otherwise [14:45:35] let me check [14:46:06] _joe_: of course! so whatever is causing this changed when it was doing whatever it was doing :P [14:46:11] Yeah, refreshlinks is at: 0 0 1 * * [14:48:01] Anomie put this in the deployment section [14:48:16] <_joe_> addshore: can just be slower progressing through the wikis :) [14:48:36] <_joe_> Amir1: he put what? [14:48:43] • Anomie will be running refreshExternallinksIndex.php for https://phabricator.wikimedia.org/T209373. [14:48:58] https://wikitech.wikimedia.org/wiki/Deployments#Week_of_November_12th [14:49:00] Anomie will be running refreshExternallinksIndex.php for T209373. [14:49:01] T209373: Run maintenance/refreshExternallinksIndex.php on all wikis - https://phabricator.wikimedia.org/T209373 [14:49:01] But if this has been happening before, we can discard that I think [14:50:04] It is definitely recovering now [14:50:08] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=1542195716887&to=1542206985063 [14:51:00] Amir1, marostegui: What's going on that that maintenance is being talked about? FYI, at the moment nothing is running for that task. I ran group 0 yesterday and it completed in 3 minutes. I was planning on running group 1 after the train window. [14:51:06] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::appserver: install php-fpm everywhere [puppet] - 10https://gerrit.wikimedia.org/r/473231 (https://phabricator.wikimedia.org/T208433) (owner: 10Giuseppe Lavagetto) [14:51:22] addshore: I think we can push again raynor's patch [14:52:48] marostegui: will do [14:52:52] thanks [14:52:57] did you see anything as the wikiadmin user in the process list that looked different than the run of the mill stuff? I know they were all short-lived, but still [14:53:04] (03PS1) 10Addshore: Revert "Revert "Prod: Enable Schema.org page split test at 1% sampling"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473528 [14:53:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10mforns) Thank you @elukey! [14:53:12] \o/ [14:53:12] (03CR) 10Addshore: [C: 032] Revert "Revert "Prod: Enable Schema.org page split test at 1% sampling"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473528 (owner: 10Addshore) [14:53:24] apergos: nope :( [14:53:43] mutante: around much today? I want to talk about https://phabricator.wikimedia.org/T99531 again! [14:54:31] addshore - do you have time? if not I can push it :) [14:54:42] (03Merged) 10jenkins-bot: Revert "Revert "Prod: Enable Schema.org page split test at 1% sampling"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473528 (owner: 10Addshore) [14:55:01] raynor: i can [14:55:15] awesome, thx [14:55:29] let me know once it's there, I'll just quickly check it still works [14:55:55] (03PS2) 10Reedy: Stop breaking blame for wikimedia special cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473487 [14:56:15] !log addshore@deploy1001 Synchronized wmf-config: Prod: Enable Schema.org page split test at 1% sampling (again) (duration: 00m 54s) [14:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:21] yar: ^^ done [14:56:23] raynor: ^^ [14:56:56] thx, let me check that [14:57:02] (03CR) 10Giuseppe Lavagetto: [C: 031] install_server: reimage rdb2001, rdb2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472970 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [14:58:16] it's 7 am in sf, add shore, just bear in mind :-D [14:58:36] apergos: silly timezones :) forgot it was so early [14:58:51] (03CR) 10jenkins-bot: Revert "Revert "Prod: Enable Schema.org page split test at 1% sampling"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473528 (owner: 10Addshore) [14:58:55] I have an sf clock added to my gnome clock, otherwise I would be hopeless :-D [14:58:55] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) I have added pc2010 as spare host with the following line on db-codfw.php - we can change it if we want t... [14:59:16] i have one on my calendar, but apparently i don't look at it that often [14:59:17] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) [14:59:25] lol [14:59:49] (03CR) 10Filippo Giunchedi: [C: 032] swift: turn on statsd_exporter in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/473519 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [15:00:12] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10MasinAlDujailiWMDE) Did someone ask for a zone file? I have a zone file! Here, take a zone file! ;-) {F27219415} [15:02:09] (03PS1) 10Muehlenhoff: Record extended account date for nathante [puppet] - 10https://gerrit.wikimedia.org/r/473530 [15:03:25] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) 05stalled>03Open [15:04:11] (03PS2) 10Muehlenhoff: Record extended account date for nathante [puppet] - 10https://gerrit.wikimedia.org/r/473530 [15:04:13] addshore, it works, thank you [15:04:20] raynor: woo! [15:04:23] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) [15:05:22] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) @Cmjohnson any ETA to get these racked&installed? Thanks [15:05:25] (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1002/13487/" [puppet] - 10https://gerrit.wikimedia.org/r/473029 (https://phabricator.wikimedia.org/T198220) (owner: 10Effie Mouzeli) [15:06:12] (03PS1) 10Andrew Bogott: Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/473531 (https://phabricator.wikimedia.org/T204745) [15:06:19] (03CR) 10Muehlenhoff: [C: 032] Record extended account date for nathante [puppet] - 10https://gerrit.wikimedia.org/r/473530 (owner: 10Muehlenhoff) [15:06:30] (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1002/13488/rdb2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/472970 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [15:06:57] (03CR) 10Andrew Bogott: [C: 032] Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/473531 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [15:07:06] (03PS2) 10Andrew Bogott: Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/473531 (https://phabricator.wikimedia.org/T204745) [15:07:22] !log ladsgroup@mwmaint1002:/srv/mediawiki-staging/php-1.33.0-wmf.4$ mwscript sql.php --wiki=incubatorwiki extensions/Wikibase/client/sql/entity_usage.sql (T209207) [15:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:25] T209207: Enable arbitrary access on Incubator - https://phabricator.wikimedia.org/T209207 [15:07:25] (03PS2) 10Muehlenhoff: Remove Diamond on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/473489 (https://phabricator.wikimedia.org/T183454) [15:08:03] (03PS3) 10Muehlenhoff: Remove Diamond on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/473489 (https://phabricator.wikimedia.org/T183454) [15:11:10] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) [15:12:08] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational [15:12:59] !log roll-restart swift on ms-be1* to pick up statsd changes [15:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:20] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/473489 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:13:46] (03PS3) 10Giuseppe Lavagetto: mediawiki::appserver: install php-fpm everywhere [puppet] - 10https://gerrit.wikimedia.org/r/473231 (https://phabricator.wikimedia.org/T208433) [15:15:09] (03PS1) 10Ladsgroup: Add incubatorwiki to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473534 (https://phabricator.wikimedia.org/T209207) [15:16:12] (03PS3) 10Niedzielski: Prod: increase Schema.org page split test to 5% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473221 (https://phabricator.wikimedia.org/T208755) [15:16:14] (03PS3) 10Niedzielski: Prod: increase Schema.org page split test to 25% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473224 (https://phabricator.wikimedia.org/T208755) [15:16:16] (03PS3) 10Niedzielski: Prod: increase Schema.org page split test to 50% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473225 (https://phabricator.wikimedia.org/T208755) [15:16:18] (03PS2) 10Niedzielski: Prod: increase Schema.org page split test to 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473227 (https://phabricator.wikimedia.org/T208755) [15:16:37] (03PS8) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) [15:16:44] (03CR) 10Effie Mouzeli: [C: 032] jobqueue_redis: Purge role jobqueue_redis [puppet] - 10https://gerrit.wikimedia.org/r/473029 (https://phabricator.wikimedia.org/T198220) (owner: 10Effie Mouzeli) [15:16:53] (03CR) 10jerkins-bot: [V: 04-1] Prod: increase Schema.org page split test to 50% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473225 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [15:17:20] (03PS2) 10Effie Mouzeli: jobqueue_redis: Purge role jobqueue_redis [puppet] - 10https://gerrit.wikimedia.org/r/473029 (https://phabricator.wikimedia.org/T198220) [15:19:03] (03CR) 10Herron: logstash: add rsyslog-shipper kafka input config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [15:26:05] (03PS1) 10Banyek: Revert "mariadb: depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473536 [15:30:01] (03PS1) 10Giuseppe Lavagetto: php::fpm: explicitly depend on the php-fpm package [puppet] - 10https://gerrit.wikimedia.org/r/473538 [15:30:26] (03PS1) 10Banyek: mariadb: depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473539 (https://phabricator.wikimedia.org/T85757) [15:31:14] (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1088 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473539 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [15:31:37] (03CR) 10Giuseppe Lavagetto: [C: 032] php::fpm: explicitly depend on the php-fpm package [puppet] - 10https://gerrit.wikimedia.org/r/473538 (owner: 10Giuseppe Lavagetto) [15:32:01] (03PS4) 10Volans: remote: refactor Remote.query() API [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) [15:32:03] (03PS3) 10Volans: Add Icinga module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) [15:32:13] (03CR) 10Volans: "done" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:32:39] (03CR) 10Volans: "Replies inline, some done." (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:32:57] (03PS2) 10Banyek: mariadb: depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473539 (https://phabricator.wikimedia.org/T85757) [15:33:02] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.973 second response time [15:33:43] (03PS1) 10GTirloni: cloudvps: reimage+rename labvirt1016 as cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/473540 (https://phabricator.wikimedia.org/T209426) [15:33:56] (03PS1) 10GTirloni: cloudvps: rename+reimage labvirt1016 as cloudvirt1016 [dns] - 10https://gerrit.wikimedia.org/r/473541 (https://phabricator.wikimedia.org/T209426) [15:34:18] (03CR) 10Marostegui: mariadb: depool db1088 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473539 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [15:34:53] (03PS2) 10GTirloni: cloudvps: reimage+rename labvirt1016 as cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/473540 (https://phabricator.wikimedia.org/T209426) [15:35:17] (03PS3) 10Herron: logstash::input::kafka add support for SSL/TLS options [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) [15:35:41] (03PS3) 10Banyek: mariadb: depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473539 (https://phabricator.wikimedia.org/T85757) [15:36:02] (03CR) 10Marostegui: [C: 031] mariadb: depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473539 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [15:36:19] (03CR) 10Herron: logstash::input::kafka add support for SSL/TLS options (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [15:36:24] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:36:37] (03CR) 10GTirloni: [C: 032] cloudvps: reimage+rename labvirt1016 as cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/473540 (https://phabricator.wikimedia.org/T209426) (owner: 10GTirloni) [15:37:04] (03CR) 10Marostegui: [C: 031] Revert "mariadb: depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473536 (owner: 10Banyek) [15:37:20] (03CR) 10GTirloni: [C: 032] cloudvps: rename+reimage labvirt1016 as cloudvirt1016 [dns] - 10https://gerrit.wikimedia.org/r/473541 (https://phabricator.wikimedia.org/T209426) (owner: 10GTirloni) [15:37:30] PROBLEM - puppet last run on rdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:37:39] (03PS9) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) [15:37:43] (03CR) 10Effie Mouzeli: [C: 032] install_server: reimage rdb2001, rdb2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472970 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [15:37:47] (03PS1) 10Addshore: WIP DNM: Add wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/473543 (https://phabricator.wikimedia.org/T99531) [15:38:01] (03PS3) 10Effie Mouzeli: install_server: reimage rdb2001, rdb2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472970 (https://phabricator.wikimedia.org/T206450) [15:38:45] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473536 (owner: 10Banyek) [15:39:04] !log repooling db2046 (T85757) [15:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:07] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [15:39:20] (03PS2) 10Banyek: Revert "mariadb: depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473536 [15:39:23] (03CR) 10Banyek: [V: 032 C: 032] Revert "mariadb: depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473536 (owner: 10Banyek) [15:40:00] PROBLEM - Host lvs2009 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:13] (03PS1) 10Giuseppe Lavagetto: php::fpm: require the package for pool.d too [puppet] - 10https://gerrit.wikimedia.org/r/473544 [15:40:40] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) @Vgutierrez Recabling done as you requested on both servers [15:40:53] <_joe_> ok that's papaul's work :) [15:41:10] PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:41:23] <_joe_> jiji: ^^ [15:41:33] no biggie [15:41:39] (03CR) 10Giuseppe Lavagetto: [C: 032] php::fpm: require the package for pool.d too [puppet] - 10https://gerrit.wikimedia.org/r/473544 (owner: 10Giuseppe Lavagetto) [15:41:40] !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: T85757: repool db2046 (duration: 00m 52s) [15:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:47] <_joe_> yup just wanted to make sure you saw it :) [15:41:56] I saw it [15:42:49] I will merge the fix [15:43:16] (03CR) 10jenkins-bot: Revert "mariadb: depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473536 (owner: 10Banyek) [15:43:22] but I have a downtime for it on icinga [15:43:28] iirc [15:44:41] <_joe_> uhm that might be an icinga bug then [15:44:54] <_joe_> better call volans! [15:45:56] _joe_: you know I know better :) it's not the old bug [15:46:16] there#s no current downtime on rdb2*, maybe it expired? [15:46:43] <_joe_> moritzm: I think it's the same icinga bug jaime encountered [15:47:30] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.368 second response time [15:48:32] arcane discovered ;) [15:48:34] addshore: it looks like you are following option 2. that seems right to me. that being said, please get a review for your patch from traffic team [15:48:36] it's all good [15:49:02] moritzm: yes, I had run it on einsteinium [15:49:09] but we switched to icinga1001 [15:49:14] so it was no good anymore :p [15:50:29] !log roll restart swift-proxy in eqiad to apply statsd changes [15:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:34] how's our overall status? [15:50:49] jouncebot: now [15:50:49] For the next 0 hour(s) and 9 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1400) [15:50:57] MW train still running btw? [15:51:00] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:51:05] <_joe_> bblack: I think hashar stopped it [15:51:17] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [15:51:51] I will restart pdfrender [15:52:18] <_joe_> thanks [15:53:04] !Restarting pdfrender on scb*.eqiad.wmnet [15:53:07] !log Restarting pdfrender on scb*.eqiad.wmnet [15:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:20] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.200 second response time [15:54:39] (03PS1) 10Banyek: mariadb: productionize dbproxy101[2-7].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) [15:54:58] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ` lvs2010.codfw.wmnet ` The log can... [15:55:16] I have a swap of the TLS certs in the US planned for now-ish [15:55:25] but I can hold/defer if there's other risks ongoing [15:55:35] (or not wanting interference in various graphs) [15:57:25] (03PS2) 10BBlack: Switch unified cert to globalsign-2018 at US edges [puppet] - 10https://gerrit.wikimedia.org/r/473211 (https://phabricator.wikimedia.org/T206804) [15:57:30] ^ that [15:58:39] !log rebooting restbase-dev1004 for kernel security update and OpenJDK security update [15:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:19] assuming other risks are not sufficient to block! :) [16:01:19] (03PS4) 10Effie Mouzeli: install_server: reimage rdb2001, rdb2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472970 (https://phabricator.wikimedia.org/T206450) [16:02:05] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2010.codfw.wmnet'] ` [16:02:12] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error executing row event: Table liwikinews.echo_event doesnt exist [16:06:38] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10thcipriani) >>! In T209456#4745707, @hashar wrote: > We had a lot of stack overflow errors on random pages such as gittiles, that is due to some pages being very long to prettify. Those happened with some f... [16:08:07] !log rebooting restbase-dev1005 for kernel security update and OpenJDK security update [16:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:29] !log starting replacement of GlobalSign unified TLS cert at US edges (affects all public TLS termination for US traffic edges) - T206804 [16:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:32] T206804: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 [16:10:55] !log disabling puppet as precaution on all caches (cumin A:cp) - T206804 [16:10:55] (03PS1) 10Herron: add dummy logstash kafka input password to pacify PCC [labs/private] - 10https://gerrit.wikimedia.org/r/473553 [16:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:34] (03CR) 10BBlack: [C: 032] Switch unified cert to globalsign-2018 at US edges [puppet] - 10https://gerrit.wikimedia.org/r/473211 (https://phabricator.wikimedia.org/T206804) (owner: 10BBlack) [16:12:41] (03CR) 10Herron: [V: 032 C: 032] add dummy logstash kafka input password to pacify PCC [labs/private] - 10https://gerrit.wikimedia.org/r/473553 (owner: 10Herron) [16:12:58] (03PS3) 10BBlack: Switch unified cert to globalsign-2018 at US edges [puppet] - 10https://gerrit.wikimedia.org/r/473211 (https://phabricator.wikimedia.org/T206804) [16:13:32] it'd be nice if wikibugs and/or log actually noted when patches were Submit-ed, instead of just when they're uploaded and/or a review vote changes [16:15:04] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 963.41 seconds [16:16:21] (03CR) 10Cwhite: [C: 032] Remove Diamond from Kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/473490 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:16:47] !log rebooting restbase-dev1006 for kernel security update and OpenJDK security update [16:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:08] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) Redirection settings are confirmed correct. Looking around other settings in the docs. [16:18:10] (03PS1) 10Ema: Add module to install and configure fifo-log-demux [puppet] - 10https://gerrit.wikimedia.org/r/473554 (https://phabricator.wikimedia.org/T204225) [16:18:12] (03PS1) 10Ema: trafficserver: configure fifo-log-demux [puppet] - 10https://gerrit.wikimedia.org/r/473555 (https://phabricator.wikimedia.org/T204225) [16:20:34] RECOVERY - Host lvs2009 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [16:23:18] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) Nothing. I guess this is just more digging, then, unless both systems are somehow broken. [16:23:27] (03CR) 10Cwhite: "> Agreed, I don't think we need the status site, it's fine to simply" [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:23:44] RECOVERY - HTTPS Unified RSA on cp2018 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 341660 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days) [16:23:52] RECOVERY - HTTPS Unified ECDSA on cp2006 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 318852 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days) [16:24:02] RECOVERY - HTTPS Unified ECDSA on cp2018 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 341642 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days) [16:24:02] incoming recovery spam for the expiring certificates, sorry! [16:24:08] RECOVERY - HTTPS Unified ECDSA on cp2025 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 339595 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days) [16:24:12] RECOVERY - HTTPS Unified ECDSA on cp2012 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 334671 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days) [16:24:14] PROBLEM - puppet last run on lvs2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:24:14] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:24:19] (03PS12) 10Cwhite: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) [16:24:28] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:24:32] RECOVERY - HTTPS Unified RSA on cp2006 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 318813 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days) [16:24:32] RECOVERY - HTTPS Unified RSA on cp2012 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 334652 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days) [16:24:46] (03CR) 10Cwhite: [C: 032] hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:24:46] PROBLEM - puppet last run on cp1074 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:24:54] PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:25:00] I guess "-b 21" was a little much, is the few concurrent puppetfails that are going on, we'll see [16:25:24] cp1073/4 may be something else, those are ATS test servers I'm not doing anything on.. [16:25:39] yes those are my fault! [16:25:40] !log rebooting ganeti1005 for kernel security update [16:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:20] cp2009 doesn't actually have any local log of a puppet agent failure, in spite of the alert above, that's odd [16:27:21] (03PS1) 10GTirloni: cloudvps: hieradata for cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/473557 (https://phabricator.wikimedia.org/T209426) [16:27:33] oh duh, also ATS :) [16:27:44] (03CR) 10Fsero: [V: 032 C: 031] fifo-log-demux 0.1 (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/473432 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema) [16:28:11] (03CR) 10GTirloni: [C: 032] cloudvps: hieradata for cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/473557 (https://phabricator.wikimedia.org/T209426) (owner: 10GTirloni) [16:28:18] icinga-wm: ping? [16:28:33] there should be more recoveries, maybe they're slow to recheck at this point [16:29:20] RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:29:34] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:52] RECOVERY - puppet last run on cp1074 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:30:00] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:30:20] (03CR) 10Muehlenhoff: [C: 031] "I don't have a strong preference, but you have a point about the unpuppetised status page lingering around otherwise. I'd say let's merge " [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:31:31] (03CR) 10Herron: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [16:33:20] !log [Done] replacement of GlobalSign unified TLS cert at US edges complete - T206804 [16:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:23] T206804: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 [16:33:44] RECOVERY - puppet last run on rdb2001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:34:07] (03PS1) 10GTirloni: cloudvps: cleanup labvirt1016 [dns] - 10https://gerrit.wikimedia.org/r/473560 (https://phabricator.wikimedia.org/T209426) [16:34:16] (03PS2) 10Dzahn: wikistats (vps): use mariadb classes, fix old FIXME [puppet] - 10https://gerrit.wikimedia.org/r/470953 [16:34:30] (03CR) 10GTirloni: [C: 032] cloudvps: cleanup labvirt1016 [dns] - 10https://gerrit.wikimedia.org/r/473560 (https://phabricator.wikimedia.org/T209426) (owner: 10GTirloni) [16:36:38] (03PS1) 10Muehlenhoff: Remove Diamond from Elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/473561 (https://phabricator.wikimedia.org/T183454) [16:36:46] (03CR) 10Dzahn: [C: 032] wikistats (vps): use mariadb classes, fix old FIXME [puppet] - 10https://gerrit.wikimedia.org/r/470953 (owner: 10Dzahn) [16:37:22] RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:37:53] that [done] reminded me we could totally #hashtag !log entries :) [16:39:56] yep, it's Twitter [16:44:38] RECOVERY - puppet last run on lvs2009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:48:12] can someone with deployment access take care of https://phabricator.wikimedia.org/T209495 [16:48:13] ? [16:48:18] PROBLEM - Host cloudvirt1016 is DOWN: PING CRITICAL - Packet loss = 100% [16:48:36] <_joe_> uh? [16:48:41] <_joe_> is this expected? [16:48:51] IIRC that host was recently reimaged [16:48:52] 16:34 <+wikibugs> (CR) GTirloni: [C: 2] cloudvps: cleanup labvirt1016 [dns] ? [16:48:56] <_joe_> well [16:49:13] should this have paged ? [16:49:17] <_joe_> yes [16:49:25] <_joe_> cloud hosts page when they go down [16:49:38] <_joe_> but it shouldn't have paged because it should've been downtimed [16:49:39] it did page me fwiw [16:49:46] I suspect the cleanup commit (which killed 1/2 hostnames for that host) may have cleaned up a hostname the icinga check was using to ping it with? I donno [16:49:47] <_joe_> it did page everyone [16:50:06] labvirt1016 is downtimed [16:50:23] yeah but cloudvirt1016 is what paged [16:50:23] <_joe_> but not cloudvir1016 :P [16:50:24] RECOVERY - Host cloudvirt1016 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:50:47] cloudvirt1016 doesn't show up in icinga for me [16:51:00] hm pages [16:51:00] sadly a silly race condition / limitation in icinga, you can't downtime hosts/services that don't exist yet [16:51:16] <_joe_> godog: right [16:51:23] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/473561 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:51:35] <_joe_> so they'd need to set notifications to 'disabled' while installing [16:51:40] bblack: I'm renaming a server [16:51:53] not sure if there's anything I can do :) [16:51:56] <_joe_> gtirloni: ok when renaming a server you will need to merge a puppet patch :) [16:52:23] ? [16:52:24] <_joe_> gtirloni: hieradata/hosts/.yaml with profile::base::notifications: disabled [16:52:31] PROBLEM - TFTP service on install2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* [16:52:31] <_joe_> before you rename it [16:52:56] ok, I'll review our instructions and add that. Thanks! [16:53:10] <_joe_> gtirloni: that's just because your hosts page everyone [16:53:16] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [16:54:21] got it, thanks and sorry about the noise [16:54:26] (03CR) 10Cwhite: [C: 031] Remove Diamond from Elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/473561 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:54:31] RECOVERY - TFTP service on install2002 is OK: PROCS OK: 1 process with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* [16:55:01] <_joe_> gtirloni: ofc you also need to remove that afterwards [16:55:07] <_joe_> as you want to get paged :) [16:56:44] _joe_: just had another issue with the docker-registry and pulling and image, and managed to get 2 people to reproduce it in 2 different locations and machines etc [16:56:59] this time 1 layer of 1 image would just refuse to download, and kept retrying [16:57:08] ticket worthy? or? [16:57:54] <_joe_> addshore: I guess so [16:58:01] * addshore goes to write it up [16:58:08] <_joe_> addshore: our docker registry is bound to be redone sooner than later [17:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1700). [17:00:04] stephanebisson and Amir1: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:11] Hi [17:00:18] I can SWAT today [17:00:24] o/ [17:01:56] 10Operations: issue pulling 1 layer of docker-registry.wikimedia.org/releng/composer-php71:latest - https://phabricator.wikimedia.org/T209507 (10Addshore) [17:01:59] _joe_: ^^ done [17:02:21] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe) [17:04:21] is someone working on cloudvirt1016? (saw notice of raid array failure and then reboot) [17:05:43] robh i think gtirloni is or andrewbogott [17:05:57] ok, i just dont like assuming its handled, felt the need to check! [17:08:49] 10Operations, 10Traffic: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10BBlack) From IRC for posterity: ` 17:01 < bblack> the straw pseudo-code still seems like a legit approach to me, but in practice there's some rough edges to it. chiefly, that cross my mind imme... [17:11:53] robh: gtirloni is renaming labvirt1016 to cloudvirt1016 [17:12:28] !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaEvents/includes/WikimediaEventsHooks.php: SWAT: [[gerrit:473394|Fix EditAttemptStepSamplingRate variable export]] (duration: 00m 54s) [17:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:08] (03PS1) 10Bstorm: cloudstore: install with the default kernel for cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/473566 (https://phabricator.wikimedia.org/T193655) [17:13:35] (03PS1) 10GTirloni: cloudvps: Add cloudvirt1016 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/473567 (https://phabricator.wikimedia.org/T209426) [17:14:01] andrewbogott: no worries, i just saw alerts and was worried =] [17:14:16] once confirmed it was folks doing stuff, i stopped worrying. [17:14:22] (03CR) 10Bstorm: [C: 032] cloudstore: install with the default kernel for cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/473566 (https://phabricator.wikimedia.org/T193655) (owner: 10Bstorm) [17:14:45] (03PS2) 10GTirloni: cloudvps: Add cloudvirt1016 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/473567 (https://phabricator.wikimedia.org/T209426) [17:16:06] (03CR) 10GTirloni: [C: 032] cloudvps: Add cloudvirt1016 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/473567 (https://phabricator.wikimedia.org/T209426) (owner: 10GTirloni) [17:19:12] !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/MobileFrontend/resources/mobile.editor.common/schemaEditAttemptStep.js: SWAT: [[gerrit:473395|schemaEditAttemptStep.js: Use correct config var name for sampling rate]] (duration: 00m 54s) [17:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:16] (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473534 (https://phabricator.wikimedia.org/T209207) (owner: 10Ladsgroup) [17:19:24] 10Operations, 10Traffic, 10Patch-For-Review: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) [17:20:34] 10Operations, 10Traffic, 10Patch-For-Review: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) 05Open>03Resolved [17:20:39] (03Merged) 10jenkins-bot: Add incubatorwiki to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473534 (https://phabricator.wikimedia.org/T209207) (owner: 10Ladsgroup) [17:21:20] Amir1: Is your change testable? [17:21:32] yup [17:22:17] Amir1: It's now on mwdebug1001 [17:22:45] Testing [17:23:08] AndyRussG: hi! Kind reminder of the druid question :) [17:23:20] elukey: thanks! [17:23:41] yes .... rrrg apologies again for not getting to it yet, I'll definitely do so today :) [17:24:11] I am pinging since your peak season is getting closer and I prefer not to rush :) [17:25:31] stephanebisson: it's fine, please move forward [17:25:57] (03CR) 10jenkins-bot: Add incubatorwiki to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473534 (https://phabricator.wikimedia.org/T209207) (owner: 10Ladsgroup) [17:26:42] !log sbisson@deploy1001 Synchronized dblists/wikidataclient.dblist: SWAT: [[gerrit:473534|Add incubatorwiki to wikidataclient.dblist]] (duration: 00m 48s) [17:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:45] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10RyanSteinberg) public key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBN1rS7OObcft7lDa9+H45kLfkdGHwlJ6rL2Fm2IPsMB preferred shell login: ryanmax [17:26:48] Amir1: done [17:26:57] Thanks! [17:27:01] And that concludes SWAT for now [17:27:56] (03CR) 10Cwhite: "> I don't have a strong preference, but you have a point about the" [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [17:31:26] !log Running importImage.php for 'Opening ceremony of First accusation protest against presumption of guilt of judicial branch.webm' per request T209495 [17:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:31] T209495: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T209495 [17:32:37] anything wrong going on ? I have to send a hotfix for the train ( https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/473551/1/includes/libs/objectcache/MemcachedPeclBagOStuff.php ) [17:34:50] I guess since SWAT has just finished, I will take over from now [17:35:05] <_joe_> hashar: that's not exactly a fix [17:35:11] 10Operations, 10Traffic: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10BBlack) p:05Triage>03Normal [17:35:15] <_joe_> you're transforming a non-fatal failure in a fatal? [17:35:33] <_joe_> oh no just reporting the stack trace [17:35:35] <_joe_> ok [17:35:38] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) This was just stuck at a prompt. Stupid mistake, the output after that stage of boot was redirected to the... [17:35:41] yeah that is for the stacktrace [17:35:43] but indeed it is not a fix [17:35:55] <_joe_> still: a step to get to a fix [17:36:21] yup :) [17:36:25] <_joe_> I /think/ it's a limitation of mcrouter btw [17:36:57] my theory is that some piece of 1.33.0-wmf.4 code ends up trying to get a cache key with string(0) "" [17:37:33] !log T207377 downtime and reboot cloudnet1004 (cloudnet1003 is the active one already) [17:37:37] <_joe_> hashar: uhm, maybe, wouldn't be so sure, but the trace should give you an answer [17:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:41] T207377: Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 [17:37:49] 10Operations, 10Traffic: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10BBlack) [17:37:56] <_joe_> hashar: it can be the key is an unprintable pack of bits [17:38:43] 10Operations, 10ops-codfw: Decommission asw-c8-codfw - https://phabricator.wikimedia.org/T209066 (10Papaul) [17:39:01] 10Operations, 10ops-codfw: Decommission asw-c8-codfw - https://phabricator.wikimedia.org/T209066 (10Papaul) [17:39:32] _joe_: it also lists the server as ":" :/ [17:40:25] that might be an empty host name + empty port, separated by a colon… [17:40:44] (though other logstash entries I saw also had a unix socket path as server) [17:41:15] 10Operations, 10ops-codfw: Decommission asw-c8-codfw - https://phabricator.wikimedia.org/T209066 (10Papaul) 05Open>03Resolved Done [17:44:13] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.181 second response time [17:46:39] (03PS1) 10Andrew Bogott: Horizon: move wmf-research-tools to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/473570 (https://phabricator.wikimedia.org/T204745) [17:47:44] (03CR) 10Andrew Bogott: [C: 032] Horizon: move wmf-research-tools to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/473570 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [17:48:22] (03CR) 10Gehel: "mostly minor comments (though I would very much appreciate simplifying the tests)" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [17:48:49] 10Operations, 10Traffic: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10BBlack) Also, we should pre-downtime the unified ssl checks in icinga early next week before the US Thanksgiving holidays, so that nobody's pestered by a spam of WARNING alerts, which I believe are set to tri... [17:49:12] 10Operations, 10ops-codfw: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) [17:53:28] !log T207377 downtime and reboot cloudnet1003 (cloudnet1004 is the active one already) [17:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:31] T207377: Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 [17:58:32] (03PS4) 10Herron: logstash::input::kafka add support for SSL/TLS options [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) [17:58:57] !log hashar@deploy1001 Started scap: php-1.33.0-wmf.4/includes/libs/objectcache/MemcachedPeclBagOStuff.php Add trace to debug memcached bad key error - T209429 [17:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:00] T209429: memcached error: A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE - https://phabricator.wikimedia.org/T209429 [17:59:31] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.332 second response time [17:59:59] deploying a wmf.4 change and 17:59:18 Updating LocalisationCache for 1.33.0-wmf.3 using 30 thread(s) ... [18:00:59] (03PS5) 10Herron: logstash::input::kafka add support for SSL/TLS options [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) [18:01:34] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) [18:04:06] (03PS6) 10Herron: logstash::input::kafka add support for SSL/TLS options [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) [18:05:29] oh [18:05:31] I am rusty [18:05:36] ran a full scap :( [18:05:44] (03CR) 10Herron: [C: 032] logstash::input::kafka add support for SSL/TLS options [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [18:09:40] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10Afandian) public key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCsJPyatQmAgubnM6ChTohZdEYTOfVJjzpsOtiVrBcwTOVBwEl3qcORlMEF0MMk+BdMfiMd12jmfxGWuOhzJAZ8iPDE9Bk... [18:09:46] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10mmodell) So the problem seems to have started between 02:00 and 02:12 UTC. There was a fairly large spike in outgoing traffic on eth0 between 02:10 and 02:12 at which point cpu load gradually falls off as a... [18:12:24] jouncebot: now [18:12:24] No deployments scheduled for the next 1 hour(s) and 47 minute(s) [18:12:30] jouncebot: next [18:12:30] In 1 hour(s) and 47 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T2000) [18:15:35] PROBLEM - High CPU load on API appserver on mw1314 is CRITICAL: CRITICAL - load average: 82.02, 36.44, 22.51 [18:17:34] (03PS1) 10Bstorm: sonofgridengine: stretch bastions want libboost-dev [puppet] - 10https://gerrit.wikimedia.org/r/473574 (https://phabricator.wikimedia.org/T200557) [18:17:44] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10mmodell) @thcipriani is going to try installing https://gerrit-review.googlesource.com/admin/repos/plugins/javamelody to hopefully collect some more useful data about the state of the JVM. At this point we... [18:17:46] (03PS1) 10Andrew Bogott: nova: remove labvirt1013 and 1014 from the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/473575 [18:18:21] hashar: hows your sync going? :) [18:19:11] (03PS10) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) [18:20:34] PROBLEM - dhclient process on cloudstore1008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.25: Connection reset by peer [18:20:34] PROBLEM - Check systemd state on cloudstore1008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.25: Connection reset by peer [18:20:48] PROBLEM - Disk space on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused [18:22:26] PROBLEM - Check the NTP synchronisation status of timesyncd on cloudstore1008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.25: Connection reset by peer [18:22:26] PROBLEM - puppet last run on cloudstore1008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.25: Connection reset by peer [18:24:12] PROBLEM - DPKG on cloudstore1008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.25: Connection reset by peer [18:25:14] RECOVERY - DPKG on cloudstore1008 is OK: All packages OK [18:25:34] RECOVERY - Check systemd state on cloudstore1008 is OK: OK - running: The system is fully operational [18:25:34] RECOVERY - dhclient process on cloudstore1008 is OK: PROCS OK: 0 processes with command name dhclient [18:25:42] (03CR) 10Andrew Bogott: [C: 032] nova: remove labvirt1013 and 1014 from the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/473575 (owner: 10Andrew Bogott) [18:27:26] RECOVERY - puppet last run on cloudstore1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:28:08] PROBLEM - MD RAID on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused [18:29:58] addshore: sorry. I ran a full sync :/ [18:30:38] wait on scap-cdb-rebuild [18:32:36] (03PS1) 10Vgutierrez: hieradata: Add lvs2010 specific settings [puppet] - 10https://gerrit.wikimedia.org/r/473576 (https://phabricator.wikimedia.org/T209337) [18:33:04] !log hashar@deploy1001 Finished scap: php-1.33.0-wmf.4/includes/libs/objectcache/MemcachedPeclBagOStuff.php Add trace to debug memcached bad key error - T209429 (duration: 34m 07s) [18:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:08] T209429: memcached error: A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE - https://phabricator.wikimedia.org/T209429 [18:33:18] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:33:34] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:33:38] PROBLEM - configured eth on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused [18:34:18] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:34:34] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:34:50] addshore: syn ccompleted finally [18:35:26] PROBLEM - Check systemd state on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused [18:35:26] PROBLEM - dhclient process on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused [18:37:18] PROBLEM - puppet last run on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused [18:37:18] PROBLEM - Check the NTP synchronisation status of timesyncd on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused [18:39:01] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10jijiki) [18:39:08] RECOVERY - High CPU load on API appserver on mw1314 is OK: OK - load average: 13.12, 15.56, 18.65 [18:39:16] PROBLEM - DPKG on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused [18:39:29] vgutierrez: I guess downtime expired for lvs2010 [18:39:35] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10jijiki) [18:39:44] yeah [18:39:47] thx volans [18:40:01] np I was about to renew it if you were not around ;) [18:40:56] 10Operations, 10decommission, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10jijiki) [18:42:36] PROBLEM - IPMI Sensor Status on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused [18:43:58] I know I know [18:44:24] PROBLEM - Long running screen/tmux on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused [18:44:58] (03CR) 10BBlack: [C: 031] "LGTM as a starting point for testing our first bnxt_en LVS. Note interface_tweaks was blindly updated for this card back in https://gerri" [puppet] - 10https://gerrit.wikimedia.org/r/473576 (https://phabricator.wikimedia.org/T209337) (owner: 10Vgutierrez) [18:45:09] thx bblack [18:45:51] (03CR) 10Vgutierrez: [C: 032] hieradata: Add lvs2010 specific settings [puppet] - 10https://gerrit.wikimedia.org/r/473576 (https://phabricator.wikimedia.org/T209337) (owner: 10Vgutierrez) [18:46:51] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T208706 (10Volans) Adding analytics, Luca and Otto in case it was missed. Also puppet has issues because of RO filesystem. [18:47:24] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10jijiki) "Reimage rdb2001, rdb2002 to stretch and change their role to spare::system" https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/472714/ [18:50:31] PROBLEM - Disk space on cloudstore1009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.26: Connection reset by peer [18:51:15] I am going to deploy a fix for the train blocker ( T209429 ), wait a bit to confirm then promote group1 to 1.33.0-wmf.4 [18:51:16] T209429: memcached error: A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE - https://phabricator.wikimedia.org/T209429 [18:51:31] PROBLEM - ensure kvm processes are running on labvirt1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm [18:52:18] that's me, no cause for alarm [18:52:28] RECOVERY - Check the NTP synchronisation status of timesyncd on cloudstore1008 is OK: OK: synced at Wed 2018-11-14 18:52:26 UTC. [18:57:05] RECOVERY - DPKG on lvs2010 is OK: All packages OK [18:57:17] RECOVERY - MD RAID on lvs2010 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:57:19] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10hashar) For monitoring it seems we monitor the JVM with JMX. There is a task for Gerrit at: T184086 The CI Jenkins used to have Java melody but that is apparently no more enabled. [18:57:47] RECOVERY - Disk space on lvs2010 is OK: DISK OK [18:59:39] RECOVERY - Check systemd state on lvs2010 is OK: OK - running: The system is fully operational [19:01:27] RECOVERY - configured eth on lvs2010 is OK: OK - interfaces up [19:02:31] RECOVERY - puppet last run on lvs2010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:03:17] RECOVERY - dhclient process on lvs2010 is OK: PROCS OK: 0 processes with command name dhclient [19:03:19] PROBLEM - configured eth on cloudstore1009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.26: Connection reset by peer [19:05:11] PROBLEM - Check systemd state on cloudstore1009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.26: Connection reset by peer [19:05:11] PROBLEM - dhclient process on cloudstore1009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.26: Connection reset by peer [19:05:23] RECOVERY - configured eth on cloudstore1009 is OK: OK - interfaces up [19:05:49] RECOVERY - Disk space on cloudstore1009 is OK: DISK OK [19:05:57] (03Abandoned) 10Dzahn: rename tegmen.wikimedia.org to icinga2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/471897 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn) [19:06:15] RECOVERY - Check systemd state on cloudstore1009 is OK: OK - running: The system is fully operational [19:06:15] RECOVERY - dhclient process on cloudstore1009 is OK: PROCS OK: 0 processes with command name dhclient [19:06:50] (03PS1) 10Effie Mouzeli: cumin: create alias for role redis::misc [puppet] - 10https://gerrit.wikimedia.org/r/473582 [19:07:09] PROBLEM - Host lvs2010 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:27] RECOVERY - Host lvs2010 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [19:12:19] I am going to deploy [mediawiki/core@wmf/1.33.0-wmf.4] JobQueue: Actually return the value from getRootJobCacheKey() [19:12:19] https://gerrit.wikimedia.org/r/473579 [19:12:43] RECOVERY - IPMI Sensor Status on lvs2010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:14:22] !log hashar@deploy1001 Synchronized php-1.33.0-wmf.4/includes/jobqueue/JobQueue.php: Actually return the value from getRootJobCacheKey() - T209429 (duration: 00m 53s) [19:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:26] T209429: memcached error: A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE - https://phabricator.wikimedia.org/T209429 [19:14:30] anomie: hotfix deployed thanks [19:19:39] PROBLEM - Check systemd state on ms-be2038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:22:45] PROBLEM - PyBal BGP sessions are established on lvs2010 is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=codfw%2520prometheus%252Fops [19:24:11] jouncebot: next [19:24:11] In 0 hour(s) and 35 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T2000) [19:24:29] 10Operations, 10Gerrit, 10Release-Engineering-Team (Watching / External): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10thcipriani) Seems like the metrics reporter plugin hasn't received any updates for 8 months now: https://gerrit.googlesource.com/plugins/metrics-reporter-... [19:24:33] that's expected [19:24:36] the lvs2010 stuff [19:24:40] sorry about the noise [19:24:56] I am going to wait for the Americans train window in 30 minutes from now [19:34:41] (03Abandoned) 10Herron: logstash::input::kafka: add topics_pattern support [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [19:35:00] 10Operations, 10Gerrit: Add javamelody to gerrit - https://phabricator.wikimedia.org/T209526 (10thcipriani) p:05Triage>03High [19:35:17] (03PS1) 10Herron: logstash::input::kafka: add topics_pattern support [puppet] - 10https://gerrit.wikimedia.org/r/473588 (https://phabricator.wikimedia.org/T206454) [19:35:34] 10Operations, 10Gerrit, 10Release-Engineering-Team (Kanban): Add javamelody to gerrit - https://phabricator.wikimedia.org/T209526 (10thcipriani) a:03thcipriani [19:36:14] (03PS1) 10Thcipriani: Add JavaMelody plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/473589 (https://phabricator.wikimedia.org/T209526) [19:37:03] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) [19:37:11] (03CR) 10Herron: "abandoned for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473588/" [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [19:37:19] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) 05Open>03Resolved [19:37:27] RECOVERY - Check the NTP synchronisation status of timesyncd on lvs2010 is OK: OK: synced at Wed 2018-11-14 19:37:25 UTC. [19:42:22] (03CR) 10Herron: "this is a continuation of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473138/" [puppet] - 10https://gerrit.wikimedia.org/r/473588 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [19:42:30] (03CR) 10Herron: "PCC is a noop https://puppet-compiler.wmflabs.org/compiler1002/13498/" [puppet] - 10https://gerrit.wikimedia.org/r/473588 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [19:43:29] (03CR) 10Herron: [C: 032] logstash::input::kafka: add topics_pattern support [puppet] - 10https://gerrit.wikimedia.org/r/473588 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [19:47:19] (03PS11) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) [19:56:02] (03PS2) 10Bstorm: sonofgridengine: stretch bastions want libboost-dev [puppet] - 10https://gerrit.wikimedia.org/r/473574 (https://phabricator.wikimedia.org/T200557) [19:57:30] (03CR) 10Bstorm: [C: 032] sonofgridengine: stretch bastions want libboost-dev [puppet] - 10https://gerrit.wikimedia.org/r/473574 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [20:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T2000) [20:04:53] (03CR) 10Herron: [C: 032] logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [20:05:01] (03PS12) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) [20:08:17] going to run train for group1 [20:08:52] (03PS1) 10Hashar: group1 wikis to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473597 [20:08:54] (03CR) 10Hashar: [C: 032] group1 wikis to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473597 (owner: 10Hashar) [20:10:42] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473597 (owner: 10Hashar) [20:12:51] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.4 [20:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:07] (03PS1) 10Herron: Revert "logstash: add rsyslog-shipper kafka input config" [puppet] - 10https://gerrit.wikimedia.org/r/473598 [20:13:44] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.4 (duration: 00m 52s) [20:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:15] (03CR) 10Herron: [C: 032] "reverting because $kafka_config['brokers']['string'] expands to plaintext ports" [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [20:14:36] (03CR) 10Herron: [C: 032] Revert "logstash: add rsyslog-shipper kafka input config" [puppet] - 10https://gerrit.wikimedia.org/r/473598 (owner: 10Herron) [20:16:20] (03CR) 10Cwhite: [C: 032] rename tegmen to icinga2001 in puppet, DHCP, and use stretch [puppet] - 10https://gerrit.wikimedia.org/r/471898 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn) [20:18:51] (03PS2) 10Cwhite: rename tegmen to icinga2001 in puppet, DHCP, and use stretch [puppet] - 10https://gerrit.wikimedia.org/r/471898 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn) [20:20:37] (03PS3) 10Cwhite: rename tegmen to icinga2001 in puppet, DHCP, and use stretch [puppet] - 10https://gerrit.wikimedia.org/r/471898 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn) [20:22:50] (03CR) 10Dzahn: [C: 031] rename tegmen to icinga2001 in puppet, DHCP, and use stretch [puppet] - 10https://gerrit.wikimedia.org/r/471898 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn) [20:23:07] (03CR) 10Cwhite: [C: 032] rename tegmen to icinga2001 in puppet, DHCP, and use stretch [puppet] - 10https://gerrit.wikimedia.org/r/471898 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn) [20:23:25] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473597 (owner: 10Hashar) [20:25:37] (03PS3) 10Cwhite: rename tegmen.wikimedia.org to icinga2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/472225 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn) [20:25:46] group1 looks good so far [20:25:53] (03CR) 10Cwhite: [C: 032] rename tegmen.wikimedia.org to icinga2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/472225 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn) [20:27:45] (03PS2) 10Cwhite: smokeping: replace tegmen with icinga2001 target [puppet] - 10https://gerrit.wikimedia.org/r/471899 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn) [20:28:24] (03PS1) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/473607 (https://phabricator.wikimedia.org/T206454) [20:28:43] (03CR) 10Cwhite: [C: 032] smokeping: replace tegmen with icinga2001 target [puppet] - 10https://gerrit.wikimedia.org/r/471899 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn) [20:29:22] XioNoX: fyi, we are renaming a smokeping target(separate from the other day) [20:29:36] since this is the same server just changing names.. we have to do it at once.. with the DNS change [20:29:46] might be an alert but should be gone soon [20:43:05] 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cwhite on cumin1001.eqiad.wmnet for hosts: ` tegmen.mgmt.codfw.wmnet ` The log can be found in `... [20:43:07] 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['tegmen.mgmt.codfw.wmnet'] ` Of which those **FAILED**: ` ['tegmen.mgmt.codfw.wmnet'] ` [20:43:33] 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cwhite on cumin1001.eqiad.wmnet for hosts: ` tegmen.wikimedia.org ` The log can be found in `/va... [20:44:46] hashar: Does group 1 seem stable enough that I can start some maintenance scripts, or should I just wait for tomorrow? [20:47:29] (03PS2) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/473607 (https://phabricator.wikimedia.org/T206454) [20:51:58] (03CR) 10Volans: [C: 031] "Syntactically correct, I'll leave it to you and Moritz for the logically correct, maybe a global one that includes both roles and both DC" [puppet] - 10https://gerrit.wikimedia.org/r/473582 (owner: 10Effie Mouzeli) [20:52:25] (03PS1) 10Thcipriani: Add javamelody plugin and dependencies [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/473610 (https://phabricator.wikimedia.org/T209526) [20:53:59] (03PS3) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/473607 (https://phabricator.wikimedia.org/T206454) [20:56:22] (03PS1) 10Thcipriani: Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) [20:56:56] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani) [20:57:21] (03CR) 1020after4: [C: 032] Add JavaMelody plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/473589 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani) [20:57:29] (03CR) 1020after4: [C: 032] Add javamelody plugin and dependencies [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/473610 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani) [20:58:15] (03PS2) 10Thcipriani: Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) [20:58:50] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani) [20:59:16] 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cwhite on cumin1001.eqiad.wmnet for hosts: ` tegmen.wikimedia.org ` The log can be found in `/va... [20:59:27] 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['tegmen.wikimedia.org'] ` Of which those **FAILED**: ` ['tegmen.wikimedia.org'] ` [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor I � Unicode. All rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T2100). [21:00:09] (03PS3) 10Thcipriani: Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) [21:00:40] (03PS4) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/473607 (https://phabricator.wikimedia.org/T206454) [21:00:54] (03CR) 1020after4: [C: 031] Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani) [21:04:16] (03CR) 10Thcipriani: "puppet compiler run: https://puppet-compiler.wmflabs.org/compiler1002/13502/" [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani) [21:04:35] (03CR) 10Herron: [C: 032] logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/473607 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [21:05:58] 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cwhite on cumin1001.eqiad.wmnet for hosts: ` tegmen.wikimedia.org ` The log can be found in `/va... [21:06:00] 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['tegmen.wikimedia.org'] ` Of which those **FAILED**: ` ['tegmen.wikimedia.org'] ` [21:09:03] 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cwhite on cumin1001.eqiad.wmnet for hosts: ` tegmen.wikimedia.org ` The log can be found in `/va... [21:14:39] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:15:17] (03PS1) 10Herron: kafka_config: add ssl_string to documentation section [puppet] - 10https://gerrit.wikimedia.org/r/473614 [21:20:31] (03PS4) 10Dzahn: Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani) [21:20:33] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Imarlier) @Gehel have the patches referenced above been deployed? [21:20:43] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Imarlier) Doesn't appear to have solved the issue, but I need to verify that the patches have actually been deployed: https://logstash.wikimedia.or... [21:20:47] (03PS1) 10Herron: logstash: remove unnecessary newline from kafka input template [puppet] - 10https://gerrit.wikimedia.org/r/473616 [21:21:16] (03CR) 10Thcipriani: [V: 032] Add JavaMelody plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/473589 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani) [21:21:26] 10Operations, 10Performance-Team, 10Traffic: Stop oversampling Asian countries - https://phabricator.wikimedia.org/T204365 (10Imarlier) 05Open>03Resolved Resolved a long time ago, just forgot to close out the ticket. [21:21:36] (03CR) 10Thcipriani: [V: 032] Add javamelody plugin and dependencies [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/473610 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani) [21:21:54] (03CR) 10Herron: [C: 032] kafka_config: add ssl_string to documentation section [puppet] - 10https://gerrit.wikimedia.org/r/473614 (owner: 10Herron) [21:24:17] (03PS2) 10Herron: logstash: remove unnecessary newline from kafka input template [puppet] - 10https://gerrit.wikimedia.org/r/473616 [21:25:10] (03CR) 10Herron: [C: 032] logstash: remove unnecessary newline from kafka input template [puppet] - 10https://gerrit.wikimedia.org/r/473616 (owner: 10Herron) [21:26:18] (03CR) 10Dzahn: [C: 032] Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani) [21:26:25] (03PS5) 10Dzahn: Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani) [21:26:54] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@ab2fa18]: deploy javamelody on gerrit2001 [21:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:05] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@ab2fa18]: deploy javamelody on gerrit2001 (duration: 00m 11s) [21:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:55] thcipriani: "deploying" the symlink [21:28:10] mutante: thank you! [21:28:44] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@ab2fa18]: deploy javamelody on cobalt [21:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:54] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@ab2fa18]: deploy javamelody on cobalt (duration: 00m 09s) [21:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:06] Notice: /Stage[main]/Gerrit::Jetty/File[/var/lib/gerrit2/review_site/lib/javamelody-deps_deploy.jar]/ensure: created [21:30:08] great! [21:30:37] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Smalyshev) The patch has been deployed, and doesn't look like it prevents the issue: ` 18:05:29.346 [update 4] WARN org.wikidata.query.rdf.tool.U... [21:31:53] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Smalyshev) > So, an interesting thing: in at least some of these cases, there is a web request that is making it to wikidata, and that is returning... [21:33:12] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) @Imarlier yes the patches have been deployed though we don't use RC A... [21:34:39] !log restart gerrit to load JavaMelody dependency library [21:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:20] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Imarlier) >>! In T207718#4748289, @Smalyshev wrote: >> So, an interesting thing: in at least some of these cases, there is a web request that is ma... [21:36:37] thcipriani: Nice! :) [21:37:21] 10Operations, 10ops-codfw, 10ops-eqiad, 10ops-ulsfo: Devices with wmf* names and status active - https://phabricator.wikimedia.org/T209074 (10Cmjohnson) @ayounsi should they all have "inventory" status? [21:39:59] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 3 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [21:41:05] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [21:41:33] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [21:44:53] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:50:04] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10thcipriani) [21:50:10] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Add javamelody to gerrit - https://phabricator.wikimedia.org/T209526 (10thcipriani) 05Open>03Resolved Now available for gerrit admins at: https://gerrit.wikimedia.org/r/monitoring [21:58:49] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Smalyshev) > It could be -- how quickly does it retry? Immediately? Or is there a delay? I don't think there's a delay for NoHttpResponseException... [22:02:32] (03PS3) 10Zoranzoki21: Add new throttle rule for Art+Feminism Event on 2018-11-17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473255 (https://phabricator.wikimedia.org/T209324) [22:03:08] 10Operations, 10ops-codfw, 10ops-eqiad, 10ops-ulsfo: Devices with wmf* names and status active - https://phabricator.wikimedia.org/T209074 (10RobH) spares should be 'planned' https://wikitech.wikimedia.org/wiki/Server_Lifecycle#States [22:03:37] (03PS1) 10Bstorm: sonofgridengine: fighting through the dependency quirks [puppet] - 10https://gerrit.wikimedia.org/r/473628 (https://phabricator.wikimedia.org/T200557) [22:04:43] (03CR) 10Bstorm: [C: 032] sonofgridengine: fighting through the dependency quirks [puppet] - 10https://gerrit.wikimedia.org/r/473628 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [22:06:39] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [22:06:39] 10Operations, 10ops-eqiad, 10Cloud-Services, 10DC-Ops: labvirt1018 -> cloudvirt1018: update physical label, network port description, netbox - https://phabricator.wikimedia.org/T207319 (10Andrew) a:05Andrew>03Cmjohnson [22:09:23] (03PS4) 10Zoranzoki21: Disable FlaggedRevs, enable RC patrol and add rights on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472745 (https://phabricator.wikimedia.org/T209251) [22:10:39] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:11:43] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [22:14:29] those were gerrit (git clone) related [22:14:45] 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['icinga2001.wikimedia.org'] ` Of which those **FAILED**: ` ['icinga2001.wikimedia.org'] ` [22:17:15] hi mutante :) [22:18:27] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) just got off the phone with HP and they are stating that they are not seeing any issues with the raid battery in the logs I have sent. They suggest it's our reporting tool. [22:18:30] anomie: sorry I have missed your ping. I think group1 is stable so far, I havent noticed much [22:19:14] anomie: so I guess it is fine to run maintenance scripts ): [22:19:15] :) [22:21:36] I am off for some sleep & [22:25:25] the page https://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Map_of_Virginia_highlighting_Arlington_County.svg/120px-Map_of_Virginia_highlighting_Arlington_County.svg.png is giving a "Error: 429, Too Many Requests" error [22:25:55] I guess there should be a grafana dashboard showing the scalers load, but I don't see it [22:26:39] maybe someone has an idea what's going on? [22:35:01] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:39:05] Often that just means render failure followed by so many attempts that rate limiter kicked in (as in that scenario every view is render attempt) [22:40:25] so the 429 is just masking the real error [22:40:37] makes sense [22:41:47] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [22:42:06] is that for the base image? [22:42:23] it also shows a 429 if I change the url to a different size [22:42:47] * bawolff knowledge is prethumbor and outdated [22:44:31] RECOVERY - Long running screen/tmux on lvs2010 is OK: OK: No SCREEN or tmux processes detected. [22:44:32] * Platonides didn't know about thumbor… [22:45:48] reading https://wikitech.wikimedia.org/wiki/Thumbor#Throttling I understand it is throttling the original [22:47:55] so, for some reason, thumbor is not generatign a thumbnail for this [22:48:09] the next question should be “why?” [22:48:26] (too many points to render? :S) [22:51:39] If its a huge file, out of time, oom, etc would be a possible reason [22:52:22] otherwise it would mean that librsvg is crashing for some reason [22:52:37] I have no idea where thumbor logs are [22:58:47] "Thumbor logs go to /srv/log/thumbor on the Thumbor servers." [22:59:13] which isn't helpful for lay men like me :P [22:59:35] I would expect it is also sent out [23:12:33] 10Operations, 10JADE, 10TechCom, 10Epic, and 3 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10Harej) p:05Triage>03Normal [23:33:22] (03PS1) 10Thcipriani: Gerrit: add basic robots.txt for proxy [puppet] - 10https://gerrit.wikimedia.org/r/473638 (https://phabricator.wikimedia.org/T209456) [23:33:43] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:39:28] (03CR) 10Paladox: [C: 031] Gerrit: add basic robots.txt for proxy [puppet] - 10https://gerrit.wikimedia.org/r/473638 (https://phabricator.wikimedia.org/T209456) (owner: 10Thcipriani) [23:39:31] 10Operations, 10JADE, 10TechCom, 10Epic, and 3 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10awight) [23:39:35] 10Operations, 10DBA, 10JADE, 10Epic, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) 05Open>03Resolved This was addressed for now, by an agreement between our team and SRE to not install JADE on wikis with revision table size >= 100GB. Th... [23:41:39] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [23:49:15] (03PS1) 10Bstorm: sonofgridengine: Try directly setting a docker install [puppet] - 10https://gerrit.wikimedia.org/r/473641 (https://phabricator.wikimedia.org/T200557) [23:50:28] (03CR) 10Bstorm: [C: 032] sonofgridengine: Try directly setting a docker install [puppet] - 10https://gerrit.wikimedia.org/r/473641 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [23:51:29] (03PS4) 10Volans: Add Icinga module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) [23:51:57] (03CR) 10Volans: "Done, thanks for the review, totally agree." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [23:52:57] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.