[00:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T0000).
[00:00:04] <jouncebot>	 niedzielski, Zoranzoki21, and ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:30] <niedzielski>	 👍
[00:01:45] <ebernhardson>	 \o
[00:14:13] <wikibugs>	 (03PS2) 10Dzahn: smokeping: replace target einsteinium with authdns1001 [puppet] - 10https://gerrit.wikimedia.org/r/473282 (https://phabricator.wikimedia.org/T202782)
[00:14:15] <wikibugs>	 (03PS1) 10Dzahn: bienvenida: add cache-control headers with max-age 1 hour [puppet] - 10https://gerrit.wikimedia.org/r/473306 (https://phabricator.wikimedia.org/T202592)
[00:15:04] <niedzielski>	 Who is conducting the SWAT today?
[00:15:18] <wikibugs>	 (03PS2) 10Dzahn: bienvenida: add cache-control headers with max-age 1 hour [puppet] - 10https://gerrit.wikimedia.org/r/473306 (https://phabricator.wikimedia.org/T202592)
[00:15:48] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "per chat in -traffic" [puppet] - 10https://gerrit.wikimedia.org/r/473306 (https://phabricator.wikimedia.org/T202592) (owner: 10Dzahn)
[00:20:49] <thcipriani>	 niedzielski: I can SWAT if you're still available
[00:21:01] <niedzielski>	 thcipriani:  yes please!
[00:22:32] <wikibugs>	 (03CR) 10Thcipriani: [C: 04-1] Add new throttle rule for Art+Feminism Event on 2018-11-17 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473255 (https://phabricator.wikimedia.org/T209324) (owner: 10Zoranzoki21)
[00:23:05] <thcipriani>	 wikibase sure triggers a good amount of tests :)
[00:24:41] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:25:03] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[00:25:07] <niedzielski>	 yeah, unfortunately it takes a bit
[00:25:13] <wikibugs>	 (03PS3) 10Dzahn: smokeping: replace target einsteinium with authdns1001 [puppet] - 10https://gerrit.wikimedia.org/r/473282 (https://phabricator.wikimedia.org/T202782)
[00:25:37] <niedzielski>	 i'm still coming up to speed with the repo but it looks really nice inside!
[00:26:53] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:30:15] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 27.48 ms
[00:30:41] <wikibugs>	 (03CR) 10Dzahn: [C: 032] smokeping: replace target einsteinium with authdns1001 [puppet] - 10https://gerrit.wikimedia.org/r/473282 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[00:32:17] <wikibugs>	 (03PS4) 10Dzahn: smokeping: replace target einsteinium with authdns1001 [puppet] - 10https://gerrit.wikimedia.org/r/473282 (https://phabricator.wikimedia.org/T202782)
[00:37:38] <thcipriani>	 sooo close
[00:42:31] <mutante>	 !log restarted smokeping on netmon1002 and netmon2001 
[00:42:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:52:37] <thcipriani>	 \o/
[00:53:24] <niedzielski>	 yay, ok let me check a few pages
[00:53:53] <thcipriani>	 niedzielski: live on mwdebug1002 now
[01:00:50] <niedzielski>	 i think all is well. thank you thcipriani 
[01:01:07] <thcipriani>	 niedzielski: okie doke, Good to sync everywhere?
[01:01:20] <niedzielski>	 thcipriani: yes please
[01:01:24] * thcipriani does
[01:02:38] <logmsgbot>	 !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/Wikibase/client/includes: SWAT: [[gerrit:473166|Update: use wikibase-debug logger instead of "PageRandomLookup"]] T208796 (duration: 00m 56s)
[01:02:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:02:41] <stashbot>	 T208796: Use wikibase-debug Logstash channel to log unexpected page_random values - https://phabricator.wikimedia.org/T208796
[01:02:46] <thcipriani>	 ^ niedzielski live everywhere
[01:03:00] <niedzielski>	 thank you!
[01:03:10] <thcipriani>	 yw :)
[02:06:12] <wikibugs>	 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @Marostegui Pinging for review of these two files, https://phabricator.wikimedia.org/diffusion/EJ...
[02:14:37] <wikibugs>	 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10Dzahn) It's fine to share public keys here right on the ticket since they are public and will be added to public repos either way.
[02:27:02] <revi>	 was CSP for private/fishbowl wikis got somewhat harsher? or is it still console log-only?
[02:28:06] <revi>	 uh nvm it loads
[02:28:15] <revi>	 just full of console warning for meta.wikimedia.org, which is bit weird tho
[02:28:48] <revi>	 and enwiki and kowiki
[03:30:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 826.93 seconds
[04:16:03] <icinga-wm>	 PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof]
[04:25:53] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 244.40 seconds
[04:41:33] <icinga-wm>	 RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:49:57] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/opsvar-lag_datasource=eqiad+prometheus/opsvar-mirror_name=main-eqiad_to_main-codfw
[05:39:25] <icinga-wm>	 PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks]
[05:39:51] <icinga-wm>	 PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui]
[05:40:31] <icinga-wm>	 PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[05:40:33] <icinga-wm>	 PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer]
[05:40:40] <legoktm>	 ok so
[05:40:45] <legoktm>	 I think Gerrit is having issues
[05:40:53] <icinga-wm>	 PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[05:41:01] <icinga-wm>	 PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[05:41:18] <legoktm>	 oh
[05:41:20] <legoktm>	 it's down
[05:41:21] <icinga-wm>	 PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org]
[05:42:03] <icinga-wm>	 PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[05:42:11] <icinga-wm>	 PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 5 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy]
[05:43:01] <icinga-wm>	 PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[05:43:05] <icinga-wm>	 PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[05:43:05] <icinga-wm>	 PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts]
[05:43:05] <icinga-wm>	 PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater],Exec[git_pull_wikimedia/discovery/golden]
[05:43:15] <icinga-wm>	 PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[05:43:25] <icinga-wm>	 PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks]
[05:43:43] <icinga-wm>	 PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer]
[05:43:49] <icinga-wm>	 PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 6 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy],Exec[git_pull_research/landing-page]
[05:44:27] <wikibugs>	 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10Joe)
[05:45:43] <icinga-wm>	 PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki]
[05:46:09] <icinga-wm>	 PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot]
[05:46:49] <_joe_>	 !log restarting gerrit
[05:46:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:13] <icinga-wm>	 PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater]
[05:47:21] <icinga-wm>	 PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot]
[05:48:45] <icinga-wm>	 PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[05:49:21] <icinga-wm>	 PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks]
[05:49:44] <wikibugs>	 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10Joe) I was still quite asleep, but I saw a series of broken pipes from sockets and jetty refusing to manage any new connection in the logs, so I just restarted gerrit. It is now working, so we can lower the...
[05:49:54] <wikibugs>	 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10Joe) p:05Unbreak!>03High
[06:00:11] <icinga-wm>	 RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[06:00:55] <icinga-wm>	 RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:02:52] <wikibugs>	 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10Legoktm) [21:45:43] <legoktm> I didn't realize it was upgraded today, I had been encountering weird behavior a few hours ago that I wasn't sure about ... [21:58:11] <legoktm> I couldn't get https://gerrit.wi...
[06:03:31] <icinga-wm>	 RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[06:03:39] <icinga-wm>	 RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:03:47] <icinga-wm>	 RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:04:05] <icinga-wm>	 RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:04:11] <icinga-wm>	 RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[06:06:01] <icinga-wm>	 RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[06:06:21] <icinga-wm>	 RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[06:06:27] <icinga-wm>	 RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[06:07:31] <icinga-wm>	 RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:07:37] <icinga-wm>	 RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:08:37] <icinga-wm>	 RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:09:59] <icinga-wm>	 RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[06:11:51] <icinga-wm>	 RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[06:12:41] <icinga-wm>	 RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[06:12:47] <icinga-wm>	 RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[06:13:37] <icinga-wm>	 RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[06:14:11] <icinga-wm>	 RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:14:22] <wikibugs>	 (03CR) 10Legoktm: [C: 032] Add PHP version information to log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) (owner: 10Legoktm)
[06:14:47] <icinga-wm>	 RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:15:44] <wikibugs>	 (03Merged) 10jenkins-bot: Add PHP version information to log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) (owner: 10Legoktm)
[06:16:17] <icinga-wm>	 RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[06:16:41] <icinga-wm>	 RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[06:18:47] <icinga-wm>	 RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:22:09] <wikibugs>	 (03CR) 10jenkins-bot: Add PHP version information to log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) (owner: 10Legoktm)
[06:22:17] <legoktm>	 hrm
[06:22:39] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473332 (https://phabricator.wikimedia.org/T208383)
[06:24:05] <wikibugs>	 (03PS1) 10Marostegui: pc2010: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/473333 (https://phabricator.wikimedia.org/T208383)
[06:24:09] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473332 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[06:24:33] <wikibugs>	 (03PS1) 10Legoktm: Revert "Add PHP version information to log entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473334
[06:24:41] <wikibugs>	 (03CR) 10Legoktm: [C: 032] Revert "Add PHP version information to log entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473334 (owner: 10Legoktm)
[06:25:21] <wikibugs>	 (03CR) 10Marostegui: [C: 032] pc2010: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/473333 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[06:25:38] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473332 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[06:26:00] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add PHP version information to log entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473334 (owner: 10Legoktm)
[06:26:10] <marostegui>	 legoktm: I have merged your change in deploy1001
[06:26:52] <legoktm>	 marostegui: sorry, I thought I'd sync out my patch pretty easily, and then it didn't work on mwdebug :/
[06:27:10] <legoktm>	 ok
[06:27:11] <legoktm>	 thanks
[06:27:33] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool pc2005 - T208383 (duration: 01m 04s)
[06:27:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:36] <stashbot>	 T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383
[06:27:41] <marostegui>	 legoktm: Mine got merged before yours but looks like between the notification and my actual git rebase yours got merged too
[06:27:49] <legoktm>	 great
[06:27:53] <legoktm>	 I also tried to git fetch as well
[06:28:08] <marostegui>	 yeah, i did fetch and rebase
[06:30:57] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:31:41] <icinga-wm>	 PROBLEM - puppet last run on mw2285 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt]
[06:31:59] <marostegui>	 !log Stop MySQL on pc2005 to clone it to pc2008 - T208383
[06:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:42] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473332 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[06:36:44] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Add PHP version information to log entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473334 (owner: 10Legoktm)
[06:36:47] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) a:05RobH>03elukey
[06:40:26] <marostegui>	 !log Deploy schema change on s6 codfw master, this will generate lag on s6 codfw -T203709
[06:40:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:30] <stashbot>	 T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709
[06:42:05] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[06:52:35] <marostegui>	 !log Deploy schema change on s4 codfw master, this will generate lag on s4 codfw - T203709
[06:52:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:39] <stashbot>	 T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709
[06:57:09] <icinga-wm>	 RECOVERY - puppet last run on mw2285 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[07:07:26] <marostegui>	 !log Deploy schema change on s2 codfw master, this will generate lag on s2 codfw - T203709
[07:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:30] <stashbot>	 T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709
[07:15:13] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:18:59] <marostegui>	 !log Deploy schema change on s7 codfw master, this will generate lag on s7 codfw - T203709
[07:19:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:02] <stashbot>	 T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709
[07:26:13] <wikibugs>	 (03PS1) 10Elukey: Add an-worker1078-95 basic settings [puppet] - 10https://gerrit.wikimedia.org/r/473359 (https://phabricator.wikimedia.org/T207192)
[07:41:39] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[07:42:45] <marostegui>	 !log Deploy schema change on s3 codfw master, this will generate lag on s3 codfw - T203709
[07:42:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:49] <stashbot>	 T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709
[07:42:54] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add an-worker1078-95 basic settings [puppet] - 10https://gerrit.wikimedia.org/r/473359 (https://phabricator.wikimedia.org/T207192) (owner: 10Elukey)
[07:46:26] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1078.eqiad.wmnet'...
[07:46:49] <wikibugs>	 (03PS6) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161)
[07:52:35] <wikibugs>	 (03PS7) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161)
[07:59:04] <wikibugs>	 (03PS8) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161)
[07:59:21] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WMDE-leszek - https://phabricator.wikimedia.org/T208717 (10MoritzMuehlenhoff) 05Resolved>03Open Leszek; you're now using the same SSH key in Cloud VPS as in the production cluster. This is a security risk as WMCS...
[08:02:41] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:02:48] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WMDE-leszek - https://phabricator.wikimedia.org/T208717 (10WMDE-leszek)
[08:04:11] <wikibugs>	 (03CR) 10Vgutierrez: "pcc seems happy https://puppet-compiler.wmflabs.org/compiler1002/13471/" [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez)
[08:04:13] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WMDE-leszek - https://phabricator.wikimedia.org/T208717 (10WMDE-leszek) Thanks @MoritzMuehlenhoff for your attention and noticing my sloppiness. Changed the ssh key, for the one to be only used for production access.
[08:04:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] hieradata: rollout syslog_exporter in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/473173 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi)
[08:04:49] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: rollout syslog_exporter in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/473173 (https://phabricator.wikimedia.org/T206633)
[08:07:25] <godog>	 !log rollout rsyslog_exporter to eqiad
[08:07:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:08] <marostegui>	 !log Deploy schema change on s3 codfw master, this will generate lag on s3 codfw - T205913
[08:08:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:11] <stashbot>	 T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913
[08:12:35] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[08:13:42] <wikibugs>	 (03PS1) 10Elukey: Set a different mac address for an-worker1078's DHCP [puppet] - 10https://gerrit.wikimedia.org/r/473387 (https://phabricator.wikimedia.org/T207192)
[08:14:32] <marostegui>	 !log Deploy schema change on s4 codfw master, this will generate lag on s4 codfw - T205913
[08:14:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:34] <stashbot>	 T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913
[08:14:35] <wikibugs>	 (03CR) 10Elukey: [C: 032] Set a different mac address for an-worker1078's DHCP [puppet] - 10https://gerrit.wikimedia.org/r/473387 (https://phabricator.wikimedia.org/T207192) (owner: 10Elukey)
[08:17:48] <marostegui>	 !log Deploy schema change on s6 codfw master, this will generate lag on s6 codfw - T205913
[08:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:18] <marostegui>	 !log Deploy schema change on s2 codfw master, this will generate lag on s2 codfw - T205913
[08:19:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:19] <marostegui>	 !log Deploy schema change on s2 codfw master, this will generate lag on s7 codfw - T205913
[08:22:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:21] <stashbot>	 T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913
[08:23:17] <wikibugs>	 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10Joe) a:05Joe>03None
[08:24:29] <marostegui>	 !log Deploy schema change on s5 codfw master, this will generate lag on s5 codfw - T205913
[08:24:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Relabel labvirt1016.eqiad.wmnet as cloudvirt1016.eqiad.wmnet - https://phabricator.wikimedia.org/T209427 (10ArielGlenn) p:05Triage>03Normal
[08:28:29] <wikibugs>	 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10ArielGlenn) p:05Triage>03Normal
[08:35:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 682.87 seconds
[08:36:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 690.42 seconds
[08:36:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 706.59 seconds
[08:36:21] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 708.14 seconds
[08:36:27] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 714.76 seconds
[08:41:10] <marostegui>	 ^ checking
[08:41:42] <wikibugs>	 10Operations: change my email address in the techcom alias - https://phabricator.wikimedia.org/T209391 (10ArielGlenn) 05Open>03Resolved p:05Triage>03Normal a:03ArielGlenn Done.
[08:42:53] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1003 - https://phabricator.wikimedia.org/T209408 (10ArielGlenn) p:05Triage>03Normal
[08:44:20] <marostegui>	 banyek: can you help checking what's going on please?
[08:44:52] <banyek>	 yes
[08:46:14] <banyek>	 I thought it is the schema change
[08:46:40] <marostegui>	 I found it and fixed it
[08:46:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.32 seconds
[08:47:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.16 seconds
[08:47:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[08:47:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 0.29 seconds
[08:47:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.24 seconds
[08:50:59] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:51:36] <wikibugs>	 (03PS1) 10Elukey: Set MAC address of 10G interface for an-worker1078 [puppet] - 10https://gerrit.wikimedia.org/r/473404
[08:52:41] <wikibugs>	 (03CR) 10Elukey: [C: 032] Set MAC address of 10G interface for an-worker1078 [puppet] - 10https://gerrit.wikimedia.org/r/473404 (owner: 10Elukey)
[08:56:35] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1035 is OK: OK - running: The system is fully operational
[09:11:59] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[09:17:44] <marostegui>	 !log Deploy schema change on db2053 - T86339
[09:17:45] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Refactor puppet WDQS module - https://phabricator.wikimedia.org/T208201 (10Mathew.onipe)
[09:17:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:48] <stashbot>	 T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339
[09:17:48] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Cleanup wdqs puppet profile to include the new changes based on refactoring - https://phabricator.wikimedia.org/T208395 (10Mathew.onipe) 05Open>03Resolved
[09:18:02] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: remove the set_handler feature flag [puppet] - 10https://gerrit.wikimedia.org/r/473411
[09:18:03] <wikibugs>	 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi)
[09:22:32] <moritzm>	 !log updated stretch netinst image for 9.6 point release
[09:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:35] <paravoid>	 win 63
[09:23:54] <_joe_>	 lose 64
[09:24:23] <paravoid>	 hah
[09:24:58] <vgutierrez>	 windows overflow... I got RIP in paravoid's computer
[09:25:08] <wikibugs>	 (03CR) 10Gehel: "Minor comments inline, otherwise LGTM." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[09:25:46] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez)
[09:25:54] <wikibugs>	 (03PS9) 10Vgutierrez: certcentral: switch to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/473229 (https://phabricator.wikimedia.org/T209161)
[09:27:46] <addshore>	 if i think a mailing list email failed to land in my inbox, should i file a phab ticket about it? or is it really not worth it?
[09:28:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'm not sure if nginx::status_site makes sense on its own without the collector? I'm for absenting the status site for now and we can rein" [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[09:29:50] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 2485 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/opsvar-lag_datasource=eqiad+prometheus/opsvar-mirror_name=main-eqiad_to_main-codfw
[09:29:59] <wikibugs>	 (03CR) 10Gehel: "Minor comments inline, otherwise, LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[09:30:50] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:31:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[09:34:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/473295 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[09:36:22] <wikibugs>	 (03CR) 10Gehel: "minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[09:36:52] <icinga-wm>	 PROBLEM - puppet last run on certcentral2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:37:14] <wikibugs>	 (03PS1) 10Vgutierrez: certcentral: Fix cron ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473418 (https://phabricator.wikimedia.org/T209161)
[09:37:19] <vgutierrez>	 :(
[09:37:45] <vgutierrez>	 pcc misled me :(
[09:38:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, modulo Gehel's comments" [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[09:40:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] "pcc happy https://puppet-compiler.wmflabs.org/compiler1002/13473/ and showing the expecting values" [puppet] - 10https://gerrit.wikimedia.org/r/473418 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez)
[09:40:25] <wikibugs>	 (03PS2) 10Vgutierrez: certcentral: Fix cron ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473418 (https://phabricator.wikimedia.org/T209161)
[09:41:14] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[09:42:24] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s8 on db1124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.pagelinks: Cant find record in pagelinks, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1087-bin.003425, end_log_pos 688342348
[09:43:50] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1078.eqiad.wmnet'...
[09:47:26] <wikibugs>	 (03PS3) 10Volans: remote: refactor Remote.query() API [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884)
[09:48:53] <wikibugs>	 (03CR) 10Volans: "I've also add the __len__ and __str__ to the RemoteHosts too as it seems useful in both contexts (and I have a use case in an upcoming CR)" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[09:49:50] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 227, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:51:20] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 675.92 seconds
[09:52:02] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:53:29] <wikibugs>	 (03PS1) 10Banyek: mariadb: depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473428 (https://phabricator.wikimedia.org/T85757)
[09:54:13] <wikibugs>	 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero)
[09:54:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Fails PCC https://puppet-compiler.wmflabs.org/compiler1002/13474/logstash1007.eqiad.wmnet/change.logstash1007.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[09:54:55] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero)
[09:54:58] <wikibugs>	 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero)
[09:55:47] <wikibugs>	 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero) p:05Triage>03Normal
[09:56:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Relabel labvirt1016.eqiad.wmnet as cloudvirt1016.eqiad.wmnet - https://phabricator.wikimedia.org/T209427 (10GTirloni) a:03GTirloni
[09:57:51] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Relabel labvirt1016.eqiad.wmnet as cloudvirt1016.eqiad.wmnet - https://phabricator.wikimedia.org/T209427 (10GTirloni) a:05GTirloni>03None
[09:58:00] <icinga-wm>	 PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 3 minutes ago with 17 failures. Failed resources (up to 3 shown): Exec[chown /srv/deployment/cpjobqueue for deploy-service],Package[mathoid/deploy],Exec[chown /srv/deployment/mathoid for deploy-service],Package[citoid/deploy]
[09:58:03] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473428 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek)
[09:58:41] <wikibugs>	 (03CR) 10Banyek: [C: 032] mariadb: depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473428 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek)
[09:58:58] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473428 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek)
[09:59:27] <wikibugs>	 (03CR) 10Phuedx: [C: 04-1] "From the AC of T208755:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski)
[09:59:28] <banyek>	 !log depooling db2046 (T85757)
[09:59:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:31] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[10:02:35] <marostegui>	 banyek: are you checking the above alert?
[10:02:50] <wikibugs>	 (03CR) 10Muehlenhoff: "Agreed, I don't think we need the status site, it's fine to simply remove diamond::collector::nginx from my PoV: Also. yesterday a dedicat" [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[10:03:05] <banyek>	 the s8 lag?
[10:03:17] <marostegui>	 yes
[10:03:27] <marostegui>	 the broken replication
[10:04:13] <banyek>	 yes, checking
[10:04:34] <marostegui>	 thanks
[10:04:44] <banyek>	 but finishing depooling first as the change is already merged 
[10:04:51] <banyek>	 it's just a scap
[10:05:25] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473428 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek)
[10:06:37] <wikibugs>	 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10Krenair) > The document is: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron_ideal_model (edits welcome). I put in some basic ones: https://wikitech.wikimedi...
[10:07:04] <logmsgbot>	 !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: T85757: depooling db2046 (duration: 00m 55s)
[10:07:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:07] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[10:08:59] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1078.eqiad.wmnet'...
[10:10:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "This is fine to merge. The current metrics from https://grafana.wikimedia.org/dashboard/db/ntp-time-servers are a bit of a regression comp" [puppet] - 10https://gerrit.wikimedia.org/r/473295 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[10:13:03] <wikibugs>	 (03CR) 10Ema: [C: 031] Remove Diamond from caches [puppet] - 10https://gerrit.wikimedia.org/r/472632 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[10:14:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good (will need a manual rebase as the underlying patch was changed)" [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[10:18:04] <wikibugs>	 (03PS1) 10Vgutierrez: certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161)
[10:18:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez)
[10:19:50] <wikibugs>	 (03PS1) 10Ema: fifo-log-demux 0.1 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/473432 (https://phabricator.wikimedia.org/T204225)
[10:21:32] <wikibugs>	 (03PS2) 10Vgutierrez: certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161)
[10:21:41] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) That point release has happened and I upgraded our netinst images earlier the day, so this should be fine to re-install now.
[10:21:48] <wikibugs>	 (03PS6) 10Gehel: Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos)
[10:22:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez)
[10:22:34] <wikibugs>	 (03PS1) 10Ladsgroup: Start reading from change_tag_def on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473433 (https://phabricator.wikimedia.org/T208846)
[10:24:45] <icinga-wm>	 RECOVERY - puppet last run on pc2008 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures
[10:25:06] <wikibugs>	 (03CR) 10Zoranzoki21: Add new throttle rule for Art+Feminism Event on 2018-11-17 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473255 (https://phabricator.wikimedia.org/T209324) (owner: 10Zoranzoki21)
[10:27:15] <wikibugs>	 (03PS3) 10Vgutierrez: certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161)
[10:27:31] <wikibugs>	 (03PS7) 10Gehel: Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos)
[10:27:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez)
[10:29:40] <wikibugs>	 (03CR) 10Gehel: "puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler1002/13478/" [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos)
[10:30:22] <wikibugs>	 (03PS4) 10Vgutierrez: certcentral: Fix ferm ssh-rsync ensure data type [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161)
[10:30:50] <wikibugs>	 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10hashar) @fgiunchedi can we add the statsd proxy for the servers running Zuul? My previous comment above T88997#4676750  has all the relevant bits.  It...
[10:32:15] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1078.eqiad.wmnet'] `  and were **ALL** successful.
[10:33:05] <wikibugs>	 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10hashar) That might be related to Gerrit 2.15.6 upgrade T205784 . I am not familiar with Jetty though but we can at least dig in the logs on the cobalt server.
[10:33:13] <wikibugs>	 (03CR) 10DCausse: remote: refactor Remote.query() API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[10:33:35] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] "pcc looks sane https://puppet-compiler.wmflabs.org/compiler1002/13479/" [puppet] - 10https://gerrit.wikimedia.org/r/473430 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez)
[10:35:09] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s8 on db1124 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[10:36:12] <wikibugs>	 (03CR) 10Gehel: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[10:36:28] <wikibugs>	 (03CR) 10Gehel: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[10:37:35] <icinga-wm>	 RECOVERY - puppet last run on certcentral2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:39:38] <wikibugs>	 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi) >>! In T88997#4745665, @hashar wrote: > @fgiunchedi can we add the statsd proxy for the servers running Zuul? My previous comment above T88...
[10:39:45] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s8 on db1124 is OK: OK slave_sql_lag Replication lag: 44.88 seconds
[10:41:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Seems overall sound and quite a good idea :P I have some minor implementation questions, that can be answered inline. None of my comments " (033 comments) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/473432 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema)
[10:41:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[10:42:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC fails https://puppet-compiler.wmflabs.org/compiler1002/13481/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[10:43:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Revert "logstash: temp stop managing indices" [puppet] - 10https://gerrit.wikimedia.org/r/473226 (owner: 10Filippo Giunchedi)
[10:43:45] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Revert "logstash: temp stop managing indices" [puppet] - 10https://gerrit.wikimedia.org/r/473226
[10:46:23] <wikibugs>	 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero)
[10:46:27] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero)
[10:47:16] <icinga-wm>	 PROBLEM - puppet last run on certcentral1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/certcentral-certs-sync]
[10:47:32] <volans>	 vgutierrez: no joy?
[10:47:36] <wikibugs>	 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero)
[10:47:46] <wikibugs>	 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10aborrero)
[10:47:48] <vgutierrez>	 volans: yup, cc2001 is happy
[10:47:55] <vgutierrez>	 cc1001 is sad for other reasons
[10:47:58] <vgutierrez>	 fix incoming :)
[10:48:13] <volans>	 ehehe, ack
[10:48:21] <wikibugs>	 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10hashar) We had a lot of stack overflow errors on random pages such as gittiles, that is due to some pages being very long to prettify. Then at 5:36:43 UTC we had a lot of: ` [2018-11-14 05:36:43,643] [HTTP-8...
[10:48:44] <wikibugs>	 (03PS1) 10Elukey: Set correct MAC address for an-worker1079 [puppet] - 10https://gerrit.wikimedia.org/r/473438
[10:48:58] <wikibugs>	 (03PS1) 10Vgutierrez: certcentral: Run certcentral-certs-sync with user cercentral [puppet] - 10https://gerrit.wikimedia.org/r/473439 (https://phabricator.wikimedia.org/T209161)
[10:49:12] <marostegui>	 jouncebot: next
[10:49:12] <jouncebot>	 In 1 hour(s) and 10 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1200)
[10:49:17] <vgutierrez>	 volans: basically the security of keyholder is too tight and root has no access to the needed SSH key
[10:49:24] <vgutierrez>	 (and that's OK)
[10:49:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] certcentral: Run certcentral-certs-sync with user cercentral [puppet] - 10https://gerrit.wikimedia.org/r/473439 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez)
[10:49:43] <volans>	 ok
[10:50:01] <wikibugs>	 (03PS1) 10WMDE-Fisch: Make AdvancedSearch the default on de-, fa-, ar-, and hu-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473440 (https://phabricator.wikimedia.org/T207640)
[10:50:09] <wikibugs>	 (03CR) 10Elukey: [C: 032] Set correct MAC address for an-worker1079 [puppet] - 10https://gerrit.wikimedia.org/r/473438 (owner: 10Elukey)
[10:50:39] <wikibugs>	 (03PS2) 10Vgutierrez: certcentral: Run certcentral-certs-sync with user cercentral [puppet] - 10https://gerrit.wikimedia.org/r/473439 (https://phabricator.wikimedia.org/T209161)
[10:50:43] <wikibugs>	 (03PS3) 10Filippo Giunchedi: Revert "logstash: temp stop managing indices" [puppet] - 10https://gerrit.wikimedia.org/r/473226
[10:50:59] <vgutierrez>	 and I'm a lazy bastard... so I'm unable to configure my editor properly
[10:51:08] <vgutierrez>	 hence the -1 by jenkins-bot
[10:51:34] <_joe_>	 vgutierrez: s/configure/choose/
[10:51:45] <wikibugs>	 (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[10:51:57] <vgutierrez>	 _joe_: I'm only human, I cannot use emacs
[10:53:02] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Add pc2010 as spare [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473444 (https://phabricator.wikimedia.org/T208383)
[10:54:16] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] "pcc is happy https://puppet-compiler.wmflabs.org/compiler1002/13482/" [puppet] - 10https://gerrit.wikimedia.org/r/473439 (https://phabricator.wikimedia.org/T209161) (owner: 10Vgutierrez)
[10:54:27] <wikibugs>	 (03PS3) 10Vgutierrez: certcentral: Run certcentral-certs-sync with user cercentral [puppet] - 10https://gerrit.wikimedia.org/r/473439 (https://phabricator.wikimedia.org/T209161)
[10:54:31] <godog>	 there's emacs for rebel^Wvim users with viper
[10:54:32] * godog runs
[10:54:38] <icinga-wm>	 PROBLEM - Host ganeti1006 is DOWN: PING CRITICAL - Packet loss = 100%
[10:54:50] <_joe_>	 wat?
[10:54:50] <akosiaris>	 that's me ^
[10:54:51] <wikibugs>	 (03CR) 10Tim Eulitz: [C: 031] "😎👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473440 (https://phabricator.wikimedia.org/T207640) (owner: 10WMDE-Fisch)
[10:54:56] <_joe_>	 akosiaris: oh ok :P
[10:54:58] <icinga-wm>	 RECOVERY - Host ganeti1006 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[10:55:06] <_joe_>	 did you restart networking?
[10:55:10] <akosiaris>	 reboot
[10:55:16] <icinga-wm>	 RECOVERY - Check systemd state on ganeti1006 is OK: OK - running: The system is fully operational
[10:55:23] <_joe_>	 akosiaris: I fixed the network interfaces on that host
[10:55:24] <akosiaris>	 but I 'll need to restart networking on ganeti1007 and ganeti1008 
[10:55:37] <_joe_>	 but 1007 and 1008 are still unfixed and have the typo
[10:55:40] <akosiaris>	 oh, that was you ? ok
[10:55:44] <_joe_>	 I was wondering where that was originated
[10:55:47] <akosiaris>	 I did PEBKAC on this one
[10:55:50] <_joe_>	 yes, I wrote you as much :)
[10:56:00] <_joe_>	 akosiaris: why is that unpuppetized?
[10:56:05] <akosiaris>	 manual action on my part. 
[10:56:20] <akosiaris>	 cause it was a mess to puppetize it. I have https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/351603/ already
[10:56:22] <wikibugs>	 (03PS4) 10Vgutierrez: certcentral: Run certcentral-certs-sync with user cercentral [puppet] - 10https://gerrit.wikimedia.org/r/473439 (https://phabricator.wikimedia.org/T209161)
[10:56:29] <akosiaris>	 I guess I need to find the time to puppetize it correctly
[10:56:36] <_joe_>	 ack, I can take a look and try to help
[10:56:38] <akosiaris>	 the mess btw is not this, is the current status quo
[10:56:59] <akosiaris>	 it's a rabbit hole essentially
[10:57:10] <akosiaris>	 maybe I can revisit it with a smaller scope and avoid being dragged down this time around
[10:58:11] <wikibugs>	 (03PS4) 10Zoranzoki21: Fix adding vendor files by default for commiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471533 (https://phabricator.wikimedia.org/T207058)
[10:58:17] <wikibugs>	 (03PS5) 10Zoranzoki21: Fix adding vendor files by default for commiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471533 (https://phabricator.wikimedia.org/T207058)
[10:58:34] <icinga-wm>	 RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:02:16] <icinga-wm>	 RECOVERY - puppet last run on certcentral1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:02:47] <wikibugs>	 (03PS3) 10Pmiazga: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski)
[11:03:54] <wikibugs>	 (03PS4) 10Pmiazga: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski)
[11:07:46] <icinga-wm>	 RECOVERY - Check systemd state on ganeti1007 is OK: OK - running: The system is fully operational
[11:08:04] <icinga-wm>	 RECOVERY - Check systemd state on ganeti1008 is OK: OK - running: The system is fully operational
[11:09:06] <banyek>	 !log Deploy schema change on db2046 (T85757)
[11:09:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:09] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[11:09:13] <akosiaris>	 heh, I did not even have to restart networking over there after all. systemctl reset-failed fixed it (after making sure an ifdown analytics; ifup analytics worked fine)
[11:10:06] <wikibugs>	 (03PS5) 10Pmiazga: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski)
[11:11:48] <moritzm>	 !log draining ganeti1005 for reboot/kernel security update
[11:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:59] <wikibugs>	 (03CR) 10MSantos: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos)
[11:18:49] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove Diamond from caches [puppet] - 10https://gerrit.wikimedia.org/r/472632 (https://phabricator.wikimedia.org/T183454)
[11:20:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from caches [puppet] - 10https://gerrit.wikimedia.org/r/472632 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[11:25:38] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team: Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10akosiaris) I think we should support multiple tags per image (docker anyway does support that and they cost next to nothing on the registry level AFAIK)  * Keep...
[11:26:06] <icinga-wm>	 PROBLEM - Check systemd state on cp2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:26:47] <wikibugs>	 (03PS6) 10Phuedx: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski)
[11:27:08] <icinga-wm>	 PROBLEM - Check systemd state on cp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:27:22] <icinga-wm>	 PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond]
[11:28:12] <icinga-wm>	 PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond]
[11:28:15] <wikibugs>	 (03CR) 10Phuedx: [C: 031] "PS6 adds a comment explaining the list of wikis with the sampling ratio set to 0." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski)
[11:29:14] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "I think I was off by a week about when the train pauses? There are deploys this week, so I’ve added this to Thursday’s EU SWAT now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471738 (https://phabricator.wikimedia.org/T207854) (owner: 10Lucas Werkmeister (WMDE))
[11:31:12] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:31:20] <_joe_>	 moritzm: it seems your change create some issues?
[11:31:30] <_joe_>	 I see puppet failure on cachess
[11:33:18] <icinga-wm>	 PROBLEM - Check systemd state on cp3049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:33:28] <icinga-wm>	 PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond]
[11:35:02] <icinga-wm>	 PROBLEM - Check systemd state on cp2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:35:05] <moritzm>	 looking, the diamond prerm fails some times, that's a bug in the deb we can't really fix
[11:35:24] <addshore>	 zeljkof: just to let you know tommorrow during EU swat I can "run the show", will be watching lucas and tarro.w on their first deploys ! :)
[11:35:30] <wikibugs>	 (03PS7) 10Reedy: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm)
[11:36:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm)
[11:36:40] <icinga-wm>	 RECOVERY - Check systemd state on cp3049 is OK: OK - running: The system is fully operational
[11:37:02] <wikibugs>	 (03CR) 10Addshore: "We deployed wikidata data access to all wiktionaries a week or so ago." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm)
[11:37:14] <icinga-wm>	 RECOVERY - Check systemd state on cp2015 is OK: OK - running: The system is fully operational
[11:37:36] <icinga-wm>	 RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[11:37:52] <icinga-wm>	 PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond]
[11:38:32] <icinga-wm>	 RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:38:52] <wikibugs>	 (03CR) 10Addshore: [C: 031] "looks sane to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski)
[11:39:20] <icinga-wm>	 RECOVERY - Check systemd state on cp2001 is OK: OK - running: The system is fully operational
[11:40:47] <zeljkof>	 addshore: cool!
[11:41:04] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 0.379 second response time
[11:41:50] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10MoritzMuehlenhoff) Icinga is flagging broken memory on 1053, simply leaving a note here as that host is up for decom anyway.
[11:43:03] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 031] "Note this is only for Wikipedias, not for other wikis in these languages. I believe this is exactly what should happen, just want to make " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473440 (https://phabricator.wikimedia.org/T207640) (owner: 10WMDE-Fisch)
[11:43:26] <icinga-wm>	 RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:43:54] <icinga-wm>	 RECOVERY - Check systemd state on cp2011 is OK: OK - running: The system is fully operational
[11:44:36] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:44:36] <Reedy>	 addshore: gj, you broke addWiki a year ago :D
[11:44:59] <addshore>	 Reedy: i did?
[11:45:06] <addshore>	 i do remember touching it
[11:45:07] <Reedy>	 Yup
[11:45:08] <Reedy>	 See https://phabricator.wikimedia.org/T209474
[11:45:14] <Reedy>	 It's almost a year to the day
[11:45:15] <Reedy>	 Haha
[11:45:32] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[11:45:39] <addshore>	 wait, using it with --wiki=aawiktionary no longer works?
[11:45:47] <Reedy>	 That's what I pasted :)
[11:45:48] <addshore>	 that did work before, there was another ticket about it too
[11:47:05] <addshore>	 i dont even see where that error message comes form?
[11:47:18] <Reedy>	 https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/commit/955db59c950e286284f25a119a4930b0e07079e1
[11:47:23] <addshore>	 i need to pull >.>
[11:47:40] <Reedy>	 https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/addWiki.php#L72-L77
[11:47:58] <icinga-wm>	 RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:48:52] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:48:56] <addshore>	 Reedy: can actually see which bit of that condition is broken with just my eyes...?
[11:49:30] <addshore>	 $this->getOption( 'wiki' ) doesn't get the wiki?
[11:49:35] <Reedy>	 I didn't pay enough attention
[11:49:37] <Reedy>	 It seems so
[11:50:59] <addshore>	 let me have a poke
[11:52:27] <Reedy>	 addshore: You can probably just abuse $wgDBname
[11:52:50] <addshore>	 true
[11:53:00] <Reedy>	 else...
[11:53:00] <Reedy>	 		if ( isset( $this->mOptions['wiki'] ) ) {
[11:53:00] <Reedy>	 			$bits = explode( '-', $this->mOptions['wiki'] );
[11:53:00] <Reedy>	 			if ( count( $bits ) == 1 ) {
[11:53:00] <Reedy>	 				$bits[] = '';
[11:53:01] <Reedy>	 			}
[11:53:03] <Reedy>	 			define( 'MW_DB', $bits[0] );
[11:53:07] <Reedy>	 			define( 'MW_PREFIX', $bits[1] );
[11:53:25] <addshore>	 $wgDBname sounds nicer
[11:53:40] <icinga-wm>	 PROBLEM - Check systemd state on cp2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:53:53] <addshore>	 Reedy: patch is up
[11:53:54] <icinga-wm>	 PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond]
[11:53:58] <Reedy>	 ta
[11:57:03] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) @Cmjohnson @RobH   I tried this morning to configure the an-workers with https://gerrit.wikimedia.org/r/#/c/473359/ but then...
[11:57:33] <wikibugs>	 (03CR) 10Banyek: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473444 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[11:59:38] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 848.31 seconds
[12:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1200).
[12:00:05] <jouncebot>	 raynor and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:11] <Amir1>	 o/
[12:00:41] <zeljkof>	 raynor and Amir1: you're both deployers, right? so go ahead, self organize and deploy your patches :)
[12:00:41] <raynor>	 o/
[12:00:47] <zeljkof>	 I'm around if you need me
[12:00:58] <Amir1>	 sure, raynor, you go first
[12:01:02] <raynor>	 ok thx
[12:01:39] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/WikimediaMaintenance/addWiki.php: Unbreak adding TMH tables (duration: 00m 55s)
[12:01:39] <wikibugs>	 (03CR) 10Pmiazga: [C: 032] Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski)
[12:01:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:48] <icinga-wm>	 RECOVERY - Check systemd state on cp2007 is OK: OK - running: The system is fully operational
[12:02:52] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaMaintenance/addWiki.php: Unbreak adding TMH tables (duration: 00m 53s)
[12:02:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:14] <wikibugs>	 (03Merged) 10jenkins-bot: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski)
[12:03:16] <raynor>	 merging, Amir1 - I'll let you know once I'm done, I think I'll need ~10m
[12:03:52] <icinga-wm>	 RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[12:04:15] <Amir1>	 Sure
[12:04:44] <wikibugs>	 (03CR) 10jenkins-bot: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski)
[12:05:36] <phuedx>	 raynor: i have the list of enwiki articles ready
[12:06:01] <raynor>	 SEO 1% is on mwdebug10023
[12:06:04] <raynor>	 1002*
[12:07:20] <wikibugs>	 (03PS8) 10Urbanecm: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546)
[12:07:24] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::web::vhost: remove the set_handler feature flag [puppet] - 10https://gerrit.wikimedia.org/r/473411
[12:07:38] <wikibugs>	 (03CR) 10Urbanecm: "> Patch Set 7:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm)
[12:08:06] <Reedy>	 Urbanecm: fyi, addwiki is broken yet again :D
[12:08:22] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[12:08:34] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[12:08:42] <icinga-wm>	 PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:08:55] <Urbanecm>	 Reedy, thank you
[12:08:59] <Urbanecm>	 I amended config for yuewiktionary, so tests are hopefully not broken and it is in wikidataclient.dblist
[12:09:10] <_joe_>	 arturo: are you looking at labstore1004?
[12:10:16] <arturo>	 _joe_: no, we are looking at labnet1001, but could be realted
[12:10:20] <arturo>	 related*
[12:11:04] <raynor>	 phuedx, - works to me on debug
[12:11:07] <raynor>	 you?
[12:11:13] <phuedx>	 raynor: checking
[12:11:32] <raynor>	 I also checked the output against the https://search.google.com/structured-data/testing-tool -> it's ok
[12:11:38] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[12:11:53] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team: Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10fselles) +1  latest should be avoided for production, in my experience is also problematic for development (since you don't know which version are you running c...
[12:12:45] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::web::vhost: remove the set_handler feature flag [puppet] - 10https://gerrit.wikimedia.org/r/473411
[12:14:58] <raynor>	 there is one problem - the mainEntityUrl still points to http
[12:15:07] <phuedx>	 raynor: i've tested a number of articles from the list (https://quarry.wmflabs.org/query/31164) and i see the json+ld block
[12:15:25] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10GTirloni)
[12:15:26] <raynor>	 sameAs, and mainEntity
[12:15:33] <raynor>	 addshore: ^
[12:15:46] * addshore reads up
[12:16:07] <arturo>	 marostegui: you around? we may have connection overloading on m5-master
[12:16:09] <raynor>	 context: https://phabricator.wikimedia.org/T209352
[12:16:23] <addshore>	 raynor: I thought that was already discussed on a gerrit patch? I believe i spotted it this morning or last night?
[12:16:30] <addshore>	 comments by Lucas_WMDE 
[12:17:07] <phuedx>	 addshore: you're correct
[12:17:10] <phuedx>	 raynor: https://phabricator.wikimedia.org/T153563
[12:17:13] <raynor>	 ah, right, yes, there is -2 there
[12:17:22] <phuedx>	 raynor: https://gerrit.wikimedia.org/r/#/c/473292/ even
[12:17:26] <raynor>	 sorry, my bad, looks good, lets roll
[12:17:41] <addshore>	 woo!
[12:17:43] <phuedx>	 go go go
[12:17:51] <raynor>	 yeah, I'm deploying and I'm overcautious :) 
[12:18:02] <phuedx>	 raynor: i'll resolve https://phabricator.wikimedia.org/T209352 as invalid with a comment
[12:18:31] <raynor>	 phuedx, w8
[12:18:53] <raynor>	 that ask has two issues, first is the https, and the second one is that the betacluster seo points to prod wikibase
[12:19:11] <logmsgbot>	 !log pmiazga@deploy1001 Synchronized wmf-config: SWAT: [[gerrit:473079]|Enable Schema.org page split test at 1% sampling (T208755)]] (duration: 00m 54s)
[12:19:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:13] <stashbot>	 T208755: Launch A/B test for sameAs property - https://phabricator.wikimedia.org/T208755
[12:19:23] <raynor>	 Amir1, I'm done, over to you
[12:19:31] <Amir1>	 Thank you!
[12:20:31] <raynor>	 zeljkof, thanks for your presence, I feel much more confident with SWATs when you're around
[12:21:01] <zeljkof>	 raynor: I'm glad I could help by not doing anything ;)
[12:21:06] <wikibugs>	 (03PS2) 10Ladsgroup: Start reading from change_tag_def on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473433 (https://phabricator.wikimedia.org/T208846)
[12:21:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: jobqueue_redis: Purge role jobqueue_redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473029 (https://phabricator.wikimedia.org/T198220) (owner: 10Effie Mouzeli)
[12:22:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/13485/mwdebug2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/473411 (owner: 10Giuseppe Lavagetto)
[12:22:41] <phuedx>	 oic
[12:22:51] <phuedx>	 raynor: thanks. i won't resolve it
[12:22:57] <wikibugs>	 (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473433 (https://phabricator.wikimedia.org/T208846) (owner: 10Ladsgroup)
[12:24:19] <wikibugs>	 (03Merged) 10jenkins-bot: Start reading from change_tag_def on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473433 (https://phabricator.wikimedia.org/T208846) (owner: 10Ladsgroup)
[12:27:24] <icinga-wm>	 RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational
[12:30:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::vhost: remove the set_handler feature flag [puppet] - 10https://gerrit.wikimedia.org/r/473411 (owner: 10Giuseppe Lavagetto)
[12:30:43] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:473433|Start reading from change_tag_def on wikidatawiki (T208846)]] (duration: 00m 55s)
[12:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:47] <stashbot>	 T208846: Start reading from change_tag_def on wikidatawiki - https://phabricator.wikimedia.org/T208846
[12:31:28] <wikibugs>	 (03PS4) 10Ladsgroup: Revert the language of votewiki to English (en) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473148 (https://phabricator.wikimedia.org/T207560) (owner: 10Huji)
[12:32:40] <wikibugs>	 (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473148 (https://phabricator.wikimedia.org/T207560) (owner: 10Huji)
[12:32:52] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active
[12:33:53] <wikibugs>	 (03Merged) 10jenkins-bot: Revert the language of votewiki to English (en) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473148 (https://phabricator.wikimedia.org/T207560) (owner: 10Huji)
[12:35:11] <wikibugs>	 (03CR) 10jenkins-bot: Start reading from change_tag_def on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473433 (https://phabricator.wikimedia.org/T208846) (owner: 10Ladsgroup)
[12:35:16] <wikibugs>	 (03CR) 10jenkins-bot: Revert the language of votewiki to English (en) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473148 (https://phabricator.wikimedia.org/T207560) (owner: 10Huji)
[12:35:56] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:473148|Revert the language of votewiki to English (en) (T207560)]] (duration: 00m 55s)
[12:35:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:59] <stashbot>	 T207560: Carry out the 2018 fawiki elections on votewiki - https://phabricator.wikimedia.org/T207560
[12:36:34] <Amir1>	 !log EU SWAT is done
[12:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:27] <moritzm>	 !log installing python security updates on trusty
[12:39:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:48] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaMaintenance/addWiki.php: Unbreak adding wiktionary (duration: 00m 53s)
[12:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:50] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/WikimediaMaintenance/addWiki.php: Unbreak adding wiktionary (duration: 00m 52s)
[12:40:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:23] <moritzm>	 !log installing python3.4 security updates on trusty (Debian already fixed)
[12:48:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:08] <Reedy>	 addshore: Params are the wrong way round :P
[12:49:23] <addshore>	 Reedy: your the wrong way around
[12:49:26] <addshore>	 what do you mean? :P
[12:50:43] <Reedy>	 addshore: You're looking for $wgDBname in 'wiktionary'
[12:50:46] <Reedy>	 Not the other way round
[12:50:53] <addshore>	 bwhahahahaa
[12:51:22] <Reedy>	 So the earlier change probably wasn't needed
[12:51:22] <Krenair>	 classic
[12:56:29] <moritzm>	 !log installing gettext "security" updates for trusty
[12:56:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:05] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1300)
[13:00:49] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/WikimediaMaintenance/addWiki.php: Fixing addshores code... (duration: 00m 55s)
[13:00:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:47] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaMaintenance/addWiki.php: Fixing addshores code... (duration: 00m 53s)
[13:01:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:24] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:07:38] <moritzm>	 !log installing ghostscript security updates on stretch
[13:07:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:51] <Krenair>	 Reedy, it's still a little broken
[13:14:05] <Krenair>	 the newproject announcement emails it's generating say 'for a  in '
[13:14:20] <Reedy>	 That's probably because I just commented out half of the script to make it run the second half
[13:14:25] <Krenair>	 between the 'a' and 'in' is supposed to be the project name :/
[13:14:29] <Krenair>	 hah
[13:14:42] <revi>	 oh
[13:14:44] <revi>	 new wiki time
[13:14:51] <Krenair>	 classic addWiki fix
[13:15:01] <revi>	 \o/ ping me when yue.wikt loads to the real wiki
[13:15:11] <Reedy>	 Though I didn't think I commented out name/lang
[13:15:12] <Reedy>	 but meh
[13:15:49] <Krenair>	 revi, the emails used to get some arbitrary delay so they'd be working by the time the announcement went out
[13:15:53] <wikibugs>	 (03PS9) 10Reedy: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm)
[13:15:58] <wikibugs>	 (03CR) 10Reedy: [C: 032] Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm)
[13:16:01] <Krenair>	 but that code broke and we took it out
[13:16:22] <revi>	 newproject email was broken when the last batch of new wiki was created
[13:16:23] <revi>	 IIRC
[13:16:29] <Reedy>	 revi: shit happens :)
[13:16:37] <revi>	 yeah
[13:16:38] <revi>	 lol
[13:17:03] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm)
[13:17:19] <wikibugs>	 (03PS8) 10Reedy: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm)
[13:17:24] <wikibugs>	 (03CR) 10Reedy: [C: 032] Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm)
[13:18:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm)
[13:18:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm)
[13:18:56] <wikibugs>	 (03PS21) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918)
[13:19:09] <wikibugs>	 (03CR) 10jenkins-bot: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm)
[13:20:20] <wikibugs>	 (03PS9) 10Reedy: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm)
[13:20:25] <wikibugs>	 (03CR) 10Reedy: [C: 032] Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm)
[13:21:04] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 928.43 seconds
[13:21:41] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm)
[13:22:05] <wikibugs>	 (03PS3) 10Reedy: Initial configuration for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468815 (https://phabricator.wikimedia.org/T204477) (owner: 10Urbanecm)
[13:22:09] <wikibugs>	 (03CR) 10Reedy: [C: 032] Initial configuration for punjabiwikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468815 (https://phabricator.wikimedia.org/T204477) (owner: 10Urbanecm)
[13:23:28] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468815 (https://phabricator.wikimedia.org/T204477) (owner: 10Urbanecm)
[13:25:50] <Reedy>	 Urbanecm: About? Can you rebase the shnwiki patch?
[13:26:33] <wikibugs>	 (03PS1) 10Reedy: Stop breaking blame for wikimedia special cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473487
[13:32:02] <wikibugs>	 (03PS16) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919)
[13:32:36] <wikibugs>	 (03PS4) 10Reedy: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm)
[13:32:42] <wikibugs>	 (03PS5) 10Reedy: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm)
[13:33:12] <wikibugs>	 (03PS6) 10Reedy: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm)
[13:33:16] <wikibugs>	 (03CR) 10Reedy: [C: 032] Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm)
[13:34:03] <wikibugs>	 (03CR) 10jenkins-bot: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm)
[13:34:05] <wikibugs>	 (03CR) 10jenkins-bot: Initial configuration for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468815 (https://phabricator.wikimedia.org/T204477) (owner: 10Urbanecm)
[13:34:37] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm)
[13:34:51] <wikibugs>	 (03CR) 10jenkins-bot: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm)
[13:35:18] <wikibugs>	 (03PS1) 10Reedy: Add new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473488
[13:36:24] <icinga-wm>	 PROBLEM - DPKG on scandium is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[13:37:40] <wikibugs>	 (03PS2) 10Reedy: Add new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473488 (https://phabricator.wikimedia.org/T206777)
[13:37:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Diamond on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/473489 (https://phabricator.wikimedia.org/T183454)
[13:37:51] <wikibugs>	 (03CR) 10Reedy: [C: 032] Add new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473488 (https://phabricator.wikimedia.org/T206777) (owner: 10Reedy)
[13:39:34] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10GTirloni)
[13:42:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Diamond from Kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/473490 (https://phabricator.wikimedia.org/T183454)
[13:42:48] <wikibugs>	 (03CR) 10Effie Mouzeli: jobqueue_redis: Purge role jobqueue_redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473029 (https://phabricator.wikimedia.org/T198220) (owner: 10Effie Mouzeli)
[13:45:05] <logmsgbot>	 !log reedy@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided)
[13:45:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:35] <gehel>	 !log plugin and JVM upgrade completed on elasticsearch / cirrus / codfw - T209293
[13:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:38] <stashbot>	 T209293: Prepare a deb package with the experimental highlighter 5.5.2.4 - https://phabricator.wikimedia.org/T209293
[13:47:28] <wikibugs>	 (03PS8) 10Gehel: Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos)
[13:48:10] <logmsgbot>	 !log reedy@deploy1001 Synchronized langlist: shn (duration: 00m 52s)
[13:48:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:34] <logmsgbot>	 !log reedy@deploy1001 Synchronized dblists/: new wikis! (duration: 00m 53s)
[13:49:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:37] <wikibugs>	 (03CR) 10jenkins-bot: Add new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473488 (https://phabricator.wikimedia.org/T206777) (owner: 10Reedy)
[13:50:00] <wikibugs>	 (03CR) 10Gehel: [C: 032] Increase tilerator num_workers maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/473260 (owner: 10MSantos)
[13:50:47] <logmsgbot>	 !log reedy@deploy1001 Synchronized static/images/project-logos/: (no justification provided) (duration: 00m 53s)
[13:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:03] <gehel>	 !log restarting tilerator on maps1004 for config change
[13:51:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:50] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: new wikis (duration: 00m 53s)
[13:51:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:55] <marostegui>	 jouncebot: next
[13:51:55] <jouncebot>	 In 0 hour(s) and 8 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1400)
[13:52:00] <icinga-wm>	 RECOVERY - DPKG on scandium is OK: All packages OK
[13:52:04] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Add pc2010 as spare [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473444 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[13:53:28] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Add pc2010 as spare [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473444 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[13:54:14] <icinga-wm>	 PROBLEM - Host labservices1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:54:23] <wikibugs>	 (03PS2) 10Ema: fifo-log-demux 0.1 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/473432 (https://phabricator.wikimedia.org/T204225)
[13:54:36] <logmsgbot>	 !log reedy@deploy1001 Synchronized multiversion/MWMultiVersion.php: (no justification provided) (duration: 00m 53s)
[13:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:00] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.666 second response time
[13:55:14] <icinga-wm>	 RECOVERY - Host labservices1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[13:55:51] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Add pc2010 as spare - T208383 (duration: 00m 53s)
[13:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:54] <stashbot>	 T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383
[13:56:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Remove Diamond from Kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/473490 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[13:56:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Remove Diamond on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/473489 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[13:56:20] <revi>	 is it just me that shnwiki loads internal errors?
[13:56:33] <Reedy>	 Nope
[13:56:38] <Reedy>	 TZ issue
[13:56:50] <marostegui>	 fails for me too
[13:56:50] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:56:51] <revi>	 ok!
[13:56:59] <wikibugs>	 (03PS1) 10Reedy: Fix shnwiki TZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473494
[13:57:25] <revi>	 I could enter editing area but cannot save
[13:57:42] <logmsgbot>	 !log reedy@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided)
[13:57:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:50] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 0.686 second response time
[13:58:08] <revi>	 lol even preview fails
[13:58:13] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10GTirloni)
[13:58:15] <wikibugs>	 (03PS2) 10Reedy: Remove shnwiki TZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473494 (https://phabricator.wikimedia.org/T206777)
[13:58:22] <wikibugs>	 (03CR) 10Reedy: [C: 032] Remove shnwiki TZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473494 (https://phabricator.wikimedia.org/T206777) (owner: 10Reedy)
[13:58:32] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:59:04] <wikibugs>	 10Operations, 10User-Elukey: Apply interface::rps to all the mc hosts - https://phabricator.wikimedia.org/T209489 (10elukey) p:05Triage>03Normal
[13:59:46] <marostegui>	 Amir1: We have a huge increase of queries in enwiki slaves (I haven't checked other sections). I am checking the timeline, can that be related to your change?
[13:59:50] <marostegui>	 banyek: can you check other sections?
[14:00:04] <jouncebot>	 hashar: Your horoscope predicts another unfortunate MediaWiki train - European version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1400).
[14:00:06] <Amir1>	 marostegui: mine should only affect wikidata
[14:00:26] <banyek>	 marostegui: yes
[14:00:44] <marostegui>	 Reedy: Anything from your changes that could potentially affect enwiki?
[14:01:01] <Reedy>	 Not AFAIK
[14:01:56] <marostegui>	 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=now-24h&to=now this is scary
[14:02:11] <wikibugs>	 (03PS1) 10Reedy: Remove punjabi from MWMultiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473498 (https://phabricator.wikimedia.org/T204477)
[14:02:13] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Fix shnwiki TZ (duration: 00m 54s)
[14:02:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:26] <marostegui>	 addshore: you around?
[14:02:31] <revi>	 shnwiki works fine now
[14:02:37] <addshore>	 \o
[14:02:48] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) @Krinkle You've edited https://grafana.wikimedia.org/dashboard/db/cluster-board-graphite back in October the last time, that d...
[14:02:48] <addshore>	 marostegui: whats happening?
[14:03:04] <marostegui>	 https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/473079/ was deployed around the time we had the huge increase on reads on enwiki
[14:03:07] <wikibugs>	 (03PS2) 10Reedy: Remove punjabi from MWMultiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473498 (https://phabricator.wikimedia.org/T204477)
[14:03:18] <marostegui>	 addshore: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=now-24h&to=now
[14:03:33] <addshore>	 marostegui: QPS increased?
[14:03:36] <marostegui>	 yep
[14:03:38] <addshore>	 could be the ceo thing
[14:03:41] <marostegui>	 like crazy
[14:03:41] <addshore>	 *looks at the time*
[14:03:52] <marostegui>	 banyek: does it happen on other sections too?
[14:03:59] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Add pc2010 as spare [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473444 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[14:04:01] <wikibugs>	 (03CR) 10jenkins-bot: Remove shnwiki TZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473494 (https://phabricator.wikimedia.org/T206777) (owner: 10Reedy)
[14:04:02] <banyek>	 that was a peak yesterday at s3: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&from=now-24h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1123&var-port=9104
[14:04:11] <wikibugs>	 (03PS3) 10Ema: fifo-log-demux 0.1 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/473432 (https://phabricator.wikimedia.org/T204225)
[14:04:11] <banyek>	 marostegui: I am checking
[14:04:26] <addshore>	 marostegui: just selects?
[14:04:31] <banyek>	 not just one host at a time I am s3
[14:04:37] <revi>	 and all the new wikis doesn't have standard new wiki main pages meh
[14:04:40] <marostegui>	 banyek: that doesn't look related (the one from yestrday)
[14:04:56] <marostegui>	 addshore: checking
[14:05:06] <wikibugs>	 (03PS1) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473499
[14:05:06] <addshore>	 12:19 pmiazga@deploy1001: Synchronized wmf-config: SWAT: [[gerrit:473079]|Enable Schema.org page split test at 1% sampling (T208755)]] (duration: 00m 54s)
[14:05:07] <stashbot>	 T208755: Launch A/B test for sameAs property - https://phabricator.wikimedia.org/T208755
[14:05:08] <wikibugs>	 (03CR) 10Reedy: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473499 (owner: 10Reedy)
[14:05:18] <wikibugs>	 (03CR) 10Ema: fifo-log-demux 0.1 (033 comments) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/473432 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema)
[14:05:18] <addshore>	 perhaps
[14:05:34] <marostegui>	 addshore: can we revert?
[14:05:53] <addshore>	 phuedx: raynor ^^
[14:06:02] <addshore>	 wait, is that even on enwiki?
[14:06:05] <addshore>	 *looks*
[14:06:11] <banyek>	 s2,s3,s4,s5 good so far
[14:06:26] <addshore>	 'default' => 0.01,, so yes enwiki
[14:06:31] <wikibugs>	 (03Merged) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473499 (owner: 10Reedy)
[14:06:45] <marostegui>	 addshore: let's revert that just in case, so we can either confirm or discard
[14:06:51] <raynor>	 I
[14:06:55] <raynor>	 I'm here, let me read
[14:07:02] <banyek>	 s6 good
[14:07:28] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/interwiki.php: Updating interwiki cache (duration: 02m 25s)
[14:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:31] <gehel>	 !log starting plugin and JVM upgrade on elasticsearch / cirrus / eqiad - T209293
[14:07:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:34] <stashbot>	 T209293: Prepare a deb package with the experimental highlighter 5.5.2.4 - https://phabricator.wikimedia.org/T209293
[14:07:45] <banyek>	 s7 has the same pattern too
[14:07:56] <addshore>	 yup, i think it must be that patch
[14:08:49] <addshore>	 raynor: okay to revert?
[14:09:10] <wikibugs>	 (03Abandoned) 10Reedy: Remove punjabi from MWMultiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473498 (https://phabricator.wikimedia.org/T204477) (owner: 10Reedy)
[14:10:04] <raynor>	 yeah, I think you can revert, we enabled A/B test only to 1% (it means 0.5% of page requests should be affected)
[14:10:13] <Reedy>	 !log Wiki created T205714 T207584 T205713 T206916
[14:10:18] * addshore will revert
[14:10:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:20] <stashbot>	 T205714: Prepare and check storage layer for yuewiktionary - https://phabricator.wikimedia.org/T205714
[14:10:21] <stashbot>	 T206916: Prepare and check storage layer for shnwiki - https://phabricator.wikimedia.org/T206916
[14:10:21] <stashbot>	 T207584: Prepare and check storage layer for punjabiwikimedia - https://phabricator.wikimedia.org/T207584
[14:10:21] <stashbot>	 T205713: Prepare and check storage layer for liwikinews - https://phabricator.wikimedia.org/T205713
[14:10:26] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: aggregate rsyslog_queue_full rate [puppet] - 10https://gerrit.wikimedia.org/r/473501 (https://phabricator.wikimedia.org/T206633)
[14:10:32] <wikibugs>	 (03PS1) 10Addshore: Revert "Prod: Enable Schema.org page split test at 1% sampling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473502
[14:10:39] <wikibugs>	 (03PS2) 10Addshore: Revert "Prod: Enable Schema.org page split test at 1% sampling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473502
[14:10:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: aggregate rsyslog_queue_full rate [puppet] - 10https://gerrit.wikimedia.org/r/473501 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi)
[14:11:21] <wikibugs>	 (03CR) 10Addshore: [C: 032] Revert "Prod: Enable Schema.org page split test at 1% sampling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473502 (owner: 10Addshore)
[14:11:43] <addshore>	 Reedy: are you done with your syncing ? :P
[14:11:47] <Reedy>	 Yup
[14:11:54] <addshore>	 coolio
[14:11:55] <raynor>	 addshore - we do some queries (we need to load the page_random) to verify if the page is in sampling session
[14:12:16] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: aggregate rsyslog_queue_full rate [puppet] - 10https://gerrit.wikimedia.org/r/473501 (https://phabricator.wikimedia.org/T206633)
[14:12:40] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Prod: Enable Schema.org page split test at 1% sampling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473502 (owner: 10Addshore)
[14:13:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] prometheus: aggregate rsyslog_queue_full rate [puppet] - 10https://gerrit.wikimedia.org/r/473501 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi)
[14:13:55] <addshore>	 syncing
[14:14:35] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config: Revert Prod: Enable Schema.org page split test at 1% sampling (duration: 00m 54s)
[14:14:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:44] <banyek>	 seems recovering https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=now-3h&to=now
[14:15:17] <marostegui>	 we'll see now that the patch has been deployed
[14:15:52] <addshore>	 marostegui: is it possible to see what the queries werE?
[14:16:09] <raynor>	 banyek, marostegui - if this our patch, what are the next step, definitely we have to fix our patch and use some caching not to do that many selects
[14:17:32] <raynor>	 https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/8b339fcb93c3ab790620c6823106740880ae2f53/client/includes/Store/Sql/PageRandomLookup.php#L42 
[14:18:02] <raynor>	 this the query we were doing - `select page_random from page where page_id = ${}`
[14:18:44] <addshore>	 the QPS still looks kind of recovery, still need a few more mins to see though i think
[14:18:51] <wikibugs>	 (03PS1) 10Volans: Add Icinga module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884)
[14:19:00] <wikibugs>	 (03CR) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473499 (owner: 10Reedy)
[14:19:02] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Prod: Enable Schema.org page split test at 1% sampling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473502 (owner: 10Addshore)
[14:19:28] <wikibugs>	 (03CR) 10Volans: elasticsearch_cluster: multi-cluster/multi-instance support (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe)
[14:20:21] <addshore>	 marostegui: also, curious, how did you spot that? did alarms go off, or did you just happen to spot it?
[14:20:43] <wikibugs>	 (03CR) 10Gehel: remote: refactor Remote.query() API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[14:21:03] <wikibugs>	 (03PS2) 10Volans: Add Icinga module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884)
[14:21:14] <marostegui>	 addshore: As part of my workflow I monitor tendril _pretty_ often to check the query values and to check if they are under normal values
[14:21:49] <marostegui>	 addshore: I always have tendril and icinga opened on one of my monitors
[14:22:16] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.33 seconds
[14:22:28] <wikibugs>	 10Operations, 10Graphite: grafana access control - https://phabricator.wikimedia.org/T108546 (10fgiunchedi) AFAICT graphite's web interface has been behind ldap auth since 2011 (`e88fdf13` in puppet.git) but `/render` has always been open. Also nowadays you can't really edit/explore metrics in grafana unless y...
[14:23:18] <marostegui>	 We are not yet recovered
[14:23:52] <addshore>	 marostegui: might not be that patch then
[14:23:55] * addshore looks in SAL
[14:24:17] <addshore>	 it looks like it starts between 12:30 and 12:40 on db1089
[14:24:32] <hashar>	 I am holding the train https://phabricator.wikimedia.org/T209429  it smells bad :(
[14:24:35] <addshore>	 12:36 Amir1: EU SWAT is done
[14:24:35] <addshore>	 12:35 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert the language of votewiki to English (en) (T207560) (duration: 00m 55s)
[14:24:35] <addshore>	 12:30 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Start reading from change_tag_def on wikidatawiki (T208846) (duration: 00m 55s)
[14:24:36] <hashar>	 will post to wikitech-l
[14:24:36] <stashbot>	 T207560: Carry out the 2018 fawiki elections on votewiki - https://phabricator.wikimedia.org/T207560
[14:24:36] <stashbot>	 T208846: Start reading from change_tag_def on wikidatawiki - https://phabricator.wikimedia.org/T208846
[14:24:42] <marostegui>	 hashar: yes please
[14:24:55] <addshore>	 Amir1: ^^ your in that time slot
[14:25:02] <marostegui>	 addshore: I asked Amir1 and he said it should only affect wikidata
[14:25:32] <Amir1>	 the first patch is impossible to cause this, the second one only affects wikidata 
[14:25:37] <addshore>	 so, 12:19 is what we just reverted, and before that is 12:02 of reedy syncing somehting totally unrelated
[14:26:28] <marostegui>	 this is also affecting API hosts like db1080
[14:26:37] <addshore>	 okay, hmmm
[14:26:54] <marostegui>	 and NOT affecting recentchanges slaves
[14:27:11] <wikibugs>	 (03PS1) 10Reedy: Add shnwiki to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473510 (https://phabricator.wikimedia.org/T206777)
[14:27:54] <raynor>	 marostegui, addshore - to make it clear, our patch was enabling the code that was doing one more SQL query (to fetch page_random value for given page_id).
[14:28:16] <addshore>	 raynor: and the revert should have stopped that right? (unless i missed something?)
[14:28:26] <raynor>	 yes, it should stop that
[14:29:10] <wikibugs>	 (03CR) 10Gehel: "minor comments inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[14:29:20] <wikibugs>	 (03PS2) 10Reedy: Add shnwiki to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473510 (https://phabricator.wikimedia.org/T206777)
[14:29:36] <godog>	 !log roll-restart swift-proxy in codfw to pick up statsd changes
[14:29:37] <addshore>	 marostegui: no increase in connections though? only in queries
[14:29:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:22] <marostegui>	 addshore: we do have more connections yes
[14:30:41] <banyek>	 there was a small descent in the graphs, but they're crawling up again
[14:31:52] <addshore>	 marostegui: i dont see a connection rate increase on https://grafana.wikimedia.org/dashboard/db/mysql?var-server=db1089&var-port=9104&var-dc=eqiad%20prometheus%2Fops&orgId=1&from=now-3h&to=now ?
[14:32:15] <marostegui>	 addshore: https://grafana.wikimedia.org/dashboard/db/mysql?var-server=db1089&var-port=9104&var-dc=eqiad%20prometheus%2Fops&orgId=1&from=now-24h&to=now&panelId=37&fullscreen
[14:32:31] <addshore>	 aaah process list
[14:32:35] <marostegui>	 yeh :)
[14:34:41] * addshore has run out of places to look
[14:34:51] <marostegui>	 I am checking performance schema to see if I find something interesting
[14:35:12] <_joe_>	 yeah it seems a flock of small queries rather than some huge ones
[14:35:18] <marostegui>	 yep
[14:35:22] <_joe_>	 tendril doesn't show anything significant
[14:35:41] <marostegui>	 yeah, and there is not a change on rows read patterns and things like that
[14:35:56] <_joe_>	 marostegui: uhm be prepared for a shock
[14:36:00] <_joe_>	 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=now-7d&to=now
[14:36:08] <_joe_>	 this happens daily
[14:36:16] <_joe_>	 always at the same time
[14:36:18] <marostegui>	 ??
[14:36:29] <addshore>	 O_o
[14:36:31] <_joe_>	 the query skyrocketing
[14:36:42] <_joe_>	 now, I have a candidate for that
[14:36:42] * marostegui goes to check cronjobs in mwmaint
[14:37:03] <_joe_>	 marostegui: either a cronjob, or some peculiar memcached key expiring was my guess
[14:37:13] <addshore>	 it is getting progressively bigger each day / set of days
[14:37:25] <marostegui>	 could be parsercache expirations?
[14:37:39] <_joe_>	 elukey: didn't we move the TTL of tha translatewiki key to 1 day?
[14:37:53] <elukey>	 _joe_ it goes out with this week's train
[14:37:56] <_joe_>	 marostegui: why on s1 though?
[14:37:56] <addshore>	 it looks like this was also happening 3 months ago https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=now-90d&to=now
[14:38:06] <marostegui>	 _joe_: yeah, it should affect all sections, good point
[14:39:00] <_joe_>	 elukey: might be related to the errors hashar was seeing, and for which he stopped the train?
[14:39:08] <_joe_>	 anyways, yes, check cronjobs
[14:39:22] <_joe_>	 else it's someone else's cronjob :P
[14:39:34] <_joe_>	 scrape_wikipedia.sh
[14:39:37] <marostegui>	 yeah, I am checking cronjobs
[14:40:08] <addshore>	 hashar: its a shame there is no stacktrace etc for https://phabricator.wikimedia.org/T209429
[14:40:44] <_joe_>	 this has been happening since nov 6th, more or less
[14:41:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: turn on statsd_exporter in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/473519 (https://phabricator.wikimedia.org/T205870)
[14:42:23] <addshore>	 _joe_: the same pattern exists in august and september
[14:42:31] <marostegui>	 pff
[14:42:35] <raynor>	 so, once we know that it's not the SEO thing, addshore - once QPS gets back to normal, could you redeploy the patch once again please?
[14:42:38] <elukey>	 _joe_ in theory it shouldn't, the error seems to be fetching a "" key
[14:43:07] <marostegui>	 why does it only affect s1
[14:43:19] <banyek>	 it happened on s7 too
[14:43:28] <banyek>	 let me check the pattern there too
[14:43:44] <addshore>	 yes, s7 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&from=now-90d&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1086&var-port=9104
[14:43:48] <wikibugs>	 (03CR) 10Mathew.onipe: "This is good!. Just few nitpicks.." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[14:43:54] <banyek>	 yep
[14:43:58] <marostegui>	 s7 has centralauth
[14:44:15] <addshore>	 during august and sept it was always at 10:40-10:50 on s7
[14:44:44] <_joe_>	 ok, it must definitely be a cron
[14:44:48] <addshore>	 in august and sept it was also always 10:40 ish
[14:44:51] <addshore>	 ^^ for s1
[14:45:13] <addshore>	 timezones have changed now and we are seeing it at 14:40 / 14:44 ish
[14:45:32] <marostegui>	 I was thinking about refreshlinks cron, but I think that got moved to a different hour
[14:45:33] <_joe_>	 addshore: no, grafana uses utc unless you tell it otherwise
[14:45:35] <marostegui>	 let me check
[14:46:06] <addshore>	 _joe_: of course! so whatever is causing this changed when it was doing whatever it was doing :P
[14:46:11] <marostegui>	 Yeah, refreshlinks is at: 0 0 1 * *
[14:48:01] <Amir1>	 Anomie put this in the deployment section
[14:48:16] <_joe_>	 addshore: can just be slower progressing through the wikis :)
[14:48:36] <_joe_>	 Amir1: he put what?
[14:48:43] <marostegui>	 	•	Anomie will be running refreshExternallinksIndex.php for https://phabricator.wikimedia.org/T209373.
[14:48:58] <Amir1>	 https://wikitech.wikimedia.org/wiki/Deployments#Week_of_November_12th
[14:49:00] <Amir1>	 Anomie will be running refreshExternallinksIndex.php for T209373.
[14:49:01] <stashbot>	 T209373: Run maintenance/refreshExternallinksIndex.php on all wikis - https://phabricator.wikimedia.org/T209373
[14:49:01] <marostegui>	 But if this has been happening before, we can discard that I think
[14:50:04] <marostegui>	 It is definitely recovering now
[14:50:08] <marostegui>	 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104&from=1542195716887&to=1542206985063
[14:51:00] <anomie>	 Amir1, marostegui: What's going on that that maintenance is being talked about? FYI, at the moment nothing is running for that task. I ran group 0 yesterday and it completed in 3 minutes. I was planning on running group 1 after the train window.
[14:51:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::appserver: install php-fpm everywhere [puppet] - 10https://gerrit.wikimedia.org/r/473231 (https://phabricator.wikimedia.org/T208433) (owner: 10Giuseppe Lavagetto)
[14:51:22] <marostegui>	 addshore: I think we can push again raynor's patch
[14:52:48] <addshore>	 marostegui: will do
[14:52:52] <marostegui>	 thanks
[14:52:57] <apergos>	 did you see anything as the wikiadmin user in the process list that looked different than the run of the mill stuff? I know they were all short-lived, but still
[14:53:04] <wikibugs>	 (03PS1) 10Addshore: Revert "Revert "Prod: Enable Schema.org page split test at 1% sampling"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473528
[14:53:10] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10mforns) Thank you @elukey!
[14:53:12] <raynor>	 \o/
[14:53:12] <wikibugs>	 (03CR) 10Addshore: [C: 032] Revert "Revert "Prod: Enable Schema.org page split test at 1% sampling"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473528 (owner: 10Addshore)
[14:53:24] <marostegui>	 apergos: nope :(
[14:53:43] <addshore>	 mutante: around much today? I want to talk about https://phabricator.wikimedia.org/T99531 again! 
[14:54:31] <raynor>	 addshore - do you have time? if not I can push it :)
[14:54:42] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Prod: Enable Schema.org page split test at 1% sampling"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473528 (owner: 10Addshore)
[14:55:01] <addshore>	 raynor: i can
[14:55:15] <raynor>	 awesome, thx
[14:55:29] <raynor>	 let me know once it's there, I'll just quickly check it still works
[14:55:55] <wikibugs>	 (03PS2) 10Reedy: Stop breaking blame for wikimedia special cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473487
[14:56:15] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config: Prod: Enable Schema.org page split test at 1% sampling (again) (duration: 00m 54s)
[14:56:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:21] <addshore>	 yar: ^^ done
[14:56:23] <addshore>	 raynor: ^^
[14:56:56] <raynor>	 thx, let me check that
[14:57:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] install_server: reimage rdb2001, rdb2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472970 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli)
[14:58:16] <apergos>	 it's 7 am in sf, add shore, just bear in  mind :-D
[14:58:36] <addshore>	 apergos: silly timezones :) forgot it was so early 
[14:58:51] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Revert "Prod: Enable Schema.org page split test at 1% sampling"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473528 (owner: 10Addshore)
[14:58:55] <apergos>	 I have an sf clock added to my gnome clock, otherwise I would be hopeless :-D
[14:58:55] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) I have added pc2010 as spare host with the following line on db-codfw.php - we can change it if we want t...
[14:59:16] <addshore>	 i have one on my calendar, but apparently i don't look at it that often
[14:59:17] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui)
[14:59:25] <apergos>	 lol
[14:59:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] swift: turn on statsd_exporter in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/473519 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi)
[15:00:12] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10MasinAlDujailiWMDE) Did someone ask for a zone file? I have a zone file! Here, take a zone file! ;-)  {F27219415}
[15:02:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Record extended account date for nathante [puppet] - 10https://gerrit.wikimedia.org/r/473530
[15:03:25] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) 05stalled>03Open
[15:04:11] <wikibugs>	 (03PS2) 10Muehlenhoff: Record extended account date for nathante [puppet] - 10https://gerrit.wikimedia.org/r/473530
[15:04:13] <raynor>	 addshore, it works, thank you
[15:04:20] <addshore>	 raynor: woo!
[15:04:23] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui)
[15:05:22] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) @Cmjohnson any ETA to get these racked&installed?  Thanks
[15:05:25] <wikibugs>	 (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1002/13487/" [puppet] - 10https://gerrit.wikimedia.org/r/473029 (https://phabricator.wikimedia.org/T198220) (owner: 10Effie Mouzeli)
[15:06:12] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/473531 (https://phabricator.wikimedia.org/T204745)
[15:06:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Record extended account date for nathante [puppet] - 10https://gerrit.wikimedia.org/r/473530 (owner: 10Muehlenhoff)
[15:06:30] <wikibugs>	 (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1002/13488/rdb2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/472970 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli)
[15:06:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/473531 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott)
[15:07:06] <wikibugs>	 (03PS2) 10Andrew Bogott: Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/473531 (https://phabricator.wikimedia.org/T204745)
[15:07:22] <Amir1>	 !log ladsgroup@mwmaint1002:/srv/mediawiki-staging/php-1.33.0-wmf.4$ mwscript sql.php --wiki=incubatorwiki extensions/Wikibase/client/sql/entity_usage.sql (T209207)
[15:07:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:25] <stashbot>	 T209207: Enable arbitrary access on Incubator - https://phabricator.wikimedia.org/T209207
[15:07:25] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove Diamond on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/473489 (https://phabricator.wikimedia.org/T183454)
[15:08:03] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove Diamond on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/473489 (https://phabricator.wikimedia.org/T183454)
[15:11:10] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Addshore)
[15:12:08] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational
[15:12:59] <godog>	 !log roll-restart swift on ms-be1* to pick up statsd changes 
[15:13:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove Diamond on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/473489 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[15:13:46] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::appserver: install php-fpm everywhere [puppet] - 10https://gerrit.wikimedia.org/r/473231 (https://phabricator.wikimedia.org/T208433)
[15:15:09] <wikibugs>	 (03PS1) 10Ladsgroup: Add incubatorwiki to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473534 (https://phabricator.wikimedia.org/T209207)
[15:16:12] <wikibugs>	 (03PS3) 10Niedzielski: Prod: increase Schema.org page split test to 5% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473221 (https://phabricator.wikimedia.org/T208755)
[15:16:14] <wikibugs>	 (03PS3) 10Niedzielski: Prod: increase Schema.org page split test to 25% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473224 (https://phabricator.wikimedia.org/T208755)
[15:16:16] <wikibugs>	 (03PS3) 10Niedzielski: Prod: increase Schema.org page split test to 50% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473225 (https://phabricator.wikimedia.org/T208755)
[15:16:18] <wikibugs>	 (03PS2) 10Niedzielski: Prod: increase Schema.org page split test to 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473227 (https://phabricator.wikimedia.org/T208755)
[15:16:37] <wikibugs>	 (03PS8) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454)
[15:16:44] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 032] jobqueue_redis: Purge role jobqueue_redis [puppet] - 10https://gerrit.wikimedia.org/r/473029 (https://phabricator.wikimedia.org/T198220) (owner: 10Effie Mouzeli)
[15:16:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Prod: increase Schema.org page split test to 50% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473225 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski)
[15:17:20] <wikibugs>	 (03PS2) 10Effie Mouzeli: jobqueue_redis: Purge role jobqueue_redis [puppet] - 10https://gerrit.wikimedia.org/r/473029 (https://phabricator.wikimedia.org/T198220)
[15:19:03] <wikibugs>	 (03CR) 10Herron: logstash: add rsyslog-shipper kafka input config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[15:26:05] <wikibugs>	 (03PS1) 10Banyek: Revert "mariadb: depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473536
[15:30:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: php::fpm: explicitly depend on the php-fpm package [puppet] - 10https://gerrit.wikimedia.org/r/473538
[15:30:26] <wikibugs>	 (03PS1) 10Banyek: mariadb: depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473539 (https://phabricator.wikimedia.org/T85757)
[15:31:14] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1088 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473539 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek)
[15:31:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] php::fpm: explicitly depend on the php-fpm package [puppet] - 10https://gerrit.wikimedia.org/r/473538 (owner: 10Giuseppe Lavagetto)
[15:32:01] <wikibugs>	 (03PS4) 10Volans: remote: refactor Remote.query() API [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884)
[15:32:03] <wikibugs>	 (03PS3) 10Volans: Add Icinga module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884)
[15:32:13] <wikibugs>	 (03CR) 10Volans: "done" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473213 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[15:32:39] <wikibugs>	 (03CR) 10Volans: "Replies inline, some done." (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[15:32:57] <wikibugs>	 (03PS2) 10Banyek: mariadb: depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473539 (https://phabricator.wikimedia.org/T85757)
[15:33:02] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.973 second response time
[15:33:43] <wikibugs>	 (03PS1) 10GTirloni: cloudvps: reimage+rename labvirt1016 as cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/473540 (https://phabricator.wikimedia.org/T209426)
[15:33:56] <wikibugs>	 (03PS1) 10GTirloni: cloudvps: rename+reimage labvirt1016 as cloudvirt1016 [dns] - 10https://gerrit.wikimedia.org/r/473541 (https://phabricator.wikimedia.org/T209426)
[15:34:18] <wikibugs>	 (03CR) 10Marostegui: mariadb: depool db1088 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473539 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek)
[15:34:53] <wikibugs>	 (03PS2) 10GTirloni: cloudvps: reimage+rename labvirt1016 as cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/473540 (https://phabricator.wikimedia.org/T209426)
[15:35:17] <wikibugs>	 (03PS3) 10Herron: logstash::input::kafka add support for SSL/TLS options [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454)
[15:35:41] <wikibugs>	 (03PS3) 10Banyek: mariadb: depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473539 (https://phabricator.wikimedia.org/T85757)
[15:36:02] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473539 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek)
[15:36:19] <wikibugs>	 (03CR) 10Herron: logstash::input::kafka add support for SSL/TLS options (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[15:36:24] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:36:37] <wikibugs>	 (03CR) 10GTirloni: [C: 032] cloudvps: reimage+rename labvirt1016 as cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/473540 (https://phabricator.wikimedia.org/T209426) (owner: 10GTirloni)
[15:37:04] <wikibugs>	 (03CR) 10Marostegui: [C: 031] Revert "mariadb: depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473536 (owner: 10Banyek)
[15:37:20] <wikibugs>	 (03CR) 10GTirloni: [C: 032] cloudvps: rename+reimage labvirt1016 as cloudvirt1016 [dns] - 10https://gerrit.wikimedia.org/r/473541 (https://phabricator.wikimedia.org/T209426) (owner: 10GTirloni)
[15:37:30] <icinga-wm>	 PROBLEM - puppet last run on rdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:37:39] <wikibugs>	 (03PS9) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454)
[15:37:43] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 032] install_server: reimage rdb2001, rdb2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472970 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli)
[15:37:47] <wikibugs>	 (03PS1) 10Addshore: WIP DNM: Add wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/473543 (https://phabricator.wikimedia.org/T99531)
[15:38:01] <wikibugs>	 (03PS3) 10Effie Mouzeli: install_server: reimage rdb2001, rdb2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472970 (https://phabricator.wikimedia.org/T206450)
[15:38:45] <wikibugs>	 (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473536 (owner: 10Banyek)
[15:39:04] <banyek>	 !log repooling db2046 (T85757)
[15:39:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:07] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[15:39:20] <wikibugs>	 (03PS2) 10Banyek: Revert "mariadb: depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473536
[15:39:23] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] Revert "mariadb: depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473536 (owner: 10Banyek)
[15:40:00] <icinga-wm>	 PROBLEM - Host lvs2009 is DOWN: PING CRITICAL - Packet loss = 100%
[15:40:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: php::fpm: require the package for pool.d too [puppet] - 10https://gerrit.wikimedia.org/r/473544
[15:40:40] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) @Vgutierrez Recabling done as you requested  on both servers
[15:40:53] <_joe_>	 ok that's papaul's work :)
[15:41:10] <icinga-wm>	 PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:41:23] <_joe_>	 jiji: ^^
[15:41:33] <jiji>	 no biggie
[15:41:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] php::fpm: require the package for pool.d too [puppet] - 10https://gerrit.wikimedia.org/r/473544 (owner: 10Giuseppe Lavagetto)
[15:41:40] <logmsgbot>	 !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: T85757: repool db2046 (duration: 00m 52s)
[15:41:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:47] <_joe_>	 yup just wanted to make sure you saw it :)
[15:41:56] <jiji>	 I saw it 
[15:42:49] <jiji>	 I will merge the fix 
[15:43:16] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473536 (owner: 10Banyek)
[15:43:22] <jiji>	 but I have a downtime for it on icinga 
[15:43:28] <jiji>	 iirc
[15:44:41] <_joe_>	 uhm that might be an icinga bug then
[15:44:54] <_joe_>	 better call volans!
[15:45:56] <volans>	 _joe_: you know I know better :) it's not the old bug
[15:46:16] <moritzm>	 there#s no current downtime on rdb2*, maybe it expired?
[15:46:43] <_joe_>	 moritzm: I think it's the same icinga bug jaime encountered
[15:47:30] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.368 second response time
[15:48:32] <volans>	 arcane discovered ;)
[15:48:34] <mutante>	 addshore: it looks like you are following option 2. that seems right to me. that being said, please get a review for your patch from traffic team
[15:48:36] <volans>	 it's all good
[15:49:02] <jiji>	 moritzm: yes, I had run it on einsteinium 
[15:49:09] <jiji>	 but we switched to icinga1001
[15:49:14] <jiji>	 so it was no good anymore :p
[15:50:29] <godog>	 !log roll restart swift-proxy in eqiad to apply statsd changes
[15:50:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:34] <bblack>	 how's our overall status?
[15:50:49] <bblack>	 jouncebot: now
[15:50:49] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1400)
[15:50:57] <bblack>	 MW train still running btw?
[15:51:00] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:51:05] <_joe_>	 bblack: I think hashar stopped it
[15:51:17] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe)
[15:51:51] <jiji>	 I will restart pdfrender 
[15:52:18] <_joe_>	 thanks
[15:53:04] <jiji>	 !Restarting pdfrender on scb*.eqiad.wmnet
[15:53:07] <jiji>	 !log Restarting pdfrender on scb*.eqiad.wmnet
[15:53:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:20] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.200 second response time
[15:54:39] <wikibugs>	 (03PS1) 10Banyek: mariadb: productionize dbproxy101[2-7].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367)
[15:54:58] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ` lvs2010.codfw.wmnet ` The log can...
[15:55:16] <bblack>	 I have a swap of the TLS certs in the US planned for now-ish
[15:55:25] <bblack>	 but I can hold/defer if there's other risks ongoing
[15:55:35] <bblack>	 (or not wanting interference in various graphs)
[15:57:25] <wikibugs>	 (03PS2) 10BBlack: Switch unified cert to globalsign-2018 at US edges [puppet] - 10https://gerrit.wikimedia.org/r/473211 (https://phabricator.wikimedia.org/T206804)
[15:57:30] <bblack>	 ^ that
[15:58:39] <moritzm>	 !log rebooting restbase-dev1004 for kernel security update and OpenJDK security update
[15:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:19] <bblack>	 assuming other risks are not sufficient to block! :)
[16:01:19] <wikibugs>	 (03PS4) 10Effie Mouzeli: install_server: reimage rdb2001, rdb2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472970 (https://phabricator.wikimedia.org/T206450)
[16:02:05] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2010.codfw.wmnet'] `  Of which those **FAILED**: ` ['lvs2010.codfw.wmnet'] `
[16:02:12] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error executing row event: Table liwikinews.echo_event doesnt exist
[16:06:38] <wikibugs>	 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10thcipriani) >>! In T209456#4745707, @hashar wrote: > We had a lot of stack overflow errors on random pages such as gittiles, that is due to some pages being very long to prettify.  Those happened with some f...
[16:08:07] <moritzm>	 !log rebooting restbase-dev1005 for kernel security update and OpenJDK security update
[16:08:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:29] <bblack>	 !log starting replacement of GlobalSign unified TLS cert at US edges (affects all public TLS termination for US traffic edges) - T206804
[16:09:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:32] <stashbot>	 T206804: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804
[16:10:55] <bblack>	 !log disabling puppet as precaution on all caches (cumin A:cp) - T206804
[16:10:55] <wikibugs>	 (03PS1) 10Herron: add dummy logstash kafka input password to pacify PCC [labs/private] - 10https://gerrit.wikimedia.org/r/473553
[16:10:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:34] <wikibugs>	 (03CR) 10BBlack: [C: 032] Switch unified cert to globalsign-2018 at US edges [puppet] - 10https://gerrit.wikimedia.org/r/473211 (https://phabricator.wikimedia.org/T206804) (owner: 10BBlack)
[16:12:41] <wikibugs>	 (03CR) 10Herron: [V: 032 C: 032] add dummy logstash kafka input password to pacify PCC [labs/private] - 10https://gerrit.wikimedia.org/r/473553 (owner: 10Herron)
[16:12:58] <wikibugs>	 (03PS3) 10BBlack: Switch unified cert to globalsign-2018 at US edges [puppet] - 10https://gerrit.wikimedia.org/r/473211 (https://phabricator.wikimedia.org/T206804)
[16:13:32] <bblack>	 it'd be nice if wikibugs and/or log actually noted when patches were Submit-ed, instead of just when they're uploaded and/or a review vote changes
[16:15:04] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 963.41 seconds
[16:16:21] <wikibugs>	 (03CR) 10Cwhite: [C: 032] Remove Diamond from Kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/473490 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[16:16:47] <moritzm>	 !log rebooting restbase-dev1006 for kernel security update and OpenJDK security update
[16:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:08] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) Redirection settings are confirmed correct.  Looking around other settings in the docs.
[16:18:10] <wikibugs>	 (03PS1) 10Ema: Add module to install and configure fifo-log-demux [puppet] - 10https://gerrit.wikimedia.org/r/473554 (https://phabricator.wikimedia.org/T204225)
[16:18:12] <wikibugs>	 (03PS1) 10Ema: trafficserver: configure fifo-log-demux [puppet] - 10https://gerrit.wikimedia.org/r/473555 (https://phabricator.wikimedia.org/T204225)
[16:20:34] <icinga-wm>	 RECOVERY - Host lvs2009 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms
[16:23:18] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) Nothing.  I guess this is just more digging, then, unless both systems are somehow broken.
[16:23:27] <wikibugs>	 (03CR) 10Cwhite: "> Agreed, I don't think we need the status site, it's fine to simply" [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[16:23:44] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp2018 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 341660 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days)
[16:23:52] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp2006 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 318852 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days)
[16:24:02] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp2018 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 341642 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days)
[16:24:02] <bblack>	 incoming recovery spam for the expiring certificates, sorry!
[16:24:08] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp2025 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 339595 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days)
[16:24:12] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp2012 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 334671 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days)
[16:24:14] <icinga-wm>	 PROBLEM - puppet last run on lvs2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:24:14] <icinga-wm>	 PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:24:19] <wikibugs>	 (03PS12) 10Cwhite: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454)
[16:24:28] <icinga-wm>	 PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:24:32] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp2006 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 318813 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days)
[16:24:32] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp2012 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 334652 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 372 days)
[16:24:46] <wikibugs>	 (03CR) 10Cwhite: [C: 032] hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[16:24:46] <icinga-wm>	 PROBLEM - puppet last run on cp1074 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:24:54] <icinga-wm>	 PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:25:00] <bblack>	 I guess "-b 21" was a little much, is the few concurrent puppetfails that are going on, we'll see
[16:25:24] <bblack>	 cp1073/4 may be something else, those are ATS test servers I'm not doing anything on..
[16:25:39] <ema>	 yes those are my fault!
[16:25:40] <moritzm>	 !log rebooting ganeti1005 for kernel security update
[16:25:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:20] <bblack>	 cp2009 doesn't actually have any local log of a puppet agent failure, in spite of the alert above, that's odd
[16:27:21] <wikibugs>	 (03PS1) 10GTirloni: cloudvps: hieradata for cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/473557 (https://phabricator.wikimedia.org/T209426)
[16:27:33] <bblack>	 oh duh, also ATS :)
[16:27:44] <wikibugs>	 (03CR) 10Fsero: [V: 032 C: 031] fifo-log-demux 0.1 (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/473432 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema)
[16:28:11] <wikibugs>	 (03CR) 10GTirloni: [C: 032] cloudvps: hieradata for cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/473557 (https://phabricator.wikimedia.org/T209426) (owner: 10GTirloni)
[16:28:18] <bblack>	 icinga-wm: ping?
[16:28:33] <bblack>	 there should be more recoveries, maybe they're slow to recheck at this point
[16:29:20] <icinga-wm>	 RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[16:29:34] <icinga-wm>	 RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:29:52] <icinga-wm>	 RECOVERY - puppet last run on cp1074 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[16:30:00] <icinga-wm>	 RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[16:30:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "I don't have a strong preference, but you have a point about the unpuppetised status page lingering around otherwise. I'd say let's merge " [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[16:31:31] <wikibugs>	 (03CR) 10Herron: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[16:33:20] <bblack>	 !log [Done] replacement of GlobalSign unified TLS cert at US edges complete - T206804
[16:33:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:23] <stashbot>	 T206804: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804
[16:33:44] <icinga-wm>	 RECOVERY - puppet last run on rdb2001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[16:34:07] <wikibugs>	 (03PS1) 10GTirloni: cloudvps: cleanup labvirt1016 [dns] - 10https://gerrit.wikimedia.org/r/473560 (https://phabricator.wikimedia.org/T209426)
[16:34:16] <wikibugs>	 (03PS2) 10Dzahn: wikistats (vps): use mariadb classes, fix old FIXME [puppet] - 10https://gerrit.wikimedia.org/r/470953
[16:34:30] <wikibugs>	 (03CR) 10GTirloni: [C: 032] cloudvps: cleanup labvirt1016 [dns] - 10https://gerrit.wikimedia.org/r/473560 (https://phabricator.wikimedia.org/T209426) (owner: 10GTirloni)
[16:36:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Diamond from Elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/473561 (https://phabricator.wikimedia.org/T183454)
[16:36:46] <wikibugs>	 (03CR) 10Dzahn: [C: 032] wikistats (vps): use mariadb classes, fix old FIXME [puppet] - 10https://gerrit.wikimedia.org/r/470953 (owner: 10Dzahn)
[16:37:22] <icinga-wm>	 RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[16:37:53] <godog>	 that [done] reminded me we could totally #hashtag !log entries :)
[16:39:56] <mutante>	 yep, it's Twitter
[16:44:38] <icinga-wm>	 RECOVERY - puppet last run on lvs2009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[16:48:12] <revi>	 can someone with deployment access take care of https://phabricator.wikimedia.org/T209495
[16:48:13] <revi>	 ?
[16:48:18] <icinga-wm>	 PROBLEM - Host cloudvirt1016 is DOWN: PING CRITICAL - Packet loss = 100%
[16:48:36] <_joe_>	 uh?
[16:48:41] <_joe_>	 is this expected?
[16:48:51] <godog>	 IIRC that host was recently reimaged
[16:48:52] <bblack>	 16:34 <+wikibugs> (CR) GTirloni: [C: 2] cloudvps: cleanup labvirt1016 [dns] ?
[16:48:56] <_joe_>	 well
[16:49:13] <akosiaris>	 should this have paged ?
[16:49:17] <_joe_>	 yes
[16:49:25] <_joe_>	 cloud hosts page when they go down
[16:49:38] <_joe_>	 but it shouldn't have paged because it should've been downtimed
[16:49:39] <herron>	 it did page me fwiw
[16:49:46] <bblack>	 I suspect the cleanup commit (which killed 1/2 hostnames for that host) may have cleaned up a hostname the icinga check was using to ping it with? I donno
[16:49:47] <_joe_>	 it did page everyone
[16:50:06] <gtirloni>	 labvirt1016 is downtimed
[16:50:23] <bblack>	 yeah but cloudvirt1016 is what paged
[16:50:23] <_joe_>	 but not cloudvir1016 :P
[16:50:24] <icinga-wm>	 RECOVERY - Host cloudvirt1016 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[16:50:47] <gtirloni>	 cloudvirt1016 doesn't show up in icinga for me
[16:51:00] <apergos>	 hm pages
[16:51:00] <godog>	 sadly a silly race condition / limitation in icinga, you can't downtime hosts/services that don't exist yet
[16:51:16] <_joe_>	 godog: right
[16:51:23] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/473561 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[16:51:35] <_joe_>	 so they'd need to set notifications to 'disabled' while installing
[16:51:40] <gtirloni>	 bblack: I'm renaming a server
[16:51:53] <gtirloni>	 not sure if there's anything I can do :)
[16:51:56] <_joe_>	 gtirloni: ok when renaming a server you will need to merge a puppet patch :)
[16:52:23] <gtirloni>	 ?
[16:52:24] <_joe_>	 gtirloni: hieradata/hosts/<new-name>.yaml with profile::base::notifications: disabled
[16:52:31] <icinga-wm>	 PROBLEM - TFTP service on install2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .*
[16:52:31] <_joe_>	 before you rename it
[16:52:56] <gtirloni>	 ok, I'll review our instructions and add that. Thanks!
[16:53:10] <_joe_>	 gtirloni: that's just because your hosts page everyone
[16:53:16] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[16:54:21] <gtirloni>	 got it, thanks and sorry about the noise
[16:54:26] <wikibugs>	 (03CR) 10Cwhite: [C: 031] Remove Diamond from Elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/473561 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[16:54:31] <icinga-wm>	 RECOVERY - TFTP service on install2002 is OK: PROCS OK: 1 process with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .*
[16:55:01] <_joe_>	 gtirloni: ofc you also need to remove that afterwards
[16:55:07] <_joe_>	 as you want to get paged :)
[16:56:44] <addshore>	 _joe_: just had another issue with the docker-registry and pulling and image, and managed to get 2 people to reproduce it in 2 different locations and machines etc
[16:56:59] <addshore>	 this time 1 layer of 1 image would just refuse to download, and kept retrying
[16:57:08] <addshore>	 ticket worthy? or?
[16:57:54] <_joe_>	 addshore: I guess so
[16:58:01] * addshore goes to write it up
[16:58:08] <_joe_>	 addshore: our docker registry is bound to be redone sooner than later
[17:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T1700).
[17:00:04] <jouncebot>	 stephanebisson and Amir1: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:11] <stephanebisson>	 Hi
[17:00:18] <stephanebisson>	 I can SWAT today
[17:00:24] <Amir1>	 o/
[17:01:56] <wikibugs>	 10Operations: issue pulling 1 layer of docker-registry.wikimedia.org/releng/composer-php71:latest - https://phabricator.wikimedia.org/T209507 (10Addshore)
[17:01:59] <addshore>	 _joe_: ^^ done
[17:02:21] <wikibugs>	 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe)
[17:04:21] <robh>	 is someone working on cloudvirt1016?  (saw notice of raid array failure and then reboot)
[17:05:43] <paladox>	 robh i think gtirloni is or andrewbogott 
[17:05:57] <robh>	 ok, i just dont like assuming its handled, felt the need to check!
[17:08:49] <wikibugs>	 10Operations, 10Traffic: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10BBlack) From IRC for posterity:  ` 17:01 < bblack> the straw pseudo-code still seems like a legit approach to me, but in practice there's some rough edges to it.  chiefly, that cross my mind imme...
[17:11:53] <andrewbogott>	 robh: gtirloni is renaming labvirt1016 to cloudvirt1016
[17:12:28] <logmsgbot>	 !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaEvents/includes/WikimediaEventsHooks.php: SWAT: [[gerrit:473394|Fix EditAttemptStepSamplingRate variable export]] (duration: 00m 54s)
[17:12:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:08] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: install with the default kernel for cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/473566 (https://phabricator.wikimedia.org/T193655)
[17:13:35] <wikibugs>	 (03PS1) 10GTirloni: cloudvps: Add cloudvirt1016 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/473567 (https://phabricator.wikimedia.org/T209426)
[17:14:01] <robh>	 andrewbogott: no worries, i just saw alerts and was worried =]
[17:14:16] <robh>	 once confirmed it was folks doing stuff, i stopped worrying.
[17:14:22] <wikibugs>	 (03CR) 10Bstorm: [C: 032] cloudstore: install with the default kernel for cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/473566 (https://phabricator.wikimedia.org/T193655) (owner: 10Bstorm)
[17:14:45] <wikibugs>	 (03PS2) 10GTirloni: cloudvps: Add cloudvirt1016 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/473567 (https://phabricator.wikimedia.org/T209426)
[17:16:06] <wikibugs>	 (03CR) 10GTirloni: [C: 032] cloudvps: Add cloudvirt1016 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/473567 (https://phabricator.wikimedia.org/T209426) (owner: 10GTirloni)
[17:19:12] <logmsgbot>	 !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/MobileFrontend/resources/mobile.editor.common/schemaEditAttemptStep.js: SWAT: [[gerrit:473395|schemaEditAttemptStep.js: Use correct config var name for sampling rate]] (duration: 00m 54s)
[17:19:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:16] <wikibugs>	 (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473534 (https://phabricator.wikimedia.org/T209207) (owner: 10Ladsgroup)
[17:19:24] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack)
[17:20:34] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) 05Open>03Resolved
[17:20:39] <wikibugs>	 (03Merged) 10jenkins-bot: Add incubatorwiki to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473534 (https://phabricator.wikimedia.org/T209207) (owner: 10Ladsgroup)
[17:21:20] <stephanebisson>	 Amir1: Is your change testable?
[17:21:32] <Amir1>	 yup
[17:22:17] <stephanebisson>	 Amir1: It's now on mwdebug1001
[17:22:45] <Amir1>	 Testing
[17:23:08] <elukey>	 AndyRussG: hi! Kind reminder of the druid question :)
[17:23:20] <AndyRussG>	 elukey: thanks!
[17:23:41] <AndyRussG>	 yes .... rrrg apologies again for not getting to it yet, I'll definitely do so today  :)
[17:24:11] <elukey>	 I am pinging since your peak season is getting closer and I prefer not to rush :)
[17:25:31] <Amir1>	 stephanebisson: it's fine, please move forward
[17:25:57] <wikibugs>	 (03CR) 10jenkins-bot: Add incubatorwiki to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473534 (https://phabricator.wikimedia.org/T209207) (owner: 10Ladsgroup)
[17:26:42] <logmsgbot>	 !log sbisson@deploy1001 Synchronized dblists/wikidataclient.dblist: SWAT: [[gerrit:473534|Add incubatorwiki to wikidataclient.dblist]] (duration: 00m 48s)
[17:26:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:45] <wikibugs>	 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10RyanSteinberg) public key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBN1rS7OObcft7lDa9+H45kLfkdGHwlJ6rL2Fm2IPsMB  preferred shell login: ryanmax
[17:26:48] <stephanebisson>	 Amir1: done
[17:26:57] <Amir1>	 Thanks!
[17:27:01] <stephanebisson>	 And that concludes SWAT for now
[17:27:56] <wikibugs>	 (03CR) 10Cwhite: "> I don't have a strong preference, but you have a point about the" [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[17:31:26] <bawolff_>	 !log Running importImage.php for 'Opening ceremony of First accusation protest against presumption of guilt of judicial branch.webm' per request T209495
[17:31:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:31] <stashbot>	 T209495: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T209495
[17:32:37] <hashar>	 anything wrong going on ? I have to send a hotfix for the train ( https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/473551/1/includes/libs/objectcache/MemcachedPeclBagOStuff.php )
[17:34:50] <hashar>	 I guess since SWAT has just finished, I will take over from now
[17:35:05] <_joe_>	 hashar: that's not exactly a fix
[17:35:11] <wikibugs>	 10Operations, 10Traffic: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10BBlack) p:05Triage>03Normal
[17:35:15] <_joe_>	 you're transforming a non-fatal failure in a fatal?
[17:35:33] <_joe_>	 oh no just reporting the stack trace
[17:35:35] <_joe_>	 ok
[17:35:38] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) This was just stuck at a prompt.  Stupid mistake, the output after that stage of boot was redirected to the...
[17:35:41] <hashar>	 yeah that is for the stacktrace
[17:35:43] <hashar>	 but indeed it is not a fix
[17:35:55] <_joe_>	 still: a step to get to a fix
[17:36:21] <hashar>	 yup :)
[17:36:25] <_joe_>	 I /think/ it's a limitation of mcrouter btw
[17:36:57] <hashar>	 my theory is that some piece of 1.33.0-wmf.4 code ends up trying to get a cache key with string(0) ""
[17:37:33] <arturo>	 !log T207377 downtime and reboot cloudnet1004 (cloudnet1003 is the active one already)
[17:37:37] <_joe_>	 hashar: uhm, maybe, wouldn't be so sure, but the trace should give you an answer
[17:37:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:41] <stashbot>	 T207377: Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377
[17:37:49] <wikibugs>	 10Operations, 10Traffic: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10BBlack)
[17:37:56] <_joe_>	 hashar: it can be the key is an unprintable pack of bits
[17:38:43] <wikibugs>	 10Operations, 10ops-codfw: Decommission asw-c8-codfw - https://phabricator.wikimedia.org/T209066 (10Papaul)
[17:39:01] <wikibugs>	 10Operations, 10ops-codfw: Decommission asw-c8-codfw - https://phabricator.wikimedia.org/T209066 (10Papaul)
[17:39:32] <hashar>	 _joe_: it also lists the server as ":" :/
[17:40:25] <Lucas_WMDE>	 that might be an empty host name + empty port, separated by a colon…
[17:40:44] <Lucas_WMDE>	 (though other logstash entries I saw also had a unix socket path as server)
[17:41:15] <wikibugs>	 10Operations, 10ops-codfw: Decommission asw-c8-codfw - https://phabricator.wikimedia.org/T209066 (10Papaul) 05Open>03Resolved Done
[17:44:13] <icinga-wm>	 PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.181 second response time
[17:46:39] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: move wmf-research-tools to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/473570 (https://phabricator.wikimedia.org/T204745)
[17:47:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Horizon: move wmf-research-tools to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/473570 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott)
[17:48:22] <wikibugs>	 (03CR) 10Gehel: "mostly minor comments (though I would very much appreciate simplifying the tests)" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[17:48:49] <wikibugs>	 10Operations, 10Traffic: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10BBlack) Also, we should pre-downtime the unified ssl checks in icinga early next week before the US Thanksgiving holidays, so that nobody's pestered by a spam of WARNING alerts, which I believe are set to tri...
[17:49:12] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul)
[17:53:28] <arturo>	 !log T207377 downtime and reboot cloudnet1003 (cloudnet1004 is the active one already)
[17:53:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:31] <stashbot>	 T207377: Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377
[17:58:32] <wikibugs>	 (03PS4) 10Herron: logstash::input::kafka add support for SSL/TLS options [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454)
[17:58:57] <logmsgbot>	 !log hashar@deploy1001 Started scap: php-1.33.0-wmf.4/includes/libs/objectcache/MemcachedPeclBagOStuff.php Add trace to debug memcached bad key error - T209429
[17:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:00] <stashbot>	 T209429: memcached error:  A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE  - https://phabricator.wikimedia.org/T209429
[17:59:31] <icinga-wm>	 RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.332 second response time
[17:59:59] <hashar>	 deploying a wmf.4 change and 17:59:18 Updating LocalisationCache for 1.33.0-wmf.3 using 30 thread(s) ...
[18:00:59] <wikibugs>	 (03PS5) 10Herron: logstash::input::kafka add support for SSL/TLS options [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454)
[18:01:34] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero)
[18:04:06] <wikibugs>	 (03PS6) 10Herron: logstash::input::kafka add support for SSL/TLS options [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454)
[18:05:29] <hashar>	 oh
[18:05:31] <hashar>	 I am rusty
[18:05:36] <hashar>	 ran  a full scap :(
[18:05:44] <wikibugs>	 (03CR) 10Herron: [C: 032] logstash::input::kafka add support for SSL/TLS options [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[18:09:40] <wikibugs>	 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10Afandian) public key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCsJPyatQmAgubnM6ChTohZdEYTOfVJjzpsOtiVrBcwTOVBwEl3qcORlMEF0MMk+BdMfiMd12jmfxGWuOhzJAZ8iPDE9Bk...
[18:09:46] <wikibugs>	 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10mmodell) So the problem seems to have started between 02:00 and 02:12 UTC. There was a fairly large spike in outgoing traffic on eth0 between 02:10 and 02:12 at which point cpu load gradually falls off as a...
[18:12:24] <addshore>	 jouncebot: now
[18:12:24] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 47 minute(s)
[18:12:30] <addshore>	 jouncebot: next
[18:12:30] <jouncebot>	 In 1 hour(s) and 47 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T2000)
[18:15:35] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1314 is CRITICAL: CRITICAL - load average: 82.02, 36.44, 22.51
[18:17:34] <wikibugs>	 (03PS1) 10Bstorm: sonofgridengine: stretch bastions want libboost-dev [puppet] - 10https://gerrit.wikimedia.org/r/473574 (https://phabricator.wikimedia.org/T200557)
[18:17:44] <wikibugs>	 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10mmodell) @thcipriani is going to try installing https://gerrit-review.googlesource.com/admin/repos/plugins/javamelody  to hopefully collect some more useful data about the state of the JVM.  At this point we...
[18:17:46] <wikibugs>	 (03PS1) 10Andrew Bogott: nova: remove labvirt1013 and 1014 from the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/473575
[18:18:21] <addshore>	 hashar: hows your sync going? :)
[18:19:11] <wikibugs>	 (03PS10) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454)
[18:20:34] <icinga-wm>	 PROBLEM - dhclient process on cloudstore1008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.25: Connection reset by peer
[18:20:34] <icinga-wm>	 PROBLEM - Check systemd state on cloudstore1008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.25: Connection reset by peer
[18:20:48] <icinga-wm>	 PROBLEM - Disk space on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused
[18:22:26] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on cloudstore1008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.25: Connection reset by peer
[18:22:26] <icinga-wm>	 PROBLEM - puppet last run on cloudstore1008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.25: Connection reset by peer
[18:24:12] <icinga-wm>	 PROBLEM - DPKG on cloudstore1008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.25: Connection reset by peer
[18:25:14] <icinga-wm>	 RECOVERY - DPKG on cloudstore1008 is OK: All packages OK
[18:25:34] <icinga-wm>	 RECOVERY - Check systemd state on cloudstore1008 is OK: OK - running: The system is fully operational
[18:25:34] <icinga-wm>	 RECOVERY - dhclient process on cloudstore1008 is OK: PROCS OK: 0 processes with command name dhclient
[18:25:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] nova: remove labvirt1013 and 1014 from the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/473575 (owner: 10Andrew Bogott)
[18:27:26] <icinga-wm>	 RECOVERY - puppet last run on cloudstore1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:28:08] <icinga-wm>	 PROBLEM - MD RAID on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused
[18:29:58] <hashar>	 addshore: sorry. I ran a full sync :/
[18:30:38] <hashar>	 wait on scap-cdb-rebuild
[18:32:36] <wikibugs>	 (03PS1) 10Vgutierrez: hieradata: Add lvs2010 specific settings [puppet] - 10https://gerrit.wikimedia.org/r/473576 (https://phabricator.wikimedia.org/T209337)
[18:33:04] <logmsgbot>	 !log hashar@deploy1001 Finished scap: php-1.33.0-wmf.4/includes/libs/objectcache/MemcachedPeclBagOStuff.php Add trace to debug memcached bad key error - T209429 (duration: 34m 07s)
[18:33:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:08] <stashbot>	 T209429: memcached error:  A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE  - https://phabricator.wikimedia.org/T209429
[18:33:18] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:33:34] <icinga-wm>	 PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:33:38] <icinga-wm>	 PROBLEM - configured eth on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused
[18:34:18] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:34:34] <icinga-wm>	 RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:34:50] <hashar>	 addshore: syn ccompleted finally
[18:35:26] <icinga-wm>	 PROBLEM - Check systemd state on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused
[18:35:26] <icinga-wm>	 PROBLEM - dhclient process on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused
[18:37:18] <icinga-wm>	 PROBLEM - puppet last run on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused
[18:37:18] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused
[18:39:01] <wikibugs>	 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10jijiki)
[18:39:08] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1314 is OK: OK - load average: 13.12, 15.56, 18.65
[18:39:16] <icinga-wm>	 PROBLEM - DPKG on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused
[18:39:29] <volans>	 vgutierrez: I guess downtime expired for lvs2010
[18:39:35] <wikibugs>	 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10jijiki)
[18:39:44] <vgutierrez>	 yeah
[18:39:47] <vgutierrez>	 thx volans 
[18:40:01] <volans>	 np I was about to renew it if you were not around ;)
[18:40:56] <wikibugs>	 10Operations, 10decommission, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10jijiki)
[18:42:36] <icinga-wm>	 PROBLEM - IPMI Sensor Status on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused
[18:43:58] <vgutierrez>	 I know I know
[18:44:24] <icinga-wm>	 PROBLEM - Long running screen/tmux on lvs2010 is CRITICAL: connect to address 10.192.49.7 port 5666: Connection refused
[18:44:58] <wikibugs>	 (03CR) 10BBlack: [C: 031] "LGTM as a starting point for testing our first bnxt_en LVS.  Note interface_tweaks was blindly updated for this card back in https://gerri" [puppet] - 10https://gerrit.wikimedia.org/r/473576 (https://phabricator.wikimedia.org/T209337) (owner: 10Vgutierrez)
[18:45:09] <vgutierrez>	 thx bblack 
[18:45:51] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] hieradata: Add lvs2010 specific settings [puppet] - 10https://gerrit.wikimedia.org/r/473576 (https://phabricator.wikimedia.org/T209337) (owner: 10Vgutierrez)
[18:46:51] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T208706 (10Volans) Adding analytics, Luca and Otto in case it was missed. Also puppet has issues because of RO filesystem.
[18:47:24] <wikibugs>	 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10jijiki) "Reimage rdb2001, rdb2002 to stretch and change their role to spare::system" https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/472714/
[18:50:31] <icinga-wm>	 PROBLEM - Disk space on cloudstore1009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.26: Connection reset by peer
[18:51:15] <hashar>	 I am going to deploy a fix for the train blocker ( T209429 ), wait a bit to confirm  then promote group1 to 1.33.0-wmf.4
[18:51:16] <stashbot>	 T209429: memcached error:  A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE  - https://phabricator.wikimedia.org/T209429
[18:51:31] <icinga-wm>	 PROBLEM - ensure kvm processes are running on labvirt1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm
[18:52:18] <andrewbogott>	 that's me, no cause for alarm
[18:52:28] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on cloudstore1008 is OK: OK: synced at Wed 2018-11-14 18:52:26 UTC.
[18:57:05] <icinga-wm>	 RECOVERY - DPKG on lvs2010 is OK: All packages OK
[18:57:17] <icinga-wm>	 RECOVERY - MD RAID on lvs2010 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[18:57:19] <wikibugs>	 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10hashar) For monitoring it seems we monitor the JVM with JMX. There is a task for Gerrit at:  T184086   The CI Jenkins used to have Java melody but that is apparently no more enabled.
[18:57:47] <icinga-wm>	 RECOVERY - Disk space on lvs2010 is OK: DISK OK
[18:59:39] <icinga-wm>	 RECOVERY - Check systemd state on lvs2010 is OK: OK - running: The system is fully operational
[19:01:27] <icinga-wm>	 RECOVERY - configured eth on lvs2010 is OK: OK - interfaces up
[19:02:31] <icinga-wm>	 RECOVERY - puppet last run on lvs2010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:03:17] <icinga-wm>	 RECOVERY - dhclient process on lvs2010 is OK: PROCS OK: 0 processes with command name dhclient
[19:03:19] <icinga-wm>	 PROBLEM - configured eth on cloudstore1009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.26: Connection reset by peer
[19:05:11] <icinga-wm>	 PROBLEM - Check systemd state on cloudstore1009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.26: Connection reset by peer
[19:05:11] <icinga-wm>	 PROBLEM - dhclient process on cloudstore1009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.26: Connection reset by peer
[19:05:23] <icinga-wm>	 RECOVERY - configured eth on cloudstore1009 is OK: OK - interfaces up
[19:05:49] <icinga-wm>	 RECOVERY - Disk space on cloudstore1009 is OK: DISK OK
[19:05:57] <wikibugs>	 (03Abandoned) 10Dzahn: rename tegmen.wikimedia.org to icinga2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/471897 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn)
[19:06:15] <icinga-wm>	 RECOVERY - Check systemd state on cloudstore1009 is OK: OK - running: The system is fully operational
[19:06:15] <icinga-wm>	 RECOVERY - dhclient process on cloudstore1009 is OK: PROCS OK: 0 processes with command name dhclient
[19:06:50] <wikibugs>	 (03PS1) 10Effie Mouzeli: cumin: create alias for role redis::misc [puppet] - 10https://gerrit.wikimedia.org/r/473582
[19:07:09] <icinga-wm>	 PROBLEM - Host lvs2010 is DOWN: PING CRITICAL - Packet loss = 100%
[19:07:27] <icinga-wm>	 RECOVERY - Host lvs2010 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[19:12:19] <hashar>	 I am going to deploy [mediawiki/core@wmf/1.33.0-wmf.4] JobQueue: Actually return the value from getRootJobCacheKey()
[19:12:19] <hashar>	 https://gerrit.wikimedia.org/r/473579
[19:12:43] <icinga-wm>	 RECOVERY - IPMI Sensor Status on lvs2010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[19:14:22] <logmsgbot>	 !log hashar@deploy1001 Synchronized php-1.33.0-wmf.4/includes/jobqueue/JobQueue.php: Actually return the value from getRootJobCacheKey() - T209429 (duration: 00m 53s)
[19:14:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:26] <stashbot>	 T209429: memcached error:  A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE  - https://phabricator.wikimedia.org/T209429
[19:14:30] <hashar>	 anomie: hotfix deployed thanks
[19:19:39] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:22:45] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs2010 is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=codfw%2520prometheus%252Fops
[19:24:11] <hashar>	 jouncebot: next
[19:24:11] <jouncebot>	 In 0 hour(s) and 35 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T2000)
[19:24:29] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team (Watching / External): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10thcipriani) Seems like the metrics reporter plugin hasn't received any updates for 8 months now: https://gerrit.googlesource.com/plugins/metrics-reporter-...
[19:24:33] <vgutierrez>	 that's expected
[19:24:36] <vgutierrez>	 the lvs2010 stuff
[19:24:40] <vgutierrez>	 sorry about the noise
[19:24:56] <hashar>	 I am going to wait for the Americans train window in 30 minutes from now
[19:34:41] <wikibugs>	 (03Abandoned) 10Herron: logstash::input::kafka: add topics_pattern support [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[19:35:00] <wikibugs>	 10Operations, 10Gerrit: Add javamelody to gerrit - https://phabricator.wikimedia.org/T209526 (10thcipriani) p:05Triage>03High
[19:35:17] <wikibugs>	 (03PS1) 10Herron: logstash::input::kafka: add topics_pattern support [puppet] - 10https://gerrit.wikimedia.org/r/473588 (https://phabricator.wikimedia.org/T206454)
[19:35:34] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team (Kanban): Add javamelody to gerrit - https://phabricator.wikimedia.org/T209526 (10thcipriani) a:03thcipriani
[19:36:14] <wikibugs>	 (03PS1) 10Thcipriani: Add JavaMelody plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/473589 (https://phabricator.wikimedia.org/T209526)
[19:37:03] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm)
[19:37:11] <wikibugs>	 (03CR) 10Herron: "abandoned for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473588/" [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[19:37:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) 05Open>03Resolved
[19:37:27] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on lvs2010 is OK: OK: synced at Wed 2018-11-14 19:37:25 UTC.
[19:42:22] <wikibugs>	 (03CR) 10Herron: "this is a continuation of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473138/" [puppet] - 10https://gerrit.wikimedia.org/r/473588 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[19:42:30] <wikibugs>	 (03CR) 10Herron: "PCC is a noop https://puppet-compiler.wmflabs.org/compiler1002/13498/" [puppet] - 10https://gerrit.wikimedia.org/r/473588 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[19:43:29] <wikibugs>	 (03CR) 10Herron: [C: 032] logstash::input::kafka: add topics_pattern support [puppet] - 10https://gerrit.wikimedia.org/r/473588 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[19:47:19] <wikibugs>	 (03PS11) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454)
[19:56:02] <wikibugs>	 (03PS2) 10Bstorm: sonofgridengine: stretch bastions want libboost-dev [puppet] - 10https://gerrit.wikimedia.org/r/473574 (https://phabricator.wikimedia.org/T200557)
[19:57:30] <wikibugs>	 (03CR) 10Bstorm: [C: 032] sonofgridengine: stretch bastions want libboost-dev [puppet] - 10https://gerrit.wikimedia.org/r/473574 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm)
[20:00:04] <jouncebot>	 Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T2000)
[20:04:53] <wikibugs>	 (03CR) 10Herron: [C: 032] logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[20:05:01] <wikibugs>	 (03PS12) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454)
[20:08:17] <hashar>	 going to run train for group1
[20:08:52] <wikibugs>	 (03PS1) 10Hashar: group1 wikis to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473597
[20:08:54] <wikibugs>	 (03CR) 10Hashar: [C: 032] group1 wikis to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473597 (owner: 10Hashar)
[20:10:42] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473597 (owner: 10Hashar)
[20:12:51] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.4
[20:13:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:07] <wikibugs>	 (03PS1) 10Herron: Revert "logstash: add rsyslog-shipper kafka input config" [puppet] - 10https://gerrit.wikimedia.org/r/473598
[20:13:44] <logmsgbot>	 !log hashar@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.4 (duration: 00m 52s)
[20:13:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:15] <wikibugs>	 (03CR) 10Herron: [C: 032] "reverting because $kafka_config['brokers']['string'] expands to plaintext ports" [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[20:14:36] <wikibugs>	 (03CR) 10Herron: [C: 032] Revert "logstash: add rsyslog-shipper kafka input config" [puppet] - 10https://gerrit.wikimedia.org/r/473598 (owner: 10Herron)
[20:16:20] <wikibugs>	 (03CR) 10Cwhite: [C: 032] rename tegmen to icinga2001 in puppet, DHCP, and use stretch [puppet] - 10https://gerrit.wikimedia.org/r/471898 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn)
[20:18:51] <wikibugs>	 (03PS2) 10Cwhite: rename tegmen to icinga2001 in puppet, DHCP, and use stretch [puppet] - 10https://gerrit.wikimedia.org/r/471898 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn)
[20:20:37] <wikibugs>	 (03PS3) 10Cwhite: rename tegmen to icinga2001 in puppet, DHCP, and use stretch [puppet] - 10https://gerrit.wikimedia.org/r/471898 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn)
[20:22:50] <wikibugs>	 (03CR) 10Dzahn: [C: 031] rename tegmen to icinga2001 in puppet, DHCP, and use stretch [puppet] - 10https://gerrit.wikimedia.org/r/471898 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn)
[20:23:07] <wikibugs>	 (03CR) 10Cwhite: [C: 032] rename tegmen to icinga2001 in puppet, DHCP, and use stretch [puppet] - 10https://gerrit.wikimedia.org/r/471898 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn)
[20:23:25] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473597 (owner: 10Hashar)
[20:25:37] <wikibugs>	 (03PS3) 10Cwhite: rename tegmen.wikimedia.org to icinga2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/472225 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn)
[20:25:46] <hashar>	 group1 looks good so far
[20:25:53] <wikibugs>	 (03CR) 10Cwhite: [C: 032] rename tegmen.wikimedia.org to icinga2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/472225 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn)
[20:27:45] <wikibugs>	 (03PS2) 10Cwhite: smokeping: replace tegmen with icinga2001 target [puppet] - 10https://gerrit.wikimedia.org/r/471899 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn)
[20:28:24] <wikibugs>	 (03PS1) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/473607 (https://phabricator.wikimedia.org/T206454)
[20:28:43] <wikibugs>	 (03CR) 10Cwhite: [C: 032] smokeping: replace tegmen with icinga2001 target [puppet] - 10https://gerrit.wikimedia.org/r/471899 (https://phabricator.wikimedia.org/T208824) (owner: 10Dzahn)
[20:29:22] <mutante>	 XioNoX: fyi, we are renaming a smokeping target(separate from the other day)
[20:29:36] <mutante>	 since this is the same server just changing names.. we have to do it at once.. with the DNS change
[20:29:46] <mutante>	 might be an alert but should be gone soon
[20:43:05] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cwhite on cumin1001.eqiad.wmnet for hosts: ` tegmen.mgmt.codfw.wmnet ` The log can be found in `...
[20:43:07] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['tegmen.mgmt.codfw.wmnet'] `  Of which those **FAILED**: ` ['tegmen.mgmt.codfw.wmnet'] `
[20:43:33] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cwhite on cumin1001.eqiad.wmnet for hosts: ` tegmen.wikimedia.org ` The log can be found in `/va...
[20:44:46] <anomie>	 hashar: Does group 1 seem stable enough that I can start some maintenance scripts, or should I just wait for tomorrow?
[20:47:29] <wikibugs>	 (03PS2) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/473607 (https://phabricator.wikimedia.org/T206454)
[20:51:58] <wikibugs>	 (03CR) 10Volans: [C: 031] "Syntactically correct, I'll leave it to you and Moritz for the logically correct, maybe a global one that includes both roles and both DC" [puppet] - 10https://gerrit.wikimedia.org/r/473582 (owner: 10Effie Mouzeli)
[20:52:25] <wikibugs>	 (03PS1) 10Thcipriani: Add javamelody plugin and dependencies [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/473610 (https://phabricator.wikimedia.org/T209526)
[20:53:59] <wikibugs>	 (03PS3) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/473607 (https://phabricator.wikimedia.org/T206454)
[20:56:22] <wikibugs>	 (03PS1) 10Thcipriani: Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526)
[20:56:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani)
[20:57:21] <wikibugs>	 (03CR) 1020after4: [C: 032] Add JavaMelody plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/473589 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani)
[20:57:29] <wikibugs>	 (03CR) 1020after4: [C: 032] Add javamelody plugin and dependencies [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/473610 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani)
[20:58:15] <wikibugs>	 (03PS2) 10Thcipriani: Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526)
[20:58:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani)
[20:59:16] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cwhite on cumin1001.eqiad.wmnet for hosts: ` tegmen.wikimedia.org ` The log can be found in `/va...
[20:59:27] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['tegmen.wikimedia.org'] `  Of which those **FAILED**: ` ['tegmen.wikimedia.org'] `
[21:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor I � Unicode. All rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181114T2100).
[21:00:09] <wikibugs>	 (03PS3) 10Thcipriani: Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526)
[21:00:40] <wikibugs>	 (03PS4) 10Herron: logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/473607 (https://phabricator.wikimedia.org/T206454)
[21:00:54] <wikibugs>	 (03CR) 1020after4: [C: 031] Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani)
[21:04:16] <wikibugs>	 (03CR) 10Thcipriani: "puppet compiler run: https://puppet-compiler.wmflabs.org/compiler1002/13502/" [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani)
[21:04:35] <wikibugs>	 (03CR) 10Herron: [C: 032] logstash: add rsyslog-shipper kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/473607 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron)
[21:05:58] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cwhite on cumin1001.eqiad.wmnet for hosts: ` tegmen.wikimedia.org ` The log can be found in `/va...
[21:06:00] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['tegmen.wikimedia.org'] `  Of which those **FAILED**: ` ['tegmen.wikimedia.org'] `
[21:09:03] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cwhite on cumin1001.eqiad.wmnet for hosts: ` tegmen.wikimedia.org ` The log can be found in `/va...
[21:14:39] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[21:15:17] <wikibugs>	 (03PS1) 10Herron: kafka_config: add ssl_string to documentation section [puppet] - 10https://gerrit.wikimedia.org/r/473614
[21:20:31] <wikibugs>	 (03PS4) 10Dzahn: Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani)
[21:20:33] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Imarlier) @Gehel have the patches referenced above been deployed?
[21:20:43] <wikibugs>	 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Imarlier) Doesn't appear to have solved the issue, but I need to verify that the patches have actually been deployed: https://logstash.wikimedia.or...
[21:20:47] <wikibugs>	 (03PS1) 10Herron: logstash: remove unnecessary newline from kafka input template [puppet] - 10https://gerrit.wikimedia.org/r/473616
[21:21:16] <wikibugs>	 (03CR) 10Thcipriani: [V: 032] Add JavaMelody plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/473589 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani)
[21:21:26] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Stop oversampling Asian countries - https://phabricator.wikimedia.org/T204365 (10Imarlier) 05Open>03Resolved Resolved a long time ago, just forgot to close out the ticket.
[21:21:36] <wikibugs>	 (03CR) 10Thcipriani: [V: 032] Add javamelody plugin and dependencies [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/473610 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani)
[21:21:54] <wikibugs>	 (03CR) 10Herron: [C: 032] kafka_config: add ssl_string to documentation section [puppet] - 10https://gerrit.wikimedia.org/r/473614 (owner: 10Herron)
[21:24:17] <wikibugs>	 (03PS2) 10Herron: logstash: remove unnecessary newline from kafka input template [puppet] - 10https://gerrit.wikimedia.org/r/473616
[21:25:10] <wikibugs>	 (03CR) 10Herron: [C: 032] logstash: remove unnecessary newline from kafka input template [puppet] - 10https://gerrit.wikimedia.org/r/473616 (owner: 10Herron)
[21:26:18] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani)
[21:26:25] <wikibugs>	 (03PS5) 10Dzahn: Gerrit: JavaMelody library dependency symlink [puppet] - 10https://gerrit.wikimedia.org/r/473612 (https://phabricator.wikimedia.org/T209526) (owner: 10Thcipriani)
[21:26:54] <logmsgbot>	 !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@ab2fa18]: deploy javamelody on gerrit2001
[21:26:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:05] <logmsgbot>	 !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@ab2fa18]: deploy javamelody on gerrit2001 (duration: 00m 11s)
[21:27:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:55] <mutante>	 thcipriani: "deploying" the symlink
[21:28:10] <thcipriani>	 mutante: thank you!
[21:28:44] <logmsgbot>	 !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@ab2fa18]: deploy javamelody on cobalt
[21:28:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:54] <logmsgbot>	 !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@ab2fa18]: deploy javamelody on cobalt (duration: 00m 09s)
[21:28:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:06] <mutante>	 Notice: /Stage[main]/Gerrit::Jetty/File[/var/lib/gerrit2/review_site/lib/javamelody-deps_deploy.jar]/ensure: created
[21:30:08] <thcipriani>	 great!
[21:30:37] <wikibugs>	 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Smalyshev) The patch has been deployed, and doesn't look like it prevents the issue:  ` 18:05:29.346 [update 4] WARN  org.wikidata.query.rdf.tool.U...
[21:31:53] <wikibugs>	 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Smalyshev) > So, an interesting thing: in at least some of these cases, there is a web request that is making it to wikidata, and that is returning...
[21:33:12] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) @Imarlier yes the patches have been deployed though we don't use RC A...
[21:34:39] <thcipriani>	 !log restart gerrit to load JavaMelody dependency library
[21:34:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:20] <wikibugs>	 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Imarlier) >>! In T207718#4748289, @Smalyshev wrote: >> So, an interesting thing: in at least some of these cases, there is a web request that is ma...
[21:36:37] <d3r1ck>	 thcipriani: Nice! :)
[21:37:21] <wikibugs>	 10Operations, 10ops-codfw, 10ops-eqiad, 10ops-ulsfo: Devices with wmf* names and status active - https://phabricator.wikimedia.org/T209074 (10Cmjohnson) @ayounsi should they all have "inventory" status?
[21:39:59] <icinga-wm>	 PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 3 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy]
[21:41:05] <icinga-wm>	 PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[21:41:33] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[21:44:53] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[21:50:04] <wikibugs>	 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10thcipriani)
[21:50:10] <wikibugs>	 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Add javamelody to gerrit - https://phabricator.wikimedia.org/T209526 (10thcipriani) 05Open>03Resolved Now available for gerrit admins at: https://gerrit.wikimedia.org/r/monitoring
[21:58:49] <wikibugs>	 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Smalyshev) > It could be -- how quickly does it retry? Immediately? Or is there a delay?  I don't think there's a delay for NoHttpResponseException...
[22:02:32] <wikibugs>	 (03PS3) 10Zoranzoki21: Add new throttle rule for Art+Feminism Event on 2018-11-17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473255 (https://phabricator.wikimedia.org/T209324)
[22:03:08] <wikibugs>	 10Operations, 10ops-codfw, 10ops-eqiad, 10ops-ulsfo: Devices with wmf* names and status active - https://phabricator.wikimedia.org/T209074 (10RobH) spares should be 'planned'  https://wikitech.wikimedia.org/wiki/Server_Lifecycle#States
[22:03:37] <wikibugs>	 (03PS1) 10Bstorm: sonofgridengine: fighting through the dependency quirks [puppet] - 10https://gerrit.wikimedia.org/r/473628 (https://phabricator.wikimedia.org/T200557)
[22:04:43] <wikibugs>	 (03CR) 10Bstorm: [C: 032] sonofgridengine: fighting through the dependency quirks [puppet] - 10https://gerrit.wikimedia.org/r/473628 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm)
[22:06:39] <icinga-wm>	 RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[22:06:39] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10DC-Ops: labvirt1018 -> cloudvirt1018: update physical label, network port description, netbox - https://phabricator.wikimedia.org/T207319 (10Andrew) a:05Andrew>03Cmjohnson
[22:09:23] <wikibugs>	 (03PS4) 10Zoranzoki21: Disable FlaggedRevs, enable RC patrol and add rights on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472745 (https://phabricator.wikimedia.org/T209251)
[22:10:39] <icinga-wm>	 RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[22:11:43] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[22:14:29] <mutante>	 those were gerrit (git clone) related
[22:14:45] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['icinga2001.wikimedia.org'] `  Of which those **FAILED**: ` ['icinga2001.wikimedia.org'] `
[22:17:15] <addshore>	 hi mutante :)
[22:18:27] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) just got off the phone with HP and they are stating that they are not seeing any issues with the raid battery in the logs I have sent.  They suggest it's our reporting tool.
[22:18:30] <hashar>	 anomie: sorry I have missed your ping. I think group1 is stable so far, I havent noticed much
[22:19:14] <hashar>	 anomie: so I guess it is fine to run maintenance scripts ):
[22:19:15] <hashar>	 :)
[22:21:36] <hashar>	 I am off for some sleep &
[22:25:25] <Platonides>	 the page https://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Map_of_Virginia_highlighting_Arlington_County.svg/120px-Map_of_Virginia_highlighting_Arlington_County.svg.png is giving a "Error: 429, Too Many Requests" error
[22:25:55] <Platonides>	 I guess there should be a grafana dashboard showing the scalers load, but I don't see it
[22:26:39] <Platonides>	 maybe someone has an idea what's going on?
[22:35:01] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:39:05] <bawolff>	 Often that just means render failure followed by so many attempts that rate limiter kicked in (as in that scenario every view is render attempt)
[22:40:25] <Platonides>	 so the 429 is just masking the real error
[22:40:37] <Platonides>	 makes sense
[22:41:47] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[22:42:06] <Platonides>	 is that for the base image?
[22:42:23] <Platonides>	 it also shows a 429 if I change the url to a different size
[22:42:47] * bawolff knowledge is prethumbor and outdated
[22:44:31] <icinga-wm>	 RECOVERY - Long running screen/tmux on lvs2010 is OK: OK: No SCREEN or tmux processes detected.
[22:44:32] * Platonides didn't know about thumbor…
[22:45:48] <Platonides>	 reading https://wikitech.wikimedia.org/wiki/Thumbor#Throttling I understand it is throttling the original
[22:47:55] <Platonides>	 so, for some reason, thumbor is not generatign a thumbnail for this
[22:48:09] <Platonides>	 the next question should be “why?”
[22:48:26] <Platonides>	 (too many points to render? :S)
[22:51:39] <bawolff>	 If its a huge file, out of time, oom, etc would be a possible reason
[22:52:22] <bawolff>	 otherwise it would mean that librsvg is crashing for some reason
[22:52:37] <bawolff>	 I have no idea where thumbor logs are
[22:58:47] <Platonides>	 "Thumbor logs go to /srv/log/thumbor on the Thumbor servers."
[22:59:13] <Platonides>	 which isn't helpful for lay men like me :P
[22:59:35] <Platonides>	 I would expect it is also sent out
[23:12:33] <wikibugs>	 10Operations, 10JADE, 10TechCom, 10Epic, and 3 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10Harej) p:05Triage>03Normal
[23:33:22] <wikibugs>	 (03PS1) 10Thcipriani: Gerrit: add basic robots.txt for proxy [puppet] - 10https://gerrit.wikimedia.org/r/473638 (https://phabricator.wikimedia.org/T209456)
[23:33:43] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:39:28] <wikibugs>	 (03CR) 10Paladox: [C: 031] Gerrit: add basic robots.txt for proxy [puppet] - 10https://gerrit.wikimedia.org/r/473638 (https://phabricator.wikimedia.org/T209456) (owner: 10Thcipriani)
[23:39:31] <wikibugs>	 10Operations, 10JADE, 10TechCom, 10Epic, and 3 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10awight)
[23:39:35] <wikibugs>	 10Operations, 10DBA, 10JADE, 10Epic, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) 05Open>03Resolved This was addressed for now, by an agreement between our team and SRE to not install JADE on wikis with revision table size >= 100GB.  Th...
[23:41:39] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[23:49:15] <wikibugs>	 (03PS1) 10Bstorm: sonofgridengine: Try directly setting a docker install [puppet] - 10https://gerrit.wikimedia.org/r/473641 (https://phabricator.wikimedia.org/T200557)
[23:50:28] <wikibugs>	 (03CR) 10Bstorm: [C: 032] sonofgridengine: Try directly setting a docker install [puppet] - 10https://gerrit.wikimedia.org/r/473641 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm)
[23:51:29] <wikibugs>	 (03PS4) 10Volans: Add Icinga module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884)
[23:51:57] <wikibugs>	 (03CR) 10Volans: "Done, thanks for the review, totally agree." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[23:52:57] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.