[00:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T0000).
[00:00:04] <jouncebot>	 dmaza: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:55] <dmaza>	 I'm here
[00:01:00] <icinga-wm>	 ACKNOWLEDGEMENT - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T209395
[00:02:48] <wikibugs>	 (03PS1) 10Catrope: Add throttle exception for Wikimedia event on December 6th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476432 (https://phabricator.wikimedia.org/T210681)
[00:03:16] <RoanKattouw>	 I can do the SWAT
[00:03:23] <RoanKattouw>	 That also lets me add in my own patch :)
[00:03:46] <dmaza>	 ;)
[00:04:24] <RoanKattouw>	 dmaza: Do you also have a config patch to enable $wgEnableBlockNoticeStats, or is it just a dark deploy for now?
[00:04:39] <wikibugs>	 (03CR) 10Catrope: [C: 032] Add throttle exception for Wikimedia event on December 6th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476432 (https://phabricator.wikimedia.org/T210681) (owner: 10Catrope)
[00:04:43] <dmaza>	 we are not enabling yet
[00:05:52] <wikibugs>	 10Operations, 10Traffic: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10Dzahn)
[00:05:58] <greg-g>	 RoanKattouw: cheater :P
[00:06:01] <wikibugs>	 10Operations, 10Traffic: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10Dzahn) p:05Triage>03Normal
[00:06:06] <wikibugs>	 (03Merged) 10jenkins-bot: Add throttle exception for Wikimedia event on December 6th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476432 (https://phabricator.wikimedia.org/T210681) (owner: 10Catrope)
[00:06:15] <icinga-wm>	 ACKNOWLEDGEMENT - Host lvs1006 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T210683
[00:06:59] <icinga-wm>	 RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms
[00:07:21] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] Misc pylint fixes [software/keyholder] - 10https://gerrit.wikimedia.org/r/476429 (owner: 10Faidon Liambotis)
[00:07:53] <wikibugs>	 (03Merged) 10jenkins-bot: Misc pylint fixes [software/keyholder] - 10https://gerrit.wikimedia.org/r/476429 (owner: 10Faidon Liambotis)
[00:08:51] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/throttle.php: T210681 (duration: 01m 04s)
[00:08:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:08:58] <stashbot>	 T210681: Throttle exemption for event at Wikimedia office on 2018-12-06 - https://phabricator.wikimedia.org/T210681
[00:11:53] <icinga-wm>	 PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
[00:12:17] <wikibugs>	 (03CR) 10jenkins-bot: Add throttle exception for Wikimedia event on December 6th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476432 (https://phabricator.wikimedia.org/T210681) (owner: 10Catrope)
[00:12:54] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "yea, simpler now but still conflicting with apache module  for now https://puppet-compiler.wmflabs.org/compiler1002/13772/mwmaint1002.eqia" [puppet] - 10https://gerrit.wikimedia.org/r/416751 (owner: 10Dzahn)
[00:14:25] <wikibugs>	 (03PS11) 10Dzahn: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742
[00:17:27] <dmaza>	 RoanKattouw: let me know when it's done
[00:17:41] <icinga-wm>	 RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[00:18:12] <RoanKattouw>	 dmaza: It failed Jenkins, trying again now
[00:18:55] <dmaza>	 👍
[00:23:47] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/13773/thorium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn)
[00:47:45] <dmaza>	 RoanKattouw: are you still around?
[00:47:52] <RoanKattouw>	 Yes, it just merged
[00:48:00] <RoanKattouw>	 Pulling it onto mwdebug1002 now
[00:48:49] <RoanKattouw>	 dmaza: OK, ready for testing on mwdebug now
[00:48:55] <RoanKattouw>	 Insofar as there is anything to test
[00:49:09] <dmaza>	 not really.. just wanna make sure nothing is on fire
[00:49:17] <dmaza>	 one sec
[00:52:42] <dmaza>	 RoanKattouw: everything looks good
[00:53:40] <RoanKattouw>	 OK, syncing
[00:55:38] <logmsgbot>	 !log catrope@deploy1001 Synchronized php-1.33.0-wmf.6/includes/: Add block notice stats on EditPage (T201718) (duration: 01m 14s)
[00:55:59] <dmaza>	 thank you
[00:57:42] <stashbot>	 catrope@deploy1001: Failed to log message to wiki. Somebody should check the error logs.
[00:57:43] <stashbot>	 T201718: Tracking blocks: Log when the desktop VisualEditor + 2010 wikitext editor block notice is displayed  - https://phabricator.wikimedia.org/T201718
[01:00:04] <jouncebot>	 twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T0100).
[01:04:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[12] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10bd808)
[01:12:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[12] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10bd808)
[01:13:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[12] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10bd808)
[01:18:44] <wikibugs>	 10Operations, 10Analytics, 10Security-Team, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10chasemp) Small bit of background from my perspective, I had discussed this on hangout with a few folks who I will let acknowledge their own le...
[01:24:08] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "based on old comments on PS7 "if pcc is ok feel free to merge :)" planning to go ahead with this and merge tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn)
[02:14:37] <wikibugs>	 (03CR) 10BBlack: [C: 031] cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn)
[02:35:02] <wikibugs>	 10Operations, 10Traffic: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10BBlack) Yeah I got busy and dropped this.  Console was unresponsive initially.  Reboot produced a responsive console, but wasn't able to initially ssh into the host (and no icinga recovery).  With the fresh reboot, eth0 has no...
[02:35:55] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10BBlack)
[02:54:25] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10ayounsi) a:03Cmjohnson Port looks down (but not disabled) on the switch side, I'd say next step is for Chris to try re-seating then different cable/ports/etc.
[03:33:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 884.97 seconds
[04:00:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 031] openstack: Move Keystone DB credentials to my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/476109 (https://phabricator.wikimedia.org/T210404) (owner: 10GTirloni)
[04:03:00] <wikibugs>	 (03CR) 10Andrew Bogott: "I don't know what phragile is (I'm probably a member because I created the project) but overall this looks OK to me.  I expect that it wil" [puppet] - 10https://gerrit.wikimedia.org/r/475032 (owner: 10Dzahn)
[04:14:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 192.74 seconds
[04:26:27] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:31:17] <icinga-wm>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 91.198.174.245, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:59:09] <icinga-wm>	 RECOVERY - ensure kvm processes are running on labvirt1011 is OK: PROCS OK: 1 process with regex args /usr/bin/kvm
[05:01:31] <icinga-wm>	 PROBLEM - ensure kvm processes are running on labvirt1011 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm
[05:03:07] <andrewbogott>	 those labvirt1011 pages are weird but not important
[05:52:25] <wikibugs>	 10Operations, 10Traffic: INMARSAT geolocates to the UK, leading to requests going to esams - https://phabricator.wikimedia.org/T209785 (10Reedy) Same IP going back, 161.30.203.16  ` Reedys-MacBook-Pro:~ reedy$ dig +short reflect.wikimedia.org 161.30.203.0 `  But it does seem to be going to ulsfo now, I guess a...
[06:11:30] <marostegui>	 !log Deploy schema change on s6 primary master - T86338
[06:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:11:34] <stashbot>	 T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338
[06:13:02] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476451 (https://phabricator.wikimedia.org/T202167)
[06:15:01] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/OATHAuth: revert logging (loldeployingfromaplane) (duration: 00m 59s)
[06:15:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:11] <icinga-wm>	 PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
[06:20:23] <icinga-wm>	 RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[06:27:00] <wikibugs>	 (03PS1) 10Marostegui: Revert "dump_section.py: Increase retention from 18 days to 45" [puppet] - 10https://gerrit.wikimedia.org/r/476452
[06:27:15] <wikibugs>	 (03PS2) 10Marostegui: Revert "dump_section.py: Increase retention from 18 days to 45" [puppet] - 10https://gerrit.wikimedia.org/r/476452
[06:28:07] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476451 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui)
[06:28:20] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "dump_section.py: Increase retention from 18 days to 45" [puppet] - 10https://gerrit.wikimedia.org/r/476452 (owner: 10Marostegui)
[06:29:10] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476451 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui)
[06:29:15] <icinga-wm>	 PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled]
[06:29:37] <icinga-wm>	 PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/DigiCert_High_Assurance_CA-3.crt]
[06:29:41] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476451 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui)
[06:31:07] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R]
[06:31:09] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1088 T202167 (duration: 00m 56s)
[06:31:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:13] <stashbot>	 T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167
[06:31:19] <marostegui>	 !log Deploy schema change on db1088 - T202167
[06:31:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:21] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476453
[06:33:19] <icinga-wm>	 PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/nf_conntrack.conf]
[06:33:28] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476453 (owner: 10Marostegui)
[06:34:30] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476453 (owner: 10Marostegui)
[06:35:35] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1088 T202167 (duration: 00m 53s)
[06:35:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:00] <marostegui>	 !log Deploy schema change on db1061 (s6 master) - T202167
[06:36:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:43] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476453 (owner: 10Marostegui)
[06:46:57] <icinga-wm>	 PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
[06:48:25] <marostegui>	 ^ that host rebooted itself (I am checking the idrac)
[06:48:31] <marostegui>	 It is booting up now
[06:49:25] <icinga-wm>	 RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms
[06:53:05] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[06:55:10] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476454 (https://phabricator.wikimedia.org/T208383)
[06:56:59] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[06:57:12] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476454 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[06:58:14] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476454 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[06:59:15] <icinga-wm>	 RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:59:29] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1006 - T208383 (duration: 00m 53s)
[06:59:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:33] <stashbot>	 T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383
[06:59:42] <marostegui>	 !log Stop MySQL on pc1006 to clone pc1009 - T208383
[06:59:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:20] <icinga-wm>	 RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:00:41] <icinga-wm>	 RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:05:51] <wikibugs>	 10Operations, 10Analytics, 10Security-Team, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10Joe) >>! In T210667#4783582, @Legoktm wrote: >>>! In T210667#4783289, @MoritzMuehlenhoff wrote: >> exfat-fuse itself is free software (GPL) an...
[07:08:07] <icinga-wm>	 PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
[07:09:22] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476454 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[07:10:31] <icinga-wm>	 RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms
[07:11:45] <wikibugs>	 10Operations: ms-be2047 rebooting itself - https://phabricator.wikimedia.org/T210697 (10Marostegui)
[07:12:03] <wikibugs>	 10Operations, 10ops-codfw: ms-be2047 rebooting itself - https://phabricator.wikimedia.org/T210697 (10Marostegui) p:05Triage>03Normal
[07:22:55] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[07:26:34] <wikibugs>	 (03PS3) 10Vgutierrez: gerrit: Use the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050)
[07:29:54] <wikibugs>	 (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/13774/" [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[07:52:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "See my comment; this would be a -1 but Valentin guaranteed it's a temporary hack, so LGTM once you add a todo note there." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[07:57:00] <godog>	 mhhh ores is in trouble
[07:57:36] <godog>	 i.e. elevated 500s
[07:57:52] <wikibugs>	 (03PS4) 10Vgutierrez: gerrit: Use the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050)
[08:01:45] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] gerrit: Use the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[08:02:00] <wikibugs>	 (03PS5) 10Vgutierrez: gerrit: Use the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050)
[08:04:22] <vgutierrez>	 !log replacing TLS certificates in gerrit - T207050
[08:04:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:26] <stashbot>	 T207050: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050
[08:05:25] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 724 bytes in 0.001 second response time
[08:05:49] <vgutierrez>	 willikins:~ vgutierrez$ echo | openssl s_client -servername gerrit.wikimedia.org -connect gerrit.wikimedia.org:443 2>/dev/null | openssl x509 -noout -dates
[08:05:49] <vgutierrez>	 notBefore=Nov 28 14:45:53 2018 GMT
[08:08:35] <godog>	 I'm opening a task for ores 500s, there's exceptions in logs but I'm not sure where to start debugging
[08:09:39] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove Diamond from redis::misc systems [puppet] - 10https://gerrit.wikimedia.org/r/476226 (https://phabricator.wikimedia.org/T183454)
[08:13:54] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team: ORES 500s since 2018-11-29 6:25 - https://phabricator.wikimedia.org/T210701 (10fgiunchedi)
[08:14:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from redis::misc systems [puppet] - 10https://gerrit.wikimedia.org/r/476226 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[08:14:52] <godog>	 Amir1: around? T210701
[08:14:53] <stashbot>	 T210701: ORES 500s since 2018-11-29 6:25 - https://phabricator.wikimedia.org/T210701
[08:16:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1261 is CRITICAL: connect to address 10.64.0.56 and port 80: Connection refused
[08:17:05] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.008 second response time
[08:17:07] <icinga-wm>	 PROBLEM - HHVM rendering on mw1261 is CRITICAL: connect to address 10.64.0.56 and port 80: Connection refused
[08:17:07] <icinga-wm>	 PROBLEM - Check systemd state on mw1261 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:17:11] <_joe_>	 mw1261 is me
[08:17:15] <_joe_>	 sorry for the noise
[08:17:22] <_joe_>	 it's depooled, it will take some time to fix
[08:37:11] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.058 second response time
[08:37:13] <icinga-wm>	 RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 75945 bytes in 0.154 second response time
[08:37:17] <icinga-wm>	 RECOVERY - Check systemd state on mw1261 is OK: OK - running: The system is fully operational
[08:37:57] <icinga-wm>	 RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.070 second response time
[08:39:28] <Amir1>	 godog: I just woke up 
[08:39:42] <godog>	 Amir1: good morning!
[08:39:47] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 75985 bytes in 0.204 second response time
[08:39:52] <Amir1>	 Someone deployed a change. Where are getting it
[08:39:59] <Amir1>	 Is it prod?
[08:40:20] <godog>	 it is yeah
[08:40:39] <akosiaris>	 Amir1:  https://phabricator.wikimedia.org/T210701
[08:40:51] <Amir1>	 Shoot
[08:41:00] <akosiaris>	 it's not just itemequality btw, nothing specific to it
[08:41:07] <akosiaris>	 I also see goodfaith and so on
[08:41:14] <Amir1>	 This is on master but we didn't deploy it
[08:41:17] <_joe_>	 sorry, I have a question
[08:41:28] <_joe_>	 why don't we get any alert for this?
[08:41:45] <_joe_>	 well this can be answerd later
[08:42:12] <Amir1>	 _joe_: it seems this happens for some cases and not all
[08:42:35] <_joe_>	 Amir1: looking at grafana, it seems nothing works, but  well
[08:42:36] <akosiaris>	 _joe_: there is one (albeit it's just one and not really helping). 
[08:42:49] <akosiaris>	 https://grafana.wikimedia.org/dashboard/db/ores grafana alertCRITICAL2018-11-29 08:41:520d 2h 6m 35s3/3CRITICAL: https://grafana.wikimedia.org/dashboard/db/ores is alerting: 5xx rate (Change prop) alert.
[08:42:56] <akosiaris>	 I missed it too btw
[08:43:00] <akosiaris>	 never really saw it
[08:43:04] <_joe_>	 so did anyone try to restart ores workers?
[08:43:04] <godog>	 there's also the availability alert, which is how I noticed
[08:43:13] <_joe_>	 it's clear something went very wrong at logrotate time
[08:43:36] <Amir1>	 500 is not in this https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=47&fullscreen&orgId=1
[08:43:45] <Amir1>	 I should add it, and then make a graph
[08:43:55] <_joe_>	 akosiaris: that alert should page imho
[08:43:56] <Amir1>	 anyway, let's get back to the main issue in hand
[08:44:21] <akosiaris>	 yeah let's write that we need better alerting in the incident report
[08:44:29] <akosiaris>	 but let's actually figure out the problem now
[08:44:35] <akosiaris>	    Active: active (running) since Wed 2018-11-28 10:13:46 UTC; 22h ago
[08:44:42] <akosiaris>	 that's the celery worker on scb1001
[08:44:44] <akosiaris>	 em
[08:44:45] <_joe_>	 ok, can I try to restart one of the workers?
[08:44:46] <akosiaris>	 ores1001
[08:44:51] <_joe_>	 akosiaris: same on ores1003
[08:44:54] <_joe_>	 22 h ago
[08:45:21] <akosiaris>	 lemme gather some stats whether all workers are in the same state
[08:45:59] <akosiaris>	 yup, all saying 22 hours ago
[08:46:06] <_joe_>	 akosiaris: so probably some release
[08:46:13] <akosiaris>	 I was expecting some issue with logrotate tbh
[08:46:21] <akosiaris>	 the 6:25 time is awfully weird
[08:46:28] <wikibugs>	 10Operations, 10Services: Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff)
[08:46:28] <_joe_>	 I'll restart uwsgi on ores1003
[08:46:41] <_joe_>	 akosiaris: yeah let's see what logrotate does
[08:46:53] <akosiaris>	 uwsgi-ores is also since 22h ago
[08:46:55] <_joe_>	 -rw-r--r-- 1 www-data www-data      6004 Nov 29 06:25 app.log.1
[08:46:56] <akosiaris>	 across the fleet
[08:47:04] <_joe_>	 so we rotate at 6:25
[08:47:26] <_joe_>	     postrotate
[08:47:28] <_joe_>	      service uwsgi-ores reload
[08:47:30] <_joe_>	     endscript
[08:47:37] <_joe_>	 so yeah we did a reload at 6:25
[08:47:41] <akosiaris>	 it was already complaining about something else though
[08:47:43] <akosiaris>	 requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.wikidata.org', port=443): Read timed out. (read timeout=5.0)
[08:47:44] <_joe_>	 that's what caused the issue
[08:47:51] <akosiaris>	 that's before the rotate
[08:48:10] <_joe_>	 but the rotate is when all went down
[08:48:18] <Amir1>	 akosiaris: that happens from time to time
[08:48:32] <Amir1>	 https://logstash.wikimedia.org/goto/543257bf91de9e695e5344a7dc382850
[08:48:42] <_joe_>	 akosiaris: let's restart one worker; if it recovers, we can depool a few for debugging and restore the service
[08:48:45] <akosiaris>	 well, not all, there are still some scores returned
[08:48:58] <wikibugs>	 (03PS12) 10Elukey: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn)
[08:48:59] <_joe_>	 !log restarting uwsgi-ores on ores1003
[08:49:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:03] <akosiaris>	 but those are probably from the cache
[08:49:19] <_joe_>	 akosiaris: I was about to say
[08:49:38] <akosiaris>	 I see it did not fix anything
[08:49:41] <Amir1>	 I can make the fix right now
[08:49:41] <_joe_>	 ok interstingly
[08:49:44] <_joe_>	 this solved nothing
[08:50:04] <akosiaris>	 Amir1: what fix ? have you already identified the issue ?
[08:50:06] <_joe_>	 did we change something in celery?
[08:50:13] <_joe_>	 Amir1: please do
[08:50:15] <Amir1>	 akosiaris: revert my puppet patch from yesterday
[08:50:20] <akosiaris>	 ok doing so
[08:50:36] <_joe_>	 also, why do we reload ores if we do copytruncate anyways
[08:51:00] <Amir1>	 akosiaris: https://gerrit.wikimedia.org/r/c/operations/puppet/+/476250
[08:51:24] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert "ores: Remove added celery configs" [puppet] - 10https://gerrit.wikimedia.org/r/476458
[08:51:28] <_joe_>	 ohhh I see
[08:51:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "ores: Remove added celery configs" [puppet] - 10https://gerrit.wikimedia.org/r/476458 (owner: 10Alexandros Kosiaris)
[08:51:38] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Revert "ores: Remove added celery configs" [puppet] - 10https://gerrit.wikimedia.org/r/476458
[08:51:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "ores: Remove added celery configs" [puppet] - 10https://gerrit.wikimedia.org/r/476458 (owner: 10Alexandros Kosiaris)
[08:51:58] <_joe_>	 akosiaris: need me to run puppet across all ores servers?
[08:52:04] <akosiaris>	 niah I got it
[08:52:05] <Amir1>	 The reason being is that this puppet change didn't trigger the celery to restart so its issue went unnoticed until 6 in the morning
[08:52:32] <akosiaris>	 well celery wasn't restarted either in 6:25 in the morning
[08:52:44] <Amir1>	 the more underlying problem is that the config exists in the code and it got deployed but it's under some other name "local_celery"
[08:52:48] <akosiaris>	 it was uwsgi that received the reload 
[08:52:56] <_joe_>	 Amir1: I don't think the issue is in celery tbh
[08:53:05] <Amir1>	 yeah, both of them would be affected
[08:53:18] * akosiaris running puppet 
[08:53:21] <Amir1>	 because this is about both sending and receiving it
[08:53:28] <_joe_>	 yes
[08:53:48] <akosiaris>	 ah yes celery tightly couples the consumer and the producer
[08:53:52] <_joe_>	 that's what I was about to say, it's how uwsgi sends data if you pickle
[08:54:09] <Amir1>	 https://github.com/wikimedia/ores/blob/master/config/00-main.yaml
[08:54:22] <akosiaris>	 restarting uwsgi and celery in eqiad
[08:54:24] <Amir1>	  task_serializer: 'pickle'
[08:54:27] <wikibugs>	 10Operations, 10Core Platform Team Backlog (Next), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10mobrovac)
[08:54:37] <Amir1>	 this hasn't been applied because it's under "local_celery"
[08:54:49] <wikibugs>	 (03CR) 10Gehel: "> Patch Set 4: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe)
[08:54:55] <_joe_>	 Amir1: I see a ton of
[08:55:05] <_joe_>	  WARNING revscoring.scoring.environment: Differences between the current environment and the environment in which the model was constructed environment were detected
[08:55:10] <wikibugs>	 (03PS1) 10Vgutierrez: gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050)
[08:55:27] <_joe_>	 is that really an info we should keep at WARNING for every score?
[08:55:35] <akosiaris>	 2018-11-29 08:55:25,073 WARNING ores.scoring_systems.celery_queue: Queue size is too full 425
[08:55:38] <akosiaris>	 ok that's good 
[08:55:49] <akosiaris>	 it means the 2 components are communicating again
[08:55:50] <_joe_>	 akosiaris: yeah things are back now
[08:56:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[08:56:01] <Amir1>	 _joe_: it's not on every score, it's on every restart
[08:56:08] <Amir1>	 I want to fix it too though
[08:56:44] <akosiaris>	 to be pedantic, a start of a celery worker 
[08:56:47] <_joe_>	 akosiaris: next time maybe restart them in a rolling fashion, I saw pybal cry :P
[08:57:14] <Amir1>	 so ores was down for three hours /o\
[08:57:18] <_joe_>	 it's ok in this case, though, as the service was effectively down
[08:57:22] <akosiaris>	 _joe_: yeah I usually do that, but now we were in a state of emergency
[08:57:29] <_joe_>	 Amir1: 2 hours 30, but yes
[08:57:42] <Amir1>	 let me get out of bed, get to the office and will write a detailed incident report
[08:57:54] <_joe_>	 Amir1: take your time :)
[08:58:02] <Amir1>	 best way to start your day, I don't need coffee anymore
[08:58:07] <akosiaris>	 lol
[08:58:16] <_joe_>	 Amir1: inorite
[08:58:28] <_joe_>	 Amir1: it's even better when you get paged at 3 am though
[08:58:37] <_joe_>	 you should try it sometimes!
[08:58:39] <_joe_>	 :P
[08:58:50] <Amir1>	 :((((
[08:58:59] <wikibugs>	 (03PS2) 10Vgutierrez: gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050)
[08:59:54] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Gehel) >>! In T210450#4781852, @Papaul wrote: > @Gehel  In this case the racking proposal will not work since those racks are 1G rack. I will update the task descriptio...
[09:01:10] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Gehel) The new racking proposal looks good to me (new servers are still in the same row as the previous proposal, which is all I care about).
[09:01:43] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1
[09:01:51] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Gehel)
[09:02:48] <wikibugs>	 (03CR) 10Vgutierrez: "pcc shows the expected changes in the production environment: https://puppet-compiler.wmflabs.org/compiler1002/13776/" [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[09:04:18] <akosiaris>	 sigh, now we 've reached redis.exceptions.ConnectionError: max number of clients reached
[09:04:29] <akosiaris>	 with 2,5 hours of scores backlogged ...
[09:09:47] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez)
[09:09:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "Well done!" [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[09:11:30] <akosiaris>	 ok fixed. increase nofile and maxclients for redis
[09:11:55] <akosiaris>	 !log increase nofile of process to 20k and maxclients to 15k to account for the backlog of ores scorings 
[09:11:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:00] <wikibugs>	 10Operations, 10Core Platform Team Backlog (Next), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff)
[09:13:55] <wikibugs>	 (03PS1) 10Ema: cache_canary: stop using exp admission policy [puppet] - 10https://gerrit.wikimedia.org/r/476460
[09:14:11] <wikibugs>	 (03PS2) 10Ema: cache_canary: stop using exp admission policy [puppet] - 10https://gerrit.wikimedia.org/r/476460
[09:15:11] <wikibugs>	 (03CR) 10Ema: [C: 032] cache_canary: stop using exp admission policy [puppet] - 10https://gerrit.wikimedia.org/r/476460 (owner: 10Ema)
[09:31:09] <wikibugs>	 (03PS2) 10Ema: cache: stop using nhw admission policy [puppet] - 10https://gerrit.wikimedia.org/r/476311 (https://phabricator.wikimedia.org/T144187)
[09:32:15] <wikibugs>	 (03CR) 10Ema: [C: 032] cache: stop using nhw admission policy [puppet] - 10https://gerrit.wikimedia.org/r/476311 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema)
[09:32:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/476393 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite)
[09:33:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] swift: Fix checks on drive/filesystem titles to allow for labs ones [puppet] - 10https://gerrit.wikimedia.org/r/402758 (https://phabricator.wikimedia.org/T184236) (owner: 10Alex Monk)
[09:34:03] <wikibugs>	 (03PS6) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381)
[09:35:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[09:36:19] <icinga-wm>	 ACKNOWLEDGEMENT - Backup of s2 in eqiad on db1115 is CRITICAL: Backup for s2 at eqiad taken more than 8 days ago: Most recent backup 2018-11-20 23:04:07 Banyek backup was failed because of recloning of the backup source host. Its fixed, not the backup is ongoing
[09:39:01] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): Blubber should be able to make multi docker files per repo - https://phabricator.wikimedia.org/T210267 (10zeljkofilipin)
[09:41:26] <wikibugs>	 (03CR) 10Elukey: [C: 032] analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn)
[09:41:34] <wikibugs>	 (03PS13) 10Elukey: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn)
[09:41:44] <wikibugs>	 10Operations, 10ops-codfw: ms-be2047 rebooting itself - https://phabricator.wikimedia.org/T210697 (10fgiunchedi)
[09:41:47] <wikibugs>	 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10fgiunchedi)
[09:41:55] <wikibugs>	 10Operations, 10ops-codfw: ms-be2047 rebooting itself - https://phabricator.wikimedia.org/T210697 (10fgiunchedi) Indeed, faulty hardware :(
[09:50:19] <banyek>	 labsdb1010 maintenance in 10 minutes
[09:53:19] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:53:41] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:53:47] <wikibugs>	 (03CR) 10Alex Monk: [C: 031] "Preferable to realm branching." [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[09:54:06] <wikibugs>	 (03PS2) 10Jcrespo: admin: Add Ryan Steinberg and Joe Wass access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298)
[09:54:29] <wikibugs>	 (03PS7) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381)
[09:54:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: Add Ryan Steinberg and Joe Wass access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298) (owner: 10Jcrespo)
[09:55:24] <gehel>	 !log restarting prometheus-elasticsearch-exporter-9200 on all elastic cirrus nodes
[09:55:26] <wikibugs>	 (03PS3) 10Jcrespo: admin: Add Ryan Steinberg and Joe Wass access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298)
[09:55:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:56] <wikibugs>	 (03CR) 10Elukey: [C: 032] "No op on thorium as far as I can see, thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn)
[10:01:26] <banyek>	 !log depooling labsdb1010 due of maintenance - T209517
[10:01:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:32] <stashbot>	 T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517
[10:01:55] <wikibugs>	 10Operations, 10Core Platform Team Backlog (Next), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10akosiaris) Does this mean he have a hard deadline of 2019-04-01 for completing the migrations? Or per the "I can backport security fixes for...
[10:01:58] <wikibugs>	 (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1010 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/476412 (https://phabricator.wikimedia.org/T209517) (owner: 10Bstorm)
[10:02:07] <wikibugs>	 (03PS2) 10Banyek: wiki replicas: depool labsdb1010 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/476412 (https://phabricator.wikimedia.org/T209517) (owner: 10Bstorm)
[10:02:11] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] wiki replicas: depool labsdb1010 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/476412 (https://phabricator.wikimedia.org/T209517) (owner: 10Bstorm)
[10:07:04] <wikibugs>	 (03PS6) 10GTirloni: openstack: Move Keystone DB credentials to my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/476109 (https://phabricator.wikimedia.org/T210404)
[10:10:26] <wikibugs>	 (03CR) 10GTirloni: [C: 032] openstack: Move Keystone DB credentials to my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/476109 (https://phabricator.wikimedia.org/T210404) (owner: 10GTirloni)
[10:16:28] <arturo>	 !log T209626 icinga downtime labvirt1011 for 1 month to avoid bogus pages
[10:16:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:34] <stashbot>	 T209626: Empty labvirt1010 and 1011 before their leases expire - https://phabricator.wikimedia.org/T209626
[10:17:17] <elukey>	 !log remove zookeeper's crontabs from conf100[1-3] to fix cronspam
[10:17:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:06] <Amir1>	 just got to the office, will start writing the incident report quickly 
[10:18:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good, all the NDAs/sign-offs are in place and the user data is fine as well." [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298) (owner: 10Jcrespo)
[10:19:20] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Banyek T209517
[10:20:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:21:13] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:31:40] <wikibugs>	 (03PS26) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919)
[10:32:01] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "This is superseeded by I0f1578aacc181ede284cef045e66258264b143ad and can probably be abandoned." [puppet] - 10https://gerrit.wikimedia.org/r/475944 (https://phabricator.wikimedia.org/T210265) (owner: 10Mathew.onipe)
[10:32:59] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM, let's wait until the servers are racked to merge." [puppet] - 10https://gerrit.wikimedia.org/r/475942 (https://phabricator.wikimedia.org/T210265) (owner: 10Mathew.onipe)
[10:34:55] <icinga-wm>	 RECOVERY - Backup of s2 in eqiad on db1115 is OK: Backup for s2 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2018-11-29 09:32:11 from db1095.eqiad.wmnet:3312 (107 GB)
[10:36:43] <wikibugs>	 (03PS1) 10Marostegui: pc1009: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/476469 (https://phabricator.wikimedia.org/T208383)
[10:40:12] <wikibugs>	 (03PS1) 10DCausse: elasticsearch: add psi & omega in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/476471 (https://phabricator.wikimedia.org/T207918)
[10:42:20] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: add kafka_shipper::kafka_brokers variable to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/476029
[10:42:22] <wikibugs>	 (03PS6) 10Filippo Giunchedi: rsyslog: add UDP localhost compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851)
[10:42:24] <wikibugs>	 (03PS1) 10Filippo Giunchedi: logstash: add new logging kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/476472 (https://phabricator.wikimedia.org/T205851)
[10:42:26] <wikibugs>	 (03PS1) 10Filippo Giunchedi: logstash: copy 'severity' into 'level' where needed [puppet] - 10https://gerrit.wikimedia.org/r/476473 (https://phabricator.wikimedia.org/T205851)
[10:42:28] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "Very minor comments inline, otherwise LGTM" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe)
[10:45:22] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui)
[10:46:08] <wikibugs>	 (03CR) 10Marostegui: [C: 032] pc1009: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/476469 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[10:51:39] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:51:41] <wikibugs>	 10Operations, 10Core Platform Team Backlog (Next), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff) >>! In T210704#4784515, @akosiaris wrote: > Does this mean he have a hard deadline of 2019-04-01 for completing the migra...
[10:51:55] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:56:09] <wikibugs>	 (03PS7) 10Filippo Giunchedi: rsyslog: add UDP localhost compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851)
[10:56:11] <wikibugs>	 (03PS2) 10Filippo Giunchedi: logstash: add new logging kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/476472 (https://phabricator.wikimedia.org/T205851)
[10:56:13] <wikibugs>	 (03PS2) 10Filippo Giunchedi: logstash: copy 'severity' into 'level' where needed [puppet] - 10https://gerrit.wikimedia.org/r/476473 (https://phabricator.wikimedia.org/T205851)
[10:56:14] <wikibugs>	 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10phuedx)
[11:08:18] <wikibugs>	 10Puppet, 10Phabricator: Local config file contains escape characters - https://phabricator.wikimedia.org/T103924 (10Aklapper) 05Open>03declined No reply. :( Please reopen this task when clarifying what is the actual problem here and where. (I also assume this is an upstream issue?) Thanks a lot!
[11:24:16] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery-Search: Find an alternative to curl connection pooling available in HHVM - https://phabricator.wikimedia.org/T210717 (10dcausse)
[11:24:44] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery-Search: Find an alternative to curl connection pooling available in HHVM - https://phabricator.wikimedia.org/T210717 (10dcausse)
[11:24:54] <wikibugs>	 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10dcausse)
[11:27:29] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:27:53] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:36:04] <wikibugs>	 (03Abandoned) 10Mathew.onipe: cirrus.yaml: add new elastic2037-elastic2054 to existing clusters [puppet] - 10https://gerrit.wikimedia.org/r/475944 (https://phabricator.wikimedia.org/T210265) (owner: 10Mathew.onipe)
[11:39:53] <wikibugs>	 10Operations, 10Traffic: Varnish won't purge thumbnails of specific file - https://phabricator.wikimedia.org/T207615 (10Aklapper) Let's close as declined as noone can reproduce anymore?
[12:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1200).
[12:00:04] <jouncebot>	 CFisch_WMDE and dcausse: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:22] <dcausse>	 o/
[12:00:25] <CFisch_WMDE>	 \o/
[12:00:32] <zeljkof>	 \o
[12:00:45] <wikibugs>	 (03PS1) 10Banyek: Revert "wiki replicas: depool labsdb1010 for upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/476482
[12:01:00] <banyek>	 !log repooling labsdb1010 after upgrades - T209517
[12:01:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:05] <stashbot>	 T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517
[12:01:06] <zeljkof>	 dcausse: go ahead with your patches, do you want to deploy CFisch_WMDE's patch, or should I do it?
[12:01:08] <wikibugs>	 (03PS1) 10Elukey: profile::hive::client: add support for kerberos to Beeline [puppet] - 10https://gerrit.wikimedia.org/r/476483
[12:01:10] <wikibugs>	 (03PS1) 10Elukey: profile::hive::client: move beeline's erb from role to profile ns [puppet] - 10https://gerrit.wikimedia.org/r/476484
[12:01:28] <dcausse>	 zeljkof: ok deploying
[12:01:51] <wikibugs>	 (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1010 for upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/476482 (owner: 10Banyek)
[12:02:03] <wikibugs>	 (03PS2) 10Banyek: Revert "wiki replicas: depool labsdb1010 for upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/476482
[12:02:06] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool labsdb1010 for upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/476482 (owner: 10Banyek)
[12:02:08] <wikibugs>	 (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475745 (https://phabricator.wikimedia.org/T198352) (owner: 10DCausse)
[12:02:28] <wikibugs>	 (03PS4) 10Jcrespo: admin: Add Ryan Steinberg and Joe Wass access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298)
[12:03:25] <wikibugs>	 (03Merged) 10jenkins-bot: [cirrus] Use normal config for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475745 (https://phabricator.wikimedia.org/T198352) (owner: 10DCausse)
[12:04:01] <wikibugs>	 (03PS2) 10Elukey: profile::hive::client: add support for kerberos to Beeline [puppet] - 10https://gerrit.wikimedia.org/r/476483
[12:04:22] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] admin: Add Ryan Steinberg and Joe Wass access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298) (owner: 10Jcrespo)
[12:05:29] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::hive::client: add support for kerberos to Beeline [puppet] - 10https://gerrit.wikimedia.org/r/476483 (owner: 10Elukey)
[12:05:49] <wikibugs>	 (03PS3) 10Elukey: profile::hive::client: add support for kerberos to Beeline [puppet] - 10https://gerrit.wikimedia.org/r/476483
[12:05:52] <wikibugs>	 (03CR) 10Elukey: [V: 032 C: 032] profile::hive::client: add support for kerberos to Beeline [puppet] - 10https://gerrit.wikimedia.org/r/476483 (owner: 10Elukey)
[12:06:08] <wikibugs>	 (03PS2) 10Elukey: profile::hive::client: move beeline's erb from role to profile ns [puppet] - 10https://gerrit.wikimedia.org/r/476484
[12:07:08] <wikibugs>	 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe)
[12:07:57] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::hive::client: move beeline's erb from role to profile ns [puppet] - 10https://gerrit.wikimedia.org/r/476484 (owner: 10Elukey)
[12:09:22] <wikibugs>	 (03PS1) 10Jcrespo: Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster" [puppet] - 10https://gerrit.wikimedia.org/r/476485
[12:09:31] <wikibugs>	 (03PS2) 10Jcrespo: Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster" [puppet] - 10https://gerrit.wikimedia.org/r/476485
[12:09:54] <logmsgbot>	 !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T210381: [cirrus] Use normal config for labswiki (duration: 00m 55s)
[12:09:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:00] <stashbot>	 T210381: Update mw-config to use the psi&omega elastic clusters in codfw  - https://phabricator.wikimedia.org/T210381
[12:10:15] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster" [puppet] - 10https://gerrit.wikimedia.org/r/476485 (owner: 10Jcrespo)
[12:10:21] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] "Failed" [puppet] - 10https://gerrit.wikimedia.org/r/476485 (owner: 10Jcrespo)
[12:10:23] <wikibugs>	 (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475746 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[12:10:26] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Revert "labs: Add mediainfo to federation config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476486
[12:10:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 032] Revert "Revert "labs: Add mediainfo to federation config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476486 (owner: 10Ladsgroup)
[12:10:39] <wikibugs>	 (03CR) 10jenkins-bot: [cirrus] Use normal config for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475745 (https://phabricator.wikimedia.org/T198352) (owner: 10DCausse)
[12:11:28] <wikibugs>	 (03Merged) 10jenkins-bot: [cirrus] multi-instance: add cirrussearch-big-indices.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475746 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[12:11:42] <wikibugs>	 (03CR) 10jenkins-bot: [cirrus] multi-instance: add cirrussearch-big-indices.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475746 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[12:12:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "labs: Add mediainfo to federation config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476486 (owner: 10Ladsgroup)
[12:12:36] <wikibugs>	 (03PS1) 10Jcrespo: Revert "Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster"" [puppet] - 10https://gerrit.wikimedia.org/r/476487
[12:13:45] <logmsgbot>	 !log dcausse@deploy1001 Synchronized dblists/cirrussearch-big-indices.dblist: T210381: [cirrus] multi-instance: add cirrussearch-big-indices.dblist (duration: 00m 53s)
[12:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:49] <icinga-wm>	 PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): User[ryanmax],User[afandian2]
[12:14:03] <jynus>	 that is me, fixing
[12:14:05] <dcausse>	 zeljkof: I'm done
[12:14:13] <jynus>	 (a puppet run would also work)
[12:14:30] <wikibugs>	 10Operations, 10Traffic: Varnish won't purge thumbnails of specific file - https://phabricator.wikimedia.org/T207615 (10Gilles) 05Open>03declined Sure, 'till next time ;)
[12:15:04] <zeljkof>	 dcausse: great! want to deploy CFisch_WMDE's patch, or should I do it? :)
[12:15:14] <dcausse>	 zeljkof: I can
[12:15:27] <zeljkof>	 dcausse: great! please do then :)
[12:15:29] <icinga-wm>	 PROBLEM - puppet last run on people1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): User[ryanmax],User[afandian2]
[12:15:38] <wikibugs>	 (03PS2) 10Jcrespo: Revert "Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster"" [puppet] - 10https://gerrit.wikimedia.org/r/476487 (https://phabricator.wikimedia.org/T209298)
[12:15:38] <CFisch_WMDE>	 :-)
[12:17:29] <wikibugs>	 (03PS3) 10Jcrespo: Revert "Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster"" [puppet] - 10https://gerrit.wikimedia.org/r/476487 (https://phabricator.wikimedia.org/T209298)
[12:17:37] <icinga-wm>	 PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): User[ryanmax],User[afandian2]
[12:19:13] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster"" [puppet] - 10https://gerrit.wikimedia.org/r/476487 (https://phabricator.wikimedia.org/T209298) (owner: 10Jcrespo)
[12:21:53] <wikibugs>	 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic, 10Patch-For-Review: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10jcrespo)
[12:23:41] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Revert "labs: Add mediainfo to federation config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476486 (owner: 10Ladsgroup)
[12:24:50] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: [Cloud VPS alert] Puppet failure on deployment-logstash2.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T210718 (10MarcoAurelio)
[12:26:17] <dcausse>	 CFisch_WMDE: jenkins is not happy :/
[12:26:27] <CFisch_WMDE>	 dcausse: yeah ... these browser test
[12:26:30] <dcausse>	 should we try again?
[12:26:33] <CFisch_WMDE>	 try to trigger it again plz
[12:26:42] <dcausse>	 sure
[12:26:45] <CFisch_WMDE>	 thanks
[12:27:38] <wikibugs>	 10Puppet, 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup, 10Wikimedia-Incident: ORES services should bind to ores config files - https://phabricator.wikimedia.org/T210719 (10Ladsgroup)
[12:29:48] <wikibugs>	 (03PS1) 10Elukey: hive-site.xml: render hive.metastore.sasl.enabled only on metastore [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476490
[12:29:50] <wikibugs>	 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic, 10Patch-For-Review: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10jcrespo) a:03toddleroux ` Notice: /Stage[main]/Admin/Admin::Hashuser[ryanmax]/Admin::Us...
[12:30:02] <wikibugs>	 10Operations, 10Puppet, 10ORES, 10Scoring-platform-team, 10Wikimedia-Incident: Logrotate should restart services when more people are around - https://phabricator.wikimedia.org/T210720 (10Ladsgroup)
[12:30:09] <wikibugs>	 (03CR) 10Elukey: [V: 032 C: 032] hive-site.xml: render hive.metastore.sasl.enabled only on metastore [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476490 (owner: 10Elukey)
[12:30:47] <jynus>	 !log run puppet on notebook1004, people1001, rutherfordium to fix failures
[12:30:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:04] <wikibugs>	 (03PS1) 10Elukey: Update cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/476491
[12:33:53] <wikibugs>	 (03CR) 10Elukey: [C: 032] Update cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/476491 (owner: 10Elukey)
[12:34:29] <icinga-wm>	 RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[12:35:55] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero)
[12:39:11] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10Scoring-platform-team: Contact number of some WMDE staff should be avalible to SRE/RelEng - https://phabricator.wikimedia.org/T210721 (10Ladsgroup)
[12:41:11] <dcausse>	 CFisch_WMDE: your change should be on mwdebug1002 is it possible for you to test?
[12:41:19] <icinga-wm>	 RECOVERY - puppet last run on people1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:41:22] <CFisch_WMDE>	 yes I will have a look
[12:43:04] <CFisch_WMDE>	 dcausse: Either my test is wrong or it's not there :-/
[12:43:13] <dcausse>	 CFisch_WMDE: looking
[12:43:27] <icinga-wm>	 RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:44:47] <dcausse>	 CFisch_WMDE: I see Html::element on line 186 at mwdebug1002 for php-1.33.0-wmf.6
[12:45:09] <dcausse>	 CFisch_WMDE: are you testing a wiki that is on wmf.6 ?
[12:45:24] <CFisch_WMDE>	 Ahh yeah good point thankts ^^'
[12:46:24] <marostegui>	 jouncebot: next
[12:46:24] <jouncebot>	 In 0 hour(s) and 13 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1300)
[12:46:55] <CFisch_WMDE>	 dcausse: nice works, thanks
[12:47:04] <wikibugs>	 (03PS27) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919)
[12:47:06] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe)
[12:48:43] <logmsgbot>	 !log dcausse@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/TwoColConflict/: Fix unescaped HTML injected into conflict resolution interface (duration: 00m 53s)
[12:48:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:01] <dcausse>	 CFisch_WMDE: it's live on all servers now
[12:49:45] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-VPS, 10cloud-services-team: [Cloud VPS alert] Puppet failure on deployment-logstash2.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T210718 (10MarcoAurelio)
[12:49:45] <dcausse>	 !log EU swat done
[12:49:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:48] <CFisch_WMDE>	 \o/ works, thanks
[12:49:59] <dcausse>	 you're welcome! :)
[12:54:55] <icinga-wm>	 PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Puppet has 28 failures. Last run 3 minutes ago with 28 failures. Failed resources (up to 3 shown): Exec[absent_ensure_members],Exec[ops_ensure_members],Exec[wikidev_ensure_members],Exec[adm_ensure_members]
[12:55:44] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10Scoring-platform-team: Contact number of some WMDE staff should be avalible to SRE/RelEng - https://phabricator.wikimedia.org/T210721 (10WMDE-leszek) a:03WMDE-leszek I take it on me. I've briefly talked about this topic with @greg during Technical Conference. We'...
[12:55:50] <jynus>	 mm, checking that
[12:55:51] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10Scoring-platform-team: Contact number of some WMDE staff should be avalible to SRE/RelEng - https://phabricator.wikimedia.org/T210721 (10WMDE-leszek) p:05Triage>03High
[12:56:33] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-VPS, 10cloud-services-team: [Cloud VPS alert] Puppet failure on deployment-logstash2.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T210718 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Yes this has been fixed by me a few hours ago...
[12:56:43] <jynus>	 Could not evaluate: Cannot allocate memory - fork(2)
[12:57:31] <jynus>	 akosiaris mobrovac: we may have memory issues on scb1004
[12:57:44] <mobrovac>	 sigh
[12:57:45] <mobrovac>	 looking
[12:57:59] <jynus>	 eventstreams maybe?
[12:58:38] <jynus>	 is that safe to restart?
[12:58:57] <mobrovac>	 yup it's eventstreams
[12:59:04] <mobrovac>	 weird, i can't restart it
[12:59:09] <mobrovac>	 don't have the rights
[12:59:11] <jynus>	 I can try
[12:59:19] <mobrovac>	 jynus: wait i'll do it from deploy1001
[12:59:25] <mobrovac>	 so to depool it first
[12:59:28] <jynus>	 ok
[12:59:33] <mobrovac>	 to minimise impact
[12:59:42] <jynus>	 that is why I asked if it was safe, I don't know the service at all
[12:59:47] <jynus>	 thanks
[13:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1300)
[13:00:05] <icinga-wm>	 RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[13:00:06] <jynus>	 you take care for now, standing by when you need me
[13:00:29] <logmsgbot>	 !log mobrovac@deploy1001 Started restart [eventstreams/deploy@07033d4]: Restart ES on scb1004 due to possible memory leak (again)
[13:00:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:57] <mobrovac>	 ok, mem is back again
[13:01:05] <mobrovac>	 thnx jynus for pinging
[13:01:30] <godog>	 I love how we got at least three ESes now, external storage / elastic search / event streams
[13:01:39] <akosiaris>	 lol
[13:01:52] <wikibugs>	 (03PS2) 10Aklapper: Phab: Use our custom Priority field value in tooltip on Reports page [puppet] - 10https://gerrit.wikimedia.org/r/455271 (https://phabricator.wikimedia.org/T91428)
[13:01:56] <wikibugs>	 (03PS2) 10Aklapper: Phab: Clarify that spaces are not allowed in user account names [puppet] - 10https://gerrit.wikimedia.org/r/455265 (https://phabricator.wikimedia.org/T179126)
[13:01:59] <jynus>	 don't worry, nothing happens if we restart an es* host !
[13:02:17] <jynus>	 just all wikipedias go down, but nothing happens :-)
[13:02:24] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Pool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476495 (https://phabricator.wikimedia.org/T208383)
[13:02:39] <godog>	 hheeh
[13:02:51] <wikibugs>	 (03PS3) 10Aklapper: Order list of extensions by alphabet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455188
[13:03:05] <jynus>	 in the past, all wikipedias wen't down and the server got reimaged and all data lost
[13:05:03] <godog>	 "one of those days"
[13:05:22] <jynus>	 they changed the board an BIOS init was reseted
[13:05:26] <marostegui>	 !log Upgrade pc3 tendril topology - T208383
[13:05:29] <jynus>	 defaulting to network boot
[13:05:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:30] <stashbot>	 T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383
[13:05:38] <marostegui>	 jynus: Oh boy, I remember that day
[13:05:51] <jynus>	 godog: that is why we now require a puppet change to reimage a db
[13:06:19] <godog>	 jynus: indeed, I remember that day too, not fun at all
[13:06:35] <jynus>	 I mean, we have 6 copies and backups
[13:07:09] <jynus>	 over 2 datacenters, I am a bit over dramatizising it, we only lost one host
[13:07:29] <jynus>	 but indeed es and elastic hosts have been confused in the past
[13:07:50] <marostegui>	 jynus: can you give this a review? https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/476495/
[13:09:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Pool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476495 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[13:09:52] <jynus>	 ^but I will ask for a followup
[13:10:02] <marostegui>	 what do you mean?
[13:10:03] <jynus>	 document the ips on the key
[13:10:06] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476495 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[13:10:12] <jynus>	 because it is now confusing
[13:10:16] <marostegui>	 what do you mean?
[13:10:36] <jynus>	 # this should be something like 'pc1', 'pc2', 'pc3', but don't touch it!
[13:10:41] <marostegui>	 aaah right
[13:10:44] <hashar>	 I will deploy a hotfix for Wikibase when things are settled
[13:10:45] <marostegui>	 I was planning to create a ticket
[13:10:47] <marostegui>	 :)
[13:10:48] <jynus>	 # sharding key bla bla bla
[13:10:54] <jynus>	 no, no need to fix it
[13:10:58] <jynus>	 just add a comment
[13:11:08] <marostegui>	 jynus: I wanted to start to get some devs involved and I was planning to create a ticket to work out how to change those
[13:11:10] <jynus>	 because we in 2 months will not remember what that os
[13:11:14] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Pool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476495 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[13:11:15] <jynus>	 yes, but that is aside
[13:11:49] <jynus>	 I just want a comment without touching the code of what 10.64.0.12 is and why it shouldn't be touched
[13:11:54] <marostegui>	 Ah sure :)
[13:12:02] <marostegui>	 I can add that after the 256GB 
[13:12:13] <jynus>	 however
[13:12:24] <jynus>	 or mayve you can create the ticket and notice it there, too
[13:12:27] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool pc1009 in pc3 - T208383 (duration: 00m 53s)
[13:12:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:31] <stashbot>	 T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383
[13:12:32] <jynus>	 but without it people will remove it
[13:12:39] <wikibugs>	 10Operations, 10monitoring: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10fgiunchedi)
[13:12:51] <marostegui>	 You mean something like '10.64.0.12'   => '10.64.48.174', # pc1010, D3 4.4TB 256GB # 10.64.0.12 is the key, but it should be pc1 eventually
[13:12:52] <stashbot>	 D3: test - ignore - https://phabricator.wikimedia.org/D3
[13:12:54] <marostegui>	 something like that?
[13:12:58] <jynus>	 or at least document before
[13:13:23] <jynus>	 'sharding function key' -> 'server ip address'
[13:13:32] <jynus>	 and then what you proposed
[13:13:39] <marostegui>	 Ah I see what you mean
[13:13:43] <marostegui>	 I will get a draft :)
[13:13:49] <jynus>	 with a WARNING, do not change T2234234 (ticket of why)
[13:14:28] <jynus>	 the exact thing doesn't matter, it is just a comment to make sure people don't change it unless they know what they are doing it
[13:14:43] <jynus>	 e.g. rob because he sees an ip that is unused
[13:14:50] <moritzm>	 !log rebooting certcentral2001 to pick up SSBD-enabled qemu/kernel update
[13:14:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:12] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10aborrero) Question: what is the warranty status of this server? would it make sense to get a more complete replacement by HP? (not just some spare pieces like disk and raid controllers)
[13:15:15] <jynus>	 and yes, normally people would ask us, but our code should outlive us :-)
[13:16:03] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Pool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476495 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui)
[13:16:54] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10aborrero)
[13:17:31] <moritzm>	 !log rebooting certcentral1001 to pick up SSBD-enabled qemu/kernel update
[13:17:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:46] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1020 - https://phabricator.wikimedia.org/T194855 (10aborrero)
[13:19:46] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @aborrero Unfortunately it's not that simple.  Once we take delivery of a server we then have to work through technical support.  We may be at the point where th...
[13:21:01] <marostegui>	 jynus: T210725
[13:21:03] <stashbot>	 T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725
[13:22:04] <jynus>	 just a warning "DONT CHANGE THESE IPS T210725" or something woudl be enough
[13:22:21] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497
[13:22:22] <marostegui>	 jynus: ^ 
[13:23:19] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497
[13:23:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui)
[13:23:23] <jynus>	 suggestion, put the do not change in caps and on the next line
[13:23:31] <icinga-wm>	 PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/centralcerts/apt.rsa-2048.crt]
[13:23:33] <jynus>	 so it cannot be missed
[13:24:00] <marostegui>	 right above the keys?
[13:24:15] <jynus>	 (yes, maybe a bit overboad, but look at 525+
[13:24:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui)
[13:24:30] <jynus>	 something that looks scary :-)
[13:24:33] <marostegui>	 haha
[13:24:48] <jynus>	 so people read the ticket first
[13:26:26] <wikibugs>	 (03CR) 10Hashar: [C: 031] "Given that is solely for deployment-prep , we can +2/merge this at any time.  Just make sure to rebase the repo on deployment.eqiad.wmnet " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476228 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi)
[13:26:42] <wikibugs>	 (03PS3) 10Marostegui: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497
[13:26:50] <marostegui>	 I think this way is clear enough ^
[13:27:50] <wikibugs>	 10Operations, 10Puppet, 10ORES, 10Scoring-platform-team, 10Wikimedia-Incident: Logrotate should restart services when more people are around - https://phabricator.wikimedia.org/T210720 (10akosiaris) I am afraid we can't really change it. It's been at 06:25am (UTC in our case) forever and people expect th...
[13:28:09] <jynus>	 +1, marostegui
[13:28:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui)
[13:28:31] <marostegui>	 jynus: thanks!
[13:29:06] <jynus>	 some tabs issue
[13:29:12] <jynus>	 CI complains
[13:29:21] <marostegui>	 yeah, I don't understand why as on my vim they are tabs :|
[13:29:44] <wikibugs>	 (03PS4) 10Marostegui: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497
[13:30:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui)
[13:31:01] <marostegui>	 this makes no sense 
[13:33:02] <wikibugs>	 (03PS5) 10Marostegui: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497
[13:34:36] <marostegui>	 All good now \o/
[13:37:18] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui)
[13:37:42] <hashar>	 marostegui: let me know when you are done and I will deploy some wikibase hotfix :)  thx!
[13:37:59] <marostegui>	 hashar: Ah sure, will not take long as soon as CI merges it!
[13:38:21] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui)
[13:39:11] <hashar>	 marostegui: no worries, take your time :)
[13:39:29] <hashar>	 the CI job, I will have to look at it but it seems most of the wait time is due to php codesniffer :\\\
[13:39:37] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify parsercache keys section (duration: 00m 52s)
[13:39:38] <marostegui>	 One more file and I am done
[13:39:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: add opcache tuning for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/476499 (https://phabricator.wikimedia.org/T206341)
[13:40:25] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: tune php-fpm parameters [puppet] - 10https://gerrit.wikimedia.org/r/476500 (https://phabricator.wikimedia.org/T206341)
[13:40:27] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: armonize settings with HHVM [puppet] - 10https://gerrit.wikimedia.org/r/476501
[13:40:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: configure php-fpm logging [puppet] - 10https://gerrit.wikimedia.org/r/476502
[13:40:37] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Clarify parsercache keys section (duration: 00m 53s)
[13:40:38] <marostegui>	 hashar: I am done!
[13:40:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:40] <wikibugs>	 10Operations, 10monitoring: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10fgiunchedi) Note we've been here before in {T172921} and sadly the command check timeout can be changed only globally on the icinga side, not per-service.
[13:41:52] <hashar>	 cool
[13:42:03] <wikibugs>	 10Operations, 10monitoring, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10fgiunchedi)
[13:42:21] <wikibugs>	 (03PS2) 10Filippo Giunchedi: LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476228 (https://phabricator.wikimedia.org/T205851)
[13:44:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476228 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi)
[13:45:02] <logmsgbot>	 !log hashar@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/Wikibase: feature flag for globe coordinator formatter using kartographer - T184933 T210617 (duration: 01m 18s)
[13:45:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:08] <stashbot>	 T210617: BadMethodCallException on Wikidata item pages containing coordinates with non-Earth globes - https://phabricator.wikimedia.org/T210617
[13:45:08] <stashbot>	 T184933: Display map for geocoordinate statements - https://phabricator.wikimedia.org/T184933
[13:45:14] <wikibugs>	 (03Merged) 10jenkins-bot: LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476228 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi)
[13:45:16] <hashar>	 gre
[13:46:15] <wikibugs>	 (03PS1) 10Huji: Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642)
[13:48:13] <wikibugs>	 (03CR) 10Daimona Eaytoy: Dissallow eliminators to block certain groups on fawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) (owner: 10Huji)
[13:49:14] <icinga-wm>	 RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:49:35] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui)
[13:49:37] <wikibugs>	 (03CR) 10jenkins-bot: LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476228 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi)
[13:50:09] <wikibugs>	 (03PS2) 10Huji: Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642)
[13:50:51] <wikibugs>	 (03PS8) 10Gehel: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe)
[13:51:25] <wikibugs>	 (03PS1) 10Hashar: wikidatawiki to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476504
[13:57:13] <wikibugs>	 (03CR) 10Gehel: [C: 032] profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe)
[14:00:04] <jouncebot>	 hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1400).
[14:00:36] <wikibugs>	 (03CR) 10Hashar: [C: 032] wikidatawiki to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476504 (owner: 10Hashar)
[14:01:37] <wikibugs>	 (03Merged) 10jenkins-bot: wikidatawiki to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476504 (owner: 10Hashar)
[14:02:59] <wikibugs>	 (03CR) 10jenkins-bot: wikidatawiki to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476504 (owner: 10Hashar)
[14:03:19] <hashar>	 godog: oyu have forgotten to rebase on deployment.eqiad.wmnet ! I have done it )
[14:03:28] <moritzm>	 !log uploaded nodejs 6.11.0~dfsg-1+wmf3 to apt.wikimedia.org/stretch-wikimedia (backporting the current security fixes)
[14:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:16] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: wikidatawiki to 1.33.0-wmf.6
[14:05:29] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/476507 (https://phabricator.wikimedia.org/T204745)
[14:05:50] <hashar>	 jouncebot: next
[14:05:51] <jouncebot>	 In 2 hour(s) and 54 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1700)
[14:06:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/476507 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott)
[14:06:20] <stashbot>	 hashar@deploy1001: Failed to log message to wiki. Somebody should check the error logs.
[14:06:36] <hashar>	 ;(
[14:08:08] <godog>	 hashar: indeed, thank you!
[14:09:13] <wikibugs>	 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10pmiazga) The service is ready, the remaining thing is to increase the CPU count (T197862). I'll talk with services today about this task.  There a...
[14:09:53] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 031] Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) (owner: 10Huji)
[14:10:07] <hashar>	 !log test stashbot
[14:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:12] <hashar>	 ...
[14:10:40] <wikibugs>	 (03PS1) 10Hashar: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509
[14:10:42] <wikibugs>	 (03CR) 10Hashar: [C: 032] all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 (owner: 10Hashar)
[14:10:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 (owner: 10Hashar)
[14:11:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 (owner: 10Hashar)
[14:12:35] <wikibugs>	 (03PS2) 10Hashar: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509
[14:13:10] <hashar>	 grr
[14:13:17] <hashar>	 I scrwed up the update of wikidatawiki
[14:13:37] <wikibugs>	 (03CR) 10Hashar: [C: 032] all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 (owner: 10Hashar)
[14:14:39] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 (owner: 10Hashar)
[14:15:23] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM, waiting to see if Volans has a last comment before merging." [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe)
[14:16:16] <wikibugs>	 (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 (owner: 10Hashar)
[14:16:49] <wikibugs>	 (03PS3) 10Vgutierrez: gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050)
[14:16:51] <wikibugs>	 (03PS1) 10Vgutierrez: certcentral: Mimick letsencrypt::cert::integrated key_group [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050)
[14:17:17] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.6
[14:17:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:15] <hashar>	 pfff
[14:23:51] <wikibugs>	 (03CR) 10Vgutierrez: "pcc shows the expected changes (0600 --> 0640) in existing certcentral clients: https://puppet-compiler.wmflabs.org/compiler1002/13780/" [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[14:25:05] <wikibugs>	 (03CR) 10Mathew.onipe: maps: remove osmupdater and osmimporter hiera passwords (032 comments) [labs/private] - 10https://gerrit.wikimedia.org/r/468631 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe)
[14:26:14] <wikibugs>	 (03PS1) 10Hashar: Revert "all wikis to 1.33.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476514 (https://phabricator.wikimedia.org/T206660)
[14:26:34] <wikibugs>	 (03CR) 10Hashar: [C: 032] Revert "all wikis to 1.33.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476514 (https://phabricator.wikimedia.org/T206660) (owner: 10Hashar)
[14:27:29] <wikibugs>	 (03PS2) 10Mathew.onipe: maps: remove osmupdater and osmimporter hiera passwords [labs/private] - 10https://gerrit.wikimedia.org/r/468631 (https://phabricator.wikimedia.org/T206639)
[14:27:32] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Post hold because of "invalid headers" in wikimediacz-l - https://phabricator.wikimedia.org/T210223 (10herron) Hello, I notice that on WikimediaCZ-l within the "privacy options..." > "spam filtering" section in the list admin, below "legacy spam filtering" there is a `b...
[14:27:35] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "all wikis to 1.33.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476514 (https://phabricator.wikimedia.org/T206660) (owner: 10Hashar)
[14:29:59] <logmsgbot>	 !log hashar@deploy1001 scap failed: average error rate on 11/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
[14:30:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:33] <hashar>	 __main__.CheckServiceError: Generic connection error: HTTPConnectionPool(host='logstash1009.eqiad.wmnet', port=9200): Max retries exceeded with url: /logstash-*/_search (Caused by ReadTimeoutError("HTTPConnectionPool(host='logstash1009.eqiad.wmnet', port=9200): Read timed out. (read timeout=10)",))
[14:30:38] <hashar>	 looks like logstash has some issue
[14:31:40] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Revert all wikis to 1.33.0-wmf.6
[14:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:58] <hashar>	 will fill tasks for all the spam I got
[14:35:45] <wikibugs>	 (03CR) 10jenkins-bot: Revert "all wikis to 1.33.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476514 (https://phabricator.wikimedia.org/T206660) (owner: 10Hashar)
[14:40:13] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: First draft of a graphoid helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/434475
[14:42:43] <wikibugs>	 (03PS3) 10Filippo Giunchedi: hieradata: add kafka_shipper::kafka_brokers variable to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/476029
[14:43:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add kafka_shipper::kafka_brokers variable to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/476029 (owner: 10Filippo Giunchedi)
[14:45:15] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Andrew) So it sounds like you will need dedicated hardware t...
[14:50:48] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Change-tagging, 10MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), and 3 others: Migrate tag_summary usage to change_tag and drop the table - https://phabricator.wikimedia.org/T209525 (10Banyek) Do you need anything from our side this moment @Ladsgroup ?
[14:56:13] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Change-tagging, 10MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), and 3 others: Migrate tag_summary usage to change_tag and drop the table - https://phabricator.wikimedia.org/T209525 (10Ladsgroup) >>! In T209525#4785174, @Banyek wrote: > Do you need anything from our side this mo...
[14:57:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Decom/return labvirt1010 and 1011 - https://phabricator.wikimedia.org/T210735 (10Andrew)
[14:58:46] <wikibugs>	 (03PS1) 10Vgutierrez: certcentral: Provide TLS certificates for lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/476521 (https://phabricator.wikimedia.org/T207050)
[14:59:28] <wikibugs>	 (03PS1) 10Andrew Bogott: Move labvirt1010/1011 to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/476522 (https://phabricator.wikimedia.org/T210735)
[15:02:04] <wikibugs>	 (03CR) 10Gehel: [V: 032 C: 032] maps: remove osmupdater and osmimporter hiera passwords [labs/private] - 10https://gerrit.wikimedia.org/r/468631 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe)
[15:02:29] <wikibugs>	 (03CR) 10Marostegui: "Can we get a puppet compiler run to make sure it is a noop on the existing hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek)
[15:03:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Move labvirt1010/1011 to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/476522 (https://phabricator.wikimedia.org/T210735) (owner: 10Andrew Bogott)
[15:08:53] <wikibugs>	 (03CR) 10Banyek: "> Can we get a puppet compiler run to make sure it is a noop on the" [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek)
[15:12:37] <icinga-wm>	 PROBLEM - Ensure that passive node gets the certificates from the active node as expected on certcentral2001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/certcentral/live_certs/.rsync.status is 7357 seconds old and 0 bytes
[15:12:42] <wikibugs>	 (03CR) 10Jcrespo: "Note this doesn't yet handle firewalling." [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek)
[15:13:21] <icinga-wm>	 PROBLEM - Ensure cert-sync script runs successfully in the active node on certcentral1001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/certcentral/live_certs/.rsync.done is 7400 seconds old and 0 bytes
[15:13:25] <icinga-wm>	 PROBLEM - Keyholder SSH agent on certcentral1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[15:13:39] <icinga-wm>	 PROBLEM - Keyholder SSH agent on certcentral2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[15:13:52] <wikibugs>	 (03CR) 10Alex Monk: [C: 031] certcentral: Mimick letsencrypt::cert::integrated key_group [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[15:13:59] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1var-server=wtp2020var-datasource=codfw%2520prometheus%252Fops
[15:14:08] <vgutierrez>	 right...
[15:14:14] <vgutierrez>	 OMW :)
[15:14:25] <icinga-wm>	 PROBLEM - puppet last run on certcentral1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/certcentral-certs-sync]
[15:15:18] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Change-tagging, 10MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), and 3 others: Migrate tag_summary usage to change_tag and drop the table - https://phabricator.wikimedia.org/T209525 (10Marostegui) Which schema change?
[15:15:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Decom/return labvirt1010 and 1011 - https://phabricator.wikimedia.org/T210735 (10Andrew) a:03RobH
[15:15:45] <icinga-wm>	 RECOVERY - Keyholder SSH agent on certcentral1001 is OK: OK: Keyholder is armed with all configured keys.
[15:16:54] <icinga-wm>	 RECOVERY - Ensure cert-sync script runs successfully in the active node on certcentral1001 is OK: FILE_AGE OK: /var/lib/certcentral/live_certs/.rsync.done is 19 seconds old and 0 bytes
[15:17:16] <vgutierrez>	 side effect of restarting certcentral nodes :)
[15:17:41] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Change-tagging, 10MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), and 3 others: Migrate tag_summary usage to change_tag and drop the table - https://phabricator.wikimedia.org/T209525 (10Ladsgroup) >>! In T209525#4785305, @Marostegui wrote: > Which schema change?   `DROP TABLE tag...
[15:18:13] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] certcentral: Mimick letsencrypt::cert::integrated key_group [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[15:18:16] <logmsgbot>	 !log $WHO Running Wikibase populateSitesTable.php on eswiktionary for T210732
[15:18:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:20] <stashbot>	 T210732: wiktionary: /rpc/RunSingleJob.php   CannotCreateActorException from line 2540 of /srv/mediawiki/php-1.33.0-wmf.6/includes/user/User.php: Cannot create an actor for a usable name that is not an existing user - https://phabricator.wikimedia.org/T210732
[15:18:44] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Change-tagging, 10MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), and 3 others: Migrate tag_summary usage to change_tag and drop the table - https://phabricator.wikimedia.org/T209525 (10Marostegui) >>! In T209525#4785310, @Ladsgroup wrote: >>>! In T209525#4785305, @Marostegui wro...
[15:19:03] <wikibugs>	 (03PS2) 10Vgutierrez: certcentral: Mimick letsencrypt::cert::integrated key_group [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050)
[15:19:05] <wikibugs>	 (03PS2) 10Vgutierrez: certcentral: Provide TLS certificates for lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/476521 (https://phabricator.wikimedia.org/T207050)
[15:19:07] <wikibugs>	 (03PS4) 10Vgutierrez: gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050)
[15:19:26] <icinga-wm>	 RECOVERY - puppet last run on certcentral1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[15:19:37] <wikibugs>	 (03PS2) 10Gehel: elasticsearch: add psi & omega in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/476471 (https://phabricator.wikimedia.org/T207918) (owner: 10DCausse)
[15:22:11] <wikibugs>	 (03CR) 10BBlack: [C: 031] certcentral: Mimick letsencrypt::cert::integrated key_group [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[15:22:48] <wikibugs>	 (03PS3) 10Vgutierrez: certcentral: Mimick letsencrypt::cert::integrated key_group [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050)
[15:23:14] <icinga-wm>	 RECOVERY - Ensure that passive node gets the certificates from the active node as expected on certcentral2001 is OK: FILE_AGE OK: /var/lib/certcentral/live_certs/.rsync.status is 400 seconds old and 0 bytes
[15:24:05] <logmsgbot>	 \!log anomie@mwmaint1002 Running cleanupUsersWithNoId.php on eswiktionary recentchanges for T210732
[15:24:05] <stashbot>	 T210732: wiktionary: /rpc/RunSingleJob.php   CannotCreateActorException from line 2540 of /srv/mediawiki/php-1.33.0-wmf.6/includes/user/User.php: Cannot create an actor for a usable name that is not an existing user - https://phabricator.wikimedia.org/T210732
[15:25:02] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch: add psi & omega in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/476471 (https://phabricator.wikimedia.org/T207918) (owner: 10DCausse)
[15:25:04] <icinga-wm>	 RECOVERY - Keyholder SSH agent on certcentral2001 is OK: OK: Keyholder is armed with all configured keys.
[15:25:10] <wikibugs>	 (03PS3) 10Gehel: elasticsearch: add psi & omega in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/476471 (https://phabricator.wikimedia.org/T207918) (owner: 10DCausse)
[15:26:35] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] certcentral: Provide TLS certificates for lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/476521 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[15:26:43] <wikibugs>	 (03PS3) 10Vgutierrez: certcentral: Provide TLS certificates for lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/476521 (https://phabricator.wikimedia.org/T207050)
[15:27:31] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) >>! In T208383#4784872, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations),...
[15:27:39] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui)
[15:28:54] <logmsgbot>	 \!log anomie@mwmaint1002 Running Wikibase/populateSitesTable.php and cleanupUsersWithNoId.php on several other wiktionaries for T210732
[15:29:29] <Krenair>	 that's weird
[15:29:37] <Krenair>	 why is logmsgbot escaping its !log messages?
[15:29:39] <Krenair>	 anomie, ^
[15:29:56] <gehel>	 !log activating multiple elasticsearch instances on cirrus / eqiad - T207918
[15:29:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:00] <stashbot>	 T207918: Refactor current code base to support multiple elasticsearch instances/multiple elasticsearch clusters - https://phabricator.wikimedia.org/T207918
[15:30:07] <dcausse>	 gehel: thanks!^
[15:30:21] <gehel>	 dcausse: wait until it actually works to thank me :)
[15:30:27] <dcausse>	 yes sure :)
[15:31:19] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Banyek) I'll check what are the needs for achieving these goals in the DBA perspective
[15:31:58] <anomie>	 Krenair: ... It's me trying to use the logmsg command on mwmaint1002, and weirdness with bash wanting to treat "!log" as an event but apparently at the same time including the backslash when I escape it.
[15:32:07] <anomie>	 s/logmsg/dologmsg/
[15:32:14] <wikibugs>	 (03CR) 10Imarlier: profile::mediawiki::php: add opcache tuning for php-fpm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476499 (https://phabricator.wikimedia.org/T206341) (owner: 10Giuseppe Lavagetto)
[15:32:23] <Krenair>	 ooh
[15:32:25] <Krenair>	 right
[15:32:35] <Krenair>	 because that part is not hardcoded in logmsgbot
[15:32:42] <logmsgbot>	 !log anomie@mwmaint1002 Running Wikibase/populateSitesTable.php and cleanupUsersWithNoId.php on several other wiktionaries for T210732
[15:32:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:46] <stashbot>	 T210732: wiktionary: /rpc/RunSingleJob.php   CannotCreateActorException from line 2540 of /srv/mediawiki/php-1.33.0-wmf.6/includes/user/User.php: Cannot create an actor for a usable name that is not an existing user - https://phabricator.wikimedia.org/T210732
[15:32:52] <hashar>	 I filled a few blocking tasks (they might not all be blockers though)
[15:32:53] <anomie>	 There, that time it worked.
[15:32:58] <hashar>	 anomie: thanks for the patches :)
[15:33:16] <hashar>	 I hvae to head back home for car maintenance, will be back in a couple hours maybe and catch up later this evening
[15:33:43] <anomie>	 hashar: That one was no patches, apparently just the need to run populateSitesTable.php when enabling Wikibase on wikis never made it into the right documentation.
[15:34:01] <hashar>	 anomie: ohhhh so that is sounds like an easy fix isn't it ? :)
[15:34:27] <hashar>	 I will redo the all group deployment later during the US train window
[15:34:56] <anomie>	 hashar: Should be fixed already with the maintenance scripts I just ran.
[15:34:58] <wikibugs>	 (03CR) 10DCausse: [C: 031] elasticsearch: configure LVS endpoint for new codfw clusters [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) (owner: 10Gehel)
[15:35:19] <gtirloni>	 !log T196507 downtimed and powercycled cloudvirt1019
[15:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:23] <stashbot>	 T196507: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507
[15:36:34] <hashar>	 anomie: excellent. Thank you :)  I will catch up later this evening
[15:52:58] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10bmansurov) @Banyek thanks for helping. While we port repositories to Gerrit, [[ https://github.com/kodchi/research-article-recommender-deploy | here ]]...
[15:54:35] <papaul>	 !log shutting down ms-be2047 for maintenance 
[15:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:44] <wikibugs>	 (03PS2) 10Cwhite: hiera: add cluster definition to recursor role [puppet] - 10https://gerrit.wikimedia.org/r/476393 (https://phabricator.wikimedia.org/T210486)
[16:04:23] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10BBlack) p:05Normal>03High
[16:04:55] <wikibugs>	 10Operations, 10Icinga, 10Scoring-platform-team: Add ahalfaker to ORES-related icinga contacts - https://phabricator.wikimedia.org/T210742 (10Dzahn)
[16:06:34] <wikibugs>	 (03PS3) 10Cwhite: hiera: add cluster definition to recursor role [puppet] - 10https://gerrit.wikimedia.org/r/476393 (https://phabricator.wikimedia.org/T210486)
[16:08:12] <wikibugs>	 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul)   Please replace the remaining hardware sent.  I found the hardware below out of date.        IDRAC at 3.21.21.21  CPLD at 1.4.9  BIOS at 1.0.1     Suggested action plan:     1.    Clear System Event...
[16:08:54] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10GTirloni) If the batter is installed and, as the HPE advisories suggest, the firmwares have been updated _and_ we have many other servers with this controller that are work...
[16:09:13] <wikibugs>	 (03PS3) 10Jcrespo: admin: Add jgleeson access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476004 (https://phabricator.wikimedia.org/T208432)
[16:09:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: Add jgleeson access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476004 (https://phabricator.wikimedia.org/T208432) (owner: 10Jcrespo)
[16:10:57] <logmsgbot>	 !log anomie@mwmaint1002 Running Wikibase/populateSitesTable.php and cleanupUsersWithNoId.php on more wiktionaries, incubatorwiki, and sourceswiki for T210732
[16:11:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:01] <stashbot>	 T210732: wiktionary: /rpc/RunSingleJob.php   CannotCreateActorException from line 2540 of /srv/mediawiki/php-1.33.0-wmf.6/includes/user/User.php: Cannot create an actor for a usable name that is not an existing user - https://phabricator.wikimedia.org/T210732
[16:12:22] <Platonides>	 users with no id? who's that even possible? :P
[16:17:44] <anomie>	 Platonides: It used to be possible for imports and cross-wiki things like Wikidata's recentchanges entries to attribute something to a named user without that user existing locally.
[16:18:02] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10GTirloni) @robh @Cmjohnson Can we get a technician from HP on site with various parts (cards, batteries, etc) to try and fix this?
[16:18:09] <wikibugs>	 (03PS4) 10Jcrespo: admin: Add jgleeson access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476004 (https://phabricator.wikimedia.org/T208432)
[16:21:16] <Platonides>	 that's true
[16:21:36] <Platonides>	 didn't think on that
[16:21:44] <Platonides>	 although I don't see why it would be a problem :P
[16:22:39] <icinga-wm>	 RECOVERY - Host lvs1006 is UP: PING WARNING - Packet loss = 86%, RTA = 0.33 ms
[16:24:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10Cmjohnson) @bblack @ayounsi sfp-t was bad, replaced and the link is up
[16:27:44] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10Cmjohnson) 05Open>03Resolved
[16:28:38] <wikibugs>	 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Did all the dell engineer recommended above. Waiting  to proceed to step 10 .
[16:31:05] <icinga-wm>	 PROBLEM - puppet last run on db1117 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:38:54] <wikibugs>	 (03PS3) 10Herron: rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324)
[16:39:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron)
[16:40:14] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-VPS, 10cloud-services-team: [Cloud VPS alert] Puppet failure on deployment-logstash2.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T210718 (10MarcoAurelio) Thanks :)
[16:40:19] <wikibugs>	 (03PS4) 10Herron: rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324)
[16:40:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron)
[16:43:26] <wikibugs>	 (03PS5) 10Herron: rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324)
[16:43:39] <wikibugs>	 10Operations, 10procurement, 10Discovery-Search (Current work): Setup elasticsearch on new codfw servers - https://phabricator.wikimedia.org/T210265 (10RobH)
[16:43:47] <wikibugs>	 10Operations, 10Discovery-Search (Current work): Setup elasticsearch on new codfw servers - https://phabricator.wikimedia.org/T210265 (10RobH)
[16:44:17] <wikibugs>	 10Operations, 10Discovery-Search (Current work): Setup elasticsearch on new codfw servers - https://phabricator.wikimedia.org/T210265 (10RobH) I went ahead and moved this out of S4 (as its not procurement), back into S1, and removed the #procurement project.
[16:46:52] <wikibugs>	 (03CR) 10Paladox: [C: 031] gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[16:47:06] <wikibugs>	 (03Abandoned) 10Mathew.onipe: base::monitoring::host: added icinga prometheus check for network drops [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe)
[16:47:43] <wikibugs>	 (03Abandoned) 10Mathew.onipe: Add elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/462514 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[16:48:23] <wikibugs>	 (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/13782/" [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron)
[16:49:49] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron)
[16:49:56] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Banyek) as we were talking with @bmansurov I learned that we need to keep old data after new import is not considered working.  my recommendation is to...
[16:50:17] <wikibugs>	 (03PS6) 10Herron: rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324)
[16:50:25] <banyek>	 I'd like to ask your opinion about https://phabricator.wikimedia.org/T208622#4785750
[16:50:30] <banyek>	 tomorrow
[16:50:33] <banyek>	 today I leave
[16:50:46] <banyek>	 bye
[16:51:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Decom/return labvirt1010 and 1011 - https://phabricator.wikimedia.org/T210735 (10RobH)
[16:53:01] <banyek|away>	 (nothere)
[16:57:44] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Decom/return labvirt1010 and 1011 - https://phabricator.wikimedia.org/T210735 (10RobH)
[16:58:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH) p:05Triage>03High
[17:00:04] <jouncebot>	 godog and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1700).
[17:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[17:02:07] <robh>	 !log decom of labvirt101[01] continuing
[17:02:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:09] <icinga-wm>	 RECOVERY - puppet last run on db1117 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:02:13] <robh>	 they shouldnt echo, but just in case...
[17:03:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH) Switch ports on asw2-b-eqiad:   ` robh@asw2-b-eqiad> show interfaces descriptions | grep labvirt1010  ge-3/0/14       up    up   l...
[17:05:41] <wikibugs>	 (03PS1) 10Jcrespo: admin: Add addshore to graphite-admins; allow _grahite commands [puppet] - 10https://gerrit.wikimedia.org/r/476558 (https://phabricator.wikimedia.org/T208750)
[17:05:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH)
[17:07:03] <wikibugs>	 (03CR) 10Jcrespo: "This is a first version without even looking at the puppet classes for the machine, I need to properly review it, too." [puppet] - 10https://gerrit.wikimedia.org/r/476558 (https://phabricator.wikimedia.org/T208750) (owner: 10Jcrespo)
[17:09:41] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for labvirt1010.eqiad.wmnet and performed the following actions: - Revoke...
[17:09:43] <wikibugs>	 (03PS1) 10Faidon Liambotis: openstack: make ::neutron::dmz_cidr an array [puppet] - 10https://gerrit.wikimedia.org/r/476567 (https://phabricator.wikimedia.org/T210754)
[17:10:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for labvirt1011.eqiad.wmnet and performed the following actions: - Revoke...
[17:10:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH)
[17:10:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: make ::neutron::dmz_cidr an array [puppet] - 10https://gerrit.wikimedia.org/r/476567 (https://phabricator.wikimedia.org/T210754) (owner: 10Faidon Liambotis)
[17:14:05] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) > In the meantime, can I delete t206636-3?  Yes.
[17:14:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "Mostly nits, LGTM otherwise" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron)
[17:15:10] <wikibugs>	 (03PS1) 10RobH: removing references to decom servers labvirt101[01] [puppet] - 10https://gerrit.wikimedia.org/r/476570 (https://phabricator.wikimedia.org/T210735)
[17:18:18] <wikibugs>	 (03CR) 10RobH: [C: 032] removing references to decom servers labvirt101[01] [puppet] - 10https://gerrit.wikimedia.org/r/476570 (https://phabricator.wikimedia.org/T210735) (owner: 10RobH)
[17:18:53] <wikibugs>	 (03PS16) 10DCausse: [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381)
[17:18:55] <wikibugs>	 (03PS6) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381)
[17:18:57] <wikibugs>	 (03PS6) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381)
[17:18:59] <wikibugs>	 (03PS8) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381)
[17:19:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[17:19:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH)
[17:19:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[17:19:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[17:20:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[17:20:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH) a:05RobH>03Cmjohnson Ok, these are ready for @cmjohnson to do the SSD smartctl secure erase on these systems.  As these are le...
[17:20:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH)
[17:28:30] <wikibugs>	 (03CR) 10Herron: "bunch of nitpicks" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi)
[17:32:35] <wikibugs>	 (03CR) 10Elukey: "Added Mark/Faidon to see if this can be merged now as opposed to wait for the SRE meeting, since basically the team already approved what " [puppet] - 10https://gerrit.wikimedia.org/r/475984 (owner: 10Elukey)
[17:33:22] <wikibugs>	 (03CR) 10Herron: [C: 04-2] "> Swift produces a significant amount of logs (~1-4G/day compressed)" [puppet] - 10https://gerrit.wikimedia.org/r/475898 (https://phabricator.wikimedia.org/T63780) (owner: 10Herron)
[17:34:51] <logmsgbot>	 !log anomie@deploy1001 Synchronized php-1.33.0-wmf.6/includes/revisiondelete/RevisionDeleteUser.php: Fix RevisionDeleteUser rev_actor query for MySQL (T210628) (duration: 00m 53s)
[17:34:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:46] <wikibugs>	 10Operations, 10ORES, 10vm-requests, 10Scoring-platform-team (Current): New node request: oresrdb[12]003 - https://phabricator.wikimedia.org/T210582 (10akosiaris)
[17:41:55] <logmsgbot>	 !log anomie@deploy1001 Synchronized php-1.33.0-wmf.6/includes/revisiondelete/RevisionDeleteUser.php: Fix RevisionDeleteUser rev_actor query for MySQL, for real this time (T210628) (duration: 00m 53s)
[17:41:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:00] <logmsgbot>	 !log mforns@deploy1001 Started deploy [analytics/refinery@40b1972]: deploying refinery to refinery-source version v0.0.81
[17:43:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:23] <mutante>	 Krenair: do i remember correctly you once had a script that parsed the admin.yaml 
[17:44:50] <mutante>	 where you could easily do stuff like "are all members of group A also in group B" from the yaml
[17:48:09] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10jcrespo) Adding DBA for the few db hosts that shouldn't be there, remove the tag when those are fixed:  * New pc* hosts * New dbstore* hosts * dbmonitor (unsure of that one, that is most likely...
[17:48:13] <wikibugs>	 10Operations, 10Proton, 10Services (doing): Increase the CPU count for proton[12]00[12] - https://phabricator.wikimedia.org/T197862 (10pmiazga)
[17:48:16] <wikibugs>	 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10pmiazga)
[17:49:00] <logmsgbot>	 !log mforns@deploy1001 Finished deploy [analytics/refinery@40b1972]: deploying refinery to refinery-source version v0.0.81 (duration: 06m 01s)
[17:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:41] <icinga-wm>	 PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:49:43] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:54:23] <icinga-wm>	 RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:54:25] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:55:45] <XioNoX>	 !log remove test netbox user from cr3-ulsfo - T205898
[17:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:48] <stashbot>	 T205898: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898
[17:56:15] <wikibugs>	 (03PS7) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324)
[17:56:17] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10Marostegui) I can fix `regex.yaml` to add the new parsercache there, but the dbstore appearing on that list do not exist: dbstore1003 and dbstore1005
[17:56:27] <wikibugs>	 (03CR) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron)
[17:57:35] <wikibugs>	 (03PS8) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324)
[17:57:57] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to "stat1007" for "researchers" group - https://phabricator.wikimedia.org/T210757 (10bmansurov)
[17:58:45] <wikibugs>	 (03CR) 10Herron: [C: 032] rsyslog: input::file add multiline handling & ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron)
[17:58:50] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476581 (https://phabricator.wikimedia.org/T205898)
[17:59:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476581 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[18:00:01] <wikibugs>	 10Operations, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps - https://phabricator.wikimedia.org/T210757 (10bmansurov)
[18:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1800).
[18:05:39] <wikibugs>	 (03Abandoned) 10Ayounsi: Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476581 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[18:05:41] <wikibugs>	 (03CR) 10DCausse: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[18:05:58] <wikibugs>	 (03CR) 10Jgleeson: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/476004 (https://phabricator.wikimedia.org/T208432) (owner: 10Jcrespo)
[18:06:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[18:07:07] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898)
[18:07:29] <wikibugs>	 10Operations, 10ops-eqiad: eqiad pdu audit - https://phabricator.wikimedia.org/T210760 (10RobH) p:05Triage>03Normal
[18:07:40] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Add fake ssh keys for netbox user" [labs/private] - 10https://gerrit.wikimedia.org/r/476584 (https://phabricator.wikimedia.org/T205898)
[18:08:44] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[18:09:03] <icinga-wm>	 PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:09:05] <icinga-wm>	 PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:11:54] <wikibugs>	 (03CR) 10Paladox: rsyslog: input::file add multiline handling & ship gerrit logs to ELK (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron)
[18:12:22] <wikibugs>	 (03PS2) 10Ayounsi: Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898)
[18:12:35] <icinga-wm>	 PROBLEM - puppet last run on people1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:12:45] <wikibugs>	 (03CR) 10Herron: [C: 032] rsyslog: input::file add multiline handling & ship gerrit logs to ELK (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron)
[18:12:53] <wikibugs>	 10Operations, 10ops-eqiad: eqiad pdu audit - https://phabricator.wikimedia.org/T210760 (10RobH)
[18:13:44] <anomie>	 cscott, arlolra, subbu, halfak, and Amir1: Are you using your window today? If not, I'd like to deploy a config change.
[18:13:53] <wikibugs>	 (03PS8) 10DCausse: [cirrus] Allow configuration arrays in production services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475747 (https://phabricator.wikimedia.org/T210381)
[18:13:55] <wikibugs>	 (03PS8) 10DCausse: [cirrus] switch to explicit config in production services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475748 (https://phabricator.wikimedia.org/T210381)
[18:13:57] <wikibugs>	 (03PS8) 10DCausse: [cirrus] prepare multi-instance services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381)
[18:13:59] <wikibugs>	 (03PS17) 10DCausse: [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381)
[18:14:02] <wikibugs>	 (03PS7) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381)
[18:14:03] <subbu>	 anomie, no parsoid deploy today
[18:14:04] <wikibugs>	 (03PS7) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381)
[18:14:06] <wikibugs>	 (03PS9) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381)
[18:14:21] <Amir1>	 No deploy for ores now
[18:14:31] <icinga-wm>	 PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:15:08] <herron>	 these recent puppet last runs are my bad, pushing a fix shortly
[18:16:22] <wikibugs>	 (03PS1) 10Herron: rsyslog::input::file fix startmsg_regex data type [puppet] - 10https://gerrit.wikimedia.org/r/476587
[18:16:36] <wikibugs>	 10Operations, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps - https://phabricator.wikimedia.org/T210757 (10Dzahn) i suggested to create this access request.  per IRC chat, adding some detail  what is needed / requested here:  add the existing admin...
[18:17:27] <wikibugs>	 (03CR) 10Herron: [C: 032] rsyslog::input::file fix startmsg_regex data type [puppet] - 10https://gerrit.wikimedia.org/r/476587 (owner: 10Herron)
[18:18:07] <icinga-wm>	 PROBLEM - puppet last run on puppetdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:19:19] <wikibugs>	 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Broken elasticsearch-prometheus-exporter service on logstash nodes after reboot - https://phabricator.wikimedia.org/T210597 (10EBjune)
[18:19:24] <paladox>	 herron i now get: Detail: undefined method `empty?' for nil:NilClass
[18:19:35] <paladox>	 Filepath: /etc/puppet/modules/rsyslog/templates/input/file.erb
[18:19:37] <herron>	 yep, reverting
[18:19:57] <paladox>	 herron i think replace .empty will fix it (ie just @var
[18:20:35] <herron>	 need to correct the template and there’s another issue with puppet escaping the regex string as well
[18:20:49] <herron>	 I’ll revert and fix outside prod then resubmit
[18:21:06] <wikibugs>	 (03PS1) 10Herron: Revert "rsyslog: input::file add multiline handling & ship gerrit logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/476590
[18:21:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "rsyslog: input::file add multiline handling & ship gerrit logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/476590 (owner: 10Herron)
[18:23:18] <wikibugs>	 (03PS2) 10Herron: Revert "rsyslog: input::file add multiline handling & ship gerrit logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/476590
[18:24:09] <wikibugs>	 10Operations, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn)
[18:24:39] <wikibugs>	 (03CR) 10Herron: [C: 032] Revert "rsyslog: input::file add multiline handling & ship gerrit logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/476590 (owner: 10Herron)
[18:24:44] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn)
[18:25:32] <wikibugs>	 (03PS1) 10Anomie: Set comment migration stage to new on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476591 (https://phabricator.wikimedia.org/T166733)
[18:25:36] <wikibugs>	 (03PS1) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476592
[18:25:52] <wikibugs>	 (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476591 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie)
[18:26:29] <wikibugs>	 (03PS2) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476592 (https://phabricator.wikimedia.org/T141324)
[18:26:40] <wikibugs>	 (03CR) 10Herron: "follow up to Ic843d3b0a1a40f831e569006776c24ec7cf54033" [puppet] - 10https://gerrit.wikimedia.org/r/476592 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron)
[18:27:27] <wikibugs>	 (03Merged) 10jenkins-bot: Set comment migration stage to new on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476591 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie)
[18:27:47] <icinga-wm>	 PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:28:46] <wikibugs>	 (03CR) 10Paladox: rsyslog: input::file add multiline handling & ship gerrit logs to ELK (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476592 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron)
[18:28:54] <logmsgbot>	 !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting comment migration to write-new/read-new on group 0 (T166733) (duration: 00m 52s)
[18:28:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:58] <stashbot>	 T166733: Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733
[18:29:31] <icinga-wm>	 PROBLEM - puppet last run on puppetdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:29:46] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) what it would mean: all members of "researchers":   `     members: [a...
[18:31:11] <wikibugs>	 (03CR) 10jenkins-bot: Set comment migration stage to new on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476591 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie)
[18:32:21] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) The alternative is creating an entirely new group  with a better name...
[18:33:20] <wikibugs>	 10Operations, 10ops-eqiad, 10media-storage: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10Cmjohnson) @fgiunchedi For racking this is the space I have   I can do at least 3 in A with out a problem,  I can only 2 in C and that would be the same rack (C2)  B can ha...
[18:33:39] <icinga-wm>	 RECOVERY - puppet last run on puppetdb1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[18:33:59] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn)
[18:34:01] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn)
[18:34:43] <icinga-wm>	 RECOVERY - puppet last run on puppetdb2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[18:34:55] <icinga-wm>	 RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[18:34:58] <icinga-wm>	 RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:36:01] <jdlrobson>	 hey all.. which username / password should i be using for https://logstash-beta.wmflabs.org/ - i cant seem to get access with wikitech credentials
[18:37:19] <jdlrobson>	 and if i dont have access could somebody give me access?
[18:39:48] <jynus>	 jdlrobson: https://www.mediawiki.org/wiki/Beta_Cluster#Testing_changes_on_Beta_Cluster
[18:40:49] <jdlrobson>	 jynus: you are a wonderful person. Thank you! you've saved me hours <3
[18:40:53] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) Also talked with bmansurov about the pupetization part. There is my pending Gerrit change for the git clone part (pending repos are on Gerrit) a...
[18:41:38] <jynus>	 lol I just seached logstash on the wiki
[18:42:11] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery-Search: Find an alternative to curl connection pooling available in HHVM - https://phabricator.wikimedia.org/T210717 (10EBernhardson) Throwing some ideas out there:  * PHP requests are stateless, and trying to share something, even an open socket, is painful. * It se...
[18:42:16] <jdlrobson>	 but your searching skills are clearly better than mine :)
[18:43:39] <icinga-wm>	 RECOVERY - puppet last run on people1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[18:45:13] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "lgtm!  https://puppet-compiler.wmflabs.org/compiler1002/13785/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[18:45:35] <icinga-wm>	 RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[18:46:55] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Browser-Tests, and 2 others: Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode - https://phabricator.wikimedia.org/T210557 (10Jdlrobson) p:05High>03Unbreak!
[18:47:46] <wikibugs>	 (03PS3) 10Ayounsi: Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898)
[18:48:38] <wikibugs>	 (03PS3) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476592 (https://phabricator.wikimedia.org/T141324)
[18:50:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[18:50:06] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Nuria) What are recommendation api dumps? If they are destined for productio...
[18:50:11] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13786/" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[18:50:24] <wikibugs>	 (03PS4) 10Ayounsi: Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898)
[18:51:42] <raynor>	 greg-g - I backported the MobileFrontend fix to 1.33.0-wmf.6
[18:52:19] <raynor>	 CI is in progress, it should be merged soon
[18:53:04] <wikibugs>	 (03CR) 10Ottomata: [C: 031] "Nice, is the ensure_resource( 'file' ... ) just to work around if !defined(File[...]) stuff?" [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn)
[18:53:29] <XioNoX>	 !log Netbox: remove Napalm integration
[18:53:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:33] <wikibugs>	 (03CR) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476592 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron)
[18:57:21] <icinga-wm>	 PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[18:58:53] <icinga-wm>	 RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[19:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1900).
[19:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:03:19] <wikibugs>	 (03PS2) 10Ayounsi: Revert "Add fake ssh keys for netbox user" [labs/private] - 10https://gerrit.wikimedia.org/r/476584 (https://phabricator.wikimedia.org/T205898)
[19:03:40] <wikibugs>	 (03CR) 10Ayounsi: [V: 032 C: 032] Revert "Add fake ssh keys for netbox user" [labs/private] - 10https://gerrit.wikimedia.org/r/476584 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[19:04:00] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10bmansurov) @Nuria we are using Spark, Wikidata dumps in Hadoop, and some Hiv...
[19:06:26] <wikibugs>	 10Operations, 10Patch-For-Review: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10ayounsi) All test configuration for Netbox/Napalm has been removed.
[19:07:00] <hashar>	 hello again
[19:07:52] <wikibugs>	 (03PS6) 10Ayounsi: Icinga: add check_vcp (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097)
[19:09:16] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] Icinga: add check_vcp (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi)
[19:11:43] <icinga-wm>	 PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[19:12:21] <wikibugs>	 (03CR) 10EBernhardson: [C: 031] [cirrus] prepare multi-instance services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[19:13:23] <hashar>	 raynor: hello :) I will take of deploying your MobileFrontend patch for 1.33.0-wmf.6 :)
[19:13:29] <hashar>	 raynor: thanks for the quick fix and backport!
[19:13:56] <raynor>	 np, sorry for merging broken code earlier. I should pay bit more attention to return types
[19:16:19] <logmsgbot>	 !log hashar@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/MobileFrontend/: RecordRevision::getUser() returns UserIdentity not int - T210737 (duration: 00m 55s)
[19:16:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:22] <stashbot>	 T210737: Various User.php: PHP Notice: Object of class User could not be converted to int - https://phabricator.wikimedia.org/T210737
[19:16:50] <hashar>	 raynor: it happens. I will roll the train to all wikis once I am done with all the hotfixes
[19:17:30] <wikibugs>	 (03CR) 10Dzahn: "thanks! yea, it's to avoid any duplicate definitions when you have multiple profiles ensuring /srv/research/ is a directory." [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn)
[19:20:15] <wikibugs>	 (03CR) 10Dzahn: "now this is just waiting on the requested repos being created on gerrit and content moved from github" [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn)
[19:21:46] <wikibugs>	 (03PS1) 10Ottomata: Use refinery-job 0.0.81 for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/476601 (https://phabricator.wikimedia.org/T210465)
[19:23:07] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "wow, you already merged it. thanks :))" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn)
[19:23:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Update MOU dates for pirroh and piccardi [puppet] - 10https://gerrit.wikimedia.org/r/476602
[19:24:42] <wikibugs>	 (03CR) 10EBernhardson: [C: 031] [cirrus] prepare multi-instance services (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[19:26:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update MOU dates for pirroh and piccardi [puppet] - 10https://gerrit.wikimedia.org/r/476602 (owner: 10Muehlenhoff)
[19:29:17] <ShakespeareFan00>	 Hi all
[19:29:44] <ShakespeareFan00>	 Is it just me or are other people reporting suspect login attempts at the moment?
[19:30:48] <wikibugs>	 (03PS1) 10Ayounsi: Icinga, assign check_vcp to all VC switches [puppet] - 10https://gerrit.wikimedia.org/r/476604 (https://phabricator.wikimedia.org/T201097)
[19:33:34] <Ebe123>	 Hello ShakespeareFan00!
[19:33:57] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/13787/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/476604 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi)
[19:33:58] <Ebe123>	 Don't know; haven't gotten one
[19:36:30] <wikibugs>	 (03CR) 10EBernhardson: [cirrus] Add temp clusters but still write to the old ones (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[19:41:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Absent NfsdCollector Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/476606 (https://phabricator.wikimedia.org/T183454)
[19:42:03] <XioNoX>	 !log remove neodymium/sarin from mgmt routers - T210612
[19:42:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:07] <stashbot>	 T210612: Remove neodymium/sarin from router ACLs - https://phabricator.wikimedia.org/T210612
[19:43:57] <wikibugs>	 (03CR) 10EBernhardson: [cirrus] Add temp clusters but still write to the old ones (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[19:45:03] <wikibugs>	 (03CR) 10GTirloni: [C: 032] Absent NfsdCollector Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/476606 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[19:45:14] <wikibugs>	 (03CR) 10EBernhardson: [C: 031] [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[19:46:43] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Epic, 10Patch-For-Review: Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10debt)
[19:46:47] <wikibugs>	 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Refactor current code base to support multiple elasticsearch instances/multiple elasticsearch clusters - https://phabricator.wikimedia.org/T207918 (10debt) 05Open>03Resolved
[19:49:07] <wikibugs>	 10Operations, 10netops: Remove neodymium/sarin from router ACLs - https://phabricator.wikimedia.org/T210612 (10ayounsi) 05Open>03Resolved a:03ayounsi Removed!
[19:50:13] <wikibugs>	 (03CR) 10EBernhardson: [C: 031] [cirrus] Start using replica group settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[19:55:00] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10GTirloni)
[19:55:02] <wikibugs>	 (03CR) 10Dzahn: [C: 031] Icinga, assign check_vcp to all VC switches [puppet] - 10https://gerrit.wikimedia.org/r/476604 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi)
[19:56:53] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Refactor puppet WDQS module - https://phabricator.wikimedia.org/T208201 (10debt) 05Open>03Resolved
[19:58:43] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Refactor puppet WDQS module - https://phabricator.wikimedia.org/T208201 (10debt)
[19:58:52] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Refactor wdqs::gui - Separate cron tasks from the module - https://phabricator.wikimedia.org/T209257 (10debt) 05Open>03Resolved
[19:59:02] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] Icinga, assign check_vcp to all VC switches [puppet] - 10https://gerrit.wikimedia.org/r/476604 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi)
[19:59:09] <wikibugs>	 (03PS2) 10Ayounsi: Icinga, assign check_vcp to all VC switches [puppet] - 10https://gerrit.wikimedia.org/r/476604 (https://phabricator.wikimedia.org/T201097)
[20:00:04] <jouncebot>	 Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T2000)
[20:01:10] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure: "Obama" page on Beta Cluster often responds with 503 - https://phabricator.wikimedia.org/T188913 (10Jdlrobson)
[20:01:22] <XioNoX>	 !log Apply Icinga:check_vcp to all VC switches - T201097
[20:01:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:26] <stashbot>	 T201097: Add virtual chassis port status alerting - https://phabricator.wikimedia.org/T201097
[20:01:55] <wikibugs>	 (03PS2) 10Ottomata: Use refinery-job 0.0.81 for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/476601 (https://phabricator.wikimedia.org/T210465)
[20:02:08] <wikibugs>	 (03CR) 10Ottomata: [C: 032] "tested and works fine" [puppet] - 10https://gerrit.wikimedia.org/r/476601 (https://phabricator.wikimedia.org/T210465) (owner: 10Ottomata)
[20:02:11] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Use refinery-job 0.0.81 for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/476601 (https://phabricator.wikimedia.org/T210465) (owner: 10Ottomata)
[20:07:03] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[20:07:40] <XioNoX>	 rgr!
[20:16:14] <XioNoX>	 robh, papaul, cmjohnson1, see above, I added an Icinga check for Virtual Chassis ports, with the following runbook: https://wikitech.wikimedia.org/wiki/Network_monitoring#VCP_status
[20:16:40] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul)
[20:16:45] <XioNoX>	 Basically offloading it all to DCops :) let me know if you have questions
[20:16:52] <cmjohnson1>	 thanks
[20:17:47] <XioNoX>	 cmjohnson1: and asw2-c-eqiad is complaining about a down port, if you want to be the 1st one to test that runbook :)
[20:18:02] <mutante>	 :) cool stuff
[20:28:07] <wikibugs>	 (03PS11) 10Dzahn: create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622)
[20:28:47] <wikibugs>	 (03PS1) 10Ottomata: Blacklist mediawiki_revision_score from refine again until we fix problem [puppet] - 10https://gerrit.wikimedia.org/r/476615 (https://phabricator.wikimedia.org/T210465)
[20:29:55] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "adjusted repo / dir name to match: <bmansurov> And this one is pending: https://gerrit.wikimedia.org/r/#/admin/projects/research/article-r" [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn)
[20:30:00] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Blacklist mediawiki_revision_score from refine again until we fix problem [puppet] - 10https://gerrit.wikimedia.org/r/476615 (https://phabricator.wikimedia.org/T210465) (owner: 10Ottomata)
[20:31:20] <wikibugs>	 (03PS3) 10Dzahn: cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036)
[20:33:40] <wikibugs>	 (03PS4) 10Dzahn: cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036)
[20:33:57] <wikibugs>	 (03PS5) 10Dzahn: cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036)
[20:34:55] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul)
[20:36:11] <wikibugs>	 (03CR) 10EBernhardson: [cirrus] Cleanup transitional states (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[20:36:59] <wikibugs>	 (03CR) 10BryanDavis: [C: 031] deployment-prep: Try changing redis_lock entries to memc hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475025 (https://phabricator.wikimedia.org/T210030) (owner: 10Alex Monk)
[20:38:11] <wikibugs>	 (03CR) 10Dzahn: [C: 032] cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn)
[20:42:42] <mutante>	 !log people.wikimedia.org is switching backends from rutherfordium to people1001, please stand by during a short maintenance period.. data has been copied  | https://wikitech.wikimedia.org/wiki/People.wikimedia.org#Backend_upgrade_November_2018 | T210036
[20:42:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:46] <stashbot>	 T210036: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036
[20:44:40] <chasemp>	 mutante: people1001 is the best server name yet
[20:45:09] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "watched puppet run and service refresh on cp1079, saw no issue" [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn)
[20:45:35] <mutante>	 chasemp: hehehe, i made sure to add it to the official naming standard page
[20:46:14] <mutante>	 folks can just use "people.eqiad" without remembering a number
[20:46:24] <mutante>	 for the future
[20:46:26] <cdanis>	 👍
[20:46:50] <mutante>	 no people.codfw yet :p
[20:47:56] <wikibugs>	 (03PS2) 10Dzahn: switch people.eqiad from rutherfordium to people1001 [dns] - 10https://gerrit.wikimedia.org/r/475234
[20:48:23] <wikibugs>	 (03CR) 10Dzahn: [C: 032] switch people.eqiad from rutherfordium to people1001 [dns] - 10https://gerrit.wikimedia.org/r/475234 (owner: 10Dzahn)
[20:50:02] <mutante>	 !log people - rsynced /home one last time, switched DNS people.eqiad CNAME over, varnish change merged (T210036)
[20:50:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:06] <stashbot>	 T210036: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036
[20:54:04] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) @Cmjohnson I don't know what the proper disk layout for these are, since they will be Cloud Virt nodes.  I doubt RAID 0 is...
[20:54:54] <wikibugs>	 (03PS1) 10Dzahn: remove peopleweb role from rutherfordium [puppet] - 10https://gerrit.wikimedia.org/r/476618 (https://phabricator.wikimedia.org/T210036)
[20:56:22] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "this removes shell access for all non-roots, the easiest way to prevent people still going to the old server" [puppet] - 10https://gerrit.wikimedia.org/r/476618 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn)
[20:57:53] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Andrew)
[20:57:57] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: delete t206636-3 VM and revert quota bumps for project wikidata-query - https://phabricator.wikimedia.org/T207101 (10Andrew) 05Open>03Resolved
[20:59:55] <wikibugs>	 (03PS2) 10Dzahn: Revert "peopleweb: allow rsync of /home from rutherfordium to people1001" [puppet] - 10https://gerrit.wikimedia.org/r/475249
[21:00:51] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Revert "peopleweb: allow rsync of /home from rutherfordium to people1001" [puppet] - 10https://gerrit.wikimedia.org/r/475249 (owner: 10Dzahn)
[21:01:51] <wikibugs>	 10Operations, 10Analytics, 10Security-Team, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10Jrogers-WMF) Hi all, commenting on this from WMF Legal.   As I understand the question and context, the issue is using a proprietary format fo...
[21:03:18] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "i did not "absent" it but there was no point when rutherfordium had the rsyncd config and will be deleted entirely" [puppet] - 10https://gerrit.wikimedia.org/r/475249 (owner: 10Dzahn)
[21:06:50] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Andrew) >>! In T207194#4787008, @Ottomata wrote: > @Cmjohnson I don't know what the proper disk layout for these are, since they wil...
[21:06:53] <wikibugs>	 (03PS1) 10MusikAnimal: Use log channel 'AbuseFilter' instead of 'AbuseFilterSlow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476620 (https://phabricator.wikimedia.org/T210636)
[21:07:11] <wikibugs>	 (03PS2) 10Dzahn: remove rutherfordium from site, netboot, DHCP [puppet] - 10https://gerrit.wikimedia.org/r/475237 (https://phabricator.wikimedia.org/T210036)
[21:07:35] <Urbanecm>	 Anybody to create a wiki account to complete T204477?
[21:07:36] <stashbot>	 T204477: Create punjabi.wikimedia.org for Punjabi Wikimedians User Group - https://phabricator.wikimedia.org/T204477
[21:07:59] <Urbanecm>	 ping Reedy, thcipriani, no_justification ^^
[21:08:43] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) Hmm, ok, then I think in this case RAID 0 is fine.  Since these will have Hadoop, data will be replicated across nodes 3x...
[21:10:00] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Use log channel 'AbuseFilter' instead of 'AbuseFilterSlow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476620 (https://phabricator.wikimedia.org/T210636) (owner: 10MusikAnimal)
[21:10:29] * Krinkle performs to acquire lock on mwdebug1002
[21:11:05] <wikibugs>	 (03Merged) 10jenkins-bot: Use log channel 'AbuseFilter' instead of 'AbuseFilterSlow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476620 (https://phabricator.wikimedia.org/T210636) (owner: 10MusikAnimal)
[21:11:20] <Krinkle>	 musikanimal: staging now on mwdebug1002
[21:11:45] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Andrew) As I understand it, with raid 0 if a single drive dies the whole system (and containing VM) will have to be rebuilt.  Also I...
[21:11:46] <musikanimal>	 thanks. Testing in progress
[21:12:04] <Krinkle>	 k, sync is done. 
[21:13:22] <Krinkle>	 musikanimal: so, one random little thing about logstash, I'd recommend editing the first filter bubble on that link and [x]-ing 1001 to clear the log of unrelated messages
[21:14:05] <musikanimal>	 mkay
[21:15:36] <musikanimal>	 hmm the X-Wikimedia-Debug Chrome extension isn't working
[21:16:11] <musikanimal>	 I see tiny flyout when I click the button, not the full thing where I can select mwdebug1002, etc.
[21:17:37] <Krinkle>	 hm.. sometimes it takes a few tries to close/re-open. There's a race condition.
[21:19:33] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[21:20:27] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy
[21:21:08] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: faulty VC link on asw2-c-eqiad - https://phabricator.wikimedia.org/T210788 (10ayounsi) p:05Triage>03High
[21:22:02] <wikibugs>	 (03CR) 10jenkins-bot: Use log channel 'AbuseFilter' instead of 'AbuseFilterSlow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476620 (https://phabricator.wikimedia.org/T210636) (owner: 10MusikAnimal)
[21:22:24] <musikanimal>	 got it to work, had to restart my browser
[21:22:38] <musikanimal>	 but now I'm having trouble making the filter slow enough!
[21:23:31] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[21:23:39] <wikibugs>	 (03PS5) 10Vgutierrez: gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050)
[21:26:20] <musikanimal>	 I think there's some db caching or something going on. I've got a really awfully written filter here and it's still going really fast
[21:27:38] <musikanimal>	 Krinkle: I think we're just going to have to hope for the best
[21:28:34] <musikanimal>	 I could make the AbuseFilter target all editors, or a wider range, but I don't want to slow down their editing just to test this
[21:29:54] <musikanimal>	 I do see the AbuseFilter cache hits/misses in logstash, so I know I'm looking at the right thing, and that X-Wikimedia-Debug is working, etc.
[21:34:32] <XioNoX>	 !log removed unused vc-port on asw2-c-eqiad:fpc8 - T210788
[21:34:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:36] <stashbot>	 T210788: faulty VC link on asw2-c-eqiad - https://phabricator.wikimedia.org/T210788
[21:35:03] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw2-c-eqiad is OK: OK: UP: 22 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[21:36:06] <wikibugs>	 (03CR) 10Dzahn: [C: 032] remove rutherfordium from site, netboot, DHCP [puppet] - 10https://gerrit.wikimedia.org/r/475237 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn)
[21:36:14] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: faulty VC link on asw2-c-eqiad - https://phabricator.wikimedia.org/T210788 (10ayounsi) 05Open>03Resolved a:05Cmjohnson>03ayounsi That was actually an unused port.
[21:36:17] <wikibugs>	 (03PS3) 10Dzahn: remove rutherfordium from site, netboot, DHCP [puppet] - 10https://gerrit.wikimedia.org/r/475237 (https://phabricator.wikimedia.org/T210036)
[21:41:30] <wikibugs>	 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) We discussed this again in the TechCom meeting the other day. If DBAs are ok with not just the new field and indexes, bu...
[21:41:59] <tzatziki>	 !log changing email for User:Mathounette
[21:42:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:03] <Krinkle>	 musikanimal: I'd say it's fine to enable on testwiki for more editors, no problem.
[21:42:15] <Krinkle>	 sorry for the delay, was distracted :)
[21:42:40] <wikibugs>	 (03PS2) 10Dzahn: remove rutherfordium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/475235 (https://phabricator.wikimedia.org/T210036)
[21:45:13] <musikanimal>	 no problem. I've created https://test.wikipedia.org/wiki/Special:AbuseFilter/189
[21:45:33] <musikanimal>	 that would normally be really, really slow
[21:47:06] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul)
[21:49:52] <musikanimal>	 Krinkle: err wait, slow filter runtimes won't be reported for other editors, since they're not on mwdebug1002, right?
[21:49:53] <Krinkle>	 musikanimal: So.. why would it be slow?
[21:50:06] <Krinkle>	 musikanimal: That is indeed also true.
[21:50:50] <musikanimal>	 I tried editing [[Barack Obama]] on testwiki, large article. No "slow filter" entry in logstash
[21:51:10] <musikanimal>	 I think there's a lot of conditions that need to be met for a filter to have a slow run time
[21:51:20] <musikanimal>	 in production, across all wikis, there were only 50 or so a day
[21:52:51] <musikanimal>	 hard to say if it's actually working or not
[21:52:56] <Krinkle>	 OK. I'll roll it out then
[21:53:01] <Krinkle>	 It's only a log channel anywah.
[21:53:12] <Krinkle>	 We're not actually potentially making anything slow.
[21:53:13] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:53:14] <musikanimal>	 yeah, and it wasn't working before, so it can't be any worse
[21:53:31] <Krinkle>	 :)
[21:53:39] <musikanimal>	 thanks!
[21:53:40] <hashar>	 all 1.33.0-wmf.6 blockers have been fixed or ruled out. So we can process with the last group
[21:53:52] <wikibugs>	 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @Marostegui Hello!  I've added a few summary columns and indexes to the link tables, and the resulting DDL would look li...
[21:54:22] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T210636 - I9ebbc625f98c314 (duration: 00m 55s)
[21:54:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:25] <stashbot>	 T210636: Slow filters logstash dashboard no longer being updated - https://phabricator.wikimedia.org/T210636
[21:54:36] * Krinkle performs ritual to release lock on mwdebug1002
[21:55:13] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.570 second response time
[21:55:47] <hashar>	 Krinkle: let me know when you are done, i will resume the train next :)
[21:56:29] * Krinkle is done
[21:59:47] <hashar>	 good
[22:00:42] <wikibugs>	 (03PS1) 10Hashar: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476764
[22:00:44] <wikibugs>	 (03CR) 10Hashar: [C: 032] all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476764 (owner: 10Hashar)
[22:01:52] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476764 (owner: 10Hashar)
[22:01:53] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:03:20] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.6
[22:04:24] <stashbot>	 hashar@deploy1001: Failed to log message to wiki. Somebody should check the error logs.
[22:04:44] <hashar>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.6
[22:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:51] <hashar>	 who knows
[22:05:09] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.096 second response time
[22:06:56] <hashar>	 bah
[22:07:11] <hashar>	 bunch of memcached error keys due to the servers on port 11213 having a timeout
[22:07:30] <hashar>	 I am afraid a new version ends up causing a large denial of service on our memcached relay ://
[22:08:17] <wikibugs>	 (03PS3) 10Andrew Bogott: deployment-prep: move lists of cache nodes out of labs.yaml hiera [puppet] - 10https://gerrit.wikimedia.org/r/475225 (owner: 10Alex Monk)
[22:08:19] <wikibugs>	 (03PS4) 10Andrew Bogott: deployment-prep: Clean up from cache-text04 -> cache-text05 migration [puppet] - 10https://gerrit.wikimedia.org/r/475227 (owner: 10Alex Monk)
[22:09:00] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] deployment-prep: move lists of cache nodes out of labs.yaml hiera [puppet] - 10https://gerrit.wikimedia.org/r/475225 (owner: 10Alex Monk)
[22:09:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] deployment-prep: Clean up from cache-text04 -> cache-text05 migration [puppet] - 10https://gerrit.wikimedia.org/r/475227 (owner: 10Alex Monk)
[22:10:59] <wikibugs>	 (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476764 (owner: 10Hashar)
[22:11:19] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:14:54] <wikibugs>	 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10colewhite) Saw a crash happen today Thu Nov 29 at 22:10Z
[22:21:23] <hashar>	 !log 1.33.0-wmf.6 is on all wikis and looks stable.
[22:21:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:29] <hashar>	 train is complete
[22:29:10] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul)
[22:32:47] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.121 second response time
[22:51:37] <wikibugs>	 (03CR) 10Paladox: "Breaks in the cloud with:" [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[22:54:55] <wikibugs>	 (03CR) 10Cwhite: [C: 032] initial commit [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) (owner: 10Cwhite)
[22:57:53] <wikibugs>	 (03PS1) 10EBernhardson: Update wbsearchentities ab test configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476772 (https://phabricator.wikimedia.org/T209402)
[22:59:17] <wikibugs>	 10Operations, 10Citoid, 10Services (watching), 10VisualEditor (Current work): Decreased internationalisation of automatic citations as a result of switch to new translation-server - https://phabricator.wikimedia.org/T210806 (10Mvolz) 05Open>03stalled p:05Triage>03Normal
[23:01:35] <wikibugs>	 10Operations, 10monitoring, 10User-CDanis: graph server temperature metrics - https://phabricator.wikimedia.org/T209863 (10CDanis) Things I have learned today:   Using the labels provided by node_hwmon_sensor_labels is not that hard...  However, if you write this:  ` node_hwmon_temp_celsius{instance=~"$serve...
[23:03:00] <wikibugs>	 (03CR) 10Paladox: "Oh nvm, it broke because of https://gerrit.wikimedia.org/r/c/operations/puppet/+/475225" [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[23:03:15] <wikibugs>	 (03CR) 10Dzahn: ""it's better to break any such projects loudly rather than silently." yea, i think it worked and paladox noticed and they do exist" [puppet] - 10https://gerrit.wikimedia.org/r/475225 (owner: 10Alex Monk)
[23:05:38] <wikibugs>	 (03CR) 10Alex Monk: "Interesting. What was he using that had the deployment-prep cache hiera stuff?" [puppet] - 10https://gerrit.wikimedia.org/r/475225 (owner: 10Alex Monk)
[23:07:03] <mutante>	 paladox: ^ please add :)
[23:07:15] <wikibugs>	 (03CR) 10Paladox: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/475225 (owner: 10Alex Monk)
[23:07:26] <paladox>	 done
[23:08:05] <wikibugs>	 (03CR) 10Alex Monk: "Interesting. I suppose in that case you probably want to trust the novaproxy hosts instead of deployment-cache-* ?" [puppet] - 10https://gerrit.wikimedia.org/r/475225 (owner: 10Alex Monk)
[23:12:40] <wikibugs>	 (03PS3) 10Bstorm: sonofgridengine: set up shadow_master profile [puppet] - 10https://gerrit.wikimedia.org/r/476430 (https://phabricator.wikimedia.org/T200557)
[23:26:46] <mutante>	 !log puppetmaster: sudo puppet cert revoke rutherfordium.eqiad.wmnet; sudo puppet node clean rutherfordium.eqiad.wmnet ; sudo puppet node deactivate rutherfordium.eqiad.wmnet ; run puppet on icinga1001.. removed host from monitoring (decom for ganeti VM) (T210036)
[23:26:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:26:50] <Platonides>	 anomie: maybe you can create gerrit repos? 
[23:26:51] <stashbot>	 T210036: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036
[23:31:48] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Nuria) @bmansurov We do not recommend to generate these in stats boxes, stat...
[23:32:35] <wikibugs>	 10Operations, 10vm-requests: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 (10Dzahn)
[23:33:30] <wikibugs>	 10Operations, 10vm-requests: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 (10Dzahn)
[23:34:45] <wikibugs>	 10Operations, 10vm-requests: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 (10Dzahn)
[23:35:39] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Nuria) @Dzahn let's please hold on on any changes, stats boxes are mean for...
[23:39:10] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) @Nuria understood! thank you for your prompt comments and don't worry...
[23:39:56] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn)
[23:40:01] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) 05Open>03stalled
[23:44:38] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10bmansurov) @Nuria OK, that makes sense. I'll work with #analytics on this. @...
[23:45:54] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) @Nuria I also have this pending gerrit change that i will put on hold...
[23:48:31] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Nuria) @Dzahn  I see, Let's abandon that change. Stats machines are used by...
[23:50:04] <wikibugs>	 (03Abandoned) 10Dzahn: create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn)
[23:51:04] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn)
[23:51:06] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) 05stalled>03Invalid Ok Nuria! makes sense. I abandoned the change...
[23:54:29] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) see T210757#4786370  for the latest status. things have changed since Nuria pointed out hadoop should be used instead.
[23:58:13] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Nuria) Many thanks to everyone for the prompt responses.