[00:20:24] <wikibugs_>	 10Operations, 10Wikimedia-Mailing-lists: Enable CAPTCHA on mailman instances - https://phabricator.wikimedia.org/T194558#4201943 (10lfaraone) First: I personally like reCAPTCHA, and think it provides a lot of value from a security/abuse PoV. Yet we need to consider carefully whether we can deploy it on Wikimed...
[00:24:14] <wikibugs_>	 (03PS4) 10Dzahn: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742
[00:24:52] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn)
[00:27:30] <wikibugs_>	 (03PS5) 10Dzahn: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742
[00:38:56] <wikibugs_>	 (03CR) 10Dzahn: "hmmm.. something still uses the apache module that also gets included here... causing a duplicate declaration.. but what is it" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn)
[00:43:19] <wikibugs_>	 (03CR) 10Dzahn: [C: 04-1] "can anyone see where there the additional usage of the apache module comes from that causes this issue?  http://puppet-compiler.wmflabs.or" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn)
[00:44:19] <wikibugs_>	 (03CR) 10Dzahn: [C: 04-1] cache::misc: switch noc.wm,dbtree.wm backends to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430527 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn)
[00:44:37] <wikibugs_>	 (03CR) 10Dzahn: "scheduled for May 25th" [puppet] - 10https://gerrit.wikimedia.org/r/422632 (owner: 10Dzahn)
[00:45:17] <wikibugs_>	 (03CR) 10Dzahn: [C: 04-1] "not until at least a week after May 25th, the scheduled switch day" [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn)
[00:47:35] <wikibugs_>	 (03CR) 10Dzahn: "needs the SSH key from https://phabricator.wikimedia.org/T194445#4206012" [puppet] - 10https://gerrit.wikimedia.org/r/433025 (https://phabricator.wikimedia.org/T194445) (owner: 10Herron)
[00:56:19] <wikibugs_>	 (03CR) 10Dzahn: [C: 04-1] "needs manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/400241 (owner: 10Dzahn)
[00:56:36] <wikibugs_>	 (03PS1) 10Brian Wolff: Log "security" channel at 'debug' level. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433095
[01:11:16] <wikibugs_>	 10Operations, 10Wikimedia-Mailing-lists: wikitech-l is mangling my PGP/MIME emails, causing signature validation to fail - https://phabricator.wikimedia.org/T186311#4206345 (10Platonides) Maybe you could try not ending the message with "-- Legoktm" ? Just prepending a space would do. Lines starting with a dash...
[01:11:49] <bawolff>	 I'm going to do a security related deploy
[01:12:11] <wikibugs_>	 (03CR) 10Brian Wolff: [C: 032] Log "security" channel at 'debug' level. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433095 (owner: 10Brian Wolff)
[01:13:24] <wikibugs_>	 (03Merged) 10jenkins-bot: Log "security" channel at 'debug' level. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433095 (owner: 10Brian Wolff)
[01:13:40] <wikibugs_>	 (03CR) 10jenkins-bot: Log "security" channel at 'debug' level. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433095 (owner: 10Brian Wolff)
[01:17:30] <logmsgbot>	 !log bawolff@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/433095/ log security channel (duration: 01m 02s)
[01:17:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:37:04] <logmsgbot>	 !log bawolff@tin Started scap: Backport https://gerrit.wikimedia.org/r/#/c/433096/ - log js loads of unregistered user js subpages
[01:37:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:37:40] <wikibugs_>	 10Operations, 10Wikimedia-Mailing-lists: wikitech-l is mangling my PGP/MIME emails, causing signature validation to fail - https://phabricator.wikimedia.org/T186311#4206365 (10Legoktm) >>! In T186311#4206345, @Platonides wrote: > Maybe you could try not ending the message with "-- Legoktm" ? Just prepending a...
[02:12:08] <wikibugs_>	 (03CR) 10Dzahn: "domains handling email can be seen in the list modules/role/files/exim/wikimedia_domains" [dns] - 10https://gerrit.wikimedia.org/r/429874 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn)
[02:33:32] <logmsgbot>	 !log bawolff@tin Finished scap: Backport https://gerrit.wikimedia.org/r/#/c/433096/ - log js loads of unregistered user js subpages (duration: 56m 27s)
[02:33:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:03:50] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.3) (duration: 06m 32s)
[03:03:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:10:59] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Tue May 15 03:10:59 UTC 2018 (duration 7m 11s)
[03:11:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:35:55] <wikibugs_>	 (03PS1) 10Zhuyifei1999: profile::docker::flannel: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433101 (https://phabricator.wikimedia.org/T190893)
[05:05:48] <_joe_>	 zhuyifei1999_: thanks for working on this
[05:05:53] <zhuyifei1999_>	 np
[05:06:05] <_joe_>	 I wasn't ignoring you last week, I was just in/out of bed with the flu 
[05:06:17] <zhuyifei1999_>	 yeah ik
[05:06:23] <zhuyifei1999_>	 you told me
[05:06:37] <_joe_>	 I didn't even remember :D
[05:06:59] <_joe_>	 I see you preserved yuvi's love for docker in the comments :D
[05:07:14] <zhuyifei1999_>	 lol
[05:09:39] <zhuyifei1999_>	 I wonder if I can test this without blowing up toolforge
[05:09:39] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] profile::docker::flannel: Use systemd::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/433101 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999)
[05:10:23] <_joe_>	 Interesting question, I *think* we have a puppet compiler for labs, but I would ask in #-cloud about that
[05:10:50] <_joe_>	 in general, I'd advise to merge it only after having disabled puppet across all of toolsforge
[05:11:00] <_joe_>	 and then test one server at a time
[05:11:13] <zhuyifei1999_>	 oops ^ copy-pasted the wrong thing
[05:11:18] <_joe_>	 eheh np
[05:11:38] <_joe_>	 I won't merge the patch btw, I'm too rusty on toolsforge nowadays not to risk screwing it up
[05:12:08] <_joe_>	 when we created the kubernetes cluster there, I was way more familiar with it
[05:12:42] <wikibugs_>	 (03PS2) 10Zhuyifei1999: profile::docker::flannel: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433101 (https://phabricator.wikimedia.org/T190893)
[05:14:23] <zhuyifei1999_>	 I guess I could just ssh into every worker and disabling it
[05:14:36] * zhuyifei1999_ has no clush access, for some reason :(
[05:25:11] <_joe_>	 zhuyifei1999_: let's wait for arturo maybe? :)
[05:25:42] <zhuyifei1999_>	 andrew told me to make a project puppetmaster on toolsbeta to test it
[05:26:02] <zhuyifei1999_>	 so I'm doing that right now (finding my 2fa keys)
[05:26:57] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 031] profile::docker::flannel: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433101 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999)
[05:27:56] <wikibugs_>	 10Operations, 10ops-eqiad: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T194698#4206565 (10Marostegui)
[05:28:32] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T194698#4206146 (10Marostegui) a:03Cmjohnson Please @Cmjohnson proceed and change the disk
[05:36:45] <wikibugs_>	 (03PS1) 10Marostegui: s2,s6.hosts: Add db1120 [software] - 10https://gerrit.wikimedia.org/r/433105
[05:38:43] <wikibugs_>	 10Operations, 10Wikimedia-Mailing-lists: Archive "wiki-offline-reader-l" - https://phabricator.wikimedia.org/T194575#4206571 (10Kelson) @Dzahn @Herron Could you please in addition remove my email address to the list of owner?
[05:41:11] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] s2,s6.hosts: Add db1120 [software] - 10https://gerrit.wikimedia.org/r/433105 (owner: 10Marostegui)
[05:41:31] <wikibugs_>	 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4206574 (10Joe) I would suggest we do NOT disable/depool anything but the obvious outlier in the databases (we already know that timeouts on the databases woul...
[05:41:58] <wikibugs_>	 (03Merged) 10jenkins-bot: s2,s6.hosts: Add db1120 [software] - 10https://gerrit.wikimedia.org/r/433105 (owner: 10Marostegui)
[05:49:58] <wikibugs_>	 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4206575 (10Joe) Things to watch out for:  - All lvs primaries for eqiad are in row C - row C includes 30 appservers - conf1002 is in row C (etcd connections wi...
[06:08:23] <wikibugs_>	 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 2 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724#4206611 (10Joe)
[06:13:29] <wikibugs_>	 (03PS1) 10Elukey: role::cache::misc: add Varnish config for turnilo.w.o [puppet] - 10https://gerrit.wikimedia.org/r/433112 (https://phabricator.wikimedia.org/T194427)
[06:14:19] <elukey>	 am I the only one getting a 400 for https://gerrit.wikimedia.org/r/433112 ?
[06:15:06] <elukey>	 ahahah nono early morning and I am still sleep, nevermind
[06:15:23] <elukey>	 (had custom headers set to test turnilo)
[06:18:34] <wikibugs_>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11202/cp1045.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/433112 (https://phabricator.wikimedia.org/T194427) (owner: 10Elukey)
[06:22:36] <wikibugs_>	 (03PS1) 10Elukey: Add turnilo.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/433118 (https://phabricator.wikimedia.org/T194427)
[06:26:56] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: mcrouter: add support for listening on the ssl port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431736 (https://phabricator.wikimedia.org/T192370) (owner: 10Giuseppe Lavagetto)
[06:29:33] <icinga-wm>	 PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/vim/vimrc.local]
[06:32:23] <icinga-wm>	 PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_ipmi_sensor],File[/usr/lib/nagios/plugins/check_sysctl]
[06:54:06] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 031] "Three minor remarks, but looks good to me." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans)
[06:55:43] <icinga-wm>	 RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[06:58:34] <icinga-wm>	 RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:32:53] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on labstore1003 is CRITICAL: cluster=labsnfs device=megaraid,13 instance=labstore1003:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1003&var-datasource=eqiad%2520prometheus%252Fops
[08:13:08] <wikibugs_>	 10Operations, 10Wikimedia-Mailing-lists: Archive "wiki-offline-reader-l" - https://phabricator.wikimedia.org/T194575#4206728 (10Kelson) 05Resolved>03Open I just go an email with subject "482 Wiki-offline-reader-l moderator request(s) waiting". Please remove me from the list of owners (see my last comment).
[08:42:35] <jynus>	 !log stop db2068 for reimage
[08:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:53] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on cp1068 is CRITICAL: 3 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1068&var-datasource=eqiad%2520prometheus%252Fops
[09:03:01] <jynus>	 !log stop db2061 for reimage
[09:03:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:52] <wikibugs_>	 (03PS1) 10Elukey: Add the community extension for Parquet [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/433131 (https://phabricator.wikimedia.org/T193712)
[09:07:44] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,9 instance=db2067:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2067&var-datasource=codfw%2520prometheus%252Fops
[09:08:39] <jynus>	 ^we may need better coordination between SMART error monitoring and RAID monitoring
[09:12:58] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4206835 (10jcrespo) 05Resolved>03Open Potential SMART errors on that device.  ``` PROBLEM - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,9 instance=db2067:9100 job=node si...
[09:15:26] <godog>	 jynus: agreed, basically double reporting ?
[09:15:56] <jynus>	 I don't think in this case, but I think it happened at others
[09:18:42] <wikibugs_>	 (03PS1) 10Joal: Add output-format parameter to sqooq cron [puppet] - 10https://gerrit.wikimedia.org/r/433133
[09:18:45] <joal>	 elukey: --^
[09:19:33] <elukey>	 joal: sqoop right? :D
[09:20:02] <joal>	 :)
[09:20:29] <elukey>	 I can change the cr's title from gerrit
[09:20:39] <joal>	 done elukey 
[09:20:47] <wikibugs_>	 (03PS2) 10Elukey: Add output-format parameter to sqoop cron [puppet] - 10https://gerrit.wikimedia.org/r/433133 (owner: 10Joal)
[09:20:51] <wikibugs_>	 (03PS3) 10Joal: Add output-format parameter to sqoop cron [puppet] - 10https://gerrit.wikimedia.org/r/433133
[09:20:54] <elukey>	 ahah
[09:20:55] <joal>	 Arf
[09:20:56] <joal>	 :)
[09:21:31] <wikibugs_>	 (03CR) 10Elukey: [C: 032] Add output-format parameter to sqoop cron [puppet] - 10https://gerrit.wikimedia.org/r/433133 (owner: 10Joal)
[09:21:40] <moritzm>	 !log upgrading app server canaries to HHVM 3,18.5+dfsg-1+wmf8+deb9u1
[09:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:48] <jynus>	 !log stop and restart db2088 for upgrade
[09:26:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:30] <joal>	 Hi ops-team - deploying refinery onto the hadoop cluster
[09:28:35] <jynus>	 joal allow me to suggest using ! log to best communicate those actions :-)
[09:28:48] <joal>	 Hi jynus
[09:29:05] <joal>	 jynus: scap will log - I'm just pinging as discussed the other day :)
[09:29:14] <jynus>	 ah, ok, thanks
[09:29:21] <joal>	 np :)
[09:30:54] <logmsgbot>	 !log joal@tin Started deploy [analytics/refinery@b2f4c3c]: Regular weekly deploy
[09:30:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:23] <icinga-wm>	 PROBLEM - MegaRAID on db1053 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[09:31:24] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1053 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T194733
[09:31:29] <wikibugs_>	 10Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T194733#4206922 (10ops-monitoring-bot)
[09:33:19] <wikibugs_>	 10Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T194733#4206922 (10jcrespo) Do not take any action, db1053 is going to be decommissioned soon.
[09:33:32] <wikibugs_>	 10Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T194733#4206937 (10jcrespo)
[09:35:01] <wikibugs_>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4206939 (10CCogdill_WMF) Thanks Casey! I'm waiting for a reply. Just bumped it, FYI.
[09:36:14] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[09:36:32] <logmsgbot>	 !log joal@tin Finished deploy [analytics/refinery@b2f4c3c]: Regular weekly deploy (duration: 05m 38s)
[09:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:04] <icinga-wm>	 PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[09:37:14] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[09:39:21] <_joe_>	 akosiaris: any idea what's happening ^^ ?
[09:39:23] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[09:40:26] <jynus>	 could it be related to the deployment?
[09:40:43] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[09:40:43] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[09:41:34] <icinga-wm>	 RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[09:41:43] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[09:42:01] <_joe_>	 jynus: nope, analytics doesn't have anything on kubernetes
[09:42:11] <_joe_>	 but I figured alex could be doing something there
[09:42:14] <jynus>	 I also was doing some upgtrades
[09:42:28] <jynus>	 but I also guess kubernetes has not dependency on mysql
[09:42:52] <_joe_>	 not etcd, no
[09:43:00] <_joe_>	 probably some consensus troubles
[09:43:04] <_joe_>	 quickly recovered
[09:43:17] <_joe_>	 meh, we really need to move to 3.x there
[09:45:12] <jynus>	 I am going to do another upgrade, will see if it happens again
[09:47:53] <jynus>	 !log stop and restart db2091 for upgrade
[09:47:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:25] <moritzm>	 !log upgrading API server canaries to HHVM 3,18.5+dfsg-1+wmf8+deb9u1
[09:51:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:24] <wikibugs_>	 (03CR) 10Ema: [C: 031] role::cache::misc: add Varnish config for turnilo.w.o [puppet] - 10https://gerrit.wikimedia.org/r/433112 (https://phabricator.wikimedia.org/T194427) (owner: 10Elukey)
[10:04:23] <wikibugs_>	 (03CR) 10Ema: [C: 031] Add turnilo.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/433118 (https://phabricator.wikimedia.org/T194427) (owner: 10Elukey)
[10:07:11] <moritzm>	 !log installing php5 security updates on trusty
[10:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:24] <wikibugs_>	 (03PS2) 10Elukey: role::cache::misc: add Varnish config for turnilo.w.o [puppet] - 10https://gerrit.wikimedia.org/r/433112 (https://phabricator.wikimedia.org/T194427)
[10:12:59] <wikibugs_>	 (03CR) 10Elukey: [C: 032] role::cache::misc: add Varnish config for turnilo.w.o [puppet] - 10https://gerrit.wikimedia.org/r/433112 (https://phabricator.wikimedia.org/T194427) (owner: 10Elukey)
[10:15:06] <moritzm>	 !log installing uwsgi security update on graphite servers in eqiad
[10:15:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:52] <jynus>	 !log stop db2065 for reimage
[10:16:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:19] <wikibugs_>	 (03CR) 10Elukey: [C: 032] Add turnilo.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/433118 (https://phabricator.wikimedia.org/T194427) (owner: 10Elukey)
[10:36:04] <wikibugs_>	 (03PS1) 10MarcoAurelio: security: remove dangerous unused groups at mlwik{tionary|isource} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433136 (https://phabricator.wikimedia.org/T152296)
[10:43:02] <logmsgbot>	 !log joal@tin Started deploy [analytics/refinery@25abeec]: Fix for regular weekly deploy
[10:43:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:09] <wikibugs_>	 (03PS1) 10Urbanecm: New throttle rule for University of Edinburgh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433138
[10:45:32] <wikibugs_>	 (03PS2) 10Urbanecm: New throttle rule for University of Edinburgh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433138 (https://phabricator.wikimedia.org/T194666)
[10:46:02] <jynus>	 !log stop db2066 for reimage
[10:46:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:49] <wikibugs_>	 (03CR) 10Rxy: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433136 (https://phabricator.wikimedia.org/T152296) (owner: 10MarcoAurelio)
[10:48:38] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:49:37] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:49:47] <logmsgbot>	 !log joal@tin Finished deploy [analytics/refinery@25abeec]: Fix for regular weekly deploy (duration: 06m 45s)
[10:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:47] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:53:03] <wikibugs_>	 (03PS1) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140
[10:53:47] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:57:39] <jynus>	 !log stop db2067 for reimage
[10:57:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:02] <wikibugs_>	 (03PS2) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140
[11:00:17] <wikibugs_>	 10Operations, 10Commons, 10Wikimedia-Site-requests: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T192751#4207278 (10MarcoAurelio) For deployers the instructions seems to be at https://wikitech.wikimedia.org/wiki/Uploading_large_files
[11:04:09] <wikibugs_>	 10Operations, 10Commons, 10Wikimedia-Site-requests: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T192751#4207293 (10Goryeo) >>! In T192751#4207278, @MarcoAurelio wrote: > For deployers the instructions seems to be at https://wikitech.wikimedia.org/wiki/Uploading_large_fil...
[11:08:26] <wikibugs_>	 (03PS3) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140
[11:09:58] <wikibugs_>	 10Operations, 10Commons, 10Wikimedia-Site-requests: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T192751#4207300 (10Zoranzoki21) >>! In T192751#4207293, @Goryeo wrote: >>>! In T192751#4207278, @MarcoAurelio wrote: >> For deployers the instructions seems to be at https://w...
[11:13:00] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Move m3 backups from db1053 to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/433141 (https://phabricator.wikimedia.org/T194634)
[11:13:48] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Move m3 backups from db1053 to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/433141 (https://phabricator.wikimedia.org/T194634) (owner: 10Jcrespo)
[11:14:34] <wikibugs_>	 (03PS4) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140
[11:15:25] <wikibugs_>	 (03PS1) 10Arturo Borrero Gonzalez: toollabs: add mono_external class [puppet] - 10https://gerrit.wikimedia.org/r/433142 (https://phabricator.wikimedia.org/T194665)
[11:17:27] <wikibugs_>	 (03PS2) 10Arturo Borrero Gonzalez: toollabs: add mono_external class [puppet] - 10https://gerrit.wikimedia.org/r/433142 (https://phabricator.wikimedia.org/T194665)
[11:21:55] <wikibugs_>	 (03PS5) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140
[11:23:42] <wikibugs_>	 10Operations, 10Commons, 10Wikimedia-Site-requests: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T192751#4207342 (10Urbanecm) >>! In T192751#4207277, @Goryeo wrote: >>>! In T192751#4207156, @Urbanecm wrote: >> It is converted, it needs to be //uploaded//. This needs someo...
[11:24:59] <wikibugs_>	 (03PS6) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140
[11:26:30] <Hauskatze>	 any merciful deployer who could https://phabricator.wikimedia.org/T192751 and stop the drama, please?
[11:32:01] <wikibugs_>	 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4207372 (10fgiunchedi) >>! In T183177#4088202, @BBlack wrote: > See updates in T190540 , quite a few codfw hosts have SEL entries for uncorrectable ECC errors that...
[11:37:38] <wikibugs_>	 (03PS7) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140
[11:48:38] <wikibugs_>	 (03PS8) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140
[11:52:00] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: base: alert on correctable errors over a period of time [puppet] - 10https://gerrit.wikimedia.org/r/433143 (https://phabricator.wikimedia.org/T183177)
[11:52:42] <wikibugs_>	 (03PS9) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140
[11:54:52] <wikibugs_>	 (03PS10) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140
[11:57:38] <wikibugs_>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11212/aqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/433140 (owner: 10Elukey)
[12:00:35] <wikibugs_>	 (03PS11) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140
[12:01:14] <wikibugs_>	 (03CR) 10Elukey: [C: 032] role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 (owner: 10Elukey)
[12:09:18] <wikibugs_>	 (03PS1) 10Elukey: role::aqs: follow up after druid's config refactoring [puppet] - 10https://gerrit.wikimedia.org/r/433148
[12:12:33] <wikibugs_>	 (03CR) 10Elukey: [C: 032] role::aqs: follow up after druid's config refactoring [puppet] - 10https://gerrit.wikimedia.org/r/433148 (owner: 10Elukey)
[12:14:18] <moritzm>	 !log uploaded intel-microcode 20180425 for jessie-wikimedia/stretch-wikimedia
[12:14:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:11] <wikibugs_>	 10Operations, 10Patch-For-Review: Re-add intel-microcode - https://phabricator.wikimedia.org/T127825#4207469 (10MoritzMuehlenhoff) We have two clusters which need updated microcode to provide support for the new IBPB instruction needed to secure KVM instances against Spectre. In addition to that keeping the mi...
[12:32:47] <wikibugs_>	 (03PS1) 10Samwilson: Deploy GlobalPreferences to test wikis and mw.org (forth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149
[12:34:17] <wikibugs_>	 (03PS2) 10Samwilson: Deploy GlobalPreferences to test wikis and mw.org (forth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 (https://phabricator.wikimedia.org/T190425)
[12:42:30] <jynus>	 !log stop db2060 for reimage
[12:42:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:21] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Disable reimage of db206* host, reimage db205* to stretch [puppet] - 10https://gerrit.wikimedia.org/r/433151
[12:53:20] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Disable reimage of db206* host, reimage db205* to stretch [puppet] - 10https://gerrit.wikimedia.org/r/433151 (owner: 10Jcrespo)
[12:59:09] <wikibugs_>	 (03PS2) 10Andrew Bogott: openstack: move nova-api and nova-network functions to labnet1002 [puppet] - 10https://gerrit.wikimedia.org/r/430118 (https://phabricator.wikimedia.org/T193579) (owner: 10Rush)
[12:59:50] <andrewbogott>	 !log stopping puppet on labnet1001 and 1002, silencing icinga for T193579
[12:59:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:55] <stashbot>	 T193579: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579
[13:00:04] <jouncebot>	 andrewbogott and chasemp: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for WMCS network maintenance -- no SWAT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180515T1300).
[13:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[13:01:06] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] openstack: move nova-api and nova-network functions to labnet1002 [puppet] - 10https://gerrit.wikimedia.org/r/430118 (https://phabricator.wikimedia.org/T193579) (owner: 10Rush)
[13:07:42] <andrewbogott>	 !log stopping nodepool and puppet on labnodepool1001 for T193579
[13:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:46] <stashbot>	 T193579: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579
[13:09:27] <chasemp>	 !log disable puppet for all openstack things in eqiad
[13:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:54] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4207600 (10chasemp)
[13:47:59] <wikibugs_>	 (03PS1) 10Andrew Bogott: openstack: move nova-api and nova-network functions to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/433153 (https://phabricator.wikimedia.org/T193579)
[13:48:22] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 04-2] "Saving this to merge during a maintenance window" [puppet] - 10https://gerrit.wikimedia.org/r/433153 (https://phabricator.wikimedia.org/T193579) (owner: 10Andrew Bogott)
[13:48:24] <wikibugs_>	 (03CR) 10Elukey: [C: 032] Kafka: increase group.initial.rebalance.delay.ms to 10s. [puppet] - 10https://gerrit.wikimedia.org/r/432615 (https://phabricator.wikimedia.org/T189618) (owner: 10Ppchelko)
[13:48:28] <wikibugs_>	 (03PS4) 10Elukey: Kafka: increase group.initial.rebalance.delay.ms to 10s. [puppet] - 10https://gerrit.wikimedia.org/r/432615 (https://phabricator.wikimedia.org/T189618) (owner: 10Ppchelko)
[13:48:56] <elukey>	 mobrovac: merging --^ first, then we coud roll restart main codfw and verify that everything works as expected ?
[13:49:25] <mobrovac>	 elukey: don't we need to roll-restart for the zk move anyway?
[13:49:49] <elukey>	 not for codfw (it uses conf2*)
[13:50:09] <mobrovac>	 ah ok
[13:50:13] <mobrovac>	 sure, let's do it then
[13:50:24] <elukey>	 super
[13:50:56] <elukey>	 !log roll restart of kafka main codfw (kafka200[1-3]) to pick up group.initial.rebalance.delay.ms = 10s
[13:50:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:08] <elukey>	 restarting 2001 now
[13:55:08] <elukey>	 I am going to force kafka preferred-replica-election after metrics stabilize otherwise it will take a bit more for the rebalance to happen
[13:55:47] <mobrovac>	 i can also just restart changeprop in codfw
[13:56:05] <elukey>	 let's do it when the roll restart is finished to test ok?
[13:56:24] <mobrovac>	 +1
[13:58:54] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on kafka1023 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=kafka1023&var-datasource=eqiad%2520prometheus%252Fops
[13:59:13] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on wtp2020 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops
[13:59:23] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on wtp2013 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw%2520prometheus%252Fops
[14:00:18] <andrewbogott>	 !log rebooting labnet1001
[14:00:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:56] <elukey>	 kafka2002 done
[14:01:09] <elukey>	 (so happy that I have to restart ALL the kafka brokers today)
[14:01:29] <wikibugs_>	 (03PS2) 10Ottomata: Migrate eventbus camus job to Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/419493 (https://phabricator.wikimedia.org/T189713)
[14:02:23] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] Migrate eventbus camus job to Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/419493 (https://phabricator.wikimedia.org/T189713) (owner: 10Ottomata)
[14:02:35] <mobrovac>	 haha
[14:05:27] <mobrovac>	 there are a couple of msgs of the http proxy service being unable to deliver events
[14:05:32] <mobrovac>	 just 2 logs so far though
[14:06:54] <elukey>	 2003 restarted now
[14:08:04] <mobrovac>	 ok, i'll restart CP and CP4JQ in codfw now and let's see
[14:09:12] <elukey>	 ack
[14:09:54] <logmsgbot>	 !log mobrovac@tin Started restart [changeprop/deploy@e468d8e]: Restart after Kafka settings change
[14:09:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:10] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4207712 (10chasemp)
[14:10:29] <logmsgbot>	 !log mobrovac@tin Started restart [cpjobqueue/deploy@58935d5]: Restart after Kafka settings change
[14:10:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:07] <mobrovac>	 ok done
[14:11:55] <ottomata>	 mobrovac:  just curious which settings did you change?  
[14:12:24] <elukey>	 ottomata: the group.initial.rebalance.delay.ms
[14:12:26] <elukey>	 to 10s
[14:12:29] <mobrovac>	 ottomata: it's the kafka rebalance change, i didn't change anything on the CP side
[14:12:32] <ottomata>	 hm, ya but that's a broker setting, no?
[14:12:38] <ottomata>	 don't think client needs restart for that
[14:12:44] <elukey>	 ah no we wanted to test it
[14:12:44] <elukey>	 :)
[14:12:46] <ottomata>	 ahhh
[14:12:47] <ottomata>	 cool :)
[14:12:49] <mobrovac>	 yes, but we restarted CP to force a rebalance
[14:13:39] <elukey>	 all right all seems good, starting with the zk cluster changes 
[14:14:00] <elukey>	 !log swap conf1001 with conf1004 in the zookeeper main eqiad's config + roll restart of the service
[14:14:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:10] <wikibugs_>	 (03PS7) 10Elukey: Swap conf1001 with conf1004 in Zookeeper main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924)
[14:15:03] <wikibugs_>	 (03CR) 10Elukey: [C: 032] Swap conf1001 with conf1004 in Zookeeper main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey)
[14:16:58] <ottomata>	 woot
[14:18:50] <elukey>	 ok zookeeper up on conf1004, tried to connect via zkCli and it shows correctly main-eqiad's content
[14:19:47] <ottomata>	 !log temporarily disabling puppet on analytics1003 to run refine-eventbus after jumbo based camus eventbus import finishes 
[14:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:07] <elukey>	 applying puppet to all the conf1* nodes so they'll get the new config
[14:22:25] <wikibugs_>	 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4207782 (10Ottomata) @bblack would you mind if I assigned this to someone on your team?
[14:22:59] <wikibugs_>	 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4207783 (10BBlack) a:05Ottomata>03Vgutierrez Done :)
[14:23:31] <elukey>	 stopped and masked zookeeper on conf1001
[14:25:09] <elukey>	 new cluster up and running, conf1004 is the new leader
[14:25:13] <elukey>	 everything seems working fine
[14:25:57] <mobrovac>	 yay
[14:26:11] <ottomata>	 nice
[14:26:16] <mobrovac>	 elukey: that's it? do we have more swappings to do?
[14:27:04] <elukey>	 mobrovac: for today no, the docs suggest only one at the time
[14:27:11] <mobrovac>	 kk
[14:27:23] <elukey>	 now I'd proceed with kafka analytics and then kafka main
[14:27:29] <mobrovac>	 i will restart CP and CP4JQ, could you please restart the proxy service?
[14:27:48] <mobrovac>	 hm actually no, not needed
[14:28:01] <logmsgbot>	 !log mobrovac@tin Started restart [cpjobqueue/deploy@58935d5]: Restart after Kafka settings change
[14:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:27] <logmsgbot>	 !log mobrovac@tin Started restart [changeprop/deploy@e468d8e]: Restart after Kafka settings change
[14:28:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:30] <ottomata>	 ya hopefully none of the clients talk to ZK anymore, so they shouldn't need restart :)
[14:28:47] <ottomata>	 (due to zk change)
[14:29:00] <icinga-wm>	 PROBLEM - Host labnet1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:29:17] <elukey>	 mmm I am seeing the following in dmesg, not related to this upgrade
[14:29:20] <elukey>	 ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[14:29:28] <elukey>	 I missed it, I hope it is only a setting to tune
[14:31:54] <wikibugs_>	 (03PS7) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865)
[14:32:28] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema)
[14:33:10] <icinga-wm>	 RECOVERY - Host labnet1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms
[14:35:26] <wikibugs_>	 10Operations, 10DBA: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4207806 (10BBlack) p:05Triage>03Normal
[14:35:32] <elukey>	 (still reading, going to restart kafka soon)
[14:35:52] <elukey>	 ah also I think I missed monitoring config for 1004
[14:36:24] <elukey>	 ah no different roles right
[14:39:02] <wikibugs_>	 10Operations, 10DBA: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4207818 (10jcrespo) @bblack do you have a pointer to puppet of other standalone service using the correct configuration?
[14:43:06] <wikibugs_>	 (03PS1) 10Elukey: role::prometheus::ops: add new zookeeper hosts' monitoring [puppet] - 10https://gerrit.wikimedia.org/r/433157 (https://phabricator.wikimedia.org/T182924)
[14:43:39] <wikibugs_>	 10Operations, 10Wikimedia-Mailing-lists: Archive "wiki-offline-reader-l" - https://phabricator.wikimedia.org/T194575#4207847 (10Dzahn) 05Open>03Resolved Done. I changed the admin address to no-reply@wikimedia.org.
[14:43:53] <elukey>	 godog: ---^ is it an acceptable solution for zookeeper's metrics for the next days?
[14:44:15] <godog>	 elukey: taking a look
[14:44:35] <godog>	 oh man second time today gerrit is taking forever to load
[14:46:13] <elukey>	 mobrovac: we have some issues with Burrow and mirror maker atm
[14:46:24] <wikibugs_>	 (03CR) 10Filippo Giunchedi: "LGTM as a temporary thing" [puppet] - 10https://gerrit.wikimedia.org/r/433157 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey)
[14:46:36] <elukey>	 godog: <3
[14:46:44] <mobrovac>	 ack elukey, thnx
[14:46:49] <wikibugs_>	 (03CR) 10Elukey: [C: 032] role::prometheus::ops: add new zookeeper hosts' monitoring [puppet] - 10https://gerrit.wikimedia.org/r/433157 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey)
[14:47:22] <wikibugs_>	 (03PS1) 10Dzahn: enable IPv6 for tendril [puppet] - 10https://gerrit.wikimedia.org/r/433159 (https://phabricator.wikimedia.org/T194766)
[14:47:41] <elukey>	 mobrovac: same problem as last time, when burrow restarts it doesn't save its previous state and re-reads everything from __consumer_topics
[14:48:01] <elukey>	 err __consumer_groups, don't remember the exact name :)
[14:48:54] <wikibugs_>	 (03PS1) 10Muehlenhoff: Allow enabling microcode updates gradually [puppet] - 10https://gerrit.wikimedia.org/r/433160 (https://phabricator.wikimedia.org/T127825)
[14:49:44] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Allow enabling microcode updates gradually [puppet] - 10https://gerrit.wikimedia.org/r/433160 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff)
[14:49:55] <wikibugs_>	 10Operations, 10DBA, 10Patch-For-Review: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4207806 (10Dzahn) @jcrespo ^  Add "interface::add_ip6_mapped { 'main': }" in the role class to apply to both hosts at once. That should be all that is needed.  comparison:  ~/...
[14:51:39] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] enable IPv6 for tendril [puppet] - 10https://gerrit.wikimedia.org/r/433159 (https://phabricator.wikimedia.org/T194766) (owner: 10Dzahn)
[14:51:44] <wikibugs_>	 (03PS2) 10Jcrespo: enable IPv6 for tendril [puppet] - 10https://gerrit.wikimedia.org/r/433159 (https://phabricator.wikimedia.org/T194766) (owner: 10Dzahn)
[14:52:10] <elukey>	 all right zookeeper on conf1004 has metrics now :)
[14:55:06] <elukey>	 ottomata: I'd proceed if you are ok
[14:55:43] <elukey>	 kafka analytics, kafka main, kafka jumbo, hadoop
[14:56:08] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] "Puppet gets stuck at:" [puppet] - 10https://gerrit.wikimedia.org/r/433159 (https://phabricator.wikimedia.org/T194766) (owner: 10Dzahn)
[14:56:11] <elukey>	 or mobrovac, if you want we can go directly with kafka main so you'll be free
[14:56:31] <mobrovac>	 wuh sorry what's the question?
[14:56:45] <mobrovac>	 go with what for kafka main?
[14:56:59] <elukey>	 mobrovac: yeah or wait for another kafka cluster to be restarted first
[14:57:23] <mobrovac>	 i'm still confused as to what the question is about
[14:57:36] <mutante>	 jynus: is it really stuck? that looks normal.. if it continues after that
[14:58:00] <mutante>	 maybe you got disconnected from the host after it brought up the new interface
[14:58:13] <elukey>	 mobrovac: if you want me to roll restart a less important cluster like the analytics one, verify that all is ok and then proceed with main, or if you want to do it now so you'll be free :)
[14:58:21] <jynus>	 no, it is stuck because I cannot run puppet from another connection
[14:58:32] <wikibugs_>	 (03PS1) 10Ottomata: Alert if EventStreams recentchange endpoint has no messages [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493)
[14:58:34] <mobrovac>	 oh ok elukey, haha, it took me a while :)
[14:58:39] <mobrovac>	 elukey: yeah, go with main
[14:58:40] <mutante>	 jynus: i can connect to it fine and it looks good:
[14:58:41] <mutante>	 208.80.154.82/26 
[14:58:48] <mobrovac>	 it should be fine
[14:58:52] <mutante>	 2620:0:861:3:208:80:154:82
[14:58:52] <ottomata>	 proceed!@ :)
[14:58:54] <mutante>	 ^ mapped
[14:58:56] <jynus>	 mutante: try running puppet
[14:59:10] <elukey>	 mobrovac: ack! So just ran puppet on kafka100[1-3], proceeding with the roll restart
[14:59:17] <mobrovac>	 kk
[14:59:51] <elukey>	 !log roll restart of kafka daemons on kafka100[1-3] to pick up new zookeeper settings and group.initial.rebalance.delay.ms = 10s
[14:59:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:06] <jynus>	 it worked the second time, but I had to cancel the puppet run
[15:00:14] <mutante>	 jynus: says it's already running. want me to wait or delete the lock file.. oh.. ok!
[15:00:18] <mutante>	 ok
[15:00:28] <mutante>	 i haven't had this issue 
[15:00:32] <mutante>	 when doing that
[15:00:43] <jynus>	 it happened on both servers
[15:00:57] <wikibugs_>	 (03PS2) 10Ottomata: Alert if EventStreams recentchange endpoint has no messages [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493)
[15:01:01] <jynus>	 if it was an automatic run, probably it would have piled up
[15:01:11] <vgutierrez>	 looks good from here, though: https://puppetboard.wikimedia.org/report/dbmonitor1001.wikimedia.org/afa3bda906c1d75846df5bc3ed2eb7df26b63c2e
[15:01:29] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Alert if EventStreams recentchange endpoint has no messages [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493) (owner: 10Ottomata)
[15:01:31] <wikibugs_>	 (03PS2) 10Muehlenhoff: Allow enabling microcode updates gradually [puppet] - 10https://gerrit.wikimedia.org/r/433160 (https://phabricator.wikimedia.org/T127825)
[15:01:56] <mutante>	 well, both have the mapped address now. so that's good
[15:02:07] <elukey>	 mobrovac: kafka1001 restarted
[15:02:59] <mutante>	 [bast2001:~] $ ping6 tendril.wikimedia.org
[15:03:07] <mutante>	 64 bytes from dbmonitor1001.wikimedia.org 
[15:03:20] <wikibugs_>	 10Operations, 10DBA, 10Patch-For-Review: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4207932 (10jcrespo) a:03BBlack Please, recheck.
[15:03:28] <mobrovac>	 elukey: looking good on our side
[15:03:44] <wikibugs_>	 (03PS3) 10Ottomata: Alert if EventStreams recentchange endpoint has no messages [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493)
[15:03:52] <wikibugs_>	 10Operations, 10DBA, 10Patch-For-Review: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4207940 (10jcrespo) @Dzhan, thanks for the patch.
[15:06:10] <wikibugs_>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11215/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493) (owner: 10Ottomata)
[15:06:59] <elukey>	 mobrovac: 1002 done
[15:09:21] <mobrovac>	 hm i see 400s from the proxy service
[15:09:27] <mobrovac>	 they seem legit though
[15:10:26] <elukey>	 1003 is the only one left (waiting a bit now for metrics to recover)
[15:11:06] <icinga-wm>	 PROBLEM - Host labnet1001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:11:32] <elukey>	 mobrovac: I am stopping now waiting for your green light
[15:11:57] <mobrovac>	 kk elukey we can go with 1003
[15:12:27] <elukey>	 super, proceeding :)
[15:13:40] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: Add the capability to check for deprecated defines [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433162 (https://phabricator.wikimedia.org/T194724)
[15:13:42] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: Check for all the available variants of a hiera call [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433163
[15:13:58] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Add the capability to check for deprecated defines [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433162 (https://phabricator.wikimedia.org/T194724) (owner: 10Giuseppe Lavagetto)
[15:14:00] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Check for all the available variants of a hiera call [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433163 (owner: 10Giuseppe Lavagetto)
[15:14:26] <_joe_>	 I hate you rubocop
[15:16:46] <icinga-wm>	 RECOVERY - Host labnet1001 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms
[15:18:16] <elukey>	 mobrovac: 1003 restarted
[15:18:53] <elukey>	 and also just forced a replica election
[15:19:49] <wikibugs_>	 10Operations, 10DBA: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4207993 (10jcrespo)
[15:20:46] <icinga-wm>	 PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:20:48] <elukey>	 !log roll restart of Kafka Analytics to pick up new zookeeper settings
[15:20:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:21] <mobrovac>	 huh again 400s from the proxy service
[15:22:29] <wikibugs_>	 (03CR) 10Addshore: Prepare Lexeme config for test.wikidata.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433145 (https://phabricator.wikimedia.org/T194250) (owner: 10Jakob)
[15:22:48] <wikibugs_>	 (03PS4) 10Ottomata: Alert if EventStreams recentchange endpoint has no messages [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493)
[15:22:50] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] Alert if EventStreams recentchange endpoint has no messages [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493) (owner: 10Ottomata)
[15:25:18] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: Add the capability to check for deprecated defines [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433162 (https://phabricator.wikimedia.org/T194724)
[15:25:21] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: Check for all the available variants of a hiera call [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433163
[15:25:38] <wikibugs_>	 (03PS1) 10Ottomata: Fix type in check_eventstreams script [puppet] - 10https://gerrit.wikimedia.org/r/433166 (https://phabricator.wikimedia.org/T174493)
[15:26:02] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] Fix type in check_eventstreams script [puppet] - 10https://gerrit.wikimedia.org/r/433166 (https://phabricator.wikimedia.org/T174493) (owner: 10Ottomata)
[15:26:27] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,9 instance=db2067:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2067&var-datasource=codfw%2520prometheus%252Fops
[15:27:37] <wikibugs_>	 (03PS2) 10Herron: admin: add mmiller to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/433083 (https://phabricator.wikimedia.org/T194550)
[15:28:08] <wikibugs_>	 (03CR) 10Addshore: Prepare Lexeme config for test.wikidata.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433145 (https://phabricator.wikimedia.org/T194250) (owner: 10Jakob)
[15:28:18] <wikibugs_>	 (03CR) 10Herron: [C: 032] admin: add mmiller to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/433083 (https://phabricator.wikimedia.org/T194550) (owner: 10Herron)
[15:30:53] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1002 is CRITICAL: 12.07 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002
[15:31:13] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1001 is CRITICAL: 14.83 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001
[15:32:34] <elukey>	 it is already cleared out
[15:32:53] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka1002 is OK: (C)10 ge (W)5 ge 3.621 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002
[15:33:10] <elukey>	 a bit weird though
[15:33:22] <elukey>	 because I am restarting a different cluster
[15:36:23] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1001 is CRITICAL: 14.33 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001
[15:38:53] <mobrovac>	 weird indeed
[15:39:32] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka1001 is OK: (C)10 ge (W)5 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001
[15:39:37] <wikibugs_>	 10Operations, 10DBA: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4208076 (10BBlack) 05Open>03Resolved Works now, thanks!
[15:41:03] <icinga-wm>	 RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[15:41:35] <wikibugs_>	 (03CR) 10Jakob: Prepare Lexeme config for test.wikidata.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433145 (https://phabricator.wikimedia.org/T194250) (owner: 10Jakob)
[15:46:00] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T194698#4208115 (10Cmjohnson) The disk has been replaced. Please resolve once rebuild is complete
[15:50:00] <wikibugs_>	 (03PS8) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865)
[15:52:15] <elukey>	 !log rolling restart of hadoop master daemons to pick up new zookeeper settings
[15:52:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:07] <ottomata>	 elukey:  you rebooted analytics kafka ya?
[15:56:13] <elukey>	 ottomata: only restarted kafka in there, not mm
[15:56:32] <ottomata>	 right
[15:56:44] <ottomata>	 i think that old 0.9 mms might need bouncing after cluster restart
[15:56:46] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Failover dbproxy1007,8 and 9 and make them passive [dns] - 10https://gerrit.wikimedia.org/r/433015 (https://phabricator.wikimedia.org/T187962)
[15:56:48] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Move m3-slave from db1053 to db1072 [dns] - 10https://gerrit.wikimedia.org/r/433175 (https://phabricator.wikimedia.org/T194634)
[15:57:01] <elukey>	 ottomata: ah yes sorry I was about to do it after the last broker :(
[15:57:05] <wikibugs_>	 (03PS9) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865)
[15:58:28] <ottomata>	 elukey:  its a little funky that we actually need that...
[15:59:02] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Move m3-slave from db1053 to db1072 [dns] - 10https://gerrit.wikimedia.org/r/433175 (https://phabricator.wikimedia.org/T194634)
[15:59:02] <elukey>	 ottomata: restarting for all the zk changes?
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180515T1600).
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:00:06] <ottomata>	 elukey:  have to restart MM after a kafka cluster restart
[16:00:13] <ottomata>	 it shouldn't be needed, but you know, 0.9 is FLAKY
[16:00:30] <elukey>	 ahhh yes yes
[16:00:30] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Move m3-slave from db1053 to db1072 [dns] - 10https://gerrit.wikimedia.org/r/433175 (https://phabricator.wikimedia.org/T194634) (owner: 10Jcrespo)
[16:01:13] <ottomata>	 elukey:  i'm going to reduce sensitiveiy of that UnderReplicatedPartitions alert
[16:01:23] <elukey>	 ack
[16:01:28] <ottomata>	 not exactly sure what better setting might be
[16:01:33] <ottomata>	         # Alert if any undereplicated for more than 50%
[16:01:33] <ottomata>	         # of the time in the last 30 minutes.
[16:01:33] <ottomata>	         from            => '30min',
[16:01:33] <ottomata>	         percentage      => 50,
[16:01:37] <joal>	 Deploying AQS with elukey ops-team
[16:01:52] <ottomata>	 maybe percentage => 80?
[16:02:24] <wikibugs_>	 (03PS3) 10Jcrespo: mariadb: Failover dbproxy1007,8 and 9 and make them passive [dns] - 10https://gerrit.wikimedia.org/r/433015 (https://phabricator.wikimedia.org/T187962)
[16:03:03] <elukey>	 ottomata: what alert is that ? A prometheus one?
[16:03:13] <ottomata>	 oh
[16:03:16] <ottomata>	 sorry
[16:03:19] <ottomata>	 old analytics is graphite 
[16:03:23] <ottomata>	 promethues just does
[16:03:26] <ottomata>	   # Alert on the average number of under replicated partitions over the last 30 minutes.
[16:03:31] <ottomata>	 avg_over_time(kafka_server_ReplicaManager_UnderReplicatedPartitions{${prometheus_labels}}[30m]
[16:03:42] <ottomata>	 maybe up it to 1h?
[16:03:43] <wikibugs_>	 (03PS4) 10Jcrespo: mariadb: Failover dbproxy1007,8 and 9 and make them passive [dns] - 10https://gerrit.wikimedia.org/r/433015 (https://phabricator.wikimedia.org/T187962)
[16:04:01] <wikibugs_>	 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4208208 (10Vgutierrez) Right now the TLS server allows the client to pick up the curve to use, since j8u121 (8u171-b11-1~deb9u1 is deployed on k...
[16:04:08] <ottomata>	 hm we don't really need an average?  
[16:04:13] <ottomata>	 we just want to know if there are currently any
[16:04:46] <elukey>	 yeah, and the time window must be short otherwise the bursts will alert and solve only after a long time
[16:05:15] <logmsgbot>	 !log joal@tin Started deploy [analytics/aqs/deploy@a736558]: Deploying druid-configuration patch
[16:05:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:03] <wikibugs_>	 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208219 (10MaxBioHazard) >so you should be able to keep using it.  Did you mean that we can change nothing in our bots, its compilation settings, and when you disable AES128 our...
[16:07:12] <ottomata>	 elukey:  maybe
[16:07:18] <ottomata>	 min_over_time [5m]
[16:07:18] <ottomata>	 ?
[16:07:34] <ottomata>	 if the min value in the last 5 mins is > X, alert
[16:07:34] <ottomata>	 ?
[16:07:40] <ottomata>	 that way it gives 5 minutes to get back to 0
[16:07:40] <ottomata>	 ?
[16:08:02] <ottomata>	 if the min is  > 0 in the last 5 minutes, warning, > 10, critical?
[16:08:38] <wikibugs_>	 (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler02/11217/" [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema)
[16:09:36] <wikibugs_>	 (03PS1) 10Ottomata: Reduce sensitivity of Kafka Broker Under Replicated Partitions alert [puppet] - 10https://gerrit.wikimedia.org/r/433179
[16:09:38] <elukey>	 ottomata: could work, maybe 10m?
[16:09:52] <elukey>	 but I am also fine to test 5
[16:10:21] <wikibugs_>	 10Operations, 10Traffic: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#4208223 (10BBlack) 05Open>03Resolved a:03BBlack Closing this (a bit late), as service has been online for a while now.  Trailing remaining tasks re: Zero and/or further network engineering aren't really a...
[16:10:31] <elukey>	 ottomata: if you haven't started the jumbo restarts I can do them now
[16:10:38] <elukey>	 just finished analytics and hadoop
[16:11:02] <logmsgbot>	 !log joal@tin Finished deploy [analytics/aqs/deploy@a736558]: Deploying druid-configuration patch (duration: 05m 47s)
[16:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:19] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Make db1072, and not db1053, the passive m3 failover [puppet] - 10https://gerrit.wikimedia.org/r/433180 (https://phabricator.wikimedia.org/T194634)
[16:11:21] <wikibugs_>	 10Operations, 10Traffic: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#4208229 (10BBlack)
[16:11:27] <wikibugs_>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4208227 (10BBlack) 05Open>03Resolved Closing this as well, we're through the basic turn-up process.  Trailing wor...
[16:11:31] <wikibugs_>	 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208230 (10Vgutierrez) >>! In T194380#4208219, @MaxBioHazard wrote: >>so you should be able to keep using it. >  > Did you mean that we can change nothing in our bots, its compil...
[16:11:35] <ottomata>	 elukey:  haven't restarted, let me deploy this alert change first
[16:11:47] <elukey>	 ottomata: all right starting them now! 
[16:11:48] <ottomata>	 elukey:  ya maybe 10 mins ok
[16:11:50] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Make db1072, and not db1053, the passive m3 failover [puppet] - 10https://gerrit.wikimedia.org/r/433180 (https://phabricator.wikimedia.org/T194634) (owner: 10Jcrespo)
[16:12:06] <elukey>	 !log roll restart kafka on kafka-jumbo to pick up new zookeeper settings
[16:12:06] <ottomata>	 elukey:  wait!  let's see if this alert change will fix flappy alert
[16:12:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:18] <ottomata>	 haven't merged yet
[16:12:32] <elukey>	 ottomata: ah yes sure, it will take me a while before doing all the brokers, it should be fine :D
[16:13:04] <ottomata>	 haha ok
[16:13:10] <wikibugs_>	 (03PS2) 10Ottomata: Reduce sensitivity of Kafka Broker Under Replicated Partitions alert [puppet] - 10https://gerrit.wikimedia.org/r/433179
[16:13:49] <wikibugs_>	 (03PS3) 10Ottomata: Reduce sensitivity of Kafka Broker Under Replicated Partitions alert [puppet] - 10https://gerrit.wikimedia.org/r/433179
[16:14:00] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] Reduce sensitivity of Kafka Broker Under Replicated Partitions alert [puppet] - 10https://gerrit.wikimedia.org/r/433179 (owner: 10Ottomata)
[16:15:23] <jynus>	 1 or 2 debproxies will complain now
[16:15:31] <jynus>	 fixing it, no user impact
[16:17:47] <jynus>	 fixed now
[16:18:31] <wikibugs_>	 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, and 4 others: rack/setup/install wdqs10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T194184#4208249 (10Cmjohnson)
[16:19:03] <wikibugs_>	 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Access to usergroups for Marshall Miller - https://phabricator.wikimedia.org/T194550#4208251 (10herron) 05Open>03Resolved Ok @MMiller_WMF, you should be good to go!  In case you haven't seen them already, there are instructions at ht...
[16:20:40] <logmsgbot>	 !log milimetric@tin Started deploy [analytics/refinery@679cf09]: Update partition drop script after schema change
[16:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:41] <logmsgbot>	 !log mobrovac@tin Started restart [changeprop/deploy@e468d8e]: (no justification provided)
[16:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:38] <wikibugs_>	 (03PS2) 10Dzahn: admins: add mholloway to maps/tilerator/kartotherian admins [puppet] - 10https://gerrit.wikimedia.org/r/433021 (https://phabricator.wikimedia.org/T194404)
[16:22:42] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] admins: add mholloway to maps/tilerator/kartotherian admins [puppet] - 10https://gerrit.wikimedia.org/r/433021 (https://phabricator.wikimedia.org/T194404) (owner: 10Dzahn)
[16:23:55] <wikibugs_>	 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208278 (10MaxBioHazard) I should execute this string on Toolforge console?
[16:24:23] <wikibugs_>	 (03PS1) 10Joal: Update AQS druid datasource to snapshot-postfixed [puppet] - 10https://gerrit.wikimedia.org/r/433182
[16:24:32] <joal>	 elukey: --^
[16:24:44] <icinga-wm>	 PROBLEM - Zookeeper Server on conf1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg
[16:25:07] <logmsgbot>	 !log mobrovac@tin Started restart [cpjobqueue/deploy@58935d5]: (no justification provided)
[16:25:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:35] <elukey>	 ah yes expired downtime for conf1001
[16:25:40] <elukey>	 all good :)
[16:25:44] <elukey>	 lemme ack it
[16:25:53] <icinga-wm>	 PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[zookeeper]
[16:25:54] <wikibugs_>	 10Operations, 10Reading-Infrastructure-Team-Backlog, 10SRE-Access-Requests, 10Patch-For-Review: Add Michael Holloway (Reading Infrastructure) to maps admin groups - https://phabricator.wikimedia.org/T194404#4208295 (10Dzahn) @Mholloway This is done, you have been added to the requested groups.  I ran puppe...
[16:26:43] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[16:27:14] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[16:27:41] <wikibugs_>	 10Operations, 10Reading-Infrastructure-Team-Backlog, 10SRE-Access-Requests, 10Patch-For-Review: Add Michael Holloway (Reading Infrastructure) to maps admin groups - https://phabricator.wikimedia.org/T194404#4208300 (10Dzahn) 05Open>03Resolved [maps2001:~] $ id mholloway-shell uid=11963(mholloway-shell)...
[16:27:53] <logmsgbot>	 !log milimetric@tin Finished deploy [analytics/refinery@679cf09]: Update partition drop script after schema change (duration: 07m 13s)
[16:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:15] <wikibugs_>	 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208305 (10Vgutierrez) >>! In T194380#4208278, @MaxBioHazard wrote: > I should execute this string on Toolforge console? And recompile mono after this?  If I'm reading the mono i...
[16:28:35] <wikibugs_>	 (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11218/aqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/433182 (owner: 10Joal)
[16:28:41] <wikibugs_>	 (03PS2) 10Elukey: Update AQS druid datasource to snapshot-postfixed [puppet] - 10https://gerrit.wikimedia.org/r/433182 (owner: 10Joal)
[16:29:02] <joal>	 elukey: once applied, we'll need to restart AQS (and test)
[16:29:22] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4208306 (10chasemp)
[16:30:51] <elukey>	 joal: going to depool and apply it to aqs1004 ok ?
[16:30:57] <joal>	 elukey: ack !
[16:31:29] <elukey>	 joal: done1
[16:31:56] <wikibugs_>	 (03PS1) 10Elukey: role::configcluster: decom zookeeper on conf1001 [puppet] - 10https://gerrit.wikimedia.org/r/433183 (https://phabricator.wikimedia.org/T182924)
[16:32:00] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[16:32:23] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] role::configcluster: decom zookeeper on conf1001 [puppet] - 10https://gerrit.wikimedia.org/r/433183 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey)
[16:32:30] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[16:32:32] <wikibugs_>	 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208314 (10MaxBioHazard) My bots are launched from cron. I hope, execute this string once would be enough.
[16:33:16] <wikibugs_>	 (03PS1) 10Cmjohnson: Adding DNS for wdqs10[09-10] [dns] - 10https://gerrit.wikimedia.org/r/433184 (https://phabricator.wikimedia.org/T194184)
[16:33:19] <wikibugs_>	 (03PS2) 10Elukey: role::configcluster: decom zookeeper on conf1001 [puppet] - 10https://gerrit.wikimedia.org/r/433183 (https://phabricator.wikimedia.org/T182924)
[16:33:53] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] role::configcluster: decom zookeeper on conf1001 [puppet] - 10https://gerrit.wikimedia.org/r/433183 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey)
[16:34:24] <wikibugs_>	 (03CR) 10Cmjohnson: [C: 032] Adding DNS for wdqs10[09-10] [dns] - 10https://gerrit.wikimedia.org/r/433184 (https://phabricator.wikimedia.org/T194184) (owner: 10Cmjohnson)
[16:34:55] <joal>	 elukey: look good to me !
[16:35:30] <joal>	 elukey: druid answers requests sent by AQS on the newly configured datasource, gicing correct numbers
[16:35:41] <joal>	 elukey: We can continue the rollout :)
[16:35:42] <wikibugs_>	 (03PS2) 10Andrew Bogott: openstack: move nova-api and nova-network functions to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/433153 (https://phabricator.wikimedia.org/T193579)
[16:35:51] <elukey>	 joal: ack
[16:35:55] <wikibugs_>	 (03CR) 10Andrew Bogott: [V: 032 C: 032] openstack: move nova-api and nova-network functions to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/433153 (https://phabricator.wikimedia.org/T193579) (owner: 10Andrew Bogott)
[16:36:09] <elukey>	 !log rolling restart of aqs on aqs* nodes to pick up the new druid config
[16:36:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:50] <icinga-wm>	 PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
[16:37:56] <wikibugs_>	 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208339 (10Reedy) It won't be, just single line it  `MONO_TLS_PROVIDER=btls mono bot.exe`
[17:00:38] <icinga-wm>	 RECOVERY - toolschecker: tools nginx proxy health on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 642 bytes in 0.002 second response time
[17:00:57] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating
[17:01:26] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating
[17:01:57] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active
[17:02:26] <icinga-wm>	 RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1015 bytes in 8.667 second response time
[17:02:56] <icinga-wm>	 RECOVERY - Host paws.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms
[17:03:21] <wikibugs_>	 (03PS2) 10Dzahn: update MAC address of mw2139 [puppet] - 10https://gerrit.wikimedia.org/r/433185 (https://phabricator.wikimedia.org/T194426)
[17:04:11] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] update MAC address of mw2139 [puppet] - 10https://gerrit.wikimedia.org/r/433185 (https://phabricator.wikimedia.org/T194426) (owner: 10Dzahn)
[17:04:27] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active
[17:04:46] <wikibugs_>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler03/11219/" [puppet] - 10https://gerrit.wikimedia.org/r/433183 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey)
[17:05:06] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating
[17:05:40] <wikibugs_>	 (03CR) 10Dzahn: [V: 032 C: 032] update MAC address of mw2139 [puppet] - 10https://gerrit.wikimedia.org/r/433185 (https://phabricator.wikimedia.org/T194426) (owner: 10Dzahn)
[17:06:12] <mutante>	 jenkins not voting due to labs maintenance i assume
[17:07:07] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active
[17:09:40] <wikibugs_>	 (03PS2) 10Cmjohnson: Adding dhcpd entry wdqs1009-10 [puppet] - 10https://gerrit.wikimedia.org/r/433186 (https://phabricator.wikimedia.org/T194184)
[17:10:27] <wikibugs_>	 (03CR) 10Cmjohnson: [C: 032] Adding dhcpd entry wdqs1009-10 [puppet] - 10https://gerrit.wikimedia.org/r/433186 (https://phabricator.wikimedia.org/T194184) (owner: 10Cmjohnson)
[17:11:53] <wikibugs_>	 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, and 5 others: rack/setup/install wdqs10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T194184#4208430 (10Cmjohnson)
[17:12:43] <wikibugs_>	 (03PS1) 10Ottomata: check_eventstreams - exit 2 if critical [puppet] - 10https://gerrit.wikimedia.org/r/433190
[17:15:40] <ottomata>	 !log rolling restart kafka-jumbo100[456] 
[17:15:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:07] <wikibugs_>	 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, and 5 others: rack/setup/install wdqs10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T194184#4208445 (10Cmjohnson) @robh the dhcpd file has been updated but not sure which partman recipe...feel free to add and continue with installation....
[17:17:36] <icinga-wm>	 RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
[17:18:01] <wikibugs_>	 (03PS2) 10Ottomata: check_eventstreams - exit 2 if critical [puppet] - 10https://gerrit.wikimedia.org/r/433190
[17:18:01] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] check_eventstreams - exit 2 if critical [puppet] - 10https://gerrit.wikimedia.org/r/433190 (owner: 10Ottomata)
[17:21:31] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4208447 (10chasemp)
[17:21:36] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T194698#4208449 (10jcrespo) 05Open>03Resolved ``` $ megacli -PDList -aALL | grep 'state' Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Sp...
[17:22:35] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4208451 (10Andrew) @Cmjohnson Just to clarify, labnet1003 and 1004 aren't in active service so you can move them anytime; just check in with @chasemp after they're moved...
[17:25:39] <icinga-wm>	 PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init]
[17:25:59] <ottomata>	 hm
[17:27:45] <twentyafterfour>	 !log branching 1.32.0-wmf.4 refs T191050
[17:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:49] <stashbot>	 T191050: 1.32.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T191050
[17:28:54] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4208462 (10chasemp) We ran through our normal procedure to fail traffic from labnet1002 back to labnet1001 (post move this morning).  Labnet1001 saw incoming traffic from...
[17:30:39] <icinga-wm>	 RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:36:18] <wikibugs_>	 (03PS1) 10Ottomata: Use fqdn instead of localhost for curl eventstreams check [puppet] - 10https://gerrit.wikimedia.org/r/433194 (https://phabricator.wikimedia.org/T174493)
[17:36:54] <wikibugs_>	 (03PS2) 10Herron: admin: add user seddon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/433025 (https://phabricator.wikimedia.org/T194445)
[17:37:04] <wikibugs_>	 (03PS2) 10Ottomata: Use proper path and fqdn for eventstreams check [puppet] - 10https://gerrit.wikimedia.org/r/433194 (https://phabricator.wikimedia.org/T174493)
[17:37:11] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] admin: add user seddon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/433025 (https://phabricator.wikimedia.org/T194445) (owner: 10Herron)
[17:37:32] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] Use proper path and fqdn for eventstreams check [puppet] - 10https://gerrit.wikimedia.org/r/433194 (https://phabricator.wikimedia.org/T174493) (owner: 10Ottomata)
[17:43:03] <wikibugs_>	 (03PS3) 10Herron: admin: add user seddon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/433025 (https://phabricator.wikimedia.org/T194445)
[17:47:51] <wikibugs_>	 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: mw2139 failed to boot - hardware check - https://phabricator.wikimedia.org/T194426#4208503 (10Papaul) a:05Papaul>03Dzahn @Dzahn I replaced the main board., Update the IDRAC and BIOS. it is all yours. I also installed the OS on the system.
[17:51:29] <icinga-wm>	 PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init]
[17:56:50] <wikibugs_>	 (03PS4) 10Herron: admin: add user seddon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/433025 (https://phabricator.wikimedia.org/T194445)
[18:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180515T1800)
[18:01:45] <wikibugs_>	 (03CR) 10Herron: [C: 032] admin: add user seddon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/433025 (https://phabricator.wikimedia.org/T194445) (owner: 10Herron)
[18:22:59] <icinga-wm>	 RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[18:25:08] <wikibugs_>	 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208556 (10Vgutierrez) @MaxBioHazard please let us know when you make the change to check on our side that everything looks good :)
[18:25:38] <icinga-wm>	 PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init]
[18:36:36] <ottomata>	 weird
[18:48:42] <wikibugs_>	 (03PS3) 10Bstorm: wiki replicas: return page to a full view [puppet] - 10https://gerrit.wikimedia.org/r/433085 (https://phabricator.wikimedia.org/T174047)
[18:49:57] <wikibugs_>	 (03CR) 10Bstorm: [C: 032] wiki replicas: return page to a full view [puppet] - 10https://gerrit.wikimedia.org/r/433085 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm)
[18:51:49] <icinga-wm>	 PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init]
[18:51:49] <icinga-wm>	 RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:00:05] <jouncebot>	 twentyafterfour: That opportune time is upon us again. Time for a MediaWiki train deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180515T1900).
[19:03:51] <mutante>	 !log mw2139 - wmf-auto-reimage --conftoool --no-verify (T194426)
[19:03:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:56] <stashbot>	 T194426: mw2139 failed to boot - hardware check - https://phabricator.wikimedia.org/T194426
[19:04:15] <wikibugs_>	 (03PS1) 10Bstorm: wiki replicas: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/433206 (https://phabricator.wikimedia.org/T174047)
[19:04:26] <mutante>	 "Unable to run wmf-auto-reimage-host: Failed to icinga_downtime"  hmmmmrrr
[19:05:20] <mutante>	 !log mw2139 - wmf-auto-reimage --conftoool --new (because it got "Failed to icinga_downtime" and has a new mainboard (T194426)
[19:05:22] <wikibugs_>	 (03PS1) 10Urbanecm: New throttle rule for WMF Hackhathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433207 (https://phabricator.wikimedia.org/T194392)
[19:05:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:25] <wikibugs_>	 (03CR) 10Bstorm: "Is there a stretch repo?  I am hoping to bring stretch into the mix on tools eventually." [puppet] - 10https://gerrit.wikimedia.org/r/433142 (https://phabricator.wikimedia.org/T194665) (owner: 10Arturo Borrero Gonzalez)
[19:07:28] <wikibugs_>	 (03CR) 10Vgutierrez: "> Is there a stretch repo?  I am hoping to bring stretch into the mix" [puppet] - 10https://gerrit.wikimedia.org/r/433142 (https://phabricator.wikimedia.org/T194665) (owner: 10Arturo Borrero Gonzalez)
[19:20:18] <wikibugs_>	 (03CR) 10Gehel: "minor comment inline" (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/432136 (https://phabricator.wikimedia.org/T193734) (owner: 10DCausse)
[19:20:35] <wikibugs_>	 (03CR) 10Foks: [C: 031] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433136 (https://phabricator.wikimedia.org/T152296) (owner: 10MarcoAurelio)
[19:23:28] <icinga-wm>	 RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[19:26:57] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Seddon access to the analytics cluster - https://phabricator.wikimedia.org/T194445#4208771 (10herron) 05Open>03Resolved Hi @Jseddon, your shell account `seddon` has been created and added to group `analytics-privatedata-users`.  You should now...
[19:29:34] <wikibugs_>	 (03CR) 10Gehel: "minor comments inline, otherwise LGTM" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron)
[19:37:22] <wikibugs_>	 (03PS1) 1020after4: testwikis wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433209
[19:37:24] <wikibugs_>	 (03CR) 1020after4: [C: 032] testwikis wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433209 (owner: 1020after4)
[19:37:40] <wikibugs_>	 (03CR) 10BryanDavis: "I added bblack as a reviewer to get a sanity check on this approach for selective https upgrades." [puppet] - 10https://gerrit.wikimedia.org/r/432935 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis)
[19:38:53] <wikibugs_>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433209 (owner: 1020after4)
[19:39:42] <logmsgbot>	 !log twentyafterfour@tin Started scap: testwikis wikis to 1.32.0-wmf.4
[19:39:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:49] <wikibugs_>	 (03CR) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433209 (owner: 1020after4)
[19:40:13] <bawolff>	 twentyafterfour: sorry, not at home cant log in to tin to check patch
[19:40:34] <bawolff>	 twentyaftrrfour: if you pastebin it somewhere i could look
[19:40:35] <twentyafterfour>	 bawolff: ok I'll paste it on phab if that'll work? 
[19:40:38] <twentyafterfour>	 ok cool
[19:41:48] <bawolff>	 I remember when i applied it in the first place there was big conflicts compared to HEAD. the version on the bug might have been easier to resolve
[19:42:30] <logmsgbot>	 !log twentyafterfour@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_770814178" --threads=10 --lang en  --quiet' returned non-zero exit status 255 (duration: 02m 47s)
[19:42:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:05] <twentyafterfour>	 bawolff: https://phabricator.wikimedia.org/P7132
[19:44:44] <twentyafterfour>	 conflicts were not complex to resolve I just wanted a second pair of eyes on it
[19:45:03] <bawolff>	 twentyafterfour: looks good
[19:45:20] <twentyafterfour>	 bawolff: thanks!
[19:49:58] <twentyafterfour>	 Fatal error: Uncaught exception 'Exception' with message '/srv/mediawiki-staging/php-1.32.0-wmf.4/extensions/CongressLookup/extension.json does not exist!' in /srv/mediawiki-staging/php-1.32.0-wmf.4/includes/registration/ExtensionRegistry.php:105
[19:50:44] <paladox>	 twentyafterfour dosen't seem it was branched
[19:50:45] <paladox>	 for wmf.4
[19:50:55] <paladox>	 https://github.com/wikimedia/mediawiki-extensions-CongressLookup/branches
[19:51:10] <twentyafterfour>	 paladox: yeah ...
[19:52:05] <twentyafterfour>	 it's not in the make-wmf-branch config 
[19:52:42] <twentyafterfour>	 I don't know why extensionregistry is looking for it
[19:53:06] <twentyafterfour>	 was this just added to mediawiki-config without adding it to the branch config? 
[19:54:39] <bawolff>	 It was deployed as a rush job so wouldnt surprise me if someone missed that
[19:54:40] <twentyafterfour>	 weird...it's in wmf.3 but I don't see a patch removing it 
[20:08:52] <paladox>	 twentyafterfour it was added, but i think as bawolff suggests, they may have overlook adding it to that script by mistake
[20:10:48] <wikibugs_>	 (03PS1) 10Ottomata: Set jdk.tls.namedGroups=secp256r1 for Kafka TLS [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993)
[20:11:08] <wikibugs_>	 (03CR) 10Ottomata: [C: 04-1] "UNTESTED! :)" [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata)
[20:11:37] <wikibugs_>	 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4208839 (10Ottomata) So, something like ^?
[20:12:07] <wikibugs_>	 (03PS2) 10Ottomata: Set jdk.tls.namedGroups=secp256r1 for Kafka TLS [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993)
[20:12:56] <wikibugs_>	 (03PS3) 10Ottomata: Set jdk.tls.namedGroups=secp256r1 for Kafka TLS [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993)
[20:13:18] <twentyafterfour>	 damnit
[20:13:35] * twentyafterfour just submitted a security patch to gerrit
[20:19:41] <paladox>	 twentyafterfour you can delete it
[20:19:49] <twentyafterfour>	 I did
[20:22:25] <mutante>	 !log [radium:~] $ sudo apt-get autoremove
[20:22:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:27] <wikibugs_>	 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild raids on labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#4208854 (10bd808)
[20:23:38] <icinga-wm>	 RECOVERY - Disk space on furud is OK: DISK OK
[20:23:48] <icinga-wm>	 PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init]
[20:27:45] <logmsgbot>	 !log twentyafterfour@tin Started scap: testwikis to 1.32.0-wmf.4 refs T191050
[20:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:50] <stashbot>	 T191050: 1.32.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T191050
[20:29:54] <mutante>	 @seen wikibugs
[20:29:54] <wm-bot>	 mutante: Last time I saw wikibugs they were talking in the channel, but they are not in the channel now and I don't know why, in #mediawiki-feed at 3/12/2018 12:21:26 PM (64d8h8m27s ago)
[20:30:07] <mutante>	 yea, that ^
[20:30:18] <icinga-wm>	 PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100%
[20:31:28] <icinga-wm>	 RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms
[20:31:59] <wikibugs_>	 10Operations: replace/reinstall radium with a stretch system - https://phabricator.wikimedia.org/T194796#4208919 (10faidon) 05Open>03declined radium is super old hardware (2011 era) and its refresh is imminent, as part of T189317. No reason to spend time to reimage at this point :)
[20:33:50] <wikibugs_>	 10Operations: replace/reinstall radium with a stretch system - https://phabricator.wikimedia.org/T194796#4208923 (10Dzahn) Alright, for that case i had the "or replace the hardware with other hardware running stretch and switch the role over, then decom radium" option.
[20:35:06] <mutante>	 ah, wikibugs with underscore
[20:38:32] <wikibugs_>	 (03PS4) 10Ottomata: Set jdk.tls.namedGroups=secp256r1 for Kafka TLS [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993)
[20:42:23] <icinga-wm>	 PROBLEM - Disk space on mw2139 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:42:23] <icinga-wm>	 PROBLEM - dhclient process on mw2139 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:44:32] <icinga-wm>	 ACKNOWLEDGEMENT - HHVM processes on mw2139 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. daniel_zahn T194426
[20:46:08] <wikibugs_>	 (03PS5) 10Ottomata: Set jdk.tls.namedGroups=secp256r1 for Kafka TLS [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993)
[20:47:33] <wikibugs_>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11224/kafka-jumbo1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata)
[20:49:32] <icinga-wm>	 RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[20:49:34] <wikibugs_>	 (03CR) 10Vgutierrez: [C: 031] "Nice, the effect of this can be tested with: openssl s_client -brief -curves sect283k1:sect283r1:sect409k1:sect409r1:sect571k1:sect571r1:s" [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata)
[20:51:07] <wikibugs_>	 (03PS1) 10Dzahn: phab/phd: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724)
[20:52:23] <wikibugs_>	 (03CR) 10Paladox: [C: 031] phab/phd: base::service_unit -> systemd::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[20:53:22] <wikibugs_>	 (03CR) 10Dzahn: "i should probably remove the "            provider   => $::initsystem," [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[20:54:42] <icinga-wm>	 PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init]
[20:56:24] <wikibugs_>	 (03PS2) 10Dzahn: phab/phd: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724)
[20:58:34] <wikibugs_>	 (03PS3) 10Dzahn: phabricator: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724)
[20:59:52] <wikibugs_>	 (03CR) 10Paladox: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[21:20:53] <icinga-wm>	 RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[21:30:17] <wikibugs_>	 (03PS1) 10Dzahn: tor: add icinga check_tcp for ORPort and DirPort [puppet] - 10https://gerrit.wikimedia.org/r/433284 (https://phabricator.wikimedia.org/T148614)
[21:30:59] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] tor: add icinga check_tcp for ORPort and DirPort [puppet] - 10https://gerrit.wikimedia.org/r/433284 (https://phabricator.wikimedia.org/T148614) (owner: 10Dzahn)
[21:31:50] <wikibugs_>	 (03PS2) 10Dzahn: tor: add icinga check_tcp for ORPort and DirPort [puppet] - 10https://gerrit.wikimedia.org/r/433284 (https://phabricator.wikimedia.org/T148614)
[21:34:04] <logmsgbot>	 !log twentyafterfour@tin Finished scap: testwikis to 1.32.0-wmf.4 refs T191050 (duration: 66m 19s)
[21:34:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:09] <stashbot>	 T191050: 1.32.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T191050
[21:52:21] <wikibugs_>	 (03PS1) 10Dzahn: introduce webperf1002 & webperf2002 [dns] - 10https://gerrit.wikimedia.org/r/433287 (https://phabricator.wikimedia.org/T194390)
[21:54:52] <wikibugs_>	 (03PS3) 10Samwilson: Deploy GlobalPreferences to test wikis and mw.org (forth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 (https://phabricator.wikimedia.org/T190425)
[21:55:57] <wikibugs_>	 (03PS2) 10Dzahn: introduce webperf1002 & webperf2002 [dns] - 10https://gerrit.wikimedia.org/r/433287 (https://phabricator.wikimedia.org/T194390)
[21:58:46] <wikibugs_>	 (03PS1) 1020after4: group0 wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433288
[21:58:48] <wikibugs_>	 (03CR) 1020after4: [C: 032] group0 wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433288 (owner: 1020after4)
[21:59:59] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] introduce webperf1002 & webperf2002 [dns] - 10https://gerrit.wikimedia.org/r/433287 (https://phabricator.wikimedia.org/T194390) (owner: 10Dzahn)
[22:00:04] <jouncebot>	 MaxSem and samwilson: (Dis)respected human, time to deploy GlobalPreferences test deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180515T2200). Please do the needful.
[22:00:06] <wikibugs_>	 (03Merged) 10jenkins-bot: group0 wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433288 (owner: 1020after4)
[22:00:08] <wikibugs_>	 (03PS4) 10Legoktm: Deploy GlobalPreferences to test wikis and mw.org (fourth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 (https://phabricator.wikimedia.org/T190425) (owner: 10Samwilson)
[22:00:21] <wikibugs_>	 (03CR) 10jenkins-bot: group0 wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433288 (owner: 1020after4)
[22:02:17] <wikibugs_>	 (03CR) 10Samwilson: [C: 032] Deploy GlobalPreferences to test wikis and mw.org (fourth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 (https://phabricator.wikimedia.org/T190425) (owner: 10Samwilson)
[22:03:50] <wikibugs_>	 (03Merged) 10jenkins-bot: Deploy GlobalPreferences to test wikis and mw.org (fourth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 (https://phabricator.wikimedia.org/T190425) (owner: 10Samwilson)
[22:04:07] <wikibugs_>	 10Operations, 10vm-requests, 10Patch-For-Review: EQIAD & CODFW: 1 VM in each data center for xhprof/xhgui/other profiling tools - https://phabricator.wikimedia.org/T194390#4209076 (10Dzahn) assigned IPs to webperf1002 and webperf2002:  eqiad forward webperf1001.eqiad.wmnet has address 10.64.0.215 webperf1002...
[22:06:43] <wikibugs_>	 (03CR) 10jenkins-bot: Deploy GlobalPreferences to test wikis and mw.org (fourth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 (https://phabricator.wikimedia.org/T190425) (owner: 10Samwilson)
[22:07:14] <wikibugs_>	 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4209078 (10RobH) The email thread with legal seems to have reached a conclusion in support, so I'm now in the process of adding admin2@gofishdigital.com to the subdomains:    >>!...
[22:07:26] <samwilson>	 twentyafterfour: are you still deploying?
[22:15:09] <MaxSem>	 twentyafterfour: I see some undeployed wikiversions changes
[22:15:24] <twentyafterfour>	 still deploying yes
[22:15:57] <twentyafterfour>	 MaxSem: that's pushing the branch to group0, I stopped deploying that to fix the syntax error
[22:17:40] <twentyafterfour>	 samwilson: will be finished shortly
[22:17:56] <samwilson>	 twentyafterfour: no worries, thanks
[22:18:59] <logmsgbot>	 !log twentyafterfour@tin Synchronized php-1.32.0-wmf.4/includes/api/ApiLogin.php: fix syntax error (duration: 01m 39s)
[22:19:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:23] <icinga-wm>	 PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init]
[22:21:53] <logmsgbot>	 !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.4 refs T191050
[22:21:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:57] <stashbot>	 T191050: 1.32.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T191050
[22:26:10] <wikibugs_>	 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4209110 (10RobH) 05Open>03Resolved a:03RobH >>! In T192893#4209078, @RobH wrote: > The email thread with legal seems to have reached a conclusion in support, so I'm now in...
[22:29:52] <wikibugs_>	 (03PS1) 10Samwilson: Revert "Deploy GlobalPreferences to test wikis and mw.org (fourth time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433291
[22:30:44] <Niharika>	 Wait what? Reverting already? :(
[22:31:28] <samwilson>	 Niharika: haven't deployed yet; waiting on other stuff to finish
[22:32:04] <Niharika>	 samwilson: Maybe we won't need the revert patch! :)
[22:33:33] <samwilson>	 Niharika: i shall be happy to abandon it
[22:39:09] <logmsgbot>	 !log samwilson@tin Synchronized wmf-config/InitialiseSettings.php: Deploying GlobalPreferences T190425 (duration: 01m 21s)
[22:39:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:39:14] <stashbot>	 T190425: GlobalPreferences deploy caused a significant increase in reads on s3 - https://phabricator.wikimedia.org/T190425
[22:42:05] <Niharika>	 Nothing seems to be blowing up yet, right?
[22:42:24] <Niharika>	 That chart is already pretty suspicious but not because of us.
[22:47:07] <samwilson>	 Niharika: yeah I was wondering about that earlier rise. but nothing to do with us :) and all's looking ok.
[22:47:30] <Niharika>	 Woooooo. :D
[22:52:32] <icinga-wm>	 RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[23:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180515T2300).
[23:00:04] <jouncebot>	 Urbanecm: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:06:18] <mutante>	 !log creating ganeti VM webperf1002.eqiad.wmnet on ganeti1004 (link: private, row: A, cpus: 4, ram: 8, disk: 50) (T194390)
[23:06:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:23] <stashbot>	 T194390: EQIAD & CODFW: 1 VM in each data center for xhprof/xhgui/other profiling tools - https://phabricator.wikimedia.org/T194390
[23:10:49] <mutante>	 !log mw2139 - reimaged, scap pull, apache-fast-test baseurls from naos, repooled with confctl  (T194426)
[23:10:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:53] <stashbot>	 T194426: mw2139 failed to boot - hardware check - https://phabricator.wikimedia.org/T194426
[23:11:04] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2139.codfw.wmnet
[23:11:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:14] <wikibugs_>	 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: mw2139 failed to boot - hardware check - https://phabricator.wikimedia.org/T194426#4209152 (10Dzahn) 05Open>03Resolved Thank you @Papaul!  Works and is in use again now. Closing ticket as resolved.  (fyi @Muehlenhoff )
[23:18:35] <wikibugs_>	 (03PS1) 10Dzahn: admins/dzahn: add makevm.sh (create ganeti) to my ~ files [puppet] - 10https://gerrit.wikimedia.org/r/433296
[23:20:29] <wikibugs_>	 (03CR) 10Dzahn: admins/dzahn: add makevm.sh (create ganeti) to my ~ files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/433296 (owner: 10Dzahn)
[23:26:33] <icinga-wm>	 PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init]
[23:33:35] <mutante>	 !log creating ganeti VM webperf2002.eqiad.wmnet on ganeti2004 (link: private, row: A, cpus: 4, ram: 8, disk: 50) (T194390)
[23:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:39] <stashbot>	 T194390: EQIAD & CODFW: 1 VM in each data center for xhprof/xhgui/other profiling tools - https://phabricator.wikimedia.org/T194390
[23:38:32] <wikibugs_>	 (03PS1) 10Dzahn: add webperf1002/2002 as spare systems with IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/433298 (https://phabricator.wikimedia.org/T194390)
[23:39:12] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] add webperf1002/2002 as spare systems with IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/433298 (https://phabricator.wikimedia.org/T194390) (owner: 10Dzahn)
[23:40:10] <wikibugs_>	 (03PS1) 10Dzahn: webperf: add IPv6 mapped address to role [puppet] - 10https://gerrit.wikimedia.org/r/433299
[23:40:41] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] webperf: add IPv6 mapped address to role [puppet] - 10https://gerrit.wikimedia.org/r/433299 (owner: 10Dzahn)