[00:20:24] 10Operations, 10Wikimedia-Mailing-lists: Enable CAPTCHA on mailman instances - https://phabricator.wikimedia.org/T194558#4201943 (10lfaraone) First: I personally like reCAPTCHA, and think it provides a lot of value from a security/abuse PoV. Yet we need to consider carefully whether we can deploy it on Wikimed... [00:24:14] (03PS4) 10Dzahn: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 [00:24:52] (03CR) 10jerkins-bot: [V: 04-1] analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [00:27:30] (03PS5) 10Dzahn: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 [00:38:56] (03CR) 10Dzahn: "hmmm.. something still uses the apache module that also gets included here... causing a duplicate declaration.. but what is it" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [00:43:19] (03CR) 10Dzahn: [C: 04-1] "can anyone see where there the additional usage of the apache module comes from that causes this issue? http://puppet-compiler.wmflabs.or" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [00:44:19] (03CR) 10Dzahn: [C: 04-1] cache::misc: switch noc.wm,dbtree.wm backends to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430527 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [00:44:37] (03CR) 10Dzahn: "scheduled for May 25th" [puppet] - 10https://gerrit.wikimedia.org/r/422632 (owner: 10Dzahn) [00:45:17] (03CR) 10Dzahn: [C: 04-1] "not until at least a week after May 25th, the scheduled switch day" [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [00:47:35] (03CR) 10Dzahn: "needs the SSH key from https://phabricator.wikimedia.org/T194445#4206012" [puppet] - 10https://gerrit.wikimedia.org/r/433025 (https://phabricator.wikimedia.org/T194445) (owner: 10Herron) [00:56:19] (03CR) 10Dzahn: [C: 04-1] "needs manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/400241 (owner: 10Dzahn) [00:56:36] (03PS1) 10Brian Wolff: Log "security" channel at 'debug' level. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433095 [01:11:16] 10Operations, 10Wikimedia-Mailing-lists: wikitech-l is mangling my PGP/MIME emails, causing signature validation to fail - https://phabricator.wikimedia.org/T186311#4206345 (10Platonides) Maybe you could try not ending the message with "-- Legoktm" ? Just prepending a space would do. Lines starting with a dash... [01:11:49] I'm going to do a security related deploy [01:12:11] (03CR) 10Brian Wolff: [C: 032] Log "security" channel at 'debug' level. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433095 (owner: 10Brian Wolff) [01:13:24] (03Merged) 10jenkins-bot: Log "security" channel at 'debug' level. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433095 (owner: 10Brian Wolff) [01:13:40] (03CR) 10jenkins-bot: Log "security" channel at 'debug' level. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433095 (owner: 10Brian Wolff) [01:17:30] !log bawolff@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/433095/ log security channel (duration: 01m 02s) [01:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:04] !log bawolff@tin Started scap: Backport https://gerrit.wikimedia.org/r/#/c/433096/ - log js loads of unregistered user js subpages [01:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:40] 10Operations, 10Wikimedia-Mailing-lists: wikitech-l is mangling my PGP/MIME emails, causing signature validation to fail - https://phabricator.wikimedia.org/T186311#4206365 (10Legoktm) >>! In T186311#4206345, @Platonides wrote: > Maybe you could try not ending the message with "-- Legoktm" ? Just prepending a... [02:12:08] (03CR) 10Dzahn: "domains handling email can be seen in the list modules/role/files/exim/wikimedia_domains" [dns] - 10https://gerrit.wikimedia.org/r/429874 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [02:33:32] !log bawolff@tin Finished scap: Backport https://gerrit.wikimedia.org/r/#/c/433096/ - log js loads of unregistered user js subpages (duration: 56m 27s) [02:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:50] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.3) (duration: 06m 32s) [03:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:59] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue May 15 03:10:59 UTC 2018 (duration 7m 11s) [03:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:55] (03PS1) 10Zhuyifei1999: profile::docker::flannel: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433101 (https://phabricator.wikimedia.org/T190893) [05:05:48] <_joe_> zhuyifei1999_: thanks for working on this [05:05:53] np [05:06:05] <_joe_> I wasn't ignoring you last week, I was just in/out of bed with the flu [05:06:17] yeah ik [05:06:23] you told me [05:06:37] <_joe_> I didn't even remember :D [05:06:59] <_joe_> I see you preserved yuvi's love for docker in the comments :D [05:07:14] lol [05:09:39] I wonder if I can test this without blowing up toolforge [05:09:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] profile::docker::flannel: Use systemd::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/433101 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [05:10:23] <_joe_> Interesting question, I *think* we have a puppet compiler for labs, but I would ask in #-cloud about that [05:10:50] <_joe_> in general, I'd advise to merge it only after having disabled puppet across all of toolsforge [05:11:00] <_joe_> and then test one server at a time [05:11:13] oops ^ copy-pasted the wrong thing [05:11:18] <_joe_> eheh np [05:11:38] <_joe_> I won't merge the patch btw, I'm too rusty on toolsforge nowadays not to risk screwing it up [05:12:08] <_joe_> when we created the kubernetes cluster there, I was way more familiar with it [05:12:42] (03PS2) 10Zhuyifei1999: profile::docker::flannel: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433101 (https://phabricator.wikimedia.org/T190893) [05:14:23] I guess I could just ssh into every worker and disabling it [05:14:36] * zhuyifei1999_ has no clush access, for some reason :( [05:25:11] <_joe_> zhuyifei1999_: let's wait for arturo maybe? :) [05:25:42] andrew told me to make a project puppetmaster on toolsbeta to test it [05:26:02] so I'm doing that right now (finding my 2fa keys) [05:26:57] (03CR) 10Giuseppe Lavagetto: [C: 031] profile::docker::flannel: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433101 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [05:27:56] 10Operations, 10ops-eqiad: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T194698#4206565 (10Marostegui) [05:28:32] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T194698#4206146 (10Marostegui) a:03Cmjohnson Please @Cmjohnson proceed and change the disk [05:36:45] (03PS1) 10Marostegui: s2,s6.hosts: Add db1120 [software] - 10https://gerrit.wikimedia.org/r/433105 [05:38:43] 10Operations, 10Wikimedia-Mailing-lists: Archive "wiki-offline-reader-l" - https://phabricator.wikimedia.org/T194575#4206571 (10Kelson) @Dzahn @Herron Could you please in addition remove my email address to the list of owner? [05:41:11] (03CR) 10Marostegui: [C: 032] s2,s6.hosts: Add db1120 [software] - 10https://gerrit.wikimedia.org/r/433105 (owner: 10Marostegui) [05:41:31] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4206574 (10Joe) I would suggest we do NOT disable/depool anything but the obvious outlier in the databases (we already know that timeouts on the databases woul... [05:41:58] (03Merged) 10jenkins-bot: s2,s6.hosts: Add db1120 [software] - 10https://gerrit.wikimedia.org/r/433105 (owner: 10Marostegui) [05:49:58] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4206575 (10Joe) Things to watch out for: - All lvs primaries for eqiad are in row C - row C includes 30 appservers - conf1002 is in row C (etcd connections wi... [06:08:23] 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 2 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724#4206611 (10Joe) [06:13:29] (03PS1) 10Elukey: role::cache::misc: add Varnish config for turnilo.w.o [puppet] - 10https://gerrit.wikimedia.org/r/433112 (https://phabricator.wikimedia.org/T194427) [06:14:19] am I the only one getting a 400 for https://gerrit.wikimedia.org/r/433112 ? [06:15:06] ahahah nono early morning and I am still sleep, nevermind [06:15:23] (had custom headers set to test turnilo) [06:18:34] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11202/cp1045.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/433112 (https://phabricator.wikimedia.org/T194427) (owner: 10Elukey) [06:22:36] (03PS1) 10Elukey: Add turnilo.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/433118 (https://phabricator.wikimedia.org/T194427) [06:26:56] (03CR) 10Giuseppe Lavagetto: mcrouter: add support for listening on the ssl port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431736 (https://phabricator.wikimedia.org/T192370) (owner: 10Giuseppe Lavagetto) [06:29:33] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/vim/vimrc.local] [06:32:23] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_ipmi_sensor],File[/usr/lib/nagios/plugins/check_sysctl] [06:54:06] (03CR) 10Muehlenhoff: [C: 031] "Three minor remarks, but looks good to me." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [06:55:43] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:58:34] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:32:53] PROBLEM - Device not healthy -SMART- on labstore1003 is CRITICAL: cluster=labsnfs device=megaraid,13 instance=labstore1003:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1003&var-datasource=eqiad%2520prometheus%252Fops [08:13:08] 10Operations, 10Wikimedia-Mailing-lists: Archive "wiki-offline-reader-l" - https://phabricator.wikimedia.org/T194575#4206728 (10Kelson) 05Resolved>03Open I just go an email with subject "482 Wiki-offline-reader-l moderator request(s) waiting". Please remove me from the list of owners (see my last comment). [08:42:35] !log stop db2068 for reimage [08:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:53] PROBLEM - Memory correctable errors -EDAC- on cp1068 is CRITICAL: 3 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1068&var-datasource=eqiad%2520prometheus%252Fops [09:03:01] !log stop db2061 for reimage [09:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:52] (03PS1) 10Elukey: Add the community extension for Parquet [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/433131 (https://phabricator.wikimedia.org/T193712) [09:07:44] PROBLEM - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,9 instance=db2067:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2067&var-datasource=codfw%2520prometheus%252Fops [09:08:39] ^we may need better coordination between SMART error monitoring and RAID monitoring [09:12:58] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4206835 (10jcrespo) 05Resolved>03Open Potential SMART errors on that device. ``` PROBLEM - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,9 instance=db2067:9100 job=node si... [09:15:26] jynus: agreed, basically double reporting ? [09:15:56] I don't think in this case, but I think it happened at others [09:18:42] (03PS1) 10Joal: Add output-format parameter to sqooq cron [puppet] - 10https://gerrit.wikimedia.org/r/433133 [09:18:45] elukey: --^ [09:19:33] joal: sqoop right? :D [09:20:02] :) [09:20:29] I can change the cr's title from gerrit [09:20:39] done elukey [09:20:47] (03PS2) 10Elukey: Add output-format parameter to sqoop cron [puppet] - 10https://gerrit.wikimedia.org/r/433133 (owner: 10Joal) [09:20:51] (03PS3) 10Joal: Add output-format parameter to sqoop cron [puppet] - 10https://gerrit.wikimedia.org/r/433133 [09:20:54] ahah [09:20:55] Arf [09:20:56] :) [09:21:31] (03CR) 10Elukey: [C: 032] Add output-format parameter to sqoop cron [puppet] - 10https://gerrit.wikimedia.org/r/433133 (owner: 10Joal) [09:21:40] !log upgrading app server canaries to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [09:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:48] !log stop and restart db2088 for upgrade [09:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:30] Hi ops-team - deploying refinery onto the hadoop cluster [09:28:35] joal allow me to suggest using ! log to best communicate those actions :-) [09:28:48] Hi jynus [09:29:05] jynus: scap will log - I'm just pinging as discussed the other day :) [09:29:14] ah, ok, thanks [09:29:21] np :) [09:30:54] !log joal@tin Started deploy [analytics/refinery@b2f4c3c]: Regular weekly deploy [09:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:23] PROBLEM - MegaRAID on db1053 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [09:31:24] ACKNOWLEDGEMENT - MegaRAID on db1053 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T194733 [09:31:29] 10Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T194733#4206922 (10ops-monitoring-bot) [09:33:19] 10Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T194733#4206922 (10jcrespo) Do not take any action, db1053 is going to be decommissioned soon. [09:33:32] 10Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T194733#4206937 (10jcrespo) [09:35:01] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4206939 (10CCogdill_WMF) Thanks Casey! I'm waiting for a reply. Just bumped it, FYI. [09:36:14] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:36:32] !log joal@tin Finished deploy [analytics/refinery@b2f4c3c]: Regular weekly deploy (duration: 05m 38s) [09:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:04] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:37:14] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:39:21] <_joe_> akosiaris: any idea what's happening ^^ ? [09:39:23] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:40:26] could it be related to the deployment? [09:40:43] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:40:43] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:41:34] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:41:43] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:42:01] <_joe_> jynus: nope, analytics doesn't have anything on kubernetes [09:42:11] <_joe_> but I figured alex could be doing something there [09:42:14] I also was doing some upgtrades [09:42:28] but I also guess kubernetes has not dependency on mysql [09:42:52] <_joe_> not etcd, no [09:43:00] <_joe_> probably some consensus troubles [09:43:04] <_joe_> quickly recovered [09:43:17] <_joe_> meh, we really need to move to 3.x there [09:45:12] I am going to do another upgrade, will see if it happens again [09:47:53] !log stop and restart db2091 for upgrade [09:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:25] !log upgrading API server canaries to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [09:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:24] (03CR) 10Ema: [C: 031] role::cache::misc: add Varnish config for turnilo.w.o [puppet] - 10https://gerrit.wikimedia.org/r/433112 (https://phabricator.wikimedia.org/T194427) (owner: 10Elukey) [10:04:23] (03CR) 10Ema: [C: 031] Add turnilo.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/433118 (https://phabricator.wikimedia.org/T194427) (owner: 10Elukey) [10:07:11] !log installing php5 security updates on trusty [10:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:24] (03PS2) 10Elukey: role::cache::misc: add Varnish config for turnilo.w.o [puppet] - 10https://gerrit.wikimedia.org/r/433112 (https://phabricator.wikimedia.org/T194427) [10:12:59] (03CR) 10Elukey: [C: 032] role::cache::misc: add Varnish config for turnilo.w.o [puppet] - 10https://gerrit.wikimedia.org/r/433112 (https://phabricator.wikimedia.org/T194427) (owner: 10Elukey) [10:15:06] !log installing uwsgi security update on graphite servers in eqiad [10:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:52] !log stop db2065 for reimage [10:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:19] (03CR) 10Elukey: [C: 032] Add turnilo.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/433118 (https://phabricator.wikimedia.org/T194427) (owner: 10Elukey) [10:36:04] (03PS1) 10MarcoAurelio: security: remove dangerous unused groups at mlwik{tionary|isource} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433136 (https://phabricator.wikimedia.org/T152296) [10:43:02] !log joal@tin Started deploy [analytics/refinery@25abeec]: Fix for regular weekly deploy [10:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:09] (03PS1) 10Urbanecm: New throttle rule for University of Edinburgh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433138 [10:45:32] (03PS2) 10Urbanecm: New throttle rule for University of Edinburgh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433138 (https://phabricator.wikimedia.org/T194666) [10:46:02] !log stop db2066 for reimage [10:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:49] (03CR) 10Rxy: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433136 (https://phabricator.wikimedia.org/T152296) (owner: 10MarcoAurelio) [10:48:38] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:49:37] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:49:47] !log joal@tin Finished deploy [analytics/refinery@25abeec]: Fix for regular weekly deploy (duration: 06m 45s) [10:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:47] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:53:03] (03PS1) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 [10:53:47] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:57:39] !log stop db2067 for reimage [10:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:02] (03PS2) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 [11:00:17] 10Operations, 10Commons, 10Wikimedia-Site-requests: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T192751#4207278 (10MarcoAurelio) For deployers the instructions seems to be at https://wikitech.wikimedia.org/wiki/Uploading_large_files [11:04:09] 10Operations, 10Commons, 10Wikimedia-Site-requests: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T192751#4207293 (10Goryeo) >>! In T192751#4207278, @MarcoAurelio wrote: > For deployers the instructions seems to be at https://wikitech.wikimedia.org/wiki/Uploading_large_fil... [11:08:26] (03PS3) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 [11:09:58] 10Operations, 10Commons, 10Wikimedia-Site-requests: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T192751#4207300 (10Zoranzoki21) >>! In T192751#4207293, @Goryeo wrote: >>>! In T192751#4207278, @MarcoAurelio wrote: >> For deployers the instructions seems to be at https://w... [11:13:00] (03PS1) 10Jcrespo: mariadb: Move m3 backups from db1053 to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/433141 (https://phabricator.wikimedia.org/T194634) [11:13:48] (03CR) 10Jcrespo: [C: 032] mariadb: Move m3 backups from db1053 to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/433141 (https://phabricator.wikimedia.org/T194634) (owner: 10Jcrespo) [11:14:34] (03PS4) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 [11:15:25] (03PS1) 10Arturo Borrero Gonzalez: toollabs: add mono_external class [puppet] - 10https://gerrit.wikimedia.org/r/433142 (https://phabricator.wikimedia.org/T194665) [11:17:27] (03PS2) 10Arturo Borrero Gonzalez: toollabs: add mono_external class [puppet] - 10https://gerrit.wikimedia.org/r/433142 (https://phabricator.wikimedia.org/T194665) [11:21:55] (03PS5) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 [11:23:42] 10Operations, 10Commons, 10Wikimedia-Site-requests: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T192751#4207342 (10Urbanecm) >>! In T192751#4207277, @Goryeo wrote: >>>! In T192751#4207156, @Urbanecm wrote: >> It is converted, it needs to be //uploaded//. This needs someo... [11:24:59] (03PS6) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 [11:26:30] any merciful deployer who could https://phabricator.wikimedia.org/T192751 and stop the drama, please? [11:32:01] 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4207372 (10fgiunchedi) >>! In T183177#4088202, @BBlack wrote: > See updates in T190540 , quite a few codfw hosts have SEL entries for uncorrectable ECC errors that... [11:37:38] (03PS7) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 [11:48:38] (03PS8) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 [11:52:00] (03PS1) 10Filippo Giunchedi: base: alert on correctable errors over a period of time [puppet] - 10https://gerrit.wikimedia.org/r/433143 (https://phabricator.wikimedia.org/T183177) [11:52:42] (03PS9) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 [11:54:52] (03PS10) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 [11:57:38] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11212/aqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/433140 (owner: 10Elukey) [12:00:35] (03PS11) 10Elukey: role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 [12:01:14] (03CR) 10Elukey: [C: 032] role::aqs: refactor druid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/433140 (owner: 10Elukey) [12:09:18] (03PS1) 10Elukey: role::aqs: follow up after druid's config refactoring [puppet] - 10https://gerrit.wikimedia.org/r/433148 [12:12:33] (03CR) 10Elukey: [C: 032] role::aqs: follow up after druid's config refactoring [puppet] - 10https://gerrit.wikimedia.org/r/433148 (owner: 10Elukey) [12:14:18] !log uploaded intel-microcode 20180425 for jessie-wikimedia/stretch-wikimedia [12:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:11] 10Operations, 10Patch-For-Review: Re-add intel-microcode - https://phabricator.wikimedia.org/T127825#4207469 (10MoritzMuehlenhoff) We have two clusters which need updated microcode to provide support for the new IBPB instruction needed to secure KVM instances against Spectre. In addition to that keeping the mi... [12:32:47] (03PS1) 10Samwilson: Deploy GlobalPreferences to test wikis and mw.org (forth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 [12:34:17] (03PS2) 10Samwilson: Deploy GlobalPreferences to test wikis and mw.org (forth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 (https://phabricator.wikimedia.org/T190425) [12:42:30] !log stop db2060 for reimage [12:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:21] (03PS1) 10Jcrespo: mariadb: Disable reimage of db206* host, reimage db205* to stretch [puppet] - 10https://gerrit.wikimedia.org/r/433151 [12:53:20] (03CR) 10Jcrespo: [C: 032] mariadb: Disable reimage of db206* host, reimage db205* to stretch [puppet] - 10https://gerrit.wikimedia.org/r/433151 (owner: 10Jcrespo) [12:59:09] (03PS2) 10Andrew Bogott: openstack: move nova-api and nova-network functions to labnet1002 [puppet] - 10https://gerrit.wikimedia.org/r/430118 (https://phabricator.wikimedia.org/T193579) (owner: 10Rush) [12:59:50] !log stopping puppet on labnet1001 and 1002, silencing icinga for T193579 [12:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:55] T193579: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579 [13:00:04] andrewbogott and chasemp: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for WMCS network maintenance -- no SWAT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180515T1300). [13:00:04] No GERRIT patches in the queue for this window AFAICS. [13:01:06] (03CR) 10Andrew Bogott: [C: 032] openstack: move nova-api and nova-network functions to labnet1002 [puppet] - 10https://gerrit.wikimedia.org/r/430118 (https://phabricator.wikimedia.org/T193579) (owner: 10Rush) [13:07:42] !log stopping nodepool and puppet on labnodepool1001 for T193579 [13:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:46] T193579: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579 [13:09:27] !log disable puppet for all openstack things in eqiad [13:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:54] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4207600 (10chasemp) [13:47:59] (03PS1) 10Andrew Bogott: openstack: move nova-api and nova-network functions to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/433153 (https://phabricator.wikimedia.org/T193579) [13:48:22] (03CR) 10Andrew Bogott: [C: 04-2] "Saving this to merge during a maintenance window" [puppet] - 10https://gerrit.wikimedia.org/r/433153 (https://phabricator.wikimedia.org/T193579) (owner: 10Andrew Bogott) [13:48:24] (03CR) 10Elukey: [C: 032] Kafka: increase group.initial.rebalance.delay.ms to 10s. [puppet] - 10https://gerrit.wikimedia.org/r/432615 (https://phabricator.wikimedia.org/T189618) (owner: 10Ppchelko) [13:48:28] (03PS4) 10Elukey: Kafka: increase group.initial.rebalance.delay.ms to 10s. [puppet] - 10https://gerrit.wikimedia.org/r/432615 (https://phabricator.wikimedia.org/T189618) (owner: 10Ppchelko) [13:48:56] mobrovac: merging --^ first, then we coud roll restart main codfw and verify that everything works as expected ? [13:49:25] elukey: don't we need to roll-restart for the zk move anyway? [13:49:49] not for codfw (it uses conf2*) [13:50:09] ah ok [13:50:13] sure, let's do it then [13:50:24] super [13:50:56] !log roll restart of kafka main codfw (kafka200[1-3]) to pick up group.initial.rebalance.delay.ms = 10s [13:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:08] restarting 2001 now [13:55:08] I am going to force kafka preferred-replica-election after metrics stabilize otherwise it will take a bit more for the rebalance to happen [13:55:47] i can also just restart changeprop in codfw [13:56:05] let's do it when the roll restart is finished to test ok? [13:56:24] +1 [13:58:54] RECOVERY - Memory correctable errors -EDAC- on kafka1023 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=kafka1023&var-datasource=eqiad%2520prometheus%252Fops [13:59:13] RECOVERY - Memory correctable errors -EDAC- on wtp2020 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops [13:59:23] RECOVERY - Memory correctable errors -EDAC- on wtp2013 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw%2520prometheus%252Fops [14:00:18] !log rebooting labnet1001 [14:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:56] kafka2002 done [14:01:09] (so happy that I have to restart ALL the kafka brokers today) [14:01:29] (03PS2) 10Ottomata: Migrate eventbus camus job to Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/419493 (https://phabricator.wikimedia.org/T189713) [14:02:23] (03CR) 10Ottomata: [C: 032] Migrate eventbus camus job to Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/419493 (https://phabricator.wikimedia.org/T189713) (owner: 10Ottomata) [14:02:35] haha [14:05:27] there are a couple of msgs of the http proxy service being unable to deliver events [14:05:32] just 2 logs so far though [14:06:54] 2003 restarted now [14:08:04] ok, i'll restart CP and CP4JQ in codfw now and let's see [14:09:12] ack [14:09:54] !log mobrovac@tin Started restart [changeprop/deploy@e468d8e]: Restart after Kafka settings change [14:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:10] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4207712 (10chasemp) [14:10:29] !log mobrovac@tin Started restart [cpjobqueue/deploy@58935d5]: Restart after Kafka settings change [14:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:07] ok done [14:11:55] mobrovac: just curious which settings did you change? [14:12:24] ottomata: the group.initial.rebalance.delay.ms [14:12:26] to 10s [14:12:29] ottomata: it's the kafka rebalance change, i didn't change anything on the CP side [14:12:32] hm, ya but that's a broker setting, no? [14:12:38] don't think client needs restart for that [14:12:44] ah no we wanted to test it [14:12:44] :) [14:12:46] ahhh [14:12:47] cool :) [14:12:49] yes, but we restarted CP to force a rebalance [14:13:39] all right all seems good, starting with the zk cluster changes [14:14:00] !log swap conf1001 with conf1004 in the zookeeper main eqiad's config + roll restart of the service [14:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:10] (03PS7) 10Elukey: Swap conf1001 with conf1004 in Zookeeper main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924) [14:15:03] (03CR) 10Elukey: [C: 032] Swap conf1001 with conf1004 in Zookeeper main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [14:16:58] woot [14:18:50] ok zookeeper up on conf1004, tried to connect via zkCli and it shows correctly main-eqiad's content [14:19:47] !log temporarily disabling puppet on analytics1003 to run refine-eventbus after jumbo based camus eventbus import finishes [14:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:07] applying puppet to all the conf1* nodes so they'll get the new config [14:22:25] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4207782 (10Ottomata) @bblack would you mind if I assigned this to someone on your team? [14:22:59] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4207783 (10BBlack) a:05Ottomata>03Vgutierrez Done :) [14:23:31] stopped and masked zookeeper on conf1001 [14:25:09] new cluster up and running, conf1004 is the new leader [14:25:13] everything seems working fine [14:25:57] yay [14:26:11] nice [14:26:16] elukey: that's it? do we have more swappings to do? [14:27:04] mobrovac: for today no, the docs suggest only one at the time [14:27:11] kk [14:27:23] now I'd proceed with kafka analytics and then kafka main [14:27:29] i will restart CP and CP4JQ, could you please restart the proxy service? [14:27:48] hm actually no, not needed [14:28:01] !log mobrovac@tin Started restart [cpjobqueue/deploy@58935d5]: Restart after Kafka settings change [14:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:27] !log mobrovac@tin Started restart [changeprop/deploy@e468d8e]: Restart after Kafka settings change [14:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:30] ya hopefully none of the clients talk to ZK anymore, so they shouldn't need restart :) [14:28:47] (due to zk change) [14:29:00] PROBLEM - Host labnet1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:29:17] mmm I am seeing the following in dmesg, not related to this upgrade [14:29:20] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427) [14:29:28] I missed it, I hope it is only a setting to tune [14:31:54] (03PS7) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) [14:32:28] (03CR) 10jerkins-bot: [V: 04-1] numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [14:33:10] RECOVERY - Host labnet1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [14:35:26] 10Operations, 10DBA: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4207806 (10BBlack) p:05Triage>03Normal [14:35:32] (still reading, going to restart kafka soon) [14:35:52] ah also I think I missed monitoring config for 1004 [14:36:24] ah no different roles right [14:39:02] 10Operations, 10DBA: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4207818 (10jcrespo) @bblack do you have a pointer to puppet of other standalone service using the correct configuration? [14:43:06] (03PS1) 10Elukey: role::prometheus::ops: add new zookeeper hosts' monitoring [puppet] - 10https://gerrit.wikimedia.org/r/433157 (https://phabricator.wikimedia.org/T182924) [14:43:39] 10Operations, 10Wikimedia-Mailing-lists: Archive "wiki-offline-reader-l" - https://phabricator.wikimedia.org/T194575#4207847 (10Dzahn) 05Open>03Resolved Done. I changed the admin address to no-reply@wikimedia.org. [14:43:53] godog: ---^ is it an acceptable solution for zookeeper's metrics for the next days? [14:44:15] elukey: taking a look [14:44:35] oh man second time today gerrit is taking forever to load [14:46:13] mobrovac: we have some issues with Burrow and mirror maker atm [14:46:24] (03CR) 10Filippo Giunchedi: "LGTM as a temporary thing" [puppet] - 10https://gerrit.wikimedia.org/r/433157 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [14:46:36] godog: <3 [14:46:44] ack elukey, thnx [14:46:49] (03CR) 10Elukey: [C: 032] role::prometheus::ops: add new zookeeper hosts' monitoring [puppet] - 10https://gerrit.wikimedia.org/r/433157 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [14:47:22] (03PS1) 10Dzahn: enable IPv6 for tendril [puppet] - 10https://gerrit.wikimedia.org/r/433159 (https://phabricator.wikimedia.org/T194766) [14:47:41] mobrovac: same problem as last time, when burrow restarts it doesn't save its previous state and re-reads everything from __consumer_topics [14:48:01] err __consumer_groups, don't remember the exact name :) [14:48:54] (03PS1) 10Muehlenhoff: Allow enabling microcode updates gradually [puppet] - 10https://gerrit.wikimedia.org/r/433160 (https://phabricator.wikimedia.org/T127825) [14:49:44] (03CR) 10jerkins-bot: [V: 04-1] Allow enabling microcode updates gradually [puppet] - 10https://gerrit.wikimedia.org/r/433160 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [14:49:55] 10Operations, 10DBA, 10Patch-For-Review: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4207806 (10Dzahn) @jcrespo ^ Add "interface::add_ip6_mapped { 'main': }" in the role class to apply to both hosts at once. That should be all that is needed. comparison: ~/... [14:51:39] (03CR) 10Jcrespo: [C: 032] enable IPv6 for tendril [puppet] - 10https://gerrit.wikimedia.org/r/433159 (https://phabricator.wikimedia.org/T194766) (owner: 10Dzahn) [14:51:44] (03PS2) 10Jcrespo: enable IPv6 for tendril [puppet] - 10https://gerrit.wikimedia.org/r/433159 (https://phabricator.wikimedia.org/T194766) (owner: 10Dzahn) [14:52:10] all right zookeeper on conf1004 has metrics now :) [14:55:06] ottomata: I'd proceed if you are ok [14:55:43] kafka analytics, kafka main, kafka jumbo, hadoop [14:56:08] (03CR) 10Jcrespo: [C: 032] "Puppet gets stuck at:" [puppet] - 10https://gerrit.wikimedia.org/r/433159 (https://phabricator.wikimedia.org/T194766) (owner: 10Dzahn) [14:56:11] or mobrovac, if you want we can go directly with kafka main so you'll be free [14:56:31] wuh sorry what's the question? [14:56:45] go with what for kafka main? [14:56:59] mobrovac: yeah or wait for another kafka cluster to be restarted first [14:57:23] i'm still confused as to what the question is about [14:57:36] jynus: is it really stuck? that looks normal.. if it continues after that [14:58:00] maybe you got disconnected from the host after it brought up the new interface [14:58:13] mobrovac: if you want me to roll restart a less important cluster like the analytics one, verify that all is ok and then proceed with main, or if you want to do it now so you'll be free :) [14:58:21] no, it is stuck because I cannot run puppet from another connection [14:58:32] (03PS1) 10Ottomata: Alert if EventStreams recentchange endpoint has no messages [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493) [14:58:34] oh ok elukey, haha, it took me a while :) [14:58:39] elukey: yeah, go with main [14:58:40] jynus: i can connect to it fine and it looks good: [14:58:41] 208.80.154.82/26 [14:58:48] it should be fine [14:58:52] 2620:0:861:3:208:80:154:82 [14:58:52] proceed!@ :) [14:58:54] ^ mapped [14:58:56] mutante: try running puppet [14:59:10] mobrovac: ack! So just ran puppet on kafka100[1-3], proceeding with the roll restart [14:59:17] kk [14:59:51] !log roll restart of kafka daemons on kafka100[1-3] to pick up new zookeeper settings and group.initial.rebalance.delay.ms = 10s [14:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:06] it worked the second time, but I had to cancel the puppet run [15:00:14] jynus: says it's already running. want me to wait or delete the lock file.. oh.. ok! [15:00:18] ok [15:00:28] i haven't had this issue [15:00:32] when doing that [15:00:43] it happened on both servers [15:00:57] (03PS2) 10Ottomata: Alert if EventStreams recentchange endpoint has no messages [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493) [15:01:01] if it was an automatic run, probably it would have piled up [15:01:11] looks good from here, though: https://puppetboard.wikimedia.org/report/dbmonitor1001.wikimedia.org/afa3bda906c1d75846df5bc3ed2eb7df26b63c2e [15:01:29] (03CR) 10jerkins-bot: [V: 04-1] Alert if EventStreams recentchange endpoint has no messages [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493) (owner: 10Ottomata) [15:01:31] (03PS2) 10Muehlenhoff: Allow enabling microcode updates gradually [puppet] - 10https://gerrit.wikimedia.org/r/433160 (https://phabricator.wikimedia.org/T127825) [15:01:56] well, both have the mapped address now. so that's good [15:02:07] mobrovac: kafka1001 restarted [15:02:59] [bast2001:~] $ ping6 tendril.wikimedia.org [15:03:07] 64 bytes from dbmonitor1001.wikimedia.org [15:03:20] 10Operations, 10DBA, 10Patch-For-Review: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4207932 (10jcrespo) a:03BBlack Please, recheck. [15:03:28] elukey: looking good on our side [15:03:44] (03PS3) 10Ottomata: Alert if EventStreams recentchange endpoint has no messages [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493) [15:03:52] 10Operations, 10DBA, 10Patch-For-Review: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4207940 (10jcrespo) @Dzhan, thanks for the patch. [15:06:10] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11215/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493) (owner: 10Ottomata) [15:06:59] mobrovac: 1002 done [15:09:21] hm i see 400s from the proxy service [15:09:27] they seem legit though [15:10:26] 1003 is the only one left (waiting a bit now for metrics to recover) [15:11:06] PROBLEM - Host labnet1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:32] mobrovac: I am stopping now waiting for your green light [15:11:57] kk elukey we can go with 1003 [15:12:27] super, proceeding :) [15:13:40] (03PS1) 10Giuseppe Lavagetto: Add the capability to check for deprecated defines [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433162 (https://phabricator.wikimedia.org/T194724) [15:13:42] (03PS1) 10Giuseppe Lavagetto: Check for all the available variants of a hiera call [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433163 [15:13:58] (03CR) 10jerkins-bot: [V: 04-1] Add the capability to check for deprecated defines [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433162 (https://phabricator.wikimedia.org/T194724) (owner: 10Giuseppe Lavagetto) [15:14:00] (03CR) 10jerkins-bot: [V: 04-1] Check for all the available variants of a hiera call [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433163 (owner: 10Giuseppe Lavagetto) [15:14:26] <_joe_> I hate you rubocop [15:16:46] RECOVERY - Host labnet1001 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [15:18:16] mobrovac: 1003 restarted [15:18:53] and also just forced a replica election [15:19:49] 10Operations, 10DBA: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4207993 (10jcrespo) [15:20:46] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:20:48] !log roll restart of Kafka Analytics to pick up new zookeeper settings [15:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:21] huh again 400s from the proxy service [15:22:29] (03CR) 10Addshore: Prepare Lexeme config for test.wikidata.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433145 (https://phabricator.wikimedia.org/T194250) (owner: 10Jakob) [15:22:48] (03PS4) 10Ottomata: Alert if EventStreams recentchange endpoint has no messages [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493) [15:22:50] (03CR) 10Ottomata: [V: 032 C: 032] Alert if EventStreams recentchange endpoint has no messages [puppet] - 10https://gerrit.wikimedia.org/r/433161 (https://phabricator.wikimedia.org/T174493) (owner: 10Ottomata) [15:25:18] (03PS2) 10Giuseppe Lavagetto: Add the capability to check for deprecated defines [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433162 (https://phabricator.wikimedia.org/T194724) [15:25:21] (03PS2) 10Giuseppe Lavagetto: Check for all the available variants of a hiera call [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/433163 [15:25:38] (03PS1) 10Ottomata: Fix type in check_eventstreams script [puppet] - 10https://gerrit.wikimedia.org/r/433166 (https://phabricator.wikimedia.org/T174493) [15:26:02] (03CR) 10Ottomata: [V: 032 C: 032] Fix type in check_eventstreams script [puppet] - 10https://gerrit.wikimedia.org/r/433166 (https://phabricator.wikimedia.org/T174493) (owner: 10Ottomata) [15:26:27] PROBLEM - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,9 instance=db2067:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2067&var-datasource=codfw%2520prometheus%252Fops [15:27:37] (03PS2) 10Herron: admin: add mmiller to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/433083 (https://phabricator.wikimedia.org/T194550) [15:28:08] (03CR) 10Addshore: Prepare Lexeme config for test.wikidata.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433145 (https://phabricator.wikimedia.org/T194250) (owner: 10Jakob) [15:28:18] (03CR) 10Herron: [C: 032] admin: add mmiller to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/433083 (https://phabricator.wikimedia.org/T194550) (owner: 10Herron) [15:30:53] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1002 is CRITICAL: 12.07 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002 [15:31:13] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1001 is CRITICAL: 14.83 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001 [15:32:34] it is already cleared out [15:32:53] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1002 is OK: (C)10 ge (W)5 ge 3.621 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002 [15:33:10] a bit weird though [15:33:22] because I am restarting a different cluster [15:36:23] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1001 is CRITICAL: 14.33 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001 [15:38:53] weird indeed [15:39:32] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1001 is OK: (C)10 ge (W)5 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001 [15:39:37] 10Operations, 10DBA: https://tendril.wikimedia.org/ IPv6 doesn't work - https://phabricator.wikimedia.org/T194766#4208076 (10BBlack) 05Open>03Resolved Works now, thanks! [15:41:03] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:41:35] (03CR) 10Jakob: Prepare Lexeme config for test.wikidata.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433145 (https://phabricator.wikimedia.org/T194250) (owner: 10Jakob) [15:46:00] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T194698#4208115 (10Cmjohnson) The disk has been replaced. Please resolve once rebuild is complete [15:50:00] (03PS8) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) [15:52:15] !log rolling restart of hadoop master daemons to pick up new zookeeper settings [15:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:07] elukey: you rebooted analytics kafka ya? [15:56:13] ottomata: only restarted kafka in there, not mm [15:56:32] right [15:56:44] i think that old 0.9 mms might need bouncing after cluster restart [15:56:46] (03PS2) 10Jcrespo: mariadb: Failover dbproxy1007,8 and 9 and make them passive [dns] - 10https://gerrit.wikimedia.org/r/433015 (https://phabricator.wikimedia.org/T187962) [15:56:48] (03PS1) 10Jcrespo: mariadb: Move m3-slave from db1053 to db1072 [dns] - 10https://gerrit.wikimedia.org/r/433175 (https://phabricator.wikimedia.org/T194634) [15:57:01] ottomata: ah yes sorry I was about to do it after the last broker :( [15:57:05] (03PS9) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) [15:58:28] elukey: its a little funky that we actually need that... [15:59:02] (03PS2) 10Jcrespo: mariadb: Move m3-slave from db1053 to db1072 [dns] - 10https://gerrit.wikimedia.org/r/433175 (https://phabricator.wikimedia.org/T194634) [15:59:02] ottomata: restarting for all the zk changes? [16:00:04] godog, moritzm, and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180515T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:06] elukey: have to restart MM after a kafka cluster restart [16:00:13] it shouldn't be needed, but you know, 0.9 is FLAKY [16:00:30] ahhh yes yes [16:00:30] (03CR) 10Jcrespo: [C: 032] mariadb: Move m3-slave from db1053 to db1072 [dns] - 10https://gerrit.wikimedia.org/r/433175 (https://phabricator.wikimedia.org/T194634) (owner: 10Jcrespo) [16:01:13] elukey: i'm going to reduce sensitiveiy of that UnderReplicatedPartitions alert [16:01:23] ack [16:01:28] not exactly sure what better setting might be [16:01:33] # Alert if any undereplicated for more than 50% [16:01:33] # of the time in the last 30 minutes. [16:01:33] from => '30min', [16:01:33] percentage => 50, [16:01:37] Deploying AQS with elukey ops-team [16:01:52] maybe percentage => 80? [16:02:24] (03PS3) 10Jcrespo: mariadb: Failover dbproxy1007,8 and 9 and make them passive [dns] - 10https://gerrit.wikimedia.org/r/433015 (https://phabricator.wikimedia.org/T187962) [16:03:03] ottomata: what alert is that ? A prometheus one? [16:03:13] oh [16:03:16] sorry [16:03:19] old analytics is graphite [16:03:23] promethues just does [16:03:26] # Alert on the average number of under replicated partitions over the last 30 minutes. [16:03:31] avg_over_time(kafka_server_ReplicaManager_UnderReplicatedPartitions{${prometheus_labels}}[30m] [16:03:42] maybe up it to 1h? [16:03:43] (03PS4) 10Jcrespo: mariadb: Failover dbproxy1007,8 and 9 and make them passive [dns] - 10https://gerrit.wikimedia.org/r/433015 (https://phabricator.wikimedia.org/T187962) [16:04:01] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4208208 (10Vgutierrez) Right now the TLS server allows the client to pick up the curve to use, since j8u121 (8u171-b11-1~deb9u1 is deployed on k... [16:04:08] hm we don't really need an average? [16:04:13] we just want to know if there are currently any [16:04:46] yeah, and the time window must be short otherwise the bursts will alert and solve only after a long time [16:05:15] !log joal@tin Started deploy [analytics/aqs/deploy@a736558]: Deploying druid-configuration patch [16:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:03] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208219 (10MaxBioHazard) >so you should be able to keep using it. Did you mean that we can change nothing in our bots, its compilation settings, and when you disable AES128 our... [16:07:12] elukey: maybe [16:07:18] min_over_time [5m] [16:07:18] ? [16:07:34] if the min value in the last 5 mins is > X, alert [16:07:34] ? [16:07:40] that way it gives 5 minutes to get back to 0 [16:07:40] ? [16:08:02] if the min is > 0 in the last 5 minutes, warning, > 10, critical? [16:08:38] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler02/11217/" [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [16:09:36] (03PS1) 10Ottomata: Reduce sensitivity of Kafka Broker Under Replicated Partitions alert [puppet] - 10https://gerrit.wikimedia.org/r/433179 [16:09:38] ottomata: could work, maybe 10m? [16:09:52] but I am also fine to test 5 [16:10:21] 10Operations, 10Traffic: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#4208223 (10BBlack) 05Open>03Resolved a:03BBlack Closing this (a bit late), as service has been online for a while now. Trailing remaining tasks re: Zero and/or further network engineering aren't really a... [16:10:31] ottomata: if you haven't started the jumbo restarts I can do them now [16:10:38] just finished analytics and hadoop [16:11:02] !log joal@tin Finished deploy [analytics/aqs/deploy@a736558]: Deploying druid-configuration patch (duration: 05m 47s) [16:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:19] (03PS1) 10Jcrespo: mariadb: Make db1072, and not db1053, the passive m3 failover [puppet] - 10https://gerrit.wikimedia.org/r/433180 (https://phabricator.wikimedia.org/T194634) [16:11:21] 10Operations, 10Traffic: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#4208229 (10BBlack) [16:11:27] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4208227 (10BBlack) 05Open>03Resolved Closing this as well, we're through the basic turn-up process. Trailing wor... [16:11:31] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208230 (10Vgutierrez) >>! In T194380#4208219, @MaxBioHazard wrote: >>so you should be able to keep using it. > > Did you mean that we can change nothing in our bots, its compil... [16:11:35] elukey: haven't restarted, let me deploy this alert change first [16:11:47] ottomata: all right starting them now! [16:11:48] elukey: ya maybe 10 mins ok [16:11:50] (03CR) 10Jcrespo: [C: 032] mariadb: Make db1072, and not db1053, the passive m3 failover [puppet] - 10https://gerrit.wikimedia.org/r/433180 (https://phabricator.wikimedia.org/T194634) (owner: 10Jcrespo) [16:12:06] !log roll restart kafka on kafka-jumbo to pick up new zookeeper settings [16:12:06] elukey: wait! let's see if this alert change will fix flappy alert [16:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:18] haven't merged yet [16:12:32] ottomata: ah yes sure, it will take me a while before doing all the brokers, it should be fine :D [16:13:04] haha ok [16:13:10] (03PS2) 10Ottomata: Reduce sensitivity of Kafka Broker Under Replicated Partitions alert [puppet] - 10https://gerrit.wikimedia.org/r/433179 [16:13:49] (03PS3) 10Ottomata: Reduce sensitivity of Kafka Broker Under Replicated Partitions alert [puppet] - 10https://gerrit.wikimedia.org/r/433179 [16:14:00] (03CR) 10Ottomata: [V: 032 C: 032] Reduce sensitivity of Kafka Broker Under Replicated Partitions alert [puppet] - 10https://gerrit.wikimedia.org/r/433179 (owner: 10Ottomata) [16:15:23] 1 or 2 debproxies will complain now [16:15:31] fixing it, no user impact [16:17:47] fixed now [16:18:31] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, and 4 others: rack/setup/install wdqs10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T194184#4208249 (10Cmjohnson) [16:19:03] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Access to usergroups for Marshall Miller - https://phabricator.wikimedia.org/T194550#4208251 (10herron) 05Open>03Resolved Ok @MMiller_WMF, you should be good to go! In case you haven't seen them already, there are instructions at ht... [16:20:40] !log milimetric@tin Started deploy [analytics/refinery@679cf09]: Update partition drop script after schema change [16:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:41] !log mobrovac@tin Started restart [changeprop/deploy@e468d8e]: (no justification provided) [16:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:38] (03PS2) 10Dzahn: admins: add mholloway to maps/tilerator/kartotherian admins [puppet] - 10https://gerrit.wikimedia.org/r/433021 (https://phabricator.wikimedia.org/T194404) [16:22:42] (03CR) 10Dzahn: [C: 032] admins: add mholloway to maps/tilerator/kartotherian admins [puppet] - 10https://gerrit.wikimedia.org/r/433021 (https://phabricator.wikimedia.org/T194404) (owner: 10Dzahn) [16:23:55] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208278 (10MaxBioHazard) I should execute this string on Toolforge console? [16:24:23] (03PS1) 10Joal: Update AQS druid datasource to snapshot-postfixed [puppet] - 10https://gerrit.wikimedia.org/r/433182 [16:24:32] elukey: --^ [16:24:44] PROBLEM - Zookeeper Server on conf1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg [16:25:07] !log mobrovac@tin Started restart [cpjobqueue/deploy@58935d5]: (no justification provided) [16:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:35] ah yes expired downtime for conf1001 [16:25:40] all good :) [16:25:44] lemme ack it [16:25:53] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[zookeeper] [16:25:54] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10SRE-Access-Requests, 10Patch-For-Review: Add Michael Holloway (Reading Infrastructure) to maps admin groups - https://phabricator.wikimedia.org/T194404#4208295 (10Dzahn) @Mholloway This is done, you have been added to the requested groups. I ran puppe... [16:26:43] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:27:14] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:27:41] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10SRE-Access-Requests, 10Patch-For-Review: Add Michael Holloway (Reading Infrastructure) to maps admin groups - https://phabricator.wikimedia.org/T194404#4208300 (10Dzahn) 05Open>03Resolved [maps2001:~] $ id mholloway-shell uid=11963(mholloway-shell)... [16:27:53] !log milimetric@tin Finished deploy [analytics/refinery@679cf09]: Update partition drop script after schema change (duration: 07m 13s) [16:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:15] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208305 (10Vgutierrez) >>! In T194380#4208278, @MaxBioHazard wrote: > I should execute this string on Toolforge console? And recompile mono after this? If I'm reading the mono i... [16:28:35] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11218/aqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/433182 (owner: 10Joal) [16:28:41] (03PS2) 10Elukey: Update AQS druid datasource to snapshot-postfixed [puppet] - 10https://gerrit.wikimedia.org/r/433182 (owner: 10Joal) [16:29:02] elukey: once applied, we'll need to restart AQS (and test) [16:29:22] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4208306 (10chasemp) [16:30:51] joal: going to depool and apply it to aqs1004 ok ? [16:30:57] elukey: ack ! [16:31:29] joal: done1 [16:31:56] (03PS1) 10Elukey: role::configcluster: decom zookeeper on conf1001 [puppet] - 10https://gerrit.wikimedia.org/r/433183 (https://phabricator.wikimedia.org/T182924) [16:32:00] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:32:23] (03CR) 10jerkins-bot: [V: 04-1] role::configcluster: decom zookeeper on conf1001 [puppet] - 10https://gerrit.wikimedia.org/r/433183 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [16:32:30] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:32:32] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208314 (10MaxBioHazard) My bots are launched from cron. I hope, execute this string once would be enough. [16:33:16] (03PS1) 10Cmjohnson: Adding DNS for wdqs10[09-10] [dns] - 10https://gerrit.wikimedia.org/r/433184 (https://phabricator.wikimedia.org/T194184) [16:33:19] (03PS2) 10Elukey: role::configcluster: decom zookeeper on conf1001 [puppet] - 10https://gerrit.wikimedia.org/r/433183 (https://phabricator.wikimedia.org/T182924) [16:33:53] (03CR) 10jerkins-bot: [V: 04-1] role::configcluster: decom zookeeper on conf1001 [puppet] - 10https://gerrit.wikimedia.org/r/433183 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [16:34:24] (03CR) 10Cmjohnson: [C: 032] Adding DNS for wdqs10[09-10] [dns] - 10https://gerrit.wikimedia.org/r/433184 (https://phabricator.wikimedia.org/T194184) (owner: 10Cmjohnson) [16:34:55] elukey: look good to me ! [16:35:30] elukey: druid answers requests sent by AQS on the newly configured datasource, gicing correct numbers [16:35:41] elukey: We can continue the rollout :) [16:35:42] (03PS2) 10Andrew Bogott: openstack: move nova-api and nova-network functions to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/433153 (https://phabricator.wikimedia.org/T193579) [16:35:51] joal: ack [16:35:55] (03CR) 10Andrew Bogott: [V: 032 C: 032] openstack: move nova-api and nova-network functions to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/433153 (https://phabricator.wikimedia.org/T193579) (owner: 10Andrew Bogott) [16:36:09] !log rolling restart of aqs on aqs* nodes to pick up the new druid config [16:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:50] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:37:56] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208339 (10Reedy) It won't be, just single line it `MONO_TLS_PROVIDER=btls mono bot.exe` [17:00:38] RECOVERY - toolschecker: tools nginx proxy health on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 642 bytes in 0.002 second response time [17:00:57] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [17:01:26] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [17:01:57] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [17:02:26] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1015 bytes in 8.667 second response time [17:02:56] RECOVERY - Host paws.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [17:03:21] (03PS2) 10Dzahn: update MAC address of mw2139 [puppet] - 10https://gerrit.wikimedia.org/r/433185 (https://phabricator.wikimedia.org/T194426) [17:04:11] (03CR) 10Dzahn: [C: 032] update MAC address of mw2139 [puppet] - 10https://gerrit.wikimedia.org/r/433185 (https://phabricator.wikimedia.org/T194426) (owner: 10Dzahn) [17:04:27] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [17:04:46] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler03/11219/" [puppet] - 10https://gerrit.wikimedia.org/r/433183 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [17:05:06] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [17:05:40] (03CR) 10Dzahn: [V: 032 C: 032] update MAC address of mw2139 [puppet] - 10https://gerrit.wikimedia.org/r/433185 (https://phabricator.wikimedia.org/T194426) (owner: 10Dzahn) [17:06:12] jenkins not voting due to labs maintenance i assume [17:07:07] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [17:09:40] (03PS2) 10Cmjohnson: Adding dhcpd entry wdqs1009-10 [puppet] - 10https://gerrit.wikimedia.org/r/433186 (https://phabricator.wikimedia.org/T194184) [17:10:27] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd entry wdqs1009-10 [puppet] - 10https://gerrit.wikimedia.org/r/433186 (https://phabricator.wikimedia.org/T194184) (owner: 10Cmjohnson) [17:11:53] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, and 5 others: rack/setup/install wdqs10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T194184#4208430 (10Cmjohnson) [17:12:43] (03PS1) 10Ottomata: check_eventstreams - exit 2 if critical [puppet] - 10https://gerrit.wikimedia.org/r/433190 [17:15:40] !log rolling restart kafka-jumbo100[456] [17:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:07] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, and 5 others: rack/setup/install wdqs10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T194184#4208445 (10Cmjohnson) @robh the dhcpd file has been updated but not sure which partman recipe...feel free to add and continue with installation.... [17:17:36] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [17:18:01] (03PS2) 10Ottomata: check_eventstreams - exit 2 if critical [puppet] - 10https://gerrit.wikimedia.org/r/433190 [17:18:01] (03CR) 10Ottomata: [V: 032 C: 032] check_eventstreams - exit 2 if critical [puppet] - 10https://gerrit.wikimedia.org/r/433190 (owner: 10Ottomata) [17:21:31] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4208447 (10chasemp) [17:21:36] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T194698#4208449 (10jcrespo) 05Open>03Resolved ``` $ megacli -PDList -aALL | grep 'state' Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Sp... [17:22:35] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4208451 (10Andrew) @Cmjohnson Just to clarify, labnet1003 and 1004 aren't in active service so you can move them anytime; just check in with @chasemp after they're moved... [17:25:39] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [17:25:59] hm [17:27:45] !log branching 1.32.0-wmf.4 refs T191050 [17:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:49] T191050: 1.32.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T191050 [17:28:54] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4208462 (10chasemp) We ran through our normal procedure to fail traffic from labnet1002 back to labnet1001 (post move this morning). Labnet1001 saw incoming traffic from... [17:30:39] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:36:18] (03PS1) 10Ottomata: Use fqdn instead of localhost for curl eventstreams check [puppet] - 10https://gerrit.wikimedia.org/r/433194 (https://phabricator.wikimedia.org/T174493) [17:36:54] (03PS2) 10Herron: admin: add user seddon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/433025 (https://phabricator.wikimedia.org/T194445) [17:37:04] (03PS2) 10Ottomata: Use proper path and fqdn for eventstreams check [puppet] - 10https://gerrit.wikimedia.org/r/433194 (https://phabricator.wikimedia.org/T174493) [17:37:11] (03CR) 10jerkins-bot: [V: 04-1] admin: add user seddon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/433025 (https://phabricator.wikimedia.org/T194445) (owner: 10Herron) [17:37:32] (03CR) 10Ottomata: [V: 032 C: 032] Use proper path and fqdn for eventstreams check [puppet] - 10https://gerrit.wikimedia.org/r/433194 (https://phabricator.wikimedia.org/T174493) (owner: 10Ottomata) [17:43:03] (03PS3) 10Herron: admin: add user seddon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/433025 (https://phabricator.wikimedia.org/T194445) [17:47:51] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: mw2139 failed to boot - hardware check - https://phabricator.wikimedia.org/T194426#4208503 (10Papaul) a:05Papaul>03Dzahn @Dzahn I replaced the main board., Update the IDRAC and BIOS. it is all yours. I also installed the OS on the system. [17:51:29] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [17:56:50] (03PS4) 10Herron: admin: add user seddon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/433025 (https://phabricator.wikimedia.org/T194445) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180515T1800) [18:01:45] (03CR) 10Herron: [C: 032] admin: add user seddon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/433025 (https://phabricator.wikimedia.org/T194445) (owner: 10Herron) [18:22:59] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:25:08] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4208556 (10Vgutierrez) @MaxBioHazard please let us know when you make the change to check on our side that everything looks good :) [18:25:38] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [18:36:36] weird [18:48:42] (03PS3) 10Bstorm: wiki replicas: return page to a full view [puppet] - 10https://gerrit.wikimedia.org/r/433085 (https://phabricator.wikimedia.org/T174047) [18:49:57] (03CR) 10Bstorm: [C: 032] wiki replicas: return page to a full view [puppet] - 10https://gerrit.wikimedia.org/r/433085 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [18:51:49] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [18:51:49] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:00:05] twentyafterfour: That opportune time is upon us again. Time for a MediaWiki train deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180515T1900). [19:03:51] !log mw2139 - wmf-auto-reimage --conftoool --no-verify (T194426) [19:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:56] T194426: mw2139 failed to boot - hardware check - https://phabricator.wikimedia.org/T194426 [19:04:15] (03PS1) 10Bstorm: wiki replicas: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/433206 (https://phabricator.wikimedia.org/T174047) [19:04:26] "Unable to run wmf-auto-reimage-host: Failed to icinga_downtime" hmmmmrrr [19:05:20] !log mw2139 - wmf-auto-reimage --conftoool --new (because it got "Failed to icinga_downtime" and has a new mainboard (T194426) [19:05:22] (03PS1) 10Urbanecm: New throttle rule for WMF Hackhathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433207 (https://phabricator.wikimedia.org/T194392) [19:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:25] (03CR) 10Bstorm: "Is there a stretch repo? I am hoping to bring stretch into the mix on tools eventually." [puppet] - 10https://gerrit.wikimedia.org/r/433142 (https://phabricator.wikimedia.org/T194665) (owner: 10Arturo Borrero Gonzalez) [19:07:28] (03CR) 10Vgutierrez: "> Is there a stretch repo? I am hoping to bring stretch into the mix" [puppet] - 10https://gerrit.wikimedia.org/r/433142 (https://phabricator.wikimedia.org/T194665) (owner: 10Arturo Borrero Gonzalez) [19:20:18] (03CR) 10Gehel: "minor comment inline" (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/432136 (https://phabricator.wikimedia.org/T193734) (owner: 10DCausse) [19:20:35] (03CR) 10Foks: [C: 031] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433136 (https://phabricator.wikimedia.org/T152296) (owner: 10MarcoAurelio) [19:23:28] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:26:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Seddon access to the analytics cluster - https://phabricator.wikimedia.org/T194445#4208771 (10herron) 05Open>03Resolved Hi @Jseddon, your shell account `seddon` has been created and added to group `analytics-privatedata-users`. You should now... [19:29:34] (03CR) 10Gehel: "minor comments inline, otherwise LGTM" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [19:37:22] (03PS1) 1020after4: testwikis wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433209 [19:37:24] (03CR) 1020after4: [C: 032] testwikis wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433209 (owner: 1020after4) [19:37:40] (03CR) 10BryanDavis: "I added bblack as a reviewer to get a sanity check on this approach for selective https upgrades." [puppet] - 10https://gerrit.wikimedia.org/r/432935 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [19:38:53] (03Merged) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433209 (owner: 1020after4) [19:39:42] !log twentyafterfour@tin Started scap: testwikis wikis to 1.32.0-wmf.4 [19:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:49] (03CR) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433209 (owner: 1020after4) [19:40:13] twentyafterfour: sorry, not at home cant log in to tin to check patch [19:40:34] twentyaftrrfour: if you pastebin it somewhere i could look [19:40:35] bawolff: ok I'll paste it on phab if that'll work? [19:40:38] ok cool [19:41:48] I remember when i applied it in the first place there was big conflicts compared to HEAD. the version on the bug might have been easier to resolve [19:42:30] !log twentyafterfour@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_770814178" --threads=10 --lang en --quiet' returned non-zero exit status 255 (duration: 02m 47s) [19:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:05] bawolff: https://phabricator.wikimedia.org/P7132 [19:44:44] conflicts were not complex to resolve I just wanted a second pair of eyes on it [19:45:03] twentyafterfour: looks good [19:45:20] bawolff: thanks! [19:49:58] Fatal error: Uncaught exception 'Exception' with message '/srv/mediawiki-staging/php-1.32.0-wmf.4/extensions/CongressLookup/extension.json does not exist!' in /srv/mediawiki-staging/php-1.32.0-wmf.4/includes/registration/ExtensionRegistry.php:105 [19:50:44] twentyafterfour dosen't seem it was branched [19:50:45] for wmf.4 [19:50:55] https://github.com/wikimedia/mediawiki-extensions-CongressLookup/branches [19:51:10] paladox: yeah ... [19:52:05] it's not in the make-wmf-branch config [19:52:42] I don't know why extensionregistry is looking for it [19:53:06] was this just added to mediawiki-config without adding it to the branch config? [19:54:39] It was deployed as a rush job so wouldnt surprise me if someone missed that [19:54:40] weird...it's in wmf.3 but I don't see a patch removing it [20:08:52] twentyafterfour it was added, but i think as bawolff suggests, they may have overlook adding it to that script by mistake [20:10:48] (03PS1) 10Ottomata: Set jdk.tls.namedGroups=secp256r1 for Kafka TLS [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) [20:11:08] (03CR) 10Ottomata: [C: 04-1] "UNTESTED! :)" [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata) [20:11:37] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4208839 (10Ottomata) So, something like ^? [20:12:07] (03PS2) 10Ottomata: Set jdk.tls.namedGroups=secp256r1 for Kafka TLS [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) [20:12:56] (03PS3) 10Ottomata: Set jdk.tls.namedGroups=secp256r1 for Kafka TLS [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) [20:13:18] damnit [20:13:35] * twentyafterfour just submitted a security patch to gerrit [20:19:41] twentyafterfour you can delete it [20:19:49] I did [20:22:25] !log [radium:~] $ sudo apt-get autoremove [20:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:27] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild raids on labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#4208854 (10bd808) [20:23:38] RECOVERY - Disk space on furud is OK: DISK OK [20:23:48] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [20:27:45] !log twentyafterfour@tin Started scap: testwikis to 1.32.0-wmf.4 refs T191050 [20:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:50] T191050: 1.32.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T191050 [20:29:54] @seen wikibugs [20:29:54] mutante: Last time I saw wikibugs they were talking in the channel, but they are not in the channel now and I don't know why, in #mediawiki-feed at 3/12/2018 12:21:26 PM (64d8h8m27s ago) [20:30:07] yea, that ^ [20:30:18] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [20:31:28] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [20:31:59] 10Operations: replace/reinstall radium with a stretch system - https://phabricator.wikimedia.org/T194796#4208919 (10faidon) 05Open>03declined radium is super old hardware (2011 era) and its refresh is imminent, as part of T189317. No reason to spend time to reimage at this point :) [20:33:50] 10Operations: replace/reinstall radium with a stretch system - https://phabricator.wikimedia.org/T194796#4208923 (10Dzahn) Alright, for that case i had the "or replace the hardware with other hardware running stretch and switch the role over, then decom radium" option. [20:35:06] ah, wikibugs with underscore [20:38:32] (03PS4) 10Ottomata: Set jdk.tls.namedGroups=secp256r1 for Kafka TLS [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) [20:42:23] PROBLEM - Disk space on mw2139 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:42:23] PROBLEM - dhclient process on mw2139 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:44:32] ACKNOWLEDGEMENT - HHVM processes on mw2139 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. daniel_zahn T194426 [20:46:08] (03PS5) 10Ottomata: Set jdk.tls.namedGroups=secp256r1 for Kafka TLS [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) [20:47:33] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11224/kafka-jumbo1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata) [20:49:32] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:49:34] (03CR) 10Vgutierrez: [C: 031] "Nice, the effect of this can be tested with: openssl s_client -brief -curves sect283k1:sect283r1:sect409k1:sect409r1:sect571k1:sect571r1:s" [puppet] - 10https://gerrit.wikimedia.org/r/433214 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata) [20:51:07] (03PS1) 10Dzahn: phab/phd: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) [20:52:23] (03CR) 10Paladox: [C: 031] phab/phd: base::service_unit -> systemd::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [20:53:22] (03CR) 10Dzahn: "i should probably remove the " provider => $::initsystem," [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [20:54:42] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [20:56:24] (03PS2) 10Dzahn: phab/phd: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) [20:58:34] (03PS3) 10Dzahn: phabricator: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) [20:59:52] (03CR) 10Paladox: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/433281 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [21:20:53] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:30:17] (03PS1) 10Dzahn: tor: add icinga check_tcp for ORPort and DirPort [puppet] - 10https://gerrit.wikimedia.org/r/433284 (https://phabricator.wikimedia.org/T148614) [21:30:59] (03CR) 10jerkins-bot: [V: 04-1] tor: add icinga check_tcp for ORPort and DirPort [puppet] - 10https://gerrit.wikimedia.org/r/433284 (https://phabricator.wikimedia.org/T148614) (owner: 10Dzahn) [21:31:50] (03PS2) 10Dzahn: tor: add icinga check_tcp for ORPort and DirPort [puppet] - 10https://gerrit.wikimedia.org/r/433284 (https://phabricator.wikimedia.org/T148614) [21:34:04] !log twentyafterfour@tin Finished scap: testwikis to 1.32.0-wmf.4 refs T191050 (duration: 66m 19s) [21:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:09] T191050: 1.32.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T191050 [21:52:21] (03PS1) 10Dzahn: introduce webperf1002 & webperf2002 [dns] - 10https://gerrit.wikimedia.org/r/433287 (https://phabricator.wikimedia.org/T194390) [21:54:52] (03PS3) 10Samwilson: Deploy GlobalPreferences to test wikis and mw.org (forth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 (https://phabricator.wikimedia.org/T190425) [21:55:57] (03PS2) 10Dzahn: introduce webperf1002 & webperf2002 [dns] - 10https://gerrit.wikimedia.org/r/433287 (https://phabricator.wikimedia.org/T194390) [21:58:46] (03PS1) 1020after4: group0 wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433288 [21:58:48] (03CR) 1020after4: [C: 032] group0 wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433288 (owner: 1020after4) [21:59:59] (03CR) 10Dzahn: [C: 032] introduce webperf1002 & webperf2002 [dns] - 10https://gerrit.wikimedia.org/r/433287 (https://phabricator.wikimedia.org/T194390) (owner: 10Dzahn) [22:00:04] MaxSem and samwilson: (Dis)respected human, time to deploy GlobalPreferences test deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180515T2200). Please do the needful. [22:00:06] (03Merged) 10jenkins-bot: group0 wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433288 (owner: 1020after4) [22:00:08] (03PS4) 10Legoktm: Deploy GlobalPreferences to test wikis and mw.org (fourth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 (https://phabricator.wikimedia.org/T190425) (owner: 10Samwilson) [22:00:21] (03CR) 10jenkins-bot: group0 wikis to 1.32.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433288 (owner: 1020after4) [22:02:17] (03CR) 10Samwilson: [C: 032] Deploy GlobalPreferences to test wikis and mw.org (fourth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 (https://phabricator.wikimedia.org/T190425) (owner: 10Samwilson) [22:03:50] (03Merged) 10jenkins-bot: Deploy GlobalPreferences to test wikis and mw.org (fourth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 (https://phabricator.wikimedia.org/T190425) (owner: 10Samwilson) [22:04:07] 10Operations, 10vm-requests, 10Patch-For-Review: EQIAD & CODFW: 1 VM in each data center for xhprof/xhgui/other profiling tools - https://phabricator.wikimedia.org/T194390#4209076 (10Dzahn) assigned IPs to webperf1002 and webperf2002: eqiad forward webperf1001.eqiad.wmnet has address 10.64.0.215 webperf1002... [22:06:43] (03CR) 10jenkins-bot: Deploy GlobalPreferences to test wikis and mw.org (fourth time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433149 (https://phabricator.wikimedia.org/T190425) (owner: 10Samwilson) [22:07:14] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4209078 (10RobH) The email thread with legal seems to have reached a conclusion in support, so I'm now in the process of adding admin2@gofishdigital.com to the subdomains: >>!... [22:07:26] twentyafterfour: are you still deploying? [22:15:09] twentyafterfour: I see some undeployed wikiversions changes [22:15:24] still deploying yes [22:15:57] MaxSem: that's pushing the branch to group0, I stopped deploying that to fix the syntax error [22:17:40] samwilson: will be finished shortly [22:17:56] twentyafterfour: no worries, thanks [22:18:59] !log twentyafterfour@tin Synchronized php-1.32.0-wmf.4/includes/api/ApiLogin.php: fix syntax error (duration: 01m 39s) [22:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:23] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [22:21:53] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.4 refs T191050 [22:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:57] T191050: 1.32.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T191050 [22:26:10] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4209110 (10RobH) 05Open>03Resolved a:03RobH >>! In T192893#4209078, @RobH wrote: > The email thread with legal seems to have reached a conclusion in support, so I'm now in... [22:29:52] (03PS1) 10Samwilson: Revert "Deploy GlobalPreferences to test wikis and mw.org (fourth time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433291 [22:30:44] Wait what? Reverting already? :( [22:31:28] Niharika: haven't deployed yet; waiting on other stuff to finish [22:32:04] samwilson: Maybe we won't need the revert patch! :) [22:33:33] Niharika: i shall be happy to abandon it [22:39:09] !log samwilson@tin Synchronized wmf-config/InitialiseSettings.php: Deploying GlobalPreferences T190425 (duration: 01m 21s) [22:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:14] T190425: GlobalPreferences deploy caused a significant increase in reads on s3 - https://phabricator.wikimedia.org/T190425 [22:42:05] Nothing seems to be blowing up yet, right? [22:42:24] That chart is already pretty suspicious but not because of us. [22:47:07] Niharika: yeah I was wondering about that earlier rise. but nothing to do with us :) and all's looking ok. [22:47:30] Woooooo. :D [22:52:32] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180515T2300). [23:00:04] Urbanecm: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:06:18] !log creating ganeti VM webperf1002.eqiad.wmnet on ganeti1004 (link: private, row: A, cpus: 4, ram: 8, disk: 50) (T194390) [23:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:23] T194390: EQIAD & CODFW: 1 VM in each data center for xhprof/xhgui/other profiling tools - https://phabricator.wikimedia.org/T194390 [23:10:49] !log mw2139 - reimaged, scap pull, apache-fast-test baseurls from naos, repooled with confctl (T194426) [23:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:53] T194426: mw2139 failed to boot - hardware check - https://phabricator.wikimedia.org/T194426 [23:11:04] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2139.codfw.wmnet [23:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:14] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: mw2139 failed to boot - hardware check - https://phabricator.wikimedia.org/T194426#4209152 (10Dzahn) 05Open>03Resolved Thank you @Papaul! Works and is in use again now. Closing ticket as resolved. (fyi @Muehlenhoff ) [23:18:35] (03PS1) 10Dzahn: admins/dzahn: add makevm.sh (create ganeti) to my ~ files [puppet] - 10https://gerrit.wikimedia.org/r/433296 [23:20:29] (03CR) 10Dzahn: admins/dzahn: add makevm.sh (create ganeti) to my ~ files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/433296 (owner: 10Dzahn) [23:26:33] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [23:33:35] !log creating ganeti VM webperf2002.eqiad.wmnet on ganeti2004 (link: private, row: A, cpus: 4, ram: 8, disk: 50) (T194390) [23:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:39] T194390: EQIAD & CODFW: 1 VM in each data center for xhprof/xhgui/other profiling tools - https://phabricator.wikimedia.org/T194390 [23:38:32] (03PS1) 10Dzahn: add webperf1002/2002 as spare systems with IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/433298 (https://phabricator.wikimedia.org/T194390) [23:39:12] (03CR) 10jerkins-bot: [V: 04-1] add webperf1002/2002 as spare systems with IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/433298 (https://phabricator.wikimedia.org/T194390) (owner: 10Dzahn) [23:40:10] (03PS1) 10Dzahn: webperf: add IPv6 mapped address to role [puppet] - 10https://gerrit.wikimedia.org/r/433299 [23:40:41] (03CR) 10jerkins-bot: [V: 04-1] webperf: add IPv6 mapped address to role [puppet] - 10https://gerrit.wikimedia.org/r/433299 (owner: 10Dzahn)