[00:10:34] (03PS4) 10Dzahn: standard: actually drop 'has_ganglia' param entirely [puppet] - 10https://gerrit.wikimedia.org/r/382926 (https://phabricator.wikimedia.org/T177225) [00:13:10] (03PS5) 10Dzahn: standard: actually drop 'has_ganglia' param entirely [puppet] - 10https://gerrit.wikimedia.org/r/382926 (https://phabricator.wikimedia.org/T177225) [00:14:38] (03CR) 10Dzahn: [C: 032] standard: actually drop 'has_ganglia' param entirely [puppet] - 10https://gerrit.wikimedia.org/r/382926 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:21:49] meh @ icinga-wm [00:29:51] (03PS2) 10Dzahn: ganglia: delete ganglia-web classes and role [puppet] - 10https://gerrit.wikimedia.org/r/382932 (https://phabricator.wikimedia.org/T177225) [00:30:44] (03CR) 10Dzahn: [C: 032] ganglia: delete ganglia-web classes and role [puppet] - 10https://gerrit.wikimedia.org/r/382932 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:36:40] (03PS1) 10Dzahn: network::constants: drop uranium from monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/399119 (https://phabricator.wikimedia.org/T177225) [00:38:46] (03PS1) 10Dzahn: remove ganglia_aggregators settings from hiera [puppet] - 10https://gerrit.wikimedia.org/r/399120 (https://phabricator.wikimedia.org/T177225) [00:44:47] !log einsteinium: sudo systemctl restrart ircecho (alias kick-icinga-wm) [00:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:17] (03PS1) 10Dzahn: rm role/manifests/ganglia/config [puppet] - 10https://gerrit.wikimedia.org/r/399121 [00:49:38] (03CR) 10jerkins-bot: [V: 04-1] rm role/manifests/ganglia/config [puppet] - 10https://gerrit.wikimedia.org/r/399121 (owner: 10Dzahn) [00:51:06] (03PS2) 10Dzahn: rm role/manifests/ganglia/config [puppet] - 10https://gerrit.wikimedia.org/r/399121 (https://phabricator.wikimedia.org/T177225) [01:00:22] (03PS1) 10Chad: Nightly server: let MW releasers manage Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/399123 [01:05:05] PROBLEM - HTTP on releases1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 13009 bytes in 0.009 second response time [01:06:18] no_justification: ^ [01:06:34] Yep yep, I know [01:09:16] PROBLEM - Check Varnish expiry mailbox lag on cp4021 is CRITICAL: CRITICAL: expiry mailbox lag is 2092413 [01:10:05] RECOVERY - HTTP on releases1001 is OK: HTTP OK: HTTP/1.1 200 OK - 19215 bytes in 0.082 second response time [01:11:42] (03CR) 10Dzahn: [C: 032] rm role/manifests/ganglia/config [puppet] - 10https://gerrit.wikimedia.org/r/399121 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [01:15:55] (03PS1) 10Dzahn: remove ganglia.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/399124 (https://phabricator.wikimedia.org/T177225) [01:20:25] 10Operations: decom uranium - https://phabricator.wikimedia.org/T183209#3846949 (10Dzahn) [01:21:25] 10Operations, 10monitoring, 10Technical-Debt: decom uranium - https://phabricator.wikimedia.org/T183209#3846962 (10Dzahn) [01:26:59] (03PS1) 10Dzahn: remove uranium.wikimedia.org, v4 + v6 [dns] - 10https://gerrit.wikimedia.org/r/399125 (https://phabricator.wikimedia.org/T183209) [01:30:02] (03PS1) 10Dzahn: uranium: remove mapped v6, add decom comment [puppet] - 10https://gerrit.wikimedia.org/r/399127 (https://phabricator.wikimedia.org/T183209) [01:31:15] (03CR) 10Dzahn: [C: 032] uranium: remove mapped v6, add decom comment [puppet] - 10https://gerrit.wikimedia.org/r/399127 (https://phabricator.wikimedia.org/T183209) (owner: 10Dzahn) [01:42:57] (03CR) 10Dzahn: "as a minimum i can definitely confirm you wouldn't be the first to let puppet execute usermod to fix this or similar:" [puppet] - 10https://gerrit.wikimedia.org/r/399101 (owner: 10Ayounsi) [01:45:24] afk now,bbl [01:50:05] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [01:50:15] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [02:24:39] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.12) (duration: 05m 22s) [02:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [02:54:56] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [03:56:15] PROBLEM - Check Varnish expiry mailbox lag on cp4024 is CRITICAL: CRITICAL: expiry mailbox lag is 2055624 [04:16:15] RECOVERY - Check Varnish expiry mailbox lag on cp4024 is OK: OK: expiry mailbox lag is 0 [04:23:05] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.016 second response time [04:29:26] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.486 second response time [05:11:06] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.025 second response time [05:14:26] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.332 second response time [05:18:24] !log restarting slapd on seaborgium (in response to ldap complaints on the grid master) [05:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:26] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.291 second response time [05:39:26] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.391 second response time [06:09:56] !log Deploy schema change on db1065 (s1 sanitarium master) with replication, so some lag will be generated on labs - T174569 [06:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:10] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:21:26] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399138 (https://phabricator.wikimedia.org/T161294) [06:24:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399138 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:25:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399138 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:26:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399138 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:26:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1106 - T161294 (duration: 00m 53s) [06:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:02] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [06:29:07] !log Stop replication in sync on db1100 and db1106 - T161294 [06:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:23] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399140 [06:40:21] !log Stop replication in sync on db1106 and dbstore1002 s5 - T161294 [06:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:31] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [06:49:21] !log mobrovac@tin Started deploy [restbase/deploy@2b75a64]: Bug fix: Add the time_to_live config option to the Parsoid module [06:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:28] !log Stop replication in sync on db1106 and db2052 - T161294 [06:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:39] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [06:53:48] !log mobrovac@tin Finished deploy [restbase/deploy@2b75a64]: Bug fix: Add the time_to_live config option to the Parsoid module (duration: 04m 26s) [06:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:36] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399140 (owner: 10Marostegui) [06:59:00] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399140 (owner: 10Marostegui) [06:59:10] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399140 (owner: 10Marostegui) [07:00:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1106 - T161294 (duration: 00m 51s) [07:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:15] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [07:16:32] (03CR) 10Giuseppe Lavagetto: First version of the helm chart scaffolding for production services (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397) (owner: 10Giuseppe Lavagetto) [07:17:05] (03PS3) 10Giuseppe Lavagetto: Create an envoy docker image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/396021 [08:02:32] (03PS9) 10Jcrespo: Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [08:02:34] (03PS2) 10Jcrespo: [WIP]Quick & dirty script to check data differences between tables [puppet] - 10https://gerrit.wikimedia.org/r/345188 (https://phabricator.wikimedia.org/T160509) [08:03:12] (03CR) 10jerkins-bot: [V: 04-1] [WIP]Quick & dirty script to check data differences between tables [puppet] - 10https://gerrit.wikimedia.org/r/345188 (https://phabricator.wikimedia.org/T160509) (owner: 10Jcrespo) [08:05:07] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw2119.codfw.wmnet [08:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:33] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw2246.codfw.wmnet [08:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:05] !log installing openssl security updates [08:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:44] !log Stop replication in sync on db2045 and db1109 - T161294 [08:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:55] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [08:30:42] (03PS2) 10Muehlenhoff: Fix texlive dependency for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/395712 [08:33:55] RECOVERY - mediawiki-installation DSH group on mw2246 is OK: OK [08:35:58] !log reimaging mw1317 (video scaler) to stretch [08:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:17] (03CR) 10Muehlenhoff: [C: 032] Fix texlive dependency for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/395712 (owner: 10Muehlenhoff) [08:38:03] (03PS3) 10Jcrespo: [WIP]Quick & dirty script to check data differences between tables [puppet] - 10https://gerrit.wikimedia.org/r/345188 (https://phabricator.wikimedia.org/T160509) [08:38:08] (03PS2) 10Filippo Giunchedi: prometheus: recording rules for redis [puppet] - 10https://gerrit.wikimedia.org/r/398871 (https://phabricator.wikimedia.org/T148637) [08:38:31] (03CR) 10jerkins-bot: [V: 04-1] [WIP]Quick & dirty script to check data differences between tables [puppet] - 10https://gerrit.wikimedia.org/r/345188 (https://phabricator.wikimedia.org/T160509) (owner: 10Jcrespo) [08:42:19] (03PS2) 10Jcrespo: [WIP]Initial commit of existent python scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/354206 [08:42:38] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: recording rules for redis [puppet] - 10https://gerrit.wikimedia.org/r/398871 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [08:46:09] (03PS1) 10Alexandros Kosiaris: Bump puppetdb on puppet compiler to 3G [puppet] - 10https://gerrit.wikimedia.org/r/399145 [08:47:11] (03PS1) 10Marostegui: Revert "Revert "db-eqiad.php: Depool db1106"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399146 [08:52:53] (03CR) 10Elukey: "pcc https://puppet-compiler.wmflabs.org/compiler02/9395/" [puppet] - 10https://gerrit.wikimedia.org/r/398869 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [08:52:58] (03CR) 10Volans: "Nice! I know it's a WIP, I just left few minor comments/suggestions." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/345188 (https://phabricator.wikimedia.org/T160509) (owner: 10Jcrespo) [08:53:07] (03CR) 10Marostegui: [C: 032] Revert "Revert "db-eqiad.php: Depool db1106"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399146 (owner: 10Marostegui) [08:55:47] (03CR) 10Filippo Giunchedi: "LGTM, though I don't see pdns-exporter running on labservices1001 yet" [puppet] - 10https://gerrit.wikimedia.org/r/398867 (owner: 10Muehlenhoff) [08:55:49] (03Merged) 10jenkins-bot: Revert "Revert "db-eqiad.php: Depool db1106"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399146 (owner: 10Marostegui) [08:56:49] (03CR) 10jenkins-bot: Revert "Revert "db-eqiad.php: Depool db1106"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399146 (owner: 10Marostegui) [08:56:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1106 - T161294 (duration: 00m 51s) [08:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:04] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [08:57:33] (03CR) 10Alexandros Kosiaris: [C: 032] Bump puppetdb on puppet compiler to 3G [puppet] - 10https://gerrit.wikimedia.org/r/399145 (owner: 10Alexandros Kosiaris) [08:59:31] (03CR) 10Volans: [C: 031] "Ack that there is no hurry and we can wait Jan. Adding +1 because the patch looks good to me now. @herron: feel free to -2 it to ensure i" [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) (owner: 10Herron) [09:03:54] RECOVERY - mediawiki-installation DSH group on mw2119 is OK: OK [09:03:54] ACKNOWLEDGEMENT - Host cp4032 is DOWN: PING CRITICAL - Packet loss = 100% Volans Under maintenance https://phabricator.wikimedia.org/T183176 [09:03:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399148 (https://phabricator.wikimedia.org/T161294) [09:04:19] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/9396/" [puppet] - 10https://gerrit.wikimedia.org/r/398847 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [09:04:27] (03PS2) 10Filippo Giunchedi: Add nutcracker_exporter profile [puppet] - 10https://gerrit.wikimedia.org/r/398847 (https://phabricator.wikimedia.org/T181995) [09:04:42] (03CR) 10Jcrespo: "Volans: aside from return values, the fact that you focus on the nitpicks and not on the fact that this is a monolithic unmaintainable mes" [puppet] - 10https://gerrit.wikimedia.org/r/345188 (https://phabricator.wikimedia.org/T160509) (owner: 10Jcrespo) [09:06:14] 10Operations, 10ops-ulsfo, 10Traffic: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3845938 (10Volans) @RobH FYI I've ack'ed the Icinga alert of the host down and set it to downtime until Fri UTC morning. [09:07:04] PROBLEM - DPKG on webperf1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:08:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399148 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [09:09:04] RECOVERY - DPKG on webperf1001 is OK: All packages OK [09:09:06] (03CR) 10Alexandros Kosiaris: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/9398/ is rather happy, I 'll proceed with this and see what we get out of it" [puppet] - 10https://gerrit.wikimedia.org/r/398276 (https://phabricator.wikimedia.org/T182860) (owner: 10Alexandros Kosiaris) [09:10:42] (03PS2) 10Alexandros Kosiaris: Populate the docker group in admin module [puppet] - 10https://gerrit.wikimedia.org/r/398276 (https://phabricator.wikimedia.org/T182860) [09:11:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399148 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [09:11:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399148 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [09:12:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097:3315 - T161294 (duration: 00m 51s) [09:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:49] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [09:15:34] (03PS3) 10Alexandros Kosiaris: Populate the docker group in admin module [puppet] - 10https://gerrit.wikimedia.org/r/398276 (https://phabricator.wikimedia.org/T182860) [09:18:40] (03PS3) 10Elukey: profile::mariadb::misc::el::master: apply data sanitization policies [puppet] - 10https://gerrit.wikimedia.org/r/398869 (https://phabricator.wikimedia.org/T108850) [09:19:43] (03CR) 10Elukey: [C: 032] profile::mariadb::misc::el::master: apply data sanitization policies [puppet] - 10https://gerrit.wikimedia.org/r/398869 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [09:20:35] (03PS4) 10Alexandros Kosiaris: Populate the docker group in admin module [puppet] - 10https://gerrit.wikimedia.org/r/398276 (https://phabricator.wikimedia.org/T182860) [09:21:01] (03CR) 10Filippo Giunchedi: [C: 031] Add Prometheus scraper configs for WDQS updater and Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/398865 (owner: 10Muehlenhoff) [09:22:25] (03CR) 10Alexandros Kosiaris: [C: 032] Populate the docker group in admin module [puppet] - 10https://gerrit.wikimedia.org/r/398276 (https://phabricator.wikimedia.org/T182860) (owner: 10Alexandros Kosiaris) [09:26:13] PROBLEM - puppet last run on db1107 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:26:34] this is me --^ [09:27:47] (03PS10) 10Jcrespo: Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [09:30:50] (03PS1) 10Elukey: profile::mariadb::misc::eventlogging: fix group/user dependencies [puppet] - 10https://gerrit.wikimedia.org/r/399149 (https://phabricator.wikimedia.org/T108850) [09:31:23] (03CR) 10Elukey: [C: 032] profile::mariadb::misc::eventlogging: fix group/user dependencies [puppet] - 10https://gerrit.wikimedia.org/r/399149 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [09:31:34] (03PS7) 10ArielGlenn: rename 'otherdir' in the dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/398034 [09:31:54] akosiaris: shall I merge? [09:32:18] elukey: no, I got it... I have to check the results anyway [09:32:24] sure [09:32:47] (03PS11) 10Jcrespo: Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [09:33:20] (03PS8) 10ArielGlenn: rename 'otherdir' in the dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/398034 [09:33:57] (03CR) 10ArielGlenn: [C: 032] rename 'otherdir' in the dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/398034 (owner: 10ArielGlenn) [09:36:18] RECOVERY - puppet last run on db1107 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:37:24] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Done with a different approach in https://gerrit.wikimedia.org/r/398276" [puppet] - 10https://gerrit.wikimedia.org/r/398240 (https://phabricator.wikimedia.org/T182860) (owner: 10Hashar) [09:38:11] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3847374 (10akosiaris) 05Open>03Resolved a:03akosiaris... [09:38:17] (03PS12) 10Jcrespo: Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [09:39:37] (03Abandoned) 10Hashar: contint: allow releng to interact with Docker [puppet] - 10https://gerrit.wikimedia.org/r/398240 (https://phabricator.wikimedia.org/T182860) (owner: 10Hashar) [09:41:08] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3847378 (10hashar) ``` contint1001$ groups wikidev docker... [09:41:58] PROBLEM - DPKG on mwdebug1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:42:35] that's me ^ [09:44:19] (03PS3) 10ArielGlenn: clean up directory setup manifests for dumps nfs and web servers [puppet] - 10https://gerrit.wikimedia.org/r/398095 [09:49:59] RECOVERY - DPKG on mwdebug1001 is OK: All packages OK [09:53:33] (03PS13) 10Jcrespo: Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [09:54:13] (03PS14) 10Jcrespo: Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [09:54:58] (03CR) 10Filippo Giunchedi: [C: 032] Add nutcracker_exporter profile [puppet] - 10https://gerrit.wikimedia.org/r/398847 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [09:55:07] (03PS3) 10Filippo Giunchedi: Add nutcracker_exporter profile [puppet] - 10https://gerrit.wikimedia.org/r/398847 (https://phabricator.wikimedia.org/T181995) [09:55:30] (03CR) 10ArielGlenn: [C: 032] clean up directory setup manifests for dumps nfs and web servers [puppet] - 10https://gerrit.wikimedia.org/r/398095 (owner: 10ArielGlenn) [09:56:09] (03PS4) 10Filippo Giunchedi: Add nutcracker_exporter profile [puppet] - 10https://gerrit.wikimedia.org/r/398847 (https://phabricator.wikimedia.org/T181995) [09:59:06] (03PS4) 10Giuseppe Lavagetto: First version of the helm chart scaffolding for production services [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397) [10:00:19] PROBLEM - Check systemd state on mw1187 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:00:20] PROBLEM - Check systemd state on mw1262 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:00:26] <_joe_> akosiaris: I'd merge that change, and declare that done [10:00:31] <_joe_> godog: is that you? ^^ [10:00:54] _joe_: yeah that's me :( I'm taking a look [10:00:54] <_joe_> ● prometheus-nutcracker-exporter.service loaded failed failed Prometheus Nutcracker exporter [10:00:59] <_joe_> yes :) [10:01:10] PROBLEM - Check systemd state on mw2202 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:01:19] PROBLEM - Check systemd state on mw2240 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:01:20] PROBLEM - Check systemd state on mw1188 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:01:29] PROBLEM - Check systemd state on mw2141 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:01:30] PROBLEM - Check systemd state on mw1190 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:01:33] _joe_: yeah sounds fine to me [10:01:36] sigh, it worked ok on mwdebug [10:01:39] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1317 is CRITICAL: Return code of 255 is out of bounds [10:01:39] PROBLEM - configured eth on mw1317 is CRITICAL: Return code of 255 is out of bounds [10:01:40] PROBLEM - Check systemd state on mw2156 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:01:40] PROBLEM - Check systemd state on mw2212 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:01:40] PROBLEM - Check systemd state on mw2255 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:01:40] PROBLEM - Check systemd state on mw1183 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:01:45] I'll rollback, sorry about the spam [10:01:57] <_joe_> godog: don't, let's try to fix it instead [10:02:00] PROBLEM - Check systemd state on mw2223 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:02:09] PROBLEM - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:02:09] (03PS2) 10ArielGlenn: apachedir is available to dumps cron jobs via a bash script, use it [puppet] - 10https://gerrit.wikimedia.org/r/398106 [10:02:10] PROBLEM - Check systemd state on mw1309 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:02:19] <_joe_> godog: let's disable puppet wherever it didn't run instead [10:02:20] PROBLEM - Check systemd state on mw2137 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:02:21] <_joe_> lemme do it [10:02:29] PROBLEM - Check systemd state on mw2172 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:02:34] _joe_: ok [10:02:39] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:02:39] PROBLEM - Check systemd state on mw2234 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:03:09] PROBLEM - Check systemd state on mw1218 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:03:10] PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:03:19] PROBLEM - Check systemd state on mw1214 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:03:19] PROBLEM - Check whether ferm is active by checking the default input chain on mw1317 is CRITICAL: Return code of 255 is out of bounds [10:03:19] PROBLEM - dhclient process on mw1317 is CRITICAL: Return code of 255 is out of bounds [10:03:25] <_joe_> done [10:03:29] PROBLEM - Check systemd state on mw1322 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:03:34] <_joe_> mw1317 is being reimaged? [10:03:39] yes [10:03:40] PROBLEM - Check systemd state on mw2225 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:03:40] PROBLEM - Check systemd state on mw2144 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:03:59] PROBLEM - Check systemd state on mw1216 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:03:59] PROBLEM - Check systemd state on mw2113 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:00] PROBLEM - Check systemd state on mw1204 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:00] PROBLEM - Check systemd state on mw2177 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:02] <_joe_> ok [10:04:08] (03CR) 10Marostegui: [C: 031] Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [10:04:09] PROBLEM - Check systemd state on mw1213 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:09] PROBLEM - Check systemd state on mw1287 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:09] PROBLEM - Check systemd state on mw2122 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:10] PROBLEM - Check systemd state on mw1233 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:14] <_joe_> godog: puppet is disabled on all those systems btw [10:04:19] PROBLEM - Check systemd state on mw2251 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:19] PROBLEM - Check systemd state on mw2231 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:30] PROBLEM - Check systemd state on mw2106 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:30] PROBLEM - Check systemd state on mw2176 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:40] PROBLEM - Check systemd state on mw2201 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:40] PROBLEM - Check systemd state on mw1220 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:00] PROBLEM - mediawiki-installation DSH group on mw1317 is CRITICAL: Host mw1317 is not in mediawiki-installation dsh group [10:05:00] PROBLEM - DPKG on mw1317 is CRITICAL: Return code of 255 is out of bounds [10:05:00] PROBLEM - Check systemd state on mw1195 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:09] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:09] PROBLEM - Check systemd state on mw2244 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:11] _joe_: thanks! I'm trying to understand why on those nutcracker answers with connection reset by peer when asked for stats, works ok e.g. on mwdebug [10:05:19] PROBLEM - Check systemd state on mw1232 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:20] PROBLEM - Check systemd state on mw2252 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:20] PROBLEM - Check systemd state on mw1295 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:20] PROBLEM - Check systemd state on mw2218 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:22] <_joe_> godog: uhm [10:05:25] <_joe_> lemme see [10:05:29] PROBLEM - Check systemd state on mw2233 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:29] PROBLEM - Check systemd state on mw2245 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:29] PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:29] PROBLEM - Check systemd state on mw2133 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:29] PROBLEM - Check systemd state on mw2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:30] PROBLEM - Check systemd state on mw2101 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:40] PROBLEM - Check systemd state on mw2214 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:40] PROBLEM - Check systemd state on mw1208 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:40] PROBLEM - Check systemd state on mw2220 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:49] PROBLEM - Check systemd state on mw1284 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:59] PROBLEM - Check systemd state on mw1263 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:06:08] not causing real problems, right? [10:06:09] PROBLEM - Check systemd state on mw2246 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:06:26] <_joe_> jynus: nope [10:06:29] no, but noise [10:06:30] good [10:06:49] PROBLEM - Disk space on mw1317 is CRITICAL: Return code of 255 is out of bounds [10:06:49] PROBLEM - nutcracker port on mw1317 is CRITICAL: Return code of 255 is out of bounds [10:06:49] PROBLEM - Check systemd state on mw1310 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:06:59] PROBLEM - Check systemd state on mw1294 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:07:20] PROBLEM - Check systemd state on mw1293 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:07:49] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:08:29] PROBLEM - nutcracker process on mw1317 is CRITICAL: Return code of 255 is out of bounds [10:08:29] PROBLEM - HHVM processes on mw1317 is CRITICAL: Return code of 255 is out of bounds [10:08:30] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:09:00] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:09:35] <_joe_> godog: on most of the systems where it failed nutcracker returns its stats correctly [10:10:10] PROBLEM - Check systemd state on mw1203 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:10:10] PROBLEM - Check systemd state on mw1242 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:10:10] PROBLEM - HHVM rendering on mw1317 is CRITICAL: connect to address 10.64.16.198 and port 80: Connection refused [10:10:10] PROBLEM - puppet last run on mw1317 is CRITICAL: Return code of 255 is out of bounds [10:10:10] PROBLEM - Check systemd state on mw2169 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:10:12] indeed [10:11:23] (03PS1) 10Muehlenhoff: Add PowerDNS exporter to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399152 [10:11:49] (03CR) 10jerkins-bot: [V: 04-1] Add PowerDNS exporter to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399152 (owner: 10Muehlenhoff) [10:11:59] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:12:06] <_joe_> godog: uhm, yeah it seems like there is something we don't understand [10:12:14] <_joe_> I'll have to look at the code again [10:12:51] <_joe_> brb [10:12:51] looks like nutcracker closes the connection before python has had time to read all the output [10:13:12] <_joe_> yeah it's possible that's an optimization in order to preserve sockets [10:13:18] <_joe_> under high load [10:13:52] (03CR) 10ArielGlenn: [C: 032] apachedir is available to dumps cron jobs via a bash script, use it [puppet] - 10https://gerrit.wikimedia.org/r/398106 (owner: 10ArielGlenn) [10:14:28] yeah I got the code wrong, it [10:14:28] (03PS2) 10Muehlenhoff: Add PowerDNS exporter to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399152 [10:14:29] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:14:40] it just so happened to work in the cases I tested [10:15:07] fixing it [10:15:20] PROBLEM - Check systemd state on thumbor2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:15:20] PROBLEM - Apache HTTP on mw1317 is CRITICAL: connect to address 10.64.16.198 and port 80: Connection refused [10:15:20] PROBLEM - MD RAID on mw1317 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:15:20] PROBLEM - Check systemd state on mw1274 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:17:49] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:18:40] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:18:49] PROBLEM - Check systemd state on mw2253 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:19:49] PROBLEM - Check systemd state on thumbor2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:20:27] (03PS1) 10Elukey: eventlogging_purging_whitelist.tsv: remove old table [puppet] - 10https://gerrit.wikimedia.org/r/399153 [10:21:52] (03PS2) 10Elukey: eventlogging_purging_whitelist.tsv: remove old table [puppet] - 10https://gerrit.wikimedia.org/r/399153 (https://phabricator.wikimedia.org/T108850) [10:23:19] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:23:59] PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:24:29] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:25:39] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:25:41] (03PS1) 10Filippo Giunchedi: Fix nutcracker metrics fetching [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/399154 (https://phabricator.wikimedia.org/T181995) [10:26:08] _joe_: ^ [10:27:00] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:27:50] <_joe_> godog: seems ok, want me to do a serious review? [10:28:23] 10Operations, 10Cloud-Services, 10Cloud-VPS: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#3847501 (10akosiaris) a:05akosiaris>03None [10:28:51] _joe_: the code is the same as the diamond exporter now so I'll just go ahead [10:29:34] <_joe_> please do [10:30:02] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:30:15] (03CR) 10Filippo Giunchedi: [C: 032] Fix nutcracker metrics fetching [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/399154 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [10:30:17] 10Operations, 10OTRS, 10Security: Upgrade OTRS to 5.0.26 - https://phabricator.wikimedia.org/T183228#3847505 (10akosiaris) [10:30:36] 10Operations, 10OTRS, 10Security: Upgrade OTRS to 5.0.26 - https://phabricator.wikimedia.org/T183228#3847518 (10akosiaris) 05Open>03Resolved Upgrade to 5.0.26 done. Resolving. [10:30:43] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] First version of the helm chart scaffolding for production services [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397) (owner: 10Giuseppe Lavagetto) [10:33:33] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:34:02] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:34:38] (03PS1) 10Hashar: Bump Jinja2 to 2.10+ [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/399155 [10:34:51] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3847529 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['mw1330.eqiad.wmnet', 'mw1331.eqiad.wmnet... [10:35:10] imaging mw133[0,1] --^ [10:37:03] RECOVERY - Check systemd state on mw1204 is OK: OK - running: The system is fully operational [10:38:42] RECOVERY - Check systemd state on mw2017 is OK: OK - running: The system is fully operational [10:38:42] RECOVERY - Check systemd state on mw2172 is OK: OK - running: The system is fully operational [10:38:43] RECOVERY - Check systemd state on mw2157 is OK: OK - running: The system is fully operational [10:38:45] !log rollout updated version of prometheus-nutcracker-exporter [10:38:52] RECOVERY - Check systemd state on mw2141 is OK: OK - running: The system is fully operational [10:38:52] RECOVERY - Check systemd state on mw2101 is OK: OK - running: The system is fully operational [10:38:52] RECOVERY - Check systemd state on mw2106 is OK: OK - running: The system is fully operational [10:38:52] RECOVERY - Check systemd state on mw2176 is OK: OK - running: The system is fully operational [10:38:53] RECOVERY - Check systemd state on mw2156 is OK: OK - running: The system is fully operational [10:38:53] RECOVERY - Check systemd state on mw2201 is OK: OK - running: The system is fully operational [10:38:53] RECOVERY - Check systemd state on mw2212 is OK: OK - running: The system is fully operational [10:38:54] RECOVERY - Check systemd state on mw2144 is OK: OK - running: The system is fully operational [10:38:54] RECOVERY - Check systemd state on mw2225 is OK: OK - running: The system is fully operational [10:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:02] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [10:39:03] RECOVERY - Check systemd state on mw2122 is OK: OK - running: The system is fully operational [10:39:12] RECOVERY - Check systemd state on mw2113 is OK: OK - running: The system is fully operational [10:39:12] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational [10:39:12] RECOVERY - Check systemd state on mw2223 is OK: OK - running: The system is fully operational [10:39:12] RECOVERY - Check systemd state on mw1287 is OK: OK - running: The system is fully operational [10:39:13] RECOVERY - Check systemd state on mw2177 is OK: OK - running: The system is fully operational [10:39:13] RECOVERY - Check systemd state on mw1218 is OK: OK - running: The system is fully operational [10:39:13] RECOVERY - Check systemd state on mw1203 is OK: OK - running: The system is fully operational [10:39:14] RECOVERY - Check systemd state on mw1242 is OK: OK - running: The system is fully operational [10:39:14] RECOVERY - Check systemd state on mw1309 is OK: OK - running: The system is fully operational [10:39:22] RECOVERY - Check systemd state on mw1274 is OK: OK - running: The system is fully operational [10:39:32] RECOVERY - Check systemd state on mw2202 is OK: OK - running: The system is fully operational [10:39:32] RECOVERY - Check systemd state on thumbor2001 is OK: OK - running: The system is fully operational [10:39:32] RECOVERY - Check systemd state on mw1293 is OK: OK - running: The system is fully operational [10:39:33] RECOVERY - Check systemd state on mw1295 is OK: OK - running: The system is fully operational [10:39:33] RECOVERY - Check systemd state on mw2231 is OK: OK - running: The system is fully operational [10:39:33] RECOVERY - Check systemd state on mw2252 is OK: OK - running: The system is fully operational [10:39:33] RECOVERY - Check systemd state on mw1187 is OK: OK - running: The system is fully operational [10:39:34] RECOVERY - Check systemd state on mw1188 is OK: OK - running: The system is fully operational [10:39:34] RECOVERY - Check systemd state on mw1262 is OK: OK - running: The system is fully operational [10:39:35] RECOVERY - Check systemd state on mw1322 is OK: OK - running: The system is fully operational [10:39:35] RECOVERY - Check systemd state on mw2218 is OK: OK - running: The system is fully operational [10:39:42] RECOVERY - Check systemd state on mw2137 is OK: OK - running: The system is fully operational [10:39:42] RECOVERY - Check systemd state on mw2233 is OK: OK - running: The system is fully operational [10:39:42] RECOVERY - Check systemd state on mw2245 is OK: OK - running: The system is fully operational [10:39:42] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational [10:39:42] RECOVERY - Check systemd state on thumbor2003 is OK: OK - running: The system is fully operational [10:39:43] RECOVERY - Check systemd state on mw2133 is OK: OK - running: The system is fully operational [10:39:43] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational [10:39:44] RECOVERY - Check systemd state on mw1190 is OK: OK - running: The system is fully operational [10:39:52] RECOVERY - Check systemd state on mw1208 is OK: OK - running: The system is fully operational [10:39:52] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:39:52] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational [10:39:52] RECOVERY - Check systemd state on mw1183 is OK: OK - running: The system is fully operational [10:39:53] RECOVERY - Check systemd state on mw1220 is OK: OK - running: The system is fully operational [10:39:53] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [10:39:53] RECOVERY - Check systemd state on mw2234 is OK: OK - running: The system is fully operational [10:39:54] RECOVERY - Check systemd state on mw2214 is OK: OK - running: The system is fully operational [10:39:54] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational [10:39:55] RECOVERY - Check systemd state on mw2220 is OK: OK - running: The system is fully operational [10:40:02] RECOVERY - Check systemd state on mw2255 is OK: OK - running: The system is fully operational [10:40:02] RECOVERY - Check systemd state on mw2253 is OK: OK - running: The system is fully operational [10:40:02] RECOVERY - Check systemd state on thumbor2002 is OK: OK - running: The system is fully operational [10:40:02] RECOVERY - Check systemd state on mw1263 is OK: OK - running: The system is fully operational [10:40:03] RECOVERY - Check systemd state on mw1294 is OK: OK - running: The system is fully operational [10:40:03] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [10:40:03] RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational [10:40:03] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational [10:40:04] RECOVERY - Check systemd state on mw1195 is OK: OK - running: The system is fully operational [10:40:06] \o/ [10:40:12] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [10:40:13] RECOVERY - Check systemd state on mw1213 is OK: OK - running: The system is fully operational [10:40:32] RECOVERY - Check systemd state on mw1233 is OK: OK - running: The system is fully operational [10:40:32] RECOVERY - Check systemd state on mw2246 is OK: OK - running: The system is fully operational [10:40:32] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational [10:42:12] RECOVERY - Check systemd state on mw1214 is OK: OK - running: The system is fully operational [10:43:22] RECOVERY - Check systemd state on mw1232 is OK: OK - running: The system is fully operational [10:43:32] RECOVERY - Check systemd state on mw2251 is OK: OK - running: The system is fully operational [10:43:32] RECOVERY - Check systemd state on mw2240 is OK: OK - running: The system is fully operational [10:43:52] RECOVERY - Check systemd state on mw1310 is OK: OK - running: The system is fully operational [10:44:02] RECOVERY - Check systemd state on mw1284 is OK: OK - running: The system is fully operational [10:44:02] RECOVERY - Check systemd state on mw2244 is OK: OK - running: The system is fully operational [10:44:03] RECOVERY - Check systemd state on mw1216 is OK: OK - running: The system is fully operational [10:44:05] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3847554 (10elukey) Next steps: 1) image all the hosts in https://gerrit.wikimedia.org/r/397749 and put them in production (January) 2) decom old row C appserve... [10:44:14] <_joe_> godog: can I reenable puppet then? [10:44:22] _joe_: yup I can do it too [10:44:32] (03PS15) 10Jcrespo: Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [10:44:38] <_joe_> done [10:44:48] nice, thanks [10:45:22] !log disabling puppet on dbproxies for 398450 deploy [10:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:42] RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational [10:45:50] (03CR) 10Jcrespo: [C: 032] Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [10:45:53] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:46:02] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:46:22] PROBLEM - puppet last run on mw2233 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:46:23] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [10:46:32] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:46:32] PROBLEM - puppet last run on mw2252 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:46:32] PROBLEM - puppet last run on mw2251 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:46:43] PROBLEM - puppet last run on mw2253 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:47:10] <_joe_> uh [10:47:12] mhh I thought I had upgraded the exporter everywhere with this cumin query [10:47:22] 'R:class = profile::prometheus::nutcracker_exporter' [10:47:22] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:47:22] RECOVERY - Check systemd state on mw2169 is OK: OK - running: The system is fully operational [10:47:24] clearly not [10:47:29] !log restart zookeeper on conf2001 for jvm updates - T179943 [10:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:40] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [10:47:41] <_joe_> godog: those hosts are the ones where puppet didn't run maybe [10:47:53] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:48:11] godog: is it the same of 'R:Service = prometheus-nutcracker-exporter' ? 119 hosts [10:48:39] <_joe_> godog: what's the correct version? [10:48:40] yeah should be volans [10:48:42] PROBLEM - puppet last run on mw2169 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-nutcracker-exporter] [10:48:51] _joe_: 0.2 or 0.2~trusty1 on trusty [10:48:57] 'P:prometheus::nutcracker_exporter' gets 129 hosts [10:49:12] <_joe_> godog: it's 0.2 indeed on mw1293 [10:49:26] I guess because puppet ran and failed they never updated puppetdb and thus don't return in the cumin query ? [10:49:38] <_joe_> godog: no it's something else I'd say [10:49:42] <_joe_> lemme see [10:50:22] <_joe_> which is a videoscaler btw [10:50:29] <_joe_> sorry, imagescaler [10:51:52] RECOVERY - configured eth on mw1317 is OK: OK - interfaces up [10:51:53] RECOVERY - nutcracker port on mw1317 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:51:53] RECOVERY - Disk space on mw1317 is OK: DISK OK [10:52:12] RECOVERY - DPKG on mw1317 is OK: All packages OK [10:52:23] RECOVERY - Check whether ferm is active by checking the default input chain on mw1317 is OK: OK ferm input default policy is set [10:52:23] RECOVERY - dhclient process on mw1317 is OK: PROCS OK: 0 processes with command name dhclient [10:52:23] RECOVERY - MD RAID on mw1317 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [10:52:42] RECOVERY - nutcracker process on mw1317 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [10:52:42] RECOVERY - HHVM processes on mw1317 is OK: PROCS OK: 6 processes with command name hhvm [10:52:46] !log upgrading pdns-recursor on nescio to 4.0.4+deb9u3~bpo8+1 (security fix) [10:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:06] (03CR) 10ArielGlenn: [C: 032] config setting to permit a list of wikis to be dumped in a specific order [dumps] - 10https://gerrit.wikimedia.org/r/398861 (owner: 10ArielGlenn) [10:53:45] !log reboot conf2001 for kernel updates - T179943 [10:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:58] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [10:54:32] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:54:32] _joe_: tried a new puppet run on mw2251 and it worked, I'll force a run where it failed [10:54:40] <_joe_> yeah [10:54:42] !log ariel@tin Started deploy [dumps/dumps@2bafffe]: allow dump runs in specified wiki list order, rather than by longest to wait [10:54:45] !log ariel@tin Finished deploy [dumps/dumps@2bafffe]: allow dump runs in specified wiki list order, rather than by longest to wait (duration: 00m 02s) [10:54:47] ah snap there is also etcd, always forget [10:54:51] <_joe_> elukey: rebooting conf2001? [10:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:52] RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:54:54] <_joe_> seriosuly? [10:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:10] _joe_ yes yes I am checking etcd now, but it needs to be done [10:55:45] <_joe_> elukey: and it's ok, let's just verify for instance where the mirror is running [10:55:52] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:56:02] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [10:56:13] _joe_ yes sorry I had a moment of "yeah there is only zk on those" and then as soon as I've hit "enter" I realized :D [10:56:22] RECOVERY - puppet last run on mw2233 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [10:56:24] (enter == SAL) [10:56:29] <_joe_> yeah I got that [10:56:31] <_joe_> so [10:56:32] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [10:56:32] RECOVERY - puppet last run on mw2251 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:56:32] RECOVERY - puppet last run on mw2252 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:56:44] <_joe_> conf2001 or the whole cluster in codfw? [10:56:50] (03PS1) 10Hashar: Bump Jinja2 from 2.9.6 to 2.10 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/399157 [10:56:55] _joe_ I'll to the reboots after the freeze, only zk restarts now [10:57:10] <_joe_> elukey: yeah I think that's advisable [10:57:16] yep yep [10:57:20] <_joe_> I mean we can reboot those [10:57:22] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:57:29] <_joe_> but I'd wait if possible [10:57:36] +1, brain fault [10:57:47] <_joe_> we'll have to get the new conf* servers in prod in eqiad anyways in january [10:58:44] RECOVERY - puppet last run on mw2169 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:00:31] (03CR) 10Filippo Giunchedi: Add PowerDNS exporter to labservices1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399152 (owner: 10Muehlenhoff) [11:00:54] PROBLEM - nova-compute process on labvirt1010 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [11:01:35] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1317 is OK: OK: synced at Tue 2017-12-19 11:01:33 UTC. [11:01:44] PROBLEM - MD RAID on mw1318 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:01:44] RECOVERY - puppet last run on mw2253 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:01:54] RECOVERY - nova-compute process on labvirt1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [11:03:47] (03PS1) 10ArielGlenn: enable dumps of big wikis to run in a fixed order [puppet] - 10https://gerrit.wikimedia.org/r/399158 [11:04:01] (03PS1) 10Alexandros Kosiaris: Move role::prometheus::k8s to profile [puppet] - 10https://gerrit.wikimedia.org/r/399159 [11:04:03] (03PS1) 10Alexandros Kosiaris: Introduce profile::prometheus::k8s::staging [puppet] - 10https://gerrit.wikimedia.org/r/399160 [11:04:41] (03CR) 10jerkins-bot: [V: 04-1] Introduce profile::prometheus::k8s::staging [puppet] - 10https://gerrit.wikimedia.org/r/399160 (owner: 10Alexandros Kosiaris) [11:05:05] PROBLEM - Check size of conntrack table on mw1318 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:05:05] PROBLEM - configured eth on mw1318 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:05:33] ^ reimage, silencing [11:06:34] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:10:35] (03PS2) 10Alexandros Kosiaris: Introduce profile::prometheus::k8s::staging [puppet] - 10https://gerrit.wikimedia.org/r/399160 [11:11:39] (03CR) 10ArielGlenn: [C: 04-2] "Do not merge until current dump run completes." [puppet] - 10https://gerrit.wikimedia.org/r/399158 (owner: 10ArielGlenn) [11:13:03] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3847595 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['mw1330.eqiad.wmnet'] ``` The log can be... [11:15:50] (03CR) 10Hashar: "Or we can solely bump it in the deploy repo has done via https://gerrit.wikimedia.org/r/#/c/399157/" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/399155 (owner: 10Hashar) [11:16:21] !log uploaded pdns-recursor 4.0.4+deb9u3~bpo8+1 to apt.wikimedia.org [11:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:32] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Port redis statistics to Prometheus - https://phabricator.wikimedia.org/T148637#3847626 (10fgiunchedi) I "promoted" (renamed) the prometheus dashboard to "redis" and the previous to "redis-graphite": https://grafana.wikim... [11:26:22] (03PS1) 10Volans: wmf-auto-reimage: improve resume capabilities [puppet] - 10https://gerrit.wikimedia.org/r/399161 (https://phabricator.wikimedia.org/T182702) [11:26:25] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Bump Jinja2 from 2.9.6 to 2.10 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/399157 (owner: 10Hashar) [11:26:35] (03CR) 10Alexandros Kosiaris: "So, should I include this in the role ? But then codfw is going to be polling the staging cluster. Should I create a new role ? But then w" [puppet] - 10https://gerrit.wikimedia.org/r/399160 (owner: 10Alexandros Kosiaris) [11:27:53] (03CR) 10Giuseppe Lavagetto: "> So, should I include this in the role ? But then codfw is going to" [puppet] - 10https://gerrit.wikimedia.org/r/399160 (owner: 10Alexandros Kosiaris) [11:28:07] !log hashar@tin Started deploy [docker-pkg/deploy@09087ad]: Bumping Jinja2 2.9.6..2.10 [11:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:37] !log hashar@tin Finished deploy [docker-pkg/deploy@09087ad]: Bumping Jinja2 2.9.6..2.10 (duration: 00m 30s) [11:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:23] !log upgrading pdns-recursor on maerlant to 4.0.4+deb9u3~bpo8+1 (security fix) [11:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:37] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1331 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:31:37] PROBLEM - configured eth on mw1331 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:33:24] added downtime, new appserver --^ [11:34:03] 10Operations, 10MediaWiki-Configuration, 10discovery-system: [DRAFT] Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597#3847651 (10Volans) p:05Triage>03Normal [11:35:25] thanks godog (kill+restore topic) [11:36:14] np! [11:36:18] the kill wasn't me tho [11:40:01] hashar, volans: do you know who is in charge these days for beta cluster? looks like enwiki is broken :( https://phabricator.wikimedia.org/T183232 [11:41:06] zeljkof: me not really [11:41:15] but I can try to find out ;) [11:41:38] volans: that was my guess, but I was sure you would know more than I do :) [11:41:44] thanks [11:41:45] (03PS1) 10Filippo Giunchedi: prometheus: add nutcracker job [puppet] - 10https://gerrit.wikimedia.org/r/399163 (https://phabricator.wikimedia.org/T181995) [11:41:49] (03CR) 10Hashar: "Bumping Jinja2 in /deploy fixed it for me ( https://gerrit.wikimedia.org/r/#/c/399157/ )." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/399155 (owner: 10Hashar) [11:42:26] (03PS1) 10Jcrespo: mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) [11:42:54] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [11:44:41] (03PS2) 10Jcrespo: mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) [11:45:07] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [11:46:02] (03PS3) 10Alexandros Kosiaris: Introduce profile::prometheus::k8s::staging [puppet] - 10https://gerrit.wikimedia.org/r/399160 [11:47:54] (03PS3) 10Jcrespo: mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) [11:48:21] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [11:49:26] RECOVERY - Check size of conntrack table on mw1318 is OK: OK: nf_conntrack is 0 % full [11:49:26] RECOVERY - configured eth on mw1318 is OK: OK - interfaces up [11:49:57] RECOVERY - MD RAID on mw1318 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [11:53:46] RECOVERY - configured eth on mw1331 is OK: OK - interfaces up [11:57:00] (03PS4) 10Jcrespo: mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) [11:57:44] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [11:59:37] !log CI: switching composer-php55 / composer-package-php55 jobs from Nodepool to Docker | https://gerrit.wikimedia.org/r/#/c/398920/ [11:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:54] (03PS5) 10Jcrespo: mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) [12:00:40] PROBLEM - Check systemd state on mw1330 is CRITICAL: Return code of 255 is out of bounds [12:01:39] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1331 is OK: OK: synced at Tue 2017-12-19 12:01:31 UTC. [12:02:20] PROBLEM - configured eth on mw1330 is CRITICAL: Return code of 255 is out of bounds [12:02:20] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1330 is CRITICAL: Return code of 255 is out of bounds [12:04:09] PROBLEM - dhclient process on mw1330 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:09] PROBLEM - Check whether ferm is active by checking the default input chain on mw1330 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:05:49] PROBLEM - DPKG on mw1330 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:05:49] PROBLEM - mediawiki-installation DSH group on mw1330 is CRITICAL: Host mw1330 is not in mediawiki-installation dsh group [12:07:30] PROBLEM - Disk space on mw1330 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:07:30] PROBLEM - nutcracker port on mw1330 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:08:24] silenced [12:09:02] zeljkof: so it seems it should be mostly releng if I'm not mistaken [12:09:13] volans: :D [12:09:27] hashar: do you agree? ;) ^ [12:09:42] I know hashar used to work on it, but I don't think he does any more [12:10:14] and having a quick look at shinken there are a lot of thing in alarm, not sure which one is the culprit. For example Puppet is failing on deployment-mediawikiNN since 5 days for a duplicate declaration [12:10:17] looks like its wikidata problem... (from the error message) [12:11:45] <_joe_> I'm definitely sure deployment-prep is taken care of by releng. [12:12:10] <_joe_> if that has changed, I didn't get the memo, or I didn't read it (both are equally possible) [12:12:40] !log CI: switching mwgate-composer-php70 job from Nodepool to Docker | https://gerrit.wikimedia.org/r/#/c/398921/ [12:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:29] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3847732 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1330.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1330.eqiad.wmnet'] ``` [12:20:06] (03PS4) 10Giuseppe Lavagetto: Create an envoy docker image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/396021 [12:25:17] 10Operations: After reimage Puppet order: sudo command failed - https://phabricator.wikimedia.org/T183236#3847744 (10Volans) [12:25:27] 10Operations: After reimage Puppet order: sudo command failed - https://phabricator.wikimedia.org/T183236#3847754 (10Volans) p:05Triage>03Normal [12:29:33] RECOVERY - configured eth on mw1330 is OK: OK - interfaces up [12:29:43] RECOVERY - Disk space on mw1330 is OK: DISK OK [12:29:43] RECOVERY - nutcracker port on mw1330 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:29:53] RECOVERY - Check systemd state on mw1330 is OK: OK - running: The system is fully operational [12:29:54] RECOVERY - DPKG on mw1330 is OK: All packages OK [12:30:14] RECOVERY - Check whether ferm is active by checking the default input chain on mw1330 is OK: OK ferm input default policy is set [12:30:14] RECOVERY - dhclient process on mw1330 is OK: PROCS OK: 0 processes with command name dhclient [12:31:05] (03PS3) 10ArielGlenn: dataset1001 rsync to labs of dumps can now use explicit inclusion list [puppet] - 10https://gerrit.wikimedia.org/r/336204 (https://phabricator.wikimedia.org/T154798) [12:31:36] (03CR) 10jerkins-bot: [V: 04-1] dataset1001 rsync to labs of dumps can now use explicit inclusion list [puppet] - 10https://gerrit.wikimedia.org/r/336204 (https://phabricator.wikimedia.org/T154798) (owner: 10ArielGlenn) [12:32:23] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1330 is OK: OK: synced at Tue 2017-12-19 12:32:18 UTC. [12:34:21] (03PS4) 10ArielGlenn: dataset1001 rsync to labs of dumps can now use explicit inclusion list [puppet] - 10https://gerrit.wikimedia.org/r/336204 (https://phabricator.wikimedia.org/T154798) [12:48:49] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3847784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1330.eqiad.wmnet'] ``` and were **ALL** successful. [12:50:59] (03PS5) 10ArielGlenn: dataset1001 rsync to labs of dumps can now use explicit inclusion list [puppet] - 10https://gerrit.wikimedia.org/r/336204 (https://phabricator.wikimedia.org/T154798) [12:54:18] (03CR) 10Muehlenhoff: [C: 031] Jonas Kress move from ldap to shell, add to groups [puppet] - 10https://gerrit.wikimedia.org/r/398524 (https://phabricator.wikimedia.org/T182908) (owner: 10RobH) [13:03:54] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/398858 (owner: 10Muehlenhoff) [13:05:12] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw133[0-1].eqiad.wmnet [13:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:53] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1318.eqiad.wmnet [13:05:54] volans: confirmed also from my side that everything works fine with wmf-auto-reimage now, thanks! [13:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:35] elukey: good to know, thanks, I've sent a CR for some improvements on resume, JIC ;) [13:07:11] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3847801 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['mw1332.eqiad.wmnet', 'mw1333.eqiad.wmnet... [13:07:58] RECOVERY - HHVM rendering on mw1317 is OK: HTTP OK: HTTP/1.1 200 OK - 73999 bytes in 0.145 second response time [13:10:45] (03PS3) 10Muehlenhoff: Add Prometheus exporter for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/398858 [13:13:36] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1317.eqiad.wmnet [13:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:15] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3796474 (10dcausse) I ported elasticsearch-memory and elasticsearch-indexing. - https://grafana-admin.wikimedia.org/dashboard/db/el... [13:15:17] RECOVERY - puppet last run on mw1317 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:33:31] (03PS1) 10Gilles: Smarter Varnish slow log [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) [13:34:01] (03CR) 10jerkins-bot: [V: 04-1] Smarter Varnish slow log [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) (owner: 10Gilles) [13:37:26] (03PS2) 10Gilles: Smarter Varnish slow log [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) [13:37:44] (03CR) 10Muehlenhoff: [C: 032] Add Prometheus exporter for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/398858 (owner: 10Muehlenhoff) [13:37:56] (03CR) 10jerkins-bot: [V: 04-1] Smarter Varnish slow log [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) (owner: 10Gilles) [13:40:14] (03PS3) 10Gilles: Smarter Varnish slow log [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) [13:41:16] (03PS2) 10Muehlenhoff: Add Prometheus scraper configs for WDQS updater and Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/398865 [13:42:28] 10Operations, 10ops-ulsfo, 10Traffic: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3845938 (10ema) >>! In T183176#3847321, @Volans wrote: > @RobH FYI I've ack'ed the Icinga alert of the host down and set it to downtime until Fri UTC morning. I've just ack'ed all related strongswan alerts... [13:43:18] (03CR) 10Muehlenhoff: [C: 032] Add Prometheus scraper configs for WDQS updater and Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/398865 (owner: 10Muehlenhoff) [13:55:13] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3847941 (10MoritzMuehlenhoff) [13:55:15] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create Prometheus exporter for wdqs-updater - https://phabricator.wikimedia.org/T182773#3847939 (10MoritzMuehlenhoff) 05Open>03Resolved An exporter has been written, packaged and rolled out. [13:55:18] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3650139 (10MoritzMuehlenhoff) [13:55:21] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create Prometheus exporter for Blazegraph - https://phabricator.wikimedia.org/T182857#3847942 (10MoritzMuehlenhoff) 05Open>03Resolved An exporter has been written, packaged and rolled out. [13:57:56] !log upgrading pdns-recursor on achernar/acamar to 4.0.4+deb9u3~bpo8+1 (security fix) [13:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:54] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/399159 (owner: 10Alexandros Kosiaris) [14:01:47] (03CR) 10Filippo Giunchedi: "LGTM (modulo jenkins' -1)" [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [14:01:52] PROBLEM - mediawiki-installation DSH group on mw1332 is CRITICAL: Host mw1332 is not in mediawiki-installation dsh group [14:01:52] PROBLEM - DPKG on mw1332 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:37] (03PS4) 10Gilles: Smarter Varnish slow log [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) [14:05:01] RECOVERY - mediawiki-installation DSH group on mw1317 is OK: OK [14:05:51] RECOVERY - mediawiki-installation DSH group on mw1330 is OK: OK [14:07:01] (03CR) 10Jcrespo: [C: 031] "Looks ok to me: https://puppet-compiler.wmflabs.org/compiler02/9411/" [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [14:08:20] (03PS6) 10Jcrespo: mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) [14:11:49] 10Operations, 10DBA: Reimage and upgrade to stretch all proxies - https://phabricator.wikimedia.org/T183249#3848018 (10jcrespo) p:05Triage>03Normal [14:12:07] (03PS5) 10Gilles: Smarter Varnish slow log [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) [14:12:17] 10Operations, 10DBA, 10Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#3848045 (10jcrespo) [14:12:19] 10Operations, 10DBA: Reimage and upgrade to stretch all proxies - https://phabricator.wikimedia.org/T183249#3848043 (10jcrespo) [14:13:35] (03PS3) 10Ema: mtail: add varnishreqstats.mtail [puppet] - 10https://gerrit.wikimedia.org/r/398819 (https://phabricator.wikimedia.org/T177199) [14:14:20] (03PS7) 10Jcrespo: mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) [14:16:41] (03PS6) 10Gilles: Smarter Varnish slow log [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) [14:16:52] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399160 (owner: 10Alexandros Kosiaris) [14:18:48] (03CR) 10Filippo Giunchedi: mtail: add varnishreqstats.mtail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398819 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:23:02] (03CR) 10Gilles: "Puppet compiler: https://puppet-compiler.wmflabs.org/compiler02/9415/" [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) (owner: 10Gilles) [14:25:01] RECOVERY - DPKG on mw1332 is OK: All packages OK [14:26:47] (03PS2) 10Alexandros Kosiaris: Move role::prometheus::k8s to profile [puppet] - 10https://gerrit.wikimedia.org/r/399159 [14:26:56] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3848099 (10jcrespo) [14:27:01] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Move role::prometheus::k8s to profile [puppet] - 10https://gerrit.wikimedia.org/r/399159 (owner: 10Alexandros Kosiaris) [14:28:26] !log disabling puppet on dbproxies for 399164 deploy [14:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:28] (03PS4) 10Ema: mtail: add varnishreqstats.mtail [puppet] - 10https://gerrit.wikimedia.org/r/398819 (https://phabricator.wikimedia.org/T177199) [14:30:09] (03CR) 10Jcrespo: [C: 032] mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [14:30:17] (03PS8) 10Jcrespo: mariadb: Preparing reimage of dbproxy1001 and setup proxy firewall [puppet] - 10https://gerrit.wikimedia.org/r/399164 (https://phabricator.wikimedia.org/T148507) [14:30:46] (03CR) 10Alexandros Kosiaris: Introduce profile::prometheus::k8s::staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399160 (owner: 10Alexandros Kosiaris) [14:30:56] (03PS4) 10Alexandros Kosiaris: Introduce profile::prometheus::k8s::staging [puppet] - 10https://gerrit.wikimedia.org/r/399160 [14:34:14] (03PS1) 10Catrope: Depool deployment-db04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399182 (https://phabricator.wikimedia.org/T183252) [14:35:35] (03CR) 10jerkins-bot: [V: 04-1] Depool deployment-db04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399182 (https://phabricator.wikimedia.org/T183252) (owner: 10Catrope) [14:35:37] (03PS5) 10Alexandros Kosiaris: Introduce profile::prometheus::k8s::staging [puppet] - 10https://gerrit.wikimedia.org/r/399160 [14:36:09] (03PS5) 10Ema: mtail: add varnishreqstats.mtail [puppet] - 10https://gerrit.wikimedia.org/r/398819 (https://phabricator.wikimedia.org/T177199) [14:37:33] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3848018 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1001.eqiad.wmnet'] ``` The log can be found in `/var/... [14:38:46] (03PS3) 10Muehlenhoff: Add PowerDNS exporter to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399152 [14:39:08] (03CR) 10Ema: mtail: add varnishreqstats.mtail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398819 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:40:37] (03PS1) 10Rush: openstack: labvirt role shuffle [puppet] - 10https://gerrit.wikimedia.org/r/399183 [14:42:43] (03PS2) 10Catrope: Depool deployment-db04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399182 (https://phabricator.wikimedia.org/T183252) [14:44:41] !log installing request-tracker4 update from jessie point release on ununpentium [14:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:03] (03CR) 10BBlack: [C: 031] mtail: add varnishreqstats.mtail [puppet] - 10https://gerrit.wikimedia.org/r/398819 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:45:11] (03CR) 10Jcrespo: [C: 031] Depool deployment-db04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399182 (https://phabricator.wikimedia.org/T183252) (owner: 10Catrope) [14:49:23] (03PS2) 10Rush: openstack: labvirt role shuffle [puppet] - 10https://gerrit.wikimedia.org/r/399183 [14:49:33] !log restarting hhvm on canary app servers to pick up security updates for openssl, icu and libx11 [14:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:43] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3848236 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1334.eqiad.wmnet', 'mw1333.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1334.eq... [14:51:25] (03CR) 10Catrope: [C: 032] Depool deployment-db04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399182 (https://phabricator.wikimedia.org/T183252) (owner: 10Catrope) [14:52:54] (03Merged) 10jenkins-bot: Depool deployment-db04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399182 (https://phabricator.wikimedia.org/T183252) (owner: 10Catrope) [14:53:13] (03CR) 10jenkins-bot: Depool deployment-db04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399182 (https://phabricator.wikimedia.org/T183252) (owner: 10Catrope) [14:53:40] (03CR) 10Ema: [C: 032] mtail: add varnishreqstats.mtail [puppet] - 10https://gerrit.wikimedia.org/r/398819 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:54:51] (03CR) 10Herron: "Thanks Volans!" [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) (owner: 10Herron) [14:55:05] (03CR) 10Herron: [C: 04-2] "Not to be merged until after holiday break" [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) (owner: 10Herron) [14:55:09] (03PS3) 10Rush: openstack: labvirt role shuffle [puppet] - 10https://gerrit.wikimedia.org/r/399183 [14:55:23] yw herron :) [14:55:38] (03PS1) 10Marostegui: Revert "Revert "Revert "db-eqiad.php: Depool db1106""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399186 [14:55:46] (03CR) 10Rush: [C: 032] openstack: labvirt role shuffle [puppet] - 10https://gerrit.wikimedia.org/r/399183 (owner: 10Rush) [14:55:51] (03PS2) 10Marostegui: Revert "Revert "Revert "db-eqiad.php: Depool db1106""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399186 [14:59:08] (03CR) 10Marostegui: [C: 032] Revert "Revert "Revert "db-eqiad.php: Depool db1106""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399186 (owner: 10Marostegui) [15:00:36] (03Merged) 10jenkins-bot: Revert "Revert "Revert "db-eqiad.php: Depool db1106""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399186 (owner: 10Marostegui) [15:00:49] (03CR) 10jenkins-bot: Revert "Revert "Revert "db-eqiad.php: Depool db1106""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399186 (owner: 10Marostegui) [15:01:40] (03PS1) 10Muehlenhoff: Add library hint for libx11 [puppet] - 10https://gerrit.wikimedia.org/r/399189 [15:01:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1106 - T161294 (duration: 00m 52s) [15:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:54] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [15:04:06] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399191 (https://phabricator.wikimedia.org/T161294) [15:07:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1109, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399191 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [15:08:13] (03CR) 10Muehlenhoff: [C: 032] Add library hint for libx11 [puppet] - 10https://gerrit.wikimedia.org/r/399189 (owner: 10Muehlenhoff) [15:09:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399191 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [15:09:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1109, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399191 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [15:10:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1099:3318 and db1109 - T161294 (duration: 00m 51s) [15:10:50] !log Stop replication in sync on db1109 and db1099:3318 - https://phabricator.wikimedia.org/T161294 [15:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:54] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [15:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:48] (03PS1) 10Jcrespo: dbproxy: Apply both regular and cloud only exception for 'cloud' [puppet] - 10https://gerrit.wikimedia.org/r/399194 (https://phabricator.wikimedia.org/T104699) [15:14:29] (03CR) 10Jcrespo: [C: 032] dbproxy: Apply both regular and cloud only exception for 'cloud' [puppet] - 10https://gerrit.wikimedia.org/r/399194 (https://phabricator.wikimedia.org/T104699) (owner: 10Jcrespo) [15:14:43] (03PS3) 10Ottomata: Set superset auth_settings => undef if not using ldap_proxy [puppet] - 10https://gerrit.wikimedia.org/r/396143 (https://phabricator.wikimedia.org/T166689) [15:14:49] (03CR) 10Ottomata: [V: 032 C: 032] Set superset auth_settings => undef if not using ldap_proxy [puppet] - 10https://gerrit.wikimedia.org/r/396143 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [15:15:38] moritzm: ok to merge? [15:16:05] (03PS1) 10Elukey: role::druid::analytics: lower down all the Xms settings [puppet] - 10https://gerrit.wikimedia.org/r/399195 [15:17:42] jynus: yes, sorry [15:18:22] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [15:18:51] (03CR) 10Elukey: [C: 032] role::druid::analytics: lower down all the Xms settings [puppet] - 10https://gerrit.wikimedia.org/r/399195 (owner: 10Elukey) [15:19:08] (03CR) 10Mforns: [C: 031] "LVGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/399153 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [15:19:20] elukey: my thing can be merged [15:19:24] its a no-op [15:19:32] i'll let you puppet-merge yours and mine [15:19:40] super [15:20:16] (03PS3) 10Elukey: eventlogging_purging_whitelist.tsv: remove old table [puppet] - 10https://gerrit.wikimedia.org/r/399153 (https://phabricator.wikimedia.org/T108850) [15:20:22] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [15:21:59] (03CR) 10Alexandros Kosiaris: [C: 032] network::constants: drop uranium from monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/399119 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [15:22:05] (03PS2) 10Alexandros Kosiaris: network::constants: drop uranium from monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/399119 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [15:22:07] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] network::constants: drop uranium from monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/399119 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [15:22:58] (03CR) 10Alexandros Kosiaris: [C: 031] remove ganglia_aggregators settings from hiera [puppet] - 10https://gerrit.wikimedia.org/r/399120 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [15:24:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The commit message needs some rewording. It gives the impression we are wondering about how to do things while the commit itself is clear " [puppet] - 10https://gerrit.wikimedia.org/r/382930 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [15:24:57] (03CR) 10Alexandros Kosiaris: [C: 031] ganglia: delete the module [puppet] - 10https://gerrit.wikimedia.org/r/382933 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [15:25:54] !log labvirt10[19|20] aptitude install linux-image-4.4.0-81-generic linux-image-extra-4.4.0-81-generic; sudo update-grub; /sbin/reboot T172538 [15:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:05] T172538: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538 [15:29:49] (03Abandoned) 10Ottomata: [WIP] Add cergen module [puppet] - 10https://gerrit.wikimedia.org/r/391134 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [15:30:21] (03PS6) 10Alexandros Kosiaris: Introduce profile::prometheus::k8s::staging [puppet] - 10https://gerrit.wikimedia.org/r/399160 [15:30:23] (03PS5) 10Alexandros Kosiaris: Add prometheus::postgres_exporter class to users [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) [15:30:57] (03PS6) 10Alexandros Kosiaris: Add prometheus::postgres_exporter class to users [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) [15:31:29] (03CR) 10jerkins-bot: [V: 04-1] Add prometheus::postgres_exporter class to users [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [15:31:46] damn I hate you jenkins [15:33:45] rip [15:33:46] (03PS7) 10Alexandros Kosiaris: Add prometheus::postgres_exporter class to users [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) [15:38:16] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3848419 (10Gehel) Additional missing metrics: * elasticsearch.indices.search.groups.prefix.query_total * elasticsearch.indices.sea... [15:39:16] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:40:16] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset 0.000159 secs [15:40:24] (03PS1) 10Ema: prometheus: add reqstats aggregation rule [puppet] - 10https://gerrit.wikimedia.org/r/399199 (https://phabricator.wikimedia.org/T177199) [15:41:46] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:41:46] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:42:06] (03CR) 10Muehlenhoff: [C: 031] prometheus: add nutcracker job [puppet] - 10https://gerrit.wikimedia.org/r/399163 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [15:42:46] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset -0.00034 secs [15:42:46] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset -0.00036 secs [15:43:12] (03PS1) 10Jcrespo: mariadb: Fix bug by which the wrong role was being set up on dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/399200 [15:44:30] (03CR) 10Jcrespo: [C: 032] mariadb: Fix bug by which the wrong role was being set up on dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/399200 (owner: 10Jcrespo) [15:45:46] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:46:38] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3848492 (10Gehel) Issue opened upstream to include those metrics: https://github.com/justwatchcom/elasticsearch_exporter/issues/115 [15:46:46] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset -1e-06 secs [15:48:26] (03PS2) 10Ema: prometheus: add reqstats aggregation rule [puppet] - 10https://gerrit.wikimedia.org/r/399199 (https://phabricator.wikimedia.org/T177199) [15:51:35] (03CR) 10Ema: [C: 032] prometheus: add reqstats aggregation rule [puppet] - 10https://gerrit.wikimedia.org/r/399199 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [15:54:35] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:55:05] PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:55:34] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3848555 (10ema) [15:55:35] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset 0.008719 secs [15:56:05] RECOVERY - NTP peers on maerlant is OK: NTP OK: Offset -0.000284 secs [15:59:41] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3848559 (10Gehel) >>! In T181627#3845066, @Ottomata wrote: > I'm probably doing somethign wrong with the ~jessie1 and ~stretch1 ver... [16:01:58] (03PS1) 10Elukey: role::druid::analytics::worker: review jvm configurations [puppet] - 10https://gerrit.wikimedia.org/r/399205 [16:04:19] (03CR) 10Elukey: [C: 032] eventlogging_purging_whitelist.tsv: remove old table [puppet] - 10https://gerrit.wikimedia.org/r/399153 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [16:04:30] (03PS4) 10Elukey: eventlogging_purging_whitelist.tsv: remove old table [puppet] - 10https://gerrit.wikimedia.org/r/399153 (https://phabricator.wikimedia.org/T108850) [16:04:38] (03CR) 10Elukey: [V: 032 C: 032] eventlogging_purging_whitelist.tsv: remove old table [puppet] - 10https://gerrit.wikimedia.org/r/399153 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [16:06:27] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3796474 (10MoritzMuehlenhoff) > As I understand it, Go statically links everything, so the same build should still be good for bot... [16:10:40] (03PS1) 10Catrope: Remove temporary read-only setting for beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399207 (https://phabricator.wikimedia.org/T183252) [16:10:59] (03CR) 10Catrope: [C: 032] Remove temporary read-only setting for beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399207 (https://phabricator.wikimedia.org/T183252) (owner: 10Catrope) [16:12:38] (03Merged) 10jenkins-bot: Remove temporary read-only setting for beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399207 (https://phabricator.wikimedia.org/T183252) (owner: 10Catrope) [16:12:52] (03CR) 10jenkins-bot: Remove temporary read-only setting for beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399207 (https://phabricator.wikimedia.org/T183252) (owner: 10Catrope) [16:14:55] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC happy at https://puppet-compiler.wmflabs.org/compiler02/9423/, let's break puppet on those hosts ;-)" [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [16:15:02] (03PS8) 10Alexandros Kosiaris: Add prometheus::postgres_exporter class to users [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) [16:15:04] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add prometheus::postgres_exporter class to users [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [16:15:43] 10Operations, 10MediaWiki-Vagrant: Import kibana package from jessie into stretch - https://phabricator.wikimedia.org/T183071#3848623 (10EBernhardson) @gehel It looks like we need to release stretch packages for all our custom elastic stuff (kibana, logstash, es, plugins?). My intuition is that since this is a... [16:16:52] (03PS2) 10Volans: Jonas Kress move from ldap to shell, add to groups [puppet] - 10https://gerrit.wikimedia.org/r/398524 (https://phabricator.wikimedia.org/T182908) (owner: 10RobH) [16:18:19] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: fix facts update process [puppet] - 10https://gerrit.wikimedia.org/r/399210 [16:18:55] <_joe_> volans, akosiaris ^^ [16:19:25] 10Operations, 10MediaWiki-Vagrant: Import kibana package from jessie into stretch - https://phabricator.wikimedia.org/T183071#3842659 (10MoritzMuehlenhoff) >>! In T183071#3848623, @EBernhardson wrote: > @gehel It looks like we need to release stretch packages for all our custom elastic stuff (kibana, logstash,... [16:19:42] 10Operations, 10MediaWiki-Vagrant: Import kibana package from jessie into stretch - https://phabricator.wikimedia.org/T183071#3848646 (10Gehel) The elasticsearch / kibana / logstash packages have already been uploaded to our stretch repo (under the thirdparty/elastic55 component). This should fix the issue rep... [16:20:05] (03CR) 10Volans: [C: 032] Jonas Kress move from ldap to shell, add to groups [puppet] - 10https://gerrit.wikimedia.org/r/398524 (https://phabricator.wikimedia.org/T182908) (owner: 10RobH) [16:20:19] (03PS1) 10Awight: Disable the ORES UI on beta wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399211 (https://phabricator.wikimedia.org/T183266) [16:21:44] 10Operations, 10ops-ulsfo, 10Traffic: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3848653 (10RobH) Error codes from ePSA test: Service Tag : 3ND3KH2 Error Code : 2000-0125 Validation : 107826 [16:22:38] !log installing ncurses updates from jessie point release [16:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:04] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Augeas[hba_create-prometheus@localhost] [16:23:23] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:23:54] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [16:23:54] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [16:23:54] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [16:23:54] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [16:24:03] RECOVERY - Host cp4032 is UP: PING OK - Packet loss = 0%, RTA = 78.58 ms [16:24:03] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 114 ESP OK [16:24:03] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 114 ESP OK [16:24:04] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [16:24:13] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [16:24:13] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 114 ESP OK [16:24:13] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 114 ESP OK [16:24:23] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [16:24:23] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [16:24:26] 10Operations, 10ops-ulsfo, 10Traffic: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3848685 (10BBlack) Dell info says that code means: `The IPMI system event log is full for various reasons or logging has stopped because too many ECC errors have occurred.` [16:24:35] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [16:24:35] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 44 ESP OK [16:24:35] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [16:24:35] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [16:24:35] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [16:24:43] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [16:24:43] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 114 ESP OK [16:24:44] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK [16:24:53] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 114 ESP OK [16:24:53] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [16:25:54] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Augeas[hba_create-prometheus@localhost] [16:27:23] PROBLEM - https://phabricator.wikimedia.org on phab1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string focus on bug not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 4297 bytes in 2.051 second response time [16:27:55] (03CR) 10Halfak: [C: 031] Disable the ORES UI on beta wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399211 (https://phabricator.wikimedia.org/T183266) (owner: 10Awight) [16:28:17] A Troublesome Encounter! [16:28:19] Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL). [16:28:23] ^ is what phab is saying [16:28:57] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Augeas[hba_create-prometheus@localhost] [16:29:17] it is now back, right? [16:29:17] <_joe_> bblack: known, jynus is on it [16:29:18] RECOVERY - https://phabricator.wikimedia.org on phab1001 is OK: HTTP OK: HTTP/1.1 200 OK - 34525 bytes in 0.297 second response time [16:29:27] <_joe_> can someone look at that augeas thing? [16:29:35] where is phabricator installed? [16:29:36] <_joe_> akosiaris: maps2001 is you? [16:29:39] I am the reason for augeas [16:29:43] <_joe_> jynus: phab1001/2001 [16:29:46] I am fixing... trying to at least [16:29:52] (03CR) 10Awight: [C: 032] "Self-merging "urgent" beta change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399211 (https://phabricator.wikimedia.org/T183266) (owner: 10Awight) [16:30:04] 10Operations, 10ops-ulsfo, 10Traffic: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3848695 (10RobH) Yeah, it turns up nothing but the error codes for the actual failed dimm. It doesn't matter much, just helps for the part replacement. SR958387090 is the self dispatch part # for the repl... [16:30:37] so the firewall was the expected, but it is not allowing all phab clients, apparently [16:30:42] bblack: ^ new memory should be here tomorrow, and ill either put it in then or thursday =] [16:32:14] (03PS1) 10Alexandros Kosiaris: Remove single quotes from netbox prometheus [puppet] - 10https://gerrit.wikimedia.org/r/399216 [16:32:55] (03CR) 10Alexandros Kosiaris: [C: 032] Remove single quotes from netbox prometheus [puppet] - 10https://gerrit.wikimedia.org/r/399216 (owner: 10Alexandros Kosiaris) [16:33:07] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:33:27] (03Merged) 10jenkins-bot: Disable the ORES UI on beta wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399211 (https://phabricator.wikimedia.org/T183266) (owner: 10Awight) [16:33:40] (03CR) 10jenkins-bot: Disable the ORES UI on beta wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399211 (https://phabricator.wikimedia.org/T183266) (owner: 10Awight) [16:33:57] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 seconds ago with 1 failures. Failed resources (up to 3 shown): Augeas[hba_create-prometheus@localhost] [16:34:07] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Augeas[hba_create-prometheus@localhost] [16:34:17] all these expected ^ kind of [16:34:36] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3848717 (10Ottomata) > While elasticsearch is JVM, the exporter is Go ? Huh, I thought we were talking about prometheus-jmx-expo... [16:34:39] !log manually started eventlogging cleaner on db1107 to purge/sanitize data up to 90 days ago (tmux is running for user eventlogcleaner) - T108850 [16:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:50] T108850: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850 [16:35:59] RoanKattouw: FYI I’m pushing “Remove temporary read-only setting for beta labs” on tin and tin-beta to keep in sync (and cos I have a beta config patch that comes after it). [16:36:10] awight: Thanks and sorry [16:36:38] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399217 [16:36:42] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399217 [16:36:44] RoanKattouw: :D IOU several dozen awesome late-night/weekend saves, don’t be sorry. [16:37:00] (03PS1) 10Alexandros Kosiaris: osm::slave: Correct dependency of prometheus [puppet] - 10https://gerrit.wikimedia.org/r/399218 [16:37:23] (03CR) 10Alexandros Kosiaris: [C: 032] osm::slave: Correct dependency of prometheus [puppet] - 10https://gerrit.wikimedia.org/r/399218 (owner: 10Alexandros Kosiaris) [16:37:27] PROBLEM - Check whether ferm is active by checking the default input chain on dbproxy1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:37:54] awight: Thursday's ORES breakage is actually directly responsible for me traveling to a freezing environment without a coat :D [16:38:22] That’s terrible news. Feel free to burn any ORES you still have clinging to your T-shirt. [16:38:31] (I had planned to go home around 4pm to pack, then go to a 6pm meeting, then sleep, but instead I ended up fighting the ORES fire from 4pm till about 5:40pm, and packing hastily around midnight, forgetting my coat) [16:38:37] And buy a new coat… [16:38:42] !log awight@tin Synchronized wmf-config/CirrusSearch-labs.php: wmf-config/CommonSettings-labs.php wmf-config/db-labs.php wmf-config/InitialiseSettings-labs.php wmf-config/interwiki-labs.php wmf-config/jobqueue-labs.php wmf-config/mc-labs.php wmf-config/mobile-labs.php wmf-config/Wikibase-labs.php Sync out labs config changes (duration: 00m 51s) [16:38:45] Oh well, I didn't like that coat anyway, so now I have a good excuse to buy a new one [16:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:57] lol Salvation Army here we come [16:39:27] PROBLEM - Check whether ferm is active by checking the default input chain on dbproxy1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:39:28] Oh no RoanKattouw! [16:40:07] Thankfully it is less freezing now, it's in the mid-40s [16:40:51] halfak: I think we actually owe RoanKattouw a coat at least as nice as the ones we have :p [16:41:05] awight: Don't you live in a warm country now? [16:41:06] RoanKattouw: You ever read this Gogol short story… The Overcoat... [16:41:19] RoanKattouw: ;-) rats you got right to the bottom of that bluff [16:41:28] I dunno man. The coats I carry around cost almost as much as my bike. [16:41:39] I live in the land of necessary winter technology. [16:42:04] Nowhere can be as cold on average as Yakutsk, Russia [16:42:38] PROBLEM - Check whether ferm is active by checking the default input chain on dbproxy1010 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:42:43] SantaC, yeah, on average, MN is pretty warm. But -40 F/C is a common occurrence in a MN winter. [16:42:55] We get hot in the summer and cold in the winter. :) [16:42:59] (03CR) 10Addshore: "Woo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399211 (https://phabricator.wikimedia.org/T183266) (owner: 10Awight) [16:43:27] Looks like Yakutsk gets a lot colder than MN though [16:43:33] (03CR) 10Addshore: "Seems lame that the solution is to disable the Ui though :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399211 (https://phabricator.wikimedia.org/T183266) (owner: 10Awight) [16:43:36] https://weatherspark.com/y/142848/Average-Weather-in-Yakutsk-Russia-Year-Round [16:43:46] freeze off your face cold [16:44:55] (03CR) 10Awight: "@addshore: I agree that this is a lame workaround, but see the task for details. There are two issues, one is that the models just should" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399211 (https://phabricator.wikimedia.org/T183266) (owner: 10Awight) [16:45:25] !log Defragment s7 databases on db1102 - https://phabricator.wikimedia.org/T172169 [16:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:41] (03CR) 10Catrope: "Per the task, there is a maintenance script that we could run to put the DB in a good state, but we can't run it right now because it need" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399211 (https://phabricator.wikimedia.org/T183266) (owner: 10Awight) [16:47:47] PROBLEM - Check whether ferm is active by checking the default input chain on dbproxy1011 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:48:23] (03PS1) 10Jcrespo: mariadb: Fix bug causing too restrictive firewall on some proxies [puppet] - 10https://gerrit.wikimedia.org/r/399221 [16:48:40] awight: To be completely honest I've actually forgotten which country you moved to, I just remembered that it sounded warm :) [16:49:18] RoanKattouw: hehe, not your job to worry. I’m in the Sacred Valley, Peru. Did you go back home to bikelandia for the holidays? [16:49:36] (03PS2) 10Jcrespo: mariadb: Fix bug causing too restrictive firewall on some proxies [puppet] - 10https://gerrit.wikimedia.org/r/399221 [16:49:45] It is rainy season but I’m fine with that. It’s magnificent to watch the irrigation canals here. [16:50:26] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399217 (owner: 10Marostegui) [16:50:30] I did [16:50:47] -1C on Sunday but now back up to about 6C [16:51:40] awight: Nice! When you said you were considering moving away from the suburb you were in, I didn't think you'd go quite that far :D [16:51:44] gross. Don’t go licking any chain link fences. [16:52:06] haha I don’t do anything half-assed. Unless it’s writing code and config patches for a top-5 website. [16:53:07] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399217 (owner: 10Marostegui) [16:53:20] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399217 (owner: 10Marostegui) [16:54:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1097:3315 - T161294 (duration: 00m 51s) [16:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:27] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [16:54:31] (03PS1) 10Rush: dumps: add wikidata-primary-sources-tool mount [puppet] - 10https://gerrit.wikimedia.org/r/399223 (https://phabricator.wikimedia.org/T183229) [16:56:10] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3848782 (10Volans) [16:57:02] 10Operations, 10Ops-Access-Requests, 10User-Addshore: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3838373 (10Volans) 05Open>03Resolved a:03Volans All done, resolving. [16:58:26] (03CR) 10Jcrespo: [C: 032] mariadb: Fix bug causing too restrictive firewall on some proxies [puppet] - 10https://gerrit.wikimedia.org/r/399221 (owner: 10Jcrespo) [16:58:46] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3848793 (10EBernhardson) Unfortunately all of the elasticsearch-specific metrics are no exposed over jmx. We can get generic JVM in... [17:00:10] !log installing libxv security updates on jessie [17:00:25] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:04] (03PS1) 10Muehlenhoff: Add library hint for libxv [puppet] - 10https://gerrit.wikimedia.org/r/399227 [17:04:17] (03CR) 10Chad: [C: 032] Remove unfinished/broken branch plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399116 (owner: 10Chad) [17:06:14] (03PS1) 10Alexandros Kosiaris: postgresql::user: Differentiate augeas on type=local [puppet] - 10https://gerrit.wikimedia.org/r/399228 [17:06:45] PROBLEM - Check whether ferm is active by checking the default input chain on dbproxy1007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [17:06:46] PROBLEM - Check whether ferm is active by checking the default input chain on dbproxy1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [17:07:40] (03PS2) 10Chad: Remove unfinished/broken branch plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399116 [17:09:14] (03CR) 10Chad: [C: 031] Remove unfinished/broken branch plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399116 (owner: 10Chad) [17:09:17] (03CR) 10Chad: [C: 032] Remove unfinished/broken branch plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399116 (owner: 10Chad) [17:10:45] (03CR) 10Alexandros Kosiaris: [C: 032] postgresql::user: Differentiate augeas on type=local [puppet] - 10https://gerrit.wikimedia.org/r/399228 (owner: 10Alexandros Kosiaris) [17:11:19] !log purging ferm from dbproxy1002, 3, 6, 9, 10 and 11 [17:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:38] (03Merged) 10jenkins-bot: Remove unfinished/broken branch plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399116 (owner: 10Chad) [17:13:55] RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:14:16] (03CR) 10Chad: [C: 032] All kinds of pylint and other style fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399117 (owner: 10Chad) [17:14:35] RECOVERY - Check whether ferm is active by checking the default input chain on dbproxy1005 is OK: OK ferm input default policy is set [17:15:55] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:17:05] PROBLEM - Check whether ferm is active by checking the default input chain on dbproxy1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [17:17:18] (03CR) 10jenkins-bot: Remove unfinished/broken branch plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399116 (owner: 10Chad) [17:18:05] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:18:05] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:18:56] RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:19:06] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:19:12] (03Merged) 10jenkins-bot: All kinds of pylint and other style fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399117 (owner: 10Chad) [17:20:32] (03CR) 10jenkins-bot: All kinds of pylint and other style fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399117 (owner: 10Chad) [17:23:30] (03PS1) 10Ayounsi: LibreNMS: fix issue where service ircbot is declared twice [puppet] - 10https://gerrit.wikimedia.org/r/399230 [17:25:59] (03PS1) 10Elukey: Revert "role::druid::analytics: lower down all the Xms settings" [puppet] - 10https://gerrit.wikimedia.org/r/399231 [17:26:06] (03PS2) 10Elukey: Revert "role::druid::analytics: lower down all the Xms settings" [puppet] - 10https://gerrit.wikimedia.org/r/399231 [17:26:29] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port postgresql metrics to Prometheus - https://phabricator.wikimedia.org/T179306#3848869 (10akosiaris) Apart from netmon2001 who has some puppet issues, all other postgres dbs have now the reporter installed and seemingly working fine. Let's cr... [17:26:35] (03CR) 10Elukey: [C: 032] Revert "role::druid::analytics: lower down all the Xms settings" [puppet] - 10https://gerrit.wikimedia.org/r/399231 (owner: 10Elukey) [17:26:59] (03CR) 10Ayounsi: [C: 032] LibreNMS: fix issue where service ircbot is declared twice [puppet] - 10https://gerrit.wikimedia.org/r/399230 (owner: 10Ayounsi) [17:27:05] (03CR) 10Ayounsi: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9427/" [puppet] - 10https://gerrit.wikimedia.org/r/399230 (owner: 10Ayounsi) [17:27:17] (03PS2) 10Ayounsi: LibreNMS: fix issue where service ircbot is declared twice [puppet] - 10https://gerrit.wikimedia.org/r/399230 [17:27:33] (03CR) 10Ayounsi: [V: 032 C: 032] LibreNMS: fix issue where service ircbot is declared twice [puppet] - 10https://gerrit.wikimedia.org/r/399230 (owner: 10Ayounsi) [17:29:22] (03PS1) 10Andrew Bogott: tools exim: Fixes for our simple route-to-mail-relay setup [puppet] - 10https://gerrit.wikimedia.org/r/399233 (https://phabricator.wikimedia.org/T183171) [17:33:37] 10Operations, 10Cloud-Services, 10MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), 10cloud-services-team (Kanban): Recover "Flominator" svn account for use as a modern developer account - https://phabricator.wikimedia.org/T180813#3848896 (10bd808) 05Open>03Resolved [17:37:20] (03CR) 10Andrew Bogott: [C: 032] tools exim: Fixes for our simple route-to-mail-relay setup [puppet] - 10https://gerrit.wikimedia.org/r/399233 (https://phabricator.wikimedia.org/T183171) (owner: 10Andrew Bogott) [17:39:27] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3848913 (10Ottomata) Ah ok, my response was to Filippo asking how to build prometheus-jmx-exporter. [17:39:43] 10Operations, 10DBA, 10Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#3848914 (10jcrespo) Firewall has been enabled on all proxies except the active ones: ``` dbproxy1002.yaml:profile::mariadb::proxy::firewall: 'disabled' dbproxy1003.yaml:profile::ma... [17:40:34] (03PS1) 10Andrew Bogott: tools exim: ipv6 future-proof the exim config [puppet] - 10https://gerrit.wikimedia.org/r/399234 [17:42:38] (03CR) 10Andrew Bogott: [C: 032] tools exim: ipv6 future-proof the exim config [puppet] - 10https://gerrit.wikimedia.org/r/399234 (owner: 10Andrew Bogott) [17:45:52] (03PS4) 10Zoranzoki21: Redirect techblog.wikimedia.org to blog.wikimedia.org/c/technology [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [17:46:37] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3848937 (10jcrespo) dbproxy1001 has been successfully reimaged, which joins the already upgraded to stretch dbproxy1004 and dbproxy1009 (although these one have to yet be reconfig... [17:58:00] nuria_: just in case you missed it, to let you know about T181952 [17:58:00] T181952: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952 [17:59:29] RECOVERY - Check Varnish expiry mailbox lag on cp4021 is OK: OK: expiry mailbox lag is 0 [18:00:13] volans: let me see [18:00:45] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3848976 (10Nuria) Updating ticket from conversation on e-mail. To grant access two things are needed: - the date at which access will expire... [18:00:56] volans: ah, sorry i though i had updated that ticket ages ago! [18:01:10] volans: updated now, had talked to tgr|away about it already [18:01:16] volans: good remainder [18:01:30] 10Operations, 10Puppet: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894#3837838 (10MoritzMuehlenhoff) >>! In T182894#3841972, @herron wrote: > A first stab at Trusty packages for puppet 4.8.2 and dependencies (hiera, ruby-deep-merge) have been built on boron (in /var/cache/pbuilder/result/... [18:02:04] nuria_: that's perfect, thanks a lot [18:03:03] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3848980 (10Volans) a:05Nuria>03None [18:04:08] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3848981 (10Tgr) [18:05:53] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3807578 (10Tgr) [18:14:54] !log installing zsh update from stretch point release [18:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:21] (03CR) 10Zoranzoki21: [C: 031] Enable TemplateStyles extension on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394831 (https://phabricator.wikimedia.org/T176082) (owner: 10Jon Harald Søby) [18:25:18] PROBLEM - Disk space on kafka1023 is CRITICAL: DISK CRITICAL - free space: /var/spool/kafka/c 71227 MB (3% inode=99%): /var/spool/kafka/b 118950 MB (6% inode=99%) [18:25:54] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Investigate why ORES logs are being written to syslog despite explicit logging config. Fix. - https://phabricator.wikimedia.org/T182614#3849065 (10awight) This is deployed on the beta cluster, but isn't working. I think I accide... [18:26:02] ottomata: FYI ^^^^ [18:26:19] 10Operations: Integrate stretch 9.3 point update - https://phabricator.wikimedia.org/T182655#3849066 (10MoritzMuehlenhoff) These are fully rolled out: xml2 libxkbcommon python2.7 [18:28:50] yeahhhhhh [18:28:51] :) [18:28:55] elukey: ^^^ [18:31:18] PROBLEM - Disk space on kafka1023 is CRITICAL: DISK CRITICAL - free space: /var/spool/kafka/c 71270 MB (3% inode=99%): /var/spool/kafka/b 118078 MB (6% inode=99%) [18:32:49] yeah we investigated it today :) [18:33:00] ottomata: need to step away from keyboard for ~1h, brb [18:39:43] 10Operations, 10ops-eqsin, 10netops: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3849122 (10ayounsi) 05Open>03Resolved [18:41:18] PROBLEM - Disk space on kafka1023 is CRITICAL: DISK CRITICAL - free space: /var/spool/kafka/c 71155 MB (3% inode=99%): /var/spool/kafka/b 117649 MB (6% inode=99%) [18:46:32] 10Operations, 10Patch-For-Review: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3849127 (10MoritzMuehlenhoff) Unfortunately there won't be rebuilt netinst images until the next point release: https://bugs.debian.org/cgi-bin/bugreport.c... [18:52:46] (03PS7) 10Alexandros Kosiaris: Introduce profile::prometheus::k8s::staging [puppet] - 10https://gerrit.wikimedia.org/r/399160 [18:54:17] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T182853#3849130 (10Cmjohnson) There were 2 failed disks. Replaced both and they're rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online,... [18:57:42] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T182853#3849134 (10Marostegui) Thank you! [18:58:10] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3780358 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on neodymium.eqiad.wmnet for hosts: ``` ganeti1006.eqiad.wmnet ``` The log can be found in `/var/l... [18:58:42] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3849139 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['ganeti1006.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['ganeti1006.eqiad.wmnet'] ``` [19:01:03] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3849141 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on neodymium.eqiad.wmnet for hosts: ``` ganeti1006.eqiad.wmnet ``` The log can be found in `/var/l... [19:12:28] PROBLEM - HHVM rendering on mw2150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:19] RECOVERY - HHVM rendering on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 73976 bytes in 0.296 second response time [19:16:28] PROBLEM - Check whether ferm is active by checking the default input chain on dbproxy1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [19:16:59] (03PS1) 10Herron: tools exim: allow relay of unqualified mails via localhost smtp [puppet] - 10https://gerrit.wikimedia.org/r/399240 (https://phabricator.wikimedia.org/T183171) [19:18:22] (03CR) 10Andrew Bogott: [C: 032] tools exim: allow relay of unqualified mails via localhost smtp [puppet] - 10https://gerrit.wikimedia.org/r/399240 (https://phabricator.wikimedia.org/T183171) (owner: 10Herron) [19:18:28] !log gerrit2001 - reboot for kernel upgrade [19:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:49] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:23:52] (03PS6) 10ArielGlenn: dataset1001 rsync to labs of dumps can now use explicit inclusion list [puppet] - 10https://gerrit.wikimedia.org/r/336204 (https://phabricator.wikimedia.org/T154798) [19:23:57] !log webperf1001/webperf2001 - rebooting for kernel upgrades (not used yet) [19:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:57] (03PS2) 10Dzahn: remove ganglia_aggregators settings from hiera [puppet] - 10https://gerrit.wikimedia.org/r/399120 (https://phabricator.wikimedia.org/T177225) [19:30:07] (03CR) 10Dzahn: [C: 032] remove ganglia_aggregators settings from hiera [puppet] - 10https://gerrit.wikimedia.org/r/399120 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:35:46] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3849185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['ganeti1006.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['ganeti1006.eqiad.wmnet'] ``` [19:38:21] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures [19:40:37] (03CR) 10Dzahn: [C: 032] Migrate contint::worker_localhost to a profile [puppet] - 10https://gerrit.wikimedia.org/r/398227 (owner: 10Hashar) [19:40:44] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3849219 (10Volans) @akosiaris if you're trying to reimage those as Jessie, we still have the netinst issue open, so you need to set numa=off to unblock it, see T182702. [19:40:46] (03PS2) 10Dzahn: Migrate contint::worker_localhost to a profile [puppet] - 10https://gerrit.wikimedia.org/r/398227 (owner: 10Hashar) [19:48:10] (03CR) 10Dzahn: [C: 032] "webserver was already down since yesterday, removing" [dns] - 10https://gerrit.wikimedia.org/r/399124 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:48:51] (03PS7) 10ArielGlenn: dataset1001 rsync to labs of dumps can now use explicit inclusion list [puppet] - 10https://gerrit.wikimedia.org/r/336204 (https://phabricator.wikimedia.org/T154798) [19:49:20] mutante: thanks :) [19:49:23] !log deleted ganglia.wikimedia.org from DNS - webserver was already down since yesterday - not used anymore (T177225) [19:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:34] T177225: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225 [19:49:36] ohh and Ganglia is gone!!! [19:49:50] hashar: welcome! how about the one in integration/config that linked to that :) [19:49:55] one number above [19:49:58] I merged it :) [19:50:01] oK:) [19:50:21] which is really just about chagning: include contint::foo with include profile::ci::foo :D [19:50:27] hashar: yea, very very close to gone [19:50:48] a few more merges, like 'delete the module' [19:51:04] 1814 warnings to go https://integration.wikimedia.org/ci/job/operations-puppet-wmf-style-guide/lastBuild/ :) [19:51:11] and "dont call it ganglia_clusters anymore in Hiera even though it is now unrelated" [19:51:44] hashar: :) i worked on site.pp [19:51:49] for the style count [19:51:57] like appserver includes -9 [19:52:10] ACKNOWLEDGEMENT - Disk space on kafka1023 is CRITICAL: DISK CRITICAL - free space: /var/spool/kafka/c 67310 MB (3% inode=99%): /var/spool/kafka/b 113031 MB (6% inode=99%): ottomata elukey and I will fix this tomorrow EU morning. [19:52:29] ah nice [19:52:58] hashar: all site.pp https://gerrit.wikimedia.org/r/#/q/topic:site-includes :p [19:54:06] I have another one for the CI boxes if you dont mind https://gerrit.wikimedia.org/r/#/c/397787/ [19:54:35] which remove XDebug from the list of Zend extensions that are enabled by default [19:54:46] because that slows down PHP :] [19:54:54] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T182853#3849238 (10Marostegui) 05Open>03Resolved All good! Thanks ``` root@db1059:~# megacli -LDInfo -L0 -a0 Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID... [19:55:27] reads.. and yea [19:55:29] RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [19:56:09] (03PS3) 10Dzahn: contint: disable XDebug by default [puppet] - 10https://gerrit.wikimedia.org/r/397787 (https://phabricator.wikimedia.org/T175028) (owner: 10Hashar) [19:56:58] (03CR) 10Dzahn: [C: 032] contint: disable XDebug by default [puppet] - 10https://gerrit.wikimedia.org/r/397787 (https://phabricator.wikimedia.org/T175028) (owner: 10Hashar) [19:56:59] (03PS1) 10Rush: openstack: whitelist kernel versions for compute [puppet] - 10https://gerrit.wikimedia.org/r/399243 [19:57:25] (03CR) 10jerkins-bot: [V: 04-1] openstack: whitelist kernel versions for compute [puppet] - 10https://gerrit.wikimedia.org/r/399243 (owner: 10Rush) [20:00:08] mutante: the PHPUnit tests thank you in advance :D [20:00:19] [contint1001:~] $ /usr/sbin/php5query -s cli -m xdebug [20:00:19] No module matches xdebug [20:00:25] i didnt see puppet do anything [20:00:32] but the module is already not loaded [20:00:33] cause I already cherry picked it [20:00:41] and contint1001 probably doesn't have that class enabled anyway :] [20:01:01] contint1001 just host Zuul/Jenkins, all the rest is in labs/cloudVps/xxxthing :) [20:01:10] oh, i expected this one was contin1001-affecting [20:01:14] as opposed to the one before [20:01:23] ok [20:09:46] (03PS2) 10Dzahn: ganglia: delete the module [puppet] - 10https://gerrit.wikimedia.org/r/382933 (https://phabricator.wikimedia.org/T177225) [20:10:05] (03CR) 10jerkins-bot: [V: 04-1] ganglia: delete the module [puppet] - 10https://gerrit.wikimedia.org/r/382933 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:13:22] (03PS3) 10Dzahn: ganglia: delete the module [puppet] - 10https://gerrit.wikimedia.org/r/382933 (https://phabricator.wikimedia.org/T177225) [20:17:40] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3849296 (10Cmjohnson) Disks are wiped [20:17:59] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10hardware-requests: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3849298 (10Cmjohnson) Disks are wiped [20:18:09] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10hardware-requests: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3849299 (10Cmjohnson) [20:18:26] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3849300 (10Cmjohnson) [20:18:35] (03PS1) 10Dzahn: redis: delete ganglia monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/399248 [20:18:44] (03PS1) 10Hashar: contint: worker_localhost had the jenkins user hardcoded [puppet] - 10https://gerrit.wikimedia.org/r/399249 [20:18:44] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3849302 (10Cmjohnson) [20:19:00] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1021 - https://phabricator.wikimedia.org/T181378#3849305 (10Cmjohnson) [20:19:03] (03CR) 10jerkins-bot: [V: 04-1] redis: delete ganglia monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/399248 (owner: 10Dzahn) [20:19:18] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3849306 (10Cmjohnson) [20:19:38] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1045 - https://phabricator.wikimedia.org/T174806#3849307 (10Cmjohnson) [20:19:54] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3849308 (10Cmjohnson) [20:20:08] (03CR) 10Dzahn: [C: 032] ganglia: delete the module [puppet] - 10https://gerrit.wikimedia.org/r/382933 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:20:17] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1050 - https://phabricator.wikimedia.org/T178162#3849310 (10Cmjohnson) [20:20:22] (03PS4) 10Dzahn: ganglia: delete the module [puppet] - 10https://gerrit.wikimedia.org/r/382933 (https://phabricator.wikimedia.org/T177225) [20:21:28] mutante: despite rspec testing, I failled one of the change for contint::worker_localhost The fix up being https://gerrit.wikimedia.org/r/399249 :D [20:22:59] yea, but i need to finish this one i am at. in a minute [20:23:15] no worries, it is not that urgent :] [20:23:19] (03PS2) 10Dzahn: redis: delete ganglia monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/399248 [20:23:43] (03CR) 10jerkins-bot: [V: 04-1] redis: delete ganglia monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/399248 (owner: 10Dzahn) [20:24:54] hashar: you know what's my most common reason for -1 probably? The space between "Bug:" and "T12345" in the commit message :) [20:24:55] T12345: Create "annotation" namespace on Hebrew Wikisource - https://phabricator.wikimedia.org/T12345 [20:25:14] (03PS3) 10Dzahn: redis: delete ganglia monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/399248 (https://phabricator.wikimedia.org/T177225) [20:25:39] just keep doing it all the time [20:25:42] mutante: you can probalby have it checked automatically whenever you do a commit [20:25:53] or find a vim rule to highlight it in red hehe [20:26:16] (03PS2) 10Dzahn: contint: worker_localhost had the jenkins user hardcoded [puppet] - 10https://gerrit.wikimedia.org/r/399249 (owner: 10Hashar) [20:26:34] hashar: yea, that's right, i should highlight it :) [20:27:45] (03CR) 10Dzahn: [C: 032] contint: worker_localhost had the jenkins user hardcoded [puppet] - 10https://gerrit.wikimedia.org/r/399249 (owner: 10Hashar) [20:29:39] (03PS4) 10Dzahn: redis: delete ganglia monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/399248 (https://phabricator.wikimedia.org/T177225) [20:29:41] (03PS5) 10Dzahn: redis: delete ganglia monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/399248 (https://phabricator.wikimedia.org/T177225) [20:31:13] mutante: https://phabricator.wikimedia.org/P6488 :) [20:31:22] ie run the tests before doing a commit [20:32:21] ah :) thanks! [20:32:50] -j1 is to run the tasks serially, that usually makes it easier to spot the issue [20:34:53] "The page you requested was not found, or you do not have permission to view this page." eh.. Gerrit? [20:36:34] and it's fine [20:38:35] mutante: puppet fixed !:) [20:39:04] hashar: :) [20:40:11] mutante it's a draft you are trying to view. [20:40:28] ohh :) [20:40:57] i did not intend to use the draft feature [20:41:22] did you push it normally like git push? or did you add %draft [20:41:44] git review [20:41:50] and some rebasing [20:42:14] Using the rest api i got some details on the patches [20:42:20] how do you turn it from draft to final [20:42:22] if you cant access it [20:42:32] https://phabricator.wikimedia.org/P6489 [20:43:03] mutante requires an admin to do it i think. Though i've never switched the patch from drafts to normal patch using the command line. [20:43:25] it can just be abandoned, i wouldnt care [20:43:31] very easy to redo [20:43:55] and thanks for pointing this out, i really dont know why it became a draft [20:44:53] here's an updated view https://phabricator.wikimedia.org/P6490 [20:44:53] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3849389 (10Marostegui) 05stalled>03Open [20:45:59] ok, how do you know it's a draft from that? [20:46:44] looks like it is not a draft. [20:46:54] as the status is new and not draft. [20:47:59] hmm, ok, well, as i said, if an admin can just abandon it that's fine [20:48:07] i'll just need to get lunch for now [20:48:34] i'll just make a new one later [20:48:41] ok [20:48:56] (and this one can still be investigated) cu in a bit [20:49:04] (03PS1) 10Ottomata: Add nrpe check_newest_file_age; monitor some analytics file backups [puppet] - 10https://gerrit.wikimedia.org/r/399255 (https://phabricator.wikimedia.org/T182327) [20:49:08] (03PS1) 10Dduvall: Use sed instead of envsubst [deployment-charts] - 10https://gerrit.wikimedia.org/r/399256 [20:53:05] (03CR) 10Ottomata: [C: 032] "Looks good i think! https://puppet-compiler.wmflabs.org/compiler02/9433/analytics1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/399255 (https://phabricator.wikimedia.org/T182327) (owner: 10Ottomata) [20:54:53] (03PS8) 10ArielGlenn: dataset1001 rsync to labs of dumps can now use explicit inclusion list [puppet] - 10https://gerrit.wikimedia.org/r/336204 (https://phabricator.wikimedia.org/T154798) [20:55:17] mutante you can try and do curl --user username:password -X POST https://gerrit.wikimedia.org/r/a/changes/399248/abandon [20:56:33] (03PS9) 10ArielGlenn: dataset1001 rsync to labs of dumps can now use explicit inclusion list [puppet] - 10https://gerrit.wikimedia.org/r/336204 (https://phabricator.wikimedia.org/T154798) [20:57:50] (03CR) 10ArielGlenn: [C: 032] dataset1001 rsync to labs of dumps can now use explicit inclusion list [puppet] - 10https://gerrit.wikimedia.org/r/336204 (https://phabricator.wikimedia.org/T154798) (owner: 10ArielGlenn) [21:02:27] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create Prometheus exporter for wdqs-updater - https://phabricator.wikimedia.org/T182773#3834187 (10Gehel) I updated the grafana dashboard as well: https://grafana-admin.wikimedia.org/dashboard/db/wikidata-query-s... [21:02:28] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create Prometheus exporter for Blazegraph - https://phabricator.wikimedia.org/T182857#3836850 (10Gehel) I updated the grafana dashboard as well: https://grafana-admin.wikimedia.org/dashboard/db/wikidata-query-ser... [21:04:49] (03PS2) 10Rush: openstack: whitelist kernel versions for compute [puppet] - 10https://gerrit.wikimedia.org/r/399243 [21:05:16] (03CR) 10jerkins-bot: [V: 04-1] openstack: whitelist kernel versions for compute [puppet] - 10https://gerrit.wikimedia.org/r/399243 (owner: 10Rush) [21:08:13] (03PS3) 10Rush: openstack: whitelist kernel versions for compute [puppet] - 10https://gerrit.wikimedia.org/r/399243 [21:08:56] (03PS1) 10Ottomata: Allow nagios to sudo to check analytics database backup newest file age [puppet] - 10https://gerrit.wikimedia.org/r/399260 (https://phabricator.wikimedia.org/T182327) [21:09:17] (03PS1) 10ArielGlenn: add dumpsgen user to labstore1003 for dumps cron cleanup job [puppet] - 10https://gerrit.wikimedia.org/r/399261 (https://phabricator.wikimedia.org/T154798) [21:11:36] (03CR) 10ArielGlenn: [C: 032] add dumpsgen user to labstore1003 for dumps cron cleanup job [puppet] - 10https://gerrit.wikimedia.org/r/399261 (https://phabricator.wikimedia.org/T154798) (owner: 10ArielGlenn) [21:12:37] (03PS2) 10Ottomata: Allow nagios to sudo to check analytics backup newest file age [puppet] - 10https://gerrit.wikimedia.org/r/399260 (https://phabricator.wikimedia.org/T182327) [21:14:33] (03PS3) 10Ottomata: Allow nagios to sudo to check analytics backup newest file age [puppet] - 10https://gerrit.wikimedia.org/r/399260 (https://phabricator.wikimedia.org/T182327) [21:16:02] (03PS4) 10Rush: openstack: whitelist kernel versions for compute [puppet] - 10https://gerrit.wikimedia.org/r/399243 [21:26:42] (03CR) 10Ottomata: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9435/analytics1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/399260 (https://phabricator.wikimedia.org/T182327) (owner: 10Ottomata) [21:31:31] (03PS1) 10Andrew Bogott: Bigbrother: pass in a giant shell string to subprocess [puppet] - 10https://gerrit.wikimedia.org/r/399262 (https://phabricator.wikimedia.org/T183171) [21:31:55] (03CR) 10jerkins-bot: [V: 04-1] Bigbrother: pass in a giant shell string to subprocess [puppet] - 10https://gerrit.wikimedia.org/r/399262 (https://phabricator.wikimedia.org/T183171) (owner: 10Andrew Bogott) [21:34:40] (03CR) 10Zhuyifei1999: Bigbrother: pass in a giant shell string to subprocess (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399262 (https://phabricator.wikimedia.org/T183171) (owner: 10Andrew Bogott) [21:47:03] (03PS5) 10Rush: openstack: whitelist kernel versions for compute [puppet] - 10https://gerrit.wikimedia.org/r/399243 [21:47:43] (03PS7) 10Rush: openstack: whitelist kernel versions for compute [puppet] - 10https://gerrit.wikimedia.org/r/399243 [21:51:18] !log removing local-as AS43821 from ams transits - T167840 [21:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:29] T167840: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840 [22:00:11] (03CR) 1020after4: [C: 031] Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [22:18:31] (03PS2) 10Andrew Bogott: Bigbrother: build restart command out of a big list [puppet] - 10https://gerrit.wikimedia.org/r/399262 (https://phabricator.wikimedia.org/T183171) [22:21:15] (03PS3) 10Zoranzoki21: Add xpda.com to $wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398702 (https://phabricator.wikimedia.org/T183073) [22:30:18] (03PS3) 10Andrew Bogott: Bigbrother: build restart command out of a big list [puppet] - 10https://gerrit.wikimedia.org/r/399262 (https://phabricator.wikimedia.org/T183171) [22:31:06] (03PS4) 10Andrew Bogott: Bigbrother: build restart command out of a big list [puppet] - 10https://gerrit.wikimedia.org/r/399262 (https://phabricator.wikimedia.org/T183171) [22:33:54] (03CR) 10Andrew Bogott: [C: 031] openstack: whitelist kernel versions for compute [puppet] - 10https://gerrit.wikimedia.org/r/399243 (owner: 10Rush) [22:44:23] (03CR) 10BryanDavis: Bigbrother: build restart command out of a big list (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/399262 (https://phabricator.wikimedia.org/T183171) (owner: 10Andrew Bogott) [22:47:56] (03PS5) 10Andrew Bogott: Bigbrother: build restart command out of a big list [puppet] - 10https://gerrit.wikimedia.org/r/399262 (https://phabricator.wikimedia.org/T183171) [22:50:01] (03PS6) 10Andrew Bogott: Bigbrother: build restart command out of a big list [puppet] - 10https://gerrit.wikimedia.org/r/399262 (https://phabricator.wikimedia.org/T183171) [22:50:34] (03CR) 10Andrew Bogott: [C: 032] Bigbrother: build restart command out of a big list [puppet] - 10https://gerrit.wikimedia.org/r/399262 (https://phabricator.wikimedia.org/T183171) (owner: 10Andrew Bogott) [22:51:47] (03PS1) 10Hashar: contint: convert Apache proxying to profiles [puppet] - 10https://gerrit.wikimedia.org/r/399311 [22:54:50] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2192798 (10Nuria) Can we go ahead and close ticket? [22:59:41] (03CR) 10Hashar: [C: 04-1] contint: convert Apache proxying to profiles [puppet] - 10https://gerrit.wikimedia.org/r/399311 (owner: 10Hashar) [23:00:16] (03PS2) 10Hashar: contint: convert Apache proxying to profiles [puppet] - 10https://gerrit.wikimedia.org/r/399311 [23:02:47] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler02/9437/ . Apparently only class get renamed:" [puppet] - 10https://gerrit.wikimedia.org/r/399311 (owner: 10Hashar) [23:04:12] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3849840 (10Nuria) 05Open>03Resolved [23:19:33] PROBLEM - parsoid on ruthenium is CRITICAL: connect to address 10.64.16.151 and port 8142: Connection refused [23:58:29] (03PS1) 10Cmjohnson: Adding entries for db1113 and 1114 T182896 [puppet] - 10https://gerrit.wikimedia.org/r/399314