[00:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171213T0000). [00:00:04] Jhs and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:53] (03CR) 10EBernhardson: [C: 032] Setup MLR AB test for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397582 (https://phabricator.wikimedia.org/T182616) (owner: 10EBernhardson) [00:03:18] kaldari: just because it seems you might know the most about it (i dunno, guessing) does https://gerrit.wikimedia.org/r/#/c/397129/1 in swat look reasonable? [00:03:22] (03Merged) 10jenkins-bot: Setup MLR AB test for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397582 (https://phabricator.wikimedia.org/T182616) (owner: 10EBernhardson) [00:03:38] (03CR) 10jenkins-bot: Setup MLR AB test for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397582 (https://phabricator.wikimedia.org/T182616) (owner: 10EBernhardson) [00:05:38] !log ebernhardson@tin Synchronized wmf-config/: SWAT: T182616: Setup MLR AB test for hewiki (duration: 01m 10s) [00:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:50] T182616: Re-run AB test for Hebrew Wikipedia (has > 1% of search traffic) with new model - https://phabricator.wikimedia.org/T182616 [00:10:34] (03CR) 10EBernhardson: [C: 032] Turn on MLR for most wikis with >1% of search traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397970 (owner: 10EBernhardson) [00:10:50] (03PS3) 10EBernhardson: Turn on MLR for most wikis with >1% of search traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397970 [00:11:33] (03CR) 10EBernhardson: [C: 032] Turn on MLR for most wikis with >1% of search traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397970 (owner: 10EBernhardson) [00:13:09] (03Merged) 10jenkins-bot: Turn on MLR for most wikis with >1% of search traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397970 (owner: 10EBernhardson) [00:13:22] (03CR) 10jenkins-bot: Turn on MLR for most wikis with >1% of search traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397970 (owner: 10EBernhardson) [00:18:13] (03PS1) 10Dzahn: mail::mx: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/397986 (https://phabricator.wikimedia.org/T177225) [00:20:38] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.639 second response time [00:25:48] (03PS1) 10EBernhardson: Turn off a couple search ranking models that arnt ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397988 [00:25:59] (03CR) 10jerkins-bot: [V: 04-1] Turn off a couple search ranking models that arnt ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397988 (owner: 10EBernhardson) [00:26:07] (03PS2) 10EBernhardson: Turn off a couple search ranking models that arnt ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397988 [00:26:43] (03CR) 10Dzahn: [C: 032] "exim stats via prometheus have been done in https://phabricator.wikimedia.org/T179565 so we can do this as well now :)" [puppet] - 10https://gerrit.wikimedia.org/r/397986 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:26:55] (03PS3) 10EBernhardson: Turn off a couple search ranking models that arnt ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397988 [00:28:11] (03CR) 10jerkins-bot: [V: 04-1] Turn off a couple search ranking models that arnt ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397988 (owner: 10EBernhardson) [00:28:59] (03PS4) 10EBernhardson: Turn off a couple search ranking models that arnt ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397988 [00:30:38] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.354 second response time [00:31:07] !log mholloway-shell@tin Started deploy [mobileapps/deploy@5832a8c]: Update mobileapps to bfc3588 [00:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:20] !log restarted requeueTranscodes.php on terbium for mp3 audio generation backfill (had dropped DB connection) [00:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:37] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [00:34:33] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port exim statistics to Prometheus - https://phabricator.wikimedia.org/T179565#3728881 (10Dzahn) removed ganglia from mx1001/mx2001 [00:35:07] (03CR) 10EBernhardson: [C: 032] Turn off a couple search ranking models that arnt ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397988 (owner: 10EBernhardson) [00:35:54] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@5832a8c]: Update mobileapps to bfc3588 (duration: 04m 48s) [00:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:33] (03Merged) 10jenkins-bot: Turn off a couple search ranking models that arnt ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397988 (owner: 10EBernhardson) [00:36:43] (03CR) 10jenkins-bot: Turn off a couple search ranking models that arnt ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397988 (owner: 10EBernhardson) [00:40:33] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Turn on cirrus MLR for most wikis with >1% of search traffic (duration: 01m 08s) [00:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:42] !log ebernhardson@tin Synchronized php-1.31.0-wmf.12/extensions/WikimediaEvents/: SWAT: T182616: turn on second mlr ab test for hewiki (duration: 01m 08s) [00:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:51] T182616: Re-run AB test for Hebrew Wikipedia (has > 1% of search traffic) with new model - https://phabricator.wikimedia.org/T182616 [00:47:40] Jhs: last call on your swat patch [00:47:47] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 578 bytes in 18.029 second response time [00:49:48] !log ebernhardson@tin Synchronized php-1.31.0-wmf.11/extensions/WikimediaEvents/: SWAT: T182616: turn on second mlr ab test for hewiki (duration: 01m 08s) [00:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:59] T182616: Re-run AB test for Hebrew Wikipedia (has > 1% of search traffic) with new model - https://phabricator.wikimedia.org/T182616 [00:52:01] (03PS1) 10Dzahn: mariadb::parsercache: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/397990 (https://phabricator.wikimedia.org/T177225) [00:53:17] (03CR) 10Dzahn: "can i try it again with this one next? i can disable all notifications on Icinga first and promise to immediately clean up (kill gmond pro" [puppet] - 10https://gerrit.wikimedia.org/r/397990 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:54:45] i'm calling swat done [01:05:14] (03CR) 10EBernhardson: "that sounds great elukey, thanks!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395923 (https://phabricator.wikimedia.org/T182276) (owner: 10EBernhardson) [01:11:08] argh, i missed it. oh well [01:11:55] ? [01:12:22] Reedy, the swat, i was scheduled but then forgot the time. [01:12:33] Anything exciting? [01:12:41] nope [01:18:42] Jhs: i bet if you ask very nicely someone will still do it (maybe) [01:29:44] Zppix, eh, just added it to tomorrow's swat instead. no rush :) [01:45:40] (03PS1) 10Chad: Releases jenkins: Ensure php-curl is present (version doesn't matter) [puppet] - 10https://gerrit.wikimedia.org/r/397993 [01:55:58] * SantaC pokes legoktm [02:05:26] 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security: Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584#3833150 (10Legoktm) [02:13:57] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [02:14:57] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8194076 keys, up 5 minutes 13 seconds - replication_delay is 0 [02:16:57] PROBLEM - Check health of redis instance on 6380 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [02:17:57] RECOVERY - Check health of redis instance on 6380 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3494384 keys, up 5 minutes 15 seconds - replication_delay is 0 [02:24:25] 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security: Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584#3833222 (10Legoktm) [02:24:39] 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security: Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584#3502923 (10Legoktm) [02:26:15] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.11) (duration: 07m 02s) [02:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:14] (03PS2) 10Andrew Bogott: puppet agent: don't call etckeeper hooks pre- and post-run [puppet] - 10https://gerrit.wikimedia.org/r/397967 (https://phabricator.wikimedia.org/T182721) [02:29:29] (03CR) 10Andrew Bogott: [C: 032] puppet agent: don't call etckeeper hooks pre- and post-run [puppet] - 10https://gerrit.wikimedia.org/r/397967 (https://phabricator.wikimedia.org/T182721) (owner: 10Andrew Bogott) [02:33:51] 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security: Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584#3833259 (10Legoktm) [02:44:01] (03PS2) 10Andrew Bogott: bootstrapvz: allow default (v4) puppet packages on stretch base images. [puppet] - 10https://gerrit.wikimedia.org/r/397966 (https://phabricator.wikimedia.org/T178717) [02:45:11] (03CR) 10Andrew Bogott: [C: 032] bootstrapvz: allow default (v4) puppet packages on stretch base images. [puppet] - 10https://gerrit.wikimedia.org/r/397966 (https://phabricator.wikimedia.org/T178717) (owner: 10Andrew Bogott) [03:24:37] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 819.96 seconds [04:05:37] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 295.06 seconds [04:09:08] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:10:37] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:24:22] (03PS3) 10BryanDavis: labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) [04:29:08] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [04:30:37] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [04:45:35] soooo [04:46:56] having not slept yet, I am going to do so now 😴 [04:47:07] see folks in 7-8 hours from now [05:01:38] 10Operations, 10Puppet, 10cloud-services-team, 10Patch-For-Review, 10Puppet-infrastructure-modernization: Stop using etckeeper (at least before/after puppet runs) - https://phabricator.wikimedia.org/T182721#3833314 (10Andrew) 05Open>03Resolved a:03Andrew [05:01:57] (03PS11) 10Andrew Bogott: WMCS: set puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/397711 (https://phabricator.wikimedia.org/T178717) [05:01:59] (03PS1) 10Andrew Bogott: nova fullstack: update image for testing [puppet] - 10https://gerrit.wikimedia.org/r/397997 [05:29:54] (03CR) 10Madhuvishy: [C: 031] "Do we need the c2 entry? labsdb1002 isn't even up and running afaik. Otherwise +1, happy to help merge this through tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) (owner: 10BryanDavis) [05:33:28] (03CR) 10BryanDavis: "> Do we need the c2 entry?" [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) (owner: 10BryanDavis) [05:35:36] (03PS4) 10BryanDavis: labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) [05:47:04] (03CR) 10Madhuvishy: [C: 031] labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) (owner: 10BryanDavis) [05:50:47] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [05:52:17] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:56:03] (03PS1) 10Muehlenhoff: Record new account expiry date for pnorman [puppet] - 10https://gerrit.wikimedia.org/r/398000 [05:58:47] (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397869 [05:59:06] (03CR) 10Muehlenhoff: [C: 032] Record new account expiry date for pnorman [puppet] - 10https://gerrit.wikimedia.org/r/398000 (owner: 10Muehlenhoff) [05:59:18] RECOVERY - Check Varnish expiry mailbox lag on cp4024 is OK: OK: expiry mailbox lag is 0 [05:59:53] (03Abandoned) 10Muehlenhoff: Remove firejail wrappers for timidity, lilypond and abc2ly [puppet] - 10https://gerrit.wikimedia.org/r/394927 (owner: 10Muehlenhoff) [06:06:17] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:06:33] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3833365 (10MoritzMuehlenhoff) @Marostegui , @elukey : Can you try passing numa=off to the kernel in d-i? That should work around it. There'll be a revised 3.16 package (and a re... [06:08:41] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397869 (owner: 10Marostegui) [06:08:47] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:10:06] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397869 (owner: 10Marostegui) [06:10:18] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397869 (owner: 10Marostegui) [06:12:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1097:3314 - T174569 (duration: 01m 08s) [06:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:29] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:19:22] (03PS1) 10Muehlenhoff: Add config file with access credentials for Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398001 (https://phabricator.wikimedia.org/T181802) [06:21:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398002 (https://phabricator.wikimedia.org/T174569) [06:23:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398002 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:25:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398002 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:25:47] !log Deploy schema change on db1103:3314 - T174569 [06:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:07] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:26:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 - T174569 (duration: 01m 11s) [06:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:03] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398002 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:39:03] (03PS1) 10Muehlenhoff: Add Prometheus exporter for RabbitMQ [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398003 (https://phabricator.wikimedia.org/T181802) [08:20:20] (03PS18) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [08:20:22] (03PS1) 10Jcrespo: mariadb: Create profile::client for non-root mariadb clients [puppet] - 10https://gerrit.wikimedia.org/r/398007 (https://phabricator.wikimedia.org/T175672) [08:21:04] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Create profile::client for non-root mariadb clients [puppet] - 10https://gerrit.wikimedia.org/r/398007 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo) [08:22:07] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:22:47] PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100% [08:23:07] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:23:07] PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:08] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:08] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:08] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:08] PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100% [08:23:18] PROBLEM - Host boron is DOWN: PING CRITICAL - Packet loss = 100% [08:23:18] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:36] that looks like a ganeti problem, akosiaris [08:23:47] PROBLEM - SSH on ganeti1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:51] of course I cannot log in [08:28:27] RECOVERY - Host releases1001 is UP: PING WARNING - Packet loss = 50%, RTA = 182.29 ms [08:28:29] RECOVERY - Host etcd1005 is UP: PING WARNING - Packet loss = 50%, RTA = 569.76 ms [08:28:29] RECOVERY - Host neon is UP: PING WARNING - Packet loss = 64%, RTA = 1.32 ms [08:28:29] RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 24.18 ms [08:28:29] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 1.71 ms [08:28:29] RECOVERY - Host boron is UP: PING WARNING - Packet loss = 0%, RTA = 1116.96 ms [08:28:29] RECOVERY - Host etcd1002 is UP: PING WARNING - Packet loss = 66%, RTA = 12.36 ms [08:28:37] RECOVERY - SSH on ganeti1008 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [08:28:37] RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [08:28:37] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [08:29:20] oh, I was just migrating the instances away [08:30:37] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [08:34:44] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3833432 (10jcrespo) It happened on ganeti1008 just a few minutes ago: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=ganeti1008&var-... [08:38:01] !log migrate away from ganeti1008 [08:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:46] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3833433 (10jcrespo) I did it anyway, the server kept throwing scary messages not only to dmesg, but to my regular shell. [08:52:54] (03PS4) 10EddieGP: Delete mowiki and mowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923) (owner: 10MarcoAurelio) [08:54:37] (03CR) 10EddieGP: "PS3 & PS4 were both rebases due to wikiversions.json. I've now scheduled this for EU midday SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923) (owner: 10MarcoAurelio) [08:56:21] (03CR) 10EddieGP: "Oops, that's wrong. PS3 wasn't just a rebase, I did the removal from config and deletion of the logos there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923) (owner: 10MarcoAurelio) [09:05:43] (03PS2) 10Jcrespo: mariadb: Create profile::client for non-root mariadb clients [puppet] - 10https://gerrit.wikimedia.org/r/398007 (https://phabricator.wikimedia.org/T175672) [09:10:04] (03PS3) 10Jcrespo: mariadb: Create profile::client for non-root mariadb clients [puppet] - 10https://gerrit.wikimedia.org/r/398007 (https://phabricator.wikimedia.org/T175672) [09:15:00] (03PS4) 10Jcrespo: mariadb: Create profile::client for non-root mariadb clients [puppet] - 10https://gerrit.wikimedia.org/r/398007 (https://phabricator.wikimedia.org/T175672) [09:22:06] (03PS5) 10Jcrespo: mariadb: Create profile::client for non-root mariadb clients [puppet] - 10https://gerrit.wikimedia.org/r/398007 (https://phabricator.wikimedia.org/T175672) [09:23:54] (03PS6) 10Jcrespo: mariadb: Create profile::client for non-root mariadb clients [puppet] - 10https://gerrit.wikimedia.org/r/398007 (https://phabricator.wikimedia.org/T175672) [09:29:49] (03CR) 10Jcrespo: [C: 032] mariadb: Create profile::client for non-root mariadb clients [puppet] - 10https://gerrit.wikimedia.org/r/398007 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo) [09:36:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398014 [09:39:25] jynus: huh, thanks for migrating instances from ganeti1008. I 'll reimage ganeti1006 (to clean up whatever mess ltpstress has created) and move many VMs on it. Maybe the BIOS upgrade will have fixed something [09:39:56] wait [09:40:07] I am in the midle of a rebalancing [09:40:55] I thought it was stateful and it would rebalance things only into 5 and 7 [09:41:17] as 7 run out of memory [09:41:47] you can access the screen on ganeti 1001 [09:41:57] !log upload prometheus-elasticsearch-exporter to jessie-wikimedia - T181627 [09:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:10] T181627: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627 [09:42:57] it is stateful, it can only migrate things into hosts that are secondary for a VM [09:43:19] yes, the whole primary secondary yes [09:43:29] but it migrated things back to 1006 and 1008 [09:43:46] yeah, not surprised [09:43:56] neither were fully depooled [09:44:04] we can't afford to have them fully depooled right now [09:44:25] yeah, I understood, I just did that unintentionally [09:44:37] I wanted to migrate borium to 1005 [09:44:47] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3833552 (10elukey) Live hacked install1002's /srv/tftpboot/jessie-installer/pxelinux.cfg/ttyS1-115200 and didn't get the kernel panic! (credis to @fgiunchedi for the technical h... [09:44:48] I think I should have asked that explicitly [09:45:11] anyway, I 'll wait for you [09:45:21] but then I though filling 6 wouldn't be bad [09:45:31] anyway, check the screen on ganeti1001 [09:45:39] and cancel or followup as you wish [09:45:43] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3833556 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [09:45:49] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3833557 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'] ``` [09:45:57] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3833558 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [09:46:02] also, I didn't know it was going to take so much time [09:46:20] wow... you run hbal ? [09:46:30] :-) [09:46:34] ok this is going to take quite a few hours [09:46:42] 0:-) [09:46:46] 10Operations, 10Wikimedia-General-or-Unknown: Icinga has httpauth on (not accessible for public) - https://phabricator.wikimedia.org/T62112#661810 (10Sau226) Such a shame that lots of log and status hosts are shifting towards a private option without even a limited public panel. Really decreases community acco... [09:46:48] it's doing secondary movings and those require disk copying [09:47:11] let's see if I can cancel that... there is no point in rebalancing right now [09:48:49] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3833565 (10Marostegui) Trying db1111 again - will report back! [09:49:47] RECOVERY - Host db1111 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [09:49:54] things would only move to 1007, and it got out of memory [09:52:21] 10Operations, 10Discovery-Search (Current work), 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3833569 (10fgiunchedi) >>! In T181627#3828220, @Gehel wrote: >>>! In T181627#3803312, @fgiunchedi wrote: >> I tr... [09:55:18] PROBLEM - OTRS SMTP on mendelevium is CRITICAL: connect to address 10.64.32.174 and port 25: Connection refused [09:55:33] that's expected ^ [09:55:47] PROBLEM - Check systemd state on mendelevium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:57:11] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3833575 (10elukey) Just got the following (I did a manual pxe boot, not using wmf-auto-reimage): ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ [!!] Finish the installation β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”... [09:58:44] RECOVERY - Check systemd state on mendelevium is OK: OK - running: The system is fully operational [09:59:24] RECOVERY - OTRS SMTP on mendelevium is OK: SMTP OK - 0.003 sec. response time [10:00:59] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3833596 (10Marostegui) I tried db1111 again and got a kernel panic (used wmf-auto-reimage): ``` Loading Linux 3.16.0-4-amd64 ... Loading initial ramdisk ... [ 0.613068] gener... [10:02:44] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3833614 (10Marostegui) 05Open>03stalled This is failing due to: T182702 [10:02:59] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3831866 (10Marostegui) [10:03:05] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3833622 (10Marostegui) [10:11:29] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3833644 (10mmodell) 15d5283b7422919d85203b5ba907027f9356e421 doesn't exist in the editquality repo. Somehow the submodule pointer... [10:12:23] 10Operations, 10media-storage: Requesting access to swift for Phabricator's git-lfs storage - https://phabricator.wikimedia.org/T182085#3833650 (10fgiunchedi) Good question, I don't know if it swift in deployment prep is reachable now from other projects, worth a try though! An alternative would be to have a e... [10:13:12] 10Operations, 10Goal, 10User-Elukey, 10User-fgiunchedi: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759#3833652 (10hashar) p:05Triage>03High [10:13:26] 10Operations, 10Goal, 10User-Elukey, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197#3833663 (10hashar) [10:15:56] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3833668 (10MarcoAurelio) Sorry for the late reply and thanks @Strainu for testing, @EddieGP for your guidance during the process and @Joe fo... [10:16:20] 10Operations, 10Goal, 10User-Elukey, 10User-fgiunchedi: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759#3833669 (10hashar) [10:17:08] !log mobrovac@tin Started deploy [recommendation-api/deploy@ac66089]: Update to service-template-node v0.5.4 [10:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:30] (03PS1) 10Elukey: Fix kafka1023 partman's recipe [puppet] - 10https://gerrit.wikimedia.org/r/398022 [10:17:46] (03PS1) 10Jcrespo: proxysql: Add proxysql user to mysql group for tls certs access [puppet] - 10https://gerrit.wikimedia.org/r/398023 (https://phabricator.wikimedia.org/T175672) [10:18:04] (03CR) 10Elukey: [C: 032] Fix kafka1023 partman's recipe [puppet] - 10https://gerrit.wikimedia.org/r/398022 (owner: 10Elukey) [10:18:54] (03CR) 10Filippo Giunchedi: [C: 031] Add config file with access credentials for Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398001 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [10:19:28] !log mobrovac@tin Finished deploy [recommendation-api/deploy@ac66089]: Update to service-template-node v0.5.4 (duration: 02m 20s) [10:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:50] (03CR) 10Elukey: [C: 032] Remove any trace of notebook1002 records [dns] - 10https://gerrit.wikimedia.org/r/397884 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [10:20:13] 10Operations, 10Wikimedia-General-or-Unknown: Icinga has httpauth on (not accessible for public) - https://phabricator.wikimedia.org/T62112#3833680 (10akosiaris) 05Open>03declined Thanks for reminding me of this. Since there's been practically no change since my last commit, I 'll indeed close this as `dec... [10:21:40] (03PS1) 10Gehel: elasticsearch: deploy prometheus-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/398025 (https://phabricator.wikimedia.org/T181627) [10:21:46] (03PS1) 10Gehel: logstash: activate prometheus elasticsearch exporter [puppet] - 10https://gerrit.wikimedia.org/r/398026 (https://phabricator.wikimedia.org/T181627) [10:21:51] (03PS1) 10Gehel: elasticsearch: activate prometheus elasticsearch exporter [puppet] - 10https://gerrit.wikimedia.org/r/398027 (https://phabricator.wikimedia.org/T181627) [10:21:55] godog: ^ [10:21:59] (03CR) 10Filippo Giunchedi: [C: 031] Add a Prometheus exporter for PDNS recursor [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394982 (owner: 10Muehlenhoff) [10:22:49] (03CR) 10Jcrespo: [C: 031] Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398014 (owner: 10Marostegui) [10:23:23] (03PS2) 10Jcrespo: proxysql: Add proxysql user to mysql group for tls certs access [puppet] - 10https://gerrit.wikimedia.org/r/398023 (https://phabricator.wikimedia.org/T175672) [10:23:23] gehel: sweet! I'm taking a look [10:24:04] godog: it looks too simple, I probably forgot something obvious.... [10:24:25] godog: I have no idea how to tell prometheus to start querying those new exporters... [10:24:58] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3833696 (10Marostegui) >>! In T182702#3833575, @elukey wrote: > Just got the following (I did a manual pxe boot, not using wmf-auto-reimage): > > ``` > β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€... [10:25:15] (03CR) 10Filippo Giunchedi: elasticsearch: deploy prometheus-elasticsearch-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398025 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [10:26:40] gehel: heheh the exporter part should be basically that, the prometheus part you'd have to add a "job" for elasticsearch to ./modules/role/manifests/prometheus/ops.pp [10:27:27] (03CR) 10Jcrespo: [C: 032] proxysql: Add proxysql user to mysql group for tls certs access [puppet] - 10https://gerrit.wikimedia.org/r/398023 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo) [10:27:48] (03PS2) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) [10:28:32] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [10:32:50] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3833700 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'] ``` [10:36:58] (03PS2) 10Gehel: elasticsearch: deploy prometheus-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/398025 (https://phabricator.wikimedia.org/T181627) [10:37:00] (03PS2) 10Gehel: logstash: activate prometheus elasticsearch exporter [puppet] - 10https://gerrit.wikimedia.org/r/398026 (https://phabricator.wikimedia.org/T181627) [10:37:02] (03PS2) 10Gehel: elasticsearch: activate prometheus elasticsearch exporter [puppet] - 10https://gerrit.wikimedia.org/r/398027 (https://phabricator.wikimedia.org/T181627) [10:40:26] 10Operations, 10Traffic, 10Patch-For-Review: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#2961994 (10Volans) I just noticed that in `late_command.sh` we have a special case for `cp[1234]*` that I guess will need to be updated to include eqsin too. Mentioning it here becau... [10:42:52] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3833707 (10akosiaris) But it does exist on tin ``` akosiaris@tin:/srv/deployment/ores/deploy/.git/modules/submodules/editquality$... [10:44:21] (03PS3) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) [10:44:49] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [10:45:22] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3833708 (10mmodell) Another thing: I'm having difficulty just cloning the editquality submodule. It's so large that git pack-obje... [10:47:47] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3833717 (10akosiaris) @mmodell is on to something though with the comment about that commit not being in the repo ``` akosiaris@t... [10:49:13] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3833719 (10mmodell) Hmm, indeed, if the object does not exist on any branch or tag then it likely won't be fetched by the "dumb" g... [10:52:25] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3833721 (10mmodell) Just fetching this one repo (editquality) from phabricator is causing inordinate load on the server. It's noth... [10:54:20] (03CR) 10Gehel: elasticsearch: deploy prometheus-elasticsearch-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398025 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [11:00:38] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3833730 (10akosiaris) Behavior is erratic as well ``` akosiaris@bast1001:~$ git clone https://phabricator.wikimedia.org/source/ed... [11:01:36] (03CR) 10Filippo Giunchedi: [C: 031] Add Prometheus exporter to role::mw_rc_irc [puppet] - 10https://gerrit.wikimedia.org/r/395766 (https://phabricator.wikimedia.org/T182196) (owner: 10Muehlenhoff) [11:01:52] (03CR) 10Filippo Giunchedi: [C: 031] Add Prometheus scraper config for ircd exporter [puppet] - 10https://gerrit.wikimedia.org/r/395767 (https://phabricator.wikimedia.org/T182196) (owner: 10Muehlenhoff) [11:02:42] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3833733 (10mmodell) Yeah that repo is 334M in the current workdir but the .git is 2.1 gigs. That doesn't seem too unreasonable bu... [11:07:47] (03PS6) 10ArielGlenn: clean up all references to a 'public dumps dir' on web/nfs dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/397806 [11:08:49] (03CR) 10ArielGlenn: [C: 032] clean up all references to a 'public dumps dir' on web/nfs dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/397806 (owner: 10ArielGlenn) [11:10:23] (03PS1) 10Muehlenhoff: Add Debianisation for prometheus-rabbitmq-exporter [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398033 [11:12:50] !log mobrovac@tin Started deploy [graphoid/deploy@7979a40]: Update to service-template-node v0.5.4 [11:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:51] (03PS2) 10Muehlenhoff: Add Prometheus exporter to role::mw_rc_irc [puppet] - 10https://gerrit.wikimedia.org/r/395766 (https://phabricator.wikimedia.org/T182196) [11:16:46] !log mobrovac@tin Finished deploy [graphoid/deploy@7979a40]: Update to service-template-node v0.5.4 (duration: 03m 56s) [11:16:47] (03CR) 10Muehlenhoff: [C: 032] Add Prometheus exporter to role::mw_rc_irc [puppet] - 10https://gerrit.wikimedia.org/r/395766 (https://phabricator.wikimedia.org/T182196) (owner: 10Muehlenhoff) [11:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A small set of comments. Overall this is nice." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [11:21:22] (03PS3) 10Muehlenhoff: Add Prometheus scraper config for ircd exporter [puppet] - 10https://gerrit.wikimedia.org/r/395767 (https://phabricator.wikimedia.org/T182196) [11:22:30] (03CR) 10Muehlenhoff: [C: 032] Add Prometheus scraper config for ircd exporter [puppet] - 10https://gerrit.wikimedia.org/r/395767 (https://phabricator.wikimedia.org/T182196) (owner: 10Muehlenhoff) [11:28:17] (03PS2) 10Elukey: Enable more accurate smaps based rss checking [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395923 (https://phabricator.wikimedia.org/T182276) (owner: 10EBernhardson) [11:32:00] (03CR) 10Elukey: [C: 031] "After thinking a bit more about this, I don't think that having another tunable would be really that safer, since we can restart increment" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395923 (https://phabricator.wikimedia.org/T182276) (owner: 10EBernhardson) [11:33:03] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add a Prometheus exporter for PDNS recursor [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394982 (owner: 10Muehlenhoff) [11:34:40] (03PS2) 10Muehlenhoff: Add pdns rec exporters to Prometheus scraper config [puppet] - 10https://gerrit.wikimedia.org/r/394564 (https://phabricator.wikimedia.org/T181620) [11:35:00] (03CR) 10jerkins-bot: [V: 04-1] Add pdns rec exporters to Prometheus scraper config [puppet] - 10https://gerrit.wikimedia.org/r/394564 (https://phabricator.wikimedia.org/T181620) (owner: 10Muehlenhoff) [11:41:41] !log empty ganeti1006 for reimage after ltpstress and bios upgrades T181121 [11:41:41] (03PS3) 10Muehlenhoff: Add Prometheus exporter to DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/394562 (https://phabricator.wikimedia.org/T181620) [11:41:43] (03PS3) 10Muehlenhoff: Add pdns rec exporters to Prometheus scraper config [puppet] - 10https://gerrit.wikimedia.org/r/394564 (https://phabricator.wikimedia.org/T181620) [11:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:51] T181121: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121 [11:42:34] (03PS5) 10EddieGP: Delete alswik(ibooks|iquote|tionary), mowik(ipedia|tionary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923) (owner: 10MarcoAurelio) [11:43:44] (03PS4) 10Elukey: role::cache::canary: add a test Varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/397765 [11:50:54] (03PS1) 10ArielGlenn: rename 'otherdir' in the dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/398034 [11:53:24] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Transport endpoint is not connected [11:53:41] elukey: ^^^ [11:54:42] (03PS5) 10Elukey: role::cache::canary: add a test Varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/397765 [11:55:59] checking.. [11:59:52] !log forced remount of /mnt/hdfs after OOM event on stat1005 [12:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:25] RECOVERY - Disk space on stat1005 is OK: DISK OK [12:03:58] (03PS2) 10ArielGlenn: rename 'otherdir' in the dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/398034 [12:04:33] (03CR) 10jerkins-bot: [V: 04-1] rename 'otherdir' in the dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/398034 (owner: 10ArielGlenn) [12:05:48] (03PS3) 10ArielGlenn: rename 'otherdir' in the dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/398034 [12:07:35] (03PS1) 10Elukey: Restrict read permissions to the config file when SSL is enabled [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/398035 [12:09:31] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler03/9322/cp1008.wikimedia.org/" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/398035 (owner: 10Elukey) [12:10:32] (03PS4) 10Alexandros Kosiaris: postgresql::user: Allow password to be undefined [puppet] - 10https://gerrit.wikimedia.org/r/392437 [12:10:34] (03PS5) 10Alexandros Kosiaris: Add postgresql::prometheus class [puppet] - 10https://gerrit.wikimedia.org/r/392438 (https://phabricator.wikimedia.org/T177196) [12:13:59] (03PS6) 10Elukey: role::cache::canary: add a test Varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/397765 [12:18:25] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:18:35] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:18:35] (03PS4) 10ArielGlenn: rename 'otherdir' in the dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/398034 [12:18:44] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:18:55] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:19:05] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:19:12] (03CR) 10Elukey: "List of changes:" [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [12:19:14] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:19:31] !log uploaded prometheus-ircd-exporter and prometheus-pdns-rec-exporter to apt.wikimedia.org [12:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:55] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:21:40] checking stat1005, surely someone crunching data [12:21:55] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [12:24:40] (03CR) 10Muehlenhoff: [C: 032] Add Prometheus exporter to DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/394562 (https://phabricator.wikimedia.org/T181620) (owner: 10Muehlenhoff) [12:25:14] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [12:25:25] RECOVERY - Disk space on stat1005 is OK: DISK OK [12:25:35] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [12:25:44] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [12:25:55] RECOVERY - DPKG on stat1005 is OK: All packages OK [12:26:05] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [12:29:55] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:33:56] (03PS1) 10Elukey: Add fake secrets for Varnishkafka TLS configuration [labs/private] - 10https://gerrit.wikimedia.org/r/398036 [12:34:22] (03CR) 10Elukey: [V: 032 C: 032] Add fake secrets for Varnishkafka TLS configuration [labs/private] - 10https://gerrit.wikimedia.org/r/398036 (owner: 10Elukey) [12:36:00] (03CR) 10Muehlenhoff: [C: 032] Add pdns rec exporters to Prometheus scraper config [puppet] - 10https://gerrit.wikimedia.org/r/394564 (https://phabricator.wikimedia.org/T181620) (owner: 10Muehlenhoff) [12:37:22] (03PS2) 10Muehlenhoff: Add Prometheus scraper config for Etherpad [puppet] - 10https://gerrit.wikimedia.org/r/395577 (https://phabricator.wikimedia.org/T182095) [12:41:29] (03PS3) 10Muehlenhoff: Add Prometheus scraper config for Etherpad [puppet] - 10https://gerrit.wikimedia.org/r/395577 (https://phabricator.wikimedia.org/T182095) [12:42:52] (03PS1) 10Volans: Icinga: fine tune settings for dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/398037 (https://phabricator.wikimedia.org/T170353) [12:44:53] (03PS2) 10Muehlenhoff: Add config file with access credentials for Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398001 (https://phabricator.wikimedia.org/T181802) [12:45:19] (03CR) 10Elukey: "pcc after adding the fake pem files to the labs private repo: https://puppet-compiler.wmflabs.org/compiler02/9325/cp1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [12:46:33] (03CR) 10Muehlenhoff: [C: 032] Add config file with access credentials for Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398001 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [12:47:38] (03CR) 10Joal: [C: 031] "I agree with ottomata - sounds sane." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395923 (https://phabricator.wikimedia.org/T182276) (owner: 10EBernhardson) [12:47:54] (03PS3) 10Ema: prometheus: add mtail to varnish-upload job [puppet] - 10https://gerrit.wikimedia.org/r/397851 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [12:47:56] (03PS2) 10Ema: WIP: rework mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/397889 (owner: 10Filippo Giunchedi) [12:47:58] (03PS2) 10Ema: varnishxcps.mtail: use prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/397876 (https://phabricator.wikimedia.org/T177199) [12:49:10] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398014 (owner: 10Marostegui) [12:51:04] (03CR) 10Nikerabbit: [C: 031] "Looks sane, didn't test." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793) (owner: 10MarcoAurelio) [12:51:20] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398014 (owner: 10Marostegui) [12:51:32] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398014 (owner: 10Marostegui) [12:51:55] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Wed 2017-12-13 12:51:51 UTC. [12:53:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1103:3314 - T174569 (duration: 01m 36s) [12:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:46] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [12:53:50] !log Deploy alter table on s4 db1064 (sanitarium master) with replication, this will generate lag on labs replicas - T174569 [12:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:56] (03CR) 10Ema: [C: 031] prometheus: add mtail to varnish-upload job [puppet] - 10https://gerrit.wikimedia.org/r/397851 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [12:56:03] (03CR) 10Ema: [C: 031] WIP: rework mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/397889 (owner: 10Filippo Giunchedi) [12:57:55] (03PS4) 10Ema: prometheus: add mtail to varnish-upload job [puppet] - 10https://gerrit.wikimedia.org/r/397851 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [12:58:04] (03CR) 10Ema: [V: 032 C: 032] prometheus: add mtail to varnish-upload job [puppet] - 10https://gerrit.wikimedia.org/r/397851 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [12:59:05] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/prometheus/rabbitmq-exporter.yaml] [13:00:08] (03PS3) 10Filippo Giunchedi: mtail: restructure tests [puppet] - 10https://gerrit.wikimedia.org/r/397889 (https://phabricator.wikimedia.org/T177199) [13:00:54] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3833995 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [13:00:57] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3833996 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'] ``` [13:00:59] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3833997 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [13:01:10] (03PS4) 10Filippo Giunchedi: mtail: restructure tests [puppet] - 10https://gerrit.wikimedia.org/r/397889 (https://phabricator.wikimedia.org/T177199) [13:01:12] (03PS3) 10Ema: varnishxcps.mtail: use prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/397876 (https://phabricator.wikimedia.org/T177199) [13:02:34] (03CR) 10Filippo Giunchedi: [C: 032] mtail: restructure tests [puppet] - 10https://gerrit.wikimedia.org/r/397889 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [13:03:18] (03PS1) 10Muehlenhoff: Fix metric family name [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/398038 [13:04:46] (03CR) 10Muehlenhoff: [V: 032 C: 032] Fix metric family name [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/398038 (owner: 10Muehlenhoff) [13:04:48] (03CR) 10Filippo Giunchedi: [C: 031] Fix metric family name [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/398038 (owner: 10Muehlenhoff) [13:06:08] (03PS4) 10Ema: varnishxcps.mtail: use prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/397876 (https://phabricator.wikimedia.org/T177199) [13:06:34] (03PS1) 10Muehlenhoff: Bump changelog [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/398039 [13:07:13] (03CR) 10Muehlenhoff: [V: 032 C: 032] Bump changelog [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/398039 (owner: 10Muehlenhoff) [13:08:55] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/prometheus/rabbitmq-exporter.yaml] [13:10:43] !log Deploy schema change on s4 - dbstore1002 - T174569 [13:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:54] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [13:14:28] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, there's likely much to bikeshed on the xcps metric prefix but I don't have a better idea ATM" [puppet] - 10https://gerrit.wikimedia.org/r/397876 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [13:16:24] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/prometheus/rabbitmq-exporter.yaml] [13:17:08] thcipriani|afk: could you take a look at https://gerrit.wikimedia.org/r/#/c/394551/ when you get a chance? thanks! [13:21:36] !log uploaded prometheus-pdns-rec-exporter 0.2-1 to apt.wikimedia.org [13:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:50] (03PS2) 10Zfilipin: Add upload_by_url to extended uploaders on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397129 (https://phabricator.wikimedia.org/T182534) (owner: 10Jon Harald SΓΈby) [13:28:22] (03PS2) 10Zfilipin: Restrict sending mails to new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397768 (https://phabricator.wikimedia.org/T182541) (owner: 10EddieGP) [13:29:35] (03PS3) 10Zfilipin: [cirrus] tune wikidata similarity configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397855 (https://phabricator.wikimedia.org/T182293) (owner: 10DCausse) [13:30:05] (03CR) 10Filippo Giunchedi: [C: 031] Add Prometheus scraper config for Etherpad [puppet] - 10https://gerrit.wikimedia.org/r/395577 (https://phabricator.wikimedia.org/T182095) (owner: 10Muehlenhoff) [13:31:30] (03CR) 10Zoranzoki21: [C: 031] "Looks good to me, but someone else must approve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397855 (https://phabricator.wikimedia.org/T182293) (owner: 10DCausse) [13:42:38] (03PS1) 10Muehlenhoff: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) [13:43:12] (03CR) 10jerkins-bot: [V: 04-1] Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [13:43:30] (03PS2) 10Muehlenhoff: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) [13:43:57] (03CR) 10jerkins-bot: [V: 04-1] Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [13:44:04] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398045 (https://phabricator.wikimedia.org/T128546) [13:44:20] zeljkof: I saw you rebased my patch, so am I right in assuming you're going to do the swat? [13:44:35] eddiegp: correct [13:45:11] (03PS3) 10Muehlenhoff: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) [13:45:26] zeljkof: I've put https://gerrit.wikimedia.org/r/#/c/394846/ into that swat window, but I'm not sure whether it's eligble. Tell me if you've got any questions/concerns about deploying it :) [13:46:23] It's quite big, and deleting wikis is not exactly something that is done regularly or known to run completely smoothly. [13:46:31] eddiegp: I don't think I have ever deleted projects during swat [13:46:45] so I would hesitate to do it [13:46:55] I would prefer if somebody with more experience does it [13:47:15] also, there are no references in the task on _why_ the projects are deleted [13:47:35] (03CR) 10Elukey: [V: 032 C: 032] Enable more accurate smaps based rss checking [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395923 (https://phabricator.wikimedia.org/T182276) (owner: 10EBernhardson) [13:48:01] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1034 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398046 (https://phabricator.wikimedia.org/T182556) [13:48:07] zeljkof: For the last part, that's in the subtask T169450 [13:48:10] T169450: Redirect several wikis - https://phabricator.wikimedia.org/T169450 [13:48:23] (03PS4) 10Muehlenhoff: Add Prometheus scraper config for Etherpad [puppet] - 10https://gerrit.wikimedia.org/r/395577 (https://phabricator.wikimedia.org/T182095) [13:49:02] (03Abandoned) 10Elukey: role::analytics_cluster::client: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/393563 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [13:49:07] (03CR) 10Muehlenhoff: [C: 032] Add Prometheus scraper config for Etherpad [puppet] - 10https://gerrit.wikimedia.org/r/395577 (https://phabricator.wikimedia.org/T182095) (owner: 10Muehlenhoff) [13:49:45] zeljkof: Basically: The URIs are redirected, so the wikis shouldn't be present in mw any more, or they'll still show up on weird places due (globalauth etc) [13:50:24] But if you don't want to do that now, I'm fine with that. I can ask greg-g, probably a deployment window would be best for this instead, combining it with the other cleanup necessary. [13:51:01] (03PS1) 10Elukey: modules::cdh: update to the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/398047 [13:55:47] eddiegp: another swat window would be fine with me, I just think somebody more experienced than I am should take a look first [13:56:44] (03CR) 10Elukey: [C: 032] modules::cdh: update to the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/398047 (owner: 10Elukey) [13:56:51] (03PS2) 10Elukey: modules::cdh: update to the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/398047 [13:57:16] (03CR) 10Alexandros Kosiaris: [C: 031] Icinga: fine tune settings for dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/398037 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [13:58:01] zeljkof: I'll move it by one day, I should be home for EU midday tomorrow. Do you have some more clue than I whom to ask for taking a look? [13:59:34] jouncebot: refresh [13:59:37] I refreshed my knowledge about deployments. [14:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171213T1400). [14:00:04] Jhs, eddiegp, dcausse, and jan_drewniak: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:58] I can SWAT today! [14:01:11] o/ [14:01:17] !log restart Yarn nodemanagers on analytics102[8,9] to apply new settings - T182276 [14:01:22] In case Jhs isn't here in time, they asked me to be around for their patch too, so I can test that :) [14:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:28] T182276: Enable more accurate smaps based RSS tracking by yarn nodemanager - https://phabricator.wikimedia.org/T182276 [14:01:31] (03PS1) 10Muehlenhoff: Fix ferm port for prometheus-etherpad-exporter [puppet] - 10https://gerrit.wikimedia.org/r/398048 [14:01:33] eddiegp: I usually do EU SWATs, so you'll have the same problem tomorrow :) [14:01:52] (03CR) 10Filippo Giunchedi: [C: 031] Fix ferm port for prometheus-etherpad-exporter [puppet] - 10https://gerrit.wikimedia.org/r/398048 (owner: 10Muehlenhoff) [14:01:54] eddiegp: greg-g, hasharAway, no_justification might know what to do [14:02:30] jan_drewniak: want to deploy your commit yourself? [14:02:35] Jhs: around for SWAT? [14:02:52] dcausse: want to deploy your commits yourself? [14:03:54] !log gnt-node remove ganeti1006 T181121 [14:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:04] T181121: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121 [14:04:08] zeljkof: Well my other alternatives are skipping university tomorrow, waiting for evening swat (1am for me) or delaying it until january. I don't like any of those :/ [14:04:15] o/ [14:04:38] zeljkof: sure, I never deployed wikibase but if it's similar to other exts I'm ok [14:04:59] dcausse: want to go first? [14:05:06] sure [14:05:17] dcausse: go ahead and let me know when you are done [14:05:23] ok [14:05:36] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397855 (https://phabricator.wikimedia.org/T182293) (owner: 10DCausse) [14:05:45] PROBLEM - ganeti-confd running on ganeti1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-confd), command name ganeti-confd [14:05:47] PROBLEM - ganeti-noded running on ganeti1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded [14:06:01] eddiegp: sorry, I really do not feel comfortable deleting projects. there are no reviews from people I trust on the commit [14:06:04] PROBLEM - ganeti-mond running on ganeti1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond [14:06:11] zeljkof: could you do my deploy? I’m looking after my 2 year old at the moment and don’t want to risk her getting near the keyboard :P [14:06:13] (03CR) 10Muehlenhoff: [C: 032] Fix ferm port for prometheus-etherpad-exporter [puppet] - 10https://gerrit.wikimedia.org/r/398048 (owner: 10Muehlenhoff) [14:06:22] jan_drewniak: sure :) [14:07:07] jan_drewniak: I don't think we have "I broke wikipedia" shirts for 2 year olds :) [14:07:55] eddiegp: if you get a +1 from experienced people for the patch, I would deploy it [14:08:02] eddiegp: Yeah, be sure that I understand that. I don't want to blame you, just to say that I don't know when to do it if not in EU swat tomorrow. [14:08:17] zeljkof: Thanks, that's a word :) [14:08:37] I'll try to get some review from someone experienced for it then. [14:09:11] eddiegp: please do [14:09:18] deleting stuff is always scary [14:09:45] Especially if you've never done it before, I know :) [14:10:01] that too [14:10:03] do y [14:10:14] do I need to run some cleanup scripts after deployment? [14:11:37] hm I think I need to remove the Depends-On tag in my commit msg [14:12:16] (03PS4) 10DCausse: [cirrus] tune wikidata similarity configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397855 (https://phabricator.wikimedia.org/T182293) [14:12:24] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3834125 (10MoritzMuehlenhoff) [14:12:26] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create ircd exporter for Prometheus - https://phabricator.wikimedia.org/T182196#3834123 (10MoritzMuehlenhoff) 05Open>03Resolved A Prometheus exporter based on the previously used Diamond collector was written... [14:12:31] (03CR) 10DCausse: [cirrus] tune wikidata similarity configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397855 (https://phabricator.wikimedia.org/T182293) (owner: 10DCausse) [14:12:37] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397855 (https://phabricator.wikimedia.org/T182293) (owner: 10DCausse) [14:13:24] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3650139 (10MoritzMuehlenhoff) [14:13:28] zeljkof: Yeah, there's documentation about that on https://wikitech.wikimedia.org/wiki/Delete_a_wiki . Not exactly a cleanup script, but some sql queries to centralauth and commonswiki. Deleting a wiki isn't really automated enough :/ [14:13:58] (03Merged) 10jenkins-bot: [cirrus] tune wikidata similarity configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397855 (https://phabricator.wikimedia.org/T182293) (owner: 10DCausse) [14:13:58] eddiegp: in that case, it's not something I would do during swat. please ask greg-g for a deployment window [14:14:16] (My plans didn't include to do the cleanup in swat too, but if you think we could that's nice) [14:14:44] Oh, okay. Yeah, so going to do that then. [14:15:04] It's probably better to do it all at once anyways [14:17:04] (03CR) 10jenkins-bot: [cirrus] tune wikidata similarity configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397855 (https://phabricator.wikimedia.org/T182293) (owner: 10DCausse) [14:17:43] eddiegp: for 397768, I would also hesitate to deploy it, until anybody from WMF Support & Safety gave it a +1 [14:18:42] Meh, okay :( [14:18:46] (03CR) 10Zfilipin: [C: 031] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398045 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:19:03] eddiegp: sorry, I have to be careful [14:19:16] I'll probably not find anybody around for a review right now, so I'll have to move that too. [14:19:42] eddiegp: Yeah, sure ;) [14:20:12] !log dcausse@tin Synchronized wmf-config/Wikibase.php: T182293 [cirrus] tune wikidata similarity configuration 1/2 (duration: 01m 12s) [14:20:18] wow scap is very verbose [14:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:22] T182293: Tune wikidata fulltext search similarity parameters - https://phabricator.wikimedia.org/T182293 [14:20:23] gj eddie, pinging myself :D [14:20:32] dcausse: there is a task... [14:21:06] T182643 [14:21:06] T182643: cache_git_info (from e.g. scap sync-file) is way way too verbose - https://phabricator.wikimedia.org/T182643 [14:21:27] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Package PDNS Recursor collector for Prometheus and adapt metrics - https://phabricator.wikimedia.org/T181620#3834169 (10fgiunchedi) The https://grafana.wikimedia.org/dashboard/db/dns-recursors has been updated to... [14:21:35] dcausse: https://phabricator.wikimedia.org/T182643 [14:21:45] PROBLEM - puppet last run on lvs4007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:21:58] zeljkof: ok [14:22:08] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: T182293 [cirrus] tune wikidata similarity configuration 2/2 (duration: 01m 07s) [14:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:30] (03CR) 10Ema: role::cache::canary: add a test Varnishkafka instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [14:27:17] (03PS1) 10Gehel: elasticsearch: configure prometheus to collect metrics from elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/398051 (https://phabricator.wikimedia.org/T181627) [14:27:40] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: configure prometheus to collect metrics from elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/398051 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [14:28:21] (03PS2) 10Volans: Icinga: fine tune settings for dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/398037 (https://phabricator.wikimedia.org/T170353) [14:29:14] (03CR) 10Volans: [C: 032] Icinga: fine tune settings for dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/398037 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [14:29:55] (03PS7) 10Elukey: role::cache::canary: add a test Varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/397765 [14:29:58] !log dcausse@tin Synchronized php-1.31.0-wmf.12/extensions/Wikibase: T182293 Extract names of search fields as constants (duration: 02m 05s) [14:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:08] T182293: Tune wikidata fulltext search similarity parameters - https://phabricator.wikimedia.org/T182293 [14:30:48] zeljkof: I'm done, thanks for your patience :) [14:31:01] dcausse: no problem, and thanks! [14:31:05] I am taking over SWAT [14:31:23] Jhs: let me know if you are around for SWAT [14:31:49] (03PS1) 10Elukey: Move hiera filename after a profile rename [labs/private] - 10https://gerrit.wikimedia.org/r/398052 [14:31:58] jan_drewniak: I am deploying your commit, can you test it in a few minutes, when deployed? [14:32:07] (03PS3) 10Gehel: elasticsearch: deploy prometheus-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/398025 (https://phabricator.wikimedia.org/T181627) [14:32:08] zeljkof: You've seen my note about Jhs patch above? [14:32:09] (03PS3) 10Gehel: logstash: activate prometheus elasticsearch exporter [puppet] - 10https://gerrit.wikimedia.org/r/398026 (https://phabricator.wikimedia.org/T181627) [14:32:11] (03PS3) 10Gehel: elasticsearch: activate prometheus elasticsearch exporter [puppet] - 10https://gerrit.wikimedia.org/r/398027 (https://phabricator.wikimedia.org/T181627) [14:32:13] (03PS2) 10Gehel: elasticsearch: configure prometheus to collect metrics from elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/398051 (https://phabricator.wikimedia.org/T181627) [14:32:15] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398045 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:32:17] (03CR) 10Elukey: [V: 032 C: 032] Move hiera filename after a profile rename [labs/private] - 10https://gerrit.wikimedia.org/r/398052 (owner: 10Elukey) [14:32:20] eddiegp: sorry, no [14:32:26] zeljkof: 15:01:22 eddiegp | In case Jhs isn't here in time, they asked me to be around for their patch too, so I can test that :) [14:32:49] eddiegp: oh, did not notice that at all, you are next then [14:33:02] zeljkof: yup, I'm around [14:33:04] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: configure prometheus to collect metrics from elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/398051 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [14:33:14] Great, ping me when on mwdebug :) [14:33:41] eddiegp: you can test if that patch works? I saw some comments about flickr problems, can you test for that too? [14:33:43] (03PS3) 10Gehel: elasticsearch: configure prometheus to collect metrics from elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/398051 (https://phabricator.wikimedia.org/T181627) [14:34:06] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398045 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:34:47] Umm, I'll have to look for that comment. Nope, Jhs didn't say anything about that. Just the usual "Check at Special:ListGroupRights" for a user rights change [14:34:50] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create Prometheus exporter for wdqs-updater - https://phabricator.wikimedia.org/T182773#3834187 (10MoritzMuehlenhoff) p:05Triage>03High [14:36:06] (03CR) 10Elukey: "Renamed the profile to ::jumbo (previous ::duplicate) and fixed the hiera config in the labs private repo." [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [14:36:34] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create Prometheus exporter for Etherpad - https://phabricator.wikimedia.org/T182095#3834200 (10fgiunchedi) I've updated the dashboard at https://grafana.wikimedia.org/dashboard/db/etherpad?orgId=1 to make use of... [14:36:47] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398045 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:37:47] !log zfilipin@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:398045|Bumping portals to master (T128546)]] (duration: 01m 08s) [14:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:58] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [14:38:57] !log zfilipin@tin Synchronized portals: SWAT: [[gerrit:398045|Bumping portals to master (T128546)]] (duration: 01m 09s) [14:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:22] jan_drewniak: portals deployed, please check and thanks for deploying with #releng! ;) [14:40:15] zeljkof: great, thanks! [14:40:33] (03CR) 10Ottomata: Restrict read permissions to the config file when SSL is enabled (032 comments) [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/398035 (owner: 10Elukey) [14:41:02] zeljkof: I agree with Jhs that I don't know why T90004 should block this, but I don't know enough about it to be certain. So if in doubt, better not deploy it, I guess. [14:41:02] T90004: Enable Flickr import for all users on Commons - https://phabricator.wikimedia.org/T90004 [14:41:35] eddiegp: same here :( in that case I would prefer not to deploy it [14:41:44] I'll leave a comment in gerrit [14:41:57] Except Jhs is back before the window is over and can answer that :) [14:42:23] eddiegp: sure, I'm around, I guess he'll see my pings and reply [14:43:29] (03CR) 10Zfilipin: "This was scheduled for EU SWAT today but I did not deploy it because I am not familiar with the feature, and from the comments here and in" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397129 (https://phabricator.wikimedia.org/T182534) (owner: 10Jon Harald SΓΈby) [14:43:36] !log EU SWAT finished [14:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:50] (03CR) 10Filippo Giunchedi: [C: 031] logstash: activate prometheus elasticsearch exporter [puppet] - 10https://gerrit.wikimedia.org/r/398026 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [14:43:55] (03CR) 10Elukey: Restrict read permissions to the config file when SSL is enabled (032 comments) [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/398035 (owner: 10Elukey) [14:44:02] Steinsplitter: around? [14:44:22] (03CR) 10Filippo Giunchedi: [C: 031] elasticsearch: deploy prometheus-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/398025 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [14:44:42] (03CR) 10Ottomata: [C: 031] role::cache::canary: add a test Varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [14:45:31] eddiegp: hi :) [14:45:59] Steinsplitter: Can you help to clarify the concerns about T182534 ? [14:46:01] T182534: Add upload_by_url to Extended uploaders (Wikimedia Commons) - https://phabricator.wikimedia.org/T182534 [14:46:30] ddiegp: the license review stuff is still done on clent side (using js) or it have been moved to server side? [14:46:38] Jhs signed it up for swat, unfortunately he can't be here right now and asked me to test it. zeljkof had some concerns to deploy it because of the comments though. [14:46:41] (03PS2) 10Elukey: Restrict read permissions to the config file when SSL is enabled [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/398035 [14:47:12] eddiegp: 5 min, i have to check. [14:48:01] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen [14:48:05] Steinsplitter: I'm unfortunately not really knowledgeable about that at all :/ Should have looked at the task comments earlier when Jhs asked me about it. [14:48:59] (03PS1) 10Volans: Icinga: escape $ not in macro [puppet] - 10https://gerrit.wikimedia.org/r/398054 (https://phabricator.wikimedia.org/T170353) [14:49:29] eddiegp: the patch needs re-write to exclude flickr upload at Special:Upload. [14:49:37] *at Sepecal:UploadWizard [14:49:49] (03CR) 10Volans: [C: 032] Icinga: escape $ not in macro [puppet] - 10https://gerrit.wikimedia.org/r/398054 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [14:50:19] Steinsplitter: Okay. So we're not going to deploy it then, I'll tell Jhs when he's around later. Thanks anyways :) [14:50:25] I'm here now! [14:50:43] Is there an issue? [14:51:04] (03PS4) 10Gehel: elasticsearch: deploy prometheus-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/398025 (https://phabricator.wikimedia.org/T181627) [14:51:10] Yeah, something about the flickr upload (see backscroll, I just asked Steinsplitter about it) [14:51:35] Jhs: I had no time yet to reply on phabricator. upload-by-url is fine, but we need to disable the flickr import tool at Special:Upload because it ignores our blacklist. [14:51:40] (03CR) 10Gehel: [C: 032] elasticsearch: deploy prometheus-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/398025 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [14:51:42] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [14:51:42] RECOVERY - puppet last run on lvs4007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:51:55] yeah, I saw Steinsplitter's comment, but can't see how those two things are related [14:52:11] Jhs: People will upload copyvios then? [14:52:14] (03PS4) 10Gehel: logstash: activate prometheus elasticsearch exporter [puppet] - 10https://gerrit.wikimedia.org/r/398026 (https://phabricator.wikimedia.org/T181627) [14:52:39] (03PS1) 10Jcrespo: db1067: Move socket location to the default path [puppet] - 10https://gerrit.wikimedia.org/r/398055 (https://phabricator.wikimedia.org/T148507) [14:52:55] (03CR) 10Ottomata: [C: 031] Restrict read permissions to the config file when SSL is enabled [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/398035 (owner: 10Elukey) [14:53:13] Jhs: I have no idea if this has been fixed. If yes, we can merge the task(?), but i had no time yet to do the neccesary checks. (cc: zhuyifei1999_) [14:53:20] Steinsplitter, that's an issue with those users if they do that. they should be removed from that group then [14:53:31] (03CR) 10Elukey: [C: 032] Restrict read permissions to the config file when SSL is enabled [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/398035 (owner: 10Elukey) [14:53:48] (03PS2) 10Jcrespo: db1067: Move socket location to the default path [puppet] - 10https://gerrit.wikimedia.org/r/398055 (https://phabricator.wikimedia.org/T148507) [14:55:02] Steinsplitter: ? [14:55:28] zhuyifei1999_: your comment at T182534 [14:55:28] T182534: Add upload_by_url to Extended uploaders (Wikimedia Commons) - https://phabricator.wikimedia.org/T182534 [14:55:31] (03PS2) 10Andrew Bogott: nova fullstack: update image for testing [puppet] - 10https://gerrit.wikimedia.org/r/397997 [14:55:34] zhuyifei1999_: is this still a issue? [14:56:13] I simply fear people uploading from flickr with UW and UW reviews them client side [14:56:15] (03CR) 10Gehel: [C: 032] logstash: activate prometheus elasticsearch exporter [puppet] - 10https://gerrit.wikimedia.org/r/398026 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [14:56:21] (03CR) 10Andrew Bogott: [C: 032] nova fullstack: update image for testing [puppet] - 10https://gerrit.wikimedia.org/r/397997 (owner: 10Andrew Bogott) [14:56:34] (03PS3) 10Andrew Bogott: nova fullstack: update image for testing [puppet] - 10https://gerrit.wikimedia.org/r/397997 [14:56:48] should now be filtered https://commons.wikimedia.org/wiki/Special:AbuseFilter/history/70/diff/prev/1759 [14:57:41] (03PS3) 10Rush: openstack: contain classes for dependency handling [puppet] - 10https://gerrit.wikimedia.org/r/397903 (https://phabricator.wikimedia.org/T171494) [14:57:56] zhuyifei1999_: ok good, so you think we can merge the task? looks ok for me. [14:58:08] (03PS1) 10Jcrespo: mariadb: Depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398056 (https://phabricator.wikimedia.org/T175672) [14:58:21] uh sure [14:58:36] (03PS4) 10Gehel: elasticsearch: activate prometheus elasticsearch exporter [puppet] - 10https://gerrit.wikimedia.org/r/398027 (https://phabricator.wikimedia.org/T181627) [14:58:57] Jhs/eddiegp: It appeas it is filtered by abuse filter, so we can merge it. Sorry for the confusion. [14:59:08] Steinsplitter, (Y) [14:59:17] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398056 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo) [14:59:22] zeljkof, are we to late to do that now? Still some seconds left of the window ;P [15:00:57] (03CR) 10Steinsplitter: [C: 031] "We re-checked, problematic stuff is filtere: https://commons.wikimedia.org/wiki/Special:AbuseFilter/history/70/diff/prev/1759" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397129 (https://phabricator.wikimedia.org/T182534) (owner: 10Jon Harald SΓΈby) [15:00:59] (03CR) 10Gehel: [C: 032] elasticsearch: activate prometheus elasticsearch exporter [puppet] - 10https://gerrit.wikimedia.org/r/398027 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [15:01:01] (03Merged) 10jenkins-bot: mariadb: Depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398056 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo) [15:01:03] (03PS1) 10Elukey: modules::varnishkafka: update to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/398057 [15:01:05] (03CR) 10jenkins-bot: mariadb: Depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398056 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo) [15:01:28] (03CR) 10jerkins-bot: [V: 04-1] modules::varnishkafka: update to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/398057 (owner: 10Elukey) [15:02:13] (03PS2) 10Elukey: modules::varnishkafka: update to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/398057 [15:02:32] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 (duration: 01m 07s) [15:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:53] (03PS4) 10Rush: openstack: contain classes for dependency handling [puppet] - 10https://gerrit.wikimedia.org/r/397903 (https://phabricator.wikimedia.org/T171494) [15:02:56] !log upgrade and restart db1067 [15:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:02] godog: moritzm is it known that puppet is broken on labcontrol from prometheus exporter issues? [15:04:10] Error: Could not set 'present' on ensure: No such file or directory - /etc/prometheus/rabbitmq-exporter.yaml20171213-23524-18uts4n.lock at 16:/etc/puppet/modules/rabbitmq/manifests/monitor.pp [15:04:29] having a look [15:05:05] (03PS4) 10Gehel: elasticsearch: configure prometheus to collect metrics from elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/398051 (https://phabricator.wikimedia.org/T181627) [15:05:35] chasemp: puppet runs fine on labcontrol1001? [15:05:38] 10Operations, 10cloud-services-team: labcontrol* puppet broken due to prometheus rabbit exporter - https://phabricator.wikimedia.org/T182779#3834301 (10chasemp) p:05Triage>03Normal [15:05:39] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1067 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398058 [15:05:47] seems labcontrol1002 is broken and 1001 is not moritzm [15:05:50] oddly [15:06:22] 10Operations, 10cloud-services-team: labcontrol1002 puppet broken due to prometheus rabbit exporter - https://phabricator.wikimedia.org/T182779#3834324 (10chasemp) [15:06:57] k, I'll make a patch [15:07:07] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834330 (10akosiaris) A fresh clone of `http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/editquality` on bast1001 does n... [15:07:24] moritzm: thanks fyi T182779 [15:07:25] T182779: labcontrol1002 puppet broken due to prometheus rabbit exporter - https://phabricator.wikimedia.org/T182779 [15:08:13] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - volans https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen [15:08:23] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [15:08:43] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen [15:09:35] moritzm: does this ring any bells? /usr/bin/sv status /etc/sv/nfs-kernel-server' returned 1: fail: /etc/sv/nfs-kernel-server: unable to change to service directory: file does not exist [15:09:40] (03PS3) 10Jcrespo: db1067: Move socket location to the default path [puppet] - 10https://gerrit.wikimedia.org/r/398055 (https://phabricator.wikimedia.org/T148507) [15:09:54] 10Operations, 10cloud-services-team: Puppet broken on labstore1004 - https://phabricator.wikimedia.org/T182781#3834354 (10chasemp) p:05Triage>03High [15:10:08] chasemp: no. haven't seen that [15:10:30] huh [15:10:30] Warning: Found multiple default providers for service: runit, debian; using runit [15:10:41] (03PS4) 10Jcrespo: db1067: Move socket location to the default path [puppet] - 10https://gerrit.wikimedia.org/r/398055 (https://phabricator.wikimedia.org/T148507) [15:10:46] 10Operations, 10cloud-services-team: Puppet broken on labstore1004 - https://phabricator.wikimedia.org/T182781#3834332 (10chasemp) [15:12:11] (03CR) 10Jcrespo: [C: 032] db1067: Move socket location to the default path [puppet] - 10https://gerrit.wikimedia.org/r/398055 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [15:12:23] Jhs: sorry, just saw your ping :) [15:12:28] it was late anyway [15:12:59] 10Operations, 10cloud-services-team: Puppet broken on labstore1004 - https://phabricator.wikimedia.org/T182781#3834373 (10chasemp) root@labstore1004:~# aptitude why runit i vblade-persist Depends runit (>= 1.8.0-2) [15:13:27] 10Operations, 10cloud-services-team: labcontrol1002 puppet broken due to prometheus rabbit exporter - https://phabricator.wikimedia.org/T182779#3834375 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Fixed (directory will be shipped by the prometheus-rabbitmq-exporter package), created the... [15:15:27] (03PS5) 10Gehel: elasticsearch: configure prometheus to collect metrics from logstash [puppet] - 10https://gerrit.wikimedia.org/r/398051 (https://phabricator.wikimedia.org/T181627) [15:15:29] (03PS1) 10Gehel: elasticsearch: configure prometheus to collect metrics from both logstash and elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/398059 (https://phabricator.wikimedia.org/T181627) [15:15:41] godog: ^ [15:15:56] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: configure prometheus to collect metrics from both logstash and elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/398059 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [15:16:22] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:19:44] gehel: looks reversed? [15:19:56] gehel: ah no nevermind, I misread [15:20:01] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834452 (10mmodell) @akosiaris: scap //should// be getting the hash from the submodule pointers contained at `HEAD` of `tin.eqiad... [15:20:08] 10Operations, 10cloud-services-team: Puppet broken on labstore1004 - https://phabricator.wikimedia.org/T182781#3834453 (10chasemp) ```root@labstore1004:~# apt-get remove --purge vblade-persist Reading package lists... Done Building dependency tree Reading state information... Done The following packages were a... [15:20:22] (03CR) 10Filippo Giunchedi: [C: 031] elasticsearch: configure prometheus to collect metrics from logstash [puppet] - 10https://gerrit.wikimedia.org/r/398051 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [15:20:53] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834455 (10akosiaris) the ores submodule btw is in the exact same state and also fails to checkout ``` akosiaris@tin:/srv/deploym... [15:21:33] 10Operations, 10cloud-services-team: Puppet broken on labstore1004 - https://phabricator.wikimedia.org/T182781#3834457 (10chasemp) ```root@labstore1004:~# apt-get remove --purge runit Reading package lists... Done Building dependency tree Reading state information... Done The following packages were automatica... [15:21:51] !log remove and purge vblade-persist and runit from labstore1004 T182781 [15:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:03] T182781: Puppet broken on labstore1004 - https://phabricator.wikimedia.org/T182781 [15:22:38] (03PS3) 10Cmjohnson: Removing site.pp and dhcpd file entries for mc1001-18 T164341 [puppet] - 10https://gerrit.wikimedia.org/r/397906 [15:24:33] 10Operations, 10cloud-services-team: Puppet broken on labstore1004 - https://phabricator.wikimedia.org/T182781#3834472 (10chasemp) fyi @madhuvishy I don't understand why this behavior started recently as it appears runit would have been installed since sept 6 (thanks @akosiaris ) Very uncomfortable with thi... [15:25:07] 10Operations, 10cloud-services-team: Puppet broken on labstore1004 - https://phabricator.wikimedia.org/T182781#3834474 (10chasemp) ```root@labstore1005:~# aptitude search runit p r-cran-runit - GNU R package providing unit testing framework p runit... [15:26:10] (03CR) 10Cmjohnson: [C: 032] Removing site.pp and dhcpd file entries for mc1001-18 T164341 [puppet] - 10https://gerrit.wikimedia.org/r/397906 (owner: 10Cmjohnson) [15:26:12] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:26:22] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834476 (10mmodell) ah ha! I figured _something_ out at least! The 15d5283b commit is in origin/master it just hasn't been merged... [15:26:30] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834478 (10akosiaris) >>! In T181661#3834452, @mmodell wrote: > @akosiaris: scap //should// be getting the hash from the submodule... [15:27:09] (03CR) 10Elukey: "pcc https://puppet-compiler.wmflabs.org/compiler02/9333/" [puppet] - 10https://gerrit.wikimedia.org/r/398057 (owner: 10Elukey) [15:27:18] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3834483 (10MoritzMuehlenhoff) [15:27:25] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create Prometheus exporter for Etherpad - https://phabricator.wikimedia.org/T182095#3834481 (10MoritzMuehlenhoff) 05Open>03Resolved An exporter has been written, packaged (software/debs/prometheus-etherpad-ex... [15:27:38] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3650139 (10MoritzMuehlenhoff) [15:28:04] (03CR) 10Elukey: "BBlack/Ema: this is a change that should happen before the new varnishkafka jumbo test instance." [puppet] - 10https://gerrit.wikimedia.org/r/398057 (owner: 10Elukey) [15:28:38] tx moritzm [15:30:05] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834498 (10mmodell) >>! In T181661#3834478, @akosiaris wrote: >>>! In T181661#3834452, @mmodell wrote: >> @akosiaris: scap //shoul... [15:30:48] (03PS12) 10Andrew Bogott: WMCS: set puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/397711 (https://phabricator.wikimedia.org/T178717) [15:30:51] (03PS1) 10Andrew Bogott: bootstrapvz: don't create duplicate (differently-named) entries in sources.list.d [puppet] - 10https://gerrit.wikimedia.org/r/398062 [15:30:55] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Remove db1034 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398046 (https://phabricator.wikimedia.org/T182556) [15:31:15] (03CR) 10jerkins-bot: [V: 04-1] bootstrapvz: don't create duplicate (differently-named) entries in sources.list.d [puppet] - 10https://gerrit.wikimedia.org/r/398062 (owner: 10Andrew Bogott) [15:32:09] (03CR) 10Andrew Bogott: [C: 032] WMCS: set puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/397711 (https://phabricator.wikimedia.org/T178717) (owner: 10Andrew Bogott) [15:33:07] (03PS2) 10Andrew Bogott: bootstrapvz: don't create duplicate entries in sources.list.d [puppet] - 10https://gerrit.wikimedia.org/r/398062 [15:33:53] (03CR) 10Andrew Bogott: [C: 032] bootstrapvz: don't create duplicate entries in sources.list.d [puppet] - 10https://gerrit.wikimedia.org/r/398062 (owner: 10Andrew Bogott) [15:34:48] !log akosiaris@tin Started deploy [ores/deploy@b4f2b02]: (no justification provided) [15:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:43] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834505 (10mmodell) so @awight, can you enlighten me about your scap.cfg? Is git_rev: origin/master intentional? If then I think... [15:37:43] (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [15:37:54] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834510 (10akosiaris) Aha! nice find. It looks like it's been there since the very beginning. See fd1067ff4da. It has undergone a... [15:39:09] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db1034 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398046 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [15:39:28] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3834518 (10MoritzMuehlenhoff) [15:39:30] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Package PDNS Recursor collector for Prometheus and adapt metrics - https://phabricator.wikimedia.org/T181620#3834516 (10MoritzMuehlenhoff) 05Open>03Resolved Exporter has been written, packaged (software/debs/... [15:39:34] !log akosiaris@tin Finished deploy [ores/deploy@b4f2b02]: (no justification provided) (duration: 04m 46s) [15:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:38] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1034 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398046 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [15:40:48] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1034 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398046 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [15:40:59] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834521 (10mmodell) I think I should add a NOTICE to scap that says something along the lines of "Deploying from non-default origi... [15:42:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1034 from config - T182556 (duration: 01m 09s) [15:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:26] T182556: Decommission db1034 - https://phabricator.wikimedia.org/T182556 [15:42:57] herron: are you doing 'apt get install puppet' batches today? How are things going? [15:43:03] I'm about ready to start doing that on VMs as well [15:43:34] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1034 from config - T182556 (duration: 01m 07s) [15:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:01] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:46:10] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834531 (10akosiaris) > I 've crafted a commit on tin removing that line and retrying a scap deploy from tin just for ores1004. O... [15:46:23] !log akosiaris@tin Started deploy [ores/deploy@b4f2b02]: (no justification provided) [15:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:26] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1034 - https://phabricator.wikimedia.org/T182556#3834533 (10Marostegui) [15:47:35] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1034 - https://phabricator.wikimedia.org/T182556#3827026 (10Marostegui) [15:47:53] !log akosiaris@tin Finished deploy [ores/deploy@b4f2b02]: (no justification provided) (duration: 01m 30s) [15:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:21] (03PS5) 10Rush: openstack: contain classes for dependency handling [puppet] - 10https://gerrit.wikimedia.org/r/397903 (https://phabricator.wikimedia.org/T171494) [15:48:35] !log akosiaris@tin Started deploy [ores/deploy@b4f2b02]: (no justification provided) [15:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:08] (03CR) 10Rush: [C: 032] openstack: contain classes for dependency handling [puppet] - 10https://gerrit.wikimedia.org/r/397903 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:50:55] !log akosiaris@tin Finished deploy [ores/deploy@b4f2b02]: (no justification provided) (duration: 02m 22s) [15:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:02] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834565 (10akosiaris) >>! In T181661#3834531, @akosiaris wrote: >> I 've crafted a commit on tin removing that line and retrying a... [15:56:55] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3834567 (10elukey) Tried to connect with install-console when the late_command failure msg is prompted, and this is the output: ``` ~ # sh /tmp/late_command + mkdir /target/roo... [15:59:09] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3834573 (10jcrespo) @aaron After spending a whole day on this, while proxysql is tec... [15:59:26] andrewbogott going well! 1198 systems done and now experimenting with the puppetlabs trusty packages [15:59:42] cool [15:59:48] I'll start upgrading some things soon then [16:01:06] the puppetlabs 'puppet-agent' package includes a newer facter which breaks our ipaddress fact [16:01:09] sounds good! [16:01:24] (03PS1) 10Cmjohnson: removing dns entries for decom hosts mc1001-1018 T164341 [dns] - 10https://gerrit.wikimedia.org/r/398066 [16:02:10] we have a regex in that fact that expects trailing whitespace and from what I can tell the newer version now removes leading/trailing whitespace [16:03:05] herron: I've tried yesterday to get one of my puppetdbs to a newer version, it's a mess [16:03:25] (03PS6) 10Gehel: elasticsearch: configure prometheus to collect metrics from logstash [puppet] - 10https://gerrit.wikimedia.org/r/398051 (https://phabricator.wikimedia.org/T181627) [16:03:26] volans which package did you use? [16:03:26] of interdependencies and version incompatibilities [16:03:48] 10Operations, 10ops-codfw, 10Analytics, 10DC-Ops, 10Patch-For-Review: Decomission eventlog2001 - https://phabricator.wikimedia.org/T182397#3834590 (10Papaul) [16:04:05] the puppetlabs one, I tried various versions [16:04:27] (03CR) 10Gehel: [C: 032] elasticsearch: configure prometheus to collect metrics from logstash [puppet] - 10https://gerrit.wikimedia.org/r/398051 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [16:05:07] (03PS2) 10Gehel: elasticsearch: configure prometheus to collect metrics from elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/398059 (https://phabricator.wikimedia.org/T181627) [16:05:22] PROBLEM - Host labtestcontrol2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:35] ^me [16:08:50] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834600 (10awight) @mmodell Tangential note, I've been happy using `git clone --depth 1` on personal projects. Would that make any sense for s... [16:09:01] RECOVERY - Host labtestcontrol2003 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [16:09:05] herron: o/ - do you have any idea if https://phabricator.wikimedia.org/T182702#3834567 can be related to any puppet things that you are working on ? [16:09:23] (Jessie reimages are broken) [16:10:49] <_joe_> elukey: did you try to run that script by hand? [16:10:55] (03PS1) 10Muehlenhoff: Add .gitreview file [debs/prometheus-wdqs-updater-exporter] - 10https://gerrit.wikimedia.org/r/398069 [16:11:16] _joe_ yep, apt-install puppet ends up with a 100 [16:11:17] <_joe_> oh I see [16:11:30] <_joe_> 100 means something is broken as far as packages go [16:12:07] <_joe_> elukey: you need an apt update between when you get the key for the wikimedia repo and when you install packages maybe? [16:12:24] (03PS3) 10Gehel: elasticsearch: configure prometheus to collect metrics from elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/398059 (https://phabricator.wikimedia.org/T181627) [16:12:52] <_joe_> herron: did you import the puppet packages in jesse-wikimedia, by any chance? [16:13:05] _joe_ yes, yesterdy [16:13:22] <_joe_> ok that might be it [16:13:53] <_joe_> I'd do the following, people: use the wikimedia-jessie docker image, and test installing puppet there [16:14:12] <_joe_> it's basically very similar to our install image [16:14:17] <_joe_> see if you can repro there [16:15:10] <_joe_> docker pull docker-registry.wikimedia.org/wikimedia-jessie:latest in case you wonder :P [16:16:24] godog: yup, looking [16:19:25] !log try again deleting obsolete cassandra metrics from graphite2002 - T181964 [16:19:30] thcipriani: sweet, thanks! [16:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:34] T181964: Clean metrics for restbase erroneus legacy tables from cassandra 3 cluster - https://phabricator.wikimedia.org/T181964 [16:21:28] (03CR) 10Gehel: [C: 032] elasticsearch: configure prometheus to collect metrics from elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/398059 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [16:28:41] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834679 (10mmodell) @awight: from what I understand, git has to do a lot of extra work on the server side in order to build to shallow clone. I... [16:29:55] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add .gitreview file [debs/prometheus-wdqs-updater-exporter] - 10https://gerrit.wikimedia.org/r/398069 (owner: 10Muehlenhoff) [16:30:08] !log Deploy schema change on s4 on dbstore1001 - T174569 [16:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:21] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [16:30:45] (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/398057 (owner: 10Elukey) [16:31:14] _joe_ tried the docker image, without apt-get update I get the 100, otherwise apt-get install puppet works [16:33:19] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3834687 (10MoritzMuehlenhoff) I think I found the problem: With the switch to puppet 4, puppet it has gained several new deps: Package: puppet Version: 4.8.2-5~bpo8+1 Depends:... [16:33:33] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3834688 (10jcrespo) I think because maintenance hosts use php (?) it should be easie... [16:33:44] herron: I think I found the problem with the puppet install during jessie d-i: https://phabricator.wikimedia.org/T182702#3834687 [16:35:30] moritzm great! I was just trying to enable jessie-backports on db1111 [16:35:41] but instead you think we should import the package? [16:36:41] during that early stage of d-i (for packages needed in late_command we should rather import it), at least we've been doing that for similar packages so far [16:37:51] ok sounds good [16:38:36] lemme know when the issue is fixed, I'll test it on kafka1023's install-console [16:38:45] and report back [16:38:51] will do [16:39:41] herron: shall I import it or are you on it? [16:39:50] I'm importing it now [16:39:52] k [16:41:40] (03CR) 10Cmjohnson: [C: 032] removing dns entries for decom hosts mc1001-1018 T164341 [dns] - 10https://gerrit.wikimedia.org/r/398066 (owner: 10Cmjohnson) [16:42:18] (03PS2) 10Jcrespo: mariadb::parsercache: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/397990 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [16:42:27] (03CR) 10Jcrespo: [C: 032] mariadb::parsercache: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/397990 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [16:42:47] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Package RabbitMQ exporter for Prometheus and adapt metrics - https://phabricator.wikimedia.org/T181802#3834746 (10fgiunchedi) I tried the first iteration of rabbitmq exporter, pasting here the metrics for referen... [16:43:46] (03CR) 10Jcrespo: [C: 031] "This is ok, but I would like to solve T177714 at the same time or soonish." [puppet] - 10https://gerrit.wikimedia.org/r/397912 (https://phabricator.wikimedia.org/T182713) (owner: 10Anomie) [16:46:31] (03CR) 10Jcrespo: [C: 04-1] "Actually, this is not ok, it sets port as 0 with missing ports, but that is trivial to fix." [puppet] - 10https://gerrit.wikimedia.org/r/397912 (https://phabricator.wikimedia.org/T182713) (owner: 10Anomie) [16:48:39] (03PS1) 10Muehlenhoff: Add Prometheus exporter for WDQS Updater [debs/prometheus-wdqs-updater-exporter] - 10https://gerrit.wikimedia.org/r/398072 [16:49:34] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3831866 (10herron) ruby-deep-merge has been imported to jessie-wikimedia and that appears to have solved this problem. `db1111` no longer errors out from missing deps when atte... [16:50:18] elukey ready for testing with kafka1023 [16:50:37] * elukey tests [16:51:04] (03CR) 10Filippo Giunchedi: "See inline, we'll also need to adapt metric names to be more in line with prometheus guidelines (see sample of metrics at https://phabrica" (036 comments) [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398003 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [16:58:48] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834785 (10mmodell) [17:01:57] (03PS2) 10Jcrespo: Fix 'sql' script for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/397912 (https://phabricator.wikimedia.org/T182713) (owner: 10Anomie) [17:02:19] (03PS1) 10Muehlenhoff: Add Prometheus exporter to WDQS servers [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) [17:07:39] (03CR) 10Jcrespo: [C: 031] "Now it works for:" [puppet] - 10https://gerrit.wikimedia.org/r/397912 (https://phabricator.wikimedia.org/T182713) (owner: 10Anomie) [17:09:29] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: New, mysterious scap failure - https://phabricator.wikimedia.org/T182801#3834842 (10awight) p:05Triage>03High [17:09:32] 10Operations, 10Traffic, 10Interdatacenter-IPsec: Enable IPSec between datacenters - https://phabricator.wikimedia.org/T81543#3834853 (10faidon) [17:10:40] (03CR) 10Jcrespo: "Should we enable TLS already? Should we fix also T177714?" [puppet] - 10https://gerrit.wikimedia.org/r/397913 (owner: 10Anomie) [17:10:50] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834855 (10mmodell) >>! In T181661#3834679, @mmodell wrote: > @awight: from what I understand, git has to do a lot of extra work on the server... [17:12:47] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834860 (10awight) [17:13:11] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: New, mysterious scap failure - https://phabricator.wikimedia.org/T182801#3834858 (10awight) 05Open>03Invalid /srv is full. Strange that there was no error message during deployment, though... [17:13:31] herron: all gooooood! \o/ [17:13:37] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: New, mysterious scap failure - https://phabricator.wikimedia.org/T182801#3834862 (10mmodell) strange indeed. Full disk can case all sorts of weird behaviors though. [17:13:41] woohoo! [17:13:51] sorry for the headache! [17:14:28] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3834863 (10elukey) My last d-i went fine! I kept in place the live hack on install1002, let's puppetize it if the upstream fix will not come soon. [17:14:44] herron: nah all fine, glad that we solved it :) [17:14:54] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: New, mysterious scap failure - https://phabricator.wikimedia.org/T182801#3834866 (10awight) >>! In T182801#3834862, @mmodell wrote: > strange indeed. Full disk can case all sorts of weird behaviors though. +1 This might n... [17:21:17] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3834873 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [17:21:53] (03PS1) 10Awight: Add Icelandic dictionary for ORES on iswiki [puppet] - 10https://gerrit.wikimedia.org/r/398078 (https://phabricator.wikimedia.org/T181099) [17:25:57] (03PS4) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) [17:25:59] (03PS1) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades scripts [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [17:26:21] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:26:26] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1067 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398058 (owner: 10Jcrespo) [17:26:32] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1067 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398058 [17:26:38] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended-upgrades: add targetted upgrades scripts [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:27:42] (03PS15) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) [17:27:58] (03PS29) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [17:30:02] !log installing wireshark security updates [17:30:14] 10Operations, 10ops-codfw, 10Analytics, 10DC-Ops, 10Patch-For-Review: Decomission eventlog2001 - https://phabricator.wikimedia.org/T182397#3834916 (10Papaul) [17:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:52] !log awight@tin Started deploy [ores/deploy@b67bba7]: (non-production) Update ORES on new cluster [17:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:51] !log awight@tin Finished deploy [ores/deploy@b67bba7]: (non-production) Update ORES on new cluster (duration: 00m 59s) [17:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:15] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3834921 (10Cmjohnson) [17:35:20] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3834939 (10awight) Looks like I'm getting the same error. > commit b67bba77acb7c0ffc678201c9f3f54f198da6650 > > scap deploy -v -l "ores*" "(no... [17:36:13] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3834941 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'] ``` [17:38:55] (03PS2) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades scripts [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [17:39:27] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended-upgrades: add targetted upgrades scripts [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:39:51] 10Operations, 10ops-eqiad, 10DC-Ops: Complete decom process for server caesium - https://phabricator.wikimedia.org/T182805#3834960 (10Cmjohnson) [17:53:49] PROBLEM - puppet last run on mw2184 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [17:57:29] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [18:01:33] PROBLEM - Check the NTP synchronisation status of timesyncd on db1111 is CRITICAL: Return code of 255 is out of bounds [18:02:03] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[tshark],Package[libgsl0-dev] [18:03:14] PROBLEM - DPKG on db1111 is CRITICAL: Return code of 255 is out of bounds [18:05:03] PROBLEM - Disk space on db1111 is CRITICAL: Return code of 255 is out of bounds [18:08:24] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 35357: Connection refused [18:08:38] 10Puppet, 10cloud-services-team (Kanban): Remove role::puppet::self and related support code - https://phabricator.wikimedia.org/T182810#3835105 (10Andrew) [18:09:15] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3835120 (10Gehel) elasticsearch_exporter is deployed on all elasticsearch nodes. Still to do: * same work on jmx_exporter * update... [18:09:57] 10Puppet, 10cloud-services-team (Kanban): Remove role::puppet::self and related support code - https://phabricator.wikimedia.org/T182810#3835137 (10Andrew) [18:09:59] 10Operations, 10Cloud-Services, 10Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#3835136 (10Andrew) [18:10:13] PROBLEM - keystone public endoint port 5000 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 5000: Connection refused [18:10:30] 10Puppet, 10cloud-services-team (Kanban): Remove role::puppet::self and related support code - https://phabricator.wikimedia.org/T182810#3835105 (10Andrew) [18:11:41] (03PS2) 10Dzahn: Releases jenkins: Ensure php-curl is present (version doesn't matter) [puppet] - 10https://gerrit.wikimedia.org/r/397993 (owner: 10Chad) [18:11:53] PROBLEM - configured eth on db1111 is CRITICAL: Return code of 255 is out of bounds [18:13:33] PROBLEM - dhclient process on db1111 is CRITICAL: Return code of 255 is out of bounds [18:14:51] moritzm: new trusty installs still do the "Error: Could not set 'present' on ensure: No such file or directory - /etc/prometheus/rabbitmq-exporter.yaml20171213-2288-u1r0dv.lock at 16:/etc/puppet/modules/rabbitmq/manifests/monitor.pp" dance, tha's expected still right? [18:15:23] PROBLEM - puppet last run on db1111 is CRITICAL: Return code of 255 is out of bounds [18:15:27] (03CR) 10Dzahn: [C: 032] Releases jenkins: Ensure php-curl is present (version doesn't matter) [puppet] - 10https://gerrit.wikimedia.org/r/397993 (owner: 10Chad) [18:18:43] PROBLEM - Check systemd state on db1111 is CRITICAL: Return code of 255 is out of bounds [18:20:44] (03PS1) 10Volans: wmf-auto-reimage: improve systemd-specific commands [puppet] - 10https://gerrit.wikimedia.org/r/398087 [18:22:24] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:23:53] RECOVERY - puppet last run on mw2184 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:24:25] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Investigate why ORES logs are being written to syslog despite explicit logging config. Fix. - https://phabricator.wikimedia.org/T182614#3835192 (10awight) The explanation is that Celery follows an archaic pattern of hijacking the... [18:24:43] PROBLEM - IPMI Sensor Status on db1111 is CRITICAL: Return code of 255 is out of bounds [18:26:23] PROBLEM - MegaRAID on db1111 is CRITICAL: Return code of 255 is out of bounds [18:26:34] (03PS1) 10Rush: openstack: cloud repo explicit apt-key update and apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/398088 (https://phabricator.wikimedia.org/T171494) [18:26:55] (03CR) 10jerkins-bot: [V: 04-1] openstack: cloud repo explicit apt-key update and apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/398088 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:27:03] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:28:18] (03PS2) 10Rush: openstack: cloud repo explicit apt-key update and apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/398088 (https://phabricator.wikimedia.org/T171494) [18:29:10] (03CR) 10Rush: [C: 032] openstack: cloud repo explicit apt-key update and apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/398088 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:29:33] (03CR) 10Ayounsi: "Addressing Alex's feedback" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [18:29:57] (03CR) 10Herron: [C: 031] wmf-auto-reimage: improve systemd-specific commands [puppet] - 10https://gerrit.wikimedia.org/r/398087 (owner: 10Volans) [18:30:07] (03PS5) 10Ayounsi: [WIP] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [18:31:01] !log releases2001: /srv/mediawiki# rm -rf extensions/ skins/ vendor/ | clean up removed repos, let puppet clone, to match releases1001 and fix puppet run [18:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:46] (03PS1) 10Papaul: Decomm: Remove production and mgmt DNS entries for eventlog2001 [dns] - 10https://gerrit.wikimedia.org/r/398090 (https://phabricator.wikimedia.org/T182397) [18:33:42] hashar: i remember you fixed check_disk on docker servers, didnt you [18:34:00] (03PS2) 10Muehlenhoff: Add Prometheus exporter for RabbitMQ [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398003 (https://phabricator.wikimedia.org/T181802) [18:34:14] (03CR) 10Muehlenhoff: Add Prometheus exporter for RabbitMQ (035 comments) [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398003 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [18:34:20] like to skip /var/lib/docker .. or whoever did it, but it flew by in Gerrit.. and now i see an Icinga alert for that again [18:34:30] (03CR) 10Muehlenhoff: Add Prometheus exporter for RabbitMQ (031 comment) [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398003 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [18:34:34] (03CR) 10Cmjohnson: [C: 031] Decomm: Remove production and mgmt DNS entries for eventlog2001 [dns] - 10https://gerrit.wikimedia.org/r/398090 (https://phabricator.wikimedia.org/T182397) (owner: 10Papaul) [18:35:12] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler02/9335/" [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [18:35:51] 10Operations, 10Cloud-Services, 10Mail, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3835313 (10Reedy) [18:35:57] chasemp: yeah, that's fixed once we've rolled out the new deb package, it's in review, will be done tomorrow [18:36:11] as a workaround you can simply mkdir /etc/prometheus for now [18:36:13] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3835328 (10Dzahn) 05Resolved>03Open [18:36:13] no worries then moritzm, just making sure [18:36:25] * chasemp nods [18:37:33] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3692868 (10Dzahn) sorry to say, but there is one of these in Icinga again ...... [18:38:08] ACKNOWLEDGEMENT - Disk space on lawrencium is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/c1dc9ea10b1d7fc55a9778b0abd9894ccd3eb7520b928bf68a1a626d9304fd16/merged is not accessible: Permission denied daniel_zahn https://phabricator.wikimedia.org/T178454 [18:40:22] (03PS1) 10EBernhardson: Enable Cirrus MLR for 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398093 [18:41:11] 10Operations, 10Mail, 10monitoring: prometheus metrics and grafana dashboard for exim - https://phabricator.wikimedia.org/T179302#3835352 (10Dzahn) This ticket seems to be a duplicate of T179565 [18:42:47] 10Operations, 10Cloud-Services, 10Mail, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3835313 (10Bawolff) I just tried security@wikipedia.org and it does not appear that email forwards to security@wikimedia.org so we should do that too. [18:42:53] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3835367 (10Dzahn) [18:42:55] 10Operations, 10Mail, 10monitoring: prometheus metrics and grafana dashboard for exim - https://phabricator.wikimedia.org/T179302#3835364 (10Dzahn) 05Open>03Resolved a:03fgiunchedi work was done in T179565 dashboard here: https://grafana.wikimedia.org/dashboard/db/mail ganglia removed from prod mx he... [18:43:58] (03CR) 10Dzahn: "now unblocked by https://phabricator.wikimedia.org/T179565 , https://phabricator.wikimedia.org/T179302 and https://gerrit.wikimedia.org/r/" [puppet] - 10https://gerrit.wikimedia.org/r/382916 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:44:33] (03PS2) 10Dzahn: exim4/multiple roles: remove Ganglia exim stats [puppet] - 10https://gerrit.wikimedia.org/r/382916 (https://phabricator.wikimedia.org/T177225) [18:46:58] 10Operations, 10Cloud-Services, 10Mail, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3835379 (10Bawolff) >>! In T182812#3835360, @Bawolff wrote: > I just tried security@wikipedia.org and it does not appear that email forwards to securi... [18:48:53] 10Operations, 10Cloud-Services, 10Mail, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3835313 (10Dzahn) The exim alias file for wikipedia.org (in private repo) _does_ have this line though: ``` 21 security: security@wikimedia.org... [18:49:33] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:50:20] (03PS2) 10Volans: wmf-auto-reimage: improve systemd-specific commands [puppet] - 10https://gerrit.wikimedia.org/r/398087 [18:50:51] (03CR) 10Anomie: "> Should we enable TLS already? Should we fix also T177714?" [puppet] - 10https://gerrit.wikimedia.org/r/397913 (owner: 10Anomie) [18:50:53] (03PS1) 10ArielGlenn: clean up directory setup manifests for dumps nfs and web servers [puppet] - 10https://gerrit.wikimedia.org/r/398095 [18:51:08] (03CR) 10Volans: [C: 032] wmf-auto-reimage: improve systemd-specific commands [puppet] - 10https://gerrit.wikimedia.org/r/398087 (owner: 10Volans) [18:51:12] 10Operations, 10Cloud-Services, 10Mail, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3835396 (10Dzahn) gotcha @bawolff! i wrote that before i saw your reply. regarding the wmflabs.org address, there is security@wmflabs.org It al... [18:51:20] (03CR) 10jerkins-bot: [V: 04-1] clean up directory setup manifests for dumps nfs and web servers [puppet] - 10https://gerrit.wikimedia.org/r/398095 (owner: 10ArielGlenn) [18:52:18] (03CR) 10Anomie: "> Actually, this is not ok, it sets port as 0 with missing ports, but" [puppet] - 10https://gerrit.wikimedia.org/r/397912 (https://phabricator.wikimedia.org/T182713) (owner: 10Anomie) [18:52:41] (03CR) 10Dzahn: [C: 032] exim4/multiple roles: remove Ganglia exim stats [puppet] - 10https://gerrit.wikimedia.org/r/382916 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:53:21] (03PS3) 10Dzahn: exim4/multiple roles: remove Ganglia exim stats [puppet] - 10https://gerrit.wikimedia.org/r/382916 (https://phabricator.wikimedia.org/T177225) [18:54:07] 10Operations, 10Cloud-Services, 10Mail, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3835313 (10faidon) tools.wmflabs.org isn't a relay that is in production, so it is not (and cannot be) a "trusted" relay. This means that e.g. Gmail w... [18:54:42] (03PS2) 10ArielGlenn: clean up directory setup manifests for dumps nfs and web servers [puppet] - 10https://gerrit.wikimedia.org/r/398095 [18:55:01] (03PS4) 10Dzahn: exim4/ganglia: mx,otrs,lists,phab: rm Ganglia exim stats [puppet] - 10https://gerrit.wikimedia.org/r/382916 (https://phabricator.wikimedia.org/T177225) [18:55:53] (03PS4) 10Muehlenhoff: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) [19:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Morning SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171213T1900). [19:00:04] stephanebisson and ebernhardson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:11] i can ship these i suppose [19:00:37] hello [19:01:25] (03CR) 10EBernhardson: [C: 032] Enable Cirrus MLR for 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398093 (owner: 10EBernhardson) [19:01:37] (03PS2) 10EBernhardson: Enable Cirrus MLR for 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398093 [19:01:46] (03CR) 10EBernhardson: [C: 032] Enable Cirrus MLR for 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398093 (owner: 10EBernhardson) [19:03:12] (03Merged) 10jenkins-bot: Enable Cirrus MLR for 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398093 (owner: 10EBernhardson) [19:03:28] (03CR) 10jenkins-bot: Enable Cirrus MLR for 4 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398093 (owner: 10EBernhardson) [19:05:29] !log phab[12]001,mx[12]001,mendelevium,fermium: rm /usr/local/bin/exim-to-gmetric and remove root's crontab lines to follow-up gerrit:382916 [19:05:35] (03CR) 10RobH: [C: 032] Decomm: Remove production and mgmt DNS entries for eventlog2001 [dns] - 10https://gerrit.wikimedia.org/r/398090 (https://phabricator.wikimedia.org/T182397) (owner: 10Papaul) [19:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:47] (03CR) 10Dzahn: "14:05 < mutante> !log phab[12]001,mx[12]001,mendelevium,fermium: rm /usr/local/bin/exim-to-gmetric and remove root's crontab lines to foll" [puppet] - 10https://gerrit.wikimedia.org/r/382916 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:06:42] 10Operations, 10Cloud-Services, 10Mail, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3835313 (10Legoktm) According to if a user were to sign up with the username "security", th... [19:07:10] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Turn on cirrus MLR for 4 more wikis (duration: 01m 09s) [19:07:13] 10Operations, 10Mail, 10Toolforge, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3835440 (10Legoktm) [19:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:41] stephanebisson: your changes are up on mwdebug1002 [19:09:51] ebernhardson: testing... [19:12:05] ebernhardson: works as expected [19:12:21] ebernhardson: Hey, can I add something to SWAT now? [19:12:57] James_F: sure [19:13:45] ebernhardson: Added – 398098 [19:13:59] !log ebernhardson@tin Synchronized php-1.31.0-wmf.12/resources/src/mediawiki.rcfilters/: SWAT: T182788: RCFilters: Fix live update (duration: 01m 08s) [19:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:10] T182788: Live update and View newest changes are broken in wmf/1.31.0-wmf.12 - https://phabricator.wikimedia.org/T182788 [19:14:17] stephanebisson: you're all syced out [19:15:05] 10Operations, 10ops-codfw, 10Analytics, 10DC-Ops: Decomission eventlog2001 - https://phabricator.wikimedia.org/T182397#3835450 (10RobH) 05Open>03Resolved a:05Papaul>03None [19:15:36] ebernhardson: indeed, all working [19:18:28] (03PS2) 10Dzahn: statsd: remove ganglia backend support [puppet] - 10https://gerrit.wikimedia.org/r/382923 (https://phabricator.wikimedia.org/T177225) [19:20:24] (03Abandoned) 10Dzahn: ganglia: add decom bash script if on trusty (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/394727 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:20:28] ebernhardson: I have a second one listed (https://gerrit.wikimedia.org/r/#/c/398096/). Can you do it as well? [19:20:41] stephanebisson: doh, i didn't notice. sorry i should have read better. sure [19:21:13] no worries, I added it at T - 1 minute ;) [19:22:11] (03PS6) 10Thcipriani: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 [19:22:30] James_F: changes is up on mwdbeug1002 if there is anything to test [19:22:58] ebernhardson: Checking. [19:23:09] 10Operations, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947#3835462 (10kaldari) Supposedly this is fixed by https://gitlab.gnome.org/GNOME/librsvg/commit/c70000117fb6e7dabdb77c1c8cc1067add7da6d9, which... [19:24:04] ebernhardson: Yeah, LGTM. We've got another. [19:24:05] (03PS2) 10Dzahn: mediawiki::appserver: move firewall from site to role [puppet] - 10https://gerrit.wikimedia.org/r/397636 [19:24:14] (03CR) 10Dzahn: [C: 031] "already done for API appservers" [puppet] - 10https://gerrit.wikimedia.org/r/397636 (owner: 10Dzahn) [19:24:28] ebernhardson: 398105 :-( [19:25:02] James_F: break less things :P [19:25:15] !log ebernhardson@tin Synchronized php-1.31.0-wmf.12/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.trackSubscriber.js: SWAT: VE trackSubscriber: data isn't required (duration: 01m 08s) [19:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:01] (03PS1) 10ArielGlenn: apachedir is available to dumps cron jobs via a bash script, use it [puppet] - 10https://gerrit.wikimedia.org/r/398106 [19:26:34] ebernhardson: Logging sucks. [19:26:45] ebernhardson: (But yes.) [19:30:53] stephanebisson: second patch is up on mwdebug1002 [19:30:57] !log ppchelko@tin Started deploy [restbase/deploy@3f4bedc]: Remove references to Cassandra 2 from Parsoid storage T179417 [19:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:11] T179417: Migrate Parsoid from legacy to new storage - https://phabricator.wikimedia.org/T179417 [19:31:59] ebernhardson: looks good to me [19:34:18] !log ebernhardson@tin Synchronized php-1.31.0-wmf.12/resources/src/mediawiki.rcfilters/mw.rcfilters.UriProcessor.js: SWAT: T182734: RCLFilters: support target page with a subpage (duration: 01m 07s) [19:34:20] stephanebisson: all synced out [19:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:29] T182734: Related changes: the title of a subpage displayed cut off - https://phabricator.wikimedia.org/T182734 [19:34:53] James_F: you're up on mwdebug1002 [19:35:40] ebernhardson: all good. Thanks! [19:35:40] !log ppchelko@tin Finished deploy [restbase/deploy@3f4bedc]: Remove references to Cassandra 2 from Parsoid storage T179417 (duration: 04m 43s) [19:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:12] ebernhardson: LGTM. Might be one last one. Sorry. [19:37:25] ebernhardson: But sync for now. We'll fix later and not steal your time. [19:37:26] !log ebernhardson@tin Synchronized php-1.31.0-wmf.12/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.trackSubscriber.js: SWAT: VE trackSubscriber: Add timing data for 'loaded' state (duration: 01m 07s) [19:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:38] James_F: all synced out. Good luck :P [19:37:41] ebernhardson: Thanks. :-) [19:39:59] 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#1945898 (10Dzahn) 14:24 < Dragonfly6-7> https://commons.wikimedia.org/wiki/File:Burbuja_(1496994920).jpg... [19:40:55] 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#3835511 (10Dzahn) 14:24 < Dragonfly6-7> https://commons.wikimedia.org/wiki/File:Burbuja_(1496994920).jpg... [19:45:40] ebernhardson: Still around and ready to sync a one-line WikimediaEvents change? 398114 ;-) [19:58:11] (03PS1) 10Chad: group1 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398115 [20:00:04] no_justification: Dear deployers, time to do the MediaWiki train deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171213T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:20] There's always patches for the train [20:00:24] Silly bot [20:06:42] (03CR) 10Chad: [C: 032] group1 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398115 (owner: 10Chad) [20:06:52] no_justification: Any chance you could throw in the WikimediaEvents change? [20:07:00] Once the train is out, of course. [20:07:02] What change? [20:07:10] (I don't know what "the" change is :)) [20:07:19] https://gerrit.wikimedia.org/r/398114 [20:08:07] (03Merged) 10jenkins-bot: group1 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398115 (owner: 10Chad) [20:08:18] (03CR) 10jenkins-bot: group1 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398115 (owner: 10Chad) [20:10:57] !log demon@tin Synchronized php: symlink bump for wmf.12 (duration: 01m 07s) [20:11:06] 10Operations, 10Mail, 10Toolforge, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3835313 (10bd808) I created https://tools.wmflabs.org/admin/tool/security and added some folks to the maintainer list. I think a `~/.forward` can be added... [20:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:23] 10Operations, 10Services (doing), 10User-Eevans, 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839#3835604 (10Eevans) >>! In T178839#3811294, @Eevans wrote: > [ ... ] > 1. Added https://github.com/aragozin/jvm-tools.git as a Git remote > 1. Merged jvmtool-umbrella-pom-0.... [20:14:27] !log demon@tin rebuilt and synchronized wikiversions files: group1 to wmf.12 [20:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:03] 10Operations, 10Puppet: custom fact interface_primary breaks under newer versions of facter - https://phabricator.wikimedia.org/T182819#3835611 (10herron) p:05Triage>03Normal [20:26:45] 10Operations, 10Mail, 10Toolforge, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3835642 (10Legoktm) >>! In T182812#3835599, @bd808 wrote: > I created https://tools.wmflabs.org/admin/tool/security and added some folks to the maintainer... [20:27:08] (03PS1) 10Rush: openstack: dedupe packages and reduce require_package [puppet] - 10https://gerrit.wikimedia.org/r/398118 (https://phabricator.wikimedia.org/T171494) [20:27:28] !log demon@tin Synchronized php-1.31.0-wmf.12/extensions/WikimediaEvents/extension.json: James_F made me do it (duration: 01m 08s) [20:27:38] (03CR) 10jerkins-bot: [V: 04-1] openstack: dedupe packages and reduce require_package [puppet] - 10https://gerrit.wikimedia.org/r/398118 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:41] (03PS2) 10Rush: openstack: dedupe packages and reduce require_package [puppet] - 10https://gerrit.wikimedia.org/r/398118 (https://phabricator.wikimedia.org/T171494) [20:27:50] Thanks no_justification. :-) [20:28:09] (03CR) 10jerkins-bot: [V: 04-1] openstack: dedupe packages and reduce require_package [puppet] - 10https://gerrit.wikimedia.org/r/398118 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:29:08] (03PS3) 10Rush: openstack: dedupe packages and reduce require_package [puppet] - 10https://gerrit.wikimedia.org/r/398118 (https://phabricator.wikimedia.org/T171494) [20:29:43] (03PS4) 10Rush: openstack: dedupe packages and reduce require_package [puppet] - 10https://gerrit.wikimedia.org/r/398118 (https://phabricator.wikimedia.org/T171494) [20:30:21] 10Operations, 10Puppet: custom fact interface_primary breaks under newer versions of facter - https://phabricator.wikimedia.org/T182819#3835676 (10herron) It looks like the regex in interface_primary expects trailing whitespace from the command output but leading/trailing whitespace is being stripped away unde... [20:34:05] (03PS1) 10Herron: facter: fix interface_primary under newer versions of facter [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) [20:39:30] (03CR) 10Dzahn: [C: 032] "just like it was done before for canaries and API servers and http://puppet-compiler.wmflabs.org/9337/" [puppet] - 10https://gerrit.wikimedia.org/r/397636 (owner: 10Dzahn) [20:39:42] jouncebot: next [20:39:42] In 0 hour(s) and 20 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171213T2100) [20:41:20] (03CR) 10Dzahn: "no-op and wmf-style: total violations delta -9" [puppet] - 10https://gerrit.wikimedia.org/r/397636 (owner: 10Dzahn) [20:41:48] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:42:32] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3835695 (10hashar) @Dzahn that is on lawrencium . Can you check the content of... [20:44:08] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:44:47] 10Operations, 10Commons, 10Multimedia, 10media-storage: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822#3835699 (10zhuyifei1999) p:05Triage>03Normal [20:45:46] (03CR) 10Rush: [C: 032] openstack: dedupe packages and reduce require_package [puppet] - 10https://gerrit.wikimedia.org/r/398118 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:45:54] (03PS5) 10Rush: openstack: dedupe packages and reduce require_package [puppet] - 10https://gerrit.wikimedia.org/r/398118 (https://phabricator.wikimedia.org/T171494) [20:52:38] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:52:46] (03CR) 10Dzahn: "i see in the code comment that "aspell-id" was imported into our own repo. does that mean the same has to happen for aspell-is first befor" [puppet] - 10https://gerrit.wikimedia.org/r/398078 (https://phabricator.wikimedia.org/T181099) (owner: 10Awight) [20:53:19] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:53:50] (03CR) 10Awight: "> i see in the code comment that "aspell-id" was imported into our own repo. does that mean the same has to happen for aspell-is first bef" [puppet] - 10https://gerrit.wikimedia.org/r/398078 (https://phabricator.wikimedia.org/T181099) (owner: 10Awight) [20:53:59] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:54:01] (03CR) 10Dzahn: "https://apt.wikimedia.org/wikimedia/pool/thirdparty/a/" [puppet] - 10https://gerrit.wikimedia.org/r/398078 (https://phabricator.wikimedia.org/T181099) (owner: 10Awight) [20:54:48] (03CR) 10Hashar: "Bumping npm is T169451 and Debian report is https://bugs.debian.org/857986" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397720 (owner: 10Paladox) [20:56:08] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:56:19] (03CR) 10Awight: "@DZahn: on second thought, I think aspell-id was imported into our repo because it's not in the jessie upstream. Correct me please, I don" [puppet] - 10https://gerrit.wikimedia.org/r/398078 (https://phabricator.wikimedia.org/T181099) (owner: 10Awight) [20:56:48] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:57:58] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:57:58] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:59:18] (03PS1) 10Rush: openstack: dependency changes for require_package [puppet] - 10https://gerrit.wikimedia.org/r/398121 (https://phabricator.wikimedia.org/T171494) [20:59:30] (03PS2) 10Rush: openstack: dependency changes for require_package [puppet] - 10https://gerrit.wikimedia.org/r/398121 (https://phabricator.wikimedia.org/T171494) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171213T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:00:19] (03CR) 10Rush: [C: 032] openstack: dependency changes for require_package [puppet] - 10https://gerrit.wikimedia.org/r/398121 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [21:00:26] Hoping to make an ORES deployment, but mutante and I are cleaning up one last detail. [21:04:18] 10Operations, 10Commons, 10Multimedia, 10media-storage: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822#3835763 (10zhuyifei1999) p:05Normal>03Triage [21:06:08] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:07:29] (03CR) 10Zoranzoki21: "What happened with this patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398058 (owner: 10Jcrespo) [21:07:38] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:08:19] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:08:58] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:12:59] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:12:59] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:20:08] !log ppchelko@tin Started deploy [restbase/deploy@a993556]: Do not fallback if the revision is not specified T182770 [21:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:21] T182770: meta property="dc:modified" may be absent - https://phabricator.wikimedia.org/T182770 [21:24:11] !log ppchelko@tin Finished deploy [restbase/deploy@a993556]: Do not fallback if the revision is not specified T182770 (duration: 04m 04s) [21:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:27] 10Operations, 10MediaWiki-Platform-Team, 10TechCom-RfC, 10HHVM, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3623015 (10Krinkle) >>! In T176370#3822421, @daniel wrote: > @tstarling According to TechCom notes, this was to enter Last Call on November 22, but th... [21:30:59] (03PS2) 10Hashar: Add .gitreview [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397747 [21:31:01] (03PS3) 10Hashar: Restrict setup.py to python 3.4 or later [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397748 [21:31:03] (03PS1) 10Hashar: Do not lint .eggs/* [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398128 [21:32:58] 10Operations, 10MediaWiki-Platform-Team, 10TechCom-RfC, 10HHVM, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3835810 (10tstarling) This is now moving to last call after a TC discussion. [21:33:14] !log mholloway-shell@tin Started deploy [mobileapps/deploy@b8082da]: Update mobileapps to bf67c97 [21:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:33] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@b8082da]: Update mobileapps to bf67c97 (duration: 04m 19s) [21:37:34] 10Operations, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, 10Traffic: PUT blocked by Varnish - https://phabricator.wikimedia.org/T182825#3835819 (10Tgr) [21:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:31] (03CR) 10Thcipriani: [C: 032] Do not lint .eggs/* [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398128 (owner: 10Hashar) [21:54:44] (03PS1) 10Hashar: tests: migrate from nose to pytest [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398136 [21:56:20] 10Operations, 10Commons, 10Multimedia, 10media-storage: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822#3835902 (10Platonides) Maybe it would be possible to extract from swift the list of files stored there? Then no HTTP requests would be needed (unl... [21:58:26] !log mholloway-shell@tin Started deploy [mobileapps/deploy@e62d8e3]: Update mobileapps to ddddebb [21:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:40] (03PS1) 10Rush: openstack: consistent style for ensure present [puppet] - 10https://gerrit.wikimedia.org/r/398140 [22:00:49] (03CR) 10jerkins-bot: [V: 04-1] openstack: consistent style for ensure present [puppet] - 10https://gerrit.wikimedia.org/r/398140 (owner: 10Rush) [22:03:50] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@e62d8e3]: Update mobileapps to ddddebb (duration: 05m 24s) [22:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:50] (03CR) 10Thcipriani: "I have no desire to support git-review (I feel like it's too often a very leaky abstraction) and this change feels like implicit endorseme" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397747 (owner: 10Hashar) [22:09:11] (03CR) 10Thcipriani: [C: 031] Restrict setup.py to python 3.4 or later [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397748 (owner: 10Hashar) [22:13:38] (03CR) 10BryanDavis: "> I have no desire to support git-review (I feel like it's too often" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397747 (owner: 10Hashar) [22:16:58] (03CR) 10Thcipriani: [C: 032] Do not lint .eggs/* [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398128 (owner: 10Hashar) [22:17:28] (03Merged) 10jenkins-bot: Do not lint .eggs/* [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398128 (owner: 10Hashar) [22:17:43] (03PS4) 10Hashar: Restrict setup.py to python 3.4 or later [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397748 [22:18:56] (03PS2) 10Rush: openstack: consistent style for ensure present [puppet] - 10https://gerrit.wikimedia.org/r/398140 [22:20:12] (03CR) 10Rush: [C: 032] openstack: consistent style for ensure present [puppet] - 10https://gerrit.wikimedia.org/r/398140 (owner: 10Rush) [22:23:17] !log ppchelko@tin Started deploy [cpjobqueue/deploy@044cd23]: Fix sha1-based deduplication [22:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:51] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@044cd23]: Fix sha1-based deduplication (duration: 00m 34s) [22:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:59] (03PS1) 10Rush: openstack: labtest and labtestn roles for net [puppet] - 10https://gerrit.wikimedia.org/r/398145 [22:26:17] (03CR) 10jerkins-bot: [V: 04-1] openstack: labtest and labtestn roles for net [puppet] - 10https://gerrit.wikimedia.org/r/398145 (owner: 10Rush) [22:26:22] (03PS2) 10Rush: openstack: labtest and labtestn roles for net [puppet] - 10https://gerrit.wikimedia.org/r/398145 [22:29:58] (03CR) 10Volans: [C: 04-1] "The regex is not equivalent to the previous one, see details inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) (owner: 10Herron) [22:30:43] (03CR) 10Rush: [C: 032] openstack: labtest and labtestn roles for net [puppet] - 10https://gerrit.wikimedia.org/r/398145 (owner: 10Rush) [22:32:04] (03CR) 10Thcipriani: [C: 032] Restrict setup.py to python 3.4 or later [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397748 (owner: 10Hashar) [22:32:32] (03Merged) 10jenkins-bot: Restrict setup.py to python 3.4 or later [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397748 (owner: 10Hashar) [22:33:44] (03CR) 10Dduvall: [C: 04-1] "I played around with this and mathoid locally using minikube and was able to get it working after making some changes to the templates and" (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397) (owner: 10Giuseppe Lavagetto) [22:36:07] PROBLEM - puppet last run on labtestnet2002 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python-novaclient],Package[python-designateclient],Package[python-keystoneclient],Package[python-openstackclient] [22:40:55] (03CR) 10Imarlier: "Aaron, is this actually still WIP?" [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [22:41:47] PROBLEM - https://phabricator.wikimedia.org on phab1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:42:42] welcome to the party icinga-wm [22:42:47] RECOVERY - https://phabricator.wikimedia.org on phab1001 is OK: HTTP OK: HTTP/1.1 200 OK - 34526 bytes in 0.251 second response time [22:44:52] (03CR) 10Chad: [C: 04-2] "git-review has been a mistake since day one--I regret I ever allowed it to permeate our documentation and processes. It's a terrible piece" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397747 (owner: 10Hashar) [22:45:38] (03CR) 10Chad: [C: 04-1] Add .gitreview [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397747 (owner: 10Hashar) [22:46:07] RECOVERY - puppet last run on labtestnet2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:46:23] (03PS1) 10Rush: openstack: sane class dependency handling for labtest[n] [puppet] - 10https://gerrit.wikimedia.org/r/398169 (https://phabricator.wikimedia.org/T171494) [22:50:11] (03CR) 10Rush: [C: 032] openstack: sane class dependency handling for labtest[n] [puppet] - 10https://gerrit.wikimedia.org/r/398169 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [22:51:15] !log restarting apache2 on phab1001, phabricator timing out [22:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:03] (03PS1) 10Dzahn: icinga: fix check_disk options on lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/398172 (https://phabricator.wikimedia.org/T178454) [23:00:30] Hi, is there an issue with Checkuser? [23:00:50] I'm getting 500 errors. [23:01:51] foks: That's a lot of errors [23:01:55] I can't get edits at all [23:01:59] Reedy, lol [23:02:06] (03CR) 10Greg Grossmeier: "If other people want to use it for submitting patches/working with Gerrit we should not block them." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397747 (owner: 10Hashar) [23:02:15] foks: I unilaterally disabled all extensions to make wikis faster. [23:02:18] You're welcome!! [23:02:21] gr8! [23:02:48] but seriously we're doing legal compliance and really need to be able to do this [23:03:02] Nothing in exception or error logs [23:03:05] DB queries timing out? [23:03:35] oh, wait [23:03:36] 2017-12-13 23:01:46 [WjGxWgpAIC0AAEHAp9oAAAAG] mw1324 commonswiki 1.31.0-wmf.12 error ERROR: [WjGxWgpAIC0AAEHAp9oAAAAG] /wiki/Special:CheckUser ErrorException from line 177 of /srv/mediawiki/php-1.31.0-wmf.12/includes/Hooks.php: PHP Error: Argument 1 passed to FlowHooks::onSpecialCheckUserGetLinksFromRow() must be an instance of CheckUser, SpecialCheckUser given {"exception_id":"WjGxWgpAIC0AAEHAp9oAAAAG","exception_url":"/wiki/ [23:03:36] Special:CheckUser","caught_by":"mwe_handler"} [23:03:51] RoanKattouw: ^ [23:04:07] Reedy, thanks for the check. :) [23:04:08] * Reedy files it as a task [23:04:12] Thanks! [23:04:20] Sorry to be a pain. [23:05:02] I can't find you on phab: ( [23:05:04] https://phabricator.wikimedia.org/T182834 [23:05:14] Reedy, I'm @jrbs there [23:05:50] yeah I didn't think it through [23:05:59] Should be easy to fix checkuser though [23:06:23] Hooks::run( 'SpecialCheckUserGetLinksFromRow', [ $this, $row, &$links ] ); [23:06:30] this being CheckUser [23:06:33] Looks like it's just on commons, seems fine on English [23:06:44] probably only broken in .12 [23:07:25] public static function onSpecialCheckUserGetLinksFromRow( CheckUser $checkUser, $row, &$links ) { [23:07:26] Roll Flow back to wmf.11? [23:07:28] Reedy: 6ab26d6e47f7526708c4a9da52f6ff79373be328 [23:07:35] MatmaRex: Already fixed [23:07:36] ? [23:07:41] no, that's the cause [23:07:45] revert that [23:08:02] lol [23:08:03] ta [23:08:10] or update the typehint in Flow [23:08:15] this is really silly, heh [23:08:30] my local copy is out of date [23:08:33] let's update flow [23:09:01] 6ab26d6e47f7526708c4a9da52f6ff79373be328 probably should have added an alias [23:09:11] although i'm not sure if that would even have worked with typehints? [23:09:48] (03PS5) 10Madhuvishy: labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) (owner: 10BryanDavis) [23:17:03] (03CR) 10Dzahn: [C: 032] icinga: fix check_disk options on lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/398172 (https://phabricator.wikimedia.org/T178454) (owner: 10Dzahn) [23:17:27] foks: apparently I can change your phab username [23:17:32] Dunno if it'll work correctly :P [23:17:38] whaaaa [23:18:03] It's an option on the profiles [23:18:44] it only changes it from a point in time :) so leaves behind all references in comments and such [23:18:44] i think it will only work if the user hasen't made any changes [23:18:47] it's quite messy [23:18:55] lol [23:18:57] yeah probably not worth it [23:19:04] as cool as "foks" would be as a username [23:19:17] shame you can't just add aliases and stuff like you can with projects etc [23:21:59] (03PS6) 10BryanDavis: labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) [23:22:00] RECOVERY - Disk space on lawrencium is OK: DISK OK [23:22:01] !log deleted 22 illegal images from server [23:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:21] 10Operations, 10cloud-services-team: Puppet broken on labstore1004 - https://phabricator.wikimedia.org/T182781#3836167 (10chasemp) 05Open>03Resolved a:03chasemp [23:22:22] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3836169 (10Dzahn) @hashar fixed by adding the right check_disk options into Hie... [23:23:36] !log reedy@tin Synchronized php-1.31.0-wmf.12/extensions/Flow/Hooks.php: unbreak CheckUser (duration: 01m 08s) [23:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:53] foks: Tryy again [23:24:11] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3836171 (10Dzahn) 05Open>03Resolved Current Status: OK (for 0d 0h 1m 5... [23:24:45] Reedy, success! [23:24:51] woo [23:25:40] pays to use a co-working space run by a local Internet Service Provider... has 20% packet loss and SSH session keeps freezing [23:26:30] lol [23:27:15] (03PS7) 10Rush: labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) (owner: 10BryanDavis) [23:27:21] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 41 probes of 285 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:32:17] (03CR) 10Dzahn: "i checked on packages.debian.org and indeed:" [puppet] - 10https://gerrit.wikimedia.org/r/398078 (https://phabricator.wikimedia.org/T181099) (owner: 10Awight) [23:32:21] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 10 probes of 285 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:32:32] (03PS2) 10Dzahn: Add Icelandic dictionary for ORES on iswiki [puppet] - 10https://gerrit.wikimedia.org/r/398078 (https://phabricator.wikimedia.org/T181099) (owner: 10Awight) [23:41:53] (03CR) 10Madhuvishy: [C: 032] labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) (owner: 10BryanDavis) [23:44:56] (03CR) 10Krinkle: [C: 04-1] "Couple of beginner questions, I probably missed something :)" [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata) [23:45:15] (03CR) 10Thcipriani: "> > I have no desire to support git-review (I feel like it's too" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397747 (owner: 10Hashar) [23:54:11] !log restarted phd on phab1001 (for good measure) [23:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:27] (03CR) 10Krinkle: [C: 04-1] "(After reading comment on Phabricator)" [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata)