[00:26:07] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 61310 MB (12% inode=99%) [00:51:07] RECOVERY - Disk space on elastic1017 is OK: DISK OK [02:00:10] !log l10nupdate@tin LocalisationUpdate failed: git pull of core failed [02:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:17] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title}{/revision} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia [02:46:17] check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) is CRITICAL: Test Get media in test p [02:46:17] expected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a t [02:46:17] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title}{/revision} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia [02:46:17] check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) is CRITICAL: Test Get media in test p [02:46:18] expected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a t [02:46:18] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title}{/revision} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia [02:46:19] check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) is CRITICAL: Test Get media in test p [02:46:19] expected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a t [02:47:18] PROBLEM - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [02:47:37] PROBLEM - cassandra-b service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [02:47:57] PROBLEM - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.168 and port 9042: Connection refused [02:47:58] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:05:38] RECOVERY - cassandra-b service on restbase-dev1004 is OK: OK - cassandra-b is active [03:06:07] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [03:06:57] RECOVERY - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.168 port 9042 [03:07:27] RECOVERY - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-b valid until 2018-07-20 15:08:05 +0000 (expires in 116 days) [03:07:37] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [03:07:37] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [03:07:37] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [03:11:48] PROBLEM - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [03:13:27] PROBLEM - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.168 and port 9042: Connection refused [03:13:48] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:13:57] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:14:28] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [03:27:14] (03PS1) 10BryanDavis: toolforge: Explictly allow wss to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/421804 (https://phabricator.wikimedia.org/T130748) [03:27:33] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Explictly allow wss to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/421804 (https://phabricator.wikimedia.org/T130748) (owner: 10BryanDavis) [03:29:37] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 866.17 seconds [03:32:28] PROBLEM - cassandra-a SSL 10.64.0.167:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [03:32:58] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [03:33:07] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:33:18] PROBLEM - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:37] RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active [03:33:47] PROBLEM - cassandra-a service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [03:33:48] PROBLEM - cassandra-a CQL 10.64.0.167:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.167 and port 9042: Connection refused [03:34:17] RECOVERY - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is OK: TCP OK - 1.016 second response time on 10.64.48.169 port 9042 [03:34:27] RECOVERY - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.168 port 9042 [03:34:47] RECOVERY - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-a valid until 2018-07-20 15:08:10 +0000 (expires in 116 days) [03:35:36] (03CR) 10BryanDavis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/421804 (https://phabricator.wikimedia.org/T130748) (owner: 10BryanDavis) [03:35:47] RECOVERY - cassandra-a service on restbase-dev1004 is OK: OK - cassandra-a is active [03:36:07] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [03:37:28] RECOVERY - cassandra-a SSL 10.64.0.167:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-a valid until 2018-07-20 15:08:04 +0000 (expires in 116 days) [03:37:57] RECOVERY - cassandra-a CQL 10.64.0.167:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.167 port 9042 [03:43:17] PROBLEM - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [03:43:47] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:45:18] PROBLEM - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.169 and port 9042: Connection refused [03:45:44] (03CR) 10Andrew Bogott: "This doesn't scare me much, although I wish we could restrict it to only working from deployment-prep VMs. I can't decide if creating a s" [puppet] - 10https://gerrit.wikimedia.org/r/421709 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [03:45:58] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:46:07] PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [03:55:37] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 178.43 seconds [04:04:07] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [04:04:07] RECOVERY - cassandra-b service on restbase-dev1006 is OK: OK - cassandra-b is active [04:04:48] RECOVERY - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-b valid until 2018-07-20 15:08:11 +0000 (expires in 116 days) [04:05:18] RECOVERY - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.169 port 9042 [04:21:56] (03CR) 10Madhuvishy: [C: 031] "Nice write up, looks good to me, thank you!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/421799 (owner: 10Kevin py) [04:22:41] (03CR) 10jerkins-bot: [V: 04-1] my understanding of webservice with start and stop actions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/421799 (owner: 10Kevin py) [04:50:41] (03PS1) 10KartikMistry: apertium-separable: Initial Debian packaging [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/421808 (https://phabricator.wikimedia.org/T189075) [05:03:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [05:03:57] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [05:39:39] <_joe_> !log restarting pdfrenderer on scb1001,1003 [05:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:47] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [05:40:47] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [06:02:48] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:03:07] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [06:05:45] (03PS1) 10KartikMistry: apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/421813 (https://phabricator.wikimedia.org/T189076) [06:06:14] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/421813 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [06:08:07] (03PS1) 10Giuseppe Lavagetto: conf: update the netboot recipe for conf servers with SSDs [puppet] - 10https://gerrit.wikimedia.org/r/421814 (https://phabricator.wikimedia.org/T166081) [06:10:48] (03PS2) 10KartikMistry: apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/421813 (https://phabricator.wikimedia.org/T189076) [06:11:13] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/421813 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [06:11:50] (03PS1) 10Elukey: Assign role::spare::system to eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/421816 (https://phabricator.wikimedia.org/T189566) [06:20:17] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 61652 MB (12% inode=99%) [06:21:11] (03CR) 10Elukey: [C: 031] conf: update the netboot recipe for conf servers with SSDs [puppet] - 10https://gerrit.wikimedia.org/r/421814 (https://phabricator.wikimedia.org/T166081) (owner: 10Giuseppe Lavagetto) [06:29:47] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt] [06:32:17] RECOVERY - Disk space on elastic1023 is OK: DISK OK [06:59:47] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:46] (03CR) 10Giuseppe Lavagetto: [C: 032] conf: update the netboot recipe for conf servers with SSDs [puppet] - 10https://gerrit.wikimedia.org/r/421814 (https://phabricator.wikimedia.org/T166081) (owner: 10Giuseppe Lavagetto) [07:11:26] (03PS1) 10Volans: Puppetboard: enable apache proxy modules [puppet] - 10https://gerrit.wikimedia.org/r/421823 (https://phabricator.wikimedia.org/T184563) [07:11:28] (03PS1) 10Volans: Puppetboard: fine tune configuration [puppet] - 10https://gerrit.wikimedia.org/r/421824 (https://phabricator.wikimedia.org/T184563) [07:13:10] (03PS1) 10KartikMistry: apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/421825 (https://phabricator.wikimedia.org/T189076) [07:13:46] (03CR) 10jerkins-bot: [V: 04-1] apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/421825 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [07:20:15] (03CR) 10Volans: [C: 032] "Compiler looks happy" [puppet] - 10https://gerrit.wikimedia.org/r/421823 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [07:20:29] (03CR) 10Volans: [C: 032] Puppetboard: fine tune configuration [puppet] - 10https://gerrit.wikimedia.org/r/421824 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [07:28:52] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/10645/" [puppet] - 10https://gerrit.wikimedia.org/r/421816 (https://phabricator.wikimedia.org/T189566) (owner: 10Elukey) [07:29:00] (03PS2) 10Elukey: Assign role::spare::system to eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/421816 (https://phabricator.wikimedia.org/T189566) [07:33:33] !log stop eventlogging zmq-forwarder on eventlog1001 as part of decom process - T189566 [07:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:40] Cc: Krinkle, marlier --^ [07:33:40] T189566: Reclaim/Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566 [07:46:39] 10Operations, 10hardware-requests, 10Patch-For-Review: Reclaim/Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4079765 (10elukey) [07:47:09] 10Operations, 10hardware-requests, 10Patch-For-Review: Reclaim/Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4045706 (10elukey) [07:47:22] 10Operations, 10hardware-requests, 10Patch-For-Review: Reclaim/Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4045706 (10elukey) a:05elukey>03None [07:47:43] 10Operations, 10hardware-requests, 10Patch-For-Review: Reclaim/Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4045706 (10elukey) Host can be decommed anytime @Cmjohnson [07:48:58] elukey: as of Friday we’re actually still having issues with the Kafka stream. Initially the one obvious error in the logs was something on our end, which has since been fixed, but it seems the stream is still spotty for unknown reasons [07:49:08] The main one on hafnium runs fine [07:49:27] Even Ian’s local copy of the new coal in hafnium works fine [07:50:19] Main one = webperf/navtiming (converted last year to Kafka, run fine) [07:50:30] Same topics [07:50:41] Krinkle: o/ [07:51:13] I can help in tracking down what's wrong if you have time [07:51:28] what do you mean that the stream is spotty? [07:51:34] RE: alerts in your mail, those are unrelated, we don’t have alerts on the coal data [07:52:13] So the uncaught ValueError was a logic error that we fixed since and deployed [07:52:43] The real problem is that it seems the consumer just seems to stop after a while [07:52:47] And get no more messages [07:53:39] After which it eventually times out and reconnects but only to time out again etc. only restarting the consume process seems to fix it [07:53:54] Then after some hours it breaks, and enters this time out reconnect cycle [07:54:09] See syslog on graphite1001 for the not so useful error message [07:54:46] (03PS1) 10Giuseppe Lavagetto: Add the SRV record for the new etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/421832 (https://phabricator.wikimedia.org/T166081) [07:55:04] All I know at the moment is that it happens on graphite1001 but not when run manually with a different consumer group from another host for testing. So we can’t reproduce the issue [07:55:13] Krinkle: from https://grafana.wikimedia.org/dashboard/db/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=coal&from=now-24h&to=now I can see the coal consumer group committing offset regularly, weird [07:55:26] (03PS1) 10KartikMistry: apertium: Add apertium-separable package [puppet] - 10https://gerrit.wikimedia.org/r/421833 (https://phabricator.wikimedia.org/T189075) [07:55:32] I’ll have to go now but I’ll check in in the morning (SF time) [07:55:47] Krinkle: ack! [07:56:07] (03CR) 10jerkins-bot: [V: 04-1] apertium: Add apertium-separable package [puppet] - 10https://gerrit.wikimedia.org/r/421833 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [07:56:45] (03CR) 10Giuseppe Lavagetto: [C: 032] Add the SRV record for the new etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/421832 (https://phabricator.wikimedia.org/T166081) (owner: 10Giuseppe Lavagetto) [07:58:35] (03PS2) 10KartikMistry: WIP: apertium: Add apertium-separable package [puppet] - 10https://gerrit.wikimedia.org/r/421833 (https://phabricator.wikimedia.org/T189075) [08:01:04] (03PS1) 10Volans: Puppetboard: ferm, open port 80 [puppet] - 10https://gerrit.wikimedia.org/r/421834 (https://phabricator.wikimedia.org/T184563) [08:03:11] (03PS2) 10Volans: Puppetboard: ferm, open port 80 [puppet] - 10https://gerrit.wikimedia.org/r/421834 (https://phabricator.wikimedia.org/T184563) [08:05:45] (03CR) 10Volans: [C: 032] "Compiler for reference:" [puppet] - 10https://gerrit.wikimedia.org/r/421834 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [08:07:09] (03PS3) 10Giuseppe Lavagetto: etcd: add class for v3 basic installation [puppet] - 10https://gerrit.wikimedia.org/r/419358 (https://phabricator.wikimedia.org/T166081) [08:09:05] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: add class for v3 basic installation [puppet] - 10https://gerrit.wikimedia.org/r/419358 (https://phabricator.wikimedia.org/T166081) (owner: 10Giuseppe Lavagetto) [08:13:10] (03PS2) 10Giuseppe Lavagetto: etcd::v3: add basic monitoring [puppet] - 10https://gerrit.wikimedia.org/r/420013 [08:13:30] 10Operations, 10HHVM, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4079805 (10MoritzMuehlenhoff) >>! In T189295#4075466, @Ladsgroup wrote: >>>! In T189295#4074637, @Bawolff wrote: >> Another thing to watch out for, is that... [08:14:36] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd::v3: add basic monitoring [puppet] - 10https://gerrit.wikimedia.org/r/420013 (owner: 10Giuseppe Lavagetto) [08:17:47] !log upgrading debdeploy across the fleet to latest release [08:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:55] Krinkle: one thing to keep in mind is that hafnium pulls data from kafka main (kafka100[123]) meanwhile coal from jumbo [08:28:15] two different versions of Kafka (main 0.9, jumbo 1.0) [08:50:19] (03PS1) 10Elukey: coal: move conf to systemd only and add dedicated logging dir [puppet] - 10https://gerrit.wikimedia.org/r/421838 [08:53:26] (03PS2) 10Elukey: coal: move conf to systemd only and add dedicated logging dir [puppet] - 10https://gerrit.wikimedia.org/r/421838 [08:58:06] (03CR) 10Elukey: "Pcc: https://puppet-compiler.wmflabs.org/compiler03/10649/graphite1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/421838 (owner: 10Elukey) [09:02:32] (03PS1) 10Filippo Giunchedi: Remove non-authoritative SRV puppet records [dns] - 10https://gerrit.wikimedia.org/r/421839 (https://phabricator.wikimedia.org/T189891) [09:18:26] (03PS1) 10Filippo Giunchedi: puppetmaster: install keypair for 'puppet' when running as CA [puppet] - 10https://gerrit.wikimedia.org/r/421842 (https://phabricator.wikimedia.org/T189891) [09:33:45] (03PS1) 10Muehlenhoff: Fix sort order of results [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/421846 [09:37:25] (03CR) 10Giuseppe Lavagetto: "We never used these records as SRV resolution was broken in puppet 3 (no caching was done by the agent, making agent runs take forever and" [dns] - 10https://gerrit.wikimedia.org/r/421839 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [09:40:06] (03PS2) 10Muehlenhoff: Use deterministic order for results [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/421846 [09:42:17] (03CR) 10Volans: [C: 031] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/421846 (owner: 10Muehlenhoff) [09:42:19] (03PS1) 10Vgutierrez: Fix dummy metrics implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/421847 (https://phabricator.wikimedia.org/T190527) [09:42:45] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal 1.15.2 dies with obscure errors without python-prometheus-client - https://phabricator.wikimedia.org/T190527#4079882 (10Vgutierrez) pybal metrics package attempts to make prometheus support optional, but it's obviously failing right now. The subm... [09:43:06] (03PS2) 10Filippo Giunchedi: puppetmaster: install keypair for 'puppet' when running as CA [puppet] - 10https://gerrit.wikimedia.org/r/421842 (https://phabricator.wikimedia.org/T189891) [09:48:34] (03PS3) 10Filippo Giunchedi: puppetmaster: install keypair for 'puppet' when running as CA [puppet] - 10https://gerrit.wikimedia.org/r/421842 (https://phabricator.wikimedia.org/T189891) [09:49:01] (03CR) 10Muehlenhoff: [C: 032] Use deterministic order for results [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/421846 (owner: 10Muehlenhoff) [09:50:21] (03PS1) 10Muehlenhoff: Bump changelog [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/421848 [09:53:48] (03CR) 10Vgutierrez: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/421051 (owner: 10Mark Bergsma) [09:53:58] (03CR) 10Muehlenhoff: [C: 032] Bump changelog [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/421848 (owner: 10Muehlenhoff) [09:55:22] 10Puppet: Investigate using SRV records for puppet - https://phabricator.wikimedia.org/T190665#4079925 (10fgiunchedi) [09:55:41] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/421839 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [09:56:11] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler03/10653/" [puppet] - 10https://gerrit.wikimedia.org/r/421842 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [10:02:17] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#4079943 (10MoritzMuehlenhoff) a:05Cmjohnson>03RobH The server is still visible in Cumin: ``` jmm@sarin:~$ sudo cumin irid* 1 hosts will be targeted: iridium.eqiad.wmnet DRY-RUN mode enabled, abortin... [10:05:51] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4079954 (10BBlack) >>! In T189252#4079025, @Liuxinyu970226 wrote: > What about Antarctica (AQ)? Not in scope here.... [10:06:24] govg: FYI ^^^ (iridium) not sure if we should have a look, might be related with the puppetdb migration [10:06:35] godog: ^^^ (sorry govg, bad autocomplete) [10:06:57] PROBLEM - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:08:38] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [10:08:57] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:08:58] PROBLEM - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.168 and port 9042: Connection refused [10:08:58] (03PS1) 10ArielGlenn: clean up all 'latest' links from most runs older than current run [dumps] - 10https://gerrit.wikimedia.org/r/421851 (https://phabricator.wikimedia.org/T189527) [10:11:14] volans: perhaps! [10:12:09] volans _joe_ https://gerrit.wikimedia.org/r/c/421842/ for your eyes [10:12:59] looking [10:13:33] thanks, I'll tame new swift machines in the meantime [10:15:00] (03CR) 10Filippo Giunchedi: [C: 032] site: add ms-be204[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/421249 (https://phabricator.wikimedia.org/T189633) (owner: 10Filippo Giunchedi) [10:18:24] (03PS1) 10Giuseppe Lavagetto: puppet_ecdsacert: puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/421853 [10:18:32] <_joe_> volans, godog ^^ I fixed it [10:18:38] (03CR) 10Filippo Giunchedi: [C: 032] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/421249 (https://phabricator.wikimedia.org/T189633) (owner: 10Filippo Giunchedi) [10:19:13] <_joe_> a simple signature change [10:19:34] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet_ecdsacert: puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/421853 (owner: 10Giuseppe Lavagetto) [10:19:52] _joe_: nice [10:20:10] (03PS3) 10Filippo Giunchedi: site: add ms-be204[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/421249 (https://phabricator.wikimedia.org/T189633) [10:23:05] !log uploaded debdeploy 0.0.99.4 to apt.wikimedia (for trusty/jessie/stretch) [10:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:37] !log upgrading debdeploy across the fleet to 0.0.99.4 [10:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:58] PROBLEM - puppet last run on ms-be2043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 seconds ago with 1 failures. Failed resources (up to 3 shown) [10:27:56] (03PS2) 10Alexandros Kosiaris: Add network policy objects to the helm charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/421486 (https://phabricator.wikimedia.org/T184923) [10:29:09] (03PS2) 10Alexandros Kosiaris: Add wmfdebug image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/421680 [10:30:42] the puppet fails on ms-be204* is me [10:30:45] PROBLEM - puppet last run on ms-be2041 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1],Exec[mkfs-/dev/sdd1] [10:30:56] PROBLEM - puppet last run on ms-be2040 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 54 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1],Exec[mkfs-/dev/sdj1] [10:31:03] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Give tiller the right to manage network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/421484 (https://phabricator.wikimedia.org/T184923) (owner: 10Alexandros Kosiaris) [10:32:35] (03CR) 10Alexandros Kosiaris: "Kubernetes 1.7 does not support egress. Kubernetes 1.8 and 1.9 do however. But until then we can't set them explicitly. So we set them in " [deployment-charts] - 10https://gerrit.wikimedia.org/r/421485 (https://phabricator.wikimedia.org/T184923) (owner: 10Alexandros Kosiaris) [10:33:26] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add wmfdebug image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/421680 (owner: 10Alexandros Kosiaris) [10:33:46] RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active [10:33:56] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [10:34:10] !log reboot californium for T189115 [10:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:05] RECOVERY - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-a valid until 2018-07-20 15:08:10 +0000 (expires in 116 days) [10:35:15] RECOVERY - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.168 port 9042 [10:35:56] RECOVERY - puppet last run on ms-be2040 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:36:56] RECOVERY - puppet last run on ms-be2043 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [10:39:52] !log reboot silver for T189115 [10:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:58] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Annotate namespace with a default deny policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/421485 (https://phabricator.wikimedia.org/T184923) (owner: 10Alexandros Kosiaris) [10:40:02] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add network policy objects to the helm charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/421486 (https://phabricator.wikimedia.org/T184923) (owner: 10Alexandros Kosiaris) [10:41:31] (03PS1) 10BBlack: eqsin: turn-up ID, MY, VN, NC [dns] - 10https://gerrit.wikimedia.org/r/421855 (https://phabricator.wikimedia.org/T189252) [10:42:16] jouncebot: next [10:42:16] In 0 hour(s) and 17 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180326T1100) [10:45:45] RECOVERY - puppet last run on ms-be2041 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:48:44] (03PS2) 10Giuseppe Lavagetto: role: add configcluster_stretch [puppet] - 10https://gerrit.wikimedia.org/r/420014 (https://phabricator.wikimedia.org/T166081) [10:51:00] (03PS1) 10Giuseppe Lavagetto: site: apply configcluster_stretch to conf1004-6 [puppet] - 10https://gerrit.wikimedia.org/r/421856 (https://phabricator.wikimedia.org/T166081) [10:54:03] (03PS1) 10Giuseppe Lavagetto: Add key for _etcd-server-ssl._tcp.v3.eqiad.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/421857 [10:55:12] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add key for _etcd-server-ssl._tcp.v3.eqiad.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/421857 (owner: 10Giuseppe Lavagetto) [10:56:20] !log installing ICU security updates for jessie/stretch [10:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:54] zeljkof, thanks for fixing deployments page. I fixed it, but instead of saving I left the wikipage in "preview" state. Facepalm ;/ [10:57:21] raynor: :D I did not put the correct irc nick, obviously :D [10:58:18] (03PS1) 10ArielGlenn: Add ability to skip recombine of meta-current page content, per project [dumps] - 10https://gerrit.wikimedia.org/r/421858 (https://phabricator.wikimedia.org/T179059) [10:58:20] (03PS1) 10KartikMistry: apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) [10:58:32] sure, I fixed that one, I just found out that didn't save the wikipage, after hitting save I got conflict [10:58:35] (03CR) 10jerkins-bot: [V: 04-1] Add ability to skip recombine of meta-current page content, per project [dumps] - 10https://gerrit.wikimedia.org/r/421858 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn) [10:58:46] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [10:58:48] (03PS1) 10Filippo Giunchedi: [WIP] puppetmaster: adjust passenger pool size [puppet] - 10https://gerrit.wikimedia.org/r/421860 (https://phabricator.wikimedia.org/T184561) [11:00:05] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180326T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:57] (03PS3) 10Giuseppe Lavagetto: role: add configcluster_stretch [puppet] - 10https://gerrit.wikimedia.org/r/420014 (https://phabricator.wikimedia.org/T166081) [11:00:59] (03PS2) 10Giuseppe Lavagetto: site: apply configcluster_stretch to conf1004-6 [puppet] - 10https://gerrit.wikimedia.org/r/421856 (https://phabricator.wikimedia.org/T166081) [11:01:15] (03PS4) 10BBlack: lvs - use new fact to determine bnx2x [puppet] - 10https://gerrit.wikimedia.org/r/414740 [11:04:19] (03CR) 10Elukey: [C: 031] "pcc looks good https://puppet-compiler.wmflabs.org/compiler03/10650/" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/421617 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [11:07:21] (03CR) 10BBlack: [C: 032] lvs - use new fact to determine bnx2x [puppet] - 10https://gerrit.wikimedia.org/r/414740 (owner: 10BBlack) [11:08:23] (03PS1) 10Giuseppe Lavagetto: profile::etcd::v3: fix monitoring declaration, and monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/421863 [11:09:28] elukey: thanks for the patch, will take a look this morning. I have a bunch more changes as well. [11:09:43] (03Draft2) 10MarcoAurelio: Enable AbuseFilter profiler at zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421862 (https://phabricator.wikimedia.org/T190663) [11:10:01] (03Draft1) 10MarcoAurelio: Enable AbuseFilter profiler at zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421862 (https://phabricator.wikimedia.org/T190663) [11:11:38] (03PS2) 10BBlack: lvs1007-12: remove LVS config bits everywhere [puppet] - 10https://gerrit.wikimedia.org/r/415044 [11:12:24] (03PS1) 10Giuseppe Lavagetto: Add fake secrets for etcd v3 [labs/private] - 10https://gerrit.wikimedia.org/r/421865 [11:12:41] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add fake secrets for etcd v3 [labs/private] - 10https://gerrit.wikimedia.org/r/421865 (owner: 10Giuseppe Lavagetto) [11:12:58] (03PS2) 10ArielGlenn: Add ability to skip recombine of meta-current page content, per project [dumps] - 10https://gerrit.wikimedia.org/r/421858 (https://phabricator.wikimedia.org/T179059) [11:14:37] (03CR) 10BBlack: [C: 032] lvs1007-12: remove LVS config bits everywhere [puppet] - 10https://gerrit.wikimedia.org/r/415044 (owner: 10BBlack) [11:15:10] (03Draft2) 10MarcoAurelio: Disable AbuseFilter from collecting IP addresses on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421864 (https://phabricator.wikimedia.org/T188862) [11:15:14] (03Draft1) 10MarcoAurelio: Disable AbuseFilter from collecting IP addresses on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421864 (https://phabricator.wikimedia.org/T188862) [11:16:25] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: service=mathoid,cluster=scb,name=scb.* [11:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:57] !log depool scb hosts for mathoid service. T184919 [11:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:03] T184919: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919 [11:17:41] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/421808 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [11:18:07] (03CR) 10jerkins-bot: [V: 04-1] apertium-separable: Initial Debian packaging [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/421808 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [11:20:19] (03PS2) 10Filippo Giunchedi: [WIP] puppetmaster: adjust passenger pool size [puppet] - 10https://gerrit.wikimedia.org/r/421860 (https://phabricator.wikimedia.org/T184561) [11:20:38] PROBLEM - puppet last run on wezen is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[debdeploy-client] [11:21:34] (03PS4) 10Giuseppe Lavagetto: role: add configcluster_stretch [puppet] - 10https://gerrit.wikimedia.org/r/420014 (https://phabricator.wikimedia.org/T166081) [11:21:52] (03CR) 10Giuseppe Lavagetto: [C: 032] role: add configcluster_stretch [puppet] - 10https://gerrit.wikimedia.org/r/420014 (https://phabricator.wikimedia.org/T166081) (owner: 10Giuseppe Lavagetto) [11:22:22] moritzm: ^^^ debdeploy-client [11:22:56] PROBLEM - puppet last run on poolcounter1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[debdeploy-client] [11:25:00] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4080162 (10Lucas_Werkmeister_WMDE) [11:25:35] (03PS1) 10Pmiazga: Enable mobile-only Mediawiki:MainPageCss styles for Hindi wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421867 (https://phabricator.wikimedia.org/T190101) [11:26:04] (03PS3) 10Giuseppe Lavagetto: site: apply configcluster_stretch to conf1004-6 [puppet] - 10https://gerrit.wikimedia.org/r/421856 (https://phabricator.wikimedia.org/T166081) [11:27:02] volans: that's all fine, that's usually caused by the puppet-triggered "apt-get update" failing since debdeploy held the dpkg lock [11:27:15] ack [11:28:53] (03CR) 10Giuseppe Lavagetto: [C: 032] site: apply configcluster_stretch to conf1004-6 [puppet] - 10https://gerrit.wikimedia.org/r/421856 (https://phabricator.wikimedia.org/T166081) (owner: 10Giuseppe Lavagetto) [11:29:09] (03PS2) 10Giuseppe Lavagetto: profile::etcd::v3: fix monitoring declaration, and monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/421863 [11:29:51] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/10662/conf1004.eqiad.wmnet/ looks good on paper!" [puppet] - 10https://gerrit.wikimedia.org/r/421863 (owner: 10Giuseppe Lavagetto) [11:30:35] RECOVERY - puppet last run on wezen is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:31:34] !log reboot labcontrol1002 for T189115 [11:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:58] jouncebot: next [11:31:58] In 1 hour(s) and 28 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180326T1300) [11:35:00] (03CR) 10MarcoAurelio: [C: 031] Initial configuration for euwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171 (owner: 10Urbanecm) [11:37:25] PROBLEM - Disk space on elastic1026 is CRITICAL: DISK CRITICAL - free space: /srv 61646 MB (12% inode=99%) [11:40:15] <_joe_> dcausse: ^^ should we worry? [11:42:17] _joe_: nope, reindex still in progress [11:45:04] (03PS1) 10Alexandros Kosiaris: Fix package typo in wmfdebug image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/421870 [11:45:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix package typo in wmfdebug image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/421870 (owner: 10Alexandros Kosiaris) [11:47:01] !log reboot labcontrol100[3,4] for T189115 [11:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:00] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4080237 (10Lucas_Werkmeister_WMDE) I think `analytics-privatedata-users` is the group I need. (As far as I can tell from the config file, that doesn’t have any sudo rights,... [11:52:55] RECOVERY - puppet last run on poolcounter1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:56:26] RECOVERY - Disk space on elastic1026 is OK: DISK OK [11:56:58] (03PS1) 10Volans: Puppetboard: fine tune apache config and settings [puppet] - 10https://gerrit.wikimedia.org/r/421871 (https://phabricator.wikimedia.org/T184563) [12:00:33] !log restarting HHVM on mediawiki canaries to pick up ICU security update [12:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:52] !log reboot labmon100[1,2] for T189115 [12:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:23] (03PS1) 10Alexandros Kosiaris: wmfdebug: Add the nmap utility [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/421874 [12:04:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] wmfdebug: Add the nmap utility [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/421874 (owner: 10Alexandros Kosiaris) [12:04:35] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: CRITICAL - kubelet_operational_latencies is 35668 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:05:35] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: OK - kubelet_operational_latencies is 5262 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:06:17] marlier: np! Feel free to send code reviews, I'll try to review them asap [12:14:41] (03CR) 10Elukey: [C: 031] Puppetboard: fine tune apache config and settings [puppet] - 10https://gerrit.wikimedia.org/r/421871 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [12:15:31] (03Abandoned) 10Elukey: mediawiki::web::modules: force dependency between apache confs [puppet] - 10https://gerrit.wikimedia.org/r/380472 (owner: 10Elukey) [12:15:40] (03Abandoned) 10Elukey: Create a prometheus rule to calculate Memcached get hit ratio [puppet] - 10https://gerrit.wikimedia.org/r/334344 (owner: 10Elukey) [12:15:49] (03CR) 10Elukey: [C: 032] Apply some consistency to source code formatting [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/421490 (owner: 10Elukey) [12:16:12] (03Abandoned) 10Elukey: phabricator: disable opcache.fastshutdown [puppet] - 10https://gerrit.wikimedia.org/r/410767 (https://phabricator.wikimedia.org/T182832) (owner: 10Elukey) [12:24:21] (03PS3) 10Elukey: cassandra: upgrade version 2.2 package settings [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) [12:29:45] (03CR) 10Elukey: "Now it looks much better: https://puppet-compiler.wmflabs.org/compiler03/10664/" [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [12:30:24] !log reboot labbwr100[2,3,4] for T189115 [12:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:35] !log reboot labnet100[2,3,4]* for T189115 [12:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:55] (03PS4) 10Elukey: cassandra: upgrade version 2.2 package settings for aqs [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) [12:33:47] gerrit's git interface seems down. web is fine. is this known? [12:35:04] oh, actually - git fetch --all fails for core. works for other stuff. [12:38:33] https://phabricator.wikimedia.org/P6897 [12:40:28] !log reboot labservices1002 for T189115 [12:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:35] (03PS1) 10Elukey: role::aqs: enable jmx agent [puppet] - 10https://gerrit.wikimedia.org/r/421878 (https://phabricator.wikimedia.org/T184795) [12:44:24] jouncebot: next [12:44:25] In 0 hour(s) and 15 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180326T1300) [12:46:49] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633#4080380 (10fgiunchedi) [12:47:31] !log add ms-be204[0-3] with minimal weight - T189633 [12:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:37] T189633: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633 [12:47:41] 10Operations, 10Gerrit: git pull fails for MW core filas with "fatal: protocol error: bad pack header" - https://phabricator.wikimedia.org/T190676#4080384 (10daniel) [12:48:40] 10Operations, 10Gerrit: git pull fails for MW core filas with "fatal: protocol error: bad pack header" - https://phabricator.wikimedia.org/T190676#4080395 (10daniel) p:05Triage>03High Bumping to high. May even be UBN. I'm trying a fresh clone now. [12:50:05] (03CR) 10Volans: [C: 032] Puppetboard: fine tune apache config and settings [puppet] - 10https://gerrit.wikimedia.org/r/421871 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [12:50:44] 10Operations, 10Gerrit: git pull fails for MW core filas with "fatal: protocol error: bad pack header" - https://phabricator.wikimedia.org/T190676#4080408 (10daniel) Ftr: [14:49] worked for me [14:49] https://www.irccloud.com/pastebin/U3V1obs5/ [12:53:55] 10Operations, 10Gerrit: git pull fails for MW core filas with "fatal: protocol error: bad pack header" - https://phabricator.wikimedia.org/T190676#4080422 (10daniel) A fresh clone seems to fix the problem. So no UBN. [12:54:25] 10Operations, 10Gerrit: git pull fails for MW core filas with "fatal: protocol error: bad pack header" - https://phabricator.wikimedia.org/T190676#4080423 (10daniel) [12:58:37] 10Operations, 10Gerrit: git pull fails for MW core filas with "fatal: protocol error: bad pack header" - https://phabricator.wikimedia.org/T190676#4080384 (10Paladox) Hi, please run "git remote prune origin" [12:59:48] 10Operations, 10Gerrit: git pull fails for MW core filas with "fatal: protocol error: bad pack header" - https://phabricator.wikimedia.org/T190676#4080478 (10Paladox) 05Open>03Resolved As long as you run "git remote prune origin" that should resolve the problem, please reopen if it dosen't. [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180326T1300). [13:00:05] Hauskatze and raynor: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:41] Hi there. [13:01:12] o/ [13:01:15] I can SWAT today [13:01:38] o/ [13:01:44] 10Operations, 10Gerrit: git pull fails for MW core filas with "fatal: protocol error: bad pack header" - https://phabricator.wikimedia.org/T190676#4080384 (10hashar) The reason is you have local branches pointing to remote branches that no more exist. Specially, the wmf/ branches are deleted after some weeks a... [13:02:45] hi zeljkof - I can't deploy, feel free to do mine [13:03:00] I also cannot deploy [13:03:06] raynor: I'll merge your commit first, since it is in an extension, and it will take a while [13:03:13] sure, thx [13:03:20] Hauskatze: I'll merge/deploy your commits in parallel [13:03:41] ok [13:03:48] 10Operations, 10Gerrit: git pull fails for MW core filas with "fatal: protocol error: bad pack header" - https://phabricator.wikimedia.org/T190676#4080495 (10daniel) Ah, thank you! Is this somehow a new thing? I wonder why I'm running into this for the first time now. [13:05:03] 10Operations, 10Gerrit: git pull fails for MW core filas with "fatal: protocol error: bad pack header" - https://phabricator.wikimedia.org/T190676#4080497 (10daniel) I feel like filing an upstream bug. "fatal: protocol error: bad pack header" is NOT a good way to say "the branch you are tracking no longer exis... [13:05:45] 10Operations, 10Gerrit: git pull fails for MW core filas with "fatal: protocol error: bad pack header" - https://phabricator.wikimedia.org/T190676#4080498 (10Paladox) >>! In T190676#4080495, @daniel wrote: > Ah, thank you! > Is this somehow a new thing? I wonder why I'm running into this for the first time no... [13:06:10] (03CR) 10Muehlenhoff: [C: 031] "That seems fine. It would probably be more intuitive to have a single/combined target_version/package version, but that'll hopefully resol" [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [13:07:12] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4080501 (10MarcoAurelio) [13:07:20] PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: eventlogging-consumer@mysql-m4-master-00 eventlogging-consumer@mysql-eventbus [13:08:23] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421721 (https://phabricator.wikimedia.org/T190619) (owner: 10MarcoAurelio) [13:08:43] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4080501 (10Paladox) You should run puppet to find out what the error is please? [13:09:40] eventlogging1002 is me, running a update query on db1107 (master db) so as precautionary measure we stopped mysql traffic [13:09:44] (03Merged) 10jenkins-bot: Add 'tboverride' to 'engineer' at ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421721 (https://phabricator.wikimedia.org/T190619) (owner: 10MarcoAurelio) [13:10:52] Hauskatze: 421721 is at mwdebug, please test and let me know if I can deploy [13:11:05] zeljkof: checking [13:11:13] (03PS1) 10Giuseppe Lavagetto: profile::etcd::v3: fix some configurations [puppet] - 10https://gerrit.wikimedia.org/r/421882 [13:11:31] zeljkof: mwdebug 1 or 2? [13:11:50] PROBLEM - eventlogging_sync processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [13:11:55] (03CR) 10jerkins-bot: [V: 04-1] profile::etcd::v3: fix some configurations [puppet] - 10https://gerrit.wikimedia.org/r/421882 (owner: 10Giuseppe Lavagetto) [13:11:59] Hauskatze: sorry, it's always mwdebug1002, per https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Test_Canary [13:12:23] zeljkof: okay thanks - checked - looks good [13:12:29] Hauskatze: deploying [13:13:25] (03CR) 10Zfilipin: [C: 031] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421862 (https://phabricator.wikimedia.org/T190663) (owner: 10MarcoAurelio) [13:13:33] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:421721|Add tboverride to engineer at ruwiki (T190619)]] (duration: 01m 01s) [13:13:37] eventlogging_sync is me again, I was convinced to have downtimed db1108 [13:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:40] T190619: Add TitleBlacklist override to Russian Wikipedia engineers - https://phabricator.wikimedia.org/T190619 [13:13:45] Hauskatze: deployed, please check [13:13:56] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421862 (https://phabricator.wikimedia.org/T190663) (owner: 10MarcoAurelio) [13:14:04] zeljkof: still looks good [13:14:07] ah sure I downtimed it for 10 mins [13:14:12] * elukey facepalm [13:14:37] (03PS2) 10Giuseppe Lavagetto: profile::etcd::v3: fix some configurations [puppet] - 10https://gerrit.wikimedia.org/r/421882 [13:14:40] raynor: your patch is merged, should be at mwdebug in a minute or two [13:15:08] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4080550 (10MarcoAurelio) I ran puppet agent -tv but I guess I did that on the wrong folder as it started to create stuff on ~/maurelio/.puppet -- I shall find the right place to run this. [13:15:10] (03Merged) 10jenkins-bot: Enable AbuseFilter profiler at zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421862 (https://phabricator.wikimedia.org/T190663) (owner: 10MarcoAurelio) [13:15:13] (03CR) 10jerkins-bot: [V: 04-1] profile::etcd::v3: fix some configurations [puppet] - 10https://gerrit.wikimedia.org/r/421882 (owner: 10Giuseppe Lavagetto) [13:15:17] ok, thx zeljkof [13:17:06] raynor: it's at mwdebug1002, please test and let me know if I can deploy [13:18:05] Hauskatze: 421862 is at mwdebug1002, please test and let me know if I can deploy [13:18:25] zeljkof - checking [13:18:43] zeljkof: checking [13:19:31] zeljkof: checked, looks good to me [13:19:42] (03PS3) 10Giuseppe Lavagetto: profile::etcd::v3: fix some configurations [puppet] - 10https://gerrit.wikimedia.org/r/421882 [13:19:58] (03CR) 10jenkins-bot: Add 'tboverride' to 'engineer' at ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421721 (https://phabricator.wikimedia.org/T190619) (owner: 10MarcoAurelio) [13:20:06] (03CR) 10jenkins-bot: Enable AbuseFilter profiler at zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421862 (https://phabricator.wikimedia.org/T190663) (owner: 10MarcoAurelio) [13:20:34] (03CR) 10Ottomata: Use --new.consumer for main -> jumbo mirror maker (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/421617 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [13:20:42] Hauskatze: deploying [13:20:58] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler03/10666/aqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/421878 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [13:21:09] (03PS3) 10Ottomata: Use --new.consumer for main -> jumbo mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/421617 (https://phabricator.wikimedia.org/T189464) [13:21:20] (03PS3) 10Zfilipin: Disable AbuseFilter from collecting IP addresses on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421864 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [13:21:58] !log zfilipin@tin Synchronized wmf-config/abusefilter.php: SWAT: [[gerrit:421862|Enable AbuseFilter profiler at zh.wikipedia (T190663)]] (duration: 01m 00s) [13:21:59] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4080558 (10MarcoAurelio) ``` maurelio@deployment-ms-be03:~$ sudo puppet agent -tv Info: Using configured environment 'future' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loadi... [13:22:03] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::v3: fix some configurations [puppet] - 10https://gerrit.wikimedia.org/r/421882 (owner: 10Giuseppe Lavagetto) [13:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:04] T190663: Enable $wgAbuseFilterProfile on zhwiki - https://phabricator.wikimedia.org/T190663 [13:22:08] Hauskatze: deployed, please check [13:22:32] zeljkof: checking [13:22:56] zeljkof: looks good [13:23:04] zeljkof: the beta one cannot be tested [13:23:09] !log temporarily stopping puppet on kafka102[023] to use --new.consumer mirrormaker consuming from end [13:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:24] (03CR) 10Ottomata: [C: 032] Use --new.consumer for main -> jumbo mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/421617 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [13:23:27] Hauskatze: I do not feel comfortable deploying 421864, per https://phabricator.wikimedia.org/T188862#4075203 [13:23:30] (03PS4) 10Ottomata: Use --new.consumer for main -> jumbo mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/421617 (https://phabricator.wikimedia.org/T189464) [13:23:31] however feel free to push it to mwdebug should you wish me to check if beta doesn't go down [13:23:32] (03CR) 10Ottomata: [V: 032 C: 032] Use --new.consumer for main -> jumbo mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/421617 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [13:23:47] _joe_: :) [13:23:52] puppet merge ok? [13:24:01] zeljkof: the patch fixes that, demon reverted the wrong config [13:24:06] Hauskatze: looks like it was already reverted once, I would prefer not to deploy until somebody more knowledgeable gives it a +1 [13:24:27] zeljkof: okay, I feel it is the right thing now, but I can understand [13:24:46] Hauskatze: could you please make sure somebody (like no_justification) takes a look before it's deployed? [13:24:50] <_joe_> ottomata: go on please [13:24:51] (03CR) 10Volans: [C: 031] "LGTM, but I'm not fully sure if there might be some races or puppet might overwrite those files. I'd like _joe_ to have a look too." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421842 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [13:24:56] <_joe_> sorry, someone ringed at the door [13:24:56] zeljkof - my patch looks good [13:25:08] raynor: ok, deploying [13:25:15] \o/ [13:25:23] zeljkof: I could try, but maybe he'll listen to you [13:25:51] Hauskatze: I don't really know what needs to happen in that file, so I am really not a good judge :| [13:26:27] !log zfilipin@tin Synchronized php-1.31.0-wmf.26/extensions/MobileFrontend/: SWAT: [[gerrit:421359|Squash: Hygiene: Auto namespace ResourceLoader modules and Add $wgMFMobileMainPageCss config flag; Hygiene: Auto namespace ResourceLoader modules; Add $wgMFMobileMainPageCss config flag (T190101)]] (duration: 01m 01s) [13:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:32] T190101: Apply styles to mobile only for Hindi campaign - https://phabricator.wikimedia.org/T190101 [13:26:51] raynor: deployed, please check and thanks for deploying with #releng ;) [13:27:10] Hauskatze: good luck and thanks for deploying with #releng :D [13:27:34] !log EU SWAT finished [13:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:50] zeljkof, thanks. Everytime you deploy - everything works \o/ [13:30:06] raynor: it's more up to you than me ;) [13:30:22] I just push the buttons, it's up to you what they do [13:30:30] !log rebooting app server canaries to pick up ICU security update [13:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:56] !log restarting HHVM on app server canaries to pick up ICU security update (not rebooting as logged before) [13:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:57] (03PS1) 10Giuseppe Lavagetto: etcd::v3: use the correct fact as default [puppet] - 10https://gerrit.wikimedia.org/r/421885 [13:32:49] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd::v3: use the correct fact as default [puppet] - 10https://gerrit.wikimedia.org/r/421885 (owner: 10Giuseppe Lavagetto) [13:37:44] (03PS1) 10Giuseppe Lavagetto: role::configcluster: use the same domain for the v2 and v3 tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/421887 [13:38:21] (03CR) 10Filippo Giunchedi: puppetmaster: install keypair for 'puppet' when running as CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421842 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [13:38:34] _joe_: when you get a moment ^ [13:38:42] 10Puppet, 10Beta-Cluster-Infrastructure: PROBLEM - Puppet errors on deployment-mediawiki07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] - https://phabricator.wikimedia.org/T190632#4080601 (10MarcoAurelio) ``` maurelio@deployment-mediawiki07:~$ sudo puppet agent -tv Info: Using conf... [13:39:43] <_joe_> godog: yeah, when I do, sorry :P [13:40:00] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet errors on deployment-mediawiki07 - https://phabricator.wikimedia.org/T190632#4080603 (10MarcoAurelio) [13:40:25] (03CR) 10Giuseppe Lavagetto: [C: 032] role::configcluster: use the same domain for the v2 and v3 tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/421887 (owner: 10Giuseppe Lavagetto) [13:40:38] yeah no worries [13:41:46] (03CR) 10Filippo Giunchedi: [C: 032] Remove non-authoritative SRV puppet records [dns] - 10https://gerrit.wikimedia.org/r/421839 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [13:41:51] (03PS2) 10Filippo Giunchedi: Remove non-authoritative SRV puppet records [dns] - 10https://gerrit.wikimedia.org/r/421839 (https://phabricator.wikimedia.org/T189891) [13:42:31] (03PS1) 10Volans: Puppetboard: fine tune inventory [puppet] - 10https://gerrit.wikimedia.org/r/421888 (https://phabricator.wikimedia.org/T184563) [13:43:01] (03CR) 10Filippo Giunchedi: [C: 032] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/421839 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [13:43:25] (03CR) 10Volans: [C: 032] Puppetboard: fine tune inventory [puppet] - 10https://gerrit.wikimedia.org/r/421888 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [13:44:53] (03CR) 10Filippo Giunchedi: cassandra: upgrade version 2.2 package settings for aqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [13:45:20] (03CR) 10Imarlier: [C: 031] "Only question: will journalctl still allow me to see the log files, once this change is made?" [puppet] - 10https://gerrit.wikimedia.org/r/421838 (owner: 10Elukey) [13:46:27] (03PS1) 10Giuseppe Lavagetto: profile::etcd::v3: make private key readable by group etcd [puppet] - 10https://gerrit.wikimedia.org/r/421890 [13:46:33] (03CR) 10Rush: "At first glance through the task etc it seems like a deployment-prep only service account would be the better model" [puppet] - 10https://gerrit.wikimedia.org/r/421709 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [13:46:36] (03PS3) 10Filippo Giunchedi: Remove non-authoritative SRV puppet records [dns] - 10https://gerrit.wikimedia.org/r/421839 (https://phabricator.wikimedia.org/T189891) [13:47:08] (03CR) 10Elukey: "> Only question: will journalctl still allow me to see the log files," [puppet] - 10https://gerrit.wikimedia.org/r/421838 (owner: 10Elukey) [13:47:24] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::v3: make private key readable by group etcd [puppet] - 10https://gerrit.wikimedia.org/r/421890 (owner: 10Giuseppe Lavagetto) [13:47:30] (03PS3) 10Elukey: coal: move conf to systemd only and add dedicated logging dir [puppet] - 10https://gerrit.wikimedia.org/r/421838 [13:47:45] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#4080616 (10MarcoAurelio) [13:47:58] (03PS4) 10Filippo Giunchedi: Remove non-authoritative SRV puppet records [dns] - 10https://gerrit.wikimedia.org/r/421839 (https://phabricator.wikimedia.org/T189891) [13:48:30] (03CR) 10Elukey: [C: 032] coal: move conf to systemd only and add dedicated logging dir [puppet] - 10https://gerrit.wikimedia.org/r/421838 (owner: 10Elukey) [13:49:18] (03CR) 10Andrew Bogott: "> a deployment-prep only service account would be the better model" [puppet] - 10https://gerrit.wikimedia.org/r/421709 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [13:52:31] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#4080629 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['conf1005.eqiad.wmnet'... [13:53:24] (03PS1) 10Ottomata: Update kafka java.security file with Java 8 u162 changes [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) [13:53:48] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff) [13:53:51] 10Operations, 10Patch-For-Review: Review changes to /etc/java-8-openjdk/security/java.security in Kafka from u162 update - https://phabricator.wikimedia.org/T190400#4080633 (10Ottomata) In general, the changes we made to java.security were to make the list of available algs more restrictive. So, in the case h... [13:54:18] (03PS1) 10Elukey: coal: fix rsyslog configuration [puppet] - 10https://gerrit.wikimedia.org/r/421892 [13:55:24] !log restarting CI Jenkins . Upgrades Mail plugin from 1.20 to 1.21 | T190393 [13:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:59] I was about to say.. is jenkins ok? :D [13:56:46] (03CR) 10Paladox: "The package needs to also be copied to stretch-wikimedia see https://phabricator.wikimedia.org/T190632#4080603" [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [13:56:55] (03CR) 10Filippo Giunchedi: prometheus: calculate varnish requests daily/weekly averages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421505 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [13:57:08] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-hhvm-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419447 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:57:23] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/10667/graphite1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/421892 (owner: 10Elukey) [14:00:10] (03CR) 10BBlack: [C: 031] Puppetboard: add varnish director entries [puppet] - 10https://gerrit.wikimedia.org/r/419763 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [14:00:39] marlier: done! You have journalctl's log in /var/log/coal/coal.log [14:00:52] (03CR) 10BBlack: [C: 031] Add puppetboard.wikimedia.org entry [dns] - 10https://gerrit.wikimedia.org/r/419800 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [14:01:10] elukey: 💯 Thank you! [14:02:17] (03CR) 10Volans: "Removing -2, service is ready to go live." [puppet] - 10https://gerrit.wikimedia.org/r/419763 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [14:02:26] marlier: if you have time, Timo this morning (EU Time) told me that you guys still have issues with coal [14:02:55] Yeah, I'm working on a couple of different things to address it. [14:03:06] super, let me know if you need my help [14:03:28] I'll have a review in, I think, an hour or so. [14:03:59] (03CR) 10Rush: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/421709 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [14:05:59] (03CR) 10Elukey: [C: 031] Update kafka java.security file with Java 8 u162 changes [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata) [14:06:12] (03PS2) 10Vgutierrez: prometheus: calculate varnish requests daily/weekly averages [puppet] - 10https://gerrit.wikimedia.org/r/421505 (https://phabricator.wikimedia.org/T184942) [14:06:36] (03PS3) 10Filippo Giunchedi: [WIP] puppetmaster: adjust passenger pool size [puppet] - 10https://gerrit.wikimedia.org/r/421860 (https://phabricator.wikimedia.org/T184561) [14:07:12] (03CR) 10Vgutierrez: prometheus: calculate varnish requests daily/weekly averages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421505 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [14:07:36] (03PS5) 10Volans: Puppetboard: add varnish director entries [puppet] - 10https://gerrit.wikimedia.org/r/419763 (https://phabricator.wikimedia.org/T184563) [14:11:33] (03PS1) 10Alexandros Kosiaris: Update helm repository with version 0.0.2 of mathoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/421895 [14:11:35] (03PS4) 10Filippo Giunchedi: puppetmaster: adjust passenger pool size [puppet] - 10https://gerrit.wikimedia.org/r/421860 (https://phabricator.wikimedia.org/T184561) [14:11:37] (03PS1) 10Ottomata: Use profile::kafka::mirror for main -> jumbo [puppet] - 10https://gerrit.wikimedia.org/r/421896 (https://phabricator.wikimedia.org/T189464) [14:12:03] (03CR) 10Volans: [C: 032] "Compiler results for reference:" [puppet] - 10https://gerrit.wikimedia.org/r/419763 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [14:14:24] (03CR) 10Rush: [C: 031] "cool thanks, a bunch of these types of offloading of weird inclusions are probably coming and that's ok. all the toolforge puppet stuffs " [puppet] - 10https://gerrit.wikimedia.org/r/421515 (https://phabricator.wikimedia.org/T190135) (owner: 10Muehlenhoff) [14:14:38] (03PS2) 10Ottomata: Use profile::kafka::mirror for main -> jumbo [puppet] - 10https://gerrit.wikimedia.org/r/421896 (https://phabricator.wikimedia.org/T189464) [14:15:19] 10Operations, 10Traffic: Removing support for AES128-SHA TLS cipher - https://phabricator.wikimedia.org/T147202#4080675 (10BBlack) Since the last stats update ~6 months ago above, the overall percentage for AES128-SHA has continued its decline, from ~0.220% to ~0.0846% . We'll be looking to plan and start an... [14:15:39] (03CR) 10Rush: [C: 031] "(the only thought I had reading this was maybe just make it " toollabs::exec_environ::fonts" similar to old fonts manifest)" [puppet] - 10https://gerrit.wikimedia.org/r/421515 (https://phabricator.wikimedia.org/T190135) (owner: 10Muehlenhoff) [14:16:10] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#4080679 (10akosiaris) [14:16:18] 10Operations, 10Mathoid, 10Prod-Kubernetes, 10Kubernetes, and 3 others: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#4080676 (10akosiaris) 05Open>03Resolved a:03akosiaris This has been achieved successfully and even surpassed the goal by achieving 100%. I '... [14:17:37] (03CR) 10Andrew Bogott: ">is port 5000 the only auth port open to instances?" [puppet] - 10https://gerrit.wikimedia.org/r/421709 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [14:18:40] (03PS3) 10Ottomata: Use profile::kafka::mirror for main -> jumbo [puppet] - 10https://gerrit.wikimedia.org/r/421896 (https://phabricator.wikimedia.org/T189464) [14:20:24] (03PS4) 10Ottomata: Use profile::kafka::mirror for main -> jumbo [puppet] - 10https://gerrit.wikimedia.org/r/421896 (https://phabricator.wikimedia.org/T189464) [14:21:40] (03CR) 10Rush: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/421709 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [14:22:40] (03PS2) 10Andrew Bogott: openstack: Permit deployment-prep-dns-manager to log in from instance subnet [puppet] - 10https://gerrit.wikimedia.org/r/421709 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [14:22:52] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 43249966 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:23:04] (03CR) 10Ottomata: "Looks good, let's try it https://puppet-compiler.wmflabs.org/compiler03/10672/kafka1020.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/421896 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [14:23:11] (03PS5) 10Ottomata: Use profile::kafka::mirror for main -> jumbo [puppet] - 10https://gerrit.wikimedia.org/r/421896 (https://phabricator.wikimedia.org/T189464) [14:23:25] (03CR) 10Ottomata: [V: 032 C: 032] Use profile::kafka::mirror for main -> jumbo [puppet] - 10https://gerrit.wikimedia.org/r/421896 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [14:24:01] (03CR) 10Andrew Bogott: [C: 032] openstack: Permit deployment-prep-dns-manager to log in from instance subnet [puppet] - 10https://gerrit.wikimedia.org/r/421709 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [14:24:07] (03PS3) 10Andrew Bogott: openstack: Permit deployment-prep-dns-manager to log in from instance subnet [puppet] - 10https://gerrit.wikimedia.org/r/421709 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [14:24:52] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 5922 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:25:03] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 25858232 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:27:30] (03PS1) 10Elukey: profile::restbae: add sysctl settings to improve tcp performance [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) [14:28:19] 10Operations: Integrate stretch 9.4 point update - https://phabricator.wikimedia.org/T189435#4080733 (10MoritzMuehlenhoff) These are fully rolled out: base-files cron cups java-atk-wrapper virt-what [14:29:06] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 4304 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:29:10] (03Restored) 10Sbisson: Configure maps source for localized labels [puppet] - 10https://gerrit.wikimedia.org/r/420315 (https://phabricator.wikimedia.org/T112948) (owner: 10Sbisson) [14:29:17] (03PS5) 10Sbisson: Configure maps source for localized labels [puppet] - 10https://gerrit.wikimedia.org/r/420315 (https://phabricator.wikimedia.org/T112948) [14:29:35] (03CR) 10Filippo Giunchedi: prometheus: calculate varnish requests daily/weekly averages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421505 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [14:29:42] (03PS2) 10Gilles: Upgrade to 1.16 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/419172 (https://phabricator.wikimedia.org/T186528) [14:29:48] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-hhvm-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419447 (https://phabricator.wikimedia.org/T135991) [14:31:03] (03CR) 10Elukey: "Added all the experts that have done this work previously for very delicate systems to have an agreement about how to proceed. From the ta" [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) (owner: 10Elukey) [14:31:19] Hi, what's the progress for getting a change into operations/puppet? Who can merge, when does it get deployed (only during puppet-swat)? [14:31:32] 10Operations, 10Puppet: remove puppet_major_version and puppetdb_major_version variables. clean up puppet master/db hieradata - https://phabricator.wikimedia.org/T190318#4080753 (10herron) [14:31:40] ( This is the patch in question: https://gerrit.wikimedia.org/r/#/c/420315/ ) [14:32:32] (03PS2) 10Elukey: profile::restbase: add sysctl settings to improve tcp performance [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) [14:32:34] (03PS1) 10Giuseppe Lavagetto: profile::etcd:v3: fix peer port [puppet] - 10https://gerrit.wikimedia.org/r/421907 [14:32:58] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::etcd:v3: fix peer port [puppet] - 10https://gerrit.wikimedia.org/r/421907 (owner: 10Giuseppe Lavagetto) [14:34:18] (03PS5) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-hhvm-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419447 (https://phabricator.wikimedia.org/T135991) [14:35:37] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-hhvm-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419447 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:36:07] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#4080780 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['conf1006.eqiad.wmnet'... [14:37:19] (03PS3) 10Vgutierrez: prometheus: calculate varnish requests daily/weekly averages [puppet] - 10https://gerrit.wikimedia.org/r/421505 (https://phabricator.wikimedia.org/T184942) [14:38:24] (03CR) 10Vgutierrez: prometheus: calculate varnish requests daily/weekly averages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421505 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [14:44:56] PROBLEM - puppet last run on lvs5002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:46:11] (03PS1) 10Ottomata: Add prometheus::jmx_exporter_config for main -> jumbo MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/421911 (https://phabricator.wikimedia.org/T189464) [14:47:04] (03CR) 10Elukey: [C: 031] Add prometheus::jmx_exporter_config for main -> jumbo MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/421911 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [14:47:39] (03CR) 10Ottomata: [C: 032] Add prometheus::jmx_exporter_config for main -> jumbo MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/421911 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [14:48:05] (03CR) 10Filippo Giunchedi: [C: 031] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/421505 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [14:49:16] (03PS1) 10Giuseppe Lavagetto: etcd::v3::monitoring: actually monitor the endpoint from NRPE [puppet] - 10https://gerrit.wikimedia.org/r/421912 [14:49:51] (03PS2) 10Giuseppe Lavagetto: etcd::v3::monitoring: actually monitor the endpoint from NRPE [puppet] - 10https://gerrit.wikimedia.org/r/421912 [14:49:56] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 41928272 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:50:48] (03CR) 10Vgutierrez: [C: 032] prometheus: calculate varnish requests daily/weekly averages [puppet] - 10https://gerrit.wikimedia.org/r/421505 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [14:50:58] (03PS4) 10Vgutierrez: prometheus: calculate varnish requests daily/weekly averages [puppet] - 10https://gerrit.wikimedia.org/r/421505 (https://phabricator.wikimedia.org/T184942) [14:51:16] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 51711948 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:51:56] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 6241 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:52:55] (03CR) 10Alexandros Kosiaris: [C: 031] icinga/mobileapps: add mobileapp contacts to service [puppet] - 10https://gerrit.wikimedia.org/r/421676 (https://phabricator.wikimedia.org/T189524) (owner: 10Dzahn) [14:54:12] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd::v3::monitoring: actually monitor the endpoint from NRPE [puppet] - 10https://gerrit.wikimedia.org/r/421912 (owner: 10Giuseppe Lavagetto) [14:54:16] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 4386 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:54:17] (03PS3) 10Giuseppe Lavagetto: etcd::v3::monitoring: actually monitor the endpoint from NRPE [puppet] - 10https://gerrit.wikimedia.org/r/421912 [14:54:19] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] etcd::v3::monitoring: actually monitor the endpoint from NRPE [puppet] - 10https://gerrit.wikimedia.org/r/421912 (owner: 10Giuseppe Lavagetto) [14:54:50] <_joe_> vgutierrez: can I merge your change? [14:54:55] i was going to ask the same [14:55:00] _joe_: go ahead .) [14:58:00] 10Operations: Extend dpkg Icinga check to also check for inconsitent apt state - https://phabricator.wikimedia.org/T190693#4080876 (10MoritzMuehlenhoff) [14:58:16] 10Operations, 10Icinga, 10monitoring: Extend dpkg Icinga check to also check for inconsitent apt state - https://phabricator.wikimedia.org/T190693#4080886 (10MoritzMuehlenhoff) p:05Triage>03Normal [15:02:24] <_joe_> heh sorry, win 19 [15:02:28] <_joe_> argh [15:05:37] 10Operations, 10Icinga, 10monitoring: Extend dpkg Icinga check to also check for inconsistent apt state - https://phabricator.wikimedia.org/T190693#4080897 (10MoritzMuehlenhoff) [15:07:23] (03PS1) 10Filippo Giunchedi: Revert "hieradata: use puppetmaster2001 as ca_server" [puppet] - 10https://gerrit.wikimedia.org/r/421917 (https://phabricator.wikimedia.org/T189891) [15:07:55] (03CR) 10Filippo Giunchedi: [C: 04-1] "DNM" [puppet] - 10https://gerrit.wikimedia.org/r/421917 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [15:08:27] (03PS4) 10Muehlenhoff: Don't include mediawiki fonts list in toollabs::exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/421515 (https://phabricator.wikimedia.org/T190135) [15:09:06] (03CR) 10Muehlenhoff: [C: 032] Don't include mediawiki fonts list in toollabs::exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/421515 (https://phabricator.wikimedia.org/T190135) (owner: 10Muehlenhoff) [15:09:26] RECOVERY - eventlogging_sync processes on db1108 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [15:10:29] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package of Blubber - https://phabricator.wikimedia.org/T190551#4080909 (10akosiaris) 05Open>03Resolved a:03akosiaris Package built and uploaded to `stretch-wikimedia` and `jessie-wikimedia`. Resolving this... [15:10:56] RECOVERY - Check status of defined EventLogging jobs on eventlog1002 is OK: OK: All defined EventLogging jobs are runnning. [15:11:15] (03PS1) 10Filippo Giunchedi: Revert "Move config-master to codfw" [dns] - 10https://gerrit.wikimedia.org/r/421918 (https://phabricator.wikimedia.org/T184562) [15:11:51] (03PS2) 10BBlack: eqsin: turn-up ID, MY, VN, NC [dns] - 10https://gerrit.wikimedia.org/r/421855 (https://phabricator.wikimedia.org/T189252) [15:12:05] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [15:12:12] (03PS1) 10Filippo Giunchedi: Revert "cache: depool puppetmaster1001 from config-master.w.o" [puppet] - 10https://gerrit.wikimedia.org/r/421919 (https://phabricator.wikimedia.org/T184562) [15:12:39] (03CR) 10Mobrovac: [C: 031] "I think these are good values for a first iteration" [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) (owner: 10Elukey) [15:12:41] (03CR) 10BBlack: [C: 032] eqsin: turn-up ID, MY, VN, NC [dns] - 10https://gerrit.wikimedia.org/r/421855 (https://phabricator.wikimedia.org/T189252) (owner: 10BBlack) [15:14:55] RECOVERY - puppet last run on lvs5002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:17:58] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4080917 (10ayounsi) >>! In T189252#4079025, @Liuxinyu970226 wrote: > What about Antarctica (AQ)? In addition, Antarc... [15:18:16] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Patch-For-Review: Requesting access to terbium/maintenance-log-readers for bmansurov - https://phabricator.wikimedia.org/T189285#4080918 (10RobH) [15:23:22] (03PS1) 10Milimetric: Add interlanguage reportupdater job [puppet] - 10https://gerrit.wikimedia.org/r/421920 (https://phabricator.wikimedia.org/T158835) [15:26:14] (03PS1) 10Gilles: Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) [15:27:36] (03CR) 10jerkins-bot: [V: 04-1] Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [15:27:55] 10Operations, 10Gerrit: git pull fails for MW core with "fatal: protocol error: bad pack header" when local branches point to remote branches that no more exist - https://phabricator.wikimedia.org/T190676#4080932 (10Aklapper) [15:28:44] (03PS2) 10Gilles: Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) [15:29:25] (03CR) 10Elukey: cassandra: upgrade version 2.2 package settings for aqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [15:29:56] (03CR) 10jerkins-bot: [V: 04-1] Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [15:31:30] (03PS3) 10Gilles: Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) [15:32:04] (03PS5) 10Elukey: cassandra: upgrade version 2.2 package settings for aqs [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) [15:36:15] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [15:38:24] (03CR) 10Elukey: "New pcc https://puppet-compiler.wmflabs.org/compiler03/10674/" [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [15:39:12] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4080943 (10Vgutierrez) Current status: * varnishxcps **ready to be removed** (https://gerrit.wikimedia.org/r/421338) * varnishxcache is currently being u... [15:41:28] 10Operations, 10Gerrit: git pull fails for MW core with "fatal: protocol error: bad pack header" when local branches point to remote branches that no more exist - https://phabricator.wikimedia.org/T190676#4080949 (10demon) This also isn't a Gerrit bug. Can easily replicate with Github or any other git server.... [15:43:59] 10Operations, 10Gerrit: git pull fails for MW core with "fatal: protocol error: bad pack header" when local branches point to remote branches that no more exist - https://phabricator.wikimedia.org/T190676#4080953 (10demon) >>! In T190676#4080498, @Paladox wrote: > You could also do "git config remote.origin.pr... [15:48:21] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4080501 (10Joe) @MarcoAurelio if you look at the code: ``` define swift::init_device($partition_nr='1') { if ($title !~ /^\/dev\/([hvs]d[a-z]+|md[0-9]+)$/) {... [15:48:36] (03PS1) 10Volans: Puppetboard: disable query tab, enable catalog one [puppet] - 10https://gerrit.wikimedia.org/r/421924 (https://phabricator.wikimedia.org/T184563) [15:50:35] (03PS1) 10Vgutierrez: varnish: Remove varnishxcache python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421925 (https://phabricator.wikimedia.org/T184942) [15:51:04] (03CR) 10jerkins-bot: [V: 04-1] varnish: Remove varnishxcache python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421925 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [15:52:47] (03PS11) 10Rduran: Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 [15:52:49] (03PS5) 10Rduran: Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746 [15:52:51] (03PS4) 10Rduran: [WIP] Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 [15:53:23] (03PS2) 10Vgutierrez: varnish: Remove varnishxcache python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421925 (https://phabricator.wikimedia.org/T184942) [15:53:39] (03CR) 10Volans: [C: 032] Puppetboard: disable query tab, enable catalog one [puppet] - 10https://gerrit.wikimedia.org/r/421924 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [15:53:52] (03CR) 10jerkins-bot: [V: 04-1] varnish: Remove varnishxcache python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421925 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [15:55:21] (03CR) 10Volans: "removed -2, related patch already merged and tested" [dns] - 10https://gerrit.wikimedia.org/r/419800 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [15:55:25] (03PS3) 10Volans: Add puppetboard.wikimedia.org entry [dns] - 10https://gerrit.wikimedia.org/r/419800 (https://phabricator.wikimedia.org/T184563) [15:55:30] (03PS3) 10Vgutierrez: varnish: Remove varnishxcache python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421925 (https://phabricator.wikimedia.org/T184942) [15:55:41] (03PS1) 10Muehlenhoff: mediawiki::packages::fonts: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/421930 [15:57:11] (03PS1) 10Jcrespo: dump_section.py: Allow manual run of the backup [puppet] - 10https://gerrit.wikimedia.org/r/421931 [15:57:48] (03CR) 10jerkins-bot: [V: 04-1] dump_section.py: Allow manual run of the backup [puppet] - 10https://gerrit.wikimedia.org/r/421931 (owner: 10Jcrespo) [15:57:59] (03CR) 10Jcrespo: [C: 04-1] "Needs more fixes, do not deploy yet." [puppet] - 10https://gerrit.wikimedia.org/r/421931 (owner: 10Jcrespo) [15:59:40] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler03/10676/" [puppet] - 10https://gerrit.wikimedia.org/r/421925 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [16:00:16] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:01:05] who is in charge of elastic this week? [16:01:19] I think we can ping dcausse ? [16:01:35] elukey: thanks [16:01:37] latency issue is kind of expected because of the reindex [16:01:42] ok, perfect [16:01:43] there you go! :) [16:01:46] thanks dcausse [16:01:49] we were about to go into a meeting [16:02:01] and didn't want to have a 1 h outage :-) [16:04:07] (03PS5) 10Rduran: [WIP] Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 [16:04:29] hashar: is it normal that a test-prio is waiting since 8 minutes queued? [16:06:36] (03PS1) 10Imarlier: [WIP] coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) [16:07:07] (03CR) 10jerkins-bot: [V: 04-1] [WIP] coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [16:08:03] (03CR) 10Volans: [C: 032] Add puppetboard.wikimedia.org entry [dns] - 10https://gerrit.wikimedia.org/r/419800 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [16:08:45] (03PS3) 10Gilles: Upgrade to 1.16 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/419172 (https://phabricator.wikimedia.org/T186528) [16:15:40] (03PS1) 10Alexandros Kosiaris: kubernetes: Allow a few clients to reach staging API [puppet] - 10https://gerrit.wikimedia.org/r/421935 (https://phabricator.wikimedia.org/T194924) [16:18:27] volans: too many changes / jobs running i guess [16:18:38] it was the only jon in test-prio [16:18:41] has completed now [16:19:00] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update helm repository with version 0.0.2 of mathoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/421895 (owner: 10Alexandros Kosiaris) [16:19:02] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes: Allow a few clients to reach staging API [puppet] - 10https://gerrit.wikimedia.org/r/421935 (https://phabricator.wikimedia.org/T194924) (owner: 10Alexandros Kosiaris) [16:19:30] volans: and test-prio has the same precedence as gate-and-submit [16:19:46] volans: with both sharing the same pool of Docker executors [16:19:46] ah ok [16:19:53] so yeah that can be blocked from time to time [16:20:22] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Utilize the deployment pipeline (stretch) - https://phabricator.wikimedia.org/T184924#4081143 (10akosiaris) https://gerrit.wikimedia.org/r/#/c/421935/ for allowing access to staging related clients [16:21:26] (03CR) 10Elukey: [WIP] coal: be smarter about consuming from Kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [16:27:20] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4081159 (10Papaul) [16:34:26] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:34:52] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4081223 (10Papaul) @rob on asw-b3-codfw any port from ge-3/0/20 up needs to be removed and disabled from the switch. on asw-b4-codfw any port from ge-4/... [16:35:17] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4081227 (10Papaul) [16:38:07] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Release-Engineering-Team (Kanban), 10Zuul: Upload new zuul and jenkins-debian-glue packages to apt.wikimedia.org - https://phabricator.wikimedia.org/T186786#4081254 (10hashar) [16:38:18] (03PS12) 10Rduran: Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 [16:38:19] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.23 (duration: 05m 03s) [16:38:20] (03PS6) 10Rduran: Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746 [16:38:22] (03PS6) 10Rduran: [WIP] Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 [16:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:03] mobrovac: yes it is indeed. But we can really easily increase them. helm upgrade --set resource.replicas=X [16:40:16] looks like we don't need to yet though [16:41:31] 10Operations, 10netops: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4081276 (10jcrespo) [16:42:51] <_joe_> akosiaris: we're not running service-checker in any ways on the k8s nodes, right? [16:43:03] <_joe_> s/nodes/pods/ [16:43:08] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4081282 (10akosiaris) 05Open>03stalled ~2 weeks with no incident yet. That's very encouraging but we 've been in that position again around the new years holidays. Given that easte... [16:43:28] (03PS1) 10BBlack: Revert "varnish: restart backends every 3.5 days" [puppet] - 10https://gerrit.wikimedia.org/r/421943 [16:43:36] (03PS2) 10BBlack: Revert "varnish: restart backends every 3.5 days" [puppet] - 10https://gerrit.wikimedia.org/r/421943 [16:43:49] _joe_: no we are not. That's one piece of functionality we lost and we need to gain it back IMHO [16:44:10] it does look like kubelet can do it btw [16:44:43] livenessProbe: [16:44:43] exec: [16:44:43] command: [16:44:43] - cat [16:44:45] 10Operations, 10monitoring, 10Patch-For-Review: Many "NRPE: Unable to read output" from "long running screen/tmux" checks in icinga - https://phabricator.wikimedia.org/T187528#4081289 (10Dzahn) 05Open>03Resolved a:03Dzahn calling resolved since it was called fixed in meeting :) if not, please just reopen [16:44:45] <_joe_> well we do monitor the lvs endpoint [16:44:45] etc etc etc [16:44:50] https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/ [16:45:01] <_joe_> I'm not sure I want to use it as a readiness probe [16:45:05] <_joe_> liveness, maybe [16:45:13] <_joe_> but we can discuss it [16:45:14] actually readiness, not liveness :P [16:45:33] <_joe_> no, I said it right, but I'll explain tomorrow [16:45:33] 10Operations, 10netops: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4071230 (10jcrespo) I've clarified the database/backups provisioning service, so that it can comfortably recover in an emergency multiple databases at the same time, in case of catastrophic failure to reduce TTR, but also... [16:45:34] <_joe_> :) [16:45:48] liveness ? so if it fails we kill the pod ? [16:45:52] <_joe_> no [16:46:00] that's what liveness does [16:46:15] <_joe_> liveness as in if it's a warning (so data changes) do not kill [16:46:20] <_joe_> kill on timeout though [16:46:25] <_joe_> but it can be risky [16:46:46] <_joe_> anwyays, we still run service-checker on the LVS endpoint [16:47:21] <_joe_> I just want to re-add the metrics reporting to statsd https://grafana.wikimedia.org/dashboard/db/service-endpoint-performance?orgId=1&var-application=mathoid&var-server=All [17:00:05] gehel: That opportune time is upon us again. Time for a Wikidata Query Service weekly deploy deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180326T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:10:49] 10Operations, 10Domains, 10Traffic, 10Wikimedia-Apache-configuration: en-wp.org certificate error - https://phabricator.wikimedia.org/T190244#4081459 (10Dzahn) [17:10:55] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#4081460 (10Dzahn) [17:13:00] 10Operations, 10Domains, 10Traffic, 10Wikimedia-Apache-configuration: en-wp.org certificate error - https://phabricator.wikimedia.org/T190244#4067490 (10Dzahn) This is blocked on T133548 (technical) and T101048 (policy). Yes, it's not just this one domain name. Though one of the tickets is about deciding... [17:16:47] (03PS1) 10Chad: Swap mediawiki.org to use standard docroot naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/421949 [17:23:35] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:24:03] (03PS1) 10Volans: Puppetboard: add HTTP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/421952 (https://phabricator.wikimedia.org/T184563) [17:24:42] * ebernhardson should make it reindex slower next time ... [17:24:56] 10Operations, 10Traffic, 10Performance: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085#4081498 (10BBlack) [17:24:59] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4081495 (10BBlack) [17:25:04] 10Operations, 10Traffic, 10Patch-For-Review: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4081499 (10BBlack) [17:26:10] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp2006.codfw.wmnet [17:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:19] 10Operations, 10DBA, 10Goal: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10jcrespo) p:05Triage>03Normal [17:26:25] ebernhardson: I guess it's not tunable on the fly... :) [17:26:27] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp2010.codfw.wmnet [17:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:46] (03CR) 10Volans: [C: 032] Puppetboard: add HTTP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/421952 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [17:26:50] 10Operations, 10DBA, 10Goal: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081518 (10jcrespo) p:05Normal>03High [17:26:55] 10Operations, 10ops-codfw, 10Traffic: cp2006, cp2010: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076372 (10BBlack) Depooled both today, we should do that in general as these arise. [17:27:00] volans: not really. Also whats happening right now is the internal reindex is running (5 days now) and overlapping with another ingestion process from hadoop -> elasticsearch [17:27:14] got it [17:28:07] 10Operations, 10DBA, 10Goal: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10jcrespo) [17:28:10] 10Operations, 10Traffic, 10Patch-For-Review: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#4081531 (10BBlack) 05Open>03Resolved a:03BBlack This was fixed in https://gerrit.wikimedia.org/r/#/c/420432/ about the broader issues (which we still have, because some VCLs never go cold,... [17:30:44] (03PS4) 10RobH: admin: contint-admins to restart Jenkins via systemd [puppet] - 10https://gerrit.wikimedia.org/r/408555 (https://phabricator.wikimedia.org/T190277) (owner: 10Hashar) [17:31:06] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): contint-admins sudo for service jenkins - https://phabricator.wikimedia.org/T190277#4081546 (10RobH) This was approved in today's SRE meeting. I'll rebase a... [17:31:08] hashar: ^ im rebasing and submitting your patch [17:31:31] (03CR) 10RobH: [C: 032] admin: contint-admins to restart Jenkins via systemd [puppet] - 10https://gerrit.wikimedia.org/r/408555 (https://phabricator.wikimedia.org/T190277) (owner: 10Hashar) [17:32:36] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:32:51] (03CR) 10Paladox: [C: 031] Gerrit 2.14.7-9-g0f04397dbd, plus some plugins [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/421463 (owner: 10Chad) [17:32:58] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): contint-admins sudo for service jenkins - https://phabricator.wikimedia.org/T190277#4081552 (10RobH) 05Open>03Resolved a:03RobH This is now live on the... [17:33:14] (03PS5) 10RobH: admin: Grant bmansurov access to terbium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/419387 (https://phabricator.wikimedia.org/T189285) (owner: 10Vgutierrez) [17:33:31] 10Operations, 10Traffic, 10Patch-For-Review: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#4081555 (10BBlack) > The 'varnish mailbox lag' icinga alerts as implemented in the parent task have been going CRITICAL for a while and in some cases result in 503s spikes... [17:34:01] 10Operations, 10Traffic, 10Patch-For-Review: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#4081560 (10BBlack) [17:34:03] (03CR) 10RobH: [C: 032] admin: Grant bmansurov access to terbium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/419387 (https://phabricator.wikimedia.org/T189285) (owner: 10Vgutierrez) [17:34:06] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4081558 (10BBlack) [17:34:24] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [17:35:05] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Patch-For-Review: Requesting access to terbium/maintenance-log-readers for bmansurov - https://phabricator.wikimedia.org/T189285#4081566 (10RobH) 05stalled>03Resolved a:03RobH @bmansurov: Your access to terbium (via maintenance-log-readers... [17:35:22] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Patch-For-Review: Requesting access to terbium/maintenance-log-readers for bmansurov - https://phabricator.wikimedia.org/T189285#4081570 (10RobH) a:05RobH>03None [17:35:53] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4081572 (10Vgutierrez) Blocked (still in use) varnish cachestats daemons: * varnishstatsd * varnishrls * varnishreqstats Folks, we need your help to mov... [17:36:10] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#4081575 (10Vgutierrez) [17:36:13] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4081574 (10Vgutierrez) 05Open>03stalled [17:38:05] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4081578 (10RobH) This was approved in the SRE meeting. @Samwilson: I'll go ahead and prepare the access patches, but I wasnt sure if your wikitech account (and thus the UID we use) was... [17:42:31] 10Operations, 10DBA, 10Goal: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081606 (10jcrespo) p:05High>03Normal [17:42:39] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:52:27] (03PS1) 10Volans: Icinga: fix check_http_unauthorized definition [puppet] - 10https://gerrit.wikimedia.org/r/421958 [17:52:55] (03CR) 10Jdlrobson: Add performance perception QuickSurvey definition (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [17:54:24] (03CR) 10Volans: [C: 032] Icinga: fix check_http_unauthorized definition [puppet] - 10https://gerrit.wikimedia.org/r/421958 (owner: 10Volans) [17:59:09] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4081674 (10BBlack) I've pulled together a few other related open tasks that belong here. It seems fairly certa... [17:59:54] elukey: RE coal, true that webperf/navtiming consumes a different Kafka (although I thought it was switched last week?), but Ian actually tried running the new coal from graph01 from home dir on hafnium as-is except with different consgroup and it didn’t exhibit the Problem there [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180326T1800). [18:00:04] raynor: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:25] o/ hello [18:01:52] Krinkle: heyaa, elukey is gone for the day, reading your patch, missed whatever yall were tlaking about earlier, but ian was talkign to me ab it on friday [18:03:24] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4081704 (10Pchelolo) - API Summary dashboard - this one uses `varnish.$dc.backends.be_{backend}` metric and relies on the actual backend being a part of... [18:04:22] who can SWAT? [18:06:59] raynor: I can, in 10 minutes. [18:07:05] awesome, thx [18:07:09] I'll ping you when ready. [18:07:44] (03PS2) 10Niharika29: Enable mobile-only Mediawiki:MainPageCss styles for Hindi wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421867 (https://phabricator.wikimedia.org/T190101) (owner: 10Pmiazga) [18:10:39] (03CR) 10Niharika29: [C: 032] Enable mobile-only Mediawiki:MainPageCss styles for Hindi wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421867 (https://phabricator.wikimedia.org/T190101) (owner: 10Pmiazga) [18:11:56] (03Merged) 10jenkins-bot: Enable mobile-only Mediawiki:MainPageCss styles for Hindi wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421867 (https://phabricator.wikimedia.org/T190101) (owner: 10Pmiazga) [18:20:22] so.. comparing tin and deploy1001.. after letting puppet do it's thing.. we get a /srv of 31G on deploy1001 [18:20:32] yet on tin it is 43G [18:20:44] should we now rsync the entire /srv for the remainder? [18:21:19] (03CR) 10jenkins-bot: Enable mobile-only Mediawiki:MainPageCss styles for Hindi wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421867 (https://phabricator.wikimedia.org/T190101) (owner: 10Pmiazga) [18:21:44] raynor: Can you test your changes? [18:21:51] They're on mwdebug1002. [18:21:52] Niharika: yes [18:21:57] I'm on it, thanks [18:22:03] I need ~10mins [18:22:25] Take your time. You're the only client today. [18:25:30] Niharika: ok, tests went pretty smooth, everything is working. Please deploy to production [18:25:31] thank you [18:25:48] raynor: On it. [18:28:00] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Enable mobile-only Mediawiki:MainPageCss styles for Hindi wiki T190101 (duration: 00m 58s) [18:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:07] T190101: Apply styles to mobile only for Hindi campaign - https://phabricator.wikimedia.org/T190101 [18:28:31] raynor: Done. [18:28:37] SWAT over. [18:28:43] thank you [18:29:12] (03PS1) 10Ottomata: Install python protobuf package on stat and notebook hosts for python facets [puppet] - 10https://gerrit.wikimedia.org/r/421963 [18:29:57] (03PS2) 10Ottomata: Install python protobuf package on stat and notebook hosts for python facets [puppet] - 10https://gerrit.wikimedia.org/r/421963 [18:30:26] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): contint-admins sudo for service jenkins - https://phabricator.wikimedia.org/T190277#4081805 (10hashar) Looks good. Thank you everyone :] [18:35:54] 10Operations, 10Puppet, 10Goal, 10Patch-For-Review: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#4081825 (10Volans) [18:35:57] 10Operations, 10Puppet, 10Patch-For-Review: Investigate landscape of PuppetDB Frontends and Provision One - https://phabricator.wikimedia.org/T184563#4081823 (10Volans) 05Open>03Resolved Puppetboard is now reachable via https://puppetboard.wikimedia.org (LDAP auth), resolving. [18:36:21] 10Operations, 10Puppet, 10Goal, 10Patch-For-Review: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#3888176 (10Volans) [18:37:52] 10Operations, 10netops: Security audit for tftp on install1001 - https://phabricator.wikimedia.org/T122210#4081830 (10ayounsi) Things must have changed since 2015. @Dzahn @Andrew Running `install1002:~$ sudo iptables -L -v -n | grep "udp dpt:69"` I see plenty of ACLs. I think next step is to audit that list,... [18:39:01] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:40:43] (03PS1) 10RobH: adding Pam Drouin to ldap user section [puppet] - 10https://gerrit.wikimedia.org/r/421965 (https://phabricator.wikimedia.org/T190711) [18:41:03] (03CR) 10RobH: [C: 032] adding Pam Drouin to ldap user section [puppet] - 10https://gerrit.wikimedia.org/r/421965 (https://phabricator.wikimedia.org/T190711) (owner: 10RobH) [18:44:35] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4081850 (10Ottomata) > Reqstats-otto dashboard I haven't looked at that dashboard in years, and I doubt anyone else has either. Deleted. :) [18:46:41] !log mobrovac@tin Started deploy [restbase/deploy@908febb]: Add tag descriptions for citations and recommendations [18:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:21] (03CR) 10Ottomata: [C: 032] Install python protobuf package on stat and notebook hosts for python facets [puppet] - 10https://gerrit.wikimedia.org/r/421963 (owner: 10Ottomata) [18:53:06] (03PS2) 10RobH: adding Pam Drouin to ldap user section [puppet] - 10https://gerrit.wikimedia.org/r/421965 (https://phabricator.wikimedia.org/T190711) [18:56:35] (03PS1) 10Dduvall: ci: Refactor pipeline deps using separate CI role [puppet] - 10https://gerrit.wikimedia.org/r/421973 (https://phabricator.wikimedia.org/T188936) [19:02:51] 10Operations, 10netops: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4081897 (10RobH) p:05Triage>03Normal I'm simply trying to reduce our number of 'needs triage' tasks in #operations. This seems to be an issue that is either normal, or higher priority. Due to the timeline of the 10G... [19:05:01] 10Operations, 10Domains, 10Traffic, 10Wikimedia-Apache-configuration: en-wp.org certificate error - https://phabricator.wikimedia.org/T190244#4081905 (10RobH) p:05Triage>03Normal As SRE Clinic Duty person this week, I'm setting this to normal priority. The items blocking it are also normal priority, a... [19:05:40] 10Operations, 10hardware-requests, 10Patch-For-Review: Reclaim/Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4081908 (10RobH) p:05Triage>03Normal a:03RobH [19:06:09] !log mobrovac@tin Finished deploy [restbase/deploy@908febb]: Add tag descriptions for citations and recommendations (duration: 19m 28s) [19:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:01] !log mobrovac@tin Started deploy [restbase/deploy@908febb]: Add tag descriptions for citations and recommendations [19:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:34] 10Operations, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947#4081935 (10kaldari) @MoritzMuehlenhoff: Any thoughts about the possibility of upgrading to librsvg 2.40.19 (or 2.40.20)? I know there's also... [19:12:51] !log mobrovac@tin Finished deploy [restbase/deploy@908febb]: Add tag descriptions for citations and recommendations (duration: 03m 49s) [19:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:39] 10Operations, 10Traffic: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4081937 (10Ragesoss) The behavior change last week was not limited to JSON pages. My app uses the mediawiki ruby API gem's `get_wikitext` [[https://github.c... [19:13:48] !log mobrovac@tin Started deploy [restbase/deploy@908febb]: Add tag descriptions for citations and recommendations [19:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:51] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python3-protobuf] [19:17:17] !log mobrovac@tin Finished deploy [restbase/deploy@908febb]: Add tag descriptions for citations and recommendations (duration: 03m 29s) [19:17:21] !log mobrovac@tin Started deploy [restbase/deploy@908febb]: Add tag descriptions for citations and recommendations [19:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:12] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python3-protobuf] [19:21:36] !log mobrovac@tin Finished deploy [restbase/deploy@908febb]: Add tag descriptions for citations and recommendations (duration: 04m 16s) [19:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:41] !log mobrovac@tin Started deploy [restbase/deploy@908febb]: Add tag descriptions for citations and recommendations [19:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:03] !log mobrovac@tin Finished deploy [restbase/deploy@908febb]: Add tag descriptions for citations and recommendations (duration: 01m 22s) [19:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:47] 10Operations, 10Analytics-Kanban, 10Patch-For-Review: Review changes to /etc/java-8-openjdk/security/java.security in Kafka from u162 update - https://phabricator.wikimedia.org/T190400#4082018 (10Ottomata) p:05Triage>03Normal a:03Ottomata [19:31:21] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal 1.15.2 dies with obscure errors without python-prometheus-client - https://phabricator.wikimedia.org/T190527#4082022 (10RobH) p:05Triage>03Low Setting this to normal priority as part of SRE clinic duty. After IRC discussion with @jgreen, it i... [19:36:15] (03PS1) 10Chad: Disable new project listener binding until implementation is done [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/421979 [19:36:17] (03PS1) 10Chad: Add some extra things to our top menu in Gerrit [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/421980 (https://phabricator.wikimedia.org/T55433) [19:41:03] (03CR) 10Chad: [V: 032 C: 032] Add some extra things to our top menu in Gerrit [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/421980 (https://phabricator.wikimedia.org/T55433) (owner: 10Chad) [19:41:05] (03CR) 10Chad: [V: 032 C: 032] Disable new project listener binding until implementation is done [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/421979 (owner: 10Chad) [19:43:02] (03CR) 10Gilles: Add performance perception QuickSurvey definition (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [19:43:21] 10Operations, 10WMDE-QWERTY-Team-Board: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4082061 (10Lea_WMDE) @MoritzMuehlenhoff how much time in advance is needed before deploying to production for deployment prep? And when would it be possible to do so, assum... [19:43:25] 10Operations, 10DC-Ops, 10monitoring, 10User-fgiunchedi: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4082063 (10BBlack) Found a couple more via reboots: T190540. Not good that we're having uncorrected memory errors go unreported/alerted... [19:44:45] 10Operations, 10WMDE-QWERTY-Team-Board: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4082066 (10Lea_WMDE) [19:47:18] (03PS1) 10Ori.livneh: coal: add systemd watchdog notifier; set WatchdogSec=60 [puppet] - 10https://gerrit.wikimedia.org/r/421981 [19:48:41] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4029896 (10RobH) >>! In T189065#4036314, @gerritbot wrote: > Change 417350 had a related patch set uploaded (by Herron; owner: Herron): > [operations/dns@master]... [19:48:42] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:52:15] !log andrew@tin Started deploy [horizon/deploy@99153e4]: Rolling out fix for security groups, 421983 [19:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:51] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4082111 (10Papaul) a:05Papaul>03None @RobH if you have time can you do the switch port session. When finished assign back to me so i can finish the m... [19:54:09] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4082113 (10Papaul) a:03RobH [19:55:27] !log andrew@tin Finished deploy [horizon/deploy@99153e4]: Rolling out fix for security groups, 421983 (duration: 03m 12s) [19:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180326T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:02:32] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4082144 (10RobH) a:05RobH>03Papaul So none of those interfaces had descriptions set. I had to add them all into the config (though they were in use b... [20:02:45] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4082146 (10RobH) [20:03:06] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4031698 (10RobH) [20:07:19] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4082180 (10herron) > Is there any reason we cannot merge this in advance of the greenhouse.io settings change? The change is immature as-is, unfortunately. Bef... [20:11:16] 10Operations, 10Office-IT: Create @wikimedia.org e-mail that just discards things sent to it - https://phabricator.wikimedia.org/T190719#4082200 (10demon) p:05Triage>03Normal [20:13:42] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:14:44] (03PS2) 10Andrew Bogott: toolforge: Explictly allow wss to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/421804 (https://phabricator.wikimedia.org/T130748) (owner: 10BryanDavis) [20:15:46] (03CR) 10Andrew Bogott: [C: 032] toolforge: Explictly allow wss to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/421804 (https://phabricator.wikimedia.org/T130748) (owner: 10BryanDavis) [20:18:27] (03CR) 10Ottomata: [C: 032] Add interlanguage reportupdater job [puppet] - 10https://gerrit.wikimedia.org/r/421920 (https://phabricator.wikimedia.org/T158835) (owner: 10Milimetric) [20:18:32] (03PS2) 10Ottomata: Add interlanguage reportupdater job [puppet] - 10https://gerrit.wikimedia.org/r/421920 (https://phabricator.wikimedia.org/T158835) (owner: 10Milimetric) [20:18:34] (03CR) 10Ottomata: [V: 032 C: 032] Add interlanguage reportupdater job [puppet] - 10https://gerrit.wikimedia.org/r/421920 (https://phabricator.wikimedia.org/T158835) (owner: 10Milimetric) [20:22:59] 10Operations, 10netops: Security audit for tftp on install1001 - https://phabricator.wikimedia.org/T122210#4082253 (10Dzahn) >Andrew wrote: >> from within the labs-vm subnet >ayounsi wrote: >> I think next step is to audit that list, especially for "higher risks" ranges, like Cloud or Sandbox. I think this i... [20:23:07] 10Operations, 10Wikimedia-Apache-configuration, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4082254 (10Imarlier) [20:24:20] (03CR) 10Chad: [V: 032 C: 032] Gerrit 2.14.7-9-g0f04397dbd, plus some plugins [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/421463 (owner: 10Chad) [20:25:53] !log demon@tin Started deploy [gerrit/gerrit@f6c5350]: update to 2.14.7-9-g0f04397dbd [20:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:04] !log demon@tin Finished deploy [gerrit/gerrit@f6c5350]: update to 2.14.7-9-g0f04397dbd (duration: 00m 10s) [20:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:44] (03PS1) 10Papaul: DNS: Remove mgmt DNS entries for mw2097-mw2134 [dns] - 10https://gerrit.wikimedia.org/r/422054 (https://phabricator.wikimedia.org/T189111) [20:29:37] !log gerrit: restarting services to pick up bugfix [20:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:23] yay [20:31:26] it worked no_justification :) [20:31:28] https://gerrit.wikimedia.org/r/#/c/99101/ [20:31:40] Yay no more stacktraces! [20:31:44] :) [20:33:35] I also had to change someone's e-mail earlier....they had quite a few bouncing :p [20:34:03] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4082287 (10Papaul) [20:34:18] heh [20:34:29] i also wonder how someone managed to push patchset 2 after merging it [20:35:51] Note there's no CR data on that patch [20:35:56] Maybe it was manually pushed? [20:36:00] yeh [20:36:13] Then the update on the change from the later patchset caused its merge status to refresh? [20:36:17] Idk....that's oldddddd [20:36:21] Coulda been a bug or something too [20:36:23] it had no parents [20:36:26] Yeah [20:36:37] which was strange [20:36:50] I bet the parents of it disappeared. [20:36:57] You know....I wonder if it was the old test branch [20:36:58] yes i think so [20:37:08] We used to have a production branch and a labs or test branch or smth [20:37:34] Oh, it's debian-somethingsomething [20:37:37] yeh [20:37:38] No such branch anymore [20:37:42] i think the branch was deleted [20:37:52] Yeah, and the commit refers to parents that no longer exist [20:37:52] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4082292 (10Papaul) p:05Low>03Normal [20:37:58] https://gerrit.wikimedia.org/r/#/c/256050 [20:38:04] Cannot display change 256050 because it has no revisions. [20:39:50] 10Operations, 10WMDE-QWERTY-Team-Board: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4082307 (10MoritzMuehlenhoff) So, if I understand this right, the wikidiff extension needs additional changes beyond what is currently deployed on production and beta, righ... [20:41:11] !log mholloway-shell@tin Started deploy [mobileapps/deploy@e223f51]: Update mobileapps to 534f95d [20:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:49] no_justification when i was looking at https://phabricator.wikimedia.org/T157898#3124564 and went to https://gerrit.wikimedia.org/r/c/13930 i found a pg bug :). [20:45:00] not sure if i fixed it with this, https://gerrit-review.googlesource.com/c/gerrit/+/168650 [20:46:21] Hmm [20:46:34] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@e223f51]: Update mobileapps to 534f95d (duration: 05m 23s) [20:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:29] (03PS1) 10Dzahn: deployment_server: allow rsyncing of /srv/ to new server [puppet] - 10https://gerrit.wikimedia.org/r/422058 (https://phabricator.wikimedia.org/T175288) [20:50:46] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.25 [keeping static files] (duration: 01m 26s) [20:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:35] (03PS2) 10Dzahn: deployment_server: allow rsyncing of /srv/ to new server [puppet] - 10https://gerrit.wikimedia.org/r/422058 (https://phabricator.wikimedia.org/T175288) [20:58:17] (03CR) 10Dzahn: [C: 032] deployment_server: allow rsyncing of /srv/ to new server [puppet] - 10https://gerrit.wikimedia.org/r/422058 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [20:58:29] (03PS3) 10Dzahn: icinga/mobileapps: add mobileapp contacts to service [puppet] - 10https://gerrit.wikimedia.org/r/421676 (https://phabricator.wikimedia.org/T189524) [20:58:37] (03CR) 10Dzahn: [C: 032] icinga/mobileapps: add mobileapp contacts to service [puppet] - 10https://gerrit.wikimedia.org/r/421676 (https://phabricator.wikimedia.org/T189524) (owner: 10Dzahn) [21:00:04] bawolff and Reedy: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180326T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:01:38] unfortunate 😲 [21:04:15] ottomata: unmerged change on puppetmaster. should i go ahead? [21:04:30] i hope it's not due to sync issues between the masters. i am using 1001 again [21:04:44] i used 2001 last week [21:07:00] oh oops [21:07:18] merged mutante [21:07:28] thanks! i was just checking if it's the same on 2001 and is it [21:07:43] ok, all cool [21:09:01] it's not showing it as merged yet, but i'll do it [21:09:46] ottomata: merged. i guess you typed "yes" but that means "no". forces you type "multiple" [21:09:59] ack [21:10:03] yup i did [21:10:06] didn't see the other one [21:10:10] thanks mutante [21:10:20] yw, all done now [21:11:56] (03PS3) 10Dzahn: deployment_server: allow rsyncing of /srv/ to new server [puppet] - 10https://gerrit.wikimedia.org/r/422058 (https://phabricator.wikimedia.org/T175288) [21:14:02] if a file on a host is generated from a puppet template, should the agent regenerate it if it's missing? [21:14:36] ...and if not, can it be made to? [21:21:29] urandom: yea, the agent would recreate it [21:22:42] the other way around ..you can even tell it to delete everything else that is _not_ from puppet [21:24:51] (03PS1) 10BryanDavis: toolforge: Add wikimedia.org to the CSP allowed list [puppet] - 10https://gerrit.wikimedia.org/r/422064 (https://phabricator.wikimedia.org/T130748) [21:27:12] mutante: so, remember that host you reimaged for us? that host, for some reason, is missing some files, /etc/cassandra-{a,b}/jvm.options for example [21:27:41] kinda wierd, because that one is added conditionally for cassandra 3.x [21:28:03] urandom: dev1006 ? [21:28:07] restbase-dev1006 [21:28:12] 10Operations, 10Goal: Make services manageable by systemd - https://phabricator.wikimedia.org/T97402#4082445 (10Nuria) [21:28:14] yeah, sorry [21:28:22] can you give me an example of a file [21:28:26] oh, you did [21:28:34] /etc/cassandra-a/jvm.options [21:28:56] actually, it *is* all of the files added conditionally for cassandra 3.x [21:29:36] do you already know where the relevant puppet code is? [21:29:39] mutante: https://github.com/wikimedia/puppet/blob/production/modules/cassandra/manifests/instance.pp#L495 [21:29:39] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Deprecation of mw.errors.* metrics - https://phabricator.wikimedia.org/T188749#4082451 (10Nuria) 05Open>03Resolved [21:29:50] those 3 files are missing [21:30:20] hmm. i see .. so that value comes from hiera [21:30:31] role/common/restbase/production_ng.yaml: target_version: '3.x' [21:30:38] if ($target_version == '3.x') { [21:30:56] it's hieradata/role/common/restbase/dev_cluster.yaml no? [21:31:01] urandom: the dev cluster doesnt have the setting [21:31:06] only the production_ng role does [21:31:19] role/common/restbase/dev_cluster.yaml: target_version: 'dev' [21:31:22] it's set to 'dev" [21:31:25] but the check is for 3.x [21:32:13] * urandom sighs [21:32:19] could change the version on dev_cluster to 3.x or change "if" check [21:32:21] something must have changed here [21:32:33] yeah, probably the latter [21:32:50] the idea was to be able to test newer/different versions in dev [21:32:52] oh, look: [21:32:53] modules/cassandra/manifests/init.pp: if ($target_version in ['3.x', 'dev']) { [21:32:58] yeah, i know [21:32:59] that's already doing this but in another place [21:33:06] that's the way it all used to work [21:33:20] and the other two machines have the files [21:33:33] only the recently reimaged one doesn't [21:35:32] a1a97c5155d (Eric Evans 2017-06-08 14:44:36 -0500 495) if ($target_version == '3.x') { [21:35:36] hmm [21:35:41] ¯\_(ツ)_/¯ [21:35:44] well, let's just fix it :) [21:35:53] yeah, i have a changeset incoming [21:35:57] ok, cool [21:37:50] (03PS1) 10Eevans: cassandra: include conf files in both 3.x and dev versions [puppet] - 10https://gerrit.wikimedia.org/r/422067 [21:39:21] (03PS2) 10Dzahn: cassandra: include conf files in both 3.x and dev versions [puppet] - 10https://gerrit.wikimedia.org/r/422067 (owner: 10Eevans) [21:39:29] (03CR) 10Dzahn: [C: 032] cassandra: include conf files in both 3.x and dev versions [puppet] - 10https://gerrit.wikimedia.org/r/422067 (owner: 10Eevans) [21:43:17] urandom: it created the files now [21:43:28] mutante: yup; thanks! [21:43:44] !log rolling restart of restbase dev environment [21:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:00] 10Operations, 10Office-IT: Create @wikimedia.org e-mail that just discards things sent to it - https://phabricator.wikimedia.org/T190719#4082200 (10faidon) There's `no-reply@wikimedia.org` that just gets discarded. I'm not sure if it could be a good fit to your purpose though -- wouldn't it be possible to just... [21:47:32] (03PS1) 10Chad: Add initial wikimedia plugin to build [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422068 [21:47:34] (03CR) 10Chad: [C: 032] Add initial wikimedia plugin to build [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422068 (owner: 10Chad) [21:48:37] :) [21:50:12] 10Operations, 10Office-IT: Create @wikimedia.org e-mail that just discards things sent to it - https://phabricator.wikimedia.org/T190719#4082568 (10demon) 05Open>03declined Phabricator has the ability to unverify e-mails and thus disable outbound mail to it. In gerrit's case the ideal solution is being abl... [21:50:26] paravoid: Heh. [21:51:07] :O [21:55:53] (03CR) 10Chad: [V: 032 C: 032] Add initial wikimedia plugin to build [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422068 (owner: 10Chad) [21:59:06] (03PS1) 10Chad: Add wikimedia plugin [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/422071 [21:59:08] (03CR) 10Chad: [C: 032] Add wikimedia plugin [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/422071 (owner: 10Chad) [21:59:11] (03CR) 10Chad: [V: 032 C: 032] Add wikimedia plugin [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/422071 (owner: 10Chad) [22:02:07] (03PS2) 10Imarlier: [WIP] coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) [22:02:38] (03CR) 10jerkins-bot: [V: 04-1] [WIP] coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [22:03:39] (03PS3) 10Imarlier: [WIP] coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) [22:09:26] !log demon@tin Started deploy [gerrit/gerrit@b14b43b]: wikimedia plugin [22:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:36] !log demon@tin Finished deploy [gerrit/gerrit@b14b43b]: wikimedia plugin (duration: 00m 10s) [22:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:55] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for RI-maintained services - https://phabricator.wikimedia.org/T189524#4082825 (10Dzahn) @Mholloway @bearND @tgr You should now receive email notifcations if the service "LVS HTTP IPv4" o... [22:38:30] !log syncing /srv from tin.eqiad to deploy1001.eqiad (T175288) [22:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:36] T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288 [22:56:38] (03CR) 10Madhuvishy: [C: 031] "Looks good to me, feel free to merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) (owner: 10ArielGlenn) [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180326T2300). [23:00:04] odder: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:00] Wow, just one patch this window? Nice. [23:04:51] (03CR) 10Dzahn: [C: 032] DNS: Remove mgmt DNS entries for mw2097-mw2134 [dns] - 10https://gerrit.wikimedia.org/r/422054 (https://phabricator.wikimedia.org/T189111) (owner: 10Papaul) [23:09:48] (03CR) 10Dzahn: "hashar: see the comments on T182832#3947150, T182832#3940279, T182832#3965012 quote "I confirmed the same issue checking another process, " [puppet] - 10https://gerrit.wikimedia.org/r/407958 (https://phabricator.wikimedia.org/T182832) (owner: 10Paladox) [23:15:53] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for RI-maintained services - https://phabricator.wikimedia.org/T189524#4082973 (10Dzahn) re: readinglists: I am not sure which Icinga check this would be. I can't find any that matches t... [23:21:56] * odder wondering if anyone's there to merge & deploy his patch [23:26:08] jouncebot: now [23:26:09] For the next 0 hour(s) and 33 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180326T2300) [23:30:43] o/ [23:33:10] (03PS2) 10Niharika29: Correct high-density logos for the Dutch Low Saxon Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421521 (https://phabricator.wikimedia.org/T190051) (owner: 10Odder) [23:33:15] odder: o/ [23:33:24] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421521 (https://phabricator.wikimedia.org/T190051) (owner: 10Odder) [23:33:27] Whoo-hoo! [23:34:51] (03Merged) 10jenkins-bot: Correct high-density logos for the Dutch Low Saxon Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421521 (https://phabricator.wikimedia.org/T190051) (owner: 10Odder) [23:36:13] odder: It's on mwdebug1002. [23:38:17] Niharika: Thanks, finally it looks as it should :-) [23:38:19] (03CR) 10jenkins-bot: Correct high-density logos for the Dutch Low Saxon Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421521 (https://phabricator.wikimedia.org/T190051) (owner: 10Odder) [23:41:16] !log niharika29@tin Synchronized static/images/project-logos/: Correct high-density logos for the Dutch Low Saxon Wikipedia T190051 (duration: 00m 59s) [23:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:23] T190051: Please update nds-nl Wikipedia logo - https://phabricator.wikimedia.org/T190051 [23:42:18] odder: Done. [23:42:43] Yup, thanks a lot [23:49:54] (03PS1) 10RobH: adding samwilson to shell users [puppet] - 10https://gerrit.wikimedia.org/r/422081 (https://phabricator.wikimedia.org/T189414) [23:50:46] (03CR) 10RobH: [C: 032] adding samwilson to shell users [puppet] - 10https://gerrit.wikimedia.org/r/422081 (https://phabricator.wikimedia.org/T189414) (owner: 10RobH) [23:52:45] (03PS1) 10RobH: adding samwilson to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/422082 (https://phabricator.wikimedia.org/T189414) [23:53:28] (03CR) 10RobH: [C: 032] adding samwilson to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/422082 (https://phabricator.wikimedia.org/T189414) (owner: 10RobH) [23:53:36] (03PS3) 10Krinkle: beta: Combine commons, deployments, meta and zero vhost [puppet] - 10https://gerrit.wikimedia.org/r/398399 (owner: 10EddieGP) [23:53:58] (03CR) 10Krinkle: [C: 031] beta: Combine commons, deployments, meta and zero vhost [puppet] - 10https://gerrit.wikimedia.org/r/398399 (owner: 10EddieGP) [23:55:11] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4083062 (10RobH) 05Open>03Resolved @samwilson: your deployment access has now been merged live. Please allow 30 minutes for all affected hosts to call in and re... [23:55:40] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4083064 (10RobH) [23:56:12] PROBLEM - puppet last run on mw2241 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:57:01] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:57:01] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:57:11] (03CR) 10jerkins-bot: [V: 04-1] beta: Combine commons, deployments, meta and zero vhost [puppet] - 10https://gerrit.wikimedia.org/r/398399 (owner: 10EddieGP) [23:57:12] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:57:12] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:57:22] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:57:52] PROBLEM - puppet last run on mwdebug2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:58:01] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:58:21] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:58:41] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:58:41] PROBLEM - puppet last run on mw1314 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:58:41] PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:58:51] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:58:52] PROBLEM - puppet last run on mw2195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:59:42] PROBLEM - puppet last run on mw1316 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:59:51] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:59:52] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:59:52] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues