[00:05:46] robh: you care if we set this channel to registered users only? it will save noise from cross-network trouble spilling into here. only caveat is any bots will need to be identified [00:15:07] SantaC: the only issue i can see with that is Google Code In students needing in here [00:15:14] They may not have an irc account [00:15:32] Zppix: i wasn't asking you [00:15:45] I know... [00:15:51] But i was letting you know [00:15:54] Zppix: i am asking operations, you are not operations [01:10:33] i just got kickbanned and my kickban revereted [01:10:37] reverted even [01:10:43] i suppose from all the troll spam [01:11:31] SantaC: So Zppix's observation is a valid one. We do tend to have unregistered folks have to join, but if you let me know exactly what the flag is so i can revert it in a few days, then its cool =] [01:11:43] i can see the spam spreading across various channels [01:11:55] why not get them to sign up via nickserv? [01:12:00] robh: it's channel mode +r [01:12:27] SantaC: cool, feel free to set if you dont mind and i'll chat with the rest of the ops team on monday [01:12:35] and we can decide if it stays or goes [01:12:44] thank you for asking/checking! [01:13:25] I'm 75% sure every public document for reporting issues directs folks to #wikimedia-tech, but I'm not 100% [01:56:26] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3843559 (10Volker_E) [02:32:56] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.12) (duration: 05m 57s) [02:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:46] (03Draft1) 10Paladox: gerrit: Set log level for com.google.gerrit.server.plugins.PluginLoader to ERROR [puppet] - 10https://gerrit.wikimedia.org/r/398785 [03:14:50] (03PS2) 10Paladox: gerrit: Set log level for com.google.gerrit.server.plugins.PluginLoader to ERROR [puppet] - 10https://gerrit.wikimedia.org/r/398785 [03:24:07] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 724.06 seconds [03:53:07] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 156.11 seconds [06:00:59] (03CR) 10Chad: "It's harmless." [puppet] - 10https://gerrit.wikimedia.org/r/398785 (owner: 10Paladox) [06:11:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398789 (https://phabricator.wikimedia.org/T161294) [06:12:39] (03PS2) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398789 (https://phabricator.wikimedia.org/T161294) [06:14:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398789 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:15:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398789 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:17:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398789 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:17:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1106 - T161294 (duration: 00m 57s) [06:17:23] !log Stop replication in sync on db1106 and db1100 - T161294 [06:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:32] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [06:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:57] PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:20:18] PROBLEM - Nginx local proxy to apache on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:20:47] PROBLEM - HHVM rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:20:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398790 (https://phabricator.wikimedia.org/T174569) [06:24:34] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398791 [06:26:13] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398791 (owner: 10Marostegui) [06:27:35] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398791 (owner: 10Marostegui) [06:27:38] RECOVERY - HHVM rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 78613 bytes in 0.091 second response time [06:27:46] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398791 (owner: 10Marostegui) [06:27:47] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.031 second response time [06:28:12] (03PS2) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398790 (https://phabricator.wikimedia.org/T174569) [06:28:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1106 - T161294 (duration: 00m 56s) [06:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:09] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [06:30:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398790 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:30:48] PROBLEM - HHVM rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:30:57] PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:31:32] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398790 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:31:42] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398790 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:32:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1073 - T174569 (duration: 00m 57s) [06:32:58] !log Deploy schema change on db1073 - T174569 [06:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:05] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:47] RECOVERY - HHVM rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 78591 bytes in 0.330 second response time [06:34:49] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.055 second response time [06:35:18] RECOVERY - Nginx local proxy to apache on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.069 second response time [06:43:15] <_joe_> !log restarted hhvm on mw1283, still the same kind of lockups [06:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:32] !log Defragment s2 databases on db1102 - T172169 [06:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:41] T172169: Compress InnoDB on db1102 - https://phabricator.wikimedia.org/T172169 [06:57:18] <_joe_> !log reeanbling puppet across servers with scap [06:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:17] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:03:38] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:48] PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100% [08:03:50] PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100% [08:03:50] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:50] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:50] PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:58] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:58] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:04:08] PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100% [08:04:17] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [08:04:45] ganeti again? [08:04:57] PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:05:08] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [08:07:27] PROBLEM - etc request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 770669 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:07:37] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 766585 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:07:40] unable to ssh, console stuck, issuing a rebbot [08:07:46] *reboot [08:08:57] !log powercycling ganeti1005 [08:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:27] RECOVERY - etc request latencies on argon is OK: OK - etcd_request_latencies is 3884 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:10:37] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 4642 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:10:59] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3780358 (10Volans) Powercycled ganeti1005, unable to ssh, console unresponsive. [08:11:07] akosiaris: FYI ^^^ [08:11:08] PROBLEM - Check systemd state on ganeti1005 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. [08:11:47] RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [08:12:07] RECOVERY - Check systemd state on ganeti1005 is OK: OK - running: The system is fully operational [08:12:17] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:15:37] RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 3.85 ms [08:15:47] RECOVERY - Host etcd1005 is UP: PING OK - Packet loss = 0%, RTA = 6.73 ms [08:15:47] RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 6.26 ms [08:15:47] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 5.84 ms [08:15:57] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 7.11 ms [08:15:57] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 7.61 ms [08:16:07] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 6.63 ms [08:16:08] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 6.85 ms [08:16:08] RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 7.92 ms [08:16:08] RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 7.60 ms [08:24:14] (03CR) 10Muehlenhoff: First version (035 comments) [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398505 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [08:24:27] PROBLEM - etc request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 6052189 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:24:37] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 4330379 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:26:24] <_joe_> heh, the etcd cluster must have not loved the issue with ganeti1005 [08:26:27] RECOVERY - etc request latencies on argon is OK: OK - etcd_request_latencies is 3227 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:26:31] <_joe_> but I see the latencies are coming back fast [08:26:35] <_joe_> which is good [08:26:37] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 3948 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:27:19] <_joe_> (btw, that value is in nanoseconds, why do engineers resist putting units of measure everywhere - you then end up blowing up rocketships, don't you know( [08:27:31] ahahah [08:27:34] indeed [08:28:27] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:30:03] !log insert decryption key for 2017 Arb elections [08:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:41] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3843804 (10Volans) p:05Triage>03Normal [08:35:46] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3843805 (10Volans) p:05Triage>03Normal [08:36:50] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/398303 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [08:40:31] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: extract facts from puppetDB [puppet] - 10https://gerrit.wikimedia.org/r/398795 [08:40:38] <_joe_> volans, elukey ^^ [08:40:43] <_joe_> care to take a look? [08:41:06] (03CR) 10jerkins-bot: [V: 04-1] puppet-compiler: extract facts from puppetDB [puppet] - 10https://gerrit.wikimedia.org/r/398795 (owner: 10Giuseppe Lavagetto) [08:41:18] <_joe_> damn you jenkins [08:41:25] <_joe_> what the heck do you want [08:41:47] <_joe_> line too long, meh [08:41:47] 10Operations, 10Patch-For-Review: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3843811 (10MoritzMuehlenhoff) A revised kernel has been released at https://lists.debian.org/debian-stable-announce/2017/12/msg00002.html But the netinst... [08:53:13] (03PS3) 10Volans: base: fix dependency relationship [puppet] - 10https://gerrit.wikimedia.org/r/398303 (https://phabricator.wikimedia.org/T182702) [08:53:15] (03PS3) 10Volans: wmf-auto-reimage: generate Puppet cert if needed [puppet] - 10https://gerrit.wikimedia.org/r/398279 (https://phabricator.wikimedia.org/T182702) [08:53:46] (03CR) 10Elukey: "Overall LGTM! Thanks for fixing it!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398795 (owner: 10Giuseppe Lavagetto) [08:54:10] I've little context about the pcc but it looks good to me! [08:56:35] (03CR) 10Giuseppe Lavagetto: puppet-compiler: extract facts from puppetDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398795 (owner: 10Giuseppe Lavagetto) [08:56:39] (03PS2) 10Giuseppe Lavagetto: puppet-compiler: extract facts from puppetDB [puppet] - 10https://gerrit.wikimedia.org/r/398795 [08:57:24] !log rolling restart of the Yarn nodemanagers (hadoop) on analytics10[456]* to pick up new settings - T182276 [08:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:35] T182276: Enable more accurate smaps based RSS tracking by yarn nodemanager - https://phabricator.wikimedia.org/T182276 [08:58:50] !log Stop replication in sync on db1100 and db2052 - T161294 [08:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:00] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [09:11:50] (03CR) 10Volans: "Much nicer than previous approach, few comments inline." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398795 (owner: 10Giuseppe Lavagetto) [09:19:16] (03PS7) 10Jcrespo: Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [09:19:53] (03CR) 10Jcrespo: "If berrit:398450 has been merged, change hiera instead." [puppet] - 10https://gerrit.wikimedia.org/r/398508 (owner: 10Jcrespo) [09:20:18] 10Operations, 10Goal, 10User-fgiunchedi: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759#3843921 (10elukey) [09:20:44] 10Operations, 10Goal, 10User-Elukey, 10User-fgiunchedi: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759#3833652 (10elukey) [09:28:43] !log removing initial import datafiles from maps[12]001 [09:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:52] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint: maps-test2001 is low on disk space - https://phabricator.wikimedia.org/T182583#3843950 (10Gehel) Reducing cassandra replication factor frees enough space that we don't have an immediate issue anymore (compaction is running without issue). The goal being to... [09:34:14] (03CR) 10Giuseppe Lavagetto: [C: 031] base: fix dependency relationship [puppet] - 10https://gerrit.wikimedia.org/r/398303 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [09:35:49] thanks! to be overcarefull I just tested it in labs too and seems to work fine [09:36:05] (03CR) 10Volans: [C: 032] base: fix dependency relationship [puppet] - 10https://gerrit.wikimedia.org/r/398303 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [09:38:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398800 [09:39:53] !log Deploy schema change on db1055 (already depooled) - T174569 [09:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:05] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [09:40:24] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398800 (owner: 10Marostegui) [09:40:26] !log installing openssl security updates [09:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:38] (03PS1) 10Gehel: elastic: provide elastic55 component also for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/398801 [09:40:44] (03CR) 10Volans: [C: 032] wmf-auto-reimage: generate Puppet cert if needed [puppet] - 10https://gerrit.wikimedia.org/r/398279 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [09:41:22] (03PS3) 10Muehlenhoff: Add rabbitmq-exporter to Prometheus scraper config for labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/398428 [09:41:50] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398800 (owner: 10Marostegui) [09:42:04] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398800 (owner: 10Marostegui) [09:43:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 - T174569 (duration: 00m 57s) [09:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:22] (03PS1) 10Marostegui: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398802 (https://phabricator.wikimedia.org/T174569) [09:45:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398802 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [09:46:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398802 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [09:46:50] (03CR) 10Muehlenhoff: [C: 031] elastic: provide elastic55 component also for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/398801 (owner: 10Gehel) [09:47:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1066 - T174569 (duration: 00m 56s) [09:47:53] !log Deploy schema change on db1066 - T174569 [09:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:58] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [09:48:05] (03CR) 10Giuseppe Lavagetto: puppet-compiler: extract facts from puppetDB (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398795 (owner: 10Giuseppe Lavagetto) [09:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:30] (03PS2) 10Gehel: elastic: provide elastic55 component also for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/398801 [09:48:36] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398802 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [09:49:15] (03CR) 10Gehel: [C: 032] elastic: provide elastic55 component also for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/398801 (owner: 10Gehel) [09:51:00] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844020 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1112.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [09:52:21] (03PS3) 10Giuseppe Lavagetto: puppet-compiler: extract facts from puppetDB [puppet] - 10https://gerrit.wikimedia.org/r/398795 [10:00:15] (03PS2) 10Filippo Giunchedi: First version [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398505 (https://phabricator.wikimedia.org/T181995) [10:00:30] (03CR) 10Filippo Giunchedi: First version (035 comments) [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398505 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [10:04:51] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844044 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1112.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1112.eqiad.wmnet'] ``` [10:08:51] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3844056 (10fgiunchedi) I noticed the same re: jre dependencies and fixed it in https://gerrit.wikimedia.org/r/#/c/394322/ though th... [10:15:07] PROBLEM - MariaDB disk space on db1104 is CRITICAL: DISK CRITICAL - free space: / 472 MB (1% inode=97%) [10:15:07] PROBLEM - Disk space on db1104 is CRITICAL: DISK CRITICAL - free space: / 472 MB (1% inode=97%) [10:15:43] ^ that is me [10:15:46] ok [10:15:49] it is not pooled? [10:16:02] it is fixed now [10:16:03] RECOVERY - Disk space on db1104 is OK: DISK OK [10:16:12] it was "/" not "/srv/" [10:16:26] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3844074 (10akosiaris) [10:16:35] a, ok [10:17:06] (03PS1) 10Volans: wmf-auto-reimage: ignore exit code for cert gen [puppet] - 10https://gerrit.wikimedia.org/r/398803 (https://phabricator.wikimedia.org/T182702) [10:17:07] RECOVERY - MariaDB disk space on db1104 is OK: DISK OK [10:17:58] 10Operations, 10Scap, 10Patch-For-Review: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046#3841983 (10akosiaris) [10:18:01] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: SCAP: Upload debian package version 3.7.4-1 - https://phabricator.wikimedia.org/T182347#3844077 (10akosiaris) [10:20:56] (03PS2) 10Elukey: profile::hadoop::prometheus_jmx_exporter: blacklist unwanted Mbeans [puppet] - 10https://gerrit.wikimedia.org/r/398282 (https://phabricator.wikimedia.org/T177458) [10:22:26] (03CR) 10Elukey: [C: 032] profile::hadoop::prometheus_jmx_exporter: blacklist unwanted Mbeans [puppet] - 10https://gerrit.wikimedia.org/r/398282 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [10:25:03] elukey: already merged? [10:25:18] * volans picking a number in the merge queue [10:25:38] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398804 (https://phabricator.wikimedia.org/T161294) [10:27:24] (03CR) 10Muehlenhoff: [C: 031] "One nit, but looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [10:27:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398804 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [10:27:28] (03CR) 10Volans: [C: 032] wmf-auto-reimage: ignore exit code for cert gen [puppet] - 10https://gerrit.wikimedia.org/r/398803 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [10:27:34] (03PS2) 10Volans: wmf-auto-reimage: ignore exit code for cert gen [puppet] - 10https://gerrit.wikimedia.org/r/398803 (https://phabricator.wikimedia.org/T182702) [10:28:52] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398804 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [10:29:07] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398804 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [10:29:37] (03CR) 10Elukey: Add mw13[29-37] to site.pp and conftool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [10:29:58] (03PS3) 10Elukey: Add mw13[29-37] to site.pp and conftool [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) [10:30:04] (03CR) 10Giuseppe Lavagetto: "A couple of comments from a quick review." (032 comments) [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398505 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [10:30:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1109 - T161294 (duration: 00m 56s) [10:30:14] !log Stop replication on db1109 and db2045 in sync - T161294 [10:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:19] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [10:30:27] (03CR) 10jerkins-bot: [V: 04-1] Add mw13[29-37] to site.pp and conftool [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [10:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:34] yes jenkins I know [10:30:46] you sure? :-P [10:31:49] yeah same -1 that it gave me before :D [10:32:57] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844111 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1112.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [10:33:59] 10Operations, 10monitoring, 10Patch-For-Review: Cluster puppet variable and ganglia decommission - https://phabricator.wikimedia.org/T179395#3844112 (10Volans) p:05Triage>03Normal [10:35:16] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Disable hiera autolookups - https://phabricator.wikimedia.org/T181971#3844113 (10Volans) p:05Triage>03Normal [10:36:30] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review, 10User-Joe: Create scaffolding of services templates for deployment in production/staging - https://phabricator.wikimedia.org/T177397#3844114 (10Volans) p:05Triage>03Normal [10:38:22] (03CR) 10Muehlenhoff: [C: 031] First version [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398505 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [10:40:57] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398806 [10:43:33] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398806 (owner: 10Marostegui) [10:45:00] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398806 (owner: 10Marostegui) [10:46:34] (03CR) 10Gehel: [C: 031] "This looks reasonable to me. I'm not sure if this is something we could deploy during our freeze or if we should wait January." [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [10:46:41] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398806 (owner: 10Marostegui) [10:46:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1109 - T161294 (duration: 00m 56s) [10:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:55] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [10:46:56] (03PS6) 10Gehel: Updates to enable short URLs for transliteration for crhwiki [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [10:47:04] 10Operations, 10Patch-For-Review: Install nodejs, nginx and other dependencies on francium - https://phabricator.wikimedia.org/T94457#3844128 (10Volans) [10:47:21] 10Operations, 10WMDE-Analytics-Engineering, 10Graphite, 10Patch-For-Review, and 2 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#3844129 (10Volans) [10:53:20] (03PS1) 10Elukey: Set numa=off to tftpboot jessie's ttyS1-115200 config [puppet] - 10https://gerrit.wikimedia.org/r/398807 (https://phabricator.wikimedia.org/T182702) [10:54:24] godog: --^ [10:54:32] (03CR) 10Giuseppe Lavagetto: [C: 031] Add mw13[29-37] to site.pp and conftool [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [10:55:09] (03PS2) 10Giuseppe Lavagetto: Remove trendingedits discovery endpoint [dns] - 10https://gerrit.wikimedia.org/r/397745 (https://phabricator.wikimedia.org/T180384) [10:55:52] (03CR) 10Giuseppe Lavagetto: [C: 032] Remove trendingedits discovery endpoint [dns] - 10https://gerrit.wikimedia.org/r/397745 (https://phabricator.wikimedia.org/T180384) (owner: 10Giuseppe Lavagetto) [10:56:20] <_joe_> mobrovac: merging the dns change, then we need to merge your change (I'll re-review it just in case) [10:56:33] kk [10:57:36] <_joe_> done [10:57:42] <_joe_> what was your change again? :) [10:58:53] <_joe_> found [10:59:44] (03PS3) 10Filippo Giunchedi: First version [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398505 (https://phabricator.wikimedia.org/T181995) [10:59:58] <_joe_> uhm your change as-is has a race condition [11:00:06] (03CR) 10Filippo Giunchedi: First version (032 comments) [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398505 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [11:00:10] <_joe_> but we can circumvent it :) [11:00:19] (03PS3) 10Giuseppe Lavagetto: Remove the Trending Edits service from production [puppet] - 10https://gerrit.wikimedia.org/r/397571 (https://phabricator.wikimedia.org/T180384) (owner: 10Mobrovac) [11:00:41] oh? [11:01:04] <_joe_> mobrovac: if we remove conftool data before puppet has run on the lvs servers, we'd have pools with zero servers [11:01:10] <_joe_> and pybal is not happy [11:01:18] <_joe_> it will keep working but complain loudly [11:01:37] right but that's short-lived and we don't care about it, do we? [11:01:50] <_joe_> heh, it's the load balancers [11:02:03] <_joe_> your best shot at a global outage after dns [11:02:07] <_joe_> i prefer to play it safe [11:02:33] <_joe_> I'll just have conftool-sync fail with my merge, and re-do it by hand after I ran puppet on the load balancers [11:03:04] kk [11:03:08] sounds good [11:04:30] <_joe_> !log disabled notifications for trendingedits.svc T180384 [11:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:42] T180384: Turn off Trending Service - https://phabricator.wikimedia.org/T180384 [11:04:55] <_joe_> stashbot: lag? [11:04:55] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [11:05:30] (03CR) 10Giuseppe Lavagetto: [C: 032] Remove the Trending Edits service from production [puppet] - 10https://gerrit.wikimedia.org/r/397571 (https://phabricator.wikimedia.org/T180384) (owner: 10Mobrovac) [11:07:39] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844212 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1112.eqiad.wmnet'] ``` and were **ALL** successful. [11:08:33] <_joe_> !log rolling restart of pybal on the low-traffic balancers [11:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:16] (03CR) 10Filippo Giunchedi: [C: 031] Add rabbitmq-exporter to Prometheus scraper config for labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/398428 (owner: 10Muehlenhoff) [11:15:39] 10Operations, 10ops-ulsfo, 10Traffic: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3844233 (10MoritzMuehlenhoff) cp4009 and cp4018 (also also cp4013) are marked as removed from puppet, but still show in https://servermon.wikimedia.org/hosts/, that usually means that "puppet deactivate" w... [11:15:50] PROBLEM - Confd template for /srv/config-master/pybal/codfw/trendingedits on labpuppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/trendingedits is broken [11:16:51] <_joe_> wat [11:16:53] <_joe_> oh right [11:17:00] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/trendingedits on labpuppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/trendingedits is broken [11:17:07] <_joe_> not a real problem ^^ [11:17:17] <_joe_> also why is this spitting out again? [11:17:43] once for codfw and once for eqiad, it seems [11:18:20] !log reimaging mw1307 (video scaler) to stretch [11:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:29] (03PS4) 10Elukey: Add mw13[29-37] to site.pp and conftool [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) [11:19:53] (03CR) 10jerkins-bot: [V: 04-1] Add mw13[29-37] to site.pp and conftool [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [11:20:10] (03CR) 10Elukey: [V: 032 C: 032] Add mw13[29-37] to site.pp and conftool [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [11:20:20] (03CR) 10Muehlenhoff: [C: 031] "Ack, let's do that until a revised netboot image has been released." [puppet] - 10https://gerrit.wikimedia.org/r/398807 (https://phabricator.wikimedia.org/T182702) (owner: 10Elukey) [11:20:20] !log stopping the trending edits service - T180384 [11:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:29] T180384: Turn off Trending Service - https://phabricator.wikimedia.org/T180384 [11:21:11] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:21:20] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:21:30] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:22:01] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:22:04] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:22:05] (03CR) 10Filippo Giunchedi: Add a Prometheus exporter for PowerDNS (034 comments) [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/398447 (https://phabricator.wikimedia.org/T182970) (owner: 10Muehlenhoff) [11:22:40] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:22:41] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:22:41] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:22:42] (03CR) 10Filippo Giunchedi: [C: 031] Set numa=off to tftpboot jessie's ttyS1-115200 config [puppet] - 10https://gerrit.wikimedia.org/r/398807 (https://phabricator.wikimedia.org/T182702) (owner: 10Elukey) [11:22:50] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:23:01] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:23:05] (03PS2) 10Elukey: Set numa=off to tftpboot jessie's ttyS1-115200 config [puppet] - 10https://gerrit.wikimedia.org/r/398807 (https://phabricator.wikimedia.org/T182702) [11:24:04] (03CR) 10Elukey: [C: 032] Set numa=off to tftpboot jessie's ttyS1-115200 config [puppet] - 10https://gerrit.wikimedia.org/r/398807 (https://phabricator.wikimedia.org/T182702) (owner: 10Elukey) [11:24:51] <_joe_> !log ran cleanup script on scb* T180384 [11:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:06] (03CR) 10Paladox: "Also it would spam logstash every minute." [puppet] - 10https://gerrit.wikimedia.org/r/398785 (owner: 10Paladox) [11:25:41] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [11:25:41] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational [11:25:41] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational [11:25:44] (03CR) 10Filippo Giunchedi: Add Prometheus exporter for Blazegraph (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398272 (https://phabricator.wikimedia.org/T182857) (owner: 10Muehlenhoff) [11:25:50] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [11:26:01] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational [11:26:02] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational [11:26:10] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational [11:26:11] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational [11:26:20] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [11:26:30] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [11:30:13] (03PS2) 10Giuseppe Lavagetto: Remove all references to trendingedits [dns] - 10https://gerrit.wikimedia.org/r/397746 (https://phabricator.wikimedia.org/T180384) [11:31:56] (03CR) 10Giuseppe Lavagetto: [C: 032] Remove all references to trendingedits [dns] - 10https://gerrit.wikimedia.org/r/397746 (https://phabricator.wikimedia.org/T180384) (owner: 10Giuseppe Lavagetto) [11:32:53] (03PS4) 10Muehlenhoff: Add Prometheus exporter for Blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398272 (https://phabricator.wikimedia.org/T182857) [11:33:03] (03CR) 10Muehlenhoff: Add Prometheus exporter for Blazegraph (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398272 (https://phabricator.wikimedia.org/T182857) (owner: 10Muehlenhoff) [11:36:12] (03CR) 10Filippo Giunchedi: [C: 031] Add Prometheus exporter for Blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398272 (https://phabricator.wikimedia.org/T182857) (owner: 10Muehlenhoff) [11:37:06] 10Operations, 10Scap, 10Patch-For-Review: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046#3841983 (10akosiaris) So, I 'll add my considerable amounts of cents in this task in order to provide insight into what happened, see where did went south and figure out what we need to do to avoid it... [11:37:15] !log restarting Jenkins CI to upgrade the monitoring plugin [11:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:46] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: SCAP: Upload debian package version 3.7.4-3 - https://phabricator.wikimedia.org/T182347#3844304 (10akosiaris) [11:53:56] 10Operations, 10Scap, 10Patch-For-Review: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046#3841983 (10ArielGlenn) Agree about building 3.7.4-3 for new upload. As far as the bin_dir, have you seen this? https://gerrit.wikimedia.org/r/#/c/398606/ [11:54:06] (03PS2) 10Muehlenhoff: Add a Prometheus exporter for PowerDNS [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/398447 (https://phabricator.wikimedia.org/T182970) [11:54:17] (03CR) 10Muehlenhoff: Add a Prometheus exporter for PowerDNS (034 comments) [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/398447 (https://phabricator.wikimedia.org/T182970) (owner: 10Muehlenhoff) [11:54:32] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [11:54:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [11:55:57] (03CR) 10Giuseppe Lavagetto: [C: 031] First version [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398505 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [11:59:43] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This will change the rules for wikipedia and other sites in beta, and for just wikipedia in production." [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [12:00:05] (03PS2) 10Alexandros Kosiaris: scap: Set bin_dir globally to /usr/bin [puppet] - 10https://gerrit.wikimedia.org/r/398606 (https://phabricator.wikimedia.org/T183046) (owner: 10Chad) [12:00:14] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] scap: Set bin_dir globally to /usr/bin [puppet] - 10https://gerrit.wikimedia.org/r/398606 (https://phabricator.wikimedia.org/T183046) (owner: 10Chad) [12:01:34] (03PS4) 10Muehlenhoff: Add rabbitmq-exporter to Prometheus scraper config for labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/398428 [12:02:39] (03CR) 10Muehlenhoff: [C: 032] Add rabbitmq-exporter to Prometheus scraper config for labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/398428 (owner: 10Muehlenhoff) [12:02:56] 10Operations, 10Scap, 10Patch-For-Review: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046#3844359 (10akosiaris) The 2 first actionables are in D919. I 've also merged the bin_dir configuration change above, but that's just a safeguard, as per the comment in the change "Ideally, we want t... [12:03:23] akosiaris: okay to puppet-merge your bin_dir patch along? [12:03:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398818 [12:04:12] PROBLEM - MD RAID on mw1307 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:05:58] ^ that's a reimage, silencing [12:09:40] 10Operations, 10Trending-Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3844363 (10mobrovac) [12:10:25] (03PS1) 10Ema: mtail: add varnishreqstats.mtail [puppet] - 10https://gerrit.wikimedia.org/r/398819 (https://phabricator.wikimedia.org/T177199) [12:10:58] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (done), 10User-Joe: Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3755911 (10mobrovac) 05Open>03Resolved a:05bearND>03mobrovac The service has been completely removed from producti... [12:10:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I don't think this unification is possible - a ton of configuration we add for www.wikipedia.org would apply to wikipedia.org as well, and" [puppet] - 10https://gerrit.wikimedia.org/r/398396 (owner: 10EddieGP) [12:12:12] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [12:12:51] (03CR) 10Alexandros Kosiaris: "Minor typo, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398795 (owner: 10Giuseppe Lavagetto) [12:12:53] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/398399 (owner: 10EddieGP) [12:13:47] moritzm: yeah, sorry I forgot about merging it [12:17:39] (03PS1) 10Alexandros Kosiaris: Bump scap to 3.7.4-3 [puppet] - 10https://gerrit.wikimedia.org/r/398822 (https://phabricator.wikimedia.org/T183046) [12:20:14] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398818 (owner: 10Marostegui) [12:20:15] !log build scap 3.7.4-3 and upload to jessie-wikimedia, stretch-wikimedia, trusty-wikimedia. T183046, T182347 [12:20:22] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [12:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:26] T182347: SCAP: Upload debian package version 3.7.4-3 - https://phabricator.wikimedia.org/T182347 [12:20:27] T183046: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046 [12:21:38] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398818 (owner: 10Marostegui) [12:21:51] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398818 (owner: 10Marostegui) [12:24:28] (03PS1) 10Ladsgroup: Don't enable lua fine grained tracking for any wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398823 (https://phabricator.wikimedia.org/T172914) [12:25:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1066 - T174569 (duration: 03m 06s) [12:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:22] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [12:25:22] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:22] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:52] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:26:02] PROBLEM - Nginx local proxy to apache on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:27:08] <_joe_> looking [12:27:19] <_joe_> it's the same bug I reported last week btw [12:28:12] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.037 second response time [12:28:13] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 78356 bytes in 0.093 second response time [12:28:24] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [12:28:32] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [12:28:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398825 (https://phabricator.wikimedia.org/T174569) [12:28:53] RECOVERY - Nginx local proxy to apache on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.037 second response time [12:30:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398825 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [12:30:41] (03CR) 10Giuseppe Lavagetto: "I am not sure the basic setup you created would work, see my comment inline, but this surely looks promising." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397770 (owner: 10EddieGP) [12:31:24] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398825 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [12:31:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398825 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [12:33:22] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:34:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1099:3311 - T174569 (duration: 03m 05s) [12:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:55] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [12:35:39] !log Deploy schema change on db1099:331 and db1067 - T174569 [12:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:52] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:37:47] akosiaris: Version '3.7.4-1' for 'scap' was not found [12:37:52] (03PS1) 10Marostegui: mariadb: Add db1112 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/398826 (https://phabricator.wikimedia.org/T180788) [12:38:22] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:38:33] (03PS5) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [12:40:23] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:40:53] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844424 (10Marostegui) [12:42:30] volans: looks like it's been upped to 3.7.4-3 [12:42:32] in the repo [12:43:22] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:44:03] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:44:22] RECOVERY - MD RAID on mw1307 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [12:45:52] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:47:33] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [12:47:42] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [12:48:10] the control file for 3.7.4-3 is still wrong, it has python-semver in depends instead of suggests. that will fail on trusty. [12:48:22] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:49:02] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:50:30] https://phabricator.wikimedia.org/source/scap/browse/master/debian/control this has it correctly [12:50:43] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:51:02] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:51:04] akosiaris: [12:52:12] (03CR) 10Hashar: contint: allow releng to interact with Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398240 (https://phabricator.wikimedia.org/T182860) (owner: 10Hashar) [12:52:21] (03PS2) 10Hashar: contint: allow releng to interact with Docker [puppet] - 10https://gerrit.wikimedia.org/r/398240 (https://phabricator.wikimedia.org/T182860) [12:52:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398827 (https://phabricator.wikimedia.org/T161294) [12:52:52] (03PS3) 10Hashar: contint: allow releng to interact with Docker [puppet] - 10https://gerrit.wikimedia.org/r/398240 (https://phabricator.wikimedia.org/T182860) [12:55:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398827 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [12:57:32] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398827 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [12:57:42] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398827 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [12:58:03] !log Stop replication in sync on db1106 and db1100 - T161294 [12:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:14] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [13:00:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1106 - T161294 (duration: 03m 03s) [13:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:15] PROBLEM - HHVM rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:01:45] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:34] PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398830 [13:03:44] RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 1.810 second response time [13:04:04] RECOVERY - HHVM rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 78354 bytes in 0.247 second response time [13:04:41] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398830 (owner: 10Marostegui) [13:06:15] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [13:06:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [13:06:54] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:06:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398830 (owner: 10Marostegui) [13:07:08] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398830 (owner: 10Marostegui) [13:07:15] PROBLEM - HHVM rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1106 - T161294 (duration: 03m 06s) [13:10:15] (03CR) 10Jcrespo: "Call the shard other than s4, e.g. "test-s4" or something, so the alerts,stats and name is separate." [puppet] - 10https://gerrit.wikimedia.org/r/398826 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [13:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:23] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [13:11:04] RECOVERY - HHVM rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 78354 bytes in 0.113 second response time [13:14:15] PROBLEM - HHVM rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:19] (03PS2) 10Marostegui: mariadb: Add db1112 to s4-test [puppet] - 10https://gerrit.wikimedia.org/r/398826 (https://phabricator.wikimedia.org/T180788) [13:16:10] (03PS3) 10Marostegui: mariadb: Add db1112 to s4-test [puppet] - 10https://gerrit.wikimedia.org/r/398826 (https://phabricator.wikimedia.org/T180788) [13:16:56] (03PS4) 10Marostegui: mariadb: Add db1112 to s4-test [puppet] - 10https://gerrit.wikimedia.org/r/398826 (https://phabricator.wikimedia.org/T180788) [13:17:32] (03CR) 10Jcrespo: [C: 031] mariadb: Add db1112 to s4-test [puppet] - 10https://gerrit.wikimedia.org/r/398826 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [13:18:28] (03CR) 10Filippo Giunchedi: "See inline, also labmon will need to have wikimedia.org to its search domains to be able to resolve unqualified names:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398428 (owner: 10Muehlenhoff) [13:18:36] !log Stop replication in sync on db1100 and db2052 - T161294 [13:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:46] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [13:20:44] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: mw1260.eqiad.wmnet [13:20:50] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1260.eqiad.wmnet [13:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:58] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: mw1313.eqiad.wmnet [13:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:04] RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 9.847 second response time [13:22:29] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3844502 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` mw1329.eqiad.wmnet ``` The log can be foun... [13:22:54] imaging mw1329 (new appserver) --^ [13:23:44] RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.071 second response time [13:24:15] RECOVERY - HHVM rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 78356 bytes in 5.904 second response time [13:25:46] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1313.eqiad.wmnet [13:25:58] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1307.eqiad.wmnet [13:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:41] (03PS7) 10Gehel: Updates to enable short URLs for transliteration for crhwiki - beta [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [13:27:43] (03PS1) 10Gehel: Updates to enable short URLs for transliteration for crhwiki production [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) [13:29:18] (03CR) 10Gehel: "I split the commit in 2 parts, 1) update beta, 2) update production. The second commit is https://gerrit.wikimedia.org/r/#/c/398832/" [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [13:29:56] !log Stop replicaiton in sync on db1109 and db2045 - T161294 [13:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:09] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [13:30:50] (03CR) 10Filippo Giunchedi: Add a Prometheus exporter for PowerDNS (031 comment) [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/398447 (https://phabricator.wikimedia.org/T182970) (owner: 10Muehlenhoff) [13:34:58] (03PS8) 10Jcrespo: Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [13:37:10] (03PS3) 10Muehlenhoff: Add a Prometheus exporter for PowerDNS [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/398447 (https://phabricator.wikimedia.org/T182970) [13:37:19] (03CR) 10Muehlenhoff: Add a Prometheus exporter for PowerDNS (031 comment) [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/398447 (https://phabricator.wikimedia.org/T182970) (owner: 10Muehlenhoff) [13:38:27] (03CR) 10Filippo Giunchedi: [C: 031] Add a Prometheus exporter for PowerDNS [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/398447 (https://phabricator.wikimedia.org/T182970) (owner: 10Muehlenhoff) [13:41:13] (03PS1) 10Jcrespo: mariadb: Add mydumper to misc:s4 databases [puppet] - 10https://gerrit.wikimedia.org/r/398836 (https://phabricator.wikimedia.org/T183123) [13:41:23] (03PS1) 10Muehlenhoff: Remove Hiera host entry for palladium [puppet] - 10https://gerrit.wikimedia.org/r/398837 [13:41:27] (03PS2) 10Jcrespo: mariadb: Add mydumper to misc:s4 databases [puppet] - 10https://gerrit.wikimedia.org/r/398836 (https://phabricator.wikimedia.org/T183123) [13:42:04] (03CR) 10Jcrespo: "Blocker of T183123" [puppet] - 10https://gerrit.wikimedia.org/r/398836 (https://phabricator.wikimedia.org/T183123) (owner: 10Jcrespo) [13:42:43] (03CR) 10Elukey: mariadb: Add mydumper to misc:s4 databases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398836 (https://phabricator.wikimedia.org/T183123) (owner: 10Jcrespo) [13:43:11] (03CR) 10Jcrespo: "Yes, sorry." [puppet] - 10https://gerrit.wikimedia.org/r/398836 (https://phabricator.wikimedia.org/T183123) (owner: 10Jcrespo) [13:43:37] (03PS3) 10Jcrespo: mariadb: Add mydumper to misc:m4 databases [puppet] - 10https://gerrit.wikimedia.org/r/398836 (https://phabricator.wikimedia.org/T183123) [13:44:01] (03PS4) 10Jcrespo: mariadb: Add mydumper to misc:m4 databases [puppet] - 10https://gerrit.wikimedia.org/r/398836 (https://phabricator.wikimedia.org/T183123) [13:44:35] (03CR) 10Elukey: [C: 031] mariadb: Add mydumper to misc:m4 databases [puppet] - 10https://gerrit.wikimedia.org/r/398836 (https://phabricator.wikimedia.org/T183123) (owner: 10Jcrespo) [13:46:33] 10Operations, 10Patch-For-Review: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3844595 (10Volans) The reimage scripts should be back on track and work as expected. It was tested today with a couple of reimages. I cannot exclude we'll... [13:46:34] PROBLEM - puppet last run on labtestservices2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:47:26] 10Operations, 10Scap, 10Patch-For-Review: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046#3844602 (10ArielGlenn) The control file seems to be wrong. It should be this https://phabricator.wikimedia.org/source/scap/browse/master/debian/control because of a move of python-semver from depen... [13:47:44] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3844604 (10Volans) @akosiaris reimages should be unblocked, see T182702#3844595 [13:47:54] 10Operations, 10DBA, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#3844605 (10jcrespo) [13:48:07] 10Operations, 10Patch-For-Review: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3844606 (10Marostegui) We still have to resolve the workaround on install1102, it is still in place as far as I remember. [13:48:11] 10Operations, 10DBA, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#1542524 (10jcrespo) [13:48:34] (03CR) 10Jcrespo: [C: 032] mariadb: Add mydumper to misc:m4 databases [puppet] - 10https://gerrit.wikimedia.org/r/398836 (https://phabricator.wikimedia.org/T183123) (owner: 10Jcrespo) [13:49:05] ^ transient or master issue I guess [13:50:18] (03CR) 10Filippo Giunchedi: [C: 032] First version [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398505 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [13:50:42] chasemp: transient, I've re-run puppet just fine [13:51:34] RECOVERY - puppet last run on labtestservices2002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [13:52:14] (03PS1) 10Filippo Giunchedi: Add debian/ and .gitreview [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398839 (https://phabricator.wikimedia.org/T181995) [13:52:56] (03PS1) 10Muehlenhoff: Add rabbitmq jobs to Prometheus config for labmon [puppet] - 10https://gerrit.wikimedia.org/r/398840 [13:55:26] (03CR) 10Muehlenhoff: [C: 031] "Looks good (please remember to build/upload for both jessie and stretch, the video scalers are already running stretch)." [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398839 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [13:55:44] (03CR) 10Filippo Giunchedi: Add Prometheus exporter to WDQS servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) (owner: 10Muehlenhoff) [13:56:22] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/398444 (owner: 10Gehel) [13:56:29] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3844631 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1329.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1329.eqiad.wmnet'] ``` [13:57:05] ah snap this was a timeout --^ [13:57:08] (03CR) 10Filippo Giunchedi: [C: 031] Add rabbitmq jobs to Prometheus config for labmon [puppet] - 10https://gerrit.wikimedia.org/r/398840 (owner: 10Muehlenhoff) [13:57:09] the host reimaged fine [13:57:19] elukey: I can have a look [13:57:21] at the logs [13:58:12] !log starting one-time backup of eventlogging database on db1107:/srv/backups T183123 [13:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:24] T183123: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123 [13:58:26] volans: nono my fault, the first boot was stuck because the absence of rootdelay (md not mounted for the 'usual' issue) [13:58:40] rest worked fien [13:58:53] ah ok [13:58:59] but then it didn't start or run puppet didn't it? [13:59:11] it's done by the script since puppet4 client doesn't do it itself [13:59:18] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3844645 (10Gehel) While migrating existing grafana dashboards, it looks like some dashboards are broken and most probably unused. W... [13:59:39] volans: ah yes I am running it manually, I'll report with other appservers (need to reimage more of them) [13:59:51] why manually? resume the reimage from there [13:59:51] elukey: I may be overloading the database, you may be able to tell if 16 threads is too aggresive [14:00:33] volans: wouldn't it go through again the d-i? It is not an issue, with install console is like a minute [14:00:44] (03CR) 10Rush: [C: 031] "+1 w/ a note" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398840 (owner: 10Muehlenhoff) [14:00:48] jynus: thanks a lot [14:00:53] going to monitor the db [14:01:01] elukey: no it will not with the right options [14:01:53] (03CR) 10Filippo Giunchedi: [C: 032] Add debian/ and .gitreview [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398839 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [14:01:55] (03PS2) 10Rush: Add rabbitmq jobs to Prometheus config for labmon [puppet] - 10https://gerrit.wikimedia.org/r/398840 (owner: 10Muehlenhoff) [14:02:13] volans: ok, next time I'll do it, don't be upset :) [14:02:21] you should add --no-verify --no-pxe, but there is a catch... damn it's a corner case hard to code without risking race conditions [14:02:59] if you run puppet agent --test once to create the certificat eand then run the reimage with those additional options it should work as expected [14:03:08] elukey: you can actually check the screen/ps to know which command I used [14:03:28] jynus: ack! [14:06:37] jouncebot: next [14:06:37] In 359 hour(s) and 53 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180102T1400) [14:06:40] cool [14:07:38] hashar: waiting for next swat: https://i.imgur.com/DNsXXq9.jpg [14:10:24] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3844667 (10fgiunchedi) [14:11:14] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00 consumer/mysql-eventbus [14:11:26] yes this is me --^ [14:11:36] I stopped mysql insertion [14:13:10] !log temporarily stopped mysql consumers on eventlog1001 to ease a mysql backup on db1107 - T183123 [14:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:21] T183123: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123 [14:21:59] (03CR) 10Muehlenhoff: [C: 032] Add rabbitmq jobs to Prometheus config for labmon [puppet] - 10https://gerrit.wikimedia.org/r/398840 (owner: 10Muehlenhoff) [14:24:56] (03PS4) 10Giuseppe Lavagetto: puppet-compiler: extract facts from puppetDB [puppet] - 10https://gerrit.wikimedia.org/r/398795 [14:25:30] 10Operations, 10cloud-services-team: labcontrol1002 Error: unable to connect to node rabbit@labcontrol1002: nodedown - https://phabricator.wikimedia.org/T183144#3844727 (10chasemp) p:05Triage>03Normal [14:26:31] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3844753 (10Volans) [14:26:35] 10Operations, 10cloud-services-team: labcontrol1002 Error: unable to connect to node rabbit@labcontrol1002: nodedown - https://phabricator.wikimedia.org/T183144#3844727 (10chasemp) [14:28:24] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/398795 (owner: 10Giuseppe Lavagetto) [14:28:47] (03CR) 10Tjones: "Thanks @Gehel! This definitely can wait until January and we don't need to try to get around the freeze." [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [14:29:57] apergos: damn I did not know about that .... [14:30:20] 10Operations, 10monitoring: Monitor resource usage on a per-cgroup basis - https://phabricator.wikimedia.org/T183146#3844761 (10ema) [14:30:20] the python-semver thing I mean [14:30:23] it was somehow that way in the -2 package too [14:30:30] 10Operations, 10monitoring: Monitor resource usage on a per-cgroup basis - https://phabricator.wikimedia.org/T183146#3844771 (10ema) p:05Triage>03Normal [14:30:30] but I don't understand how the wrong file made it in [14:30:46] the master branch is .... advisory ? [14:30:53] ugh really? [14:30:53] I am not sure what it is used for tbh [14:31:01] I was told to release from the release branch [14:31:14] and so you did, and here we are [14:31:19] yeah [14:31:22] twice already [14:31:29] well meh. do you want to wait for no_justification or thcipriani|afk to show up? [14:31:33] this rabbithole is turning out a bit deep [14:31:36] yes [14:31:39] they were both here for the last dive down it [14:31:40] k [14:31:48] What's up? [14:31:51] hey hey [14:31:51] scap [14:31:53] :-) [14:31:55] * no_justification is up absurdly early [14:31:56] :) [14:31:56] as usual :-P [14:32:18] I got 3.7.4-3 ready to push but apergos tells me it's not trusty compatible [14:32:20] so that move oh the python_semver or whatever it is, [14:32:32] to suggests [14:32:36] that's only in master I guess [14:32:42] maybe not in the releasebranch? [14:32:56] so guess what got built into -3: the one with that package in depends [14:32:58] I think it should be in both branches IIRC (when I looked Friday) [14:33:05] hm [14:33:08] (03PS5) 10Giuseppe Lavagetto: puppet-compiler: extract facts from puppetDB [puppet] - 10https://gerrit.wikimedia.org/r/398795 [14:33:19] PROBLEM - Apache HTTP on mw1329 is CRITICAL: connect to address 10.64.32.66 and port 80: Connection refused [14:33:19] PROBLEM - MD RAID on mw1329 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:33:21] https://github.com/wikimedia/scap/blob/release/debian/control#L19 [14:33:40] Suggests: git-fat, python-semver, php5-cli | php-cli | hhvm [14:33:51] well, that's as it should be [14:34:01] so wtf happened? [14:34:04] you mean the package does not have it that way ? [14:34:04] nope [14:34:17] I ar x the deb right from /srv/thing/pool [14:34:24] ah yes indeed [14:34:34] Depends: python, python-configparser, python-jinja2, python-psutil, python-pygments, python-requests, python-semver, python-six, python:any (<< 2.8), python:any (>= 2.7.5-5~), python-yaml, git, bash-completion, python-conftool [14:34:57] is it in the setup.py? [14:35:28] (03PS1) 10Filippo Giunchedi: Add nutcracker_exporter profile [puppet] - 10https://gerrit.wikimedia.org/r/398847 (https://phabricator.wikimedia.org/T181995) [14:35:35] it would be added as dependency by ${python:Depends} [14:35:41] sure enough: https://github.com/wikimedia/scap/blob/release/requirements.txt [14:35:48] it's in requirements.txt [14:35:50] ggaaahhhh [14:35:57] so yeah... python:depnds [14:36:26] but that's not new, right ? [14:36:32] 3.7.4-1 had that too from what I see [14:36:39] yes and that didn't install either [14:36:52] Does Depends in setup.py get added to the debian package, overriding what we put in control? [14:36:53] yeah, 3.7.3 was probably the last release that didn't have it [14:36:53] (03PS1) 10Elukey: role::analytics_cluster::refinery: fix logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/398848 [14:36:54] the trusty hosts are on 3.7.3 somthing [14:37:25] not "new" but there was a long time between releases. [14:37:26] it can't be [14:37:33] git tag --contains 76987bf7 [14:37:33] 3.5.8 [14:37:33] 3.6.0 [14:37:33] 3.7.0 [14:37:33] 3.7.1 [14:37:34] 3.7.2 [14:37:35] 3.7.3 [14:37:36] (03PS2) 10Tjones: Updates to enable short URLs for transliteration for crhwiki production [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel) [14:37:36] 3.7.4 [14:37:53] ok I have a guess about this [14:37:55] gimme a sec [14:38:33] oh, yeah, I guess this goes back to May [14:39:26] (03CR) 10Muehlenhoff: Add Prometheus exporter to WDQS servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) (owner: 10Muehlenhoff) [14:39:44] hehe [14:39:49] so I built against jessie [14:39:51] (03PS3) 10Muehlenhoff: Add Prometheus exporter to WDQS servers [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) [14:39:56] :-/ [14:40:02] if I build against trusty I get Suggests: git-fat, python-semver, php5-cli | php-cli | hhvm [14:40:07] jfc [14:40:12] rabbit holes all the way down [14:40:29] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3844795 (10herron) I've sent a test message through just now and see it in the list archive. Did you receive this message in your email? To trace logs... [14:40:37] 10Operations: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656#3844797 (10ema) p:05Triage>03Normal [14:40:38] ok, let's use the trusty all across the border [14:40:43] i guess so [14:40:45] it should be reusable everywhere [14:40:46] sigh [14:41:16] fourth time's a charm? :-P :-D [14:41:48] !log upgrade pinkunicorn to latest jessie point release (8.10) T182656 [14:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:00] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [14:42:19] RECOVERY - Apache HTTP on mw1329 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [14:44:19] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3844813 (10MoritzMuehlenhoff) [14:44:28] ok done [14:45:38] (03CR) 10Elukey: [C: 032] role::analytics_cluster::refinery: fix logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/398848 (owner: 10Elukey) [14:46:48] 10Operations, 10Scap, 10Patch-For-Review: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046#3844824 (10akosiaris) >>! In T183046#3844602, @ArielGlenn wrote: > The control file seems to be wrong. It should be this https://phabricator.wikimedia.org/source/scap/browse/master/debian/control be... [14:46:49] PROBLEM - Squid on install1002 is CRITICAL: connect to address 208.80.154.22 and port 8080: Connection refused [14:47:12] thcipriani|afk: Also, while we're at it https://phabricator.wikimedia.org/D920 [14:47:17] that's me ^ [14:47:52] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3844828 (10faidon) For wikiba.se, another option (4, I guess!) is to just host it outside of the #Traffic infrastructure and with a separate... [14:48:13] ok [14:48:30] no_justification: lol [14:48:32] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: extract facts from puppetDB [puppet] - 10https://gerrit.wikimedia.org/r/398795 (owner: 10Giuseppe Lavagetto) [14:48:44] (03PS6) 10Giuseppe Lavagetto: puppet-compiler: extract facts from puppetDB [puppet] - 10https://gerrit.wikimedia.org/r/398795 [14:48:51] \o/ _joe_ [14:49:24] (03CR) 10Herron: [C: 04-1] "AFAIK with this config v3 agents will silently ignore the http_ options and use the default configtimeout. I think we'll want to either ap" [puppet] - 10https://gerrit.wikimedia.org/r/398484 (https://phabricator.wikimedia.org/T182585) (owner: 10Andrew Bogott) [14:49:28] <_joe_> amn gerrit [14:49:37] (03PS3) 10Tjones: Updates to enable short URLs for transliteration for crhwiki production [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel) [14:49:55] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: extract facts from puppetDB [puppet] - 10https://gerrit.wikimedia.org/r/398795 (owner: 10Giuseppe Lavagetto) [14:50:12] no_justification: btw.. do the debian/ tags actually get used somehow ? [14:50:33] I didn't see git-buildpackage using them so I am trying to figure out if they are "advisory" or not [14:51:30] 10Operations, 10Scap, 10Patch-For-Review: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046#3841983 (10thcipriani) the documentation on the prep-side for packages is at https://wikitech.wikimedia.org/wiki/How_to_deploy_code/Scap which should probably get combined with https://wikitech.wikime... [14:52:14] (03PS1) 10Elukey: role::analytics_cluster::refinery: follow up on logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/398849 [14:52:21] akosiaris: I thought the debian/ tag was the default for gbp if you don't override in gbp.conf? [14:52:29] RECOVERY - MD RAID on mw1329 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [14:52:30] (03CR) 10Elukey: [V: 032 C: 032] role::analytics_cluster::refinery: follow up on logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/398849 (owner: 10Elukey) [14:53:01] mw1329 is a new appserver [14:54:07] (03PS4) 10Muehlenhoff: Add Prometheus exporter to WDQS servers [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) [14:54:08] thcipriani|afk: yes, but that is only being used if you pass --git-tag. otherwise it's a noop [14:54:12] (03CR) 10Muehlenhoff: Add Prometheus exporter to WDQS servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) (owner: 10Muehlenhoff) [14:54:34] but from what I gather that's what's being done from https://wikitech.wikimedia.org/wiki/How_to_deploy_code/Scap#Tag_Debian_Version [14:54:43] ok we badly need to merge the docs as you pointed out [14:54:44] (03CR) 10Gehel: [C: 031] "LGTM. We could keep the previous profile::prometheus::wdqs_updater class, and include it unconditionally from profile::wdqs (I think it wo" [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) (owner: 10Muehlenhoff) [14:55:21] (03CR) 10Tjones: "I've included all the variants for all the projects, but not sure what to do about the ProxyPassMatch rule that is present for some projec" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel) [14:55:33] ./modules/scap/manifests/init.pp: $version = '3.7.4-1', [14:55:41] (03CR) 10Muehlenhoff: [C: 032] Add a Prometheus exporter for PowerDNS [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/398447 (https://phabricator.wikimedia.org/T182970) (owner: 10Muehlenhoff) [14:55:53] +100 on merging those docs basically since they basically end with: and throw it over the fence and you're done! [14:55:59] lol nice [14:56:31] and the fence is named specifically [14:56:41] (03CR) 10Gehel: "minor knitpick" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) (owner: 10Muehlenhoff) [14:57:30] it says "usually" :) [14:57:42] someone cae to do the honors and bump the version there? [14:58:00] * apergos eyes thcipriani|afk [14:58:05] * thcipriani|afk does [14:58:07] apergos: I have a patch already ready to merge. I was just waiting on thcipriani|afk and no_justification to wake up [14:58:12] ah ha [14:58:20] which I am not sure they 've done yet so I am giving them time [14:58:42] it is very early in sf time [14:58:58] unless they follow a better routine than I start looking at laptop and emails before I even had breakfast [14:59:25] I do the same dang thing. [14:59:36] I had coffee at least.. [15:00:38] (03PS5) 10Muehlenhoff: Add Prometheus exporter to WDQS servers [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) [15:00:53] anyway, lemme know when you are in a fully functioning condition and I 'll merge patch and deploy 3.7.4-3 [15:02:05] (03PS1) 10Thcipriani: Scap: bump version to 3.7.4-3 [puppet] - 10https://gerrit.wikimedia.org/r/398853 (https://phabricator.wikimedia.org/T183046) [15:02:13] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3844944 (10Natuur12) Your email arrived at 15:25 (Dutch local time), I send another email using a hotmail adress at 15:31 (I'm sorry but I'm not going t... [15:03:10] * apergos whistles... "it's beginning to look a lot like scapping".... [15:03:36] Every server you go? [15:03:42] akosiaris: I'm around now [15:04:00] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) (owner: 10Muehlenhoff) [15:05:19] thcipriani: ok merging then [15:05:30] (03CR) 10Alexandros Kosiaris: [C: 032] Bump scap to 3.7.4-3 [puppet] - 10https://gerrit.wikimedia.org/r/398822 (https://phabricator.wikimedia.org/T183046) (owner: 10Alexandros Kosiaris) [15:05:37] (03PS2) 10Alexandros Kosiaris: Bump scap to 3.7.4-3 [puppet] - 10https://gerrit.wikimedia.org/r/398822 (https://phabricator.wikimedia.org/T183046) [15:05:40] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Bump scap to 3.7.4-3 [puppet] - 10https://gerrit.wikimedia.org/r/398822 (https://phabricator.wikimedia.org/T183046) (owner: 10Alexandros Kosiaris) [15:07:53] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3845015 (10EBernhardson) load testing and percentiles could certainly go away. cluster recovery might be useful at some point in th... [15:07:57] RECOVERY - Squid on install1002 is OK: TCP OK - 0.001 second response time on 208.80.154.22 port 8080 [15:08:43] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add Prometheus exporter for Blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398272 (https://phabricator.wikimedia.org/T182857) (owner: 10Muehlenhoff) [15:09:07] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:09:10] (03PS3) 10Muehlenhoff: Add Debianisation for prometheus-blazegraph-exporter [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398277 [15:09:14] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add Debianisation for prometheus-blazegraph-exporter [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398277 (owner: 10Muehlenhoff) [15:09:54] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3845027 (10EBernhardson) Unrelated to dashboards, but for prometheus. We will likely need a fork (or an additional custom collecto... [15:10:58] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:11:16] (03CR) 10Giuseppe Lavagetto: First version of the helm chart scaffolding for production services (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397) (owner: 10Giuseppe Lavagetto) [15:11:19] new scap installed \o/ [15:11:38] (03Abandoned) 10Thcipriani: Scap: bump version to 3.7.4-3 [puppet] - 10https://gerrit.wikimedia.org/r/398853 (https://phabricator.wikimedia.org/T183046) (owner: 10Thcipriani) [15:11:50] nice [15:13:27] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:13:27] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:14:11] (03PS3) 10Giuseppe Lavagetto: First version of the helm chart scaffolding for production services [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397) [15:14:28] spot-checking scap on tin, everything looks as expected [15:15:57] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:16:46] !log reboot labtestvirt2003 [15:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:07] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:20:31] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3845060 (10Ottomata) To build: https://github.com/wikimedia/operations-debs-prometheus-jmx-exporter/blob/master/debian/README.Debian [15:21:24] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3845066 (10Ottomata) I'm probably doing somethign wrong with the ~jessie1 and ~stretch1 versions. It's JVM right? So the same bui... [15:23:06] akosiaris: whenever scap is installed everywhere I can give a noop README sync a try to verify all's working [15:23:28] !log uploaded prometheus-blazegraph-exporter, prometheus-wdqs-updater-exporter and prometheus-pdns-exporter to apt.wikimedia.org [15:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:43] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3845075 (10Ottomata) Hm, looks like the last line of README.debian got cut off. Should be: USENETWORK=yes GIT_PBUILDER_AUTOCONF... [15:25:52] (03PS2) 10Hashar: Enable jenkins on contint1001 reboot [puppet] - 10https://gerrit.wikimedia.org/r/392399 [15:25:57] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:26:29] (03PS6) 10Muehlenhoff: Add Prometheus exporter to WDQS servers [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) [15:27:27] (03CR) 10Muehlenhoff: [C: 032] Add Prometheus exporter to WDQS servers [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) (owner: 10Muehlenhoff) [15:28:52] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1329.eqiad.wmnet [15:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:48] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:31:15] thcipriani: yeah I guessing around ~30 mins tops we should be able to test [15:31:46] okie doke. [15:33:27] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:34:15] (03PS1) 10Muehlenhoff: Add Prometheus exporter for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/398858 [15:35:27] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:36:52] (03CR) 10Gehel: Add Prometheus exporter for Blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398858 (owner: 10Muehlenhoff) [15:37:32] !log Stop db1100 and dbstore1002 in sync - T161294 [15:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:43] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [15:38:27] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:39:43] (03CR) 10Muehlenhoff: "Actually, nutcracker is also used on silver which is trusty, so this needs an Upstart job." [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398839 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [15:42:57] 10Operations, 10ORES, 10Scoring-platform-team, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3845155 (10awight) [15:44:52] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/398847 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [15:45:31] !log stop and upgrade db1107 T183123 [15:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:43] T183123: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123 [15:46:00] (03CR) 10Marostegui: [C: 032] mariadb: Add db1112 to s4-test [puppet] - 10https://gerrit.wikimedia.org/r/398826 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [15:46:08] (03PS5) 10Marostegui: mariadb: Add db1112 to s4-test [puppet] - 10https://gerrit.wikimedia.org/r/398826 (https://phabricator.wikimedia.org/T180788) [15:46:50] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Get rid of "import realm.pp" in manifests/site.pp - https://phabricator.wikimedia.org/T154915#3845179 (10hashar) a:03hashar [15:46:55] (03PS2) 10Hashar: test: puppet-syntax now fails on deprecation notices [puppet] - 10https://gerrit.wikimedia.org/r/333012 (https://phabricator.wikimedia.org/T154915) [15:47:18] !log Stop MySQL on db1111 to copy its content to db1112 - T180788 [15:47:23] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Get rid of "import realm.pp" in manifests/site.pp - https://phabricator.wikimedia.org/T154915#3845181 (10hashar) [15:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:28] T180788: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788 [15:47:47] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [15:47:57] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Get rid of "import realm.pp" in manifests/site.pp - https://phabricator.wikimedia.org/T154915#2928315 (10hashar) [15:48:04] (03PS1) 10Muehlenhoff: Depend on python-dateutil [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398860 [15:48:15] (03PS1) 10ArielGlenn: config setting to permit a list of wikis to be dumped in a specific order [dumps] - 10https://gerrit.wikimedia.org/r/398861 [15:48:29] (03CR) 10Muehlenhoff: [V: 032 C: 032] Depend on python-dateutil [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398860 (owner: 10Muehlenhoff) [15:48:31] (03PS3) 10Hashar: test: puppet-syntax now fails on deprecation notices [puppet] - 10https://gerrit.wikimedia.org/r/333012 (https://phabricator.wikimedia.org/T154915) [15:48:35] (03CR) 10jerkins-bot: [V: 04-1] config setting to permit a list of wikis to be dumped in a specific order [dumps] - 10https://gerrit.wikimedia.org/r/398861 (owner: 10ArielGlenn) [15:48:47] (03CR) 10Alexandros Kosiaris: [C: 031] First version of the helm chart scaffolding for production services [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397) (owner: 10Giuseppe Lavagetto) [15:51:14] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#3845197 (10awight) @akosiaris Just a nudge, I'm waiting for your feedback... [15:51:48] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 [15:52:01] (03PS2) 10ArielGlenn: config setting to permit a list of wikis to be dumped in a specific order [dumps] - 10https://gerrit.wikimedia.org/r/398861 [15:53:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398862 [15:53:51] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398862 [15:55:05] (03CR) 10Muehlenhoff: Add Prometheus exporter for Blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398858 (owner: 10Muehlenhoff) [15:55:30] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Port redis statistics to Prometheus - https://phabricator.wikimedia.org/T148637#3845220 (10fgiunchedi) Initial dashboard at https://grafana-admin.wikimedia.org/dashboard/db/prometheus-redis [15:55:35] (03PS2) 10Muehlenhoff: Add Prometheus exporter for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/398858 [15:56:26] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398862 (owner: 10Marostegui) [15:57:17] (03PS1) 10Ottomata: Add ssl_array and ssl_string entries to kafka_config [puppet] - 10https://gerrit.wikimedia.org/r/398863 [15:57:40] (03CR) 10jerkins-bot: [V: 04-1] Add ssl_array and ssl_string entries to kafka_config [puppet] - 10https://gerrit.wikimedia.org/r/398863 (owner: 10Ottomata) [15:58:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398862 (owner: 10Marostegui) [15:58:16] (03PS2) 10Ottomata: Add ssl_array and ssl_string entries to kafka_config [puppet] - 10https://gerrit.wikimedia.org/r/398863 [15:58:18] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398862 (owner: 10Marostegui) [16:01:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1099:3311 - T174569 (duration: 03m 03s) [16:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:36] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [16:04:14] 3 minutes, that is weird... [16:04:34] thcipriani: we are ready to go I think [16:04:47] yep, was just about to ping you :) [16:04:50] * thcipriani does [16:06:49] (phew this logging always freaks me out) [16:08:52] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3845267 (10fgiunchedi) [16:09:25] !log thcipriani@tin Synchronized README: noop sync to test scap 3.7.4-3 (duration: 03m 02s) [16:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:22] akosiaris: hrm, 4 hostkey verification failures (looking for which hosts)... eventually timedout on mw133{0,2,4,5} [16:10:48] sync seemed normal except for that from the scap side [16:11:01] thcipriani: snap it is my fault [16:11:17] they are new, need to be reimaged to become appservers [16:11:29] going to put them inactive [16:11:58] ok [16:12:25] ok, then it looks like scap is doing fine. [16:13:00] akosiaris: thank you for your help! I'll try to get some basic docs started in the repo asap. [16:13:03] new hosts are from mw1329->mw1337 [16:13:44] thcipriani: thanks! And sorry for all the mess [16:14:44] elukey: ah, yup, that accounts for all 8 of the hostkey failures and timeouts, makes sense. No urgency on this from my side, just making sure failures are accounted for. [16:15:20] (03PS1) 10Muehlenhoff: Add Prometheus scraper configs for WDQS updater and Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/398865 [16:15:54] !log elukey@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw133[0-7].eqiad.wmnet [16:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:04] akosiaris: no worries, the more people that know about scap packaging, the better it gets (I hope :))! [16:16:07] this should have fixed the issue --^ [16:16:22] mw1329 is already up and running but pooled=no [16:17:10] thcipriani: since we are chatting, I'd have a question for you [16:17:26] :) [16:17:28] I'd need to put in service some appservers and some jobrunners (the above ones) [16:17:48] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.8 (duration: 04m 58s) [16:17:53] the appservers can stay in pooled=no [16:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:31] for the holidays, but the jobrunners start automatically after the first puppet run (we can stop them of course afterwards) [16:18:33] (03Abandoned) 10Muehlenhoff: Add Blazegraph exporter to WDQS hosts [puppet] - 10https://gerrit.wikimedia.org/r/398280 (owner: 10Muehlenhoff) [16:18:51] wondering if should stop my reimage work and wait for january or not [16:19:36] thcipriani: --^ [16:21:25] elukey: these machines *should* be fine after they pull the latest code in; however, the next couple weeks would be bad times to find out there was a problem with that process. If it's not urgent I'd wait, but I'm usually overly cautious about this kind of thing :) [16:21:27] (03PS1) 10Muehlenhoff: Add labmon Prometheus scraper config for PowerDNS [puppet] - 10https://gerrit.wikimedia.org/r/398867 [16:21:34] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [16:21:58] (03PS2) 10Ottomata: Add documentation for .m suffix code to pagecounts-ez doc page [puppet] - 10https://gerrit.wikimedia.org/r/395517 (https://phabricator.wikimedia.org/T180452) (owner: 10Mforns) [16:22:02] (03CR) 10Ottomata: [V: 032 C: 032] Add documentation for .m suffix code to pagecounts-ez doc page [puppet] - 10https://gerrit.wikimedia.org/r/395517 (https://phabricator.wikimedia.org/T180452) (owner: 10Mforns) [16:23:02] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.11 [keeping static files] (duration: 01m 18s) [16:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:25] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:25:44] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [16:25:54] thcipriani: ack! I'll ask the question to the other opsens and let you guys know :) [16:28:25] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 11 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:39:52] (03PS2) 10Herron: facter: fix interface_primary under newer versions of facter [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) [16:39:58] (03PS2) 10Ema: mtail: add varnishreqstats.mtail [puppet] - 10https://gerrit.wikimedia.org/r/398819 (https://phabricator.wikimedia.org/T177199) [16:44:24] (03PS1) 10Elukey: profile::mariadb::misc::el::master: apply data sanitization policies [puppet] - 10https://gerrit.wikimedia.org/r/398869 (https://phabricator.wikimedia.org/T108850) [16:45:44] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:46:27] (03PS1) 10Filippo Giunchedi: prometheus: recording rules for redis [puppet] - 10https://gerrit.wikimedia.org/r/398871 (https://phabricator.wikimedia.org/T148637) [16:47:06] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10hardware-requests: eqiad: (8) Hadoop expansion - FY 2017 / 2018 - https://phabricator.wikimedia.org/T182628#3845408 (10Milimetric) [16:51:34] (03PS6) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [16:53:26] (03CR) 10Herron: "Thanks for testing and good point! Updated to ( .*)?$ in ps2." [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) (owner: 10Herron) [16:56:19] (03PS2) 10Elukey: profile::mariadb::misc::el::master: apply data sanitization policies [puppet] - 10https://gerrit.wikimedia.org/r/398869 (https://phabricator.wikimedia.org/T108850) [17:00:37] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10hardware-requests: eqiad: (8) Hadoop expansion - FY 2017 / 2018 - https://phabricator.wikimedia.org/T182628#3845436 (10faidon) p:05Triage>03High [17:08:10] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: SCAP: Upload debian package version 3.7.4-3 - https://phabricator.wikimedia.org/T182347#3845445 (10akosiaris) 05Open>03Resolved Done, re-resolving finally [17:08:12] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3845447 (10akosiaris) [17:48:28] 10Operations, 10Release Pipeline, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Watching / External): Switch CI Docker Storage Driver to devicemapper - https://phabricator.wikimedia.org/T178663#3845604 (10greg) [17:51:53] !log run kafka preferred-replica-election on the analytics cluster to allow kafka1023 (new node) to become a partition leader [17:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:00] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3800970 (10Gehel) >>! In T177225#3841695, @Dzahn wrote: > Alright, Ganglia is purged from everything across the board, except 17 hosts now! :) They are: > > 4 x maps codfw (osm/post... [17:58:13] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: New WDQS clusters eqiad + codfw - https://phabricator.wikimedia.org/T182991#3845654 (10EBjune) @Gehel the budget has been approved [18:00:08] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3845657 (10RobH) Please note this was provisionally approv... [18:11:23] !log installing python updates from stretch point release [18:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:06] 10Operations, 10ops-eqdfw, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3588945 (10Dzahn) I would take this ticket from the "OS installation" step forward but i am unsure if the hardware troubleshooting part... [18:20:01] (03CR) 10MaxSem: [C: 031] maps: Bump maximum zoom to 19 [puppet] - 10https://gerrit.wikimedia.org/r/394948 (https://phabricator.wikimedia.org/T180907) (owner: 10Gehel) [18:24:57] 10Operations, 10ops-eqdfw, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3588945 (10MoritzMuehlenhoff) We have stretch builds of HHVM and this should work (with some minor changes maybe). We should really pro... [18:25:38] 10Operations, 10ops-codfw, 10Cloud-VPS: Connect labtestvirt2003 eth1 and eth2 interface(s) to switch fabric - https://phabricator.wikimedia.org/T183167#3845718 (10chasemp) p:05Triage>03Normal [18:28:58] !log installing libxkbcommon updates from stretch point release [18:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:20] !log installing xml2 updates from stretch point release [18:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:15] PROBLEM - DPKG on boron is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:46:13] boron is me, should recover soon [18:46:15] RECOVERY - DPKG on boron is OK: All packages OK [18:49:03] (03CR) 10Dduvall: "Thanks for considering my comments. I'll be testing this against mathoid/minikube again today." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397) (owner: 10Giuseppe Lavagetto) [18:51:40] hi! quick pointer to the code where we cannonically define the base URLs for each project? looking somewhere in mediawiki-config repo.... thx in advance!! [18:52:05] AndyRussG: $wgServer combined with $wgArticlePath / $wgScriptPath [18:52:32] There's probably complex magic happening though, as what wiki you're visiting is based on the virtual host name [18:53:17] bawolff: thx!!! [18:53:32] If you want to get that info in code, you shouldn't use those variables directly though, and instead use something like wfScript() [18:54:05] 10Operations, 10ORES, 10Scoring-platform-team, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3845797 (10awight) Interesting hypothesis from IRC conversation: the sine waves could be a garbage collection artifact. Python includes some to... [18:57:39] (03CR) 10Dduvall: [C: 04-1] First version of the helm chart scaffolding for production services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397) (owner: 10Giuseppe Lavagetto) [18:58:22] bawolff: yeah... This is actually not code that runs in a Mediawiki context. Rather, I need to reverse lookup from CentralNotice's internal concept of "project" and "language" to the [18:58:24] # project column of the wmf.pageview_hourly table in Hive [18:59:07] gehel: ah, thanks for the comment on ganglia ticket :) will remove that too, i just considered it my only blocker because we didnt have replacement for postgresql stats [18:59:11] So I'm just making a configurable lookup function with a comment to coordinate with the requisite cluster config and analytics code [18:59:34] (03PS1) 10Ayounsi: LibreNMS: Add an IRCbot process [puppet] - 10https://gerrit.wikimedia.org/r/398898 [18:59:48] Ah. So central notice's concept is probably from wfWikiId() (e.g. english wikipedia = enwiki) [18:59:48] can't think of any more beautious options [19:00:05] So you probably want to look up the SiteConfiguration class (Warning, there be dragons over there) [19:00:19] although the sites table in use at wikidata, might have the data in a cleaner way [19:01:00] (03PS1) 10Dzahn: maps: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398899 (https://phabricator.wikimedia.org/T177225) [19:01:04] bawolff: heheh no it's worse, it's a special CN config variable, $wgNoticeProject, that can globs several projects together [19:01:35] ah, yeah, that's much worse :P [19:01:37] mmmm can't access Mediawiki php from this code, tho if there's a Mediawiki API endpoint with this kinda stuff, that might help [19:02:51] maybe the interwiki api end point [19:03:44] e.g. https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=interwikimap&sifilteriw=local [19:07:06] 10Operations, 10Puppet, 10Patch-For-Review: custom fact interface_primary breaks under newer versions of facter - https://phabricator.wikimedia.org/T182819#3845832 (10herron) p:05Normal>03Low [19:12:03] (03PS1) 10Rush: openstack: only run rabbitmq cleanup on active control node [puppet] - 10https://gerrit.wikimedia.org/r/398900 (https://phabricator.wikimedia.org/T183144) [19:12:10] bawolff: hmmmm looks like that would tangle it even more [19:12:29] (03CR) 10jerkins-bot: [V: 04-1] openstack: only run rabbitmq cleanup on active control node [puppet] - 10https://gerrit.wikimedia.org/r/398900 (https://phabricator.wikimedia.org/T183144) (owner: 10Rush) [19:16:28] (03PS2) 10Dzahn: maps: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398899 (https://phabricator.wikimedia.org/T177225) [19:17:59] (03PS3) 10Dzahn: maps: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398899 (https://phabricator.wikimedia.org/T177225) [19:18:06] (03CR) 10Dzahn: [C: 032] maps: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398899 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:20:44] (03PS2) 10Rush: openstack: only run rabbitmq cleanup on active control node [puppet] - 10https://gerrit.wikimedia.org/r/398900 (https://phabricator.wikimedia.org/T183144) [19:21:24] (03CR) 10Dzahn: [C: 031] "looks interesting and good to me. if there is an issue it's easiest to just add code incrementally since it's a new feature anyways" [puppet] - 10https://gerrit.wikimedia.org/r/398898 (owner: 10Ayounsi) [19:21:47] (03PS3) 10Rush: openstack: only run rabbitmq cleanup on active control node [puppet] - 10https://gerrit.wikimedia.org/r/398900 (https://phabricator.wikimedia.org/T183144) [19:22:33] (03CR) 10Dzahn: [C: 031] "should the config (which channel does the bot join) be a parameter in puppet? (for (cloud) testing)" [puppet] - 10https://gerrit.wikimedia.org/r/398898 (owner: 10Ayounsi) [19:23:28] (03CR) 10Andrew Bogott: [C: 031] openstack: only run rabbitmq cleanup on active control node [puppet] - 10https://gerrit.wikimedia.org/r/398900 (https://phabricator.wikimedia.org/T183144) (owner: 10Rush) [19:30:05] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3845863 (10Dzahn) >>! In T177225#3845641, @Gehel wrote: > We are not actively using ganglia for maps, so we can remove those without any issue. Cool! Thanks for confirming. I had just... [19:30:58] (03PS5) 10ArielGlenn: rename 'otherdir' in the dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/398034 [19:31:14] (03CR) 10jerkins-bot: [V: 04-1] rename 'otherdir' in the dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/398034 (owner: 10ArielGlenn) [19:32:31] (03Abandoned) 10Dzahn: mysql eqiad: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394518 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:33:24] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Investigate why ORES logs are being written to syslog despite explicit logging config. Fix. - https://phabricator.wikimedia.org/T182614#3845874 (10awight) Here's a fun debugging tool, https://pypi.python.org/pypi/logging_tree [19:34:14] (03CR) 10Andrew Bogott: [C: 031] Add nutcracker_exporter profile [puppet] - 10https://gerrit.wikimedia.org/r/398847 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [19:34:16] (03CR) 10Dzahn: site: decom ganglia-web host, rm aggregators, rm phab include (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/382904 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:36:15] (03PS6) 10ArielGlenn: rename 'otherdir' in the dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/398034 [19:36:59] (03PS2) 10Dzahn: ganglia/site: decom ganglia-web node, rm eqiad/codfw aggregators [puppet] - 10https://gerrit.wikimedia.org/r/382904 (https://phabricator.wikimedia.org/T177225) [19:46:42] 10Operations, 10ops-ulsfo, 10Traffic: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3845938 (10RobH) p:05Triage>03Normal [19:48:57] 10Operations, 10DC-Ops, 10monitoring: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#3845956 (10RobH) p:05Triage>03High [19:49:57] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4032.ulsfo.wmnet [19:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:32] 10Operations, 10ops-ulsfo, 10Traffic: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3845938 (10RobH) [19:51:17] 10Operations, 10ops-ulsfo, 10Traffic: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3845982 (10RobH) I've created another task, T183177 to track the fact this error wasn't shown in icinga. I've also depooled the system, and will be rebooting it into the Dell ePSA to attempt to get an error... [19:57:41] 10Operations, 10ops-ulsfo, 10Traffic: setup/deploy dns400[12]/wmf721[56] - https://phabricator.wikimedia.org/T179204#3845999 (10RobH) [19:57:43] 10Operations, 10ops-ulsfo: apply hostname labels to dns400[12] - https://phabricator.wikimedia.org/T180077#3845997 (10RobH) 05Open>03Resolved done [19:57:52] (03PS4) 10Dzahn: Add postgresql::prometheus class to postgresql users [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [19:58:31] (03CR) 10jerkins-bot: [V: 04-1] Add postgresql::prometheus class to postgresql users [puppet] - 10https://gerrit.wikimedia.org/r/392441 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [20:00:04] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4032_v4, cp4032_v6 [20:00:04] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4032_v4, cp4032_v6 [20:00:04] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [20:00:14] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [20:00:15] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4032_v4, cp4032_v6 [20:00:17] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4032_v4, cp4032_v6 [20:00:17] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [20:00:17] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [20:00:24] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [20:00:25] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4032_v4, cp4032_v6 [20:00:25] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [20:00:34] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [20:00:44] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [20:00:44] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4032_v4, cp4032_v6 [20:00:44] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [20:00:54] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [20:00:54] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [20:00:54] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [20:00:54] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [20:00:54] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [20:00:55] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [20:00:55] yeah thats expected [20:00:55] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [20:00:59] i took down cp4032 [20:01:08] and those are just gonna happen for it [20:01:33] (if those happen and no one chimes in like i just did, then its an actual issue ) [20:03:04] 10Operations, 10ops-ulsfo: Multiple systems in ulsfo 1.22 showing PSU failures - https://phabricator.wikimedia.org/T177622#3846011 (10RobH) [20:03:06] 10Operations, 10ops-ulsfo: check lvs4002 power supply redundancy - https://phabricator.wikimedia.org/T177623#3846007 (10RobH) 05Open>03declined system has a failed psu out of warranty, nothing to be done about it other than decom and replace, which is already tracked on T164327 this system is being replac... [20:03:41] 10Operations, 10ops-ulsfo: Multiple systems in ulsfo 1.22 showing PSU failures - https://phabricator.wikimedia.org/T177622#3664980 (10RobH) All of the linked closed/resolved/decline psu failures are valid failures, not just loose cables. Each of the systems has either been decommissioned, or is slated to be d... [20:06:19] (03PS1) 10Dzahn: labsdb100[467]: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398903 (https://phabricator.wikimedia.org/T177225) [20:07:18] (03CR) 10Dzahn: [C: 032] labsdb100[467]: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398903 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:08:29] (03PS2) 10Dzahn: labsdb100[467]: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398903 (https://phabricator.wikimedia.org/T177225) [20:12:01] (03CR) 10Dzahn: [C: 031] "the existing views already got removed and maps isnt using it either. also see reasoning on Change-Id" [puppet] - 10https://gerrit.wikimedia.org/r/382906 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:13:15] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:13:26] ^ me, will be gone in a minute, decom'ing ganglia [20:15:39] (03CR) 10Dzahn: [C: 032] postgresql: remove all ganglia support [puppet] - 10https://gerrit.wikimedia.org/r/382906 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:16:06] (03PS3) 10Dzahn: osm: remove all ganglia support [puppet] - 10https://gerrit.wikimedia.org/r/382905 (https://phabricator.wikimedia.org/T177225) [20:16:14] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:16:14] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:16:36] (03CR) 10Dzahn: [C: 031] "per "osm has been done" and this goes first in the dependencies anyways" [puppet] - 10https://gerrit.wikimedia.org/r/382905 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:17:43] (03CR) 10Dzahn: [C: 032] osm: remove all ganglia support [puppet] - 10https://gerrit.wikimedia.org/r/382905 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:19:03] (03PS2) 10Dzahn: postgresql: remove all ganglia support [puppet] - 10https://gerrit.wikimedia.org/r/382906 (https://phabricator.wikimedia.org/T177225) [20:23:17] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Investigate why ORES logs are being written to syslog despite explicit logging config. Fix. - https://phabricator.wikimedia.org/T182614#3846052 (10awight) https://github.com/wiki-ai/ores/pull/241 [20:25:56] (03PS1) 10Dzahn: osm/postgres: remove ganglia diskstat plugin inclusion [puppet] - 10https://gerrit.wikimedia.org/r/398904 (https://phabricator.wikimedia.org/T177225) [20:27:04] (03CR) 10Dzahn: [C: 032] osm/postgres: remove ganglia diskstat plugin inclusion [puppet] - 10https://gerrit.wikimedia.org/r/398904 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:28:13] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:31:12] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:31:13] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:34:58] (03PS3) 10Dzahn: ganglia/site: decom ganglia-web node, rm eqiad/codfw aggregators [puppet] - 10https://gerrit.wikimedia.org/r/382904 (https://phabricator.wikimedia.org/T177225) [20:39:54] (03PS4) 10Dzahn: ganglia/site: decom ganglia-web node, rm eqiad/codfw aggregators [puppet] - 10https://gerrit.wikimedia.org/r/382904 (https://phabricator.wikimedia.org/T177225) [20:40:30] (03CR) 10Dzahn: [C: 032] ganglia/site: decom ganglia-web node, rm eqiad/codfw aggregators [puppet] - 10https://gerrit.wikimedia.org/r/382904 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:44:52] PROBLEM - DPKG on install2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:45:02] PROBLEM - DPKG on install1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:45:05] that's me too, i got it [20:45:41] (03PS1) 10Dzahn: site/uranium fix "spare" -> "spare::system" typo [puppet] - 10https://gerrit.wikimedia.org/r/398910 [20:46:11] (03CR) 10Dzahn: [C: 032] site/uranium fix "spare" -> "spare::system" typo [puppet] - 10https://gerrit.wikimedia.org/r/398910 (owner: 10Dzahn) [20:47:34] Wow, lots of debugging messages in scap... [20:47:40] (03PS2) 10Ayounsi: LibreNMS: Add an IRCbot process [puppet] - 10https://gerrit.wikimedia.org/r/398898 [20:48:10] !log bawolff@tin Synchronized php-1.31.0-wmf.12/extensions/TemplateData/TemplateDataBlob.php: T118682 (duration: 00m 52s) [20:48:12] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 11 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[ganglia-monitor] [20:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:32] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 34 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[ganglia-monitor] [20:49:25] !log install1002/2002 - killing all ganglia processes, decoming aggregators [20:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:52] RECOVERY - DPKG on install2002 is OK: All packages OK [20:50:02] RECOVERY - DPKG on install1002 is OK: All packages OK [20:53:11] !log reboot labtestvirt2003 [20:53:12] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:33] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:53:36] !log ganglia.wikimedia.org shut down just now after a deprecation period - service is out of commission - T177225 [20:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:46] T177225: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225 [20:57:09] w00t! [20:57:25] apergos: :) [21:00:00] !log uranium - apt-get remove ganglia-webfrontend, apache2 [21:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:25] (03CR) 10Dzahn: [C: 031] LibreNMS: Add an IRCbot process [puppet] - 10https://gerrit.wikimedia.org/r/398898 (owner: 10Ayounsi) [21:06:02] (03PS3) 10Dzahn: statsd: remove ganglia backend support [puppet] - 10https://gerrit.wikimedia.org/r/382923 (https://phabricator.wikimedia.org/T177225) [21:06:50] (03CR) 10Dzahn: [C: 032] statsd: remove ganglia backend support [puppet] - 10https://gerrit.wikimedia.org/r/382923 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [21:11:40] (03PS3) 10Ayounsi: LibreNMS: Add an IRCbot process [puppet] - 10https://gerrit.wikimedia.org/r/398898 [21:14:22] (03PS2) 10Dzahn: standard: decom ganglia plugin everywhere by default [puppet] - 10https://gerrit.wikimedia.org/r/382924 (https://phabricator.wikimedia.org/T177225) [21:14:28] (03CR) 10Ayounsi: [C: 032] LibreNMS: Add an IRCbot process [puppet] - 10https://gerrit.wikimedia.org/r/398898 (owner: 10Ayounsi) [21:14:42] (03PS4) 10Ayounsi: LibreNMS: Add an IRCbot process [puppet] - 10https://gerrit.wikimedia.org/r/398898 [21:20:48] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:22:09] (03PS1) 10Ayounsi: LibreNMS: IRCbot puppet fixes [puppet] - 10https://gerrit.wikimedia.org/r/398922 [21:22:11] 10Operations: Can't fetch from gerrit after updating ssh keys - https://phabricator.wikimedia.org/T183193#3846293 (10kaldari) [21:22:42] (03CR) 10Ayounsi: [C: 032] LibreNMS: IRCbot puppet fixes [puppet] - 10https://gerrit.wikimedia.org/r/398922 (owner: 10Ayounsi) [21:25:48] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:26:19] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:28:18] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [21:33:09] (03PS1) 10Ayounsi: LibrenNMS IRC bot, config typo [puppet] - 10https://gerrit.wikimedia.org/r/398940 [21:34:16] (03CR) 10Ayounsi: [C: 032] LibrenNMS IRC bot, config typo [puppet] - 10https://gerrit.wikimedia.org/r/398940 (owner: 10Ayounsi) [21:58:24] 10Operations: Can't fetch from gerrit after updating ssh keys - https://phabricator.wikimedia.org/T183193#3846293 (10Dzahn) @kaldari the Gerrit SSH key is seperate from the production shell SSH key. Please check in Gerrit web UI at https://gerrit.wikimedia.org/r/#/settings/ssh-keys [21:58:39] kaldari: https://gerrit.wikimedia.org/r/#/settings/ssh-keys [21:58:49] mutante: Yeah, I added it there first. [22:00:04] eh.. ok. in that case i will add the Gerrit tag to that ticket [22:00:23] 10Operations, 10Gerrit: Can't fetch from gerrit after updating ssh keys - https://phabricator.wikimedia.org/T183193#3846450 (10Dzahn) [22:01:31] debug1: Executing proxy command: exec ssh -a -W gerrit.wikimedia.org:29418 bast1001.wikimedia.org [22:01:39] kaldari: ^ try that without proxying [22:01:48] like not via bast1001 when the target is gerrit [22:02:25] my ssh config has lines like: [22:02:26] Host *.wikimedia.org *.wmnet !gerrit.wikimedia.org !git-ssh.wikimedia.org [22:02:33] to use it for all except for gerrit [22:04:30] 10Operations, 10Gerrit: Can't fetch from gerrit after updating ssh keys - https://phabricator.wikimedia.org/T183193#3846502 (10Dzahn) I noticed the line ``` debug1: Executing proxy command: exec ssh -a -W gerrit.wikimedia.org:29418 bast1001.wikimedia.org ``` But it should directly connect to gerrit on that... [22:08:09] (03PS1) 10Ayounsi: LibreNMS: Allow librenms to write file in $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/399101 [22:08:31] (03CR) 10jerkins-bot: [V: 04-1] LibreNMS: Allow librenms to write file in $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/399101 (owner: 10Ayounsi) [22:08:57] (03CR) 10Ayounsi: "Note that this has been tested manually and works, but not sure if it's the best way of fixing the issue." [puppet] - 10https://gerrit.wikimedia.org/r/399101 (owner: 10Ayounsi) [22:10:07] (03PS2) 10Ayounsi: LibreNMS: Allow librenms to write file in $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/399101 [22:19:37] (03PS1) 10Andrew Bogott: bigbrother: catch exceptions thrown during tool restarts [puppet] - 10https://gerrit.wikimedia.org/r/399104 (https://phabricator.wikimedia.org/T183171) [22:21:16] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: New WDQS clusters eqiad + codfw - https://phabricator.wikimedia.org/T182991#3846607 (10RobH) 05Open>03stalled p:05Triage>03Normal [22:25:15] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: New WDQS clusters eqiad + codfw - https://phabricator.wikimedia.org/T182991#3840568 (10RobH) [22:30:38] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:31:12] (03PS3) 10Dzahn: standard: decom ganglia plugin everywhere by default [puppet] - 10https://gerrit.wikimedia.org/r/382924 (https://phabricator.wikimedia.org/T177225) [22:31:18] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:31:29] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:31:39] PROBLEM - puppet last run on dysprosium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:31:59] PROBLEM - puppet last run on mw1188 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:19] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:38] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:39] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:48] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:59] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:09] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:09] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:18] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:19] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:29] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:58] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:58] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:58] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:34:08] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:36:51] i cant confirm these. it's like it was a puppetmaster issue that is already over [22:37:19] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:37:47] (03PS5) 10Dzahn: standard: decom ganglia plugin everywhere by default [puppet] - 10https://gerrit.wikimedia.org/r/382924 (https://phabricator.wikimedia.org/T177225) [22:37:48] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:37:59] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:38:29] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [22:39:08] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:43:10] (03CR) 10Dzahn: [C: 032] standard: decom ganglia plugin everywhere by default [puppet] - 10https://gerrit.wikimedia.org/r/382924 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:47:08] (03PS2) 10Dzahn: standard: actually drop 'has_ganglia' param entirely [puppet] - 10https://gerrit.wikimedia.org/r/382926 (https://phabricator.wikimedia.org/T177225) [22:48:58] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:49:39] 10Operations, 10Gerrit: Can't fetch from gerrit after updating ssh keys - https://phabricator.wikimedia.org/T183193#3846731 (10kaldari) I already added the key via the Gerrit web UI and confirmed that it's there. I just now updated my ssh config to the following: ``` Host bast1001.wikimedia.org ProxyComma... [22:51:21] (03PS1) 10Ladsgroup: icinga: Add scoring-team for alerts of ores-extension [puppet] - 10https://gerrit.wikimedia.org/r/399109 (https://phabricator.wikimedia.org/T154175) [22:51:30] mutante: Are you in the office by any chance? [22:52:10] kaldari: not anymore today, no [22:52:22] kaldari: does "ssh-add -l" list the key? [22:52:41] mutante: hey, do you have a minute to check this out? https://gerrit.wikimedia.org/r/#/c/399109/ [22:52:51] to exclude any issues with the agent, you could try "ssh -i /path/to/private/key kaldari@..." too [22:52:55] mutante: nope: "The agent has no identities." [22:53:12] https://phabricator.wikimedia.org/T179246 <-- when is this going to be fixed ? [22:53:23] kaldari: try the "ssh-add .ssh/id_rsa [22:53:27] (again) [22:54:19] Amir1: easy enough since that contact group already exists [22:54:56] yup :) [22:55:07] (03PS2) 10Dzahn: icinga: Add scoring-team for alerts of ores-extension [puppet] - 10https://gerrit.wikimedia.org/r/399109 (https://phabricator.wikimedia.org/T154175) (owner: 10Ladsgroup) [22:55:30] mutante: cool, I successfully added the ssh identity: "Identity added: .ssh/id_rsa (kaldari@WMF1838.corp.wikimedia.org)", but still get Permission denied (publickey). [22:55:50] (03CR) 10BryanDavis: bigbrother: catch exceptions thrown during tool restarts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/399104 (https://phabricator.wikimedia.org/T183171) (owner: 10Andrew Bogott) [22:57:18] kaldari: let's try with "ssh -i" directly specifying the new key, not even using the agent .. hrmm [22:57:28] sure ... [22:57:35] (03CR) 10Dzahn: [C: 032] icinga: Add scoring-team for alerts of ores-extension [puppet] - 10https://gerrit.wikimedia.org/r/399109 (https://phabricator.wikimedia.org/T154175) (owner: 10Ladsgroup) [22:58:58] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:58:58] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [22:59:11] mutante: Here's what I get from that... [22:59:14] https://www.irccloud.com/pastebin/cwUPhlev/ [23:00:38] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:01:18] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:01:29] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:01:39] RECOVERY - puppet last run on dysprosium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:01:59] RECOVERY - puppet last run on mw1188 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:02:38] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:02:39] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:03:07] kaldari: i'm trying to check server logs, hold on [23:03:09] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:03:09] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:03:16] (03PS2) 10Andrew Bogott: bigbrother: catch exceptions thrown during tool restarts [puppet] - 10https://gerrit.wikimedia.org/r/399104 (https://phabricator.wikimedia.org/T183171) [23:03:18] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:03:21] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:04:07] kaldari - AUTH FAILURE FROM 2620:0:861:2:208:80:154:149 no-matching-key [23:04:14] (03CR) 10BryanDavis: [C: 031] bigbrother: catch exceptions thrown during tool restarts [puppet] - 10https://gerrit.wikimedia.org/r/399104 (https://phabricator.wikimedia.org/T183171) (owner: 10Andrew Bogott) [23:04:16] for some reason [23:04:31] could there be an additonal character in it when doing copy/paste in gerrit web ui or soemthing? [23:04:41] I'll double check... [23:04:48] mutante: Thanks! [23:05:13] Amir1: yw [23:05:53] kaldari: all the errors just say "no matching key"... hmm [23:05:56] (03PS3) 10Andrew Bogott: bigbrother: catch exceptions thrown during tool restarts [puppet] - 10https://gerrit.wikimedia.org/r/399104 (https://phabricator.wikimedia.org/T183171) [23:06:46] kaldari hi, are you using rsa? [23:06:47] (03CR) 10Andrew Bogott: [C: 032] bigbrother: catch exceptions thrown during tool restarts [puppet] - 10https://gerrit.wikimedia.org/r/399104 (https://phabricator.wikimedia.org/T183171) (owner: 10Andrew Bogott) [23:07:04] paladox: yes [23:07:35] 4096 bit rsa [23:08:10] thanks [23:09:20] hmm looking at https://www.irccloud.com/pastebin/cwUPhlev/ it is showing "debug1: kex: algorithm: ecdh-sha2-nistp256" [23:09:34] mutante kaldari ^^ [23:09:46] in https://gerrit.wikimedia.org/r/#/settings/ssh-keys it says Algorithm "ssh-rsa" too? [23:10:08] yes [23:10:15] ssh-rsa [23:11:01] mutante: the only difference I could find is that my ssh pub key file ends with a linebreak [23:11:39] does it start with a duplicate "ssh-rsa" ? [23:11:50] when i look at mine there is the Algorithm field [23:11:59] and the actual key field.. and the latter does not have the "ssh-rsa" a second time [23:12:41] yes, the key itself is "ssh-rsa AAAA..." [23:12:51] paladox: do you know where gerrit stores it for real? [23:12:51] so it's duplicated in the key field [23:12:58] aah [23:13:01] mutante yep a git repo [23:13:15] try removing that second ssh-rsa [23:13:18] ok... [23:13:55] meh, i think that wasnt it either, and it's merely a display issue [23:14:11] when i click into that field this changes [23:14:18] no, it says invalid ssh key without the ssh-rsa at the front [23:15:02] do I need to delete my previous ssh key from the GUI interface? [23:15:16] oh, well, that's a good question [23:15:20] nope, you can add as many as you want. [23:15:38] could it still match just against the first one and then give up? [23:15:45] nope. [23:15:54] I have two. and it correctly gets my second one. [23:15:59] ok.. uhm.. where's that git repo.. [23:16:05] All-Users [23:19:37] is cloning that and paladox is finding the right ref [23:20:24] it's something like [23:20:25] refs/users/65/1665 [23:20:28] i know the ending [23:20:38] but the middle part i am not sure where it gets that number from [23:21:14] it's the last two digits [23:21:16] delete the old key anyways since you won't use it? [23:21:20] (meanwhile) [23:22:57] ah [23:22:59] thanks [23:23:12] mutante try git checkout origin/users/78/78 [23:23:18] git checkout origin/users/78.. [23:23:25] it wont work for me as i am not an admin and i can only checkout my own branch [23:23:37] error: pathspec 'origin/users/78/78' did not match any file(s) known to git. [23:23:50] hmm [23:24:09] mutante try git branch | grep 78 [23:24:27] actually git branch -a | grep 78 [23:24:51] there's a lot [23:24:52] remotes/origin/starred-changes/78/100378/562 [23:25:18] are we looking for 78/* ? [23:25:22] yep, it stores all users favourite changes in the repo. [23:25:27] and i think so [23:25:34] deleted the old key, but didn't help [23:25:52] kaldari try recreating the key? [23:26:11] yea, and that newline.. just in case [23:26:17] sure... [23:26:19] mutante https://gerrit.wikimedia.org/r/accounts/rkaldari@wikimedia.org/sshkeys [23:26:51] paladox: i get like an empty array [23:27:07] ah that means he deleted all the ssh keys [23:27:13] )]}' [23:27:14] [] [23:27:23] yep, he needs to add a key now [23:27:35] (03PS1) 10Jcrespo: mysql-package: add mysql service unit derived from the mariadb [software] - 10https://gerrit.wikimedia.org/r/399113 [23:27:38] the brackets kind of look uneven [23:27:41] recreated it (including the newline), but still no luck [23:28:39] i see a key on that URL now [23:29:23] it ends in \u003d\u003d [23:29:55] "algorithm": "ssh-rsa", [23:29:57] "comment": "kaldari@WMF1838.corp.wikimedia.org", [23:29:57] "valid": true [23:30:28] kaldari try ssh -o KexAlgorithms=diffie-hellman-group-exchange-sha256 -i .ssh/id_rsa.pub -p 29418 kaldari@gerrit.wikimedia.org -v [23:30:30] \u003d\u003d is the "==", right? [23:31:04] https://www.irccloud.com/pastebin/pCdWhLd0/ [23:31:05] so when i compare that to my own key [23:31:05] (03PS1) 10Jcrespo: [WIP]mariadb: Add mysql 8.0-compatible template [puppet] - 10https://gerrit.wikimedia.org/r/399115 [23:31:12] using that same URL above but with my user [23:31:25] my key does not end in that \u003d\u003d [23:31:31] the rest looks similar [23:31:49] kaldari ah [23:31:55] i see what is happening now [23:32:01] try using a capital K [23:32:08] ie Kaldari [23:32:35] That works! [23:32:38] wtf [23:32:39] :) [23:32:46] another user hit that too [23:32:50] :oo [23:33:31] Yay!! [23:34:13] I will file a task under #gerrit on phab as upstream need to fix this. But the task can be used as a reminder [23:34:15] paladox saves the day [23:34:19] lol :) [23:34:34] even upstream bug :p [23:34:57] Thanks paladox! [23:35:27] your welcome :) [23:35:56] mutante it's to do with us setting ldap lower case setting. [23:36:53] paladox: it reminds me of discussins on gerrit changes about that [23:37:09] yep [23:37:20] as you say there was that other user [23:37:20] that broke all git clones apparently. [23:37:27] just didnt think of that at all anymore [23:38:04] mutante https://phabricator.wikimedia.org/T183205 [23:38:59] 10Operations, 10Gerrit: Can't fetch from gerrit after updating ssh keys - https://phabricator.wikimedia.org/T183193#3846790 (10kaldari) 05Open>03Resolved a:03kaldari Turns out the username for gerrit has to be uppercase for some reason: ``` WMF1838:~ kaldari$ ssh -i .ssh/id_rsa.pub -p 29418 Kaldari@gerri... [23:39:23] paladox: wanna update https://phabricator.wikimedia.org/T183193#3846731 with the solution and link to that and upstream? :) [23:39:35] and thanks! [23:39:59] also thanks mutante for help troubleshooting! [23:40:09] (03CR) 10Legoktm: [C: 04-2] "Yes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309066 (https://phabricator.wikimedia.org/T85847) (owner: 10Legoktm) [23:40:25] yw [23:41:17] 10Operations, 10Gerrit: Can't fetch from gerrit after updating ssh keys - https://phabricator.wikimedia.org/T183193#3846293 (10Paladox) The problem is when we set ldap lowercase earlier this year it made it lowercase through out the ui. Though they doint allow you to use different casing under ssh / git clone... [23:41:57] (03PS1) 10Chad: Remove unfinished/broken branch plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399116 [23:43:05] paladox: Does this mean I have to reclone all my repos? :( [23:43:12] nope [23:43:23] unless you used kaldari instead of Kaldari [23:43:38] yes, I used kaldari instead of Kaldari :P [23:43:52] kaldari: you can fix this with .gitconfig [23:43:58] oh good [23:44:10] [url "ssh://Kaldari@gerrit.wikimedia.org:29418/"] [23:44:10] insteadOf = "ssh://kaldari@gerrit.wikimedia.org:29418/" [23:44:10] of course we never get "wrong user" or anything, it's just "key doesnt match", nice touch [23:45:28] legoktm: Thanks. That works! [23:45:38] filled upstream https://bugs.chromium.org/p/gerrit/issues/detail?id=8004 [23:51:36] PROBLEM - Host cp4032 is DOWN: PING CRITICAL - Packet loss = 100% [23:53:11] (03PS3) 10Dzahn: standard: actually drop 'has_ganglia' param entirely [puppet] - 10https://gerrit.wikimedia.org/r/382926 (https://phabricator.wikimedia.org/T177225) [23:54:36] (03PS1) 10Chad: All kinds of pylint and other style fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399117 [23:55:30] (03PS2) 10Chad: All kinds of pylint and other style fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399117 [23:55:40] cp4032 is rob working in ulsfo