[00:01:11] (03CR) 10Cmjohnson: [C: 032] Adding entries for db1113 and 1114 T182896 [puppet] - 10https://gerrit.wikimedia.org/r/399314 (owner: 10Cmjohnson) [00:19:46] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3850067 (10Cmjohnson) [00:20:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3837888 (10Cmjohnson) @Marostegui These are ready for installs. [00:56:51] legoktm: around? [00:57:06] ish [00:57:20] i have a breaking error with DPL after upgrading to 1.30.0 [00:57:29] can you file a bug? [00:57:39] ok can i assign it to you? [00:59:36] just cc me? [01:03:31] done [01:06:51] SantaC: https://github.com/wikimedia/mediawiki-extensions-Cite/commit/d6b1bdeff5ac2ade60414ab88c6d8d5ad19852ba [01:07:02] I'd suggest it's been broken since September 2016 [01:07:17] why did it only break after upgrading then [01:07:28] What did you upgrade from? [01:07:33] 1.29.2 [01:07:53] You weren't using the correct version of Cite in 1.29? [01:07:53] https://gerrit.wikimedia.org/r/#/c/311963/ [01:07:58] It's in REL1_29 [01:08:13] isn't cite in core? [01:08:17] https://github.com/wikimedia/mediawiki-extensions-Cite/blob/REL1_29/includes/Cite.php#L105 [01:08:18] No [01:08:51] "This extension comes with MediaWiki 1.21 and above." <_< [01:09:48] That doesn't mean it's in core [01:09:50] It's just bundled [01:09:56] that's what i meant [01:10:11] someone presumably didn't upgrade to 1.29 properly [01:10:36] Or... [01:10:58] i replaced extension with 1.30 snapshot of cite and the backtrace is the same [01:11:07] Yes, I didn't say it was fixed [01:11:13] I'm just saying, it should've been broken in 1.29 [01:11:19] https://github.com/wikimedia/mediawiki-extensions-DynamicPageList/blame/master/DPL.php#L79 [01:11:21] Code in DPL hasn't changed [01:13:56] pluh. well we upgraded to 1.29.0 on August 2nd and to 1.29.2 on Nov 28 [01:15:47] It's definitely in https://releases.wikimedia.org/mediawiki/1.29/mediawiki-1.29.2.tar.gz [01:48:12] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 28 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:53:12] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 13 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:20:28] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.12) (duration: 05m 45s) [02:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:17] paladox: the curl command got me "Unauthorized" too [03:39:33] RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1564 bytes in 0.093 second response time [03:45:33] (03PS1) 10Dzahn: redis: delete ganglia monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/399326 (https://phabricator.wikimedia.org/T177225) [03:53:53] (03PS5) 10Dzahn: pybal: use lvs::config not ganglia_clusters to determine if appserver [puppet] - 10https://gerrit.wikimedia.org/r/382930 (https://phabricator.wikimedia.org/T177225) [03:55:20] (03CR) 10Dzahn: [C: 04-1] hiera/wmflib: drop ganglia_clusters variable entirely? [puppet] - 10https://gerrit.wikimedia.org/r/382931 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [03:57:34] 10Operations, 10Goal, 10Technical-Debt, 10User-fgiunchedi: Reduce technical debt in metrics monitoring - https://phabricator.wikimedia.org/T177195#3850316 (10Dzahn) [03:57:37] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3850314 (10Dzahn) 05Open>03Resolved Ganglia has been uninstalled from the fleet, the aggregators are gone, the roles and the module is deleted, the DNS name is removed, for all pur... [04:01:12] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 25 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:06:12] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 15 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:08:46] (03PS1) 10BryanDavis: bigbrother: rate limit emails [puppet] - 10https://gerrit.wikimedia.org/r/399338 [04:09:10] (03CR) 10jerkins-bot: [V: 04-1] bigbrother: rate limit emails [puppet] - 10https://gerrit.wikimedia.org/r/399338 (owner: 10BryanDavis) [04:10:46] (03PS2) 10BryanDavis: bigbrother: rate limit emails [puppet] - 10https://gerrit.wikimedia.org/r/399338 [04:11:07] (03CR) 10jerkins-bot: [V: 04-1] bigbrother: rate limit emails [puppet] - 10https://gerrit.wikimedia.org/r/399338 (owner: 10BryanDavis) [04:12:02] (03PS3) 10BryanDavis: bigbrother: rate limit emails [puppet] - 10https://gerrit.wikimedia.org/r/399338 [04:13:52] (03CR) 10BryanDavis: "needs testing" [puppet] - 10https://gerrit.wikimedia.org/r/399338 (owner: 10BryanDavis) [04:35:12] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:40:12] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 12 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:58:34] (03CR) 10Andrew Bogott: [C: 04-1] "This will install the production master cert, which probably isn't what we want" [puppet] - 10https://gerrit.wikimedia.org/r/398323 (owner: 10Andrew Bogott) [05:12:23] PROBLEM - HHVM rendering on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:22] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 74821 bytes in 0.142 second response time [05:43:22] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 179433720 for key PRIMARY on query. Default database: enwiki. [Query snipped] [06:06:32] !log reset mailman password for tawikisource T183329 [06:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:44] T183329: Reset mailing list password for tawikisource - https://phabricator.wikimedia.org/T183329 [06:10:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399340 (https://phabricator.wikimedia.org/T174569) [06:15:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399340 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:16:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399340 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:17:07] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399340 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:18:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1105:3311 - T174569 (duration: 00m 52s) [06:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:40] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:19:31] !log Deploy schema change on db1105:3311 - T174569 [06:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:32] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:20:42] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:21:23] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 74768 bytes in 0.104 second response time [06:21:32] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.035 second response time [06:31:57] <_joe_> looking [06:47:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3318, repool s8 db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399342 (https://phabricator.wikimedia.org/T161294) [06:49:22] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:49:49] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3850459 (10Joe) [06:49:52] 10Operations, 10OCG-General, 10Patch-For-Review, 10Services (watching): Decommission OCG from production - https://phabricator.wikimedia.org/T177931#3850458 (10Joe) 05Open>03Resolved [06:50:04] 10Operations, 10OCG-General, 10Patch-For-Review, 10Services (watching): Decommission OCG from production - https://phabricator.wikimedia.org/T177931#3675377 (10Joe) Marked as resolved, I don't have much to do here. [06:50:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3318, repool s8 db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399342 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:50:51] 10Operations, 10TechCom: Create email alias for the TechCom - https://phabricator.wikimedia.org/T181027#3850462 (10Joe) 05Open>03Resolved [06:50:58] 10Operations, 10monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3850464 (10elukey) [06:51:00] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3850463 (10elukey) 05stalled>03Resolved [06:51:44] 10Operations, 10Kubernetes: Operations 2017-18 Q2 Program 6 umbrella task - https://phabricator.wikimedia.org/T178325#3850467 (10Joe) [06:51:47] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review, 10User-Joe: Create scaffolding of services templates for deployment in production/staging - https://phabricator.wikimedia.org/T177397#3850466 (10Joe) 05Open>03Resolved [06:52:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3318, repool s8 db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399342 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:52:23] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3318, repool s8 db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399342 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:52:26] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review, 10User-Joe: Create scaffolding of services templates for deployment in production/staging - https://phabricator.wikimedia.org/T177397#3657212 (10Joe) [06:53:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1099:3318 depool db1101:3318 - T161294 (duration: 00m 51s) [06:53:42] !log Stop replication in sync on db1101:3318 and db1109 - T161294 [06:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:47] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [06:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:02] PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:56:03] PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:56:52] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 74768 bytes in 0.108 second response time [06:56:53] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.021 second response time [07:01:12] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399343 (https://phabricator.wikimedia.org/T161294) [07:04:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399343 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [07:05:51] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399343 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [07:05:55] 10Operations, 10ops-eqiad: mw1191 ipmi-sel cpu errors - https://phabricator.wikimedia.org/T179640#3731680 (10Joe) This might be not urgent, but almost two months before having a response on this ticket is quite a long time. Since we're decommissioning this host and its peers, we can just ignore it. [07:06:55] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399343 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [07:08:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1096:3315 - T161294 (duration: 00m 51s) [07:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:32] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [07:09:01] !log Stop replication in sync on db1096:3315 and db1100 - T161294 [07:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:28] (03PS1) 10Jcrespo: dbproxy: Preparing to reimage dbproxy1004 [puppet] - 10https://gerrit.wikimedia.org/r/399347 (https://phabricator.wikimedia.org/T183249) [07:20:43] (03PS2) 10Jcrespo: dbproxy: Change socket location for dbproxy1004 [puppet] - 10https://gerrit.wikimedia.org/r/399347 (https://phabricator.wikimedia.org/T183249) [07:21:36] (03CR) 10Jcrespo: [C: 032] dbproxy: Change socket location for dbproxy1004 [puppet] - 10https://gerrit.wikimedia.org/r/399347 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [07:24:18] !log upgrading and restarting dbproxy1004 [07:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:32] RECOVERY - Check whether ferm is active by checking the default input chain on dbproxy1001 is OK: OK ferm input default policy is set [07:28:13] (03PS1) 10Jcrespo: dbproxy: Preparing dbproxy1005 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/399349 (https://phabricator.wikimedia.org/T183249) [07:28:53] RECOVERY - Check whether ferm is active by checking the default input chain on dbproxy1004 is OK: OK ferm input default policy is set [07:29:07] (03CR) 10Jcrespo: [C: 032] dbproxy: Preparing dbproxy1005 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/399349 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [07:35:52] !log starting reimage of dbproxy1005 [07:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:16] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850501 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1005.eqiad.wmnet'] ``` The log can be found in `/var/... [07:41:01] (03PS5) 10Giuseppe Lavagetto: Create an envoy docker image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/396021 [07:54:08] (03PS1) 10Jcrespo: dbproxy: Fix dbproxy1005 and prepare dbproxy1007 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/399350 (https://phabricator.wikimedia.org/T183249) [07:54:40] (03CR) 10Jcrespo: [C: 032] dbproxy: Fix dbproxy1005 and prepare dbproxy1007 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/399350 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [07:56:48] !log starting reimage of dbproxy1007 [07:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:07] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850504 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1007.eqiad.wmnet'] ``` The log can be found in `/var/... [08:03:00] (03PS1) 10Jcrespo: dbproxy: Prepare dbproxy1008 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/399351 (https://phabricator.wikimedia.org/T183249) [08:08:39] (03PS2) 10Muehlenhoff: Add library hint for libxv [puppet] - 10https://gerrit.wikimedia.org/r/399227 [08:10:32] (03PS2) 10ArielGlenn: enable dumps of big wikis to run in a fixed order [puppet] - 10https://gerrit.wikimedia.org/r/399158 [08:10:58] (03CR) 10ArielGlenn: [C: 032] enable dumps of big wikis to run in a fixed order [puppet] - 10https://gerrit.wikimedia.org/r/399158 (owner: 10ArielGlenn) [08:11:34] (03PS3) 10Muehlenhoff: Add library hint for libxv [puppet] - 10https://gerrit.wikimedia.org/r/399227 [08:11:37] !log Stop replication in sync on db1100 and dbstore1002 - T161294 [08:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:48] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [08:13:57] (03CR) 10Muehlenhoff: [C: 032] Add library hint for libxv [puppet] - 10https://gerrit.wikimedia.org/r/399227 (owner: 10Muehlenhoff) [08:19:39] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850525 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1005.eqiad.wmnet'] ``` and were **ALL** successful. [08:22:22] (03CR) 10Jcrespo: [C: 032] dbproxy: Prepare dbproxy1008 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/399351 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [08:22:30] (03PS2) 10Jcrespo: dbproxy: Prepare dbproxy1008 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/399351 (https://phabricator.wikimedia.org/T183249) [08:24:17] !log starting reimage of dbproxy1008 [08:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:22] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850530 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1008.eqiad.wmnet'] ``` The log can be found in `/var/... [08:26:42] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850533 (10jcrespo) All proxies reimaged except the active ones: ``` dbproxy1002.eqiad.wmnet dbproxy1003.eqiad.wmnet dbproxy1006.eqiad.wmnet dbproxy1009.eqiad.wmnet dbproxy1010.eq... [08:30:02] !log installing rsync security updates [08:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:40] PROBLEM - Check size of conntrack table on dbproxy1007 is CRITICAL: Return code of 255 is out of bounds [08:31:40] PROBLEM - dhclient process on dbproxy1007 is CRITICAL: Return code of 255 is out of bounds [08:32:38] aparently, wmf-reimage downtimes things only at random [08:33:46] RECOVERY - dhclient process on dbproxy1007 is OK: PROCS OK: 0 processes with command name dhclient [08:34:26] PROBLEM - haproxy alive on dbproxy1007 is CRITICAL: CRITICAL check_alive invalid response [08:34:37] RECOVERY - Check size of conntrack table on dbproxy1007 is OK: OK: nf_conntrack is 0 % full [08:36:33] not a random, the problem is structural: [08:36:56] icinga race condition? [08:37:38] yeah, it sets downtime initially, but then the host record gets removed in puppet and get it gets recreated during the first puppet run, it's not downtimed again [08:37:49] we could use the new mechanism via Hiera [08:38:05] the ones that is often used to downtime hosts permantenyl which are in setup [08:38:17] but I do not think the installer can commit to puppet [08:38:34] ack, but that's also not simple since the reimage script would need git access [08:38:59] is not as much as it cannot [08:39:01] <_joe_> or we could do it for new installs [08:39:01] as it should not [08:39:09] <_joe_> you know before installing or reimaging [08:39:10] _joe_: this is not a new install [08:39:30] <_joe_> jynus: thanks for pointing it out, how does that make a difference? :) [08:39:41] [09:39] <_joe_> or we could do it for new installs [08:39:50] <_joe_> yeah I clarified next [08:39:57] <_joe_> it's not really different in any ways [08:40:19] <_joe_> the inconvenience is needing two commits to puppet per reimage [08:40:20] then the downtime dance of the installer is a wast of time and resources [08:40:21] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850573 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1007.eqiad.wmnet'] ``` and were **ALL** successful. [08:40:24] <_joe_> your choice :) [08:40:51] <_joe_> it could be useful to set it for when one (re)images a few servers, for just one... no [08:40:55] my point is icinga architecture should be better and handled more dynamicly [08:41:08] <_joe_> I mostly agree [08:41:11] it is not the installer problem [08:41:12] <_joe_> s/architecture// [08:41:15] it is the icinga problem [08:41:27] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:41:27] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:41:30] <_joe_> icinga is just not good at managing transient states [08:41:33] <_joe_> oh not agian [08:41:34] oh, here we go again [08:41:36] <_joe_> *again [08:41:45] ganeti again [08:41:56] <_joe_> seems so [08:42:05] having a look [08:42:07] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:42:07] PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:42:07] PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100% [08:42:07] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:42:17] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:42:17] PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100% [08:42:17] PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100% [08:42:27] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:42:32] (03PS1) 10ArielGlenn: use hieras settings for a few more hardcoded paths in dumps profiles [puppet] - 10https://gerrit.wikimedia.org/r/399356 [08:42:57] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3850580 (10Marostegui) [08:43:17] PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:44:17] !log powercycling ganeti1005 [08:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:56] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 385465 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:45:57] PROBLEM - Host ganeti1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:46:46] RECOVERY - Host ganeti1005 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [08:46:56] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 6509 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:47:16] RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [08:47:22] <_joe_> ganeti1005 went down repeatedly in the last week [08:47:27] <_joe_> should we depool it? [08:48:26] RECOVERY - haproxy alive on dbproxy1007 is OK: OK check_alive uptime 335s [08:49:37] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:50:07] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3850584 (10MoritzMuehlenhoff) Happened again on ganeti1005. This time the box froze hard, no kernel/syslog logs and also nothing over mgmt (which again points towards a hardware e... [08:50:30] and I imagine the misc's 500 are piwik's [08:50:36] RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 6.43 ms [08:50:36] RECOVERY - Host etcd1005 is UP: PING OK - Packet loss = 0%, RTA = 4.77 ms [08:50:46] RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 7.44 ms [08:50:46] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 7.90 ms [08:50:50] _joe_: all of 1005-1008 went down frequently unfortunately and I'm not sure we have the capacity to depool them all [08:50:50] I don't know, is it worth it? [08:50:52] I really hope that the database is not corrupted [08:50:56] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 6.43 ms [08:50:56] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 6.61 ms [08:50:56] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 6.29 ms [08:50:57] (again) [08:51:06] 1006 is already removed for the reimage [08:51:06] RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 7.18 ms [08:51:16] RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 8.79 ms [08:51:17] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 7.57 ms [08:51:37] elukey: you should configure mysql with the same parameters as our production to avoid corruption [08:51:40] Chris updated the firmware, but it needs a reimage since Alex ran ltpstress (and it's recommended to reinstall after running it) [08:52:17] I do not think 1006 broke after upgrade? [08:52:19] jynus: definitely, super ignorant about it so I'll try to figure out how to do it [08:52:32] jynus: actually it is a mysql 5.5, not sure if crash safe slaves is available on 5.5, I believe only on 5.6 [08:52:53] marostegui: is it mysql 5.5- then there you have your problem :-D [08:52:56] jynus: 1006 is depooled so far [08:53:00] :) [08:53:08] moritzm: that would explain it :-) [08:53:18] it needs a reimage, Alex started it yesterday, but I think he ran into a problem [08:53:25] Yeah, crash safe is only available on 5.6, so... [08:53:31] (just checked) [08:53:34] seems the initial puppet run isn't completed, let's wait for him to be around [08:54:10] marostegui: I want 5.6! :D [08:54:19] (03CR) 10ArielGlenn: [C: 032] use hieras settings for a few more hardcoded paths in dumps profiles [puppet] - 10https://gerrit.wikimedia.org/r/399356 (owner: 10ArielGlenn) [08:54:33] elukey: it is literally 2 options- innodb_flush_log_at_trx_commit=1 and persistent replication options (different on mariadb and mysql) [08:58:39] I am around [08:59:08] so... ganeti1006 did not get reimaged yesterday ? I thought luca had commited the numa=off change... lemme check what happened [08:59:59] hmm it was stuck in the d-I waiting for an [09:00:04] it's proceeding now [09:00:07] weird [09:00:35] anyway, I 'll proceed with the reimage and remove VMs from the other ganeti hosts [09:00:37] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [09:00:47] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [09:02:47] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [09:05:24] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3850609 (10akosiaris) >>! In T181121#3849219, @Volans wrote: > @akosiaris if you're trying to reimage those as Jessie, we still have the netinst issue open, so you need to set num... [09:05:50] (03PS1) 10ArielGlenn: move one more hardcoded path from dumps profiles to hiera [puppet] - 10https://gerrit.wikimedia.org/r/399358 [09:07:58] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([scb2001.codfw.wmnet]) [09:07:58] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([scb2001.codfw.wmnet]) [09:08:32] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850615 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1008.eqiad.wmnet'] ``` and were **ALL** successful. [09:10:21] (03PS1) 10Jcrespo: haproxy: Update haproxy systemd.unit to that of stretch [puppet] - 10https://gerrit.wikimedia.org/r/399359 (https://phabricator.wikimedia.org/T183249) [09:10:27] (03CR) 10ArielGlenn: [C: 032] move one more hardcoded path from dumps profiles to hiera [puppet] - 10https://gerrit.wikimedia.org/r/399358 (owner: 10ArielGlenn) [09:11:02] ^ for the above, scb2001 is currently depooled for service restarts related to the openssl update [09:11:15] but standard depooling, not sure why that alerts [09:11:42] I'll repool to see whether that helps [09:12:03] (03PS1) 10Elukey: profile::druid::monitoring: restrict jmx mbeans to query [puppet] - 10https://gerrit.wikimedia.org/r/399360 (https://phabricator.wikimedia.org/T183273) [09:12:07] (03CR) 10Marostegui: [C: 031] haproxy: Update haproxy systemd.unit to that of stretch [puppet] - 10https://gerrit.wikimedia.org/r/399359 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [09:12:15] it was depooled with "no", not removed from the config with "inactive" [09:12:58] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [09:12:58] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal [09:13:30] (03PS2) 10Elukey: profile::druid::monitoring: restrict jmx mbeans to query [puppet] - 10https://gerrit.wikimedia.org/r/399360 (https://phabricator.wikimedia.org/T183273) [09:15:54] (03PS2) 10Jcrespo: haproxy: Update haproxy systemd.unit to that of stretch [puppet] - 10https://gerrit.wikimedia.org/r/399359 (https://phabricator.wikimedia.org/T183249) [09:16:17] (03CR) 10jerkins-bot: [V: 04-1] haproxy: Update haproxy systemd.unit to that of stretch [puppet] - 10https://gerrit.wikimedia.org/r/399359 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [09:18:05] (03PS3) 10Jcrespo: haproxy: Update haproxy systemd.unit to that of stretch [puppet] - 10https://gerrit.wikimedia.org/r/399359 (https://phabricator.wikimedia.org/T183249) [09:24:13] !log disable puppet on dbproxies for gerrit:399359 deployment [09:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:46] (03CR) 10Jcrespo: [C: 032] haproxy: Update haproxy systemd.unit to that of stretch [puppet] - 10https://gerrit.wikimedia.org/r/399359 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [09:25:59] (03CR) 10Filippo Giunchedi: "LGTM, see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/399160 (owner: 10Alexandros Kosiaris) [09:26:57] (03CR) 10Filippo Giunchedi: [C: 031] profile::druid::monitoring: restrict jmx mbeans to query [puppet] - 10https://gerrit.wikimedia.org/r/399360 (https://phabricator.wikimedia.org/T183273) (owner: 10Elukey) [09:27:28] (03CR) 10Filippo Giunchedi: [C: 031] Add PowerDNS exporter to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399152 (owner: 10Muehlenhoff) [09:27:57] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, should work as-is once https://gerrit.wikimedia.org/r/#/c/399152 is merged" [puppet] - 10https://gerrit.wikimedia.org/r/398867 (owner: 10Muehlenhoff) [09:29:31] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:29:32] (03CR) 10Elukey: [C: 032] profile::druid::monitoring: restrict jmx mbeans to query [puppet] - 10https://gerrit.wikimedia.org/r/399360 (https://phabricator.wikimedia.org/T183273) (owner: 10Elukey) [09:29:39] (03PS3) 10Elukey: profile::druid::monitoring: restrict jmx mbeans to query [puppet] - 10https://gerrit.wikimedia.org/r/399360 (https://phabricator.wikimedia.org/T183273) [09:30:43] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399363 [09:30:45] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399363 [09:31:00] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:31:00] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:31:10] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:31:10] PROBLEM - puppet last run on mw1186 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:31:10] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:31:10] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:31:11] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [09:31:11] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [09:31:20] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:31:50] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:31:50] PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:50] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:32:30] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:32:31] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:33:00] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:33:20] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:33:40] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:33:41] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:34:20] nitrogen --^ [09:34:23] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:34:44] puppetdb restared 7m ago [09:35:29] oh, that explains it [09:36:30] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399363 (owner: 10Marostegui) [09:36:38] (03CR) 10Elukey: [C: 031] prometheus: add nutcracker job [puppet] - 10https://gerrit.wikimedia.org/r/399163 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [09:37:55] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399363 (owner: 10Marostegui) [09:38:07] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399363 (owner: 10Marostegui) [09:39:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1105:3311 - T174569 (duration: 00m 51s) [09:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:59] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [09:47:20] (03PS2) 10Filippo Giunchedi: prometheus: add nutcracker job [puppet] - 10https://gerrit.wikimedia.org/r/399163 (https://phabricator.wikimedia.org/T181995) [09:47:20] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:49:00] PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:49:16] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add nutcracker job [puppet] - 10https://gerrit.wikimedia.org/r/399163 (https://phabricator.wikimedia.org/T181995) (owner: 10Filippo Giunchedi) [09:50:22] (03PS4) 10Muehlenhoff: Add PowerDNS exporter to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399152 [09:50:33] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Can't docker pull from docker-registry.discovery.wmnet - https://phabricator.wikimedia.org/T183342#3850688 (10hashar) [09:53:14] 10Operations: Reclaim lawrencium - https://phabricator.wikimedia.org/T183343#3850700 (10akosiaris) [09:53:38] 10Operations: Reclaim lawrencium - https://phabricator.wikimedia.org/T183343#3850713 (10akosiaris) [09:53:40] 10Operations, 10Performance-Team, 10Patch-For-Review: setup/install lawrencium for temp use by performance team - https://phabricator.wikimedia.org/T179968#3850714 (10akosiaris) [09:54:06] 10Operations: Reclaim lawrencium - https://phabricator.wikimedia.org/T183343#3850700 (10akosiaris) [09:54:14] (03CR) 10Muehlenhoff: [C: 032] Add PowerDNS exporter to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399152 (owner: 10Muehlenhoff) [09:55:05] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Can't docker pull from docker-registry.discovery.wmnet - https://phabricator.wikimedia.org/T183342#3850717 (10hashar) [09:55:19] (03PS1) 10Alexandros Kosiaris: Reclaim lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/399367 (https://phabricator.wikimedia.org/T183343) [09:56:19] !log restart dbproxy1001 to test cold service start [09:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:43] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-pdns-exporter] [09:58:44] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:59:13] labservices is me, fix is under way [09:59:14] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:59:34] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:01:03] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:01:03] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:01:04] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:01:04] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:01:04] RECOVERY - puppet last run on mw1186 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:01:13] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:01:15] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:01:36] (03PS1) 10Muehlenhoff: Add Upstart job [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/399368 [10:01:44] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:01:53] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:02:33] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:02:33] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:02:54] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:03:21] (03CR) 10Alexandros Kosiaris: [C: 032] Reclaim lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/399367 (https://phabricator.wikimedia.org/T183343) (owner: 10Alexandros Kosiaris) [10:03:43] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:04:03] (03CR) 10Muehlenhoff: [C: 032] Add Upstart job [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/399368 (owner: 10Muehlenhoff) [10:06:53] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures [10:13:43] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:17:31] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Can't docker pull from docker-registry.discovery.wmnet - https://phabricator.wikimedia.org/T183342#3850763 (10hashar) [10:22:59] 10Operations, 10Graphite, 10Nodepool, 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997#3850770 (10fgiunchedi) [10:32:34] 10Operations, 10Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race on jessie - https://phabricator.wikimedia.org/T148986#2738609 (10jcrespo) We believe the workaround for T166653, resulting on ferm loading late also affects network state negatively, causing haproxy automatic re... [10:39:15] (03PS2) 10Elukey: role::druid::analytics::worker: review jvm configurations [puppet] - 10https://gerrit.wikimedia.org/r/399205 [10:49:34] (03PS1) 10Jcrespo: haproxy: Add workaround for ferm starting too late [puppet] - 10https://gerrit.wikimedia.org/r/399377 (https://phabricator.wikimedia.org/T148986) [10:53:08] 10Operations: lawrencium's iDRAC misbehaving IPMI wise - https://phabricator.wikimedia.org/T183349#3850840 (10akosiaris) [10:57:32] !log remove old kernel packages from silver.wikimedia.org to free space [10:57:39] (03CR) 10Marostegui: [C: 031] haproxy: Add workaround for ferm starting too late [puppet] - 10https://gerrit.wikimedia.org/r/399377 (https://phabricator.wikimedia.org/T148986) (owner: 10Jcrespo) [10:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:51] (03PS3) 10Elukey: role::druid::analytics::worker: review jvm configurations [puppet] - 10https://gerrit.wikimedia.org/r/399205 [11:08:39] (03CR) 10Jcrespo: [C: 032] haproxy: Add workaround for ferm starting too late [puppet] - 10https://gerrit.wikimedia.org/r/399377 (https://phabricator.wikimedia.org/T148986) (owner: 10Jcrespo) [11:11:20] !log Stop replication in sync on db1100 and db1071 - T161294 [11:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:31] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [11:14:38] (03PS4) 10Elukey: role::druid::analytics::worker: review jvm configurations [puppet] - 10https://gerrit.wikimedia.org/r/399205 [11:15:14] (03PS8) 10Alexandros Kosiaris: Introduce profile::prometheus::k8s::staging [puppet] - 10https://gerrit.wikimedia.org/r/399160 [11:15:37] (03CR) 10Alexandros Kosiaris: Introduce profile::prometheus::k8s::staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399160 (owner: 10Alexandros Kosiaris) [11:15:50] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Introduce profile::prometheus::k8s::staging [puppet] - 10https://gerrit.wikimedia.org/r/399160 (owner: 10Alexandros Kosiaris) [11:16:04] (03PS9) 10Alexandros Kosiaris: Introduce profile::prometheus::k8s::staging [puppet] - 10https://gerrit.wikimedia.org/r/399160 [11:17:24] (03CR) 10Elukey: [C: 032] role::druid::analytics::worker: review jvm configurations [puppet] - 10https://gerrit.wikimedia.org/r/399205 (owner: 10Elukey) [11:17:41] (03PS5) 10Elukey: role::druid::analytics::worker: review jvm configurations [puppet] - 10https://gerrit.wikimedia.org/r/399205 [11:17:43] (03CR) 10Elukey: [V: 032 C: 032] role::druid::analytics::worker: review jvm configurations [puppet] - 10https://gerrit.wikimedia.org/r/399205 (owner: 10Elukey) [11:18:15] (03CR) 10Alexandros Kosiaris: "forgot to mention PCC was happy at https://puppet-compiler.wmflabs.org/compiler02/9430/" [puppet] - 10https://gerrit.wikimedia.org/r/399160 (owner: 10Alexandros Kosiaris) [11:18:24] !log restart dbproxy1005 to test cold service start [11:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:47] akosiaris: I can merge or I'll let you do it whenever you are ok :) [11:19:26] elukey: yes, merge please [11:19:27] I was about to [11:19:47] done! [11:21:17] 10Operations, 10Patch-For-Review: Reclaim lawrencium - https://phabricator.wikimedia.org/T183343#3850885 (10akosiaris) 05Open>03Resolved a:03akosiaris lawrencium reclaimed per last comment in T176361 [11:21:42] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] "I built this locally and tested envoy starts correctly and proxies http calls. I'll create a child image with TLS support next." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/396021 (owner: 10Giuseppe Lavagetto) [11:25:02] (03PS1) 10Alexandros Kosiaris: Instruct prometheus to gather postgresql metrics [puppet] - 10https://gerrit.wikimedia.org/r/399380 (https://phabricator.wikimedia.org/T179306) [11:25:18] PROBLEM - Disk space on boron is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/39b5b87d7ca37c35461bcc347f9f6c33914aaa4f801f102853e5744798480957/shm is not accessible: Permission denied [11:27:32] !log repool ganeti1006, rebalance row_A ganeti nodegroup. T181121 [11:27:35] let's see what happens [11:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:42] T181121: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121 [11:28:14] (03PS2) 10Jcrespo: mariadb-package: Some updates for mysql support (service unit) [software] - 10https://gerrit.wikimedia.org/r/399113 [11:28:40] 10Operations, 10ops-eqiad: Hardware errors on ganeti1005- ganeti1008 - https://phabricator.wikimedia.org/T181121#3850900 (10MoritzMuehlenhoff) [11:28:48] (03PS3) 10Jcrespo: mariadb-package: Add proper mysql support (service unit) [software] - 10https://gerrit.wikimedia.org/r/399113 [11:32:26] (03PS4) 10Jcrespo: mariadb-package: Add proper mysql support (service unit) [software] - 10https://gerrit.wikimedia.org/r/399113 [11:34:19] (03CR) 10Jcrespo: [C: 032] mariadb-package: Add proper mysql support (service unit) [software] - 10https://gerrit.wikimedia.org/r/399113 (owner: 10Jcrespo) [11:34:26] RECOVERY - Disk space on boron is OK: DISK OK [11:34:55] (03CR) 10Filippo Giunchedi: [C: 031] Instruct prometheus to gather postgresql metrics [puppet] - 10https://gerrit.wikimedia.org/r/399380 (https://phabricator.wikimedia.org/T179306) (owner: 10Alexandros Kosiaris) [11:34:59] 10Operations: lawrencium's iDRAC misbehaving IPMI wise - https://phabricator.wikimedia.org/T183349#3850907 (10akosiaris) 05Open>03Resolved a:03akosiaris With @Volans 's help we managed to fix it (the root cause is unknown). Per T150160 a ``` IPMI passwords getting out of sync with their iDRAC passwords.... [11:37:17] !log upgrading openssl on elastic* (along with system service restarts) [11:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:45] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port nutcracker statistics to Prometheus - https://phabricator.wikimedia.org/T181995#3850920 (10fgiunchedi) I've updated the nutcracker dashboard at https://grafana.wikimedia.org/dashboard/db/nutcracker?orgId=1 (and moved the graphite one to "nut... [11:38:09] (03CR) 10Alexandros Kosiaris: [C: 032] Instruct prometheus to gather postgresql metrics [puppet] - 10https://gerrit.wikimedia.org/r/399380 (https://phabricator.wikimedia.org/T179306) (owner: 10Alexandros Kosiaris) [11:43:15] (03PS2) 10Muehlenhoff: Add labmon Prometheus scraper config for PowerDNS [puppet] - 10https://gerrit.wikimedia.org/r/398867 [11:44:16] (03CR) 10Muehlenhoff: [C: 032] Add labmon Prometheus scraper config for PowerDNS [puppet] - 10https://gerrit.wikimedia.org/r/398867 (owner: 10Muehlenhoff) [11:47:35] (03PS1) 10Muehlenhoff: Adapt sudo config for Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/399382 [11:48:08] (03CR) 10Filippo Giunchedi: [C: 031] Adapt sudo config for Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/399382 (owner: 10Muehlenhoff) [11:48:26] PROBLEM - Disk space on boron is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/6488a0d5af4e70349fa5cf2ae1896c6a09ba884da02d6e4a58a2e59191abd057/shm is not accessible: Permission denied [11:49:21] (03CR) 10Muehlenhoff: [C: 032] Adapt sudo config for Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/399382 (owner: 10Muehlenhoff) [11:50:29] !log create k8s-staging LVs in prometheus/eqiad - T163692 [11:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:41] T163692: Have puppet create Prometheus LVs - https://phabricator.wikimedia.org/T163692 [11:50:42] akosiaris: ^ I forgot that step, done now [11:59:16] (03PS2) 10Jcrespo: mariadb: Add mysql 8.0-compatible template [puppet] - 10https://gerrit.wikimedia.org/r/399115 [11:59:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice work! A first round of comments inline" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/390330 (owner: 10Ayounsi) [11:59:22] RECOVERY - Disk space on boron is OK: DISK OK [12:00:02] godog: ah nice. thanks! [12:07:34] (03PS3) 10Jcrespo: mariadb: Add mysql 8.0-compatible template [puppet] - 10https://gerrit.wikimedia.org/r/399115 [12:08:12] (03PS4) 10Jcrespo: mariadb: Add mysql 8.0-compatible template [puppet] - 10https://gerrit.wikimedia.org/r/399115 [12:14:52] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [12:17:09] (03PS1) 10ArielGlenn: fix up dumps cleanup paths for labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/399385 [12:17:29] jynus: ^ is that with dbproxy expected? [12:17:39] well, it is not expected [12:17:44] but it is a passive proxy [12:17:51] ah, ok [12:17:51] has no trafic at all [12:18:27] apparently, waiting 5 seconds is not enough [12:18:44] !log installing iproute2 bugfix update from stretch point relesae [12:18:51] https://gerrit.wikimedia.org/r/399377 [12:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:27] (03CR) 10Jcrespo: [C: 032] mariadb: Add mysql 8.0-compatible template [puppet] - 10https://gerrit.wikimedia.org/r/399115 (owner: 10Jcrespo) [12:21:41] (03CR) 10ArielGlenn: [C: 032] fix up dumps cleanup paths for labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/399385 (owner: 10ArielGlenn) [12:21:51] (03PS2) 10ArielGlenn: fix up dumps cleanup paths for labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/399385 [12:22:21] (03PS1) 10Marostegui: db-eqiad: Repool db1101:3318, db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399387 (https://phabricator.wikimedia.org/T161294) [12:22:51] (03CR) 10Paladox: "test" [puppet] - 10https://gerrit.wikimedia.org/r/355894 (owner: 10Paladox) [12:23:47] (03PS1) 10Jcrespo: haproxy: Increase workaround for ferm starting too late [puppet] - 10https://gerrit.wikimedia.org/r/399388 (https://phabricator.wikimedia.org/T148986) [12:24:36] (03PS2) 10Jcrespo: haproxy: Increase workaround for ferm starting too late [puppet] - 10https://gerrit.wikimedia.org/r/399388 (https://phabricator.wikimedia.org/T148986) [12:24:38] (03PS1) 10Alexandros Kosiaris: Move prometheus::postgres_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/399389 [12:24:55] (03CR) 10Marostegui: [C: 032] db-eqiad: Repool db1101:3318, db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399387 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [12:26:20] (03Merged) 10jenkins-bot: db-eqiad: Repool db1101:3318, db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399387 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [12:27:36] (03CR) 10Jcrespo: [C: 032] haproxy: Increase workaround for ferm starting too late [puppet] - 10https://gerrit.wikimedia.org/r/399388 (https://phabricator.wikimedia.org/T148986) (owner: 10Jcrespo) [12:27:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 and db1109 - T161294 (duration: 00m 51s) [12:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:56] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [12:30:14] (03CR) 10jenkins-bot: db-eqiad: Repool db1101:3318, db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399387 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [12:30:33] PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:33] PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:43] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:54] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:54] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:33:22] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC happy at https://puppet-compiler.wmflabs.org/compiler02/9442/" [puppet] - 10https://gerrit.wikimedia.org/r/399389 (owner: 10Alexandros Kosiaris) [12:33:28] (03PS2) 10Alexandros Kosiaris: Move prometheus::postgres_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/399389 [12:33:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Move prometheus::postgres_exporter to a profile [puppet] - 10https://gerrit.wikimedia.org/r/399389 (owner: 10Alexandros Kosiaris) [12:33:53] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 3.480 second response time [12:33:54] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 74311 bytes in 5.487 second response time [12:34:44] PROBLEM - Nginx local proxy to apache on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:34:50] !log Enable notifications for db1100 - T161294 [12:34:54] RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0 [12:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:02] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [12:35:08] aparently, it needs 10 seconds for it to work reliably [12:37:03] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:37:03] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:37:46] RECOVERY - Nginx local proxy to apache on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.960 second response time [12:37:53] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.028 second response time [12:37:53] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 74341 bytes in 0.086 second response time [12:38:39] (03PS1) 10ArielGlenn: when creating lists of dump files for rsync, don't bail on bogus error [puppet] - 10https://gerrit.wikimedia.org/r/399392 [12:40:53] PROBLEM - Nginx local proxy to apache on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:41:03] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:41:03] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:41:43] RECOVERY - Nginx local proxy to apache on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.034 second response time [12:41:53] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.030 second response time [12:41:53] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 74342 bytes in 0.104 second response time [12:43:52] (03PS2) 10ArielGlenn: when creating lists of dump files for rsync, don't bail on bogus error [puppet] - 10https://gerrit.wikimedia.org/r/399392 [12:44:30] (03CR) 10ArielGlenn: [C: 032] when creating lists of dump files for rsync, don't bail on bogus error [puppet] - 10https://gerrit.wikimedia.org/r/399392 (owner: 10ArielGlenn) [12:54:23] Can someone take a look at https://dpaste.de/Vnh5/raw -- cannot vagrant ssh apparently [12:54:33] PROBLEM - Disk space on boron is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/e2c139bbea705c334c80d25a3a75a575302e2ca6a816bd3352e7dc5adbfb876e/shm is not accessible: Permission denied [12:55:03] RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.073 second response time [12:55:34] tonythomas: you may ask in -cloud, more users are familar with it there, or you ask bd808 [12:55:45] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.029 second response time [12:55:45] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 74342 bytes in 0.115 second response time [12:58:13] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:58:54] PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:58:54] PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:28] Sagan: sure. [13:00:38] bd808: should know actually. [13:00:43] /join #wikimedia-cloud [13:01:32] !log upgrading openssl on hadoop cluster (along with system service restarts) [13:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:57] (03PS2) 10Zoranzoki21: hieradata: use deployment-redis05 for labs jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/387579 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [13:03:44] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [13:04:33] RECOVERY - Disk space on boron is OK: DISK OK [13:11:33] !log restart dbproxy1008 to test workaround is working on cold restart [13:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:14] (03PS7) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [13:16:53] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [13:17:51] the issue seems to be "fixed" [13:20:02] !log upgrading openssl on aqs/druid clusters (along with system service restarts) [13:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:34] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3851182 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['mw1333.eqiad.wmnet'] ``` The log can be... [13:25:57] (03PS8) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [13:27:18] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3851217 (10elukey) [13:35:38] !log rolling restart of aqs to pick up openssl security update [13:35:39] PROBLEM - MD RAID on mw1334 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:07] hello mw1334, let me silence you [13:37:55] (03CR) 10Faidon Liambotis: [C: 04-1] "- Use our own mirror, mirrors.wikimedia.org instead of deb.debian.org, faster, easier :)" [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [13:38:30] !log upgrading openssl on wdqs (along with system service restarts) [13:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:48] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1276.eqiad.wmnet [13:38:52] working on --^ [13:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:04] it is the recurrent hhvm issue [13:44:19] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.035 second response time [13:44:19] RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time [13:44:19] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 74279 bytes in 0.100 second response time [13:49:18] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1276.eqiad.wmnet [13:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:48] RECOVERY - MD RAID on mw1334 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [13:59:12] paravoid: ACK with the comments, I would like to have a real-time conversation at some point [14:02:36] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1333 is CRITICAL: Return code of 255 is out of bounds [14:02:36] PROBLEM - configured eth on mw1333 is CRITICAL: Return code of 255 is out of bounds [14:04:16] PROBLEM - Check whether ferm is active by checking the default input chain on mw1333 is CRITICAL: Return code of 255 is out of bounds [14:04:16] PROBLEM - dhclient process on mw1333 is CRITICAL: Return code of 255 is out of bounds [14:04:20] and silencing mw1333 [14:05:19] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([scb2002.codfw.wmnet]) [14:05:19] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([scb2002.codfw.wmnet]) [14:05:45] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([scb2002.codfw.wmnet]) Ema Debugging [14:05:45] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([scb2002.codfw.wmnet]) Ema Debugging [14:06:46] !log temporarily shutdown kafka on kafka1023 to move some topic partitions on different disk partition (disk space usage alerts) [14:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:20] e [14:12:03] (03PS1) 10Alexandros Kosiaris: prometheus: Define $postgresql_jobs and use it [puppet] - 10https://gerrit.wikimedia.org/r/399397 [14:12:59] (03CR) 10Alexandros Kosiaris: [C: 032] prometheus: Define $postgresql_jobs and use it [puppet] - 10https://gerrit.wikimedia.org/r/399397 (owner: 10Alexandros Kosiaris) [14:13:14] !log bounce pybal on lvs2006 and clean IPVS table [14:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:30] !log upgrading openssl on wdqs (along with system service restarts) [14:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:17] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [14:15:40] moritzm: ok so the pybal alerts above are due to the removal of a service (trendingedits?) [14:15:58] I had a hunch :-) [14:16:08] the service isn't defined in pybal any longer, but is till in IPVS [14:16:41] the service being `TCP 10.2.1.9:6699 wrr` [14:17:31] _joe_, mobrovac: ^ [14:17:37] I've cleaned the IPVS table on lvs2006 and the stale service is gone [14:17:57] now I'll do the same on lvs2003, stopping pybal first and waiting for traffic to failover to lvs2006 [14:19:28] (03PS1) 10Muehlenhoff: Add pdns-rec Prometheus exporter to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399400 [14:19:53] ema: moritzm: that was trending edits once i believe [14:20:00] but it's no longer in prod [14:20:33] we probably need to improve the icinga check (or add a new one) to look for service diff too, not only hosts [14:21:20] in this case we've noticed the problem pretty late, only now that moritz depooled some scb hosts [14:22:41] (03PS1) 10Ottomata: Fix bug in create_hdfs_user_directories.sh script [puppet/cdh] - 10https://gerrit.wikimedia.org/r/399401 (https://phabricator.wikimedia.org/T182908) [14:23:33] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#3851398 (10akosiaris) I 'll start this with a reiteration of some common... [14:23:38] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3851399 (10Ottomata) Heya! There was a bug in the script that was creating your HDFS user home directory, which also... [14:23:56] (03CR) 10Filippo Giunchedi: [C: 031] Add pdns-rec Prometheus exporter to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399400 (owner: 10Muehlenhoff) [14:24:18] (03CR) 10Ottomata: [V: 032 C: 032] Fix bug in create_hdfs_user_directories.sh script [puppet/cdh] - 10https://gerrit.wikimedia.org/r/399401 (https://phabricator.wikimedia.org/T182908) (owner: 10Ottomata) [14:24:53] !log stop pybal on lvs2003, clean IPVS table after traffic failover to get rid of trendingedits `TCP 10.2.1.9:6699 wrr` [14:24:56] (03PS1) 10Ottomata: Bump cdh module with create_hdfs_user_directories.sh bugfix [puppet] - 10https://gerrit.wikimedia.org/r/399402 [14:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:10] (03PS2) 10Ottomata: Bump cdh module with create_hdfs_user_directories.sh bugfix [puppet] - 10https://gerrit.wikimedia.org/r/399402 [14:25:45] (03CR) 10Ottomata: [C: 032] Bump cdh module with create_hdfs_user_directories.sh bugfix [puppet] - 10https://gerrit.wikimedia.org/r/399402 (owner: 10Ottomata) [14:26:12] !log start pybal on lvs2003 [14:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:52] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [14:30:22] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal [14:30:22] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [14:30:23] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [14:30:43] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [14:30:44] yes kafka I am sorry [14:30:50] 1023 will be back in a few [14:31:05] moritzm, mobrovac: done [14:31:57] thanks [14:33:12] (03PS2) 10Muehlenhoff: Add pdns-rec Prometheus exporter to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399400 [14:33:54] (03CR) 10Muehlenhoff: [C: 032] Add pdns-rec Prometheus exporter to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399400 (owner: 10Muehlenhoff) [14:38:13] (03PS1) 10Muehlenhoff: Add upstart job, we now also need the exporter on labservices1001/trusty [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/399405 [14:38:46] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add upstart job, we now also need the exporter on labservices1001/trusty [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/399405 (owner: 10Muehlenhoff) [14:41:12] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-pdns-rec-exporter] [14:45:42] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [14:50:53] PROBLEM - Disk space on boron is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/6488a0d5af4e70349fa5cf2ae1896c6a09ba884da02d6e4a58a2e59191abd057/shm is not accessible: Permission denied [14:51:15] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3851465 (10akosiaris) [14:51:17] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port postgresql metrics to Prometheus - https://phabricator.wikimedia.org/T179306#3851463 (10akosiaris) 05Open>03Resolved And we got our first dashboard. https://grafana.wikimedia.org/dashboard/db/postgres?orgId=1. service owners are encourag... [14:51:30] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3650139 (10akosiaris) [14:51:47] arturo: shoot :) [14:51:47] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3650139 (10akosiaris) [14:52:47] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#3851468 (10awight) Hi, thanks for your thoughts! >>! In T181071#3851398,... [14:52:52] RECOVERY - Disk space on boron is OK: DISK OK [14:55:52] PROBLEM - Disk space on boron is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/bc9b11e6f03eec455438aa5dab5008678040d4b831a8a070eb9fbd277040794a/shm is not accessible: Permission denied [14:56:13] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:57:32] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [15:02:18] 10Operations, 10ops-eqiad: mw1191 ipmi-sel cpu errors - https://phabricator.wikimedia.org/T179640#3851494 (10Cmjohnson) @joe this was a host identified for decommission and is well out of warranty. There is little I can do to fix. I had assumed that the replacements would have been installed by now. [15:03:43] (03PS1) 10Muehlenhoff: Parse two additional metrics (used by the PDNS recursor in WMCS) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/399411 [15:05:28] (03CR) 10Muehlenhoff: [V: 032 C: 032] Parse two additional metrics (used by the PDNS recursor in WMCS) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/399411 (owner: 10Muehlenhoff) [15:29:01] (03CR) 10Andrew Bogott: [C: 04-1] bigbrother: rate limit emails (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/399338 (owner: 10BryanDavis) [15:31:39] (03PS1) 10Muehlenhoff: Extend sudo configuration for pdns-recursor on labservices [puppet] - 10https://gerrit.wikimedia.org/r/399420 [15:32:42] (03PS2) 10Muehlenhoff: Extend sudo configuration for pdns-recursor on labservices [puppet] - 10https://gerrit.wikimedia.org/r/399420 [15:33:25] (03CR) 10Muehlenhoff: [C: 032] Extend sudo configuration for pdns-recursor on labservices [puppet] - 10https://gerrit.wikimedia.org/r/399420 (owner: 10Muehlenhoff) [15:34:02] RECOVERY - Disk space on boron is OK: DISK OK [15:38:00] (03PS1) 10Muehlenhoff: Add PowerDNS Recursor scraper config on labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/399422 [15:38:52] !log rolling restart of remaining scb* hosts in codfw to pick up openssl update [15:39:02] PROBLEM - Disk space on boron is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/bc9b11e6f03eec455438aa5dab5008678040d4b831a8a070eb9fbd277040794a/shm is not accessible: Permission denied [15:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:23] _joe_: should we ignore boron as well for docker builds ^ ? [15:40:32] I guess so.. same treatment as for contint1001 [15:41:07] <_joe_> akosiaris: yes [15:41:21] <_joe_> akosiaris: I'm currently fighting against https://github.com/bazelbuild/bazel/issues/587 [15:41:49] (03PS1) 10Alexandros Kosiaris: Remove specific hieradata for lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/399425 [15:42:10] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove specific hieradata for lawrencium [puppet] - 10https://gerrit.wikimedia.org/r/399425 (owner: 10Alexandros Kosiaris) [15:43:49] (03PS1) 10Alexandros Kosiaris: Ignore docker FS checks on builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/399426 [15:44:13] PROBLEM - mediawiki-installation DSH group on mw1334 is CRITICAL: Host mw1334 is not in mediawiki-installation dsh group [15:44:27] this is a new host --^ [15:46:10] (03CR) 10Alexandros Kosiaris: [C: 032] Ignore docker FS checks on builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/399426 (owner: 10Alexandros Kosiaris) [15:46:16] _joe_: done [15:46:38] (03CR) 10Krinkle: [C: 032] Get rid of clearly unloved refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398400 (owner: 10Chad) [15:46:43] (03CR) 10Krinkle: [C: 031] Get rid of clearly unloved refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398400 (owner: 10Chad) [15:47:28] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1334.eqiad.wmnet [15:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:44] <_joe_> akosiaris: thanks [15:54:38] !log rolling restart of scb* hosts in eqiad to pick up openssl update [15:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:53] PROBLEM - MegaRAID on db1011 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [15:57:30] (03PS4) 10Andrew Bogott: bigbrother: rate limit emails [puppet] - 10https://gerrit.wikimedia.org/r/399338 (owner: 10BryanDavis) [15:57:53] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([scb1001.eqiad.wmnet]) [15:58:13] PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([scb1001.eqiad.wmnet]) [16:02:23] ^ ema same cleanup needed in eqiad as you did for codfw probably [16:03:27] (03CR) 10Andrew Bogott: [C: 032] bigbrother: rate limit emails [puppet] - 10https://gerrit.wikimedia.org/r/399338 (owner: 10BryanDavis) [16:04:33] PROBLEM - Apache HTTP on mw1333 is CRITICAL: connect to address 10.64.32.35 and port 80: Connection refused [16:04:33] PROBLEM - MD RAID on mw1333 is CRITICAL: Return code of 255 is out of bounds [16:04:33] PROBLEM - nutcracker process on mw1333 is CRITICAL: Return code of 255 is out of bounds [16:04:33] PROBLEM - HHVM processes on mw1333 is CRITICAL: Return code of 255 is out of bounds [16:04:53] PROBLEM - nutcracker port on mw1333 is CRITICAL: Return code of 255 is out of bounds [16:04:53] PROBLEM - Disk space on mw1333 is CRITICAL: Return code of 255 is out of bounds [16:04:53] PROBLEM - Check systemd state on mw1333 is CRITICAL: Return code of 255 is out of bounds [16:05:05] silenced sorry [16:05:53] RECOVERY - MegaRAID on db1011 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [16:12:53] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [16:13:13] RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal [16:14:08] (03CR) 10Krinkle: [C: 04-1] Add loginwiki and wikidata to $wgLocalVirtualHosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) (owner: 10TerraCodes) [16:15:03] (03CR) 10Krinkle: [C: 04-1] "Task mentioned wikidata and loginwiki, not votewiki. mediawiki.org is fine, but do mention it in the commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) (owner: 10TerraCodes) [16:23:39] (03PS1) 10Brian Wolff: Fill in $wgGNSMFallbackCategory based on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399433 (https://phabricator.wikimedia.org/T172875) [16:25:16] !log lvs1006: stop pybal, clean ipvs services, start pybal [16:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:11] !log lvs1003: stop pybal, clean ipvs services, start pybal [16:27:16] (03CR) 10jerkins-bot: [V: 04-1] Fill in $wgGNSMFallbackCategory based on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399433 (https://phabricator.wikimedia.org/T172875) (owner: 10Brian Wolff) [16:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:15] moritzm: done [16:31:18] (03CR) 10Filippo Giunchedi: [C: 031] Add PowerDNS Recursor scraper config on labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/399422 (owner: 10Muehlenhoff) [16:35:01] 10Operations: Package Poolcounter for Debian Stretch - https://phabricator.wikimedia.org/T183385#3851697 (10Gilles) [16:35:50] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: Design pod-level monitoring and service-level alerting - https://phabricator.wikimedia.org/T177396#3851706 (10akosiaris) [16:35:56] (03PS3) 10Andrew Bogott: horizon: Update logos and file naming [puppet] - 10https://gerrit.wikimedia.org/r/398605 (https://phabricator.wikimedia.org/T168480) (owner: 10BryanDavis) [16:36:47] (03CR) 10Andrew Bogott: [C: 032] horizon: Update logos and file naming [puppet] - 10https://gerrit.wikimedia.org/r/398605 (https://phabricator.wikimedia.org/T168480) (owner: 10BryanDavis) [16:37:00] (03PS2) 10Brian Wolff: Fill in $wgGNSMFallbackCategory based on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399433 (https://phabricator.wikimedia.org/T172875) [16:43:13] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [16:43:35] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [16:44:13] RECOVERY - mediawiki-installation DSH group on mw1334 is OK: OK [16:53:38] ottomata: here? have you seen this error before ? trying to build prometheus-jmx-exporter [16:53:41] ottomata: [ERROR] Could not create local repository at /nonexistent/.m2/repository -> [Help 1] [16:55:03] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [16:55:22] PROBLEM - Kafka Broker Replica Max Lag on kafka1023 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16fullscreenorgId=1 [16:55:52] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [17:04:56] (03CR) 10Chad: [C: 032] Get rid of clearly unloved refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398400 (owner: 10Chad) [17:06:23] (03Merged) 10jenkins-bot: Get rid of clearly unloved refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398400 (owner: 10Chad) [17:06:59] (03CR) 10jenkins-bot: Get rid of clearly unloved refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398400 (owner: 10Chad) [17:07:21] (03CR) 10Chad: "It's basically a duplicate of everything that's in make-wmf-branch. It's not really used by anything right now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396482 (owner: 10Umherirrender) [17:08:04] paravoid: probably tomorrow, I would like chasemp to be around [17:09:09] ok, today is a bit of a difficult day time-wise for me, but ping me and we'll see :) [17:09:22] (03CR) 10Chad: [C: 032] Remove AccountAudit from multiversion/submodules.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396478 (owner: 10Umherirrender) [17:09:30] !log demon@tin Synchronized README: noop (duration: 00m 51s) [17:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:43] er [17:09:47] tomorrow I meant, sorry [17:10:28] 10Operations, 10ops-eqiad, 10netops: unrack/decom pfw1-eqiad and pfw2-eqiad - https://phabricator.wikimedia.org/T183390#3851786 (10ayounsi) [17:10:42] (03Merged) 10jenkins-bot: Remove AccountAudit from multiversion/submodules.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396478 (owner: 10Umherirrender) [17:10:52] (03CR) 10jenkins-bot: Remove AccountAudit from multiversion/submodules.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396478 (owner: 10Umherirrender) [17:11:08] (03CR) 10Chad: [C: 032] Remove MoodBar from multiversion/submodules.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396475 (owner: 10Umherirrender) [17:11:18] (03PS3) 10Chad: Remove MoodBar from multiversion/submodules.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396475 (owner: 10Umherirrender) [17:11:31] (03PS2) 10Chad: Remove Wikidata from multiversion/submodules.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396482 (owner: 10Umherirrender) [17:12:22] RECOVERY - Disk space on kafka1023 is OK: DISK OK [17:12:42] PROBLEM - HHVM rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:43] PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:02] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:07] (03CR) 10Chad: [C: 032] Remove Wikidata from multiversion/submodules.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396482 (owner: 10Umherirrender) [17:13:33] PROBLEM - Nginx local proxy to apache on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:05] (03CR) 10jenkins-bot: Remove MoodBar from multiversion/submodules.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396475 (owner: 10Umherirrender) [17:14:32] (03Merged) 10jenkins-bot: Remove Wikidata from multiversion/submodules.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396482 (owner: 10Umherirrender) [17:14:41] (03CR) 10jenkins-bot: Remove Wikidata from multiversion/submodules.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396482 (owner: 10Umherirrender) [17:15:23] RECOVERY - Nginx local proxy to apache on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.129 second response time [17:15:49] depooled mw1316 and ran hhvm-dump-debug --full [17:15:52] let's see if it works [17:16:10] (03PS2) 10Chad: Remove old commented out $wgCollectionFormats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396536 (owner: 10Reedy) [17:16:15] (03CR) 10Chad: [C: 032] Remove old commented out $wgCollectionFormats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396536 (owner: 10Reedy) [17:16:51] !log demon@tin Synchronized multiversion/submodules.json: no-op (duration: 00m 51s) [17:16:52] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.809 second response time [17:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:32] (03Merged) 10jenkins-bot: Remove old commented out $wgCollectionFormats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396536 (owner: 10Reedy) [17:17:34] (03PS1) 10Reedy: Update path to loadExitNodes.php [puppet] - 10https://gerrit.wikimedia.org/r/399434 [17:17:48] (03CR) 10jenkins-bot: Remove old commented out $wgCollectionFormats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396536 (owner: 10Reedy) [17:17:54] (03PS2) 10Chad: Fix LandingCheck indenting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396535 (owner: 10Reedy) [17:17:58] (03CR) 10Chad: [C: 032] Fix LandingCheck indenting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396535 (owner: 10Reedy) [17:19:12] PROBLEM - Nginx local proxy to apache on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:32] PROBLEM - Apache HTTP on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:37] (03Merged) 10jenkins-bot: Fix LandingCheck indenting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396535 (owner: 10Reedy) [17:19:42] PROBLEM - HHVM rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:02] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:02] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:33] PROBLEM - Nginx local proxy to apache on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:58] !log demon@tin Synchronized wmf-config/CommonSettings.php: more n0-0ps (duration: 00m 52s) [17:21:03] (03CR) 10jenkins-bot: Fix LandingCheck indenting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396535 (owner: 10Reedy) [17:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:32] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 7.481 second response time [17:21:32] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 74647 bytes in 0.086 second response time [17:22:03] RECOVERY - Nginx local proxy to apache on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.016 second response time [17:23:42] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:33] PROBLEM - Apache HTTP on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:42] PROBLEM - HHVM rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:59] (03CR) 10Chad: [C: 04-2] "I'm reluctant to do this. All of the other entries in that list are various GLAMs--this appears to just be some guys' personal website? I " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398702 (https://phabricator.wikimedia.org/T183073) (owner: 10Zoranzoki21) [17:25:23] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.023 second response time [17:25:33] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 74647 bytes in 0.097 second response time [17:26:33] RECOVERY - Nginx local proxy to apache on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.873 second response time [17:26:33] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.028 second response time [17:27:00] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: New WDQS clusters eqiad + codfw - https://phabricator.wikimedia.org/T182991#3840568 (10Lucas_Werkmeister_WMDE) Can you perhaps briefly explain how the specs compare to the existing WDQS clusters? Because I would assume that the... [17:27:31] (03Abandoned) 10Chad: Enable Timeless skin on 5 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377864 (https://phabricator.wikimedia.org/T154371) (owner: 10Framawiki) [17:32:02] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 74648 bytes in 0.111 second response time [17:32:29] !log restart zookeeper on conf2002 for jvm updates - T179943 [17:32:34] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: Design pod-level monitoring and service-level alerting - https://phabricator.wikimedia.org/T177396#3851847 (10akosiaris) [17:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:41] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [17:33:05] (03CR) 10Chad: "Dec 18 was two days ago: shall we proceed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396311 (https://phabricator.wikimedia.org/T180614) (owner: 10Lucas Werkmeister (WMDE)) [17:33:54] (03CR) 10Chad: [C: 032] Enable Sandbox Extension at Atikamekw Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398417 (https://phabricator.wikimedia.org/T182798) (owner: 10Jayprakash12345) [17:34:46] christmas cleanup? [17:34:51] (03CR) 10Lucas Werkmeister (WMDE): "Well I was planning to wait until the “no deploys” break was over… do you think it’s low-risk enough to do now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396311 (https://phabricator.wikimedia.org/T180614) (owner: 10Lucas Werkmeister (WMDE)) [17:35:12] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:42] PROBLEM - Nginx local proxy to apache on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:43] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:02] addshore: Eh, I've been trying to handle the wmf-config backlog ~monthly [17:36:21] :) [17:37:03] 10Operations, 10ops-eqiad, 10netops: unrack/decom pfw1-eqiad and pfw2-eqiad - https://phabricator.wikimedia.org/T183390#3851856 (10RobH) @Cmjohnson can actually do everything I can do on the scs, so he can feel free to do all of those steps or assign to me (if he is busy with other onsite things.) Whatever... [17:37:05] depooling mw1234 and restarting hhvm [17:37:06] (03CR) 10Chad: "Holidays? No deploys? (wink wink)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396311 (https://phabricator.wikimedia.org/T180614) (owner: 10Lucas Werkmeister (WMDE)) [17:37:11] (03Merged) 10jenkins-bot: Enable Sandbox Extension at Atikamekw Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398417 (https://phabricator.wikimedia.org/T182798) (owner: 10Jayprakash12345) [17:37:20] (03CR) 10Chad: "I DON'T CARE ABOUT CHRISTMAS " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396311 (https://phabricator.wikimedia.org/T180614) (owner: 10Lucas Werkmeister (WMDE)) [17:37:21] (03CR) 10jenkins-bot: Enable Sandbox Extension at Atikamekw Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398417 (https://phabricator.wikimedia.org/T182798) (owner: 10Jayprakash12345) [17:38:01] 10Operations, 10ops-eqiad, 10hardware-requests, 10netops: unrack/decom pfw1-eqiad and pfw2-eqiad - https://phabricator.wikimedia.org/T183390#3851858 (10RobH) [17:38:30] !log restart hhvm on mw1234 [17:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:51] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: sandbox link on Atikamekw 'pedia (duration: 00m 52s) [17:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:24] lol no_justification [17:39:49] I'm planning to miss my flight home for the holidays anyway :p [17:40:06] (didn't wanna go in the first place) [17:40:13] Well if you're deploying random things, I have a new trivial config patch for wikinews ;) [17:40:19] I saw that one [17:40:27] It's in one of my E_TOOMANYTABS [17:40:33] RECOVERY - Nginx local proxy to apache on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.214 second response time [17:40:42] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.031 second response time [17:41:02] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 74649 bytes in 0.134 second response time [17:41:04] (03PS4) 10Zoranzoki21: Add xpda.com to $wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398702 (https://phabricator.wikimedia.org/T183073) [17:41:17] (03PS3) 10Chad: Fill in $wgGNSMFallbackCategory based on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399433 (https://phabricator.wikimedia.org/T172875) (owner: 10Brian Wolff) [17:42:07] !log restart hhvm on mw1316 [17:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:59] (03CR) 10Chad: [C: 032] Remove detail from wbcheckconstraints API response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396311 (https://phabricator.wikimedia.org/T180614) (owner: 10Lucas Werkmeister (WMDE)) [17:43:17] !log restart zookeeper on conf2003 for jvm updates - T179943 [17:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:27] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [17:43:52] RECOVERY - HHVM rendering on mw1316 is OK: HTTP OK: HTTP/1.1 200 OK - 74649 bytes in 0.348 second response time [17:44:02] RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.873 second response time [17:44:12] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.055 second response time [17:44:22] (03Merged) 10jenkins-bot: Remove detail from wbcheckconstraints API response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396311 (https://phabricator.wikimedia.org/T180614) (owner: 10Lucas Werkmeister (WMDE)) [17:44:27] (03CR) 10Chad: [C: 032] Fill in $wgGNSMFallbackCategory based on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399433 (https://phabricator.wikimedia.org/T172875) (owner: 10Brian Wolff) [17:45:46] (03Merged) 10jenkins-bot: Fill in $wgGNSMFallbackCategory based on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399433 (https://phabricator.wikimedia.org/T172875) (owner: 10Brian Wolff) [17:46:07] !log demon@tin Synchronized wmf-config/Wikibase-production.php: T180614 (duration: 00m 51s) [17:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:18] T180614: Remove detail and detailHTML from wbcheckconstraints response - https://phabricator.wikimedia.org/T180614 [17:46:48] (03CR) 10jenkins-bot: Remove detail from wbcheckconstraints API response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396311 (https://phabricator.wikimedia.org/T180614) (owner: 10Lucas Werkmeister (WMDE)) [17:47:27] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: T172875 (duration: 00m 51s) [17:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:38] T172875: Empty Special:Newsfeed on many wikinews - https://phabricator.wikimedia.org/T172875 [17:48:13] PROBLEM - Nginx local proxy to apache on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:42] PROBLEM - Apache HTTP on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:43] PROBLEM - HHVM rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:56] !log new mw jobrunner in production (mw1334) - T165519 [17:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:08] T165519: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519 [17:49:30] this was not intended since I wanted to reimage only api appservers with pooled=no [17:49:43] all looks good but I thought to mention it in the sal [17:49:56] (the jr starts as soon as the first puppet run completes) [17:50:26] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3533347 (10MoritzMuehlenhoff) This host still shows up in puppetdb, i.e. misses the deactivate step (e.g. visible in https://servermon.wikimedia.org/hosts/) [17:51:06] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: T172875 (second try, forgot to pull first (duration: 00m 51s) [17:51:12] RECOVERY - Nginx local proxy to apache on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.154 second response time [17:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:32] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.149 second response time [17:51:42] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 0.108 second response time [17:51:42] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3572381 (10MoritzMuehlenhoff) This host still shows up in puppetdb, i.e. misses the deactivate step (e.g. visible in https://servermon.wikimedia.org/hosts/) [17:51:49] (03PS18) 10Chad: Add wikidata and mediawiki.org to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) (owner: 10TerraCodes) [17:52:13] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3798986 (10MoritzMuehlenhoff) This host still shows up in puppetdb, i.e. misses the deactivate step (e.g. visible in https://servermon.wikimedia.org/hosts/) [17:52:26] (03PS3) 10ArielGlenn: ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) [17:52:41] (03CR) 10jerkins-bot: [V: 04-1] ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) (owner: 10ArielGlenn) [17:56:04] (03PS1) 10Filippo Giunchedi: WIP allow labmon1001 to contact pdns exporters [puppet] - 10https://gerrit.wikimedia.org/r/399439 [17:56:41] (03CR) 10Lucas Werkmeister (WMDE): ":-O It’s a Christmas miracle!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396311 (https://phabricator.wikimedia.org/T180614) (owner: 10Lucas Werkmeister (WMDE)) [17:57:38] <_joe_> !log depooling mw1277 for further investigation [17:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:58] (03PS4) 10ArielGlenn: ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) [18:01:22] PROBLEM - Nginx local proxy to apache on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:01:42] PROBLEM - Apache HTTP on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:01:52] PROBLEM - HHVM rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:01:52] RECOVERY - mediawiki-installation DSH group on mw1332 is OK: OK [18:06:03] PROBLEM - Disk space on mw1277 is CRITICAL: DISK CRITICAL - free space: /tmp 0 MB (0% inode=99%) [18:06:12] RECOVERY - Nginx local proxy to apache on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time [18:06:33] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.033 second response time [18:06:42] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 0.111 second response time [18:08:07] is the puppet compiler busted? I don't seem to be able to compile even production changes https://puppet-compiler.wmflabs.org/compiler02/9445/tin.eqiad.wmnet/prod.tin.eqiad.wmnet.err [18:08:56] godog: I know it was having some problems last week... haven't heard status update on it [18:09:43] bd808: thanks! yeah afaik those were resolved and it was working again this week, maybe that changed again [18:10:45] it was indeed working even earlier today [18:11:12] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [18:13:03] RECOVERY - Disk space on mw1277 is OK: DISK OK [18:16:12] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:57:03] (03PS1) 10Catrope: Revert "Depool deployment-db04" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399449 [18:57:28] (03PS2) 10Catrope: Revert "Depool deployment-db04" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399449 (https://phabricator.wikimedia.org/T183252) [18:57:33] (03PS3) 10Catrope: Revert "Depool deployment-db04" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399449 (https://phabricator.wikimedia.org/T183252) [18:59:59] (03PS4) 10Catrope: Revert "Depool deployment-db04" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399449 (https://phabricator.wikimedia.org/T183252) [19:00:04] (03CR) 10Catrope: [C: 032] Revert "Depool deployment-db04" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399449 (https://phabricator.wikimedia.org/T183252) (owner: 10Catrope) [19:01:31] (03Merged) 10jenkins-bot: Revert "Depool deployment-db04" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399449 (https://phabricator.wikimedia.org/T183252) (owner: 10Catrope) [19:03:31] 10Operations, 10ops-eqdfw, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3852214 (10Dzahn) a:05Cmjohnson>03Dzahn [19:04:14] (03Abandoned) 10Ottomata: [WIP] Port statsv from kafka analytics to kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/379308 (https://phabricator.wikimedia.org/T176352) (owner: 10Ottomata) [19:15:40] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3852233 (10Dzahn) So looks like we need a new role class for "regular ores server in production" (regular as opposed to ores::redis). Because we have currently on... [19:20:26] (03PS1) 10Dzahn: ores: basic role for a worker-only node [puppet] - 10https://gerrit.wikimedia.org/r/399452 (https://phabricator.wikimedia.org/T165170) [19:20:52] (03CR) 10jerkins-bot: [V: 04-1] ores: basic role for a worker-only node [puppet] - 10https://gerrit.wikimedia.org/r/399452 (https://phabricator.wikimedia.org/T165170) (owner: 10Dzahn) [19:23:46] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Investigate why ORES logs are being written to syslog despite explicit logging config. Fix. - https://phabricator.wikimedia.org/T182614#3852237 (10awight) https://github.com/wiki-ai/ores/pull/243 [19:24:23] (03CR) 10Ayounsi: "thanks for the review, comments addressed!" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/390330 (owner: 10Ayounsi) [19:25:51] (03PS3) 10Ayounsi: [WIP] Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 [19:26:13] (03PS2) 10Dzahn: ores: basic role for a worker-only node [puppet] - 10https://gerrit.wikimedia.org/r/399452 (https://phabricator.wikimedia.org/T165170) [19:26:36] (03CR) 10jerkins-bot: [V: 04-1] ores: basic role for a worker-only node [puppet] - 10https://gerrit.wikimedia.org/r/399452 (https://phabricator.wikimedia.org/T165170) (owner: 10Dzahn) [19:27:39] (03PS1) 10Dzahn: ores::stresstest: fix style violations [puppet] - 10https://gerrit.wikimedia.org/r/399454 [19:28:01] (03PS5) 10Ottomata: Move statsv varnishkafka and service to use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) [19:30:11] (03PS6) 10Ottomata: Move statsv varnishkafka and service to use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) [19:31:00] (03PS3) 10Dzahn: ores: basic role for a worker-only node [puppet] - 10https://gerrit.wikimedia.org/r/399452 (https://phabricator.wikimedia.org/T165170) [19:31:22] (03CR) 10jerkins-bot: [V: 04-1] ores: basic role for a worker-only node [puppet] - 10https://gerrit.wikimedia.org/r/399452 (https://phabricator.wikimedia.org/T165170) (owner: 10Dzahn) [19:32:06] (03PS7) 10Ottomata: Move statsv varnishkafka and service to use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) [19:32:36] (03PS4) 10Dzahn: ores: basic role for a worker-only node [puppet] - 10https://gerrit.wikimedia.org/r/399452 (https://phabricator.wikimedia.org/T165170) [19:39:55] 10Operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992#3852284 (10Dzahn) Nowadays iron is used for testing experimental 2fa authentication. Seperately we want to reinstall it because the hardware needs to be replaced soon. Thoughts on how to move forward? Are we getting a replacement for... [19:45:27] 10Operations: replace bast1001 (new hardware) - https://phabricator.wikimedia.org/T183412#3852302 (10Dzahn) [19:45:43] 10Operations: replace bast1001 (new hardware) - https://phabricator.wikimedia.org/T183412#3852312 (10Dzahn) a:03Dzahn [19:46:32] !log remove local-as from cr2-esams IX6 [19:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:35] 10Operations, 10procurement: esams: lvs + misc systems refresh - https://phabricator.wikimedia.org/T183413#3852318 (10RobH) p:05Triage>03Normal [20:03:16] (03CR) 10Zoranzoki21: "> I'm reluctant to do this. All of the other entries in that list are" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398702 (https://phabricator.wikimedia.org/T183073) (owner: 10Zoranzoki21) [20:05:21] (03PS5) 10ArielGlenn: ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) [20:13:28] (03PS12) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 (https://phabricator.wikimedia.org/T183414) [20:13:30] (03PS1) 10Andrew Bogott: Puppetmaster web frontend: support specifying different certs for a hostname [puppet] - 10https://gerrit.wikimedia.org/r/399459 (https://phabricator.wikimedia.org/T183414) [20:14:14] (03CR) 10Chad: [C: 04-2] "Let me rephrase: I'm reluctant to allow this to be deployed (hence the -2 and not just -1)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398702 (https://phabricator.wikimedia.org/T183073) (owner: 10Zoranzoki21) [20:15:50] (03PS2) 10Andrew Bogott: Puppetmaster web frontend: support specifying different certs for a hostname [puppet] - 10https://gerrit.wikimedia.org/r/399459 (https://phabricator.wikimedia.org/T183414) [20:15:50] (03PS13) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 (https://phabricator.wikimedia.org/T183414) [20:18:59] !log remove local-as from cr2-esams IX4 [20:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:07] (03PS6) 10ArielGlenn: ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) [20:21:50] (03CR) 10Zoranzoki21: "> Let me rephrase: I'm reluctant to allow this to be deployed (hence" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398702 (https://phabricator.wikimedia.org/T183073) (owner: 10Zoranzoki21) [20:25:02] Hi, Chad added -2 on patch https://gerrit.wikimedia.org/r/#/c/398702 because he no want to deploy this due to his personal reasons. I need help. Does this will be deployed or no? [20:26:21] once it's -2, you cannot remove it without disccusing it with the user. [20:27:17] All is ok with change, but he no want to deploy this [20:27:24] See comments in patch [20:27:44] (03CR) 10Chad: [C: 04-2] "It's not personal. We don't just add copy-by-urls because someone asks nicely. If you look at the other entries, they're all from reputabl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398702 (https://phabricator.wikimedia.org/T183073) (owner: 10Zoranzoki21) [20:27:44] He dosen't mean he dosen't want to deploy it. He just dosen't want anyone to deploy it. Per his updated comment. [20:28:03] Ok. Sorry than [20:28:36] (03CR) 10Zoranzoki21: "> It's not personal. We don't just add copy-by-urls because someone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398702 (https://phabricator.wikimedia.org/T183073) (owner: 10Zoranzoki21) [20:28:36] And it's not personal. [20:28:43] Ok. Sorry. [20:28:51] See my last comment [20:28:56] (03CR) 10Chad: [C: 04-2] "Yes, please." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398702 (https://phabricator.wikimedia.org/T183073) (owner: 10Zoranzoki21) [20:29:13] (03Abandoned) 10Zoranzoki21: Add xpda.com to $wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398702 (https://phabricator.wikimedia.org/T183073) (owner: 10Zoranzoki21) [20:30:13] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:28] no_justification: Ok, closed [20:30:35] and patch and task [20:31:14] !log T183053 update elasticsearch settings for wikidatawiki_content on codfw to use: index.refresh_interval=5s [20:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:25] T183053: New Wikidata items appear in search with a delay - https://phabricator.wikimedia.org/T183053 [20:32:27] I have problem with "our" phabricator [20:32:43] See screenshot: http://prntscr.com/hq8b8y [20:33:07] I no see logo of phabricator, and pictuers there [20:33:30] Example 1: http://prntscr.com/hq8boz [20:34:05] Example 2: http://prntscr.com/hq8bza [20:34:11] To I open task about this problem? [20:34:28] Zoranzoki21: that would be a #wikimedia-releng question [20:34:54] Zppix: Ok, I will ask there [20:36:44] Zppix: I asked here [21:00:12] (03PS7) 10ArielGlenn: ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) [21:00:13] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:09] (03PS1) 10Smalyshev: Lower refresh interval for Wikidata to 5s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399466 (https://phabricator.wikimedia.org/T183053) [21:06:13] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [21:06:17] (03PS2) 10Dzahn: redis: delete ganglia monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/399326 (https://phabricator.wikimedia.org/T177225) [21:06:53] RECOVERY - Kafka Broker Replica Max Lag on kafka1023 is OK: OK: Less than 50.00% above the threshold [1000000.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16fullscreenorgId=1 [21:07:24] (03CR) 10Dzahn: [C: 032] redis: delete ganglia monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/399326 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [21:07:31] (03PS3) 10Dzahn: redis: delete ganglia monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/399326 (https://phabricator.wikimedia.org/T177225) [21:09:00] (03PS8) 10ArielGlenn: ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) [21:09:06] no_justification: could you abandon https://gerrit.wikimedia.org/r/399248 ? [21:09:09] (i can't) [21:09:15] (not sure why, heh) [21:14:47] Could not be found / no permission [21:14:50] same [21:15:05] no_justification: FYI i'm deploying analytics refinery with a bugfix for popups hive schema refinement [21:15:29] 10Operations, 10MediaWiki-Configuration, 10Availability (Multiple-active-datacenters), 10MediaWiki-Platform-Team (MWPT-Q2-Oct-Dec-2017), and 4 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3852549 (10Krinkle) [21:15:31] !log otto@tin Started deploy [analytics/refinery@548dad7]: deploying refinery v0.0.56 with JsonRefine fixes to allow Popups schema to be refined. This is a no-op for everything else [21:15:36] 10Operations, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar), 10Services (watching), and 3 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#3852550 (10Krinkle) [21:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:07] (03PS9) 10ArielGlenn: ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) [21:20:22] !log otto@tin Finished deploy [analytics/refinery@548dad7]: deploying refinery v0.0.56 with JsonRefine fixes to allow Popups schema to be refined. This is a no-op for everything else (duration: 04m 51s) [21:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:39] i have no idea why it is like that [21:20:49] i created it like any other (afaict) [21:27:25] (03CR) 10EBernhardson: Lower refresh interval for Wikidata to 5s (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399466 (https://phabricator.wikimedia.org/T183053) (owner: 10Smalyshev) [21:50:57] no_justification i can access the rest api for the change [21:51:05] and dosen't look like a DRAFT. [21:56:15] Weird [22:00:14] paladox: What's the refs/meta/config look like? What's the permissions on the repo? [22:00:27] on which repo? [22:00:32] operations/puppet? [22:01:07] https://phabricator.wikimedia.org/source/operations-puppet/browse/project.config/project.config;43fd5c7879fe72254e8f8dd21f63ae29f04c0e3e [22:04:55] no_justification this is what i pulled from the rest api https://phabricator.wikimedia.org/P6490 [22:05:20] mutante updated the commit msg. So somehow that broke visability for everyone. [22:06:18] Though the owner of the patch has to allways be able to see the change so some how this feels like a bug, but not sure how to reproduce it. [22:09:53] blames the inline editor :) [22:10:17] did the inline editor cause this? [22:10:57] gerrit 2.15 and partially 2.14 have overhauled the permission system. [22:14:57] mutante no_justification aha [22:15:00] Multiple patch sets for "108ece92f5d2de8174b221281a5b9679913ea73d": 399248,4; 399248,5 [22:15:25] that's from https://gerrit.wikimedia.org/r/changes/399248/revisions/108ece92f5d2de8174b221281a5b9679913ea73d/commit?links [22:17:37] paladox: File Not Found?! [22:17:50] yep, dosen't work in all browsers [22:18:12] mutante try curl https://gerrit.wikimedia.org/r/changes/399248/revisions/108ece92f5d2de8174b221281a5b9679913ea73d/commit?links [22:18:18] Firefox = File Not Found [22:18:21] Chromium = might be temporarily down or it may have moved permanently to a new web address. [22:18:24] ERR_INVALID_RESPONSE [22:18:51] "not all browsers" = works in curl :) yes, it does [22:18:57] but why is it "Multiple patch sets" [22:19:05] Have no idea :) [22:19:51] 2 patch sets sharing the same ref ?! [22:20:59] yep [22:51:59] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3852857 (10hashar) I have finally started the conversion of the CI images to docker-pkg and even sent a few patch... [23:04:10] (03PS1) 10Chad: gerrit replication: don't bother trying to create repos on github [puppet] - 10https://gerrit.wikimedia.org/r/399533 [23:05:31] (03CR) 10Hashar: [C: 031] gerrit replication: don't bother trying to create repos on github [puppet] - 10https://gerrit.wikimedia.org/r/399533 (owner: 10Chad) [23:06:46] (03PS2) 10Dzahn: ores::stresstest: fix style violations [puppet] - 10https://gerrit.wikimedia.org/r/399454 [23:07:38] (03CR) 10Dzahn: [C: 032] ores::stresstest: fix style violations [puppet] - 10https://gerrit.wikimedia.org/r/399454 (owner: 10Dzahn) [23:08:30] (03PS2) 10Dzahn: gerrit replication: don't bother trying to create repos on github [puppet] - 10https://gerrit.wikimedia.org/r/399533 (owner: 10Chad) [23:08:55] (03CR) 10Dzahn: [C: 032] gerrit replication: don't bother trying to create repos on github [puppet] - 10https://gerrit.wikimedia.org/r/399533 (owner: 10Chad) [23:13:43] (03PS2) 10Smalyshev: Lower refresh interval for Wikidata to 5s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399466 (https://phabricator.wikimedia.org/T183053) [23:26:17] !log restarting apache on phab1001 to deploy a hotfix for T144184 [23:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:58] (03PS3) 10Dzahn: gerrit: Set log level for com.google.gerrit.server.plugins.PluginLoader to ERROR [puppet] - 10https://gerrit.wikimedia.org/r/398785 (owner: 10Paladox) [23:41:35] (03CR) 10Dzahn: [C: 032] gerrit: Set log level for com.google.gerrit.server.plugins.PluginLoader to ERROR [puppet] - 10https://gerrit.wikimedia.org/r/398785 (owner: 10Paladox) [23:44:46] (03Abandoned) 10Dzahn: grafana: add dashboard for cloud-codfw [puppet] - 10https://gerrit.wikimedia.org/r/393698 (owner: 10Dzahn) [23:52:11] (03PS2) 10Dzahn: aptrepo: move Hiera calls into parameter of role class [puppet] - 10https://gerrit.wikimedia.org/r/397730 [23:56:06] Warning: Unknown variable: 'realm'. at /srv/jenkins-workspace/puppet-compiler/9446/change/src/manifests/realm.pp:21:4 hrmm [23:58:13] mutante: go.dog saw that earlier today but I don't know if he poked at it [23:58:33] bd808: ok, thanks, good to know