[00:13:31] 10Operations, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947#386350 (10Arthur2e5) Just spotted at https://commons.wikimedia.org/wiki/File:Mexico_1835-1846_administrative_map-en-2.svg. Uploading a transf... [00:22:13] (03PS1) 10Jon Harald Søby: Add upload_by_url to extended uploaders on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397129 (https://phabricator.wikimedia.org/T182534) [00:23:35] RECOVERY - HHVM rendering on mw1316 is OK: HTTP OK: HTTP/1.1 200 OK - 77833 bytes in 0.204 second response time [00:24:04] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.027 second response time [00:24:04] RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.027 second response time [02:43:35] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:43:35] PROBLEM - Nginx local proxy to apache on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:43:41] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.11) (duration: 09m 21s) [02:43:44] PROBLEM - Nginx local proxy to apache on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:15] PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:25] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:25] PROBLEM - HHVM rendering on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:34] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:34] PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:44] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:44] PROBLEM - Nginx local proxy to apache on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:54] PROBLEM - Nginx local proxy to apache on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:04] PROBLEM - HHVM rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:15] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:17] PROBLEM - Apache HTTP on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:24] PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:25] PROBLEM - HHVM rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:35] PROBLEM - Apache HTTP on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:35] PROBLEM - Nginx local proxy to apache on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:44] PROBLEM - Nginx local proxy to apache on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:54] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:54] PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:55] PROBLEM - HHVM rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:15] PROBLEM - Nginx local proxy to apache on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:15] PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:15] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:15] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:04] RECOVERY - Nginx local proxy to apache on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.058 second response time [02:48:44] PROBLEM - Nginx local proxy to apache on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:48:54] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:49:05] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.032 second response time [02:49:44] RECOVERY - Nginx local proxy to apache on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 7.530 second response time [02:50:34] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 5.527 second response time [02:51:24] RECOVERY - HHVM rendering on mw1317 is OK: HTTP OK: HTTP/1.1 200 OK - 77819 bytes in 0.122 second response time [02:51:34] RECOVERY - Nginx local proxy to apache on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.031 second response time [02:51:44] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 77819 bytes in 0.166 second response time [02:57:34] PROBLEM - Check systemd state on maps-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:57:44] PROBLEM - cassandra service on maps-test2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [02:58:04] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: connect to address 10.192.0.128 and port 9042: Connection refused [03:10:04] RECOVERY - Nginx local proxy to apache on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 9.704 second response time [03:10:07] (03CR) 10KartikMistry: "This is OK to go. Follow-up patches coming." [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/394967 (https://phabricator.wikimedia.org/T181463) (owner: 10KartikMistry) [03:10:54] (03CR) 10KartikMistry: "This is OK to go. Depends on new hfst." [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/395176 (https://phabricator.wikimedia.org/T181464) (owner: 10KartikMistry) [03:11:34] RECOVERY - Check systemd state on maps-test2001 is OK: OK - running: The system is fully operational [03:11:54] RECOVERY - cassandra service on maps-test2001 is OK: OK - cassandra is active [03:12:04] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.036 second response time on 10.192.0.128 port 9042 [03:13:05] PROBLEM - Nginx local proxy to apache on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:26:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 733.91 seconds [03:59:25] PROBLEM - HHVM rendering on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:00:14] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 285.84 seconds [04:00:15] RECOVERY - HHVM rendering on mw2132 is OK: HTTP OK: HTTP/1.1 200 OK - 77902 bytes in 0.320 second response time [04:16:14] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 6.883 second response time [04:19:15] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:27:35] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 78008 bytes in 9.503 second response time [04:27:44] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.027 second response time [04:30:35] PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:30:54] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:40:02] (03PS1) 10KartikMistry: Depends on new cg3 [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/397217 (https://phabricator.wikimedia.org/T171406) [05:10:52] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-arg-cat] - 10https://gerrit.wikimedia.org/r/397218 (https://phabricator.wikimedia.org/T171406) [05:11:34] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-arg-cat] - 10https://gerrit.wikimedia.org/r/397218 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:13:33] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-arg] - 10https://gerrit.wikimedia.org/r/397219 (https://phabricator.wikimedia.org/T171406) [05:14:05] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-arg] - 10https://gerrit.wikimedia.org/r/397219 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:15:25] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:16:15] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.036 second response time on 10.192.0.128 port 9042 [05:17:02] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-bel] - 10https://gerrit.wikimedia.org/r/397220 (https://phabricator.wikimedia.org/T171406) [05:17:33] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-bel] - 10https://gerrit.wikimedia.org/r/397220 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:19:04] PROBLEM - tilerator on maps-test2001 is CRITICAL: connect to address 10.192.0.128 and port 6534: Connection refused [05:19:14] PROBLEM - tilerator on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6534: Connection refused [05:20:06] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-bel-rus] - 10https://gerrit.wikimedia.org/r/397221 (https://phabricator.wikimedia.org/T171406) [05:20:28] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-bel-rus] - 10https://gerrit.wikimedia.org/r/397221 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:21:24] PROBLEM - Check health of redis instance on 6379 on maps-test2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [05:22:58] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-br-fr] - 10https://gerrit.wikimedia.org/r/397222 (https://phabricator.wikimedia.org/T171406) [05:23:25] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-br-fr] - 10https://gerrit.wikimedia.org/r/397222 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:26:51] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/397223 (https://phabricator.wikimedia.org/T171406) [05:27:16] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/397223 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:30:09] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-cat-srd] - 10https://gerrit.wikimedia.org/r/397224 (https://phabricator.wikimedia.org/T171406) [05:30:32] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-cat-srd] - 10https://gerrit.wikimedia.org/r/397224 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:31:54] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.502 second response time [05:33:35] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-crh] - 10https://gerrit.wikimedia.org/r/397225 (https://phabricator.wikimedia.org/T171406) [05:34:12] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-crh] - 10https://gerrit.wikimedia.org/r/397225 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:34:55] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:36:42] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/397226 (https://phabricator.wikimedia.org/T171406) [05:37:07] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/397226 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:38:48] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-cy-en] - 10https://gerrit.wikimedia.org/r/397227 (https://phabricator.wikimedia.org/T171406) [05:39:23] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-cy-en] - 10https://gerrit.wikimedia.org/r/397227 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:42:22] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/397228 (https://phabricator.wikimedia.org/T171406) [05:42:46] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/397228 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:44:15] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 11 failures. Last run 3 minutes ago with 11 failures. Failed resources (up to 3 shown): Exec[create_user-kartotherian],Exec[create_user-monitoring@maps-test2002],Exec[create_user-tileratorui],Exec[create_user-osmimporter] [05:46:01] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/397229 (https://phabricator.wikimedia.org/T171406) [05:46:35] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/397229 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [06:13:17] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-eo-es] - 10https://gerrit.wikimedia.org/r/397230 (https://phabricator.wikimedia.org/T171406) [06:13:57] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-eo-es] - 10https://gerrit.wikimedia.org/r/397230 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [06:15:13] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-eus] - 10https://gerrit.wikimedia.org/r/397231 (https://phabricator.wikimedia.org/T171406) [06:15:34] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397232 (https://phabricator.wikimedia.org/T178359) [06:15:40] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-eus] - 10https://gerrit.wikimedia.org/r/397231 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [06:17:25] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/397233 (https://phabricator.wikimedia.org/T171406) [06:17:53] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/397233 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [06:18:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397232 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:19:14] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/397234 (https://phabricator.wikimedia.org/T171406) [06:19:36] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/397234 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [06:19:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397232 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:19:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397232 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:21:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1096:3316 to compress InnoDB there - T178359 (duration: 00m 45s) [06:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:20] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:22:12] !log Compress s6 on db1096 - T178359 [06:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:44] RECOVERY - Check health of redis instance on 6379 on maps-test2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 103466 keys, up 34 days 14 hours [06:26:14] RECOVERY - tilerator on maps-test2001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.091 second response time [06:26:15] RECOVERY - tilerator on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.092 second response time [06:38:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397235 (https://phabricator.wikimedia.org/T174569) [06:42:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397235 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:43:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397235 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:44:15] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:45:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1060 - T174569 (duration: 00m 44s) [06:45:18] !log Deploy schema change on s2 db1060 with replication enabled, this will generate some lag on s2 on labs - T174569 [06:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:19] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:34] PROBLEM - Check Varnish expiry mailbox lag on cp4021 is CRITICAL: CRITICAL: expiry mailbox lag is 2103234 [06:47:03] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397235 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:47:35] PROBLEM - HHVM rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:47:55] PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:48:34] PROBLEM - Nginx local proxy to apache on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:53:29] <_joe_> looking [06:57:19] (03PS1) 10BryanDavis: labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) [06:58:07] (03CR) 10jerkins-bot: [V: 04-1] labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) (owner: 10BryanDavis) [06:58:25] RECOVERY - Nginx local proxy to apache on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.063 second response time [06:58:34] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 77962 bytes in 1.353 second response time [06:58:54] RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.033 second response time [06:59:53] <_joe_> !log restarted hhvm, nginx on mw1280, hanging kernel operations [07:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:50] (03PS2) 10BryanDavis: labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) [07:15:05] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/397279 (https://phabricator.wikimedia.org/T171406) [07:15:40] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/397279 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [07:17:27] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-hbs-eng] - 10https://gerrit.wikimedia.org/r/397283 (https://phabricator.wikimedia.org/T171406) [07:17:52] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-hbs-eng] - 10https://gerrit.wikimedia.org/r/397283 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [07:19:09] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-hbs-mkd] - 10https://gerrit.wikimedia.org/r/397284 (https://phabricator.wikimedia.org/T171406) [07:19:41] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-hbs-mkd] - 10https://gerrit.wikimedia.org/r/397284 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [07:20:56] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-hbs-slv] - 10https://gerrit.wikimedia.org/r/397285 (https://phabricator.wikimedia.org/T171406) [07:21:23] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-hbs-slv] - 10https://gerrit.wikimedia.org/r/397285 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [07:23:47] 10Operations, 10DBA: Decommission db1034 - https://phabricator.wikimedia.org/T182556#3827026 (10Marostegui) p:05Triage>03Normal [07:25:28] (03PS1) 10Marostegui: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397295 (https://phabricator.wikimedia.org/T182556) [07:27:04] RECOVERY - Nginx local proxy to apache on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.045 second response time [07:27:04] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 77960 bytes in 0.166 second response time [07:27:24] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.106 second response time [07:29:43] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397295 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [07:31:44] (03PS2) 10Marostegui: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397295 (https://phabricator.wikimedia.org/T182556) [07:34:28] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397295 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [07:36:42] (03PS3) 10Marostegui: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397295 (https://phabricator.wikimedia.org/T182556) [07:39:34] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 77960 bytes in 0.885 second response time [07:40:04] RECOVERY - Nginx local proxy to apache on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.274 second response time [07:40:04] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.277 second response time [07:43:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397295 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [07:44:05] <_joe_> !log restarting hhvm on mw1189,mw1229,mw1235,mw1282,mw1285,mw1315,mw1316, all stuck with a kernel hang [07:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:05] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.032 second response time [07:45:05] RECOVERY - Nginx local proxy to apache on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.039 second response time [07:45:54] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 77962 bytes in 2.526 second response time [07:48:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397295 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [07:48:49] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397295 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [07:48:54] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 77960 bytes in 0.242 second response time [07:48:54] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 77960 bytes in 0.407 second response time [07:49:05] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time [07:49:05] RECOVERY - Nginx local proxy to apache on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.047 second response time [07:49:05] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.040 second response time [07:49:34] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.766 second response time [07:49:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1034 - T182556 (duration: 00m 45s) [07:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:57] T182556: Decommission db1034 - https://phabricator.wikimedia.org/T182556 [07:51:05] RECOVERY - Nginx local proxy to apache on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.036 second response time [07:51:05] RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.034 second response time [07:51:44] RECOVERY - HHVM rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 77942 bytes in 0.402 second response time [07:52:05] PROBLEM - HHVM rendering on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1312 bytes in 4.520 second response time [07:52:05] PROBLEM - Apache HTTP on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1312 bytes in 9.508 second response time [07:52:05] PROBLEM - Nginx local proxy to apache on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [07:52:44] RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.357 second response time [07:52:44] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.027 second response time [07:52:44] RECOVERY - HHVM rendering on mw1316 is OK: HTTP OK: HTTP/1.1 200 OK - 77942 bytes in 0.258 second response time [07:52:55] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.205 second response time [07:53:04] RECOVERY - HHVM rendering on mw1317 is OK: HTTP OK: HTTP/1.1 200 OK - 77942 bytes in 0.319 second response time [07:53:05] RECOVERY - Nginx local proxy to apache on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.040 second response time [07:59:28] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] Elukey kafka1018 down - T181518 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [07:59:28] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] Elukey kafka1018 down - T181518 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [08:02:54] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:02:54] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:03:25] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:25] PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:25] PROBLEM - Host ganeti1008 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:35] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:35] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:35] PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100% [08:03:35] PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100% [08:03:35] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:43] eh? [08:04:05] wt... [08:04:32] ganeti1008 down sigh [08:05:43] ah sigh [08:05:45] happy happy monday [08:08:49] so in console I can see a ton of printks [08:09:08] it might be something that akosiaris set up to debug the recurrent issues with ganeti [08:10:23] <_joe_> elukey: let's bring it up again? [08:10:48] <_joe_> if you're already in console [08:10:57] _joe_ sure I can powercycle it [08:12:32] !log powercycle ganeti1008 - all vms stuck, console com2 showed a ton of printks without a clear indicator of the root cause [08:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:04] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:14:04] RECOVERY - Host ganeti1008 is UP: PING WARNING - Packet loss = 64%, RTA = 0.44 ms [08:15:44] RECOVERY - Host etcd1005 is UP: PING OK - Packet loss = 0%, RTA = 3.62 ms [08:15:44] RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 2.15 ms [08:15:54] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 4.68 ms [08:15:54] RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 4.48 ms [08:15:54] RECOVERY - Host bohrium is UP: PING WARNING - Packet loss = 66%, RTA = 3.48 ms [08:15:54] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 4.39 ms [08:16:04] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 3.93 ms [08:16:04] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 3.60 ms [08:16:04] RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 3.97 ms [08:21:12] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3827151 (10elukey) Ganeti1008 went down some minutes ago, a powercycle fixed it. The console was showing up a ton of printks but I didn't find anything useful. A snippet: ``` *... [08:27:05] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:28:26] piwik wasn't happy [08:30:46] weird the icinga dashboard link above does not get the '&' [08:43:25] (03PS1) 10Elukey: role::graphite::alerts::reqstat: render correctly the dashboard qs [puppet] - 10https://gerrit.wikimedia.org/r/397350 [08:43:58] (03CR) 10jerkins-bot: [V: 04-1] role::graphite::alerts::reqstat: render correctly the dashboard qs [puppet] - 10https://gerrit.wikimedia.org/r/397350 (owner: 10Elukey) [08:44:26] (03PS1) 10Marostegui: db103{4,9}: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/397351 (https://phabricator.wikimedia.org/T163190) [08:46:15] (03PS2) 10Elukey: role::graphite::alerts::reqstat: render correctly the dashboard qs [puppet] - 10https://gerrit.wikimedia.org/r/397350 [08:46:38] (03PS2) 10Elukey: site.pp: set notebook1002 as spare::system [puppet] - 10https://gerrit.wikimedia.org/r/394985 (https://phabricator.wikimedia.org/T181518) [08:46:59] (03CR) 10Marostegui: [C: 032] db103{4,9}: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/397351 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [08:52:38] !log Stop replication in sync on db1034 and db1039 - T163190 [08:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:48] T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190 [08:55:25] (03PS3) 10Elukey: site.pp: set notebook1002 as spare::system [puppet] - 10https://gerrit.wikimedia.org/r/394985 (https://phabricator.wikimedia.org/T181518) [08:56:25] (03PS2) 10Gehel: wdqs: adding "su" directive to log rotation [puppet] - 10https://gerrit.wikimedia.org/r/396419 [08:57:09] * elukey thanks gehel for removing cronspam [08:57:31] elukey: yeah, I should have merged that Friday already... [08:57:39] (03CR) 10Gehel: [C: 032] wdqs: adding "su" directive to log rotation [puppet] - 10https://gerrit.wikimedia.org/r/396419 (owner: 10Gehel) [08:59:03] (03CR) 10Elukey: [C: 032] site.pp: set notebook1002 as spare::system [puppet] - 10https://gerrit.wikimedia.org/r/394985 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [08:59:14] (03PS4) 10Elukey: site.pp: set notebook1002 as spare::system [puppet] - 10https://gerrit.wikimedia.org/r/394985 (https://phabricator.wikimedia.org/T181518) [09:02:01] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Remove Wikidata from multiversion/submodules.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396482 (owner: 10Umherirrender) [09:03:47] !log dropping multiple leftover files from db1102 [09:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:42] !log set notebook1002 as role::spare as prep step to reimage it to kafka1023 [09:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:19] PROBLEM - cassandra service on maps-test2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [09:08:50] PROBLEM - Check systemd state on maps-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:08:59] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: connect to address 10.192.0.128 and port 9042: Connection refused [09:09:49] PROBLEM - tilerator on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6534: Connection refused [09:11:39] PROBLEM - tilerator on maps-test2001 is CRITICAL: connect to address 10.192.0.128 and port 6534: Connection refused [09:11:50] RECOVERY - Check systemd state on maps-test2001 is OK: OK - running: The system is fully operational [09:11:59] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.036 second response time on 10.192.0.128 port 9042 [09:12:20] RECOVERY - cassandra service on maps-test2001 is OK: OK - cassandra is active [09:12:30] PROBLEM - Check health of redis instance on 6379 on maps-test2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [09:12:39] * gehel is looking at maps-test... [09:14:30] RECOVERY - Check health of redis instance on 6379 on maps-test2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 103641 keys, up 34 days 17 hours [09:14:39] RECOVERY - tilerator on maps-test2001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.093 second response time [09:14:49] RECOVERY - tilerator on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.096 second response time [09:15:00] !log cleaning up old postgres logs on maps-test2001 [09:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:09] !log cleaning old cassandra dumps on maps-test2001 servers [09:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:19] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 4 minutes ago with 9 failures. Failed resources (up to 3 shown): Exec[create_user-kartotherian],Exec[create_user-monitoring@maps-test2002],Exec[create_user-tileratorui],Exec[create_user-osmimporter] [09:21:19] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:28:00] (03PS1) 10Elukey: netbook.cfg: add kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/397397 (https://phabricator.wikimedia.org/T181518) [09:28:28] !log upload scap_3.7.4-1 to apt.wikimedia.org/jessie-wikimedia/main [09:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:22] <_joe_> elukey: netbooK? [09:38:35] _joe_: ahahah yes I was about to change it, weird mix between netbook and notebook [09:39:26] (03PS2) 10Elukey: netboot.cfg: add kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/397397 (https://phabricator.wikimedia.org/T181518) [09:40:07] (03CR) 10Filippo Giunchedi: "LGTM overall" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [09:43:02] (03CR) 10Elukey: [C: 032] netboot.cfg: add kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/397397 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [10:00:59] !log stopping dbstore2001:s5 and dbstore1002 (s5) mysql replication in sync [10:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:57] !log upgrade grafana to 4.6.2 on labmon1001 - T182294 [10:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:09] T182294: Upgrade grafana to 4.6.2 - https://phabricator.wikimedia.org/T182294 [10:10:52] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: Experiment with a TLS proxy/router for pods - https://phabricator.wikimedia.org/T177394#3657130 (10Joe) I am experimenting a bit with envoy, and while it's interesting, and probably something we want to implement on the long run, I'm not sure we're... [10:19:58] (03PS2) 10Filippo Giunchedi: prometheus: add ores redis job [puppet] - 10https://gerrit.wikimedia.org/r/395569 (https://phabricator.wikimedia.org/T148637) [10:22:08] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add ores redis job [puppet] - 10https://gerrit.wikimedia.org/r/395569 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [10:30:21] 10Operations, 10Deployments, 10Beta-Cluster-reproducible, 10HHVM, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7) - https://phabricator.wikimedia.org/T146285#3827389 (10Joe) @tstarling AIUI we should be able to switch mwscript to use 'php' instead of 'php5' now,... [10:31:54] (03PS2) 10Volans: Icinga web: add icons for multiple notes_url items [puppet] - 10https://gerrit.wikimedia.org/r/392606 (https://phabricator.wikimedia.org/T170353) [10:31:56] (03PS2) 10Volans: Metric alarms: convert dashboad_link to array [puppet] - 10https://gerrit.wikimedia.org/r/392607 (https://phabricator.wikimedia.org/T170353) [10:32:53] (03CR) 10Filippo Giunchedi: "LGTM, though I'd like Joe or Keith opinion too" [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [10:35:48] (03CR) 10Volans: "Puppet compiler result: https://puppet-compiler.wmflabs.org/compiler02/9262/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/392606 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:41:50] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397472 (https://phabricator.wikimedia.org/T128546) [10:46:37] 10Operations, 10Services (doing), 10User-Eevans, 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839#3827436 (10fgiunchedi) Indeed @Eevans that looks good to me. Though I'm open to suggestions on e.g. branch workflow and such, the package should be buildable with gbp now. [10:49:11] !log cp4021: restart varnish-be due to mbox lag [10:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:37] 10Operations, 10media-storage: Requesting access to swift for Phabricator's git-lfs storage - https://phabricator.wikimedia.org/T182085#3827438 (10fgiunchedi) Thanks for the context! We have a swift deployment-prep cluster you could use for experiments, provided there's also phabricator in deployment-prep (?)... [10:53:00] (03PS3) 10Volans: Metric alarms: convert dashboad_link to array [puppet] - 10https://gerrit.wikimedia.org/r/392607 (https://phabricator.wikimedia.org/T170353) [10:56:28] RECOVERY - Check Varnish expiry mailbox lag on cp4021 is OK: OK: expiry mailbox lag is 0 [10:59:49] (03PS4) 10Volans: Metric alarms: convert dashboad_link to array [puppet] - 10https://gerrit.wikimedia.org/r/392607 (https://phabricator.wikimedia.org/T170353) [11:00:07] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171211T1100). Please do the needful. [11:00:07] No GERRIT patches in the queue for this window AFAICS. [11:00:44] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397472 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:05:37] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397472 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:06:27] (03PS5) 10Volans: Metric alarms: convert dashboad_link to array [puppet] - 10https://gerrit.wikimedia.org/r/392607 (https://phabricator.wikimedia.org/T170353) [11:07:09] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397472 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:07:36] !log jdrewniak@tin Synchronized portals/prod/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:397472|Bumping portals to master (T128546)]] (duration: 00m 44s) [11:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:46] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:08:21] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:397472|Bumping portals to master (T128546)]] (duration: 00m 45s) [11:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:17] PROBLEM - Nginx local proxy to apache on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:10:32] _joe_ --^ [11:10:47] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:10:47] <_joe_> elukey: uhm happening again heh? [11:11:00] elukey@mw1200:~$ hhvmadm check-health [11:11:00] { "load":128 [11:11:01] , "queued":100 [11:11:07] <_joe_> same issue [11:11:07] PROBLEM - HHVM rendering on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:11:31] <_joe_> !log restarting hhvm on mw1200, stuck in a kernel task [11:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:40] <_joe_> elukey: actually, since hhvmadm works [11:12:03] <_joe_> !log depooling mw1200 for investigation instead [11:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:58] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 78371 bytes in 5.610 second response time [11:13:00] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-hin] - 10https://gerrit.wikimedia.org/r/397494 (https://phabricator.wikimedia.org/T171406) [11:13:07] RECOVERY - Nginx local proxy to apache on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.030 second response time [11:13:22] <_joe_> elukey: interestingly, depooling it seems enough to recover, let's see if we can get a stack trace [11:13:25] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-hin] - 10https://gerrit.wikimedia.org/r/397494 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [11:13:38] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.025 second response time [11:16:37] (03CR) 10Volans: "Compiler results seems ok to me: https://puppet-compiler.wmflabs.org/compiler02/9267/" [puppet] - 10https://gerrit.wikimedia.org/r/392607 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [11:25:46] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/397502 (https://phabricator.wikimedia.org/T171406) [11:26:03] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/397502 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [11:26:29] 10Operations, 10HHVM: HHVM periodically hangs - https://phabricator.wikimedia.org/T182568#3827535 (10elukey) p:05Triage>03High [11:29:16] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-isl-eng] - 10https://gerrit.wikimedia.org/r/397512 (https://phabricator.wikimedia.org/T171406) [11:29:52] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-isl-eng] - 10https://gerrit.wikimedia.org/r/397512 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [11:30:27] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [11:31:07] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [11:32:58] (03CR) 10Volans: "I know there is some code repeated here, but it's just validation and defining a string, it doesn't create a resource, so not super easy t" [puppet] - 10https://gerrit.wikimedia.org/r/392607 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [11:37:09] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/386399 (https://phabricator.wikimedia.org/T179002) (owner: 10Volans) [11:39:22] (03PS1) 10KartikMistry: apertium-is-sv: New upstream release and cg3 update [debs/contenttranslation/apertium-is-sv] - 10https://gerrit.wikimedia.org/r/397527 (https://phabricator.wikimedia.org/T171406) [11:39:44] (03CR) 10jerkins-bot: [V: 04-1] Logging: uniform loggers [software/cumin] - 10https://gerrit.wikimedia.org/r/386399 (https://phabricator.wikimedia.org/T179002) (owner: 10Volans) [11:39:50] (03CR) 10jerkins-bot: [V: 04-1] apertium-is-sv: New upstream release and cg3 update [debs/contenttranslation/apertium-is-sv] - 10https://gerrit.wikimedia.org/r/397527 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [11:44:38] 10Operations, 10ops-eqiad: Disconnect flerovium's disk shelves - https://phabricator.wikimedia.org/T181724#3827584 (10faidon) >>! In T181724#3813467, @Cmjohnson wrote: > I disconnected the disk shelves and powered down. @faidon please let me know when and if it's okay to coordinate the drop off. @Cmjohnson... [11:47:36] (03PS1) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-ita] - 10https://gerrit.wikimedia.org/r/397528 (https://phabricator.wikimedia.org/T171406) [11:48:13] (03CR) 10jerkins-bot: [V: 04-1] Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-ita] - 10https://gerrit.wikimedia.org/r/397528 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [11:58:48] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [11:59:17] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [12:03:52] (03PS2) 10Alexandros Kosiaris: Bump scap version to 3.7.4-1 [puppet] - 10https://gerrit.wikimedia.org/r/396089 (owner: 1020after4) [12:04:47] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Bump scap version to 3.7.4-1 [puppet] - 10https://gerrit.wikimedia.org/r/396089 (owner: 1020after4) [12:07:37] (03CR) 1020after4: "The uploaded package still needs to be uploaded to the apt repo: https://phabricator.wikimedia.org/T182347" [puppet] - 10https://gerrit.wikimedia.org/r/396089 (owner: 1020after4) [12:08:28] PROBLEM - puppet last run on wtp1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:08:35] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3827651 (10mobrovac) Switching all small non-WP projects should be a non-brainer, so I'd vote for switching them + `cebwiki` and `ruwiki`. This should be safe... [12:09:43] 10Operations, 10Packaging, 10Scap: SCAP: Upload debian package version 3.7.4-1 - https://phabricator.wikimedia.org/T182347#3827653 (10akosiaris) 05Open>03Resolved a:03akosiaris Package built for both jessie and trusty and uploaded on apt.wikimedia.org. Puppet change merged as well [12:09:46] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3827656 (10akosiaris) [12:10:05] 10Operations, 10Packaging, 10Scap: SCAP: Upload debian package version 3.7.4-1 - https://phabricator.wikimedia.org/T182347#3827657 (10mmodell) Awesome thank you! [12:10:58] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:11:48] PROBLEM - puppet last run on wtp1038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:14:37] PROBLEM - puppet last run on wtp1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:14:38] PROBLEM - puppet last run on wtp1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:14:47] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:15:09] (03PS2) 10Ema: mtail: port varnishxcps [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) [12:15:27] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.cu_changes: Cant find record in cu_changes, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001620, end_log_pos 686955729 [12:16:18] PROBLEM - puppet last run on wtp1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:16:23] <_joe_> akosiaris: ^^ scap issues on wtp* [12:16:28] <_joe_> maybe stretch? [12:16:47] PROBLEM - puppet last run on wtp1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:18:03] yeah [12:18:06] fixing [12:18:18] PROBLEM - puppet last run on wtp1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:18:57] PROBLEM - puppet last run on wtp1033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:19:18] PROBLEM - puppet last run on wtp1036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:19:28] PROBLEM - puppet last run on wtp1043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:20:17] PROBLEM - puppet last run on wtp1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:21:16] (03PS5) 10Volans: Logging: uniform loggers [software/cumin] - 10https://gerrit.wikimedia.org/r/386399 (https://phabricator.wikimedia.org/T179002) [12:21:17] RECOVERY - puppet last run on wtp1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:21:18] (03PS5) 10Volans: Logging: use % syntax for parameters [software/cumin] - 10https://gerrit.wikimedia.org/r/386400 (https://phabricator.wikimedia.org/T179002) [12:21:20] (03PS1) 10Volans: Tests: workaround for py.test bug [software/cumin] - 10https://gerrit.wikimedia.org/r/397533 [12:21:48] RECOVERY - puppet last run on wtp1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:21:49] RECOVERY - puppet last run on wtp1038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:22:02] ok fixed [12:22:37] akosiaris: https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed if you want a quick fix ;) [12:23:04] already did [12:23:08] PROBLEM - puppet last run on mw2119 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:23:11] albeit not with the failed only [12:23:17] RECOVERY - puppet last run on wtp1048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:23:20] and only for wtps [12:23:27] RECOVERY - puppet last run on wtp1035 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:23:37] (03CR) 10Filippo Giunchedi: mtail: port varnishxcps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [12:23:46] hmm some mws are also stretch but I guess most of these should be ok quite soon [12:23:57] RECOVERY - puppet last run on wtp1033 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:24:18] RECOVERY - puppet last run on wtp1036 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:24:28] RECOVERY - puppet last run on wtp1032 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:24:28] RECOVERY - puppet last run on wtp1043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:24:38] RECOVERY - puppet last run on wtp1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:25:08] RECOVERY - puppet last run on wtp1041 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:25:46] (03CR) 10Volans: [C: 032] Tests: workaround for py.test bug [software/cumin] - 10https://gerrit.wikimedia.org/r/397533 (owner: 10Volans) [12:28:35] (03Merged) 10jenkins-bot: Tests: workaround for py.test bug [software/cumin] - 10https://gerrit.wikimedia.org/r/397533 (owner: 10Volans) [12:29:05] (03CR) 10Volans: [C: 032] Logging: uniform loggers [software/cumin] - 10https://gerrit.wikimedia.org/r/386399 (https://phabricator.wikimedia.org/T179002) (owner: 10Volans) [12:29:41] (03CR) 10jenkins-bot: Tests: workaround for py.test bug [software/cumin] - 10https://gerrit.wikimedia.org/r/397533 (owner: 10Volans) [12:33:33] (03Merged) 10jenkins-bot: Logging: uniform loggers [software/cumin] - 10https://gerrit.wikimedia.org/r/386399 (https://phabricator.wikimedia.org/T179002) (owner: 10Volans) [12:34:18] (03CR) 10jenkins-bot: Logging: uniform loggers [software/cumin] - 10https://gerrit.wikimedia.org/r/386399 (https://phabricator.wikimedia.org/T179002) (owner: 10Volans) [12:35:37] (03CR) 10Volans: [C: 032] Logging: use % syntax for parameters [software/cumin] - 10https://gerrit.wikimedia.org/r/386400 (https://phabricator.wikimedia.org/T179002) (owner: 10Volans) [12:38:11] (03Merged) 10jenkins-bot: Logging: use % syntax for parameters [software/cumin] - 10https://gerrit.wikimedia.org/r/386400 (https://phabricator.wikimedia.org/T179002) (owner: 10Volans) [12:39:00] (03CR) 10jenkins-bot: Logging: use % syntax for parameters [software/cumin] - 10https://gerrit.wikimedia.org/r/386400 (https://phabricator.wikimedia.org/T179002) (owner: 10Volans) [12:39:45] <_joe_> !log trying to get a full core dump from hhvm on mw1200 [12:39:47] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:57] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:57] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:42:17] PROBLEM - HHVM rendering on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:42:27] PROBLEM - Nginx local proxy to apache on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:42:37] hello mw1200 [12:42:57] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 4.981 second response time [12:43:17] RECOVERY - Nginx local proxy to apache on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 1.390 second response time [12:44:08] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 78433 bytes in 4.453 second response time [12:47:01] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3827757 (10akosiaris) Found the following in kern.log ``` Dec 11 08:01:28 ganeti1008 kernel: [858993.252853] ------------[ cut here ]------------ Dec 11 08:01:28 ganeti1008 kerne... [12:48:08] RECOVERY - puppet last run on mw2119 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:49:57] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:48] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.036 second response time on 10.192.0.128 port 9042 [12:52:06] (03PS1) 10Elukey: Rename notebook1002 to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/397534 (https://phabricator.wikimedia.org/T181518) [12:57:28] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3827779 (10akosiaris) Looking at grafana at the time of the page allocations stalls incident (Dec 9), the box had ~30GB memory used, and ~34GB in buffers (the block level equivale... [13:05:55] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3827811 (10elukey) Updated procedure after a chat with @Volans: 1) shutdown notebook1002 2) replace it in DNS and puppet with kafka1023 (caveat: wmf-auto-reimage will need... [13:06:07] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3827813 (10chasemp) >>! In T177196#3802917, @chasemp wrote: > I talked @fgiunchedi into enabling the client collect... [13:08:37] (03PS1) 10Elukey: Prepare the conditions to rename notebook1002 in kafka1023 [dns] - 10https://gerrit.wikimedia.org/r/397539 (https://phabricator.wikimedia.org/T181518) [13:21:41] (03PS1) 10KartikMistry: apertium-kaz: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-kaz] - 10https://gerrit.wikimedia.org/r/397540 (https://phabricator.wikimedia.org/T171406) [13:22:27] (03CR) 10jerkins-bot: [V: 04-1] apertium-kaz: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-kaz] - 10https://gerrit.wikimedia.org/r/397540 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [13:23:32] (03PS1) 10KartikMistry: apertium-kaz-tat: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-kaz-tat] - 10https://gerrit.wikimedia.org/r/397542 (https://phabricator.wikimedia.org/T171406) [13:24:07] (03CR) 10jerkins-bot: [V: 04-1] apertium-kaz-tat: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-kaz-tat] - 10https://gerrit.wikimedia.org/r/397542 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [13:24:19] (03PS2) 10KartikMistry: apertium: Depends on new cg3 [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/397217 (https://phabricator.wikimedia.org/T171406) [13:29:53] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3827860 (10akosiaris) ltpstress finished on ganeti1006. It did not trigger any problem. [13:32:17] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:17] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 7.100 second response time on 10.192.0.128 port 9042 [13:36:18] (03CR) 10Addshore: "Should Wikibase and all other extensions that were included in the Wikidata build be added to this file?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396482 (owner: 10Umherirrender) [13:40:32] 10Operations, 10DBA, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3827867 (10Marostegui) [13:40:35] 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3827865 (10Marostegui) 05Open>03Resolved As per our chat, closing this. Thanks for all the hard work you've put to make this happen! [13:43:37] (03CR) 10Filippo Giunchedi: [C: 031] Icinga web: add icons for multiple notes_url items [puppet] - 10https://gerrit.wikimedia.org/r/392606 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [13:45:23] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/392607 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [13:45:44] thanks Filippo! [13:46:53] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397545 [13:46:57] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397545 [13:47:15] np volans ! [13:48:06] (03PS1) 10Alexandros Kosiaris: Add 3 prometheus checks for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) [13:48:27] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397545 (owner: 10Marostegui) [13:49:53] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397545 (owner: 10Marostegui) [13:50:05] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397545 (owner: 10Marostegui) [13:52:07] PROBLEM - Long running screen/tmux on restbase2004 is CRITICAL: CRIT: Long running SCREEN process. (PID: 30220, 1739870s 1728000s). [13:54:42] (03CR) 10Volans: [C: 04-1] "Missing dashboard links to grafana, they're mandatory ;)" [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [13:54:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 - T174569 (duration: 04m 42s) [13:55:01] (03PS5) 10Revi: Create NS_PROJECT and NS_PROJECT_TALK alias for kowikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) [13:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:03] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [13:56:37] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:27] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.036 second response time on 10.192.0.128 port 9042 [14:00:05] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171211T1400). [14:00:05] revi, Hauskatze, and bawolff: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] o/ [14:00:10] Woo! [14:00:12] \o/ [14:01:11] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint: maps-test2001 is low on disk space - https://phabricator.wikimedia.org/T182583#3827938 (10Gehel) [14:01:23] o/ [14:02:50] I can SWAT. [14:02:54] Hey all. [14:03:01] good evening! [14:03:16] Evening, revi! Howdy? [14:03:21] cmjohnson1: o/ [14:03:21] ready to go :D [14:03:42] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi) [14:03:56] And we're off. [14:04:28] yoohoo [14:04:47] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint: maps-test2001 is low on disk space - https://phabricator.wikimedia.org/T182583#3827938 (10MaxSem) Probably, Cassandra didn't delete old keyspaces. [14:05:03] (03Merged) 10jenkins-bot: Create NS_PROJECT and NS_PROJECT_TALK alias for kowikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi) [14:05:19] (03PS2) 10Niharika29: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396714 (https://phabricator.wikimedia.org/T182506) (owner: 10Reedy) [14:05:30] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396714 (https://phabricator.wikimedia.org/T182506) (owner: 10Reedy) [14:06:47] (03Merged) 10jenkins-bot: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396714 (https://phabricator.wikimedia.org/T182506) (owner: 10Reedy) [14:06:56] (03PS12) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) [14:07:00] (03CR) 10jenkins-bot: Create NS_PROJECT and NS_PROJECT_TALK alias for kowikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396569 (https://phabricator.wikimedia.org/T182487) (owner: 10Revi) [14:07:10] (03PS26) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [14:07:26] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint: maps-test2001 is low on disk space - https://phabricator.wikimedia.org/T182583#3827970 (10Gehel) @MaxSem good point! I'll check [14:08:07] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3827971 (10fgiunchedi) @gehel I still have to figure out what's wrong with jmx_exporter above, but what do you think re: metrics in P6392 ? [14:10:34] (03PS2) 10Alexandros Kosiaris: Add 3 prometheus checks for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) [14:10:36] (03PS1) 10Alexandros Kosiaris: Add kubelet operational latencies check [puppet] - 10https://gerrit.wikimedia.org/r/397552 (https://phabricator.wikimedia.org/T177395) [14:10:40] revi: You're live on mwdebug1002. [14:10:57] ok, testin'... [14:11:03] That scap pull took ~6 minutes. [14:11:04] (03CR) 10jerkins-bot: [V: 04-1] Add 3 prometheus checks for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [14:12:09] good, 문:관리자 선거 redirects to 위키문헌:관리자 선거 just fine [14:12:20] revi: Okay, syncing it then. [14:13:34] Okay, scap seems broken. [14:13:39] Anyone from ops here? [14:13:45] duh [14:13:47] :O [14:13:54] ... [14:14:02] super real duh [14:14:16] https://www.irccloud.com/pastebin/2p2XWdRg/ [14:14:23] revi: look what you've done [14:14:27] Here's what I got when I started syncing that file. [14:14:36] bawolff: Got any ideas? [14:14:43] (._. [14:15:03] Well marostegui managed to sync something 10 minutes ago, so it must be recent [14:15:18] copying from tin to tin is right? [14:15:22] (03PS1) 10Herron: puppet: change puppet major version to 4 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/397553 (https://phabricator.wikimedia.org/T177254) [14:15:52] Hauskatze: It's copying from tin to all servers, I believe. [14:15:58] All slaves. [14:16:06] None of those actually look like actual error messages, so much as just debugging info [14:16:22] bd808: scap is broken [14:16:44] Its all, starting process, process completed, none of them are "process exploded" [14:16:57] This went on for a bit. Like the actual thing is 10 times this size. [14:17:02] Yeah. But what process? [14:17:03] I've never used scap so I don't know what I should be looking for [14:17:08] same here [14:17:11] I've never seen this before. [14:17:16] (03PS1) 10Zoranzoki21: Disable the wgTranslateNumerals at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397554 (https://phabricator.wikimedia.org/T182584) [14:17:36] Niharika: scap sync again maybe? [14:17:37] It looks like its creating the gitinfo cache stuff, which is used for showing version numbers on Special:Version [14:18:07] note that you've also merged interwiki.php [14:18:14] which is in a different dir [14:18:19] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint: maps-test2001 is low on disk space - https://phabricator.wikimedia.org/T182583#3827988 (10Gehel) It looks like Cassandra does not have enough space to do compaction: ``` ERROR [CompactionExecutor:4465] 2017-12-05 09:24:41,156 CassandraDaemon.java:185 - Ex... [14:18:27] maybe cd to wmf-config and scap again to trigger both folders? [14:18:33] Hauskatze: I've merged it but not pulled it. [14:19:00] bawolff: You think it'd be safe to proceed? [14:19:19] Honestly, I'd wait until confirming with someone who knows scap [14:19:26] Right. [14:19:27] S.C.A.P.: scatter crap around production [14:19:33] (03PS2) 10TerraCodes: Disable the wgTranslateNumerals at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397554 (https://phabricator.wikimedia.org/T182584) (owner: 10Zoranzoki21) [14:19:33] Who can we ping... [14:19:37] wmopbot: You're new. [14:19:54] 10Operations: Puppet: Setting configtimeout is deprecated - https://phabricator.wikimedia.org/T182585#3827993 (10herron) p:05Triage>03Normal [14:20:09] dereckson: ping? [14:20:52] (._. is what I can say [14:20:53] :P [14:21:03] revi: You jinxed it. :P [14:21:08] !ops [14:21:17] that's for channel ops [14:21:23] Really? Duh. [14:21:29] yeah [14:21:41] We may have to wait until san francisco people wake up [14:21:47] How does one reach ops for emergencies? [14:21:50] yeah, 06:21 [14:21:50] Hi. Please deploy this: https://gerrit.wikimedia.org/r/#/c/397554/ [14:21:54] well no [14:21:56] not really an emergency [14:21:58] scap is gone [14:22:00] Most ops folks are in EU. [14:22:04] (03CR) 10Jayprakash12345: [C: 031] Disable the wgTranslateNumerals at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397554 (https://phabricator.wikimedia.org/T182584) (owner: 10Zoranzoki21) [14:22:09] MaxSem!!! You're a sight for sore eyes. [14:22:11] Deploy patch https://gerrit.wikimedia.org/r/#/c/397554/ please [14:22:19] Zoranzoki21: we're in an emergency [14:22:20] Release engineering team people would be who I'd generally talk to for something like this [14:22:23] Zoranzoki21: Hold your horses. Scap might be broken. [14:22:29] Not an actual emergency ;) [14:22:31] Zoranzoki21: no, we can't [14:22:34] Zoranzoki21: I'll ping you if we get moving again. [14:22:39] hashar: maybe you know a thing about scap? [14:22:41] well, it's semi-emergency then :P [14:22:46] please add your patch to the calendar as everyone else [14:22:47] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:52] what is up ? [14:23:04] (03PS3) 10Alexandros Kosiaris: Add 3 prometheus checks for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) [14:23:04] hashar: scap est en panne [14:23:05] (03PS2) 10Alexandros Kosiaris: Add kubelet operational latencies check [puppet] - 10https://gerrit.wikimedia.org/r/397552 (https://phabricator.wikimedia.org/T177395) [14:23:10] hashar: scap is outputting a bunch of weird debug stuff [14:23:17] Hey hashar. I tried a snap sync and got https://www.irccloud.com/pastebin/2p2XWdRg/ [14:23:24] scap sync-file* [14:23:32] (03CR) 10jerkins-bot: [V: 04-1] Add 3 prometheus checks for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [14:23:41] pfff [14:23:41] I think that may be related to the scap update today [14:23:50] o/ paladox [14:23:52] I can be only now on IRC [14:23:52] Since it is debugging ores new servers [14:23:53] scap is super verbose iirc [14:24:03] so all that is just the git commands being shown [14:24:09] Zoranzoki21: That's only like 10% of the log. [14:24:14] Goes on for a while. [14:24:17] https://phabricator.wikimedia.org/T181661 [14:24:18] When patch can be deployed? [14:24:23] hashar ^^ [14:24:28] hashar: So this is normal? [14:24:30] https://phabricator.wikimedia.org/T182347 [14:24:38] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 1.043 second response time on 10.192.0.128 port 9042 [14:24:42] Niharika: I would say yes [14:24:48] seems it is updating the git cache [14:24:53] Taking your word for it then! [14:24:57] that is used to show up the sha1 on Special:Version [14:25:08] is it deployed then? [14:25:12] * hashar looks at https://logstash.wikimedia.org/app/kibana#/dashboard/scap [14:25:33] I'm assuming not yet, cuz... https://tppr.me/Z7XJ6 [14:25:54] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Create NS_PROJECT and NS_PROJECT_TALK alias for kowikisource (T182487) (duration: 00m 56s) [14:25:59] revi: Now it is. [14:26:03] gr8 [14:26:03] \o/ [14:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:05] T182487: Create NS alias for Korean Wikisource - https://phabricator.wikimedia.org/T182487 [14:26:07] Zoranzoki21: Can you put that on the calendar? [14:26:19] there is one sure thing [14:26:24] thanks! [14:26:27] scap should not spurts so much output [14:26:40] seems like the logging level is wrong [14:26:40] https://phabricator.wikimedia.org/D907 [14:26:45] hashar: I know, never seen this before. [14:26:46] or someone made it way too verbose [14:26:49] Niharika: I already added on calendar [14:27:07] Niharika: For this time because Jay can help me about mwdebug [14:27:28] hashar see https://phabricator.wikimedia.org/D907 :) [14:27:29] Zoranzoki21: Alright, I'll be on it in a bit. [14:27:39] Niharika: Thank you [14:28:05] Hauskatze: https://gerrit.wikimedia.org/r/#/c/396714/2 is on mwdebug1002. [14:28:11] Niharika: checking [14:28:15] (03PS3) 10Niharika29: wm2018: sysops to add and remove 'translationadmin' from their accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396580 (https://phabricator.wikimedia.org/T182492) (owner: 10MarcoAurelio) [14:28:23] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396580 (https://phabricator.wikimedia.org/T182492) (owner: 10MarcoAurelio) [14:28:56] there was a new scap release today. IIRC they're using a new library to wrap subprocess commands. I think the verbosity level of that library needs to be reduced is all. [14:29:13] (03PS3) 10Alexandros Kosiaris: Add kubelet operational latencies check [puppet] - 10https://gerrit.wikimedia.org/r/397552 (https://phabricator.wikimedia.org/T177395) [14:29:43] thcipriani|afk: Thank you. [14:29:44] Niharika: is it? can't see [14:29:53] (03Merged) 10jenkins-bot: wm2018: sysops to add and remove 'translationadmin' from their accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396580 (https://phabricator.wikimedia.org/T182492) (owner: 10MarcoAurelio) [14:30:07] Hauskatze: Most definitely is. How are you testing? [14:30:22] Niharika: ah, the interwiki map one, sorry I was looking at the wrong patch [14:30:26] Niharika: I can't test that [14:30:33] Hauskatze: Okay. [14:30:34] feel free to deploy the interwiki update [14:30:42] Reedy did it so it's okay [14:31:11] Hauskatze: Your other one is there too. [14:31:12] (03PS4) 10Alexandros Kosiaris: Add 3 prometheus checks for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) [14:31:14] (03PS4) 10Alexandros Kosiaris: Add kubelet operational latencies check [puppet] - 10https://gerrit.wikimedia.org/r/397552 (https://phabricator.wikimedia.org/T177395) [14:31:14] Sorry, but is all ok now? [14:31:18] Sorry, but is all ok now? [14:31:25] yes, wait for your turn [14:31:30] Niharika: so checking the wmf2018 one [14:31:32] revi: Ok. Thank you [14:31:46] Niharika: Do it easy, I have enough time. You can give this last priorty. [14:32:17] Niharika: lgtm [14:32:19] Jayprakash12345: Thanks! [14:32:33] (y) [14:32:53] !log niharika29@tin Synchronized wmf-config/interwiki.php: Update the Interwiki map - T182506 (duration: 00m 56s) [14:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:04] T182506: Run dumpInterwiki.php - https://phabricator.wikimedia.org/T182506 [14:34:30] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397554 (https://phabricator.wikimedia.org/T182584) (owner: 10Zoranzoki21) [14:34:37] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Bureaucrats to grant and remove translationadmin rights; sysops to add and remove the same from themselves - T182492 (duration: 00m 56s) [14:34:40] Hauskatze: Both of yours are now synced. [14:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:47] T182492: wikimania2018: 'translationadmin' to [Add|Remove]GroupsToSelf for sysops - https://phabricator.wikimedia.org/T182492 [14:34:48] thanks [14:34:57] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:12] (03PS3) 10Zoranzoki21: Disable the wgTranslateNumerals at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397554 (https://phabricator.wikimedia.org/T182584) [14:35:21] (03PS4) 10Niharika29: Disable the wgTranslateNumerals at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397554 (https://phabricator.wikimedia.org/T182584) (owner: 10Zoranzoki21) [14:35:29] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397554 (https://phabricator.wikimedia.org/T182584) (owner: 10Zoranzoki21) [14:35:48] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 1.048 second response time on 10.192.0.128 port 9042 [14:35:56] Niharika29: In same time :D [14:36:23] :) [14:36:50] (03Merged) 10jenkins-bot: Disable the wgTranslateNumerals at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397554 (https://phabricator.wikimedia.org/T182584) (owner: 10Zoranzoki21) [14:38:16] Jayprakash12345: you can do now what is needed [14:38:30] (03CR) 10Ottomata: [C: 031] kafkatee: remove Ganglia monitoring class and script [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/396088 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [14:38:58] bawolff: Your change is up on mwdebug1002 [14:39:13] Cool, testing :) [14:39:14] Zoranzoki21: Check the Numeral in https://hi.wikiversity.org/wiki/Specical:RecentChanges. [14:39:44] It it 12345 or same other [14:39:48] Jayprakash12345: Ok. Than? [14:39:57] Niharika: Works, thank you [14:40:31] Jayprakash12345: I can not recongize hindi language [14:40:37] Zoranzoki21: If it is 123 then ok. [14:40:47] Jayprakash12345: Ok [14:40:51] Zoranzoki21: See only number [14:41:05] Jayprakash12345: Ok [14:41:36] Alright, syncing in a moment. [14:41:58] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:57] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 7.306 second response time on 10.192.0.128 port 9042 [14:45:16] Zoranzoki21: Server is too slow to me [14:45:22] Is Zoranzoki21's change supposed to be on mwdebug1002, because it doesn't look like it is [14:45:37] bawolff: Nope, it's not supposed to be there yet. [14:45:43] Ah, ok [14:45:51] Jayprakash12345: OMG [14:46:00] Jayprakash12345: But how? [14:46:14] !log niharika29@tin Synchronized php-1.31.0-wmf.11/includes/specials/SpecialUndelete.php: Revert replacing textarea in Special:Undelete with OOUI T182398 (duration: 00m 57s) [14:46:17] Zoranzoki21: Jayprakash12345 Your change is not on mwdebug yet. Please wait. [14:46:21] bawolff: You're live. [14:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:25] T182398: Special:Undelete contains egregious white space after OOUI update - https://phabricator.wikimedia.org/T182398 [14:46:28] thanks [14:46:36] Niharika: ok [14:47:29] Zoranzoki21: Jayprakash12345: Now you're on mwdebug1002. Please test. [14:47:29] ..... [14:47:37] Niharika: ok [14:47:41] Zoranzoki21: Patience is a virtue, you know. [14:47:43] Niharika: I closed and reopen brower 15-20. I was thought that the patch is mwdebug. [14:47:53] Well, I never said it is. [14:47:57] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35903248 [14:48:44] Niharika: Everthingis fine. Syn Please [14:48:51] Jayprakash12345: On it. [14:48:57] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 939232 [14:49:24] (03PS2) 10Giuseppe Lavagetto: wmflib: use string for parameter of package, not symbol [puppet] - 10https://gerrit.wikimedia.org/r/395695 [14:49:37] (03PS1) 10Elukey: Allow TLS/SSL configurations [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/397557 [14:50:09] (03PS3) 10Volans: Icinga web: add icons for multiple notes_url items [puppet] - 10https://gerrit.wikimedia.org/r/392606 (https://phabricator.wikimedia.org/T170353) [14:51:34] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Disable the wgTranslateNumerals at hiwikiversity T182584 (duration: 00m 56s) [14:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:44] T182584: Disable the wgTranslateNumerals at hiwikiversity - https://phabricator.wikimedia.org/T182584 [14:51:44] Jayprakash12345: You're live. [14:51:47] All done. [14:51:51] * Niharika dusts off hands [14:52:00] Niharika: Thanks [14:52:41] (03CR) 10Volans: [C: 032] Icinga web: add icons for multiple notes_url items [puppet] - 10https://gerrit.wikimedia.org/r/392606 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [14:53:24] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3828220 (10Gehel) >>! In T181627#3803312, @fgiunchedi wrote: > I tried jmx_exporter on deployment-logstash2 with the results below. A few notes: the... [14:53:37] PROBLEM - Disk space on lawrencium is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/c1dc9ea10b1d7fc55a9778b0abd9894ccd3eb7520b928bf68a1a626d9304fd16/merged is not accessible: Permission denied [14:54:22] (03PS2) 10Elukey: Allow TLS/SSL configurations [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/397557 [14:55:58] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 47524280 [14:59:56] hi elukey [15:02:55] (03PS3) 10Elukey: Allow TLS/SSL configurations [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/397557 [15:04:26] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050#3828227 (10fgiunchedi) More details on the Prometheus part: [] Allow bast4002 as an additional Prometheus host (https://gerrit.wikimedia.org/r/#/c/393943/) [] rsync Prometheus data from bast4001 to bast40... [15:06:07] (03PS4) 10Elukey: Allow TLS/SSL configurations [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/397557 [15:08:48] (03PS1) 10Andrew Bogott: labpuppetmaster1001 and 1002: move to puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/397560 [15:09:35] (03CR) 10Andrew Bogott: [C: 032] labpuppetmaster1001 and 1002: move to puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/397560 (owner: 10Andrew Bogott) [15:12:15] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler03/9273/cp1048.eqiad.wmnet/" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/397557 (owner: 10Elukey) [15:12:33] ottomata: --^ [15:14:38] PROBLEM - DPKG on labpuppetmaster1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:15:17] PROBLEM - puppetmaster https on labpuppetmaster1001 is CRITICAL: connect to address 208.80.154.158 and port 8140: Connection refused [15:15:18] PROBLEM - puppetmaster backend https on labpuppetmaster1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [15:15:37] PROBLEM - puppetmaster backend https on labpuppetmaster1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [15:15:38] RECOVERY - DPKG on labpuppetmaster1001 is OK: All packages OK [15:17:31] (03CR) 10Ottomata: [C: 031] "Nits! But +1 once addressed! :)" (032 comments) [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/397557 (owner: 10Elukey) [15:18:22] ottomata: you +1'd it anyway :P [15:18:52] (03CR) 10Awight: "@akosiaris, I'm on-board with your suggestion to make the number of workers the same across all servers. Your suggestion of 48 would mean" [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) (owner: 10Halfak) [15:19:21] Reedy: ya, then elukey can merge after he addresses without me having to review again :) [15:22:52] Reedy: Andrew is a nice guy! :D [15:23:28] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:17] (03PS5) 10Elukey: Allow TLS/SSL configurations [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/397557 [15:24:21] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9268/" [puppet] - 10https://gerrit.wikimedia.org/r/395695 (owner: 10Giuseppe Lavagetto) [15:25:27] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 3.072 second response time on 10.192.0.128 port 9042 [15:27:15] has anyone looked at ^^^^ ? [15:27:52] urandom: yes, I'm on it... (maps-test, so no emergency here) I should silence them... [15:28:37] gehel: kk; let me know if you need anything [15:29:16] RECOVERY - puppetmaster https on labpuppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 399 bytes in 0.052 second response time [15:29:18] urandom: we don't have enough disk space for cassandra to compact, so we're loosing even more disk space :( [15:29:25] RECOVERY - puppetmaster backend https on labpuppetmaster1002 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.019 second response time [15:29:50] urandom: but those are test servers, with smaller disks than production. Which also should be decommissioned. [15:30:08] urandom: If you have a magic way to recover some space from cassandra, I'd love to hear about it! [15:30:08] gehel: oh, yeah [15:30:25] gehel: has reserved space been disabled on the fs? [15:30:25] RECOVERY - puppetmaster backend https on labpuppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.020 second response time [15:31:10] urandom: it looks like everything is already allocated [15:31:24] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/9274/cp1048.eqiad.wmnet/" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/397557 (owner: 10Elukey) [15:31:45] urandom: we should just reimage them with a subset of all production data [15:32:03] even better, we should do that on WMCS [15:32:36] ACKNOWLEDGEMENT - Postgres Replication Lag on maps-test2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 131857808 Gehel vacuum in progress [15:32:36] WMCS? [15:33:15] Cloud Service, previously lab ... [15:33:28] I thought that was the new name... [15:34:20] Toolserver! lets complete the renaming circle ;) [15:34:28] production data should not arrive to cloud [15:35:08] jynus: yeah, it's not actually production data. It is OSM data that we import both on production servers and on test servers. [15:35:28] Production data was a shortcut to "a data set of the same size as production" [15:35:29] (03PS3) 10Ema: mtail: port varnishxcps [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) [15:35:31] gehel: ah, ok [15:35:37] 10Operations, 10MediaWiki-Configuration, 10discovery-system: [DRAFT] Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597#3828259 (10Joe) [15:35:46] then there is no reason to have that in production [15:36:09] (03PS1) 10Elukey: modules::varnishafka: update to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/397562 [15:36:20] !log Deploy schema change on s2 master (db1054) - T174569 [15:36:28] jynus: to have what in production? Not sure I follow... [15:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:32] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [15:36:39] gehel: just ignore my comments [15:36:58] * gehel is adding a spam filter rule for jynus :) [15:38:29] (03CR) 10Ema: mtail: port varnishxcps (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [15:39:54] (03CR) 10Elukey: [C: 032] "pcc https://puppet-compiler.wmflabs.org/compiler03/9275/cp1048.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/397562 (owner: 10Elukey) [15:41:55] (03PS2) 10Giuseppe Lavagetto: [WiP] Create an envoy docker image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/396021 [15:43:36] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 187184 [15:51:47] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 38749864 [15:52:07] ^ I'm silencing that one as well... [15:56:46] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 58488 [16:05:16] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3828368 (10BBlack) It's a pain any direction we slice this, and I'm not fond of adding new canonical domains outside the known set for indivi... [16:07:18] 10Operations, 10monitoring: Upgrade grafana to 4.6.2 - https://phabricator.wikimedia.org/T182294#3828387 (10fgiunchedi) a:03fgiunchedi [16:10:08] (03CR) 10Herron: [C: 032] puppet: change puppet major version to 4 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/397553 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [16:10:16] (03PS2) 10Herron: puppet: change puppet major version to 4 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/397553 (https://phabricator.wikimedia.org/T177254) [16:23:33] (03PS6) 10Awight: Refactor web workers for ORES [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) (owner: 10Halfak) [16:30:02] 10Operations, 10Analytics, 10Analytics-Cluster: stat1004 - /mnt/hdfs is not accessible - https://phabricator.wikimedia.org/T182342#3828452 (10Ottomata) 05Open>03Resolved a:03Ottomata Followed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Fixing_HDFS_mount_at_/mnt/h... [16:31:02] (03PS7) 10Awight: Refactor web workers for ORES [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) (owner: 10Halfak) [16:31:15] akosiaris: ^ ready for your re-review [16:34:00] (03PS1) 10Mobrovac: Remove the Trending Edits service from production [puppet] - 10https://gerrit.wikimedia.org/r/397571 (https://phabricator.wikimedia.org/T180384) [16:34:35] (03CR) 10jerkins-bot: [V: 04-1] Remove the Trending Edits service from production [puppet] - 10https://gerrit.wikimedia.org/r/397571 (https://phabricator.wikimedia.org/T180384) (owner: 10Mobrovac) [16:37:13] (03PS2) 10Mobrovac: Remove the Trending Edits service from production [puppet] - 10https://gerrit.wikimedia.org/r/397571 (https://phabricator.wikimedia.org/T180384) [16:40:55] (03CR) 10jenkins-bot: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396714 (https://phabricator.wikimedia.org/T182506) (owner: 10Reedy) [16:40:57] (03CR) 10jenkins-bot: wm2018: sysops to add and remove 'translationadmin' from their accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396580 (https://phabricator.wikimedia.org/T182492) (owner: 10MarcoAurelio) [16:40:59] (03CR) 10jenkins-bot: Disable the wgTranslateNumerals at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397554 (https://phabricator.wikimedia.org/T182584) (owner: 10Zoranzoki21) [16:41:05] oh hello jenkins-bot [16:41:07] Nice of you to join us [16:42:21] !log Restarting Jenkins [16:42:31] ah [16:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:33] hahaha [16:45:52] (03CR) 10Mobrovac: "PCC ok - https://puppet-compiler.wmflabs.org/compiler02/9279/" [puppet] - 10https://gerrit.wikimedia.org/r/397571 (https://phabricator.wikimedia.org/T180384) (owner: 10Mobrovac) [16:47:56] (03Abandoned) 10Ema: First stab documenting HFP/HFM cases [puppet] - 10https://gerrit.wikimedia.org/r/386895 (owner: 10BBlack) [16:48:15] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [16:50:16] (03CR) 10Filippo Giunchedi: mtail: port varnishxcps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [16:51:44] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3828534 (10Dzahn) >>! In T169450#3827175, @MarcoAurelio wrote: > @Dzahn (not sure if you handle this stuff): Sorry, i don't. [16:54:04] 10Operations, 10Security-Team, 10Wikimedia-General-or-Unknown, 10WorkType-NewFunctionality: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860#3828538 (10Dzahn) [16:54:46] 10Operations, 10Security-Team, 10Wikimedia-General-or-Unknown, 10WorkType-NewFunctionality: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860#426408 (10Dzahn) Adding Security-Team. What do you guys think about such a key nowaday... [16:59:21] 10Operations, 10Security-Team, 10Wikimedia-General-or-Unknown, 10WorkType-NewFunctionality: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860#3828554 (10Bawolff) Speaking just for myself and not the team. I think such a thing ma... [17:00:55] (03PS1) 10Andrew Bogott: puppetmaster: include @extra_auth_rules in v4 auth.conf [puppet] - 10https://gerrit.wikimedia.org/r/397575 (https://phabricator.wikimedia.org/T178717) [17:06:01] (03CR) 10Andrew Bogott: [C: 032] puppetmaster: include @extra_auth_rules in v4 auth.conf [puppet] - 10https://gerrit.wikimedia.org/r/397575 (https://phabricator.wikimedia.org/T178717) (owner: 10Andrew Bogott) [17:08:32] 10Operations, 10Discovery-Search (Current work), 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3828579 (10Gehel) [17:09:02] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3828587 (10Gehel) [17:09:04] 10Operations, 10monitoring, 10Discovery-Search (Current work): port elasticsearch diamond collector to prometheus - https://phabricator.wikimedia.org/T175799#3828583 (10Gehel) 05Open>03declined This is a duplicate of T181627 [17:17:47] (03CR) 10Krinkle: mediawiki/hhvm: Move fatal-error.php to Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [17:22:37] 10Operations, 10Trending-Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3828638 (10Joe) [17:24:19] (03PS4) 10Gehel: admin: allow Paul Norman (pnorman) to deploy kartotherian / tilerator [puppet] - 10https://gerrit.wikimedia.org/r/395481 (https://phabricator.wikimedia.org/T182066) [17:24:20] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: Requesting access to deploy-service for pnorman - https://phabricator.wikimedia.org/T182066#3828642 (10Gehel) This has been approved in Ops meeting as well. [17:25:11] (03CR) 10Gehel: [C: 032] admin: allow Paul Norman (pnorman) to deploy kartotherian / tilerator [puppet] - 10https://gerrit.wikimedia.org/r/395481 (https://phabricator.wikimedia.org/T182066) (owner: 10Gehel) [17:36:10] (03PS13) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) [17:36:18] (03PS27) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [17:36:59] (03PS1) 10Andrew Bogott: WMCS puppetmaster: modify auth.conf extra rules to account for url changes [puppet] - 10https://gerrit.wikimedia.org/r/397579 (https://phabricator.wikimedia.org/T178717) [17:37:34] (03CR) 10jerkins-bot: [V: 04-1] WMCS puppetmaster: modify auth.conf extra rules to account for url changes [puppet] - 10https://gerrit.wikimedia.org/r/397579 (https://phabricator.wikimedia.org/T178717) (owner: 10Andrew Bogott) [17:40:30] (03CR) 10Andrew Bogott: [V: 032 C: 032] WMCS puppetmaster: modify auth.conf extra rules to account for url changes [puppet] - 10https://gerrit.wikimedia.org/r/397579 (https://phabricator.wikimedia.org/T178717) (owner: 10Andrew Bogott) [17:48:06] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:43] (03PS1) 10Ppchelko: Disable Redis queue for small projects and ru and ceb wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397581 (https://phabricator.wikimedia.org/T182023) [17:49:50] (03PS1) 10EBernhardson: Setup MLR AB test for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397582 [17:50:01] (03PS2) 10EBernhardson: Setup MLR AB test for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397582 [17:51:15] 10Operations, 10ORES, 10Patch-For-Review, 10Performance, 10Scoring-platform-team (Current): Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3828705 (10awight) [17:51:29] 10Operations, 10ORES, 10Patch-For-Review, 10Performance, 10Scoring-platform-team (Current): Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3817970 (10awight) p:05Triage>03Normal [17:53:06] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:00:04] gehel: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171211T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:58] jouncebot: o/ [18:01:13] TIL jouncebot has dad jokes [18:01:37] lol, that's actually kind of funny [18:02:26] !log gehel@tin Started deploy [wdqs/wdqs@353b3cb]: wdqs: GUI and updater updates [18:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:08] (03PS1) 10Andrew Bogott: profile::puppetmaster::backend: fix extra_auth_rules for backend masters [puppet] - 10https://gerrit.wikimedia.org/r/397584 (https://phabricator.wikimedia.org/T178717) [18:03:39] !log gehel@tin Finished deploy [wdqs/wdqs@353b3cb]: wdqs: GUI and updater updates (duration: 01m 14s) [18:03:40] (03CR) 10jerkins-bot: [V: 04-1] profile::puppetmaster::backend: fix extra_auth_rules for backend masters [puppet] - 10https://gerrit.wikimedia.org/r/397584 (https://phabricator.wikimedia.org/T178717) (owner: 10Andrew Bogott) [18:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:18] SMalyshev: ^ wdqs update completed, updater restarted, tests are green [18:04:32] (03CR) 10Andrew Bogott: [V: 032 C: 032] profile::puppetmaster::backend: fix extra_auth_rules for backend masters [puppet] - 10https://gerrit.wikimedia.org/r/397584 (https://phabricator.wikimedia.org/T178717) (owner: 10Andrew Bogott) [18:10:50] gehel: great [18:11:16] gehel: we should watch updater lag for a bit to see if the fix has worked [18:15:37] (03PS1) 10Andrew Bogott: shinken: replace function_hiera with call_function(:hiera) [puppet] - 10https://gerrit.wikimedia.org/r/397588 [18:16:05] SMalyshev: I'm watching the logs on wdqs2003 and the grafana dashboard... [18:16:38] (03CR) 10Andrew Bogott: [C: 032] shinken: replace function_hiera with call_function(:hiera) [puppet] - 10https://gerrit.wikimedia.org/r/397588 (owner: 10Andrew Bogott) [18:18:48] (03PS1) 10Andrew Bogott: Torrus: replace function_hiera with call_function(:hiera) [puppet] - 10https://gerrit.wikimedia.org/r/397590 [18:19:18] (03PS3) 10Dzahn: role::ci::slave::browsertests: Fix $redis_port by adding string [puppet] - 10https://gerrit.wikimedia.org/r/394096 (owner: 10Paladox) [18:19:22] (03CR) 10DCausse: [C: 031] Setup MLR AB test for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397582 (owner: 10EBernhardson) [18:20:10] (03CR) 10Dzahn: [C: 032] role::ci::slave::browsertests: Fix $redis_port by adding string [puppet] - 10https://gerrit.wikimedia.org/r/394096 (owner: 10Paladox) [18:21:42] (03CR) 10Dzahn: [C: 032] Move contint::browsertests to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392956 (owner: 10Hashar) [18:22:35] (03CR) 10Dzahn: [C: 032] Install hhvm dev packages from the profile [puppet] - 10https://gerrit.wikimedia.org/r/392929 (owner: 10Hashar) [18:22:47] (03PS3) 10Dzahn: Install hhvm dev packages from the profile [puppet] - 10https://gerrit.wikimedia.org/r/392929 (owner: 10Hashar) [18:22:58] (03PS4) 10Dzahn: contint: Install hhvm dev packages from the profile [puppet] - 10https://gerrit.wikimedia.org/r/392929 (owner: 10Hashar) [18:24:23] 10Operations, 10ORES, 10Scoring-platform-team: Investigate why ORES logs are being written to syslog despite explicit logging config - https://phabricator.wikimedia.org/T182614#3828863 (10awight) p:05Triage>03Normal [18:24:31] (03PS3) 10Dzahn: Move contint::browsertests to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392956 (owner: 10Hashar) [18:24:58] (03PS2) 10Andrew Bogott: Torrus: replace function_hiera with call_function(:hiera) [puppet] - 10https://gerrit.wikimedia.org/r/397590 [18:25:45] (03PS2) 10Bmansurov: Enable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395053 (https://phabricator.wikimedia.org/T181493) (owner: 10Jdlrobson) [18:26:12] 10Operations, 10ORES, 10Scoring-platform-team: Investigate why ORES logs are being written to syslog despite explicit logging config. Fix. - https://phabricator.wikimedia.org/T182614#3828863 (10awight) [18:26:13] (03PS3) 10Andrew Bogott: Torrus: replace function_hiera with call_function(:hiera) [puppet] - 10https://gerrit.wikimedia.org/r/397590 [18:27:26] (03CR) 10Andrew Bogott: [C: 032] Torrus: replace function_hiera with call_function(:hiera) [puppet] - 10https://gerrit.wikimedia.org/r/397590 (owner: 10Andrew Bogott) [18:27:42] (03CR) 10Dzahn: [C: 032] Move contint::browsers to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392976 (owner: 10Hashar) [18:27:51] (03PS3) 10Dzahn: Move contint::browsers to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392976 (owner: 10Hashar) [18:52:30] (03CR) 10Bmansurov: "Deploying soon as the blocker has been resolved." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395053 (https://phabricator.wikimedia.org/T181493) (owner: 10Jdlrobson) [18:52:49] (03CR) 10Mobrovac: [C: 032] Disable Redis queue for small projects and ru and ceb wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397581 (https://phabricator.wikimedia.org/T182023) (owner: 10Ppchelko) [18:54:15] (03Merged) 10jenkins-bot: Disable Redis queue for small projects and ru and ceb wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397581 (https://phabricator.wikimedia.org/T182023) (owner: 10Ppchelko) [18:54:41] !log ppchelko@tin Started deploy [cpjobqueue/deploy@e1075af]: Enable htmlCacheUpdate for ceb and ru wiki and small projects T182023 [18:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:51] T182023: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023 [18:55:15] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@e1075af]: Enable htmlCacheUpdate for ceb and ru wiki and small projects T182023 (duration: 00m 34s) [18:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:46] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch cebwiki, ruwiki and small projects to Kafka for htmlCacheUpdate - T182023 (duration: 00m 57s) [18:56:48] (03CR) 10jenkins-bot: Disable Redis queue for small projects and ru and ceb wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397581 (https://phabricator.wikimedia.org/T182023) (owner: 10Ppchelko) [18:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:27] (03PS1) 10Zoranzoki21: Added throttle rule for McGill University Library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397599 (https://phabricator.wikimedia.org/T182613) [19:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171211T1900). [19:00:04] James_F, ebernhardson, and bmansurov: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:11] here [19:00:14] \o [19:00:43] Hi [19:00:45] I am here [19:01:02] I added patch https://gerrit.wikimedia.org/r/#/c/397599/ in 19:59 for this swat [19:01:31] Who doing current swat? [19:03:46] I’m here too. [19:04:00] ebernhardson: You SWATing too? [19:04:32] James_F: i guess i can ship it out [19:04:41] Cool. [19:05:06] First my :D [19:05:11] I am joking [19:05:14] (03PS1) 10Ottomata: Add tbayer to statistics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/397600 (https://phabricator.wikimedia.org/T182027) [19:05:22] Do how you can [19:05:37] (03CR) 10EBernhardson: [C: 032] Switch submit button from 'save' to 'publish' on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392886 (owner: 10Jforrester) [19:05:43] (03CR) 10Ottomata: [C: 032] "Discussed in ops meeting today" [puppet] - 10https://gerrit.wikimedia.org/r/397600 (https://phabricator.wikimedia.org/T182027) (owner: 10Ottomata) [19:05:54] (03PS2) 10Zoranzoki21: Added throttle rule for McGill University Library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397599 (https://phabricator.wikimedia.org/T182613) [19:07:00] (03Draft1) 10Paladox: contint: Remove duplicate Class[Contint::Packages::Ruby] [puppet] - 10https://gerrit.wikimedia.org/r/397601 [19:07:08] (03PS2) 10Paladox: contint: Remove duplicate Class[Contint::Packages::Ruby] [puppet] - 10https://gerrit.wikimedia.org/r/397601 [19:07:16] (03CR) 10Dzahn: [C: 032] kafkatee: remove Ganglia monitoring class and script [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/396088 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:09:33] (03PS1) 10Ottomata: Add tbayer to statistics-admins [puppet] - 10https://gerrit.wikimedia.org/r/397602 (https://phabricator.wikimedia.org/T182027) [19:09:54] (03CR) 10Ottomata: [C: 032] "Approved in ops meeting today." [puppet] - 10https://gerrit.wikimedia.org/r/397602 (https://phabricator.wikimedia.org/T182027) (owner: 10Ottomata) [19:10:06] Is all ok with my patch? [19:10:13] https://gerrit.wikimedia.org/r/#/c/397599/ [19:10:34] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3829035 (10mmodell) @awight: can you try it now with the -v flag, scap should include the ssh logs in verbose output now. [19:10:43] Zoranzoki21: probably yea, i just like to merge them one at a time so they are easier to ship out individually [19:11:15] eberhardson: ok [19:11:26] ebernhandson: ok [19:11:35] (03PS2) 10EBernhardson: Switch submit button from 'save' to 'publish' on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392886 (owner: 10Jforrester) [19:11:47] (03CR) 10EBernhardson: [C: 032] Switch submit button from 'save' to 'publish' on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392886 (owner: 10Jforrester) [19:11:58] of course i always forget mediawiki-config often needs a rebase for merge to work ... [19:12:04] 10Operations, 10media-storage: Requesting access to swift for Phabricator's git-lfs storage - https://phabricator.wikimedia.org/T182085#3829042 (10mmodell) We have a phabricator instance in it's own project, however, I've never managed to maintain one in deployment-prep. Can we test it across multiple cloud pr... [19:13:13] (03Merged) 10jenkins-bot: Switch submit button from 'save' to 'publish' on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392886 (owner: 10Jforrester) [19:13:16] (03Draft1) 10Paladox: contint: Remove duplicate Class[Contint::Browsers] [puppet] - 10https://gerrit.wikimedia.org/r/397603 [19:13:18] (03PS2) 10Paladox: contint: Remove duplicate Class[Contint::Browsers] [puppet] - 10https://gerrit.wikimedia.org/r/397603 [19:13:25] (03CR) 10jenkins-bot: Switch submit button from 'save' to 'publish' on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392886 (owner: 10Jforrester) [19:14:01] James_F: your update is on mwdebug1002 [19:14:08] (03CR) 10Dzahn: "how is it possible that "Duplicate declaration: Class[Contint::Browsers] is already declared" happens if the previous changes were already" [puppet] - 10https://gerrit.wikimedia.org/r/397603 (owner: 10Paladox) [19:14:50] (03CR) 10Dzahn: [C: 04-1] "this change only adds one comment line" [puppet] - 10https://gerrit.wikimedia.org/r/397603 (owner: 10Paladox) [19:15:40] (03PS3) 10Zoranzoki21: contint: Remove duplicate Class[Contint::Browsers] [puppet] - 10https://gerrit.wikimedia.org/r/397603 (owner: 10Paladox) [19:16:13] ebernhardson: My computer crashed and is restarting. Can you just push it out? Config change already live on all the others… [19:16:18] James_F: ok [19:16:40] !log ebernhardson@tin Synchronized php-1.31.0-wmf.11/extensions/CirrusSearch/includes/Search/RescoreBuilders.php: SWAT: Add query string for running Cirrus MLR pre-deploy checks (duration: 00m 57s) [19:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:59] bmansurov: your patch is -1'd by pheudx, can you verify its unblocked? [19:17:04] (03PS4) 10Paladox: contint: Remove duplicate Class[Contint::Browsers] [puppet] - 10https://gerrit.wikimedia.org/r/397603 [19:17:15] ebernhardson: yes it is. the task has been resolved [19:17:33] I also talked to phuedx recently [19:17:38] (03PS3) 10EBernhardson: Added throttle rule for McGill University Library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397599 (https://phabricator.wikimedia.org/T182613) (owner: 10Zoranzoki21) [19:17:43] (03CR) 10EBernhardson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397599 (https://phabricator.wikimedia.org/T182613) (owner: 10Zoranzoki21) [19:17:54] o/ [19:17:56] !log awight@tin Started deploy [ores/deploy@1c0ede0]: (non-production) Testing parallel ORES deployment, T181661 [19:17:56] beg pardon [19:17:58] i'll remove the -1 [19:18:01] phuedx: thanks! [19:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:06] T181661: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661 [19:18:23] ebernhardson: Thank you for +2 on my patch [19:19:07] !log awight@tin Finished deploy [ores/deploy@1c0ede0]: (non-production) Testing parallel ORES deployment, T181661 (duration: 01m 12s) [19:19:11] (03Merged) 10jenkins-bot: Added throttle rule for McGill University Library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397599 (https://phabricator.wikimedia.org/T182613) (owner: 10Zoranzoki21) [19:19:13] (03CR) 10Phuedx: "FTR T180036 has been resolved." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395053 (https://phabricator.wikimedia.org/T181493) (owner: 10Jdlrobson) [19:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:18] (03CR) 10jenkins-bot: Added throttle rule for McGill University Library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397599 (https://phabricator.wikimedia.org/T182613) (owner: 10Zoranzoki21) [19:20:04] ebernhardson: What is next for doing to patch be complete deployed? [19:20:14] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3829073 (10mobrovac) >>! In T182023#3827651, @mobrovac wrote: > Switching all small non-WP projects should be a non-brainer, so I'd vote for switching them + `... [19:20:52] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3829075 (10awight) Done. Logs are in scap-sync-2017-12-09-0004.log (sic., note that the dates are still misleading) Thanks for t... [19:20:52] Zoranzoki21: i have to sync it out, and watch the logs. james's patch is still syncing (sync seems a little slower than in the past and spams a bunch of logs...but i havn't done code deploys for a month or more) [19:21:19] ebernhardson: ok. Thank you for deploying patch [19:21:32] ebernhardson: Now only to wait technology to do it complete ;) [19:21:43] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T131132: Switch submit button from save to publish on enwiki (duration: 02m 43s) [19:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:54] T131132: Re-label the "Save" button to be "Publish", to better indicate to users the outcomes of their action - https://phabricator.wikimedia.org/T131132 [19:22:14] ebernhardson: Now I [19:22:48] (03CR) 10EBernhardson: [C: 032] Enable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395053 (https://phabricator.wikimedia.org/T181493) (owner: 10Jdlrobson) [19:22:51] (03PS3) 10EBernhardson: Enable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395053 (https://phabricator.wikimedia.org/T181493) (owner: 10Jdlrobson) [19:22:58] (03CR) 10EBernhardson: [C: 032] Enable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395053 (https://phabricator.wikimedia.org/T181493) (owner: 10Jdlrobson) [19:24:24] (03Merged) 10jenkins-bot: Enable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395053 (https://phabricator.wikimedia.org/T181493) (owner: 10Jdlrobson) [19:24:34] !log ebernhardson@tin Synchronized wmf-config/throttle.php: SWAT: T182613 Update throttle rule for McGill University Library (duration: 00m 56s) [19:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:48] T182613: Add throttle rule for McGill University Library - https://phabricator.wikimedia.org/T182613 [19:24:51] Zoranzoki21: ok you should be all ready to go [19:25:57] bmansurov: your config change is up on mwdebug1002 [19:25:58] ebernhardson: Thank you very much [19:26:38] !log mobrovac@tin Started deploy [restbase/deploy@bce2885]: Expose the Reading Lists end points - T181107 [19:26:40] (03CR) 10jenkins-bot: Enable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395053 (https://phabricator.wikimedia.org/T181493) (owner: 10Jdlrobson) [19:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:48] T181107: Deploy Reading Lists Service to production - https://phabricator.wikimedia.org/T181107 [19:26:50] ebernhardson: OK, thanks. I'll need about 5 mins to verify that events are being recorded. [19:28:05] !log mobrovac@tin Finished deploy [restbase/deploy@bce2885]: Expose the Reading Lists end points - T181107 (duration: 01m 26s) [19:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:34] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3829109 (10mmodell) Something really strange is going on... ``` debug1: Offering RSA public key: /etc/keyholder.d/deploy_servic... [19:28:35] !log mobrovac@tin Started deploy [restbase/deploy@bce2885]: Expose the Reading Lists end points, take #2 - T181107 [19:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:15] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/data/lists/{id}/entries/ (test for /en.wikipedia.org/v1/data/lists/{id}/entries/) is CRITICAL: Test test for /en.wikipedia.org/v1/data/lists/{id}/entries/ returned the unexpected status 404 (expecting: 200): /en.wikipedia.org/v1/data/lists/ (test for /en.wikipedia.org/v1/data/lists/) is CRITICAL: Test test for /en.wikipedia.org/v1/data/lists/ [19:30:17] cted status 500 (expecting: 200): /en.wikipedia.org/v1/data/lists/changes/since/{date} (test for /en.wikipedia.org/v1/data/lists/changes/since/{date}) is CRITICAL: Test test for /en.wikipedia.org/v1/data/lists/changes/since/{date} returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/data/lists/pages/{project}/{title} (test for /en.wikipedia.org/v1/data/lists/pages/{project}/{title}) is CRITICAL: Test test f [19:30:17] g/v1/data/lists/pages/{project}/{title} returned the unexpected status 500 (expecting: 200) [19:30:30] known ^ [19:30:52] heh, thanks. except normally icinga messages dont span 3 lines :p [19:31:07] don't recall seeing that [19:33:35] tgr: the route tests are failing ^ [19:34:07] !log mobrovac@tin Finished deploy [restbase/deploy@bce2885]: Expose the Reading Lists end points, take #2 - T181107 (duration: 05m 33s) [19:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:18] T181107: Deploy Reading Lists Service to production - https://phabricator.wikimedia.org/T181107 [19:34:36] hm, I thought I removed those [19:35:19] mobrovac: that's from x-monitor, right? [19:35:36] yeah, looks like some of them are missing [19:35:41] i'll add them and re-deploy [19:35:53] (03PS1) 10Dzahn: kafkatee: bump submodule [puppet] - 10https://gerrit.wikimedia.org/r/397607 [19:38:07] ottomata: hey, could you check if i'm doing it right, is this correct submodule bump or not: https://gerrit.wikimedia.org/r/#/c/397607/ in git log it seems i am also adding your last 2 changes, have you bumped it since then? [19:39:17] mutante: ya you can bump to latest [19:39:30] ottomata: ok, thanks [19:39:32] those changes will be no-ops unless someone uses [19:39:37] alright, good [19:40:21] (03CR) 10Dzahn: [C: 032] kafkatee: bump submodule [puppet] - 10https://gerrit.wikimedia.org/r/397607 (owner: 10Dzahn) [19:40:55] !log mobrovac@tin Started deploy [restbase/deploy@be7d72f]: Expose the Reading Lists end points, take #3 - T181107 [19:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:04] T181107: Deploy Reading Lists Service to production - https://phabricator.wikimedia.org/T181107 [19:41:55] (03CR) 10Dzahn: [C: 04-1] "the issue is most likely just because puppet code isn't up2date on this master. there were changes to this earlier today. please ensure pu" [puppet] - 10https://gerrit.wikimedia.org/r/397603 (owner: 10Paladox) [19:42:25] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [19:43:03] ebernhardson: looks like the change is working. I'll wait and see if I can gather more data, but so far so good. [19:43:21] ebernhardson: can you push the change to everywhere? [19:43:48] !log lower compaction throughput to 2 MB/s, restbase1010-{a,b,c} - T178177 [19:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:59] T178177: Investigate aberrant Cassandra columnfamily read latency of restbase101{0,2,4} - https://phabricator.wikimedia.org/T178177 [19:44:25] (03PS1) 10Jgreen: route DMARC reports for donate.wikimedia.org to dmarc@donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/397612 (https://phabricator.wikimedia.org/T182622) [19:45:33] (03PS2) 10Dzahn: mediawiki:appserver:api: move firewall to role, use profile [puppet] - 10https://gerrit.wikimedia.org/r/391731 [19:45:44] bmansurov: sure i'll ship it [19:46:09] !log ppchelko@tin Started deploy [cpjobqueue/deploy@b1beaf1]: Revert dedupe based on sha1 as well as on event ID [19:46:10] thanks [19:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:38] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@b1beaf1]: Revert dedupe based on sha1 as well as on event ID (duration: 00m 29s) [19:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:14] !log mobrovac@tin Finished deploy [restbase/deploy@be7d72f]: Expose the Reading Lists end points, take #3 - T181107 (duration: 06m 19s) [19:47:16] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T181493: Enable Page Previews EventLogging instrumentation (duration: 00m 56s) [19:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:23] T181107: Deploy Reading Lists Service to production - https://phabricator.wikimedia.org/T181107 [19:47:25] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [19:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:32] T181493: Enable Page Previews EventLogging instrumentation - https://phabricator.wikimedia.org/T181493 [19:47:56] tgr: ok, monitoring fixed, the public api is now out [19:48:25] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [19:48:32] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9282/" [puppet] - 10https://gerrit.wikimedia.org/r/391731 (owner: 10Dzahn) [19:49:03] mobrovac: thanks! will test in a few minutes [19:49:22] bmansurov: your all synced out [19:50:17] (03CR) 10Dzahn: "9 style violations fixed - no-op on api::appservers as expected" [puppet] - 10https://gerrit.wikimedia.org/r/391731 (owner: 10Dzahn) [19:51:25] (03PS1) 10Ayounsi: Initial deb packaging [debs/python-json-logger] - 10https://gerrit.wikimedia.org/r/397615 [19:52:11] (03Abandoned) 10Ayounsi: Initial deb packaging [debs/python-json-logger] - 10https://gerrit.wikimedia.org/r/394507 (owner: 10Ayounsi) [19:56:25] mobrovac: seems all good! [19:56:39] awesome! [19:57:02] (well, I probably found a bug but it's in the backend) [19:58:11] (03PS1) 10Ottomata: Add superset password mappings for analytics mysql dbs [puppet] - 10https://gerrit.wikimedia.org/r/397616 (https://phabricator.wikimedia.org/T166689) [19:59:36] ops: any chance someone could look at https://gerrit.wikimedia.org/r/#/c/395694/ this week? [20:03:04] (03PS1) 10Ayounsi: Initial deb packaging [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/397619 [20:04:15] (03CR) 10Ottomata: [C: 032] "Getup, standup! https://puppet-compiler.wmflabs.org/compiler02/9283/thorium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/397616 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [20:09:29] 10Operations, 10Ops-Access-Requests, 10Analytics-Kanban, 10Patch-For-Review: Get access to geowiki data - https://phabricator.wikimedia.org/T182027#3829238 (10Ottomata) @tbayer you should be able to sudo -u stats on stat1006 now, which will let you view `sudo -u stats ls /srv/geowiki`, etc. [20:10:51] (03PS2) 10Jgreen: route DMARC reports for donate.wikimedia.org to dmarc@donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/397612 (https://phabricator.wikimedia.org/T182622) [20:13:32] (03PS1) 10Ottomata: Fix bug where superset worker_class not enclosed in quotes in config [puppet] - 10https://gerrit.wikimedia.org/r/397621 [20:28:54] 10Operations, 10Ops-Access-Requests, 10Analytics-Kanban, 10Patch-For-Review: Get access to geowiki data - https://phabricator.wikimedia.org/T182027#3829266 (10Tbayer) 05Open>03Resolved It works, thanks! [20:36:36] (03PS1) 10Gergő Tisza: Add ReadingLists tables to Toolforge filter config [puppet] - 10https://gerrit.wikimedia.org/r/397623 [20:37:54] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: Requesting access to deploy-service for pnorman - https://phabricator.wikimedia.org/T182066#3829280 (10RobH) 05Open>03Resolved [20:38:15] (03CR) 10Ottomata: [C: 032] Fix bug where superset worker_class not enclosed in quotes in config [puppet] - 10https://gerrit.wikimedia.org/r/397621 (owner: 10Ottomata) [20:38:30] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3807578 (10RobH) a:05Cmjohnson>03None [20:40:31] (03PS2) 10Gergő Tisza: Add ReadingLists tables to Toolforge filter config [puppet] - 10https://gerrit.wikimedia.org/r/397623 [20:43:08] (03PS1) 10Herron: puppet: change location of environment setting from [main] to [agent] [puppet] - 10https://gerrit.wikimedia.org/r/397624 (https://phabricator.wikimedia.org/T177254) [20:45:08] (03PS2) 10Herron: puppet: change location of environment setting from [main] to [agent] [puppet] - 10https://gerrit.wikimedia.org/r/397624 (https://phabricator.wikimedia.org/T177254) [20:45:18] (03PS3) 10Gergő Tisza: Add ReadingLists tables to Toolforge filter config [puppet] - 10https://gerrit.wikimedia.org/r/397623 [20:47:17] (03PS1) 10Ottomata: Fix superset lookup_password to work with sqla uris [puppet] - 10https://gerrit.wikimedia.org/r/397629 (https://phabricator.wikimedia.org/T166689) [20:47:28] (03CR) 10Ayounsi: "Will add the deb files to the repository when this is merged." [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/397619 (owner: 10Ayounsi) [20:47:54] (03CR) 10Ayounsi: "Will add the deb files to the repository when this is merged." [debs/python-json-logger] - 10https://gerrit.wikimedia.org/r/397615 (owner: 10Ayounsi) [20:48:23] (03CR) 10Ottomata: [C: 032] Fix superset lookup_password to work with sqla uris [puppet] - 10https://gerrit.wikimedia.org/r/397629 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [20:49:14] (03PS2) 10Ayounsi: Initial deb packaging [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/397619 [20:55:54] urandom, mobrovac: catching up with the stuff I miss in the last two days of past week, I'm sorry to hear that the test didn't helped :( [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171211T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:00:19] no parsoid deploy today [21:00:33] (03PS3) 10Herron: route DMARC reports for donate.wikimedia.org to dmarc@donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/397612 (https://phabricator.wikimedia.org/T182622) (owner: 10Jgreen) [21:02:16] (03CR) 10Herron: [C: 031] route DMARC reports for donate.wikimedia.org to dmarc@donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/397612 (https://phabricator.wikimedia.org/T182622) (owner: 10Jgreen) [21:04:01] !log mholloway-shell@tin Started deploy [mobileapps/deploy@6347d62]: Update mobileapps to 61ca333 [21:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:52] volans: :( [21:05:44] volans: i guess i didn't have a strong sense that it would, but yeah, it would have been great [21:07:11] yeah, me neither too much, given that it should help reads, not make them worse, but those black boxes are hard to predict under real life workloads [21:07:47] the other thing I noticed more recently is that writes don't have this behaviour at all [21:09:57] urandom: the list of potential differences is very long, I'm not sure if it makes sense to continue to play "spot the differences" or at this point is better to concentrate on the bottleneck(s) on those boxes and try to improve the situation [21:11:36] volans: yeah, i've been playing that game a bit today, and it's a bit of a rabbit hole [21:11:52] [btw, is NUMA at play here? configured/disabled/different between the hosts] [21:11:58] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@6347d62]: Update mobileapps to 61ca333 (duration: 07m 56s) [21:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:37] we have intel disks and samsung disks (the latter have been notoriously bad performance-wise), and we have HP and Dell (the former have the RAID controller, the latter not), and we have quite a matrix of combination when you factor in data-center, rack, and cluster combinations [21:13:05] volans: i dunno [21:13:12] good question [21:13:53] also, re: that unusual disk configuration you asked about in the ticket (and for which i still owe a write-up), that is different between clusters [21:14:00] and, i could see that factoring in [21:14:03] 10Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: eqiad: hadoop expansion - FY 2017 / 2018 - https://phabricator.wikimedia.org/T182628#3829362 (10Ottomata) [21:14:48] 10Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: eqiad (8): hadoop expansion - FY 2017 / 2018 - https://phabricator.wikimedia.org/T182628#3829369 (10Ottomata) [21:14:51] ok, I'm also missing completely what cassandra does internally to create the response [21:14:56] that is all... way too complex [21:14:59] * urandom sighs [21:15:00] 10Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: eqiad: (8) Hadoop expansion - FY 2017 / 2018 - https://phabricator.wikimedia.org/T182628#3829350 (10Ottomata) [21:15:29] 10Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: eqiad: (8) Hadoop expansion - FY 2017 / 2018 - https://phabricator.wikimedia.org/T182628#3829350 (10Ottomata) [21:15:44] we should try to at least to be able to determine if it's a problem that we see only at cassandra layer, or also at the OS layer [21:16:04] (03CR) 10Herron: [C: 032] route DMARC reports for donate.wikimedia.org to dmarc@donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/397612 (https://phabricator.wikimedia.org/T182622) (owner: 10Jgreen) [21:16:08] maybe tracking some additional data that we don't have, like IO latencies, etc... [21:24:39] We've got an puppet change for Apache (mediawiki module). Simple thing, adding a few redirects in the file designated to do redirects. But apache changes are generally not eligible for puppetSWAT according to the docs. [21:24:49] Could anybody give me a hint whom to ask about that? [21:27:25] My first guess would be bblack (Don't know if its the right person. But seems like a logical extension from varnish stuff to apache redirects) [21:27:28] Ops and pray [21:27:40] Don't forget to sacrafice a virgin [21:28:11] or ping the Clinic Duty Ops of the week ;) [21:28:15] it's a simple change [21:28:19] Doesn't matter [21:28:19] what could go wrong? [21:28:22] simple changes aren't [21:28:23] Most ops don't like doing apache changes [21:28:28] you could simply take down all the projects ;) [21:28:35] and redirects are among those that really screw stuff up when they do [21:28:47] Its only one site, there's an entire other internet out there. Don't fret ;) [21:28:47] because even the simplest thing can end up in a hour long outage of drama :D [21:28:52] also a) I am not on clinic duty and b) it's 11:30 at night, before you ask.... :-P [21:29:33] you were all too fast replying seriously before I finished :P [21:29:43] That was ... helpful. I'm going to ping robh (clinic duty) on the task then :D [21:30:19] very used to the finger-nose protocol [21:30:24] ya gotta be fast, Platonides [21:30:24] so apache changes can break the site, and tend to have to be reviewed more closely [21:30:35] Yeah, that's why I asked. [21:30:39] As _joe_ has been responding on the task.. You might want to ask him explicitly [21:30:52] basicaly i'd ask around the ops team and if no one responds within a week i'd put as an item on our weekly meeting =] [21:31:05] indeed, if you already have an opsen on task and resopnsive i'd checkw ith them [21:31:33] bawolff: apache isnt bblack ;] [21:31:34] It's about T169450 , the patch is https://gerrit.wikimedia.org/r/#/c/393289/ [21:31:34] T169450: Redirect several wikis - https://phabricator.wikimedia.org/T169450 [21:32:04] i mean, anyone in ops can at least direct someone to the right people, so its not 'wrong' but its not technically his =] [21:32:12] Good to know [21:32:30] joe is indeed more closely involved in apache stuff afaik and seems tos upport that by him replying on the task [21:33:08] * robh is reading said task now [21:33:55] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3829414 (10EddieGP) @Joe As you've already commented here, could you help with deployment of https://gerrit.wikimedia.org/r/#/c/39... [21:38:18] so this entire thread seems to have stalled out. i've flagged it, and if i cannot find someone who can review the changes then ill list in our weekly meeting as incoming request [21:38:28] (they take place every monday) [21:38:40] today's already happened [21:42:19] robh: Thanks. I don't think there's a hurry for this one. If we can't get it done this week we should maybe wait until january. I wouldn't like to have the redirects applied without the required mediawiki cleanup happening fastly afterwards. [21:42:54] (Assuming we're not doing it the next two weeks due to end-of-year deployment freeze) [21:43:05] duly noted =] [21:43:09] right, week of 18th and 25th no deploys :) [21:43:23] booooring [21:44:07] i recall in the early days xmas week was a good week for maint [21:44:12] since it was low traffic ;] [21:44:21] but now high money making :) [21:44:35] (thanks, world) [21:44:38] Reedy: We could also sync that changes out right now without any review and probably get me a sticker :P [21:44:56] fixing it is how you get any token [21:45:00] anyone can break [21:45:02] Pfft [21:45:06] eddiegp: jerkins does review now [21:45:18] Gone are the days you can easily push major syntax errors out [21:45:54] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 3 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3829487 (10Bawolff) So yes, this sounds sane to me (With the caveat, I haven't looked at the multimedia code in a while). Some comments:... [21:47:56] really almost everything we do in our complex apache config should be split up and sent to two other places. non-canonical redirections should go go a traffic-level secure redirector service, and the redirects that apply to live real wiki stuff should get pushed down into mediawiki's own URL-routing layer. [21:48:08] (03PS1) 10Dzahn: labnet100[34]: use role::spare until used [puppet] - 10https://gerrit.wikimedia.org/r/397635 (https://phabricator.wikimedia.org/T165779) [21:48:14] neither of those destinations yet exist, but they've both been discussed and/or worked on at some past point [21:51:07] (while I'm dreaming of nice-to-have things that aren't getting worked on, we should also redo our canonical URLs to get rid of language-code subdomains and the explicit m-dot mobile subdomains) [21:52:12] greg-g: Well, anyone can click the "Revert" button in gerrit too, it's even easier than installing git and pushing the breaking change in the first place :P [21:54:15] revert button in gerrit doesnt fix a redirect loop that is cached in varnish [21:54:26] (and breaks all the sites :) [21:55:15] and if it has "easy" or "trivial" on it it kind of makes it more likely for that stuff to happen :p [21:56:30] Well I didn't use the word "trivial" on purpose ;) [21:57:21] just saying unfortunately not all things are easy to revert, most are, but there are exceptions [21:59:24] Yeah, I'm aware, as I already said on the ticket yesterday - apache and varnish are probably one of the easiest ways to break everything ;) [21:59:36] (03PS1) 10Dzahn: mediawiki::appserver: move firewall from site to role [puppet] - 10https://gerrit.wikimedia.org/r/397636 [21:59:58] the thing about redirects is that things wind up being cached... wrong things, if the redirect has some subtle error, and as mutante says, then reverting is not all that's required, there's usually some scrambling around to find the problem and then figure out exactly what to purge [22:00:04] dapatrick, bawolff, and Reedy: I, the Bot under the Fountain, allow thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171211T2200). [22:00:04] No GERRIT patches in the queue for this window AFAICS. [22:00:17] so we're all a bit 'once bitten twice shy' about it [22:00:23] a common fear is the rewrite is broken and then you can fix it but it's already cached and you cant just randomly start deleting all the things from cache either [22:00:30] that bot is getting a bit pompous isn't it? [22:00:38] It just loves us [22:00:50] eddiegp: what mutate said :) [22:01:46] at least people are paying attention to it :) [22:01:50] (some) [22:02:14] Yeah, as I said, I'm aware. My point was that there's usually not much more you can do to "fix" other than press the revert button for someone without shell access :) [22:02:35] volans: interesting: https://graphite.wikimedia.org/render?target=servers.restbase10{11,16}.cpu.total.iowait&from=-90d&width=1024 [22:03:18] volans: that's the legacy cluster, one node HP w/ Samsung disks, one Dell w/ Intel [22:03:30] the Dell wins again (even with different disks) [22:03:51] and this is the old version of Cassandra, the old storage strategy, and the old raid and partitioning schema [22:04:10] midnight, I'm checking back out again... [22:04:26] the impact on latency isn't the same, but maybe this has been going on all along [22:04:47] urandom: interesting! the unit on the left are seconds? [22:05:04] ah sorry, read now IOwait [22:05:08] ya [22:05:09] thought was latencies :D [22:05:26] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 22 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:05:33] volans: iowait has always been the biggest predictor of latency with these clusters [22:05:42] indeed [22:07:03] volans: also: https://graphite.wikimedia.org/render?target=servers.restbase20{09,12}.cpu.total.iowait&from=-90d&width=1024 [22:07:17] that is codfw, very different workload, so take that with a grain of salt [22:07:39] we removed salt from the fleet, I can take it with some grain of cumin :-P [22:07:46] (03CR) 10Halfak: [C: 031] "Looks like this will increase uwsgi workers on a few machines to 48 and drop it down from 64 to 48 on scb1003 and scb1004. If akosiaris i" [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) (owner: 10Halfak) [22:07:54] but 2009 is HP + Intel, and 2012 is Dell + Intel [22:07:57] volans: ha! [22:08:06] and the absolute numbers are much lower [22:08:11] yes [22:08:18] that's the grain of cumin part, i think [22:08:36] codfw and the legacy cluster, so it's a very different workload going on [22:08:54] but the Intels have always given us lower iowait, apples-to-apples [22:09:20] and maybe the specs of the disks can somehow confirm this [22:09:47] the common thread remains that HP controller though [22:10:26] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 11 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:10:52] don't want to get hung too much on that, but everything continues to point there [22:11:15] HP has dell or samsung? [22:11:26] sorry I'm tired, Intel or Samsung disks? [22:12:19] it's a mess, but we have a lot of Dells + Intel and HP + Samsung [22:12:36] ok so could be the controller as the disks [22:12:37] and only one case of HP + Intel, I think [22:12:50] that's the last iowait graph [22:13:02] 2009? [22:13:07] same disks, and the one with the HP controller is still slower [22:13:10] err... more iowait [22:13:17] yes [22:14:00] that last graph 2009 and 2012 is an example of two hosts, same rack, intel disks, one HP, one Dell [22:14:51] very low absolute values, but the one is still consistenly on the order of almost double [22:14:58] double-ish [22:15:09] yeah [22:15:45] ok I have my next theory, but need to verify it first before rambling random stuff ;) [22:15:51] heh [22:21:33] <_joe_> it's not like since I answered a ticket I'm the owner of it, people [22:21:45] <_joe_> having said that, I'll take a look tomorrow [22:22:33] i never suggested ownership, my apologizes if it seemed such [22:22:53] i said i'd ask around and if i got no interested parties id bring up in ops meeting [22:23:29] but my comment of asking those who have already chatted on task doesn't seem like its a bad thing, but i can understand why its not ideal for some [22:23:36] sorry about that =] [22:24:05] Just blame me ;) [22:24:51] <_joe_> I always blame you Reedy [22:24:58] <_joe_> by default [22:25:00] Good [22:25:05] Shouldn't have it any other way [22:25:25] _joe_: Sorry, didn't want to put ownership onto you with my comment. It's just hard to figure out the exact opsen to ask about a thing :) [22:26:09] <_joe_> eddiegp: don't worry, the real issue is our apache config is huge and somewhat delicate to change [22:26:32] <_joe_> eddiegp: and if you get something wrong, all our caching layers make things quite hard to revert [22:26:41] We should have a huge list of urls to shove into apache-fast-test to test [22:26:49] <_joe_> Reedy: we do [22:26:59] <_joe_> not huge, but a good sample [22:27:30] <_joe_> but we need unit testing of that config, *or* moving most of it to php [22:27:36] <_joe_> or both, really [22:27:55] <_joe_> eddiegp: I have one doubt about the patch, I'll try to remember about it tomorrow [22:28:21] <_joe_> (it's very very late here and I've just been reminded I should not work at such hours :)) [22:28:26] And after deploying it... Remember what the problem is? :D [22:28:32] _joe_: don't work so late! [22:29:18] Reedy: That's how you usually do it? :P [22:29:45] _joe_: It's perfectly fine, as I said, that's not urgent. :) [22:30:21] Is gerrit just crap at diffing conf files? [22:32:41] Apparently [22:32:44] "Is gerrit just crap at ..." sounds like a rhetorical question to me. [22:32:56] https://gerrit.wikimedia.org/r/#/c/393289/10/modules/mediawiki/files/apache/sites/redirects.conf [22:33:02] Every line shouldn't be showing as some change [22:33:17] IMHO the whole gerrit UI is crap. [22:33:46] (03CR) 10Reedy: "What a mess gerrit makes of that .conf diff" [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [22:33:48] (03CR) 10Rush: [C: 031] labnet100[34]: use role::spare until used [puppet] - 10https://gerrit.wikimedia.org/r/397635 (https://phabricator.wikimedia.org/T165779) (owner: 10Dzahn) [22:34:48] No, it's something to do with who ran the script [22:34:55] Reedy: Actually, every line changed \n -> \r\n [22:35:01] yeah, exactly [22:35:04] * Reedy is fixing [22:35:12] bring back svn:eol-style native [22:35:32] 2 files changed, 29 insertions(+) [22:35:34] That's better [22:35:42] urandom: fancy another test? [22:35:46] :D [22:35:49] (03PS11) 10Reedy: apache: redirect several wikis per Board of Trustees and LangCom request [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [22:35:53] But yeah, gerrit should show invisible characters like that (git does). [22:36:13] eddiegp: https://gerrit.wikimedia.org/r/#/settings/diff-preferences [22:36:15] ignore whitespace? [22:36:52] (03CR) 10Reedy: "PS11 makes it not change the eol-style so a much more reasonable looking diff :)" [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [22:37:00] Reedy: "None" [22:37:03] (03CR) 10Dzahn: [C: 032] labnet100[34]: use role::spare until used [puppet] - 10https://gerrit.wikimedia.org/r/397635 (https://phabricator.wikimedia.org/T165779) (owner: 10Dzahn) [22:39:26] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [22:39:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [22:42:05] Does apache config even accept \r\n line endings? I was under the impression that it doesn't, in which case CI should certainly catch that. [22:43:02] the script writes the actual config file though [22:43:18] (03Draft1) 10Paladox: puppetmaster: only install puppetdb-terminus on jessie [puppet] - 10https://gerrit.wikimedia.org/r/397700 [22:43:23] (03PS2) 10Paladox: puppetmaster: only install puppetdb-terminus on jessie [puppet] - 10https://gerrit.wikimedia.org/r/397700 [22:43:41] (03PS1) 10Dzahn: mwlog: style fixes, move firewall include [puppet] - 10https://gerrit.wikimedia.org/r/397701 [22:44:57] You're right, the script should never use \r\n in the first place. But if CI doesn't catch it for this (script generated) apache config file, I imagine it neither will for other (hand-written) confs. [22:46:59] https://github.com/wikimedia/puppet/blob/production/modules/mediawiki/lib/puppet/parser/functions/compile_redirects.rb [22:47:27] Script only explicitly uses \n in 3 places [22:47:58] (03CR) 10Andrew Bogott: "I can't investigate immediately but I don't think this is the right solution. Probably we need the package on stretch (production) and do" [puppet] - 10https://gerrit.wikimedia.org/r/397700 (owner: 10Paladox) [22:48:36] (03Abandoned) 10Paladox: puppetmaster: only install puppetdb-terminus on jessie [puppet] - 10https://gerrit.wikimedia.org/r/397700 (owner: 10Paladox) [22:49:13] It seems to have some heredocs, likely those use the os defau.t [22:51:06] volans: what do you have in mind? [22:51:06] dest.puts "\n\t# Type: #{lower_camel_name}\n" [22:51:09] Can see those... [22:52:53] urandom: my idea would be to try the HBA mode of the controller, that basically bypass completely the RAID controller, cache included. So no guarantee it will be better, but it should expose the disks directly to the OS [22:53:32] volans: oh, that sounds...disruptive? [22:53:46] volans: how come we didn't do that to begin with? [22:54:02] i assumed we did it this way because there was no other option [22:54:07] that's what I'm trying to understand (if it's disruptive or not given we have each disk by itself already) [22:54:22] hrmm, yeah [22:54:39] urandom: no idea about initial configuration, I had to google for a bit to ensure this model support this mode [22:55:21] yeah, they've always been this way, and it was my understanding that it was because there was no alternative, no way around the controller [22:55:40] bypassing the RAID controller would be awesome if that's possible [22:55:55] it seems so [22:57:36] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [22:57:37] so, i guess as long as the device ordering was the same when it came back up, it ought to be transparent [22:57:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [22:57:56] it's a big if ;) [22:57:59] heh [22:58:01] well [22:58:08] do you trust udev ? :D [22:58:22] i trust nothing! [22:58:43] I would pick a host you're "ok-ish" to loose the data and need to reimage and re-init the data [22:59:07] or if we have a test host with similar specs [22:59:11] we will eventually be moving all of the hosts currently in the legacy cluster, to this one [22:59:41] we need to move the remaining use-cases, and with holidays and whatnot, that might be some time off [23:00:09] but it would provide an opportunity free of risk [23:00:24] nice! also at least a reboot is needed [23:00:31] but seems it's possible to do it via hpssacli [23:00:36] Reedy: Just remembered, iirc that script sends everything to stdout. The problem might not be with the script, but how to get it's output into the conf file without adding \r\n on a Windows machine. [23:00:54] volans: oh? that does sound dangerous! :) [23:01:20] it says that a reboot is needed after that, so I guess it will have effect at next reboot [23:01:26] (03CR) 10Paladox: "> I haven't reviewed the 2 packages. Is ruby-mysql2 fully compatible" [puppet] - 10https://gerrit.wikimedia.org/r/391336 (owner: 10Paladox) [23:01:28] oh ok [23:01:42] I mean in bash you just do "> redirects.conf", but I have no clue how it'd be done on Windows and if it could be convinced to use "\n". [23:01:45] although I've also found people saying that that command doesn't work and gives an error :D [23:01:57] ymmv, i guess [23:02:04] as always! :D [23:02:31] You can redirect to files on windows too [23:02:35] same way [23:02:44] worse case scenario the data can be re-imported from the rest of the cluster? [23:03:06] yeah, but doing that at this stage would make me pretty unpopular, i suspect [23:03:07] eddiegp: it's probably not a problem worth fixing [23:03:23] Hmm, okay. I imagine it was copy-pasted into some editor set to use \r\n. [23:03:29] because I'm reading also cases in which you have to "delete" the existing RAID... so not sure if can be done transparently, being already JBOD it should, but who knows [23:03:31] Indeed [23:03:45] we're juggling load from the legacy cluster to the new, and then moving hosts, and then lather, rinse, and repeat [23:04:07] Well about that script, you're probably right. But imho we should have a task about CI not failing over \r\n. [23:04:11] lol [23:04:17] and the granularity of the use-cases being moved doesn't always provide as much headroom as you'd like [23:04:44] eddiegp: just find one that's about reverting from git to svn [23:05:10] so we could do it now with an existing node, but it would shutdown that work until it was done, and i'd be on santa's naughty list for sure. [23:05:51] Reedy: For that one I can probably search for "Status: declined"? :P [23:06:57] Depends if chad was in a "fsck git" mood [23:08:57] urandom: see query ;) [23:11:51] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [23:11:52] i <3 u svn [23:12:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [23:15:45] I've never used svn, and what I've heard so far about it sums up to "forget it, use git". :D [23:16:11] svn wasn't that bad [23:18:53] Repo management was easier! [23:19:52] the hugest difference between svn and git is that svn versions *files* and git versions *trees* [23:22:41] !log mholloway-shell@tin Started deploy [mobileapps/deploy@07293bc]: Update mobileapps to e290b17 [23:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:20] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [23:25:00] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [23:28:57] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@07293bc]: Update mobileapps to e290b17 (duration: 06m 16s) [23:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:06] !log smalyshev@tin Started deploy [wdqs/wdqs@f6b110f]: updater fix and GUI update [23:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:18] (03CR) 1020after4: [C: 031] "Looks sane to me" [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [23:37:16] !log smalyshev@tin Finished deploy [wdqs/wdqs@f6b110f]: updater fix and GUI update (duration: 06m 10s) [23:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:30] (03PS1) 10Andrew Bogott: WMCS: use puppet 4 for any VM-hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/397711