[00:10:30] (03PS29) 10Alex Monk: Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) [00:11:28] (03CR) 10jenkins-bot: [V: 04-1] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [00:12:13] maintain-replicas.py:187:80: E501 line too long (80 > 79 characters) [00:12:15] * Krenair sighs [00:12:39] (03PS30) 10Alex Monk: Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) [01:54:30] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:19:46] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:27:12] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.20) (duration: 13m 07s) [02:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:03] RECOVERY - cassandra service on maps-test2004 is OK: OK - cassandra is active [02:38:25] PROBLEM - cassandra service on maps-test2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [03:03:32] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:09:54] !log just took a shit [03:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:11:23] PROBLEM - puppet last run on lvs1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:28:59] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [03:36:29] RECOVERY - puppet last run on lvs1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:54:49] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:19:51] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:34:02] Maybe https://twitter.com/wikimediatech/status/782417315606462464 should be deleted/hidden. [05:37:12] We really should update the bot to check if the person is ID to nickserv first [06:40:33] !admin hola alvaro molina hola alvaro molin hola alvaro molina hola alvaro molin hola alvaro molina hola alvaro molina [06:40:35] !admin hola alvaro molina hola alvaro molin hola alvaro molina hola alvaro molin hola alvaro molina hola alvaro molina [06:46:16] !admin hola alvaro molina hola alvaro molin hola alvaro molina hola alvaro molin hola alvaro molina hola alvaro molina [06:46:41] I LOVE ALEXZ [06:46:55] Shahhssh [06:47:25] !admin hola alvaro molina hola alvaro molin hola alvaro molina hola alvaro molin hola alvaro molina hola alvaro molina [06:50:11] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:51:50] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [06:57:26] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 531 bytes in 0.055 second response time [06:58:08] 06Operations, 10Ops-Access-Requests, 10netops: Access to network devices - https://phabricator.wikimedia.org/T147061#2682181 (10Peachey88) [06:58:41] PROBLEM - puppet last run on elastic2005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[molly-guard],Package[ncdu] [06:59:31] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [07:20:23] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.065 second response time [07:24:04] RECOVERY - puppet last run on elastic2005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:28:31] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Response times of Wikidata Query Service increasing - https://phabricator.wikimedia.org/T147130#2682204 (10Gehel) [07:29:32] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665333 (10Gehel) 05Open>03Resolved a:03Gehel As @jcrespo pointed out, the current issue is different... [07:29:49] !log silencing wdqs response time alerts, it is flapping, related to traffic - T147130 [07:29:51] T147130: Response times of Wikidata Query Service increasing - https://phabricator.wikimedia.org/T147130 [07:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:12:28] (03PS2) 10Urbanecm: Add 1.5 and 2x logos for olowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313658 (https://phabricator.wikimedia.org/T146745) [08:43:22] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:48:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:53:11] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [09:10:34] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:10:35] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:18:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:19:28] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:21:50] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [09:39:17] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:44:03] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [09:58:07] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:02:07] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:03:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:07:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:22:58] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:25:07] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [10:25:28] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:26:11] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [10:28:40] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [10:35:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:44:22] PROBLEM - puppet last run on analytics1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:44:44] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:47:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:52:03] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:52:34] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:06:53] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:11:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [11:19:22] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:23:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:23:42] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:40:19] RECOVERY - puppet last run on analytics1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:05:30] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 658 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3387384 keys - replication_delay is 658 [12:09:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:16:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [12:21:31] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [12:23:49] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:38:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:38:40] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:50:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:03:10] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:31:44] <_joe_> seems like the 5xx errors are 500s from commons [13:33:11] <_joe_> some bot requesting 0px images [13:33:15] <_joe_> or some app [14:44:49] PROBLEM - parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:01] RECOVERY - parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.271 second response time [15:33:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:38:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:44:50] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [15:51:10] (03PS1) 10Kerberizer: Fix an invalid empty line in the global robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313763 (https://phabricator.wikimedia.org/T146908) [15:52:01] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [15:55:01] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:41] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [16:02:10] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:06:19] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 868271 msg (=800000 warning): ocg_render_job_queue 3004 msg (=3000 critical) [16:07:03] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 868544 msg (=800000 warning): ocg_render_job_queue 3117 msg (=3000 critical) [16:07:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:07:40] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 868828 msg (=800000 warning): ocg_render_job_queue 3226 msg (=3000 critical) [16:09:29] RECOVERY - cassandra service on maps-test2002 is OK: OK - cassandra is active [16:12:01] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:19:10] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:21:00] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [16:25:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [16:37:24] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:38:12] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:12:49] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:39:10] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:59:50] RECOVERY - cassandra service on maps-test2004 is OK: OK - cassandra is active [18:39:50] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [18:44:42] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:59:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:10:23] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:36:43] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:50:00] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:54:40] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:56:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:00:52] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:23:15] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:47:51] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [22:11:06] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:51] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:42:53] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3290404 keys - replication_delay is 0 [23:23:22] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:33:11] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2683321 (10Krenair)