[01:02:10] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1498957325 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9002661 keys, up 2 minutes 3 seconds - replication_delay is 1498957325 [01:02:20] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:20] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:03:00] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [01:03:00] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:03:10] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8999531 keys, up 3 minutes 3 seconds - replication_delay is 0 [01:03:10] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4293885 keys, up 3 minutes 3 seconds - replication_delay is 0 [01:03:11] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4291547 keys, up 3 minutes 4 seconds - replication_delay is 0 [01:03:50] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8997237 keys, up 3 minutes 42 seconds - replication_delay is 0 [01:03:50] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8993149 keys, up 3 minutes 47 seconds - replication_delay is 0 [01:04:50] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:12:20] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:12:38] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [01:13:20] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [01:32:00] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [01:36:40] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [01:37:40] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set [01:42:10] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [01:42:30] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [01:43:10] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2003 is OK: OK ferm input default policy is set [01:46:44] 10Operations, 10Phabricator, 10Release-Engineering-Team: Viewing raw files on phab fails with ERROR_MESSAGE_MAIN - https://phabricator.wikimedia.org/T169454#3398729 (10Paladox) [01:48:46] 10Operations, 10Phabricator, 10Release-Engineering-Team: Viewing raw files on phab fails with ERROR_MESSAGE_MAIN - https://phabricator.wikimedia.org/T169454#3398741 (10Paladox) p:05Triage>03High I am filling this in #operations because the error sound unrelated to phabricator internal code. I think this... [01:52:41] 10Operations, 10Phabricator, 10Release-Engineering-Team: Viewing raw files on phab fails with ERROR_MESSAGE_MAIN - https://phabricator.wikimedia.org/T169454#3398743 (10Paladox) But the rewrites seem to work for https://phabzilla.wmflabs.org/file/data/yydp2qcdrep6lrv2vlxn/PHID-FILE-wnr4rr7iatru4jo2ityf/index.... [01:54:23] (03PS4) 10Paladox: Update npm to 2.x and nodejs to 4.x [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/303370 [01:54:32] (03PS5) 10Paladox: Update npm to 2.x and nodejs to 4.x [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/303370 [02:01:14] (03PS6) 10Paladox: Update npm to 2.x and nodejs to 4.x [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/303370 [02:03:52] (03PS7) 10Paladox: Update npm to 4.x and nodejs to 6.x [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/303370 [02:07:32] (03PS8) 10Paladox: Update npm to 4.x and nodejs to 6.x [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/303370 [02:17:30] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [02:20:30] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [02:38:20] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [02:39:20] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2003 is OK: OK ferm input default policy is set [02:47:43] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time [02:47:43] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [02:47:50] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [03:02:10] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [03:03:10] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1002 is OK: OK ferm input default policy is set [03:30:14] (03CR) 10BryanDavis: [C: 04-2] Update npm to 4.x and nodejs to 6.x [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/303370 (owner: 10Paladox) [03:31:43] (03CR) 10BryanDavis: [C: 04-2] "Running a `curl | sudo` anything is not safe. We will not be doing things like that." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/303370 (owner: 10Paladox) [03:32:41] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [03:32:50] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:33:10] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:33:40] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2002 is OK: OK ferm input default policy is set [03:33:44] (03CR) 10Zhuyifei1999: Update npm to 4.x and nodejs to 6.x (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/303370 (owner: 10Paladox) [04:00:20] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [04:01:10] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [04:06:00] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[bump nf_conntrack hash table size],Service[carbon] [04:14:10] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=567.40 Read Requests/Sec=4307.60 Write Requests/Sec=2.50 KBytes Read/Sec=44128.00 KBytes_Written/Sec=48.80 [04:21:20] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=3.90 Read Requests/Sec=0.20 Write Requests/Sec=13.40 KBytes Read/Sec=0.80 KBytes_Written/Sec=146.40 [04:34:10] RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [04:55:02] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3398774 (10Urbanecm) Great. I'll schedule the reopening for Wednesday. [05:37:27] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Markos90 → Mαρκος: supervision needed - https://phabricator.wikimedia.org/T169396#3398775 (10Marostegui) Please ping me before you do the rename so I can monitor the DBs Thanks! [05:44:07] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T169448#3398777 (10Marostegui) p:05Triage>03Normal This is a s1 slave - @Cmjohnson please change the disk when you are back from holidays. If you need to get some used disks, there are some hosts scheduled for d... [06:57:10] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89904.44 seconds [07:12:30] PROBLEM - Nginx local proxy to apache on mw2111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:13:20] RECOVERY - Nginx local proxy to apache on mw2111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.199 second response time [07:44:00] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [07:44:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:50:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:54:00] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:25:30] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [10:43:00] PROBLEM - Check whether ferm is active by checking the default input chain on mwlog2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [10:44:00] RECOVERY - Check whether ferm is active by checking the default input chain on mwlog2001 is OK: OK ferm input default policy is set [10:55:01] marostegui: ping [11:04:15] This user is now online in #wikimedia-operations. I'll let you know when they show some activity (talk, etc.) [11:04:15] @notify marostegui [11:12:00] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [11:13:00] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [11:16:00] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [11:24:00] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [12:44:17] !log powercycle mw2256 https://phabricator.wikimedia.org/P5662 T163346 [12:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:29] T163346: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346 [12:45:40] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms [13:09:53] (03PS1) 10D3r1ck01: Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) [14:02:20] PROBLEM - CPU frequency on tin is CRITICAL: CRITICAL: CPU frequency is 600 MHz (160 MHz) [14:07:50] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:21] marostegui: global rename time? [15:37:30] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:38:20] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [15:50:20] RECOVERY - CPU frequency on tin is OK: OK: CPU frequency is = 600 MHz (1199 MHz) [15:57:40] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on tin is OK: Files ownership is ok. [16:03:01] WARNING: Current database lag 13 s violates maxlag of 6 s, waiting 6 s <-- isn't 13 unusual high? [16:04:36] TabtbyCat: meaow, today is saturday. :P [16:06:28] Steinsplitter: no, today in Sunday [16:06:41] yes . [16:06:43] so. meh. [16:06:59] heh [16:10:10] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [16:22:18] flood of spambot registrations happening right now, sigh [16:22:40] but how... they deal with captcha? [16:23:10] our captcha is too easy [16:23:17] :/ [16:23:33] or they have humans resolving it for them? idk [16:25:10] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [16:26:30] https://anti-captcha.com/  CAPTCHA is outdated method. "ajax" confirmation is better way. [16:27:10] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [17:07:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:08:20] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:15:20] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:16:10] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:47:51] 10Operations, 10Gerrit, 10Release-Engineering-Team: Gerrit: can lose data if it crashes - https://phabricator.wikimedia.org/T159743#3399128 (10Paladox) [17:47:54] 10Operations, 10Gerrit: Decide how to support polygerrit - https://phabricator.wikimedia.org/T158479#3399129 (10Paladox) [19:11:20] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [19:19:20] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [19:21:20] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:22:20] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [19:45:30] PROBLEM - Apache HTTP on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [19:45:40] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [19:45:40] PROBLEM - Nginx local proxy to apache on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.009 second response time [19:46:10] PROBLEM - HHVM rendering on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [19:46:40] RECOVERY - Nginx local proxy to apache on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.061 second response time [19:47:10] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 74897 bytes in 0.429 second response time [19:47:30] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.093 second response time [19:47:40] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.052 second response time [19:57:48] 10Operations, 10Traffic, 10netops: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3399300 (10ayounsi) [19:58:47] 10Operations, 10Traffic, 10netops: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3395298 (10ayounsi) Edit: moving the maintenance to Wednesday July 12 for availability reasons. [20:19:58] 10Operations, 10cloud-services-team, 10Upstream: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290#3399312 (10Nemo_bis) [20:23:18] 10Operations, 10cloud-services-team, 10Upstream: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290#3393693 (10Nemo_bis) "A total of 16,214 non-merge changesets were pulled into the mainline repository for the 4.9 development cycle, making this cycle t... [21:09:20] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:38:30] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:53:30] PROBLEM - HHVM rendering on mw2120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:54:20] RECOVERY - HHVM rendering on mw2120 is OK: HTTP OK: HTTP/1.1 200 OK - 74865 bytes in 0.302 second response time [22:45:20] PROBLEM - HHVM rendering on mw2209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:10] RECOVERY - HHVM rendering on mw2209 is OK: HTTP OK: HTTP/1.1 200 OK - 74865 bytes in 0.348 second response time [23:28:33] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:29:23] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.046 second response time [23:31:01] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:32:30] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:32:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [23:35:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [23:38:30] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:39:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:40:10] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:40:40] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]