[00:15:45] !log RT (ununpentium) installing pending package upgrades [00:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:51] 10Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#3373808 (10Dzahn) 05Open>03Resolved Since we have the new system running since a couple weeks and there was only positive feedback.. i now disabled the duplicate mails still going to RT. Now they only go to our new Goo... [00:29:35] (03CR) 10Dzahn: [C: 031] "well, that's interesting. it would explain my question from above "and why does it currently work?:)" which i was wondering about since th" [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [00:31:06] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3373812 (10Dzahn) Do you guys know anything about the latest 2 comments on https://gerrit.wikimedia.org/r/#/c/359373/ ? It seems those G... [00:33:46] (03CR) 10Dzahn: "> Last I know from owners, the database change breaks gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [00:42:36] (03CR) 10Dzahn: "it's been >6 months. brought it up multiple times incl. meeting but https://phabricator.wikimedia.org/T143175 is still stalled without co" [puppet] - 10https://gerrit.wikimedia.org/r/324841 (https://phabricator.wikimedia.org/T143175) (owner: 1020after4) [00:48:48] (03CR) 10Dzahn: [C: 031] "> So as such I am not sure what is the point of this patch." [puppet] - 10https://gerrit.wikimedia.org/r/359492 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [01:02:13] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1498179727 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8467889 keys, up 2 minutes 5 seconds - replication_delay is 1498179727 [01:02:13] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1498179728 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8572415 keys, up 2 minutes 6 seconds - replication_delay is 1498179728 [01:02:14] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:02:43] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:03:13] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8548294 keys, up 3 minutes 6 seconds - replication_delay is 0 [01:03:13] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8449893 keys, up 3 minutes 7 seconds - replication_delay is 0 [01:03:14] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8544568 keys, up 3 minutes 7 seconds - replication_delay is 0 [01:03:43] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8543807 keys, up 3 minutes 33 seconds - replication_delay is 0 [01:04:43] (03CR) 10Dzahn: [C: 04-1] "no, it would remove purge scripts and break other things, see the diff here:" [puppet] - 10https://gerrit.wikimedia.org/r/347899 (owner: 10Paladox) [01:05:45] (03Abandoned) 10Dzahn: contint: Disable hhvm temporarily on stretch [puppet] - 10https://gerrit.wikimedia.org/r/359492 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [01:07:03] (03Abandoned) 10Dzahn: redis: add support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/354041 (owner: 10Paladox) [01:09:17] (03CR) 10Dzahn: [C: 04-2] "please fix commit message and explain what this is all about" [puppet] - 10https://gerrit.wikimedia.org/r/351546 (owner: 10Paladox) [01:11:06] 10Operations, 10Wikidata: 503 error raises again while trying to load a Wikidata page - https://phabricator.wikimedia.org/T140879#3373870 (10Krinkle) [01:12:08] (03CR) 10Dzahn: "since https://gerrit.wikimedia.org/r/#/c/330832/ is merged, what's the status here now?" [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [02:54:13] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 348 MB (3% inode=75%) [04:12:13] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=602.60 Read Requests/Sec=484.40 Write Requests/Sec=0.80 KBytes Read/Sec=39559.60 KBytes_Written/Sec=18.80 [04:13:14] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=7.20 Read Requests/Sec=20.70 Write Requests/Sec=28.20 KBytes Read/Sec=248.00 KBytes_Written/Sec=962.00 [04:16:13] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=5236.10 Read Requests/Sec=3826.20 Write Requests/Sec=1.90 KBytes Read/Sec=31464.00 KBytes_Written/Sec=788.00 [04:22:42] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3373930 (10mmodell) [04:23:13] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=2.60 Read Requests/Sec=4.30 Write Requests/Sec=1.50 KBytes Read/Sec=22.40 KBytes_Written/Sec=50.00 [04:23:14] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3373930 (10mmodell) [04:37:11] (03PS1) 10Dzahn: icinga: add plugin to check exim queue sizes [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) [04:41:48] 10Operations, 10monitoring, 10Patch-For-Review: Check for an oversized exim4 queue indicating mail delivery failures - https://phabricator.wikimedia.org/T133110#3373984 (10Dzahn) ^ with script above: ``` [mx1001:~] $ ./check_exim_queue -w 1000 -c 5000 OK: Less than 1000 mails in exim queue. [mx1001:~] $ ./c... [05:27:43] 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#3374006 (10Pnorman) [05:53:29] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361025 [05:55:00] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361025 [05:57:28] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361025 (owner: 10Marostegui) [05:58:29] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361025 (owner: 10Marostegui) [05:58:38] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361025 (owner: 10Marostegui) [05:59:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1026 - T166207 (duration: 00m 47s) [05:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:40] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [06:03:22] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3374039 (10jcrespo) > I propose Monday after ops meeting (17:00 UTC). I am sorry, but that is outside of my working hours. [06:03:48] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361027 [06:03:51] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361027 [06:05:37] (03CR) 10Marostegui: "> well, that's interesting. it would explain my question from above" [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:06:02] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361027 (owner: 10Marostegui) [06:06:39] (03CR) 10Jcrespo: [C: 04-2] "You are modifying production.sql - those should not change- the web monitors do never connect to the databases. Any existing grants in tha" [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:07:18] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361027 (owner: 10Marostegui) [06:07:27] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361027 (owner: 10Marostegui) [06:07:43] RECOVERY - Check systemd state on releases1001 is OK: OK - running: The system is fully operational [06:08:26] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2066 - T168354 (duration: 00m 46s) [06:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:34] T168354: dbstore2001: s5 thread isn't able to catch up with the master - https://phabricator.wikimedia.org/T168354 [06:10:43] PROBLEM - Check systemd state on releases1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:11:03] (03PS1) 10Marostegui: db-eqiad.php: Add coment to db1041 running alter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361028 (https://phabricator.wikimedia.org/T166208) [06:12:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add coment to db1041 running alter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361028 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [06:14:01] (03Merged) 10jenkins-bot: db-eqiad.php: Add coment to db1041 running alter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361028 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [06:14:09] (03CR) 10Jcrespo: [C: 04-2] "This is not only needed, it is a security hole." [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:15:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add comments to db1041 long running alter status - T166208 (duration: 00m 46s) [06:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:28] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [06:15:43] RECOVERY - Check systemd state on releases1001 is OK: OK - running: The system is fully operational [06:15:51] (03CR) 10Jcrespo: [C: 04-2] "Not only unnecessary..." [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:15:56] (03CR) 10jenkins-bot: db-eqiad.php: Add coment to db1041 running alter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361028 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [06:17:22] !log Deploy alter table on db1041 - s7 - T166208 [06:17:23] !log releases1001 - systemctl reset-failed to clear Icinga systemd status CRIT - service puppet [06:17:29] (03Abandoned) 10Dzahn: mariadb: add GRANTs for tendril@dbmonitor1001, tendriL@dbmonitor2001 [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:33] (03CR) 10Jcrespo: "If you are so sure this works, please deploy and break gerrit, but do not expect me to fix it for you later." [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [06:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:34] (03Abandoned) 10Dzahn: Gerrit: Set useUnicode=true, also change connectionCollation to utf8mb4_unicode_ci [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [06:18:50] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3374054 (10Dzahn) a:05Dzahn>03None [06:23:02] (03Abandoned) 10Dzahn: switch tendril from einsteinium to dbmonitor1001 [dns] - 10https://gerrit.wikimedia.org/r/359372 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:35:00] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3374061 (10Dzahn) [06:36:27] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3366319 (10Dzahn) 05Open>03Invalid [06:37:24] 10Operations: tracking task: get rid of Ubuntu (trusty) (in prod) - https://phabricator.wikimedia.org/T168495#3374063 (10Dzahn) 05Open>03Invalid [06:38:44] 10Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3374067 (10Dzahn) [06:38:48] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3374068 (10Dzahn) [06:39:03] 10Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3187493 (10Dzahn) a:05Dzahn>03None [06:46:51] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3374077 (10madhuvishy) @jcrespo no problem! let me know what time works for you :) I can do earlier on Monday too. Would 14:00 UTC work? Feel free to propose a suitable time if not. Tha... [06:49:55] (03CR) 10Dzahn: ""either are ok to me" and then "you are modiyfing production, it's a security hole" are both reviews from you on the same patch. sorry if " [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:52:58] (03CR) 10Jcrespo: "The grants were ok, but adding it to production.sql.erb, which applies to all databases in production was an error- I didn't notice that a" [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [07:43:56] (03PS1) 10Marostegui: maintain-views.yaml: Remove unsued views [puppet] - 10https://gerrit.wikimedia.org/r/361034 (https://phabricator.wikimedia.org/T153213) [07:46:33] (03PS2) 10Marostegui: maintain-views.yaml: Remove unused views [puppet] - 10https://gerrit.wikimedia.org/r/361034 (https://phabricator.wikimedia.org/T153213) [07:52:08] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3374119 (10jcrespo) So there are several things here- the dns change and the actual reboot. There should be a time between them. I say you (as in, anyone on your team) change dns day 1... [08:06:48] (03PS1) 10Filippo Giunchedi: hieradata: fix nutcracker thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/361035 [08:09:22] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: fix nutcracker thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/361035 (owner: 10Filippo Giunchedi) [08:14:23] RECOVERY - nutcracker process on thumbor2001 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [08:14:33] RECOVERY - nutcracker port on mw2148 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [08:14:33] RECOVERY - nutcracker port on thumbor2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [08:15:23] RECOVERY - nutcracker process on mw2148 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [08:21:01] 10Operations: nutcracker test config in puppet doesn't work - https://phabricator.wikimedia.org/T168705#3374171 (10fgiunchedi) [08:21:03] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:21:03] RECOVERY - puppet last run on thumbor2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:27:24] (03PS1) 10Filippo Giunchedi: nutcracker: validate new config file [puppet] - 10https://gerrit.wikimedia.org/r/361039 (https://phabricator.wikimedia.org/T168705) [08:27:27] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3374198 (10jcrespo) a:03jcrespo [08:28:10] (03Restored) 10Jcrespo: switch tendril from einsteinium to dbmonitor1001 [dns] - 10https://gerrit.wikimedia.org/r/359372 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [08:28:23] (03PS2) 10Jcrespo: switch tendril from einsteinium to dbmonitor1001 [dns] - 10https://gerrit.wikimedia.org/r/359372 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [08:30:23] 10Operations, 10Wikimedia-Stream: Error on RCStream server startup for the "flash policy server" - https://phabricator.wikimedia.org/T153770#3374210 (10Aklapper) [08:32:05] (03PS4) 10Volans: CLI: improve configuration error handling [software/cumin] - 10https://gerrit.wikimedia.org/r/357234 (https://phabricator.wikimedia.org/T158747) [08:32:07] (03PS2) 10Volans: ClusterShell: allow to set a timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) [08:32:09] (03PS2) 10Volans: CLI: migrate to timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359467 (https://phabricator.wikimedia.org/T164838) [08:32:11] (03PS5) 10Volans: Package metadata and testing tools improvements [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) [08:32:13] (03PS1) 10Volans: Fix Pylint and other tools reported errors [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) [08:37:47] (03CR) 10Jcrespo: [C: 032] switch tendril from einsteinium to dbmonitor1001 [dns] - 10https://gerrit.wikimedia.org/r/359372 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [08:38:36] !log deploying dns change to tendril [08:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:51] (03PS1) 10Jcrespo: Revert "switch tendril from einsteinium to dbmonitor1001" [dns] - 10https://gerrit.wikimedia.org/r/361041 [08:43:18] (03CR) 10Jcrespo: [C: 032] Revert "switch tendril from einsteinium to dbmonitor1001" [dns] - 10https://gerrit.wikimedia.org/r/361041 (owner: 10Jcrespo) [08:49:09] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3374243 (10jcrespo) The dns change doesn't work either because it causes a TLS error. This is because LE has been disabled there: https:/... [08:59:04] 10Operations, 10Gerrit, 10Release-Engineering-Team (Kanban): Setup maintenance date to reindex gerrit (offline reindex) - https://phabricator.wikimedia.org/T168670#3371480 (10MarcoAurelio) I am not sure if this is causing {T168707}. [08:59:28] (03CR) 10Elukey: [C: 031] "Looking at the logs in T168705 it seems that validate_cmd works on the tmp file, so IIUC even the first puppet run should be covered by th" [puppet] - 10https://gerrit.wikimedia.org/r/361039 (https://phabricator.wikimedia.org/T168705) (owner: 10Filippo Giunchedi) [09:06:24] RainbowSprinkles: poke [09:07:22] (03PS1) 10Marostegui: Revert "MariaDB: temporarily increase limit for space alarm" [puppet] - 10https://gerrit.wikimedia.org/r/361043 [09:07:27] (03PS2) 10Marostegui: Revert "MariaDB: temporarily increase limit for space alarm" [puppet] - 10https://gerrit.wikimedia.org/r/361043 [09:09:19] (03CR) 10Marostegui: [C: 032] Revert "MariaDB: temporarily increase limit for space alarm" [puppet] - 10https://gerrit.wikimedia.org/r/361043 (owner: 10Marostegui) [09:16:10] 10Operations, 10DBA: puppet stopped mysqld using orphan pid file from puppet agent - https://phabricator.wikimedia.org/T86482#3374297 (10Marostegui) 05Open>03Resolved After having a chat with Filippo we believe this hasn't happened again so we are closing it for now. Feel free to reopen if this is seen again. [09:19:03] (03PS1) 10Jcrespo: dbmonitor1001: Reenable let's encript generation script [puppet] - 10https://gerrit.wikimedia.org/r/361046 (https://phabricator.wikimedia.org/T149557) [09:19:05] (03PS1) 10Jcrespo: dbmonitor: Remove tendril role from einstenium and tegmen [puppet] - 10https://gerrit.wikimedia.org/r/361047 (https://phabricator.wikimedia.org/T149557) [09:21:58] (03PS1) 10Jcrespo: Revert "Revert "switch tendril from einsteinium to dbmonitor1001"" [dns] - 10https://gerrit.wikimedia.org/r/361048 [09:22:07] (03CR) 10Marostegui: [C: 031] dbmonitor1001: Reenable let's encript generation script [puppet] - 10https://gerrit.wikimedia.org/r/361046 (https://phabricator.wikimedia.org/T149557) (owner: 10Jcrespo) [09:22:38] (03CR) 10Marostegui: [C: 031] dbmonitor: Remove tendril role from einstenium and tegmen [puppet] - 10https://gerrit.wikimedia.org/r/361047 (https://phabricator.wikimedia.org/T149557) (owner: 10Jcrespo) [09:25:20] (03CR) 10Jcrespo: [C: 032] Revert "Revert "switch tendril from einsteinium to dbmonitor1001"" [dns] - 10https://gerrit.wikimedia.org/r/361048 (owner: 10Jcrespo) [09:25:41] (03PS2) 10Jcrespo: dbmonitor1001: Reenable let's encript generation script [puppet] - 10https://gerrit.wikimedia.org/r/361046 (https://phabricator.wikimedia.org/T149557) [09:27:52] !log reapplying dns change - small downtime on tendril until puppet deploy and run [09:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:28] (03CR) 10Jcrespo: [C: 032] dbmonitor1001: Reenable let's encript generation script [puppet] - 10https://gerrit.wikimedia.org/r/361046 (https://phabricator.wikimedia.org/T149557) (owner: 10Jcrespo) [09:35:55] (03CR) 10Volans: [C: 04-1] "@mutante: I've replied in the task as you asked." [puppet] - 10https://gerrit.wikimedia.org/r/324841 (https://phabricator.wikimedia.org/T143175) (owner: 1020after4) [09:45:57] PROBLEM - nutcracker port on thumbor2002 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [09:46:17] PROBLEM - nutcracker process on thumbor2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (nutcracker), command name nutcracker [09:46:37] PROBLEM - puppet last run on thumbor2002 is CRITICAL: CRITICAL: Puppet has 44 failures. Last run 13 minutes ago with 44 failures. Failed resources (up to 3 shown): Service[nutcracker],Exec[create-tmp-folder-8821],Exec[create-tmp-folder-8810],Exec[create-tmp-folder-8819] [09:47:47] PROBLEM - Check systemd state on thumbor2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:49:51] that's me ^ silencing [09:51:57] RECOVERY - nutcracker port on thumbor2002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [09:52:17] RECOVERY - nutcracker process on thumbor2002 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [09:55:21] !log reboot mw1250-53 for kernel updates [09:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:55] RECOVERY - Check systemd state on thumbor2002 is OK: OK - running: The system is fully operational [10:12:52] 10Operations, 10Gerrit, 10Release-Engineering-Team (Kanban): Setup maintenance date to reindex gerrit (offline reindex) - https://phabricator.wikimedia.org/T168670#3374397 (10Paladox) It may have, we may have discovered a bug if it is indeed the reindex. 2.13 is too buggy now. [10:12:55] RECOVERY - puppet last run on thumbor2002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:47:55] 10Operations, 10Traffic, 10Patch-For-Review: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3374447 (10ema) I've queried prometheus as follows to find the maximum transient storage usage per cache_type/layer over the past 30 days: `max by (job,layer) (max_over_time(v... [11:13:57] PROBLEM - nutcracker port on thumbor2003 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [11:14:17] PROBLEM - nutcracker process on thumbor2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (nutcracker), command name nutcracker [11:15:37] PROBLEM - puppet last run on thumbor2003 is CRITICAL: CRITICAL: Puppet has 52 failures. Last run 1 minute ago with 52 failures. Failed resources (up to 3 shown): Service[nutcracker],Exec[create-tmp-folder-8821],Exec[create-tmp-folder-8846],Exec[create-tmp-folder-8810] [11:15:37] PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:40] (03PS3) 10Hashar: contint: PHP ext build dependencies on Nodepool [puppet] - 10https://gerrit.wikimedia.org/r/342635 (https://phabricator.wikimedia.org/T134381) [11:18:43] (03PS3) 10DCausse: [WIP] Switch this repo to a deb package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) [11:21:51] ORES is going down: https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1&from=now-1h&to=now [11:22:12] :( [11:22:52] It seems it is an overload [11:22:57] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [11:24:02] Nope, it's 100% now [11:25:04] mutante: it seems CPU on scb1001 is around 100% again [11:25:39] oh nooo, we did not deploy wmf.6? there were a few fixes in it that i wanted to go live :( [11:25:47] i can't have a swat on friday, can i? :/ [11:28:10] Ops people, please kill electron on scb100[1-4] nodes, the CPU for all of them is around 100% causing ORES to return error to ALL requests [11:28:10] https://grafana.wikimedia.org/dashboard/db/ores?panelId=5&fullscreen&orgId=1 [11:28:14] see [11:29:44] afk for now [11:35:57] RECOVERY - nutcracker port on thumbor2003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [11:36:17] RECOVERY - nutcracker process on thumbor2003 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [11:37:33] _joe_: ^ [11:39:38] Amir1: hey, _joe_ is afk at the moment, is it safe to kill electron? [11:39:53] ema: we did that last time [11:40:12] ok, looking [11:40:50] yeah I was looking too, looks like the electron workers just restarted ? [11:41:43] got better a little https://grafana.wikimedia.org/dashboard/db/ores?panelId=5&fullscreen&orgId=1 [11:44:52] I'm gonna try restarting pdfrender.service on scb1001 [11:45:12] !log scb1001: restart pdfrender.service [11:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:46] Thanks [11:45:49] I'll see if I can capture strace / ltrace from electron on scb1003 [11:46:46] cpu usage is still high on scb1001, mostly due to celery workers [11:49:08] yeah it doesn't look like electron is the culprit, at least not looking at cpu usage [11:49:42] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:49:42] I think my stracing of electron processes on 1003 might have killed them, and "fixing" the issue at the same time [11:51:34] heh on e.g. 1004 there's high cpu usage from electron alright, together with celery [11:53:04] it seems scb1003 gets too many requests comparing to other nodes [11:53:12] around ten times more [11:53:22] PROBLEM - pdfrender on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 5252: Connection refused [11:53:23] PROBLEM - puppet last run on thumbor2004 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 7 minutes ago with 21 failures. Failed resources (up to 3 shown): Exec[create-tmp-folder-8821],Exec[create-tmp-folder-8846],Exec[create-tmp-folder-8810],Exec[create-tmp-folder-8819] [11:53:50] godog: I'm gonna try restarting pdfrender on all 4 nodes if you agree [11:53:52] going to restart that, maybe LVS pick it up [11:54:03] heh also I got the infamous xpra error for pdfrender on scb1003 [11:54:10] (I'm talking about restarting celery and uwsgi) [11:54:25] ema: sec, I'm looking at the xpra issue on 1003 [11:54:28] ok [11:55:02] !log restarted uwsgi-ores and celery-ores-worker services in scb1003 [11:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:09] why the f it takes forever to make celery restart [11:57:32] It's backing up [11:57:44] https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-3h&to=now [11:58:27] Amir1: let's do one thing at a time [11:59:08] okay, sorry [11:59:53] Amir1: so ores is backing up in the sense that it is recovering? [12:00:30] godog: can't say for sure [12:00:37] it seems it's going down again [12:01:12] PROBLEM - thumbor@8835 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8835 is inactive [12:01:22] PROBLEM - thumbor@8819 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8819 is inactive [12:01:43] godog: my impression is that lvs sends too many requests to scb1003 https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-15m&to=now [12:01:51] compare different eqiad nodes there [12:01:56] (03PS1) 10Dereckson: Use hosts in wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361055 [12:02:02] PROBLEM - thumbor@8838 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8838 is inactive [12:02:03] PROBLEM - thumbor@8821 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8821 is inactive [12:02:16] half of the requests goes to scb1003 [12:02:22] PROBLEM - thumbor@8839 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8839 is inactive [12:02:22] PROBLEM - thumbor@8805 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8805 is inactive [12:02:22] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [12:02:29] sorry about thumbor2004, silencing [12:02:32] PROBLEM - thumbor@8823 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8823 is inactive [12:02:32] PROBLEM - thumbor@8806 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8806 is inactive [12:02:32] PROBLEM - thumbor@8840 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8840 is inactive [12:02:41] Amir1: https://config-master.wikimedia.org/pybal/eqiad/ores - weights for each host in there [12:02:52] PROBLEM - thumbor@8807 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8807 is inactive [12:02:58] (03CR) 10jerkins-bot: [V: 04-1] Use hosts in wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361055 (owner: 10Dereckson) [12:03:12] (03PS2) 10Dereckson: Use hosts in wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361055 [12:03:28] elukey: compare 1003 and 1004 in my graph, it's still three times higher [12:03:34] five times [12:05:16] is it directly tied to more traffic handled or a side effect of particular scores getting delayed all of sudden in one host more than the others? [12:05:45] (still trying to bring up pdfrender on scb1003 btw) [12:06:55] I think it's about more traffic being sent to 1003, it seems it has been happening for quite some time [12:07:11] Amir1: https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-3h&to=now - as far as I can see scb1003 always handles more scores [12:07:16] even when it is not rxploding [12:07:40] https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-90d&to=now interesting [12:07:41] oh, okay, so that's a red herring [12:08:05] ok definitely weird https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-7d&to=now [12:09:04] I don't understand why scb1004 is so happy and getting very low number of requests [12:09:33] but it seems it's under pressure of edits, someone is editing with really bad pace in wikidata or English Wikipedia [12:09:42] let me check (ORES exclude bots) [12:10:24] yeah I was about to ask, it seems that some clients like few big clients are generating big volume [12:11:09] but again, it is has been ongoing for a while, maybe pdfelectron added more entropy and caused ores to trip over its limits [12:11:32] Amir1: what is the current status? Ores down? [12:11:52] yes [12:11:56] ORES is down [12:12:52] RECOVERY - thumbor@8807 service on thumbor2004 is OK: OK - thumbor@8807 is active [12:13:02] RECOVERY - thumbor@8838 service on thumbor2004 is OK: OK - thumbor@8838 is active [12:13:02] RECOVERY - thumbor@8821 service on thumbor2004 is OK: OK - thumbor@8821 is active [12:13:12] RECOVERY - thumbor@8835 service on thumbor2004 is OK: OK - thumbor@8835 is active [12:13:22] RECOVERY - thumbor@8839 service on thumbor2004 is OK: OK - thumbor@8839 is active [12:13:22] RECOVERY - thumbor@8805 service on thumbor2004 is OK: OK - thumbor@8805 is active [12:13:28] is there a graph for number of human edits per wiki? I couldn't find it in grafana [12:13:32] RECOVERY - thumbor@8819 service on thumbor2004 is OK: OK - thumbor@8819 is active [12:13:33] ok, and the suspected cause is scb out of cpu? [12:13:40] no there's all edits combined [12:13:42] RECOVERY - thumbor@8806 service on thumbor2004 is OK: OK - thumbor@8806 is active [12:13:42] RECOVERY - thumbor@8823 service on thumbor2004 is OK: OK - thumbor@8823 is active [12:13:42] RECOVERY - thumbor@8840 service on thumbor2004 is OK: OK - thumbor@8840 is active [12:13:43] RECOVERY - puppet last run on thumbor2004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:13:57] i.e. https://grafana.wikimedia.org/dashboard/db/edit-count [12:14:41] Amir1: any idea why celery is also consuming so much cpu? [12:14:54] godog: it's under pressure [12:15:43] RECOVERY - puppet last run on thumbor2003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:15:56] godog: one thing that we can do for now is to redirect traffic to codfw [12:16:13] but I'm not sure if that's an option [12:17:27] Amir1: the traffic will just shift to codfw and potentially ahve the same overload there too I'm assuming? [12:17:32] (03PS3) 10Lucas Werkmeister (WMDE): Configure WikibaseQualityConstraints extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 [12:18:02] I think codfw has some more capacity [12:18:13] also, active/active would help us for now [12:18:22] (03CR) 10Lucas Werkmeister (WMDE): "I’ve rebased the change (conflict with the changes to add a constraints section) and slightly amended the last comment to make the fallbac" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 (owner: 10Lucas Werkmeister (WMDE)) [12:18:54] it may be happening that once a node is up it receives traffic for most of the nodes (since they are not passing health checks) ending up in 503s again [12:19:27] they will send 503s to grafana, I can check that [12:19:49] the other weirdness is that the health checks (at least on scb1001) are returning (HTTP/1.0 200) meanwhile the rest is 503 [12:20:09] I also checked logstash to see if there's DoS attack, but number of external requests is natural [12:20:53] I restart ores in all nodes for now [12:21:41] number of internal (precache) requests it gets is higher but ores seen way worse in the past week [12:22:03] ack, if you think that might fix for sure [12:22:14] what happens to the celery queue? it gets flushed? [12:22:32] I've called akosiaris, he's coming online [12:22:35] it sends error [12:24:38] not sure what you mean by that? [12:25:18] godog: it flushes the queue and uwsgi sends out error AFAIK [12:25:30] ah ok [12:25:48] icinga says ores is back online but when I try manually, I get 503 [12:26:08] !log restarting celery and uwsgi on all scb nodes in eqiad [12:26:15] Amir1: health checks are passing but most of the scores get 503s [12:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:50] now scb1004 looks better [12:27:10] I can see 200s [12:27:22] nope, 503s again [12:28:32] PROBLEM - Check systemd state on mw2260 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:28:52] PROBLEM - Check systemd state on mw2259 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:29:03] I'm restarting scb1004, it's natural [12:29:26] well the goal was not to get 503s again :) [12:31:40] 503 is descending https://grafana.wikimedia.org/dashboard/db/ores?panelId=9&fullscreen&orgId=1&from=now-15m&to=now [12:31:56] let's see if it goes up again [12:32:41] in that case we can do 1- stop ChangeProp 2- increase ores capacity using puppet (it will eat more memory) [12:33:01] it's precaching requests that have gone up [12:33:19] so it's a reaction to changeprop I think [12:33:39] 503s went up again [12:33:54] akosiaris: yes, we need to stop ChangeProp right now [12:34:08] Pchelolo: ^ [12:35:21] yeah we can do that [12:36:10] akosiaris: btw. The new nodes are usable now, we can do the transfer today if that's fine [12:36:53] akosiaris: how do we stop changeprop? [12:37:36] I was looking for a grafana changeprod dashboard, no joy so far [12:37:38] no let's not migrate to the new nodes just because we are under pressure [12:37:51] let's actually figure this out [12:38:02] so I am stopping changeprop for a while [12:38:21] godog: it's in event bus, let me find it [12:38:52] !log disable changeprop due to ORES issues [12:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:37] ok changeprop stopped [12:39:47] let's see [12:40:10] Amir1: btw, we should stop sending INFO messages to logstash [12:40:17] it's cluttering the view for now reason [12:40:26] and are pretty useless I guess [12:40:31] unless you feel differently [12:40:32] Amir1: thanks, found it [12:41:12] akosiaris: the aggregated version of it is useful, to see for example too many requests is coming from where [12:41:23] but lots of them are indeed useless [12:41:42] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:41:42] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:41:42] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:41:45] Amir1: yeah we need to make it a bit better, actually send to correct log level to logstash [12:41:52] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:41:52] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:41:56] instead of hardcoding to info [12:42:01] all these ^ are expected [12:42:22] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:42:32] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:42:32] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:43:49] Amir1: so it looks like we are ok ? [12:44:14] it's still overloaded, let me restart again to flush the queue [12:46:30] (03CR) 10Alexandros Kosiaris: [C: 032] dbmonitor: Remove tendril role from einstenium and tegmen [puppet] - 10https://gerrit.wikimedia.org/r/361047 (https://phabricator.wikimedia.org/T149557) (owner: 10Jcrespo) [12:46:44] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks fine to me. Go for it" [puppet] - 10https://gerrit.wikimedia.org/r/361047 (https://phabricator.wikimedia.org/T149557) (owner: 10Jcrespo) [12:47:41] Amir1: btw, in https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=now-15m&to=now not cached and precached scores are exactly the same right now [12:49:03] akosiaris: I'm guessing external requests got blocked somehow [12:49:16] precache ones doesn't hit varnish [12:49:35] ofc they do [12:49:44] oh you mean changeprop originated ones [12:49:50] yeah that's true, they don't [12:51:06] overload errors is going up again yay [12:51:48] so it's someone external [12:52:20] however the cluster isn't CPU bounded [12:52:49] currently it's around 30% cpu used [12:53:44] https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=now-3h&to=now&panelId=1&fullscreen [12:54:10] how on earth codfw gets external requests, that graph is wrong [12:54:16] ofc it does [12:54:30] it's active/active geolocated [12:54:43] anyone closer to codfw than eqiad will hit codfw [12:54:54] why is eqiad not getting any requests is the question [12:55:55] Amir1: ^ ? [12:56:03] eqiad should be receiving external requests as well [12:56:22] ores is active/active? [12:56:26] Are you sure [12:56:42] yes [12:56:49] for eqiad, it's because it's overloaded [12:57:02] it's since the rollback from codfw [12:57:14] we went active/active since that [12:59:13] akosiaris: one strange thing here is https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-24h&to=now [12:59:13] I see requests to links like /scores/enwiki/?models=damaging%7Cgoodfaith&revids=787097648&precache=true&format=json from varnish boxes [12:59:39] scb1004 suddenly dropped number of requests it can handle, that probably is putting strain on under nodes [12:59:57] akosiaris: these ones are from jobrunner of mediawiki [13:01:09] and why does it pass precache=true as well ? [13:01:30] I may have misunderstood that part, I expect precache=true requests coming only from changeprop [13:01:37] is that correct ? [13:01:58] no, mediawiki requests also contains precache=true [13:02:14] because they get send once the edit is made [13:03:37] so what ? a revision get's a request to be scored from both mediawiki AND changeprop ? [13:03:57] or do the requests differ ? like different models ? [13:04:15] yeah, they are for different wikis and different models [13:04:37] but they have some overlap (they didn't but now, they have since we enabled goodfaith for mediawiki too) [13:04:48] I need to get that handled [13:05:06] well, load wise it should not matter, but it does make debugging more confusing and difficult, so yes please do [13:05:33] by load-wise, I mean because we have already calculated the scores [13:05:36] hm [13:05:46] akosiaris: the thing I don't understand is that the load on ores is not different from yesterday [13:05:46] you said the models may differ, so actually that's wrong [13:06:21] because they gets triggered at almost same time, it can put pressure on the workers [13:07:58] akosiaris: do you think we should increase the capacity of ores for now? [13:08:01] https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-7d&to=now [13:08:55] I don't see how that would help right now [13:09:32] yeah, just a thought [13:09:57] so it's the backpressure system that's currently being triggered [13:10:50] should we somehow reload eqiad's ORES to make that go away ? [13:10:57] oh you already did that [13:11:01] didn't help, right ? [13:11:08] well, it can't hurt if I redo it [13:11:26] I did that twice [13:11:32] I can do for the third time [13:12:22] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:12] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 79767 bytes in 0.155 second response time [13:13:48] !log bump uwsgi-ores and celery-ores-worker on scb100* [13:13:52] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1001.eqiad.wmnet because of too many down! [13:13:52] PROBLEM - ores on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 8081: Connection refused [13:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:02] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1001.eqiad.wmnet because of too many down! [13:14:02] PROBLEM - ores on scb1004 is CRITICAL: connect to address 10.64.48.29 and port 8081: Connection refused [13:14:12] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1001.eqiad.wmnet because of too many down! [13:14:17] PROBLEM - LVS HTTP IPv4 on ores.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.10 and port 8081: Connection refused [13:14:22] PROBLEM - ores on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8081: Connection refused [13:14:22] PROBLEM - ores on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 8081: Connection refused [13:14:23] that's me ^ [13:14:32] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1001.eqiad.wmnet because of too many down! [13:14:48] <_joe_> what's up with ores? [13:15:02] RECOVERY - ores on scb1002 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 4.740 second response time [13:15:03] RECOVERY - ores on scb1004 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 0.012 second response time [13:15:17] RECOVERY - LVS HTTP IPv4 on ores.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 4375 bytes in 1.605 second response time [13:15:18] it's overloaded and can't get handle the pressure [13:15:22] RECOVERY - ores on scb1001 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 0.019 second response time [13:15:22] RECOVERY - ores on scb1003 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 0.011 second response time [13:15:32] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [13:15:36]