[00:15:45] !log RT (ununpentium) installing pending package upgrades [00:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:51] 10Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#3373808 (10Dzahn) 05Open>03Resolved Since we have the new system running since a couple weeks and there was only positive feedback.. i now disabled the duplicate mails still going to RT. Now they only go to our new Goo... [00:29:35] (03CR) 10Dzahn: [C: 031] "well, that's interesting. it would explain my question from above "and why does it currently work?:)" which i was wondering about since th" [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [00:31:06] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3373812 (10Dzahn) Do you guys know anything about the latest 2 comments on https://gerrit.wikimedia.org/r/#/c/359373/ ? It seems those G... [00:33:46] (03CR) 10Dzahn: "> Last I know from owners, the database change breaks gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [00:42:36] (03CR) 10Dzahn: "it's been >6 months. brought it up multiple times incl. meeting but https://phabricator.wikimedia.org/T143175 is still stalled without co" [puppet] - 10https://gerrit.wikimedia.org/r/324841 (https://phabricator.wikimedia.org/T143175) (owner: 1020after4) [00:48:48] (03CR) 10Dzahn: [C: 031] "> So as such I am not sure what is the point of this patch." [puppet] - 10https://gerrit.wikimedia.org/r/359492 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [01:02:13] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1498179727 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8467889 keys, up 2 minutes 5 seconds - replication_delay is 1498179727 [01:02:13] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1498179728 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8572415 keys, up 2 minutes 6 seconds - replication_delay is 1498179728 [01:02:14] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:02:43] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:03:13] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8548294 keys, up 3 minutes 6 seconds - replication_delay is 0 [01:03:13] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8449893 keys, up 3 minutes 7 seconds - replication_delay is 0 [01:03:14] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8544568 keys, up 3 minutes 7 seconds - replication_delay is 0 [01:03:43] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8543807 keys, up 3 minutes 33 seconds - replication_delay is 0 [01:04:43] (03CR) 10Dzahn: [C: 04-1] "no, it would remove purge scripts and break other things, see the diff here:" [puppet] - 10https://gerrit.wikimedia.org/r/347899 (owner: 10Paladox) [01:05:45] (03Abandoned) 10Dzahn: contint: Disable hhvm temporarily on stretch [puppet] - 10https://gerrit.wikimedia.org/r/359492 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [01:07:03] (03Abandoned) 10Dzahn: redis: add support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/354041 (owner: 10Paladox) [01:09:17] (03CR) 10Dzahn: [C: 04-2] "please fix commit message and explain what this is all about" [puppet] - 10https://gerrit.wikimedia.org/r/351546 (owner: 10Paladox) [01:11:06] 10Operations, 10Wikidata: 503 error raises again while trying to load a Wikidata page - https://phabricator.wikimedia.org/T140879#3373870 (10Krinkle) [01:12:08] (03CR) 10Dzahn: "since https://gerrit.wikimedia.org/r/#/c/330832/ is merged, what's the status here now?" [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [02:54:13] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 348 MB (3% inode=75%) [04:12:13] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=602.60 Read Requests/Sec=484.40 Write Requests/Sec=0.80 KBytes Read/Sec=39559.60 KBytes_Written/Sec=18.80 [04:13:14] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=7.20 Read Requests/Sec=20.70 Write Requests/Sec=28.20 KBytes Read/Sec=248.00 KBytes_Written/Sec=962.00 [04:16:13] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=5236.10 Read Requests/Sec=3826.20 Write Requests/Sec=1.90 KBytes Read/Sec=31464.00 KBytes_Written/Sec=788.00 [04:22:42] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3373930 (10mmodell) [04:23:13] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=2.60 Read Requests/Sec=4.30 Write Requests/Sec=1.50 KBytes Read/Sec=22.40 KBytes_Written/Sec=50.00 [04:23:14] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3373930 (10mmodell) [04:37:11] (03PS1) 10Dzahn: icinga: add plugin to check exim queue sizes [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) [04:41:48] 10Operations, 10monitoring, 10Patch-For-Review: Check for an oversized exim4 queue indicating mail delivery failures - https://phabricator.wikimedia.org/T133110#3373984 (10Dzahn) ^ with script above: ``` [mx1001:~] $ ./check_exim_queue -w 1000 -c 5000 OK: Less than 1000 mails in exim queue. [mx1001:~] $ ./c... [05:27:43] 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#3374006 (10Pnorman) [05:53:29] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361025 [05:55:00] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361025 [05:57:28] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361025 (owner: 10Marostegui) [05:58:29] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361025 (owner: 10Marostegui) [05:58:38] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361025 (owner: 10Marostegui) [05:59:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1026 - T166207 (duration: 00m 47s) [05:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:40] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [06:03:22] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3374039 (10jcrespo) > I propose Monday after ops meeting (17:00 UTC). I am sorry, but that is outside of my working hours. [06:03:48] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361027 [06:03:51] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361027 [06:05:37] (03CR) 10Marostegui: "> well, that's interesting. it would explain my question from above" [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:06:02] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361027 (owner: 10Marostegui) [06:06:39] (03CR) 10Jcrespo: [C: 04-2] "You are modifying production.sql - those should not change- the web monitors do never connect to the databases. Any existing grants in tha" [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:07:18] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361027 (owner: 10Marostegui) [06:07:27] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361027 (owner: 10Marostegui) [06:07:43] RECOVERY - Check systemd state on releases1001 is OK: OK - running: The system is fully operational [06:08:26] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2066 - T168354 (duration: 00m 46s) [06:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:34] T168354: dbstore2001: s5 thread isn't able to catch up with the master - https://phabricator.wikimedia.org/T168354 [06:10:43] PROBLEM - Check systemd state on releases1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:11:03] (03PS1) 10Marostegui: db-eqiad.php: Add coment to db1041 running alter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361028 (https://phabricator.wikimedia.org/T166208) [06:12:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add coment to db1041 running alter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361028 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [06:14:01] (03Merged) 10jenkins-bot: db-eqiad.php: Add coment to db1041 running alter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361028 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [06:14:09] (03CR) 10Jcrespo: [C: 04-2] "This is not only needed, it is a security hole." [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:15:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add comments to db1041 long running alter status - T166208 (duration: 00m 46s) [06:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:28] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [06:15:43] RECOVERY - Check systemd state on releases1001 is OK: OK - running: The system is fully operational [06:15:51] (03CR) 10Jcrespo: [C: 04-2] "Not only unnecessary..." [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:15:56] (03CR) 10jenkins-bot: db-eqiad.php: Add coment to db1041 running alter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361028 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [06:17:22] !log Deploy alter table on db1041 - s7 - T166208 [06:17:23] !log releases1001 - systemctl reset-failed to clear Icinga systemd status CRIT - service puppet [06:17:29] (03Abandoned) 10Dzahn: mariadb: add GRANTs for tendril@dbmonitor1001, tendriL@dbmonitor2001 [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:33] (03CR) 10Jcrespo: "If you are so sure this works, please deploy and break gerrit, but do not expect me to fix it for you later." [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [06:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:34] (03Abandoned) 10Dzahn: Gerrit: Set useUnicode=true, also change connectionCollation to utf8mb4_unicode_ci [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [06:18:50] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3374054 (10Dzahn) a:05Dzahn>03None [06:23:02] (03Abandoned) 10Dzahn: switch tendril from einsteinium to dbmonitor1001 [dns] - 10https://gerrit.wikimedia.org/r/359372 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:35:00] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3374061 (10Dzahn) [06:36:27] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3366319 (10Dzahn) 05Open>03Invalid [06:37:24] 10Operations: tracking task: get rid of Ubuntu (trusty) (in prod) - https://phabricator.wikimedia.org/T168495#3374063 (10Dzahn) 05Open>03Invalid [06:38:44] 10Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3374067 (10Dzahn) [06:38:48] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3374068 (10Dzahn) [06:39:03] 10Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3187493 (10Dzahn) a:05Dzahn>03None [06:46:51] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3374077 (10madhuvishy) @jcrespo no problem! let me know what time works for you :) I can do earlier on Monday too. Would 14:00 UTC work? Feel free to propose a suitable time if not. Tha... [06:49:55] (03CR) 10Dzahn: ""either are ok to me" and then "you are modiyfing production, it's a security hole" are both reviews from you on the same patch. sorry if " [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:52:58] (03CR) 10Jcrespo: "The grants were ok, but adding it to production.sql.erb, which applies to all databases in production was an error- I didn't notice that a" [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [07:43:56] (03PS1) 10Marostegui: maintain-views.yaml: Remove unsued views [puppet] - 10https://gerrit.wikimedia.org/r/361034 (https://phabricator.wikimedia.org/T153213) [07:46:33] (03PS2) 10Marostegui: maintain-views.yaml: Remove unused views [puppet] - 10https://gerrit.wikimedia.org/r/361034 (https://phabricator.wikimedia.org/T153213) [07:52:08] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3374119 (10jcrespo) So there are several things here- the dns change and the actual reboot. There should be a time between them. I say you (as in, anyone on your team) change dns day 1... [08:06:48] (03PS1) 10Filippo Giunchedi: hieradata: fix nutcracker thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/361035 [08:09:22] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: fix nutcracker thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/361035 (owner: 10Filippo Giunchedi) [08:14:23] RECOVERY - nutcracker process on thumbor2001 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [08:14:33] RECOVERY - nutcracker port on mw2148 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [08:14:33] RECOVERY - nutcracker port on thumbor2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [08:15:23] RECOVERY - nutcracker process on mw2148 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [08:21:01] 10Operations: nutcracker test config in puppet doesn't work - https://phabricator.wikimedia.org/T168705#3374171 (10fgiunchedi) [08:21:03] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:21:03] RECOVERY - puppet last run on thumbor2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:27:24] (03PS1) 10Filippo Giunchedi: nutcracker: validate new config file [puppet] - 10https://gerrit.wikimedia.org/r/361039 (https://phabricator.wikimedia.org/T168705) [08:27:27] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3374198 (10jcrespo) a:03jcrespo [08:28:10] (03Restored) 10Jcrespo: switch tendril from einsteinium to dbmonitor1001 [dns] - 10https://gerrit.wikimedia.org/r/359372 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [08:28:23] (03PS2) 10Jcrespo: switch tendril from einsteinium to dbmonitor1001 [dns] - 10https://gerrit.wikimedia.org/r/359372 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [08:30:23] 10Operations, 10Wikimedia-Stream: Error on RCStream server startup for the "flash policy server" - https://phabricator.wikimedia.org/T153770#3374210 (10Aklapper) [08:32:05] (03PS4) 10Volans: CLI: improve configuration error handling [software/cumin] - 10https://gerrit.wikimedia.org/r/357234 (https://phabricator.wikimedia.org/T158747) [08:32:07] (03PS2) 10Volans: ClusterShell: allow to set a timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) [08:32:09] (03PS2) 10Volans: CLI: migrate to timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359467 (https://phabricator.wikimedia.org/T164838) [08:32:11] (03PS5) 10Volans: Package metadata and testing tools improvements [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) [08:32:13] (03PS1) 10Volans: Fix Pylint and other tools reported errors [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) [08:37:47] (03CR) 10Jcrespo: [C: 032] switch tendril from einsteinium to dbmonitor1001 [dns] - 10https://gerrit.wikimedia.org/r/359372 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [08:38:36] !log deploying dns change to tendril [08:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:51] (03PS1) 10Jcrespo: Revert "switch tendril from einsteinium to dbmonitor1001" [dns] - 10https://gerrit.wikimedia.org/r/361041 [08:43:18] (03CR) 10Jcrespo: [C: 032] Revert "switch tendril from einsteinium to dbmonitor1001" [dns] - 10https://gerrit.wikimedia.org/r/361041 (owner: 10Jcrespo) [08:49:09] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3374243 (10jcrespo) The dns change doesn't work either because it causes a TLS error. This is because LE has been disabled there: https:/... [08:59:04] 10Operations, 10Gerrit, 10Release-Engineering-Team (Kanban): Setup maintenance date to reindex gerrit (offline reindex) - https://phabricator.wikimedia.org/T168670#3371480 (10MarcoAurelio) I am not sure if this is causing {T168707}. [08:59:28] (03CR) 10Elukey: [C: 031] "Looking at the logs in T168705 it seems that validate_cmd works on the tmp file, so IIUC even the first puppet run should be covered by th" [puppet] - 10https://gerrit.wikimedia.org/r/361039 (https://phabricator.wikimedia.org/T168705) (owner: 10Filippo Giunchedi) [09:06:24] RainbowSprinkles: poke [09:07:22] (03PS1) 10Marostegui: Revert "MariaDB: temporarily increase limit for space alarm" [puppet] - 10https://gerrit.wikimedia.org/r/361043 [09:07:27] (03PS2) 10Marostegui: Revert "MariaDB: temporarily increase limit for space alarm" [puppet] - 10https://gerrit.wikimedia.org/r/361043 [09:09:19] (03CR) 10Marostegui: [C: 032] Revert "MariaDB: temporarily increase limit for space alarm" [puppet] - 10https://gerrit.wikimedia.org/r/361043 (owner: 10Marostegui) [09:16:10] 10Operations, 10DBA: puppet stopped mysqld using orphan pid file from puppet agent - https://phabricator.wikimedia.org/T86482#3374297 (10Marostegui) 05Open>03Resolved After having a chat with Filippo we believe this hasn't happened again so we are closing it for now. Feel free to reopen if this is seen again. [09:19:03] (03PS1) 10Jcrespo: dbmonitor1001: Reenable let's encript generation script [puppet] - 10https://gerrit.wikimedia.org/r/361046 (https://phabricator.wikimedia.org/T149557) [09:19:05] (03PS1) 10Jcrespo: dbmonitor: Remove tendril role from einstenium and tegmen [puppet] - 10https://gerrit.wikimedia.org/r/361047 (https://phabricator.wikimedia.org/T149557) [09:21:58] (03PS1) 10Jcrespo: Revert "Revert "switch tendril from einsteinium to dbmonitor1001"" [dns] - 10https://gerrit.wikimedia.org/r/361048 [09:22:07] (03CR) 10Marostegui: [C: 031] dbmonitor1001: Reenable let's encript generation script [puppet] - 10https://gerrit.wikimedia.org/r/361046 (https://phabricator.wikimedia.org/T149557) (owner: 10Jcrespo) [09:22:38] (03CR) 10Marostegui: [C: 031] dbmonitor: Remove tendril role from einstenium and tegmen [puppet] - 10https://gerrit.wikimedia.org/r/361047 (https://phabricator.wikimedia.org/T149557) (owner: 10Jcrespo) [09:25:20] (03CR) 10Jcrespo: [C: 032] Revert "Revert "switch tendril from einsteinium to dbmonitor1001"" [dns] - 10https://gerrit.wikimedia.org/r/361048 (owner: 10Jcrespo) [09:25:41] (03PS2) 10Jcrespo: dbmonitor1001: Reenable let's encript generation script [puppet] - 10https://gerrit.wikimedia.org/r/361046 (https://phabricator.wikimedia.org/T149557) [09:27:52] !log reapplying dns change - small downtime on tendril until puppet deploy and run [09:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:28] (03CR) 10Jcrespo: [C: 032] dbmonitor1001: Reenable let's encript generation script [puppet] - 10https://gerrit.wikimedia.org/r/361046 (https://phabricator.wikimedia.org/T149557) (owner: 10Jcrespo) [09:35:55] (03CR) 10Volans: [C: 04-1] "@mutante: I've replied in the task as you asked." [puppet] - 10https://gerrit.wikimedia.org/r/324841 (https://phabricator.wikimedia.org/T143175) (owner: 1020after4) [09:45:57] PROBLEM - nutcracker port on thumbor2002 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [09:46:17] PROBLEM - nutcracker process on thumbor2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (nutcracker), command name nutcracker [09:46:37] PROBLEM - puppet last run on thumbor2002 is CRITICAL: CRITICAL: Puppet has 44 failures. Last run 13 minutes ago with 44 failures. Failed resources (up to 3 shown): Service[nutcracker],Exec[create-tmp-folder-8821],Exec[create-tmp-folder-8810],Exec[create-tmp-folder-8819] [09:47:47] PROBLEM - Check systemd state on thumbor2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:49:51] that's me ^ silencing [09:51:57] RECOVERY - nutcracker port on thumbor2002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [09:52:17] RECOVERY - nutcracker process on thumbor2002 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [09:55:21] !log reboot mw1250-53 for kernel updates [09:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:55] RECOVERY - Check systemd state on thumbor2002 is OK: OK - running: The system is fully operational [10:12:52] 10Operations, 10Gerrit, 10Release-Engineering-Team (Kanban): Setup maintenance date to reindex gerrit (offline reindex) - https://phabricator.wikimedia.org/T168670#3374397 (10Paladox) It may have, we may have discovered a bug if it is indeed the reindex. 2.13 is too buggy now. [10:12:55] RECOVERY - puppet last run on thumbor2002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:47:55] 10Operations, 10Traffic, 10Patch-For-Review: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3374447 (10ema) I've queried prometheus as follows to find the maximum transient storage usage per cache_type/layer over the past 30 days: `max by (job,layer) (max_over_time(v... [11:13:57] PROBLEM - nutcracker port on thumbor2003 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [11:14:17] PROBLEM - nutcracker process on thumbor2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (nutcracker), command name nutcracker [11:15:37] PROBLEM - puppet last run on thumbor2003 is CRITICAL: CRITICAL: Puppet has 52 failures. Last run 1 minute ago with 52 failures. Failed resources (up to 3 shown): Service[nutcracker],Exec[create-tmp-folder-8821],Exec[create-tmp-folder-8846],Exec[create-tmp-folder-8810] [11:15:37] PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:40] (03PS3) 10Hashar: contint: PHP ext build dependencies on Nodepool [puppet] - 10https://gerrit.wikimedia.org/r/342635 (https://phabricator.wikimedia.org/T134381) [11:18:43] (03PS3) 10DCausse: [WIP] Switch this repo to a deb package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) [11:21:51] ORES is going down: https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1&from=now-1h&to=now [11:22:12] :( [11:22:52] It seems it is an overload [11:22:57] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [11:24:02] Nope, it's 100% now [11:25:04] mutante: it seems CPU on scb1001 is around 100% again [11:25:39] oh nooo, we did not deploy wmf.6? there were a few fixes in it that i wanted to go live :( [11:25:47] i can't have a swat on friday, can i? :/ [11:28:10] Ops people, please kill electron on scb100[1-4] nodes, the CPU for all of them is around 100% causing ORES to return error to ALL requests [11:28:10] https://grafana.wikimedia.org/dashboard/db/ores?panelId=5&fullscreen&orgId=1 [11:28:14] see [11:29:44] afk for now [11:35:57] RECOVERY - nutcracker port on thumbor2003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [11:36:17] RECOVERY - nutcracker process on thumbor2003 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [11:37:33] _joe_: ^ [11:39:38] Amir1: hey, _joe_ is afk at the moment, is it safe to kill electron? [11:39:53] ema: we did that last time [11:40:12] ok, looking [11:40:50] yeah I was looking too, looks like the electron workers just restarted ? [11:41:43] got better a little https://grafana.wikimedia.org/dashboard/db/ores?panelId=5&fullscreen&orgId=1 [11:44:52] I'm gonna try restarting pdfrender.service on scb1001 [11:45:12] !log scb1001: restart pdfrender.service [11:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:46] Thanks [11:45:49] I'll see if I can capture strace / ltrace from electron on scb1003 [11:46:46] cpu usage is still high on scb1001, mostly due to celery workers [11:49:08] yeah it doesn't look like electron is the culprit, at least not looking at cpu usage [11:49:42] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:49:42] I think my stracing of electron processes on 1003 might have killed them, and "fixing" the issue at the same time [11:51:34] heh on e.g. 1004 there's high cpu usage from electron alright, together with celery [11:53:04] it seems scb1003 gets too many requests comparing to other nodes [11:53:12] around ten times more [11:53:22] PROBLEM - pdfrender on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 5252: Connection refused [11:53:23] PROBLEM - puppet last run on thumbor2004 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 7 minutes ago with 21 failures. Failed resources (up to 3 shown): Exec[create-tmp-folder-8821],Exec[create-tmp-folder-8846],Exec[create-tmp-folder-8810],Exec[create-tmp-folder-8819] [11:53:50] godog: I'm gonna try restarting pdfrender on all 4 nodes if you agree [11:53:52] going to restart that, maybe LVS pick it up [11:54:03] heh also I got the infamous xpra error for pdfrender on scb1003 [11:54:10] (I'm talking about restarting celery and uwsgi) [11:54:25] ema: sec, I'm looking at the xpra issue on 1003 [11:54:28] ok [11:55:02] !log restarted uwsgi-ores and celery-ores-worker services in scb1003 [11:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:09] why the f it takes forever to make celery restart [11:57:32] It's backing up [11:57:44] https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-3h&to=now [11:58:27] Amir1: let's do one thing at a time [11:59:08] okay, sorry [11:59:53] Amir1: so ores is backing up in the sense that it is recovering? [12:00:30] godog: can't say for sure [12:00:37] it seems it's going down again [12:01:12] PROBLEM - thumbor@8835 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8835 is inactive [12:01:22] PROBLEM - thumbor@8819 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8819 is inactive [12:01:43] godog: my impression is that lvs sends too many requests to scb1003 https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-15m&to=now [12:01:51] compare different eqiad nodes there [12:01:56] (03PS1) 10Dereckson: Use hosts in wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361055 [12:02:02] PROBLEM - thumbor@8838 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8838 is inactive [12:02:03] PROBLEM - thumbor@8821 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8821 is inactive [12:02:16] half of the requests goes to scb1003 [12:02:22] PROBLEM - thumbor@8839 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8839 is inactive [12:02:22] PROBLEM - thumbor@8805 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8805 is inactive [12:02:22] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [12:02:29] sorry about thumbor2004, silencing [12:02:32] PROBLEM - thumbor@8823 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8823 is inactive [12:02:32] PROBLEM - thumbor@8806 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8806 is inactive [12:02:32] PROBLEM - thumbor@8840 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8840 is inactive [12:02:41] Amir1: https://config-master.wikimedia.org/pybal/eqiad/ores - weights for each host in there [12:02:52] PROBLEM - thumbor@8807 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8807 is inactive [12:02:58] (03CR) 10jerkins-bot: [V: 04-1] Use hosts in wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361055 (owner: 10Dereckson) [12:03:12] (03PS2) 10Dereckson: Use hosts in wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361055 [12:03:28] elukey: compare 1003 and 1004 in my graph, it's still three times higher [12:03:34] five times [12:05:16] is it directly tied to more traffic handled or a side effect of particular scores getting delayed all of sudden in one host more than the others? [12:05:45] (still trying to bring up pdfrender on scb1003 btw) [12:06:55] I think it's about more traffic being sent to 1003, it seems it has been happening for quite some time [12:07:11] Amir1: https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-3h&to=now - as far as I can see scb1003 always handles more scores [12:07:16] even when it is not rxploding [12:07:40] https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-90d&to=now interesting [12:07:41] oh, okay, so that's a red herring [12:08:05] ok definitely weird https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-7d&to=now [12:09:04] I don't understand why scb1004 is so happy and getting very low number of requests [12:09:33] but it seems it's under pressure of edits, someone is editing with really bad pace in wikidata or English Wikipedia [12:09:42] let me check (ORES exclude bots) [12:10:24] yeah I was about to ask, it seems that some clients like few big clients are generating big volume [12:11:09] but again, it is has been ongoing for a while, maybe pdfelectron added more entropy and caused ores to trip over its limits [12:11:32] Amir1: what is the current status? Ores down? [12:11:52] yes [12:11:56] ORES is down [12:12:52] RECOVERY - thumbor@8807 service on thumbor2004 is OK: OK - thumbor@8807 is active [12:13:02] RECOVERY - thumbor@8838 service on thumbor2004 is OK: OK - thumbor@8838 is active [12:13:02] RECOVERY - thumbor@8821 service on thumbor2004 is OK: OK - thumbor@8821 is active [12:13:12] RECOVERY - thumbor@8835 service on thumbor2004 is OK: OK - thumbor@8835 is active [12:13:22] RECOVERY - thumbor@8839 service on thumbor2004 is OK: OK - thumbor@8839 is active [12:13:22] RECOVERY - thumbor@8805 service on thumbor2004 is OK: OK - thumbor@8805 is active [12:13:28] is there a graph for number of human edits per wiki? I couldn't find it in grafana [12:13:32] RECOVERY - thumbor@8819 service on thumbor2004 is OK: OK - thumbor@8819 is active [12:13:33] ok, and the suspected cause is scb out of cpu? [12:13:40] no there's all edits combined [12:13:42] RECOVERY - thumbor@8806 service on thumbor2004 is OK: OK - thumbor@8806 is active [12:13:42] RECOVERY - thumbor@8823 service on thumbor2004 is OK: OK - thumbor@8823 is active [12:13:42] RECOVERY - thumbor@8840 service on thumbor2004 is OK: OK - thumbor@8840 is active [12:13:43] RECOVERY - puppet last run on thumbor2004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:13:57] i.e. https://grafana.wikimedia.org/dashboard/db/edit-count [12:14:41] Amir1: any idea why celery is also consuming so much cpu? [12:14:54] godog: it's under pressure [12:15:43] RECOVERY - puppet last run on thumbor2003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:15:56] godog: one thing that we can do for now is to redirect traffic to codfw [12:16:13] but I'm not sure if that's an option [12:17:27] Amir1: the traffic will just shift to codfw and potentially ahve the same overload there too I'm assuming? [12:17:32] (03PS3) 10Lucas Werkmeister (WMDE): Configure WikibaseQualityConstraints extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 [12:18:02] I think codfw has some more capacity [12:18:13] also, active/active would help us for now [12:18:22] (03CR) 10Lucas Werkmeister (WMDE): "I’ve rebased the change (conflict with the changes to add a constraints section) and slightly amended the last comment to make the fallbac" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 (owner: 10Lucas Werkmeister (WMDE)) [12:18:54] it may be happening that once a node is up it receives traffic for most of the nodes (since they are not passing health checks) ending up in 503s again [12:19:27] they will send 503s to grafana, I can check that [12:19:49] the other weirdness is that the health checks (at least on scb1001) are returning (HTTP/1.0 200) meanwhile the rest is 503 [12:20:09] I also checked logstash to see if there's DoS attack, but number of external requests is natural [12:20:53] I restart ores in all nodes for now [12:21:41] number of internal (precache) requests it gets is higher but ores seen way worse in the past week [12:22:03] ack, if you think that might fix for sure [12:22:14] what happens to the celery queue? it gets flushed? [12:22:32] I've called akosiaris, he's coming online [12:22:35] it sends error [12:24:38] not sure what you mean by that? [12:25:18] godog: it flushes the queue and uwsgi sends out error AFAIK [12:25:30] ah ok [12:25:48] icinga says ores is back online but when I try manually, I get 503 [12:26:08] !log restarting celery and uwsgi on all scb nodes in eqiad [12:26:15] Amir1: health checks are passing but most of the scores get 503s [12:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:50] now scb1004 looks better [12:27:10] I can see 200s [12:27:22] nope, 503s again [12:28:32] PROBLEM - Check systemd state on mw2260 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:28:52] PROBLEM - Check systemd state on mw2259 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:29:03] I'm restarting scb1004, it's natural [12:29:26] well the goal was not to get 503s again :) [12:31:40] 503 is descending https://grafana.wikimedia.org/dashboard/db/ores?panelId=9&fullscreen&orgId=1&from=now-15m&to=now [12:31:56] let's see if it goes up again [12:32:41] in that case we can do 1- stop ChangeProp 2- increase ores capacity using puppet (it will eat more memory) [12:33:01] it's precaching requests that have gone up [12:33:19] so it's a reaction to changeprop I think [12:33:39] 503s went up again [12:33:54] akosiaris: yes, we need to stop ChangeProp right now [12:34:08] Pchelolo: ^ [12:35:21] yeah we can do that [12:36:10] akosiaris: btw. The new nodes are usable now, we can do the transfer today if that's fine [12:36:53] akosiaris: how do we stop changeprop? [12:37:36] I was looking for a grafana changeprod dashboard, no joy so far [12:37:38] no let's not migrate to the new nodes just because we are under pressure [12:37:51] let's actually figure this out [12:38:02] so I am stopping changeprop for a while [12:38:21] godog: it's in event bus, let me find it [12:38:52] !log disable changeprop due to ORES issues [12:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:37] ok changeprop stopped [12:39:47] let's see [12:40:10] Amir1: btw, we should stop sending INFO messages to logstash [12:40:17] it's cluttering the view for now reason [12:40:26] and are pretty useless I guess [12:40:31] unless you feel differently [12:40:32] Amir1: thanks, found it [12:41:12] akosiaris: the aggregated version of it is useful, to see for example too many requests is coming from where [12:41:23] but lots of them are indeed useless [12:41:42] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:41:42] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:41:42] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:41:45] Amir1: yeah we need to make it a bit better, actually send to correct log level to logstash [12:41:52] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:41:52] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:41:56] instead of hardcoding to info [12:42:01] all these ^ are expected [12:42:22] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:42:32] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:42:32] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:43:49] Amir1: so it looks like we are ok ? [12:44:14] it's still overloaded, let me restart again to flush the queue [12:46:30] (03CR) 10Alexandros Kosiaris: [C: 032] dbmonitor: Remove tendril role from einstenium and tegmen [puppet] - 10https://gerrit.wikimedia.org/r/361047 (https://phabricator.wikimedia.org/T149557) (owner: 10Jcrespo) [12:46:44] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks fine to me. Go for it" [puppet] - 10https://gerrit.wikimedia.org/r/361047 (https://phabricator.wikimedia.org/T149557) (owner: 10Jcrespo) [12:47:41] Amir1: btw, in https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=now-15m&to=now not cached and precached scores are exactly the same right now [12:49:03] akosiaris: I'm guessing external requests got blocked somehow [12:49:16] precache ones doesn't hit varnish [12:49:35] ofc they do [12:49:44] oh you mean changeprop originated ones [12:49:50] yeah that's true, they don't [12:51:06] overload errors is going up again yay [12:51:48] so it's someone external [12:52:20] however the cluster isn't CPU bounded [12:52:49] currently it's around 30% cpu used [12:53:44] https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=now-3h&to=now&panelId=1&fullscreen [12:54:10] how on earth codfw gets external requests, that graph is wrong [12:54:16] ofc it does [12:54:30] it's active/active geolocated [12:54:43] anyone closer to codfw than eqiad will hit codfw [12:54:54] why is eqiad not getting any requests is the question [12:55:55] Amir1: ^ ? [12:56:03] eqiad should be receiving external requests as well [12:56:22] ores is active/active? [12:56:26] Are you sure [12:56:42] yes [12:56:49] for eqiad, it's because it's overloaded [12:57:02] it's since the rollback from codfw [12:57:14] we went active/active since that [12:59:13] akosiaris: one strange thing here is https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-24h&to=now [12:59:13] I see requests to links like /scores/enwiki/?models=damaging%7Cgoodfaith&revids=787097648&precache=true&format=json from varnish boxes [12:59:39] scb1004 suddenly dropped number of requests it can handle, that probably is putting strain on under nodes [12:59:57] akosiaris: these ones are from jobrunner of mediawiki [13:01:09] and why does it pass precache=true as well ? [13:01:30] I may have misunderstood that part, I expect precache=true requests coming only from changeprop [13:01:37] is that correct ? [13:01:58] no, mediawiki requests also contains precache=true [13:02:14] because they get send once the edit is made [13:03:37] so what ? a revision get's a request to be scored from both mediawiki AND changeprop ? [13:03:57] or do the requests differ ? like different models ? [13:04:15] yeah, they are for different wikis and different models [13:04:37] but they have some overlap (they didn't but now, they have since we enabled goodfaith for mediawiki too) [13:04:48] I need to get that handled [13:05:06] well, load wise it should not matter, but it does make debugging more confusing and difficult, so yes please do [13:05:33] by load-wise, I mean because we have already calculated the scores [13:05:36] hm [13:05:46] akosiaris: the thing I don't understand is that the load on ores is not different from yesterday [13:05:46] you said the models may differ, so actually that's wrong [13:06:21] because they gets triggered at almost same time, it can put pressure on the workers [13:07:58] akosiaris: do you think we should increase the capacity of ores for now? [13:08:01] https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-7d&to=now [13:08:55] I don't see how that would help right now [13:09:32] yeah, just a thought [13:09:57] so it's the backpressure system that's currently being triggered [13:10:50] should we somehow reload eqiad's ORES to make that go away ? [13:10:57] oh you already did that [13:11:01] didn't help, right ? [13:11:08] well, it can't hurt if I redo it [13:11:26] I did that twice [13:11:32] I can do for the third time [13:12:22] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:12] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 79767 bytes in 0.155 second response time [13:13:48] !log bump uwsgi-ores and celery-ores-worker on scb100* [13:13:52] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1001.eqiad.wmnet because of too many down! [13:13:52] PROBLEM - ores on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 8081: Connection refused [13:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:02] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1001.eqiad.wmnet because of too many down! [13:14:02] PROBLEM - ores on scb1004 is CRITICAL: connect to address 10.64.48.29 and port 8081: Connection refused [13:14:12] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1001.eqiad.wmnet because of too many down! [13:14:17] PROBLEM - LVS HTTP IPv4 on ores.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.10 and port 8081: Connection refused [13:14:22] PROBLEM - ores on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8081: Connection refused [13:14:22] PROBLEM - ores on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 8081: Connection refused [13:14:23] that's me ^ [13:14:32] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1001.eqiad.wmnet because of too many down! [13:14:48] <_joe_> what's up with ores? [13:15:02] RECOVERY - ores on scb1002 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 4.740 second response time [13:15:03] RECOVERY - ores on scb1004 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 0.012 second response time [13:15:17] RECOVERY - LVS HTTP IPv4 on ores.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 4375 bytes in 1.605 second response time [13:15:18] it's overloaded and can't get handle the pressure [13:15:22] RECOVERY - ores on scb1001 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 0.019 second response time [13:15:22] RECOVERY - ores on scb1003 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 0.011 second response time [13:15:32] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [13:15:36] Amir1: should we try to clean the queue ? [13:15:57] I guess that would help, assuming external requests don't pile up agian [13:15:58] again* [13:16:02] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [13:16:05] yeah [13:16:19] do you want to flush redis [13:16:45] yeah I can do that [13:16:52] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [13:17:12] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [13:21:27] !log issue flashdb on oresrdb1001:6379 [13:21:31] Amir1: ^ [13:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:47] only 2k celery tasks now [13:22:18] thanks. Let's see [13:22:55] I 'll wait 10 mins and then restart changeprop as well if everything is ok then [13:23:58] Amir1: /me curious, what happens when a score is requested for a revision that has already a job in the queue ? [13:24:31] itsn't not double calculated I assume, right ? [13:24:46] akosiaris: per what's Aaron implemented it's not get duplicated [13:24:59] but honestly I never checked that in prod [13:26:42] 11k jobs already in redis [13:26:47] and I haven't yet started changeprop [13:27:41] and it's overloaded again [13:29:29] so let's find who's overloading us then [13:30:58] akosiaris: I checked logstash and nothing was problematic, the load was normal [13:31:31] something is making ores extremely slow [13:33:40] (03PS14) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [13:35:28] Amir1: looking at the logs from scb1003 [13:35:51] it looks like there is no external entity issuing too many requests [13:35:57] we are DoSing ourselves [13:36:30] 9690 user agent "MediaWiki/1.30.0-wmf.5" [13:36:30] 5840 user agent "MediaWiki/1.30.0-wmf.6" [13:36:43] that's scb1003 since 12:40 today [13:37:05] the next one is at 294 [13:37:30] akosiaris: that translates to someone making too many edit on a non-bot account [13:37:49] I checked wikidata and nothing was suspicious [13:38:27] I can check in graphite [13:38:46] 6357 scores/enwiki/ [13:38:46] 5698 scores/wikidatawiki/ [13:38:55] <_joe_> akosiaris: I bet it's someone using the mw api [13:38:59] same host, same timeperiod [13:39:11] <_joe_> which I already disabled once [13:39:19] <_joe_> let me see in api.log [13:40:01] _joe_: thanks! [13:41:22] let me check the logs [13:41:43] <_joe_> Amir1: what's the action to query ores from the action api? [13:42:17] oresreview is one of them [13:42:25] there are several cases [13:42:44] <_joe_> yeah I can't find any calls on the api logs [13:42:55] <_joe_> so I guess it's a bot not marked as such [13:43:03] <_joe_> again, not the first time :P [13:43:57] Amir1: I guess taks in redis being marked as REVOKED is expected ? [13:44:18] I guess, since we were flushing them [13:45:04] hmmm not sure how this makes sense [13:45:06] akosiaris: I see ChangePropagation/WMF requests, are you sure you disabled those? [13:45:26] (just tailing main.log in scb1003) [13:45:53] (just tailing main.log in scb1003) [13:46:00] Sorry [13:46:06] ? [13:46:13] I am definitely sure I did [13:46:16] double checking [13:46:53] maybe, it's disabled but old requests are being retried [13:47:06] hm,, no actually it was not disabled on scb1001 [13:47:15] although I am certain I disabled it on it too [13:47:28] interesting.. so service changeprop stop might actually fail ? [13:47:31] how nice [13:47:37] anyway reissued the command now [13:48:19] ok this time around it looks like it's stopped everywhere [13:48:42] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:48:42] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:49:30] another flushing of everything? [13:50:51] sure [13:51:07] dbsize [13:51:07] (integer) 31 [13:51:29] !log issue flashdb on oresrdb1001:6379 [13:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:21] Amir1: looks like it's at 109 now.. and stable for the last few mns [13:52:23] mins* [13:52:24] secs* [13:52:25] damn [13:52:42] (03CR) 10Elukey: "@Mforns: I removed the extra warning log about number of rows updated bigger than the batch size since now it does not help much, plus add" [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [13:56:18] (03PS1) 10Strainu: Set collation for Romanian wikis to uca-ro [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 [13:57:02] (03PS2) 10Strainu: Set collation for Romanian wikis to uca-ro [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) [13:58:21] Amir1: dbsize at 3,6k [13:58:32] which is way less than before [13:59:16] https://www.wikidata.org/w/index.php?title=Special:Contributions/Teolemon& [13:59:26] this user is doing lots of edits [14:00:19] that's clearly a bot [14:00:32] <_joe_> let' [14:00:41] <_joe_> s find a wikidata admin and block him [14:00:55] _joe_: I am [14:01:03] <_joe_> I suspected so :) [14:01:05] yeah I was about to say we have one :-) [14:01:12] <_joe_> Amir1: block him. [14:01:22] <_joe_> well, it [14:01:36] yeah they are in clear violation of the rules. This is a bot and it's not marked as a bot [14:01:56] <_joe_> even if they weren't, it's a production issue and we need to stop the bleeding [14:04:21] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Configure WikibaseQualityConstraints extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 (owner: 10Lucas Werkmeister (WMDE)) [14:05:09] Amir1: I see you blocked the user, let's see what happens [14:05:17] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Access request for Daniel Worley to analytics / hadoop - https://phabricator.wikimedia.org/T168439#3374794 (10Dworley) Yes, my wikitech email is fine. Thanks! [14:05:35] yeah, I also sent a warning to another one too [14:08:05] (03PS3) 10Giuseppe Lavagetto: Add build script plus nodejs base images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/360813 [14:08:19] (03PS1) 10Ottomata: Insert the (new) page-create events into MySQL too [puppet] - 10https://gerrit.wikimedia.org/r/361068 [14:08:29] I'd need to reboot some app/api servers for the kernel updates, should I wait a bit more ? [14:10:21] (03PS4) 10DCausse: Switch this repo to a deb package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) [14:12:09] (03CR) 10Ottomata: [C: 032] Insert the (new) page-create events into MySQL too [puppet] - 10https://gerrit.wikimedia.org/r/361068 (owner: 10Ottomata) [14:12:46] Amir1: I see no overload ORES. I would like to re-enable changeprop. any objections ? [14:13:10] akosiaris: it's getting timeout error lots of time [14:13:21] can we wait for around ten more minutes? [14:13:54] yeah [14:13:57] sure, no prob [14:15:19] akosiaris: we are overloaded again [14:15:19] https://grafana.wikimedia.org/dashboard/db/ores?panelId=9&fullscreen&orgId=1&from=now-1h&to=now [14:15:22] and I'm done [14:16:07] ? [14:16:40] dbsize [14:16:41] (integer) 11119 [14:16:44] what on earth ? [14:17:42] akosiaris: the next thing we can do is to kill all ORESFetchScore jobs in jobrunner [14:17:49] _joe_: what do you think? [14:17:52] (03PS5) 10DCausse: Switch this repo to a deb package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) [14:17:55] maybe that's putting too much pressure [14:18:18] <_joe_> Amir1: it's not exactly trivial to do that, FYI [14:20:30] why codfw is down too [14:20:39] it wasn't down half an hour ago [14:22:21] sigh [14:23:30] dbsize [14:23:30] (integer) 1814124 [14:23:32] that's codfw [14:23:33] damn [14:29:59] I'm assuming the reason why celery uses so much cpu is because its queue is just big with things to process? [14:30:56] the other thing I noticed is that despite the 503 there's nothing in ores logs afaics, except the access log itself showing the 503 [14:34:10] I'm bringing Aaron [14:36:23] PROBLEM - Host mw2259 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:57] that's me ^ stale host from thumbor reimaging [14:37:00] 10Operations, 10Traffic, 10Patch-For-Review: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3374918 (10ema) >>! In T164768#3374447, @ema wrote: > When it comes to upload and misc, instead, we should try to find out why transient storage usage grows up to > 50G in cert... [14:37:03] RECOVERY - Check systemd state on thumbor2003 is OK: OK - running: The system is fully operational [14:37:04] mw2260 will likely follow [14:37:13] RECOVERY - Host mw2259 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [14:37:13] RECOVERY - Check systemd state on mw2259 is OK: OK - running: The system is fully operational [14:37:25] akosiaris: it seems the job gets retried over and over again [14:37:31] for mediawiki [14:37:39] I don't know how to flush that out [14:37:53] and the queue will get bigger and bigger [14:37:53] RECOVERY - Check systemd state on mw2260 is OK: OK - running: The system is fully operational [14:38:20] so jobqueue is what ? resubmitting the same job over and over to ORES ? [14:38:26] if you check logs, recent requests from mediawiki are for revisions are for hours ago [14:38:26] and btw, that answer my question [14:38:31] from earlier [14:38:38] yeah [14:38:47] somehow those requests get resubmitted and ores does not deduplicate [14:38:55] I know it stops after 30 retries but we can't afford that [14:40:11] akosiaris: for mediawiki, when it sees a 503, it marks it for a retry after a while, the failed job doesn't even get to the queue [14:40:27] *even get to the ores celery queue [14:40:48] PROBLEM - Check whether ferm is active by checking the default input chain on bast3002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:18] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational [14:41:38] RECOVERY - Check whether ferm is active by checking the default input chain on bast3002 is OK: OK ferm input default policy is set [14:41:49] Amir1: so, how is the overload condition triggered ? [14:42:05] I mean what is the code that says "hey we are overloaded currently!" ? [14:42:15] cause it's clearly not related to CPU/mem/IO of boxes [14:42:18] when the redis queue gets more than 100 revisions, it stops to accept more and sends out 503 [14:42:38] 100 revisions ? [14:42:44] 100 per worker [14:42:52] ah [14:43:08] let me ask aaron [14:44:03] so 64*2*4 ? [14:44:13] 32 cores, 2 workers per core, 4 boxes [14:44:15] akosiaris: Aaron says it's 100 in total [14:45:11] so wait.. am I understanding this correctly ? we can serve at most 100 revisions scorings at any point in time ? [14:45:26] What I think worth checking is that, it seems ores is slow even when there is no tasks in the queue [14:45:31] maybe that's related to something [14:46:07] akosiaris: no, when workers are assigned to do a task it's not in queue anymore [14:46:45] so all workers are busy + 100 in queue [14:46:55] o/ [14:46:59] Amir1, just got connected. [14:47:01] 10Operations, 10Traffic, 10Patch-For-Review: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3374941 (10BBlack) >>! In T164768#3374447, @ema wrote: > On text, transient storage usage seems pretty reasonable; we could cap as follows, leaving plenty of room for spikes: >... [14:47:07] hey [14:47:15] halfak: o/ [14:47:39] hey akosiaris [14:47:59] So from what I've heard from Amir, my suspect is the uwsgi queue. [14:48:12] I think each host has a uwsgi queue. [14:48:45] uwsgi workers are having fun https://grafana.wikimedia.org/dashboard/db/ores?panelId=13&fullscreen&orgId=1 [14:48:46] Has anyone looked into what kind of requests fill up the redis queue as soon as it is cleared? [14:49:04] halfak: most of them now are mediawiki jobqueue [14:49:06] yeah me.. REVOKED staff [14:49:17] Oh... what. So non of them are doing anything? [14:49:42] akosiaris, "REVOKED"? [14:50:02] TaskRevokedError [14:50:04] I am doing a [14:50:12] get celery-task-meta-wikidatawiki:goodfaith:0.3.0:505327754 [14:50:19] that's the human readable part [14:50:28] and status: REVOKED [14:50:52] No deploys recently, right? [14:51:16] AFAIK yes [14:51:22] last on is Jun 15 [14:51:26] no deploys [14:51:29] kk [14:51:31] so no deploys [14:51:39] I've got headache [14:51:47] getting some rest and be back soon [14:51:50] o/ [14:51:59] Looks like celery is doing *something* [14:52:00] o/ [14:52:11] have a good rest, Amir1 [14:52:26] halfak: so I 've flushed the queue store at least 3 times up to now [14:52:32] with redis FLUSHDB [14:52:36] hasn't really helped [14:52:51] kk [14:53:00] dbsize in eqiad is already at 18321 keys [14:53:28] Hmm.. I'm trying to connect to 6380 in eqiad and it's not working. [14:53:32] Oh nevermind. [14:53:34] Just slow [14:56:11] Hmm.. I just got some requests to go to ORES successfully. [14:56:13] So not fully down? [14:56:22] Amir1 said we were down for the last ~3 hours [14:56:49] we 've been having on and off troubles for about that time, yes [14:56:54] heavily overloaded at times [14:57:20] I see. And the external request rate does not correspond to the load? [14:57:40] no this is self inflicted [14:57:46] it's mediawiki [14:57:50] not some external user [14:57:58] at least not by the number of requests we are seeing [14:58:03] I've got a suggestion [14:58:16] let's disable ores review tool in enwiki and wikidatawiki for now [14:58:46] and add it back later [14:58:59] when we are on the new nodes [14:59:45] Hmm... That seems drastic. But the overload errors chart is insane. [15:00:11] akosiaris, so in your opinion mediawiki requests are DOSing us? [15:00:30] (03PS6) 10DCausse: Switch this repo to a deb package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) [15:00:38] yes by a look at the logs more than 90% of the requests are coming from mediawiki [15:00:51] I checked the celery queue (which is checked for 503's) and it's totally empty in eqiad. [15:00:59] yes I am wondering about that too [15:01:01] (03PS4) 10Ema: varnish: limit varnishd transient storage size [puppet] - 10https://gerrit.wikimedia.org/r/353274 (https://phabricator.wikimedia.org/T164768) [15:01:12] the queue ? no not the queue [15:01:20] the scores cache [15:01:27] the queue is in port 6379 [15:01:35] Oh woops. Thanks [15:01:40] 127.0.0.1:6379> dbsize [15:01:40] (integer) 18321 [15:01:49] but the scores cache is indeed empty in eqiad [15:01:54] and I did not empty it [15:01:55] OK yeah. Seeing 103 now. [15:02:17] That was me being an idiot. Damn it. [15:02:21] ? [15:02:30] Accidentally ran an old command while doing Ctrl-R [15:02:38] and ? flushed the cache ? [15:02:40] Apparently I cleared the cache a while back [15:02:44] oh [15:02:58] ok that's not good... [15:03:15] Thought I ctrl-C'd fast enough :/ I guess not. [15:03:18] yeah. Not helping. [15:03:28] ok, so what do we do ? [15:03:45] I'm looking at that celery queue now [15:04:20] Can we switch traffic to codfw? I don't see any overloading there [15:04:51] I guess so... [15:05:03] btw we have disabled changeprop across the fleet [15:05:13] so no precache from changeprop [15:06:34] I don't get it. All of my requests via the browser are getting through. [15:07:12] are you perhaps hitting codfw ? [15:07:18] mine don't [15:07:19] https://ores.wikimedia.org/v2/scores/wikidatawiki/?models=damaging&revids=421063984 [15:07:20] It looks like it. [15:07:28] check header [15:07:32] codfw is the only think returning scores now. [15:08:32] curl --resolve ores.wikimedia.org:443:208.80.154.224 'https://ores.wikimedia.org/v2/scores/wikidatawiki/?models=damaging&revids=421063984' [15:09:04] 404? [15:09:45] yeah typo on my part [15:09:47] (03PS2) 10Nschaaf: Stop reader surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360849 (https://phabricator.wikimedia.org/T131949) [15:09:49] curl --resolve ores.wikimedia.org:443:208.80.208.80.154.251 'https://ores.wikimedia.org/v2/scores/wikidatawiki/?models=damaging&revids=421063984' [15:09:54] { [15:09:54] "error": { [15:09:54] "code": "server overloaded", [15:09:55] "message": "Cannot process your request because the server is overloaded. Try again in afew minutes." [15:09:55] } [15:10:05] the above would make you hit eqiad [15:10:11] I get a score. [15:10:19] Am connecting from SF right now. [15:10:23] I get overload [15:10:26] So likely codfw [15:10:40] (03CR) 10Nschaaf: [C: 04-1] "Should not be merged until we're ready to stop the surveys" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360849 (https://phabricator.wikimedia.org/T131949) (owner: 10Nschaaf) [15:11:12] halfak: try curl --resolve ores.wikimedia.org:443:208.80.208.80.154.251 [15:11:15] it will force eqiad [15:11:44] I copy-pasted your last command. [15:11:50] Still getting a score. [15:11:54] weird [15:11:58] yeah. Hmm [15:13:07] how about this one ? [15:13:10] curl -i --resolve ores.wikimedia.org:443:208.80.154.251 'https://ores.wikimedia.org/v2/scores/wikidatawiki/?models=damaging&revids=421063984' [15:13:19] I am king of typos today it seems [15:13:23] Looks like the queue isn't moving at all on eqiad. Maybe the celery workers have lost connection. [15:13:27] they are doing *something* [15:13:48] I am running out of battery, I 'll be on the move for a bit to get closer to a power source [15:13:51] brb [15:13:52] Got the overload [15:13:59] kk [15:14:58] I think I want to try restarting the celery workers on the scb100* nodes. [15:15:11] And see if we can get this queue moving. [15:17:25] halfak: I'm here too, LMK if you need some root operations [15:18:09] godog, thanks. I think I will. Thinking about what this should look like. [15:21:44] godog, I think we want "service ores-celery restart" on scb100* [15:22:19] I'm struggling to confirm that i got the service name exactly right but there should be nothing else like that name. [15:22:54] halfak: looks like it is celery-ores-worker, I'll roll restart it [15:23:07] Thanks [15:23:16] halfak: I did the restart, you can do it too (we have sudo rights only for restarting these services) [15:23:19] you should be able too btw via sudo [15:23:23] heh I was gonna say [15:23:26] :D [15:23:39] godog, I get a password prompt :/ [15:24:15] halfak: using celery-ores-worker as the service name? mhh it should work [15:24:36] Ahh I had the name wrong. [15:25:02] Amir1, I'm watching the queue to see if *anything* changes. [15:25:12] If nothing at all, I think I have an idea about what is going on. [15:25:17] halfak: I'll go ahead and restart anyhow [15:25:44] OK godog. I have a restart for the service hanging on scb1001 right now [15:26:23] Restart finished [15:26:45] yeah! Queue is changing again [15:27:08] Fills right back up. [15:27:34] weird. So, like, the workers come back online, empty the queue but then stop processing the queue at all. [15:27:40] And it fills back up. [15:28:02] while using a ton of cpu too, afaics [15:28:22] Oh. Wait. loooks like the queue is moving now. [15:30:01] yeah. Definitely moving [15:31:27] halfak: is it possible to use celery native inspection tools btw? e.g. using the command line celery binary? [15:31:30] "binary" [15:31:49] hmm... not sure I've done that before. [15:31:58] I'm inspecting redis directly now. [15:32:29] Looks like the workers have stopped consuming from the queue. [15:32:38] ack, thanks, mostly curiosity [15:32:56] OK so here's what I think is going on: [15:33:08] Oh wait. [15:33:20] Not hypothesis time until I figure out what the celery workers are doing with all their CPU. [15:33:42] And why codfw isn't struggling with this. [15:33:55] I'm not sure why we aren't just redirecting traffic to codfw at this point. [15:33:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:33:58] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:35:08] I can help with that if needed halfak, that'd mean changing varnish to use codfw instead of eqiad ? [15:35:13] 10Operations, 10Wikimedia-Logstash: Set up a service IP for logstash - https://phabricator.wikimedia.org/T113104#3375027 (10Krinkle) [15:35:49] godog, yeah I think so. I wonder if akosiaris already decided against this. I'm kind of a latecomer to this downtime event. [15:35:58] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:36:19] But codfw is throwing no errors right now. And I can successfully make requests to it. [15:36:50] 10Operations, 10Wikimedia-Logstash: Log lines on flourine overflow at 8092 bytes. - https://phabricator.wikimedia.org/T114849#3375028 (10Krinkle) [15:36:59] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:37:36] ack, I'm looking into how to make the switch if it comes to that [15:37:58] Not quite sure where I should expect to find celery logs these days. "journalctl -u celery-ores-worker -f" returns "No journal files.." [15:38:54] 10Operations, 10Wikimedia-Logstash: Log lines on flourine overflow at 8092 bytes. - https://phabricator.wikimedia.org/T114849#3375053 (10Krinkle) [15:39:21] judging by what files celery has open I'd say /srv/log/ores/app.log [15:39:37] but it is an empty file, next step would be to increase its logging level [15:39:54] Wait. That shouldn't be an empty file. [15:39:58] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [15:39:59] There's always *something* in there [15:41:17] I'm down to increase logging for celery though [15:41:23] there's indeed exceptions in the previous file, app.log.1 [15:41:27] * halfak checks on where to do that. [15:41:48] Looks like we probably want to do a deploy for that. We have a log config in the deploy repo. [15:41:59] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:42:32] /srv/deployment/ores/deploy/logging_config.yaml [15:42:49] Celery logs on ERROR [15:42:53] Could set that to INFO for a bit. [15:43:11] I'm checking if that config overrides or not command line [15:43:14] ExecStart=/srv/deployment/ores/venv/bin/celery worker \ --app ores_celery.application \ --loglevel ERROR [15:43:32] but yeah we can try the config first [15:43:51] Oh... Hmm... I doubt it would override the commandline. [15:44:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:44:00] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:44:35] ok I'll try on scb1001 to change that first in the systemd service file [15:45:20] !log bounce celery-ores-worker on scb1001 with logging level INFO [15:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:40] OK. I think if we don't see something obvious here, let's just switch traffic to codfw [15:46:22] takes forever to stop celery-ores-worker heh [15:46:41] yeah it does :/ [15:47:35] ah that's because systemd sigkills it [15:47:58] * halfak tail -f's app.log [15:48:33] heh doesn't seem to have changed anything, ok to change the config too? [15:49:17] * halfak thinks. [15:49:26] Let's stop debugging and move traffic to codfw [15:50:15] Here's a past incident where we re-routed traffic: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170426-ORES [15:50:22] https://gerrit.wikimedia.org/r/#/c/350487/ [15:50:27] Looks like that rerouted to eqiad. [15:50:52] ok, I'll do the same [15:51:23] * halfak is sitting in bed still [15:51:30] I'm going to brb to go to a couple of morning things. [15:52:23] (03PS1) 10Filippo Giunchedi: hieradata: shift all ores traffic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/361075 [15:54:05] !log Restarted Jenkins [15:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:20] PROBLEM - Apache HTTP on mw2106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:35] halfak: https://gerrit.wikimedia.org/r/#/c/361075 [15:55:09] RECOVERY - Apache HTTP on mw2106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.121 second response time [15:56:24] (03CR) 10Halfak: [C: 031] hieradata: shift all ores traffic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/361075 (owner: 10Filippo Giunchedi) [15:56:43] back [15:56:50] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: shift all ores traffic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/361075 (owner: 10Filippo Giunchedi) [15:57:49] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [15:58:37] * halfak watches codfw stuff [15:58:49] icinga things we recovered. [15:59:04] we don't :) [15:59:18] Amir1, aren't you supposed to be resting? [15:59:19] halfak: traffic should be moving shortly [15:59:19] :P [15:59:27] godog, thanks. [16:00:05] This is going to be a confusing debug, I think. [16:00:28] codfw continues being totally OK [16:01:13] halfak: I can't [16:01:18] I'm half resting [16:01:37] halfak: I'm afraid we might overload the whole codfw [16:01:37] godog, it looks like scb1001 might have recovered for a bit when you restarted it. Just a note. [16:01:48] Amir1, I don't think this is a DOS [16:01:59] This is workers making a brief connection and then losing it. [16:02:05] celery workers --> redis [16:02:10] And then the queue doesn't move at all. [16:02:15] halfak: yeah I think that's a side effect of restoring the previous loglevel to ERROR and restarting [16:02:19] I checked the logs for load several times, it looks okay to me too [16:02:44] It looks like scb1001 was the only machine to recover when we did the rolling restarts too [16:02:55] https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&panelId=3&fullscreen [16:03:11] Note that scb1001 is the only machine that produces good scores during the spikes [16:03:15] in eqiad [16:03:40] (03PS2) 10RobH: Daniel Worley shell access request [puppet] - 10https://gerrit.wikimedia.org/r/360988 [16:04:09] godog, is it possible that the rolling restarts only hit scb1001? I can't imagine that. [16:04:36] Celery is working hard on scb1002, so there's something to that. [16:04:54] I'm going to restart celery-ores-worker manually on scb1002 to see what happens. [16:04:57] I am back [16:05:02] halfak: no I don't think so either, scb1001's celery was restarted like 10 minutes ago when I re-enabled puppet there [16:05:06] took me longer than expected sorry [16:05:11] * akosiaris reading backlog [16:05:19] akosiaris: we've moved all traffic to codfw in the meantime, mostly the tl;dr [16:05:52] akosiaris, I've learned that when we restart, scb1001 is the only eqiad machine to start handling requests again for a little bit. [16:05:52] ok [16:06:05] wat [16:06:06] ? [16:06:30] Restarting the celery service on scb1002 manually right now to see if we can get it to do some scoring. [16:06:43] It's almost as though we didn't restart anything on scb1002-1004 from grafana [16:06:59] we 've restarted them multiple times already IIRC [16:07:24] Looks like codfw doesn't have the same issues [16:07:25] so dbsize on 6380 ores is 3863 [16:07:35] Our overload errors are falling fast [16:07:36] so at least something has been scored [16:07:53] akosiaris, yeah. It all went through scb1001 [16:07:58] Not scb1002-1004 [16:08:20] I just manually restarted scb1002 and got a few scores to go through. [16:08:33] So I think the rolling service restart might have failed somehow. [16:08:44] Doesn't make sens.e Just my operating hypothesis. [16:09:03] See https://grafana.wikimedia.org/dashboard/db/ores?panelId=3&fullscreen&orgId=1 [16:09:12] Pay special attention to scb100* [16:11:02] Looks like scb1002 cleared the eqiad celery queue [16:11:15] halfak: I tried restarting them, every time around 100 edits go through and then it's overload again [16:11:41] you can see spikes https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1&from=now-6h&to=now [16:11:45] Amir1, right. But for some reason, the last couple rolling restarts resulted in only scb1001 getting a few scores done. [16:11:52] Note which machine is scoring for each spike. [16:11:57] when the failure rate is below 100% It means I restarted them [16:13:12] Amir1, yeah. I see that. Still we are having the situation where only one machine scores after a restart [16:13:38] https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-3h&to=now [16:13:53] compare the spikes between different eqiad nodes [16:14:21] Returning scores != processing scores [16:14:25] looks like 500s have recovered from ores, since 16:02 [16:14:25] Returning comes from uwsgi [16:14:29] processing comes from celery [16:14:35] https://grafana.wikimedia.org/dashboard/db/ores?panelId=3&fullscreen&orgId=1&from=now-3h&to=now [16:14:42] * halfak keeps pointing to ^ [16:15:01] godog, I'm seeing the same. Thank you for helping turn this into a less urgent problem ^_^ [16:15:08] doing an strace on a random celery worker on scb1002 says it's doing something [16:15:10] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3375114 (10RobH) I had a brief chat in IRC with Faidon about this. The boot issue where the LVM fails to detect (due to disks not detecting) should technically be fixed by addressing its r... [16:15:27] akosiaris, confirmed here too. Not sure what it is doing but definitely a lot of something. [16:16:01] halfak: np! glad it worked *touch wood* [16:16:20] :) I like having two datacenters to help deal with this kind of weird **** [16:16:41] akosiaris, Oh! scb1002 *is* doing something [16:16:52] It's behaving well because I restarted the celery service there recently [16:16:55] Try scb1003 [16:17:03] ok looking into scb1003 [16:17:08] please don't restart anything there [16:17:29] right. Check out the runtime of those celery processes. [16:17:36] These are have not been restarted in hours! [16:18:45] stuck in a read syscall [16:19:10] on a pipe ? [16:19:27] strace -p 8712 [16:19:27] Process 8712 attached [16:19:27] read(161, [16:19:49] and 161 is ls -l /proc/8712/fd/161 [16:19:49] lr-x------ 1 www-data www-data 64 Jun 23 16:18 /proc/8712/fd/161 -> pipe:[18371897] [16:19:53] hmm [16:19:56] akosiaris, I was thinking this was a redis connection issue for reading from the celery queue. Does that jibe? [16:20:00] (03CR) 10RobH: [C: 032] Daniel Worley shell access request [puppet] - 10https://gerrit.wikimedia.org/r/360988 (owner: 10RobH) [16:20:07] no it would be socket, not a pipe [16:20:14] Oh. Wat [16:20:38] 1 -> socket:[18356051] [16:20:44] that's probably the redis connection [16:20:50] lemme make sure [16:20:54] I need to head to the airport soon. [16:21:08] I'm going to go pack up and get ready to leave my hotel. [16:21:17] But I'll check back here every few minutes. [16:22:36] so, no the redis connection is 162 -> socket:[18295540] [16:22:45] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Access request for Daniel Worley to analytics / hadoop - https://phabricator.wikimedia.org/T168439#3375139 (10RobH) 05Open>03Resolved a:05RobH>03None No objections were noted & all other steps accomplished, so I've merged this access live. It'... [16:22:50] but definitely not what celery is stuck on [16:23:39] celery 8712 www-data 161r FIFO 0,10 0t0 18371897 pipe [16:23:43] so stuck on a FIFO [16:27:44] ok so it's that celery is stuck I think [16:28:23] that pipe is shared between a gazilion celery processes [16:29:42] akosiaris: CP is disabled because of ^ ? [16:29:49] mobrovac: yes [16:32:05] (03CR) 10Dzahn: [C: 031] "i see "tendril.wikimedia.org is an alias for dbmonitor1001.wikimedia.org.", so seems already switched and good to go" [puppet] - 10https://gerrit.wikimedia.org/r/361047 (https://phabricator.wikimedia.org/T149557) (owner: 10Jcrespo) [16:32:33] (03PS2) 10Jcrespo: dbmonitor: Remove tendril role from einstenium and tegmen [puppet] - 10https://gerrit.wikimedia.org/r/361047 (https://phabricator.wikimedia.org/T149557) [16:32:41] akosiaris: we've also tried to increase celery's logging with not a lot of success, I changed the level on scb1001 service file but didn't see increased logging in either app.log or journalctl [16:32:47] that was before the codfw switch [16:32:54] mobrovac, I think we can probably re-enable CP for ORES [16:33:18] codfw seems to be happy [16:33:55] FYI: scb1002 continues to happily process requests [16:34:03] in eqiad [16:34:09] akosiaris: ^ [16:34:20] !log restart celery ores worker on scb1003 [16:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:10] halfak: so I think I am narrowing down on a root cause [16:35:20] big words... [16:35:25] so let actually say what I found [16:35:50] * halfak reads intently :) [16:35:51] 32582, 32659, 32663, 32664, 32666 and some other processes in scb1004 [16:35:58] are stuck in an endless loop [16:36:04] they are consuming a ton of CPU [16:36:10] but I see nothing in strace [16:36:32] which means they are probably stuck doing something in user space [16:36:42] probably the endless loop I am thining [16:36:49] I am gonna gdb on one of them [16:36:54] and stacktrace it [16:37:24] scb1003 is back to producing scores [16:37:38] but!!! scb1001 is no longer [16:37:53] let me verify the symptoms are same over there [16:37:55] Yeah. I did do a manual restart on scb1001 earlier and it was short lived [16:38:04] But scb1002 has continued to be happy for a long time [16:38:13] scb1002 got restarted after the codfw switch [16:38:27] But it had a big queue to process and proceeded happily. [16:38:47] yeah 25580 on scb1001 exhibits the same symptoms I saw on scb1004 and scb1003 [16:39:07] I am more and more thinking endless loop... [16:39:14] gotcha. We've not seen this type of behavior before. [16:39:22] * akosiaris installing gdb [16:39:22] Let's see if codfw nodes show the same behavior [16:39:41] Because they *should* if it's user space, right? [16:40:12] Oh! scb2006 seems to have stopped processing requests [16:40:13] if the root cause is some external request, yes they should exhibiting problems [16:40:15] aaah [16:40:16] hmmm [16:40:18] looking [16:40:46] Sure enough, it seems like we're using a lot of CPU on that machine. [16:40:57] same symptoms [16:40:59] * halfak wonders what changed in the last couple of weeks. [16:41:03] akosiaris@scb2006:~$ sudo strace -p 30805 [16:41:03] Process 30805 attached [16:41:06] but no output [16:41:25] and the process is in Rl state and consuming 82.4% of cpu currently [16:41:44] So codfw switch is unlikely to be a long-lived solution [16:42:03] Did we do restarts of scb machines for the recent kernel issues? [16:42:30] akosiaris: why is CP up only on scb1001? [16:42:43] mobrovac: my guess is puppet run [16:42:53] halfak: yes, but not on scb1004 [16:43:00] so it can't be that [16:43:07] akosiaris: ok then i'll re-enable everywhere [16:43:15] ah yes that would be me (cp on scb1001, because of puppet run) [16:43:18] mobrovac: can you actually delay that please [16:43:20] ? [16:43:33] yeah sorry mobrovac I misspoke [16:43:38] Based on what akosiaris has learned [16:43:41] Sorry for the confusion [16:43:42] I 'd like to not have CP added to the noise while gdbing [16:43:58] akosiaris: ok i'll disable cp on scb1001 then [16:44:24] thanks [16:44:36] hm puppet isn't disabled on scb1001 [16:44:44] !log scb1001 disabling puppet [16:44:48] (03CR) 10Jcrespo: [C: 032] dbmonitor: Remove tendril role from einstenium and tegmen [puppet] - 10https://gerrit.wikimedia.org/r/361047 (https://phabricator.wikimedia.org/T149557) (owner: 10Jcrespo) [16:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:49] mobrovac: yeah my bad, I've reenabled puppet on scb1001 after changing celery loglevel [16:45:57] (03PS1) 10Framawiki: Planet-fr: Add frwiki Wikimag bulletin [puppet] - 10https://gerrit.wikimedia.org/r/361078 (https://phabricator.wikimedia.org/T168005) [16:46:00] oh ok go [16:46:07] of course puppet agent --disable doesn't say anything when puppet is already disabled [16:46:12] damn [16:46:16] 111 stack frames [16:46:25] #0 is _PyUnicode_ToLowercase [16:46:31] that's not good [16:46:50] please tell me we did not stumble across a python utf8 bug :-( [16:47:08] lol damn it. [16:47:52] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:50:54] akosiaris: ok for me to deploy a cp config change that would allow it to process other updates but not ores? [16:51:07] Oh wow. cp is totally down? [16:51:13] then revert later [16:51:20] yup halfak, alex disabled it [16:51:24] gotcha. [16:51:29] mobrovac: yeah, its fine [16:51:32] kk [16:51:51] akosiaris mobrovac halfak one note, since ores extension now does precaching for both goodfaith and damaging models, it's safe to disable CP for around ten wikis [16:52:33] kk Amir1, we can deal with this once the outage is over :) [16:53:07] yeah sure, I keep repeating it so we don't forget [16:54:53] Looks like ORES in codfw is having the same issue. [16:55:23] PROBLEM - HTTPS-tendril on einsteinium is CRITICAL: SSL CRITICAL - failed to verify tendril.wikimedia.org against icinga.wikimedia.org [16:55:26] (03PS1) 10RobH: stat1005 install module update [puppet] - 10https://gerrit.wikimedia.org/r/361079 [16:55:35] Yup. we are fully down again [16:55:45] akosiaris, ^ [16:55:54] I'm just about to leave for the airport >:( [16:56:28] I can be here for some time [16:56:31] ok [16:56:34] We got about an hour of uptime from the CODFW switch [16:56:35] but you are a better help than me [16:56:35] I am running a bit crazy [16:56:44] (03PS2) 10RobH: stat1005 install module update [puppet] - 10https://gerrit.wikimedia.org/r/361079 [16:56:53] (03CR) 10RobH: [C: 032] stat1005 install module update [puppet] - 10https://gerrit.wikimedia.org/r/361079 (owner: 10RobH) [16:56:55] looks like a look in Modules/sre_lib.h ? [16:57:16] weird [16:57:31] Something happened at 1345 UTC that started all this. [16:59:14] perhaps a revision with a weird set of chars or something? (if we are still on the utf8 route) [16:59:17] this can't be [16:59:26] it's looping over sre_lib.h [16:59:37] mobrovac: yeah that's the only logical route I can think of [16:59:56] !log mobrovac@tin Started deploy [changeprop/deploy@1f45fae]: Temporary disable ORES (ongoing outage) [17:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:16] I would suggest to disable ores review tool in enwiki/wikidatawiki as most of hits are from there [17:00:40] but ores servers under-performing for another reason and it's not DoS [17:01:06] Amir1, I don't think that's going to help anything [17:01:12] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational [17:01:12] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational [17:01:13] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational [17:01:16] !log mobrovac@tin Finished deploy [changeprop/deploy@1f45fae]: Temporary disable ORES (ongoing outage) (duration: 01m 19s) [17:01:22] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational [17:01:22] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [17:01:22] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [17:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:32] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational [17:01:52] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [17:02:02] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [17:02:05] yeah I agree but I can't think of anything else [17:02:18] halfak: go, I don't want you to lose the plane [17:02:26] I like akosiaris' trajectory [17:02:29] oh yeah! [17:02:31] * halfak runs [17:02:42] I'll log back on to check in once I get to sit down [17:02:43] o/ [17:02:51] <3 you all. Thanks for your hard work on this :) [17:07:21] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3375220 (10RobH) [17:07:44] (03CR) 10Chad: "I have no desire to break gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [17:10:07] akosiaris: what can I do to help? I'm running of ideas/things to do [17:12:25] Amir1: how are your gdb skills ? [17:12:36] I could use some help to figure out what's going on [17:12:44] even if it's just to bounce ideas back and forth [17:12:53] akosiaris: I'm here too if that helps [17:12:58] I don't know much [17:13:19] so I got gdb running with debug symbols on scb1004 [17:13:28] and source download so I know everyline that runs [17:13:31] and guess what [17:13:38] I never get to reach python code [17:13:45] I knee deep in C code [17:14:01] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3375251 (10jcrespo) So the switchover has happened- for future reference: 1) failover the dns 2) enable LE on puppet 3) run puppet. Becau... [17:14:24] have we recently changed python version? updated to a newer version? [17:14:35] ah, where you able to decode further the frames from gdb? [17:14:49] godog: yes I can actually see the code now [17:15:15] godog: I am looping through sre_lib.h as it seems [17:15:56] I think the lowest line I can see is 517 [17:15:57] sigh, how does the whole stack look like? [17:16:23] lol [17:16:26] lemme paste it [17:16:47] https://p.defau.lt/?Uqrabdz8bSQGl1EOdQ9i6g [17:16:56] 213 frames currently [17:17:49] for the love of [17:17:57] aren't .h files supposed to be header files ? [17:18:04] why is this file full of C code ? [17:18:19] maybe Amir1 is on to something about python version [17:18:55] nope [17:19:00] last update on Ct 2016 [17:19:04] Oct 2016* [17:19:12] so scratch that out [17:19:17] heh in terms of software changes on scb I can only think of recent upgrades/reboots [17:19:44] it's definitely not related to hardware [17:19:55] and I am not touching kernel space in gdb so it's not that [17:20:06] in fact I don't seem to be getting ever out of sre_lib.h [17:20:20] akosiaris: yeah not a whole lot of help from the full stack frame [17:21:01] time to start looking for known bugs [17:21:23] functions in that file are huge [17:22:11] maybe if we could fine the input to reproduce that ... [17:22:14] find* [17:22:46] I'm poking around in scb1002, there seems to be only one celery process there spinning the cpu [17:23:02] yeah that's why I am saying it's input related [17:23:22] something in re.search() ? re.match() ? [17:24:07] (gdb) finish [17:24:07] Run till exit from #0 sre_ucs1_match.lto_priv () at ../Modules/sre_lib.h:517 [17:24:11] so... stuck in there [17:24:19] at least we don't have to jump between files [17:24:22] PROBLEM - HHVM rendering on mw2199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:40] lol [17:25:12] RECOVERY - HHVM rendering on mw2199 is OK: HTTP OK: HTTP/1.1 200 OK - 79767 bytes in 0.295 second response time [17:27:01] ah [17:27:15] so if I force a return from the function [17:27:30] the processes completes successfully and returns to normal operations [17:27:52] but somehow this is killing the inter celery IPC [17:28:02] seems like it's being done over a pipe [17:28:49] for example after getting it unstuck from the busy loop [17:28:59] 32582 is now stuck on read(13,) [17:29:11] with that being /proc/32582/fd/13 -> pipe:[631237833] [17:29:31] and sudo lsof |grep 631237833 spewing out a ton of processes [17:30:02] anyway, my current hypothesis is some celery children get stuck on that endless loop [17:30:16] and somehow the celery IPC gets screwed up [17:30:28] resulting in practically no process actually working [17:31:14] ack, seems reasonable to me [17:31:51] so that's almost surely user input related [17:32:03] something in the input for celery makes it go caput [17:32:05] akosiaris, is that a regexp issue? fwiw, maybe an unrelated comment, but, in parsoid, we have run into a few instances of bad regexps ... with a pattern like (..+|..)* [17:32:15] that cause pathological backtracking [17:32:20] on certain inputs [17:32:29] there was also a libc update that came with the kernel update, perhaps related? [17:32:33] fixed that by changing regexps to (..|..)* .. that is removing the "+" [17:33:02] PROBLEM - pdfrender on scb1004 is CRITICAL: connect to address 10.64.48.29 and port 5252: Connection refused [17:33:03] subbu: it's definitely re related. We had a similar issues last year with perl res and utf8 [17:33:25] I thought of pathological backtracing too [17:33:29] that's me ^... stopped the wrong thing [17:33:30] so, that is a good low hanging fruit to check for .. if there is a regexp that is of that form. [17:34:02] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [17:34:03] 10Operations, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3375285 (10RobH) p:05Low>03Normal a:05RobH>03Ottomata [17:34:29] so https://github.com/python/cpython/blob/v3.4.2/Modules/sre_lib.h#L493 [17:34:36] that's must be re.match(), right ? [17:34:57] 10Operations, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3264256 (10RobH) System is installed, puppet/salt accepted, and ready to have software/services implemented. Assigning to @Ottomata for implementation. [17:35:11] sounds like it akosiaris [17:35:20] Amir1, https://github.com/wikimedia/parsoid/commit/9f02169f546fba140c4f01e5e7a9c6eee818f112 is what we had to fix in parsoid. [17:35:32] 10Operations, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3375291 (10RobH) a:05Ottomata>03RobH Turns out needs jessie, taking back for reimage. [17:35:42] thanks [17:35:45] we ran into that pattern a few times before we caught them all. [17:35:56] the problem in ores is that it has tons of regexes in it [17:36:28] and disabling all regexes will take days [17:36:38] you need to rebuild all models, etc. [17:36:50] (03PS1) 10RobH: stat1005 needs jessie [puppet] - 10https://gerrit.wikimedia.org/r/361082 [17:36:51] If I know the exact regex, I can do something [17:37:07] i am suggesting that as one possible place to audit, not necessarily disable something. [17:37:19] Generally we shouldn't get stuck on something like this anyway. We should get timeouts. [17:38:07] Amir1, check on the new regexes for huwiki [17:38:08] anyway, take it fwiw ..i paged in and watched the discussion and noticed that akosiaris found a stack trace that looked like a regexp and got reminded of what we had run into. [17:38:11] (03CR) 10RobH: [C: 032] stat1005 needs jessie [puppet] - 10https://gerrit.wikimedia.org/r/361082 (owner: 10RobH) [17:38:15] They are all pretty simple [17:38:21] subbu: no that's actually quite helpful [17:38:25] thanks for the input [17:38:32] I was thinking more of a unicode bug [17:38:42] but that was my skewed bias from prior experience [17:38:44] Yeah. Was in lower, right? [17:39:15] We could write a script to try to lower all Unicode chars and see where it gets stuck [17:39:23] damn ... how do I print a Py_UCS4 in gdb [17:39:34] it's the pattern according to gdb [17:39:37] but it says 16 [17:39:40] not really useful [17:39:45] halfak|Mobile: isn't it related to the fart thing we did? [17:39:51] (gdb) p pattern [17:39:52] $5 = (Py_UCS4 *) 0x12b6f4d8 [17:39:52] (gdb) p *pattern [17:39:52] $6 = 16 [17:40:12] Right Amir1 [17:40:33] Those are the newest regexes [17:41:37] why it triggers such a huge down time after a week is still strange to me [17:42:27] typedef uint32_t Py_UCS4; [17:42:32] so it actually means that? 16 ? [17:42:37] I think so [17:42:41] I don' [17:42:55] akosiaris: hm - https://github.com/python/cpython/blob/6f0eb93183519024cb360162bdd81b9faec97ba6/Objects/unicodectype.c#L46 [17:42:58] Agreed, Amir [17:42:59] not easy to print that [17:43:12] First thing I want to figure out is why timeouts aren't working [17:43:15] I don't know about all the encoding of larger codepoints, but I'd assume "16" is straight up ascii 0x10 equivalent like UTF-8? [17:43:18] http://www.fileformat.info/info/unicode/char/0016/index.htm [17:43:29] ? [17:44:00] that's hex-16, I assume your 16 is in decimal [17:44:10] http://www.fileformat.info/info/unicode/char/0010/index.htm [17:44:13] yeah [17:44:27] so somehow the pattern is that ? [17:44:35] I must be missing something [17:44:44] ah.. it's the first char of the string [17:46:01] as in python internally uses the old DLE control char as a string start marker? [17:46:07] print (char *)pattern [17:46:07] $10 = 0x12b6f4d8 "\020" [17:51:03] damn I hate reading regular expression code... it's always full of gotos [17:51:13] 10Operations, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3264256 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by robh on neodymium.eqiad.wmnet for hosts: ``` ['stat1005.eq... [17:52:10] not to mention the fact it's always optimized by the compiler and line number are reorganized [17:53:17] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3375326 (10Dzahn) packages: It installed php5-mysql Apache modules: several but i checked all of them are ALSO needed by Icinga, so no r... [17:55:50] We should probably blog about this when we get it worked out. [17:56:25] akosiaris: Amir1: let Pchelolo know if it will be safe to re-enable ores rules in CP, I have to run now and will be back later [17:56:36] okay [17:59:14] ok [18:03:37] (03PS1) 10Framawiki: Enable Quiz extension on huwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361084 (https://phabricator.wikimedia.org/T168471) [18:04:36] (03PS1) 10RobH: stat1006 install params [puppet] - 10https://gerrit.wikimedia.org/r/361085 [18:05:12] (03CR) 10RobH: [C: 032] stat1006 install params [puppet] - 10https://gerrit.wikimedia.org/r/361085 (owner: 10RobH) [18:06:37] I'm semi-afk for dinner [18:06:42] ping me if I can do anything [18:09:55] PROBLEM - Check systemd state on install1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:10:22] bleh, i broke dhcp on installers, fixing [18:12:15] (03PS1) 10RobH: fixing typo in dhcpd lease file [puppet] - 10https://gerrit.wikimedia.org/r/361086 [18:12:40] (03CR) 10RobH: [C: 032] fixing typo in dhcpd lease file [puppet] - 10https://gerrit.wikimedia.org/r/361086 (owner: 10RobH) [18:15:55] RECOVERY - Check systemd state on install1002 is OK: OK - running: The system is fully operational [18:16:06] akosiaris: I tried gdb'ing a few celeries on scb2001 and then going up the stack until sre_search [18:16:41] interestingly print (char *)state->beginning [18:16:47] 10Operations, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3375411 (10RobH) p:05Low>03Normal [18:16:51] prints always the same string, regardless of the process [18:17:27] I think part of this edit https://es.wikipedia.org/w/index.php?title=Relieve_terrestre&type=revision&diff=100032810&oldid=100032572 [18:17:53] interesting [18:18:43] maybe the unbalanced { greatly confuses things? [18:19:05] the edit happened today so that might be it [18:19:20] might ? I think you on to something ! [18:19:46] heheh I don't know how to bypass that edit tho, Amir1 halfak|Mobile [18:20:43] godog: what do you mean by bypassing? [18:21:13] I mean making sure ores doesn't see it or otherwise doesn't process it [18:21:21] varnish? [18:21:34] change prop is behind varnish though [18:21:56] I don't know how to make CP don't touch the edit [18:22:09] Amir1: no I think he means being able to blacklist a revision [18:22:17] fully skip it [18:22:25] Put something in the cache for it [18:22:34] akosiaris: we don't have that functionality for now [18:22:40] halfak|Mobile: that's a good idea [18:22:43] actually halfak|Mobile had a nice idea [18:22:46] :-) [18:22:55] Just got through security [18:22:56] Amir1: can you do that ? [18:23:03] akosiaris: yeah sure [18:23:10] nice idea! [18:23:11] I 'll try in the meantime to reproduce the issue [18:23:57] I want to directly put it into redis [18:23:58] Just realized I've eaten nothing but coffee today... [18:24:59] :(((( [18:25:04] hugs dude [18:26:19] (03PS5) 10Herron: Add logrotate template to retain 60 days of exim mx logs [puppet] - 10https://gerrit.wikimedia.org/r/357723 (https://phabricator.wikimedia.org/T167333) [18:29:37] Amir1: any joy with putting that revision in cache? [18:29:46] I'm connected [18:30:03] trying to finish it, I need to get some samples first [18:30:12] ack, let me know if I can help [18:32:25] (03PS1) 10Dzahn: planet: give higher priority to English Apache site [puppet] - 10https://gerrit.wikimedia.org/r/361087 [18:32:40] godog akosiaris added a key and value for revision 100032810 in enwiki with some value [18:32:46] for goodfaith and damaging [18:32:59] it's in eqiad. Shall I do it for codfw? [18:33:12] yes please [18:33:25] yup [18:33:25] I don't know how the ores cache works, though the revid I see in the logs is 100032572 [18:33:37] (03CR) 10jerkins-bot: [V: 04-1] planet: give higher priority to English Apache site [puppet] - 10https://gerrit.wikimedia.org/r/361087 (owner: 10Dzahn) [18:33:40] (03PS2) 10Dzahn: planet: give higher priority to English Apache site [puppet] - 10https://gerrit.wikimedia.org/r/361087 [18:33:52] well I guess we could do both [18:33:53] (03PS2) 10Dzahn: icinga: add plugin to check exim queue sizes [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) [18:33:59] I'm doing both [18:34:00] Amir1: 100032810 ? [18:34:07] that's the new, no ? [18:34:16] from the link above [18:34:17] https://es.wikipedia.org/w/index.php?title=Relieve_terrestre&type=revision&diff=100032810&oldid=100032572 [18:34:26] ah you mean both revisions ? [18:34:26] ok [18:34:28] wait, that's eswiki [18:34:40] oh you added enwiki ? ;-) [18:34:56] Sorry [18:34:56] (03CR) 10jerkins-bot: [V: 04-1] planet: give higher priority to English Apache site [puppet] - 10https://gerrit.wikimedia.org/r/361087 (owner: 10Dzahn) [18:35:02] no worries [18:35:09] if it works we are having a party [18:35:25] * akosiaris still trying to reproduce this, I am failing for now [18:35:40] heheh yeah we should have all the parties if it works [18:36:04] eswiki has only one model [18:37:19] 10Operations, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3375431 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['stat1005.eqiad.wmnet'] ``` and were **ALL** successful. [18:37:49] added for only reverted models in both dc s [18:38:31] Amir1: ack, I'll roll restart celery in codfw [18:39:14] !log roll restart celery-ores-worker in codfw [18:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:48] godog: make sure they are actually dead [18:40:54] they would not die on me before [18:41:24] theoretically systemd should avoid that, but somehow I had to issue the command twitch [18:41:27] twice* [18:41:39] * halfak|Mobile watches on with interest [18:41:45] yeah good call, it takes a while and systemd I think resorts to sigkill [18:41:52] Once it's up there, I'm going to request them and they need to return 0.927146989310164 otherwise it means it's not hitting the cache [18:42:38] Amir1: can you try that rev on wmflabs ores? [18:42:58] Maybe staging instead [18:43:04] :D [18:43:11] yeah, don't want a dead ores in labs too [18:43:16] Lol [18:43:52] Amir1: roll restart completed [18:44:14] halfak|Mobile: staging is not answering to that request [18:44:23] it times out though [18:44:42] \o/ [18:44:54] A weird thing to celebrate [18:45:15] when I try it in prod, it errors that it can't understand the json from cache which is fine [18:45:16] https://ores.wikimedia.org/v2/scores/eswiki/reverted/100032572 [18:45:25] because it hits the fucking cache [18:45:59] for the record I did oresrdb.svc.codfw.wmnet:6380> MSET ores:eswiki:reverted:100032572:0.3.0 "{\"prediction\": false, \"probability\": {\"false\": 0.927146989310164, \"true\": 0.072853010OK9836}}" [18:46:02] Only problem is that changeprop will probably retry [18:46:10] No big deal [18:47:19] scb cpu under control so far [18:47:21] delimiter: line 1 column 86 (char [18:47:36] but that does not matter much [18:47:57] if we actually shortcircuited the problem I am already happy [18:48:21] I'm checking grafana before opening up the champagne [18:48:50] ah we will open the actual champagne when we reproduce and get a CVE for that :-) [18:49:08] and of course this is blog post material [18:49:25] but let's actually fix the problem at hand for sure, have a nice weekend and reproduce on Monday [18:49:48] six minutes without overload [18:49:52] heheh "processing input from the internet considered harmful" [18:49:58] (03PS3) 10Dzahn: planet: give higher priority to English Apache site [puppet] - 10https://gerrit.wikimedia.org/r/361087 [18:50:12] 10Operations, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3375476 (10RobH) a:05RobH>03Ottomata reimage complete, system ready for you to add to site.pp for specific roles! [18:50:13] seven [18:50:42] there are various celery worker processes on scb boxes that in 100% CPU [18:50:48] but leftovers [18:50:56] times are 15:57, 16:43, 14:27 [18:51:05] godog: that's what I meant ^ [18:51:10] look at scb1001 for example [18:51:17] ps auxww |grep celery | grep 'Rl' [18:51:26] www-data 25779 99.8 4.1 2594976 1352636 ? Rl 15:57 172:2 [18:51:33] and another 5 like that [18:51:42] eight [18:51:49] (03CR) 10Dzahn: [C: 032] planet: give higher priority to English Apache site [puppet] - 10https://gerrit.wikimedia.org/r/361087 (owner: 10Dzahn) [18:51:54] (03PS4) 10Dzahn: planet: give higher priority to English Apache site [puppet] - 10https://gerrit.wikimedia.org/r/361087 [18:52:38] nine [18:53:06] (03CR) 10Dzahn: [C: 031] "> In trusty and jessie it gets installed as part of the d-i base install" [puppet] - 10https://gerrit.wikimedia.org/r/360789 (owner: 10Dzahn) [18:53:19] akosiaris: in codfw celery processes seem all freshly restarted tho afaics [18:53:37] I did cumin -b2 'scb2*' 'systemctl restart celery-ores-worker' [18:54:24] godog: have you also restarted eqiad ? [18:54:36] or not ? cause if not, it's my misunderstanding [18:54:42] (gdb) print self->pattern->ob_type [18:54:42] $34 = (struct _typeobject *) 0xa02b00 [18:54:43] grrr [18:54:47] akosiaris: no I thought you'd be still looking [18:54:50] still trying to figure out the pattern [18:54:58] godog: actually... yeah don't... [18:55:08] let's not lose our guinea pig [18:55:10] good idea [18:55:12] I wish I was so my desk [18:55:17] *at [18:55:33] halfak|Mobile: don't worry, have a safe flight [18:55:41] Still aiming traffic at codfw, right? [18:55:48] we have twelve minutes without overload [18:55:59] We made it 50 mins before [18:56:06] 10Operations, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3375499 (10RobH) [18:56:47] My first test is going to be "why didn't this time out like it was supposed to" [18:56:51] that was a whole new datacenter, by one restart we never made more than five minutes [18:56:52] (03PS2) 10Dzahn: base: add psmisc to standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/360789 [18:57:08] https://github.com/python/cpython/blob/master/Include/object.h#L346 .... how an earth am I supposed to get the pattern out of that ? [18:57:16] halfak|Mobile: we need to run lots of tests on that revision [18:57:40] that's like a struct with 50 members most of which are damned structs [18:57:46] structs all the way down [18:57:47] one thing is that one experimental node indeed times out but maybe the worker doesn't understand it or celery [18:58:56] (03CR) 10Dzahn: [C: 032] "this is already installed on all jessie and trusty, it will just add it on stretch systems that would otherwise be missing these commands " [puppet] - 10https://gerrit.wikimedia.org/r/360789 (owner: 10Dzahn) [18:58:58] akosiaris: lol, the rabbit hole has infinite depth [18:59:47] (03CR) 10Herron: "Added puppet header, changed tabs to spaces and changed to file type. Here are the compiler results after these updates https://puppet-co" [puppet] - 10https://gerrit.wikimedia.org/r/357723 (https://phabricator.wikimedia.org/T167333) (owner: 10Herron) [19:01:47] (03CR) 10Dzahn: [C: 031] Add logrotate template to retain 60 days of exim mx logs [puppet] - 10https://gerrit.wikimedia.org/r/357723 (https://phabricator.wikimedia.org/T167333) (owner: 10Herron) [19:02:56] I'm going to be afk, I'll be back in fifteen minutes [19:03:08] ok, we 'll keep an eye [19:03:33] I can see 500 only for that url now and since 18:40, nice [19:03:41] that url == https://ores.wikimedia.org/scores/eswiki/?revids=100032572&models=reverted [19:04:14] I'll be afk too for a bit, dinner [19:04:22] yeah, because the json I put for cache is not valid but who cares [19:05:32] (03PS2) 10Dzahn: Planet-fr: Add frwiki Wikimag bulletin [puppet] - 10https://gerrit.wikimedia.org/r/361078 (https://phabricator.wikimedia.org/T168005) (owner: 10Framawiki) [19:19:31] no overloads for 40 minutes [19:20:19] and the CPU usage on scb2xxx is very good [19:27:12] indeed! I think we can call it a victory [19:27:32] * godog jinxes it and runs [19:28:43] lol [19:29:14] I am gonna restart the celery workers on scb1001, scb1002, scb1003. I 'll leave scb1004 out for now [19:29:20] ack [19:29:23] I am still gdbing there [19:29:57] no joy yet with the pattern heh? [19:30:58] (03CR) 10Alexandros Kosiaris: [C: 031] Change wikimedia.org SPF record to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) (owner: 10Herron) [19:31:26] (03PS3) 10Dzahn: Planet-fr: Add frwiki Wikimag bulletin [puppet] - 10https://gerrit.wikimedia.org/r/361078 (https://phabricator.wikimedia.org/T168005) (owner: 10Framawiki) [19:31:38] I'm off, have my phone with me though [19:34:39] !log restart celery-ores-workers on scb1001, scb1002, scb1003, leave scb1004 alone [19:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:01] (03CR) 10Dzahn: [C: 032] Planet-fr: Add frwiki Wikimag bulletin [puppet] - 10https://gerrit.wikimedia.org/r/361078 (https://phabricator.wikimedia.org/T168005) (owner: 10Framawiki) [19:35:13] Pchelolo: can you please undo the ORES disabling in changeprop ? [19:35:20] Nice [19:35:36] akosiaris: sure, will deploy shortly [19:35:47] Pchelolo: thanks! [19:37:13] !log ppchelko@tin Started deploy [changeprop/deploy@ffabd13]: Re-enable ORES rules back [19:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:56] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:38:20] !log ppchelko@tin Finished deploy [changeprop/deploy@ffabd13]: Re-enable ORES rules back (duration: 01m 07s) [19:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:14] (03CR) 10Hashar: [C: 031] Set collation for Romanian wikis to uca-ro [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) (owner: 10Strainu) [19:45:56] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [19:51:06] akosiaris: ok to enable puppet on scb back? [19:51:18] mobrovac: yeah I 'll do so now [19:51:22] kk thnx [19:51:35] I 'll leave it disabled on scb1004 though [19:51:39] still gdbing on that one [19:51:43] oh ok [19:53:03] so the problem is that the initial { is not included in the diff? [19:53:14] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational [19:53:21] we don't know yet.. still trying to reproduce the issue [19:53:29] but it's clear it's that damn change [19:53:48] I am trying to figure out the damn re pattern [19:54:28] akosiaris: I can get you some of regexes that happens on eswiki only [19:54:41] and you can run it locally to see if it's the cause [19:54:44] if that helps [19:55:15] yeah that's an alternate path forward... if I don't get anywhere dumping memory with gdb I 'll ask you to probably [19:55:19] I 'll let you know [19:56:45] Amir1, can you point me to the eswiki regexps .. curious if i see something there. [19:56:59] sure [19:57:02] congrats on the victory. :) [19:57:11] Thank Alex [19:57:26] I was going to get it but saw my facebook, sorry [19:58:09] https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/spanish.py [19:58:14] subbu: akosiaris ^ [20:02:44] what was the edit that caused the problem? [20:03:26] https://es.wikipedia.org/w/index.php?title=Relieve_terrestre&type=revision&diff=100032810&oldid=100032572 subbu [20:03:38] "(?:b+l+a+h*)+", "(l+[uo]+l+)([uo]+l+)*, "j+[eaiou]+(j+[aeiou]*)*" ... [20:04:18] look at that jajajajjaja ... that seems to implicate that last regexp? [20:05:36] 'This particular pattern (..+)* has been responsible for several instances of non-terminating regexps in Parsoid' from https://github.com/wikimedia/parsoid/commit/9f02169f546fba140c4f01e5e7a9c6eee818f112 ... [20:05:54] so, my bet is on that one ... but, worth looking at 3 closely. [20:06:09] *at all three [20:09:15] off to a coffee shop. back online later. [20:11:57] akosiaris, Amir1 mobrovac fyi ^^^ [20:12:19] have fun [20:12:28] thnx subbu [20:12:31] oh, i meant: the regexp ;-) [20:12:35] bye for now. [20:27:55] never mind. at least in node.js, that re seems to process pretty quickly. [20:39:15] Amir1, akosiaris mobrovac actually that is exactly it .. i had forgotten the [20:39:18] 'i' flag [20:39:21] it hangs in node.js [20:39:59] https://gist.githubusercontent.com/subbuss/e1387aa11c2318c93c5639cf7e9437d2/raw/e10ccc0ecfe236da1389cc8059782d696e8fdd51/gistfile1.txt [20:41:13] replace "j+[eaiou]+(j+[aeiou]*)*" with "j+[eaiou]+(j[aeiou]*)*" and all is hunky dory [20:42:36] nice find subbu! [20:42:38] sooo subtle [20:43:15] worth auditing all regexps in ores for this pattern and fixing them. [20:43:35] fantastic [20:43:37] Thanks [20:43:42] Yeah, we should [20:43:57] some sort of CI tests also would be great [20:44:33] CI might be difficult since you don't know what patterns might trigger bad regexps. [20:49:38] subbu: nice find! [20:50:25] yeah [20:50:31] just thinking out loud [20:53:37] the problem with (j+)* is that the # of possible match paths for jjjj explodes combinatorially ... so, on failures, there is backtracking along each of those paths. [20:54:14] anyway .. it is easy to make this mistake and we learnt this in parsoid by making the same mistake many times. :) [20:55:43] I can't reproduce it in python yet though [20:56:38] akosiaris, you need the boundary terminators and the i flag .. /(^|\s)(...)(\s|$)/i .. my first attempts missed both and only when I added them both, did it trigger. [20:57:01] but, this is in node.js [20:57:42] ah hmm it's my locale I think [20:58:01] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): logstash mapping mixing up field types - https://phabricator.wikimedia.org/T165137#3375753 (10debt) 05Open>03Invalid Cool, closing as invalid, thanks for taking a look. :) [21:03:28] reproduced in python as well [21:04:09] and the fix works. [21:05:10] https://gist.githubusercontent.com/subbuss/12746448368188ce5c3f72d62de61d24/raw/bc238c7f5e4f59d4aa654811b4aa5dd1f7805811/gistfile1.txt [21:08:28] wait those regexes get compiled into one ? [21:08:53] I was trying with just a=r"j+[eaiou]+(j+[aeiou]*)*" [21:09:18] and I should have been able to reproduce it ... [21:12:30] akosiaris, that fails for me with this as well >>> p = re.compile("(^|\s)j+[eaiou]+(j+[aeiou]*)*(\s|$)", re.IGNORECASE) [21:12:56] hmmm [21:13:04] akosiaris, i assumed they get compiled into one. :) [21:13:20] i expect they are .. haven't read the code. [21:15:42] (03PS1) 10Dzahn: planet: add HTML page/item/feed templates for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/361180 [21:16:52] (03CR) 10jerkins-bot: [V: 04-1] planet: add HTML page/item/feed templates for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/361180 (owner: 10Dzahn) [21:19:59] (03PS2) 10Dzahn: planet: add HTML page/item/feed templates for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/361180 (https://phabricator.wikimedia.org/T168490) [21:32:57] (03PS3) 10Dzahn: planet: add HTML page/item/feed templates for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/361180 (https://phabricator.wikimedia.org/T168490) [21:38:34] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 344.17 seconds [21:43:16] (03PS4) 10Dzahn: planet: add HTML page/item/feed templates for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/361180 (https://phabricator.wikimedia.org/T168490) [21:44:16] (03CR) 10jerkins-bot: [V: 04-1] planet: add HTML page/item/feed templates for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/361180 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [21:46:15] (03PS5) 10Dzahn: planet: add HTML page/item/feed templates for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/361180 (https://phabricator.wikimedia.org/T168490) [21:46:18] 10Operations, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3375927 (10RobH) a:05RobH>03Ottomata [21:46:38] 10Operations, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3264224 (10RobH) Assigned to @Ottomata for implementation. [21:49:03] (03PS6) 10Dzahn: planet: add HTML page/item/feed templates for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/361180 (https://phabricator.wikimedia.org/T168490) [21:49:15] (03CR) 10Dzahn: [C: 032] "finally it's right http://puppet-compiler.wmflabs.org/6846/" [puppet] - 10https://gerrit.wikimedia.org/r/361180 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [22:06:03] (03PS1) 10Dzahn: planet: fix syntax error, source vs. content with templates [puppet] - 10https://gerrit.wikimedia.org/r/361184 [22:08:23] (03CR) 10Dzahn: [C: 032] planet: fix syntax error, source vs. content with templates [puppet] - 10https://gerrit.wikimedia.org/r/361184 (owner: 10Dzahn) [22:08:28] (03PS2) 10Dzahn: planet: fix syntax error, source vs. content with templates [puppet] - 10https://gerrit.wikimedia.org/r/361184 [22:11:34] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 11.11 seconds [22:12:24] PROBLEM - HHVM rendering on mw2186 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:13:14] RECOVERY - HHVM rendering on mw2186 is OK: HTTP OK: HTTP/1.1 200 OK - 79677 bytes in 0.272 second response time [22:26:21] (03PS1) 10Dzahn: planet: fix one more content/template syntax issue [puppet] - 10https://gerrit.wikimedia.org/r/361185 [22:28:07] (03CR) 10Dzahn: [C: 032] planet: fix one more content/template syntax issue [puppet] - 10https://gerrit.wikimedia.org/r/361185 (owner: 10Dzahn) [23:08:14] (03PS3) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) [23:22:02] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/6847/mx1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [23:26:40] (03PS4) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) [23:28:47] (03PS3) 10Dzahn: gerrit: add firewall rule for nrpe (DO NOT MERGE) [puppet] - 10https://gerrit.wikimedia.org/r/351546 (owner: 10Paladox) [23:32:19] (03CR) 10Dzahn: [C: 04-2] "i just changed the commit message to what it is. what was this for? Icinga monitoring via NRPE in labscloud?" [puppet] - 10https://gerrit.wikimedia.org/r/351546 (owner: 10Paladox) [23:32:45] subbu: I 've just verified independently that you were right. I 've dump byte by byte the regexp pattern using gdb [23:32:48] subbu: https://gist.github.com/akosiaris/69c0b67677372e471d16a652a2b8c9b3 [23:33:03] great. :) [23:33:25] so yeah you solved it nicely. thanks for the help! really appreciated [23:33:37] i submitted this PR earlier this evening .. https://github.com/wiki-ai/revscoring/compare/master...subbuss:master [23:33:56] turns out btw that without the long re I can't reproduce it [23:33:57] enwiki regexp also has some which could potentially trigger [23:34:21] yeah I guess it's extremely possible there are many that could trigger [23:35:06] akosiaris, ya .. when i looked at your stack trace top elt and googled the file, it was regexp code and immediately i had the hunch it was the same thing we had tripped on a few times. [23:35:14] :D [23:35:16] (03Abandoned) 10Paladox: icinga: Remove icinga-init.sh file, its now provided by the package [puppet] - 10https://gerrit.wikimedia.org/r/347899 (owner: 10Paladox) [23:35:27] anyway, have a good evening! :) i am heading out. it is probably late for you. [23:35:35] btw, I 've even figured out that the JAJAJAJAJAJA alone is not enough. it needs a swear word as well [23:35:43] which are also covered in the re [23:36:02] anyway I am off to bed [23:36:08] thanks for the help! have a nice time [23:36:16] good night! yw. [23:42:22] !log bounce celery-ores-worker on scb1004 [23:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:40] o/ [23:42:46] Just landed. [23:43:02] halfak|Mobile: we are ok. relax. go home, have a shower [23:43:13] (03PS1) 10Dzahn: netmon1002: add smokeping role [puppet] - 10https://gerrit.wikimedia.org/r/361191 (https://phabricator.wikimedia.org/T159756) [23:43:20] halfak|Mobile: we recap this on Monday [23:43:22] :-) [23:43:35] Ok cool. [23:43:47] I 'll pool eqiad again [23:45:07] (03PS1) 10Alexandros Kosiaris: Revert "hieradata: shift all ores traffic to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/361192 [23:45:43] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "hieradata: shift all ores traffic to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/361192 (owner: 10Alexandros Kosiaris) [23:47:54] (03Draft1) 10Paladox: planet: Update css and templates to be modern look [puppet] - 10https://gerrit.wikimedia.org/r/361190 [23:48:03] (03PS2) 10Paladox: planet: Update css and templates to be modern look [puppet] - 10https://gerrit.wikimedia.org/r/361190 [23:50:56] (03PS3) 10Paladox: planet: Update css and templates to be modern look [puppet] - 10https://gerrit.wikimedia.org/r/361190 [23:52:59] (03PS4) 10Paladox: planet: Update css and templates to be modern look [puppet] - 10https://gerrit.wikimedia.org/r/361190 [23:58:55] (03PS1) 10Foks: Adding font files and license [mediawiki-config/fonts] - 10https://gerrit.wikimedia.org/r/361195 (https://phabricator.wikimedia.org/T168757)